You may be aware that I'm doing
Advent of Code in
assembly. You might be wondering where someone learns the (dark) art
of programming in assembly. Perhaps you'd like to write a compiler, or
just want to recreate how the programmers of yore did things. So
here's a brief overview of assembly programming on x86_64 and Linux -
I'm going to assume that you've read the nasm
manual well
enough to understand assembly syntax, and are a proficient C
programmer.
Learning assembly is really mostly unwrapping the layer of abstraction
the C compiler provides - which is actually a really thick layering of
multiple abstractions. Let's look at one of the thinner layers - the
compiler driver. What you think of as a C compiler is actually a few
programs working together, all orchestrated by the compiler driver.
You can see this if you give gcc
the option
-v
. Here's an example, with the irrelevant stuff omitted.
justin@Carbon ~/AdventOfCode> gcc -v hello.c -o hello cc1 hello.c -o /tmp/ccEY5VjO.s as -o /tmp/ccJ9QAxD.o /tmp/ccEY5VjO.s collect2 -o hello /tmp/ccJ9QAxD.o -lc
Let's break this down a bit. When I ask gcc
to compile a
file, first, it delegates the transformation of the file from C to
assembly over to cc1
, the actual compiler program. This
is a part that gets skipped when you're writing assembly directly.
Then, it asks the system assembler, as
, to assemble the
file into an object file. Finally, it asks the linker,
collect2
(which is itself a wrapper around the actual
linker, ld
), to link the object file with the C standard
library (another thing that you don't have when working in assembly)
and produce the executable named hello
.
The actual sequence of commands to produce some assembly is largely similar to what your compiler does.
justin@Carbon ~/AdventOfCode> nasm -f elf64 common.s justin@Carbon ~/AdventOfCode> nasm -f elf64 5a.s justin@Carbon ~/AdventOfCode> ld -o 5a 5a.o common.o
We ask the assembler (I'm using nasm
since its syntax is
a lot nicer to work with) to compile our assembly files into object
files (the output file name can be implicit), and then we ask the
linker to produce a executable. (nasm
unhelpfully
defaults to 32-bit ELF files, even on my 64-bit system, so we need to
tell it otherwise.)
So now let's talk about ELF files - the Executable and Linkable Format is a file format for binary data. Both object files and executables are ELF files, and I'll get into the exact differences later. In general, an ELF file consists of sections; when a program is run, Linux will load those sections into memory (from where in the file and to where in memory based on information stored in the ELF file - in the program header table), set all registers except the instruction pointer and stack pointer to zero, set up the stack, and start userspace execution with the instruction pointer set to the value specified in the ELF file.
All of the ELF file manipulation is the province of the assembler and
the linker. Your job, as the programmer, is to tell the assembler what
sections you would like to have, and what the contents of those
sections are. So what sections could you have? There's four main types
your assembler recognizes. There's .text
sections (yes,
the period is intentional - section names always start with a dot,
don't ask me why), which contain the actual assembly code of your
program. There's .data
sections - these contain data that
your program may read and modify. There's
.rodata
sections - these contain data that your program
may only read. This is usually where string literals are kept.
Finally, there's .bss
sections, where zero-initialized
writable data is kept. (.bss
stands for block started by
symbol - it's historical baggage; x86_64 assembly has enough
historical baggage to fill an airliner.) You'd use a
.bss
section because however much you have, it takes up a
constant amount of disk space (constant because the section still
needs to have its metadata and size stored in the ELF file).
Only the linker and the assembler really care about section types - they're a construct that your assembler has decided to provide. There's nothing in ELF stopping you from declaring a section that is not readable and only writable, or a section that is readable, writable, and executable.
Now, let's get to the difference between object files and executable
files. The linker takes in object files, decides where everything goes
in memory in absolute terms, and produces an executable. The main
difference between object files and executables is that object files
have a relocation table, and executable files don't. Suppose I want
one object file to call a function defined elsewhere. I need to know
the address of the function to call it, but I only know that once the
linker decides where everything goes in memory in absolute terms. So
the assembler will add an entry into the relocation section
instructing the linker to edit the code and insert the absolute
address of the function once it's known. Likewise, the assembler, for
the other file, will add an entry to the symbol table section
declaring that the function with this name may be found at some offset
in this file's text section. And then the linker must reconcile these
tables, except it also needs to know where the entry point of the
program should be - where the instruction pointer should point when
the program starts. That's traditionally pointed at the symbol
_start
in the text section. Interestingly, the linker
keeps the symbol table around in the finished executable (the
strip
command discards these). This is quite convenient
since assemblers don't include any more detailed kind of debugging
symbols, but gdb
is perfectly happy to read the included
symbol table.
As an aside, your C compiler won't actually produce any code for the
_start
symbol - the main function is at a symbol called
main
(more or less - some C compilers prefix C symbols
with an underscore). The C standard library supplies a wrapper at
_start
that does some work to get the command line
arguments (see below) into the expected locations before calling main
and then making the exit
syscall, among other things.
This is necessarily written in assembly - which is, perhaps, a reason
you might want to know how to at least read assembly: so you can
understand how your program's runtime environment gets set up. This is
also where, in languages with non-constant global variables, those
variables are initialized. For more information, see
GCC's documentation on constructors and initialization.
Okay - let's cover actual assembly programming. Remember that, since
this is assembly, almost everything is a matter of programmer
convention and almost nothing is enforced by your environment. You've
got 16 registers, accessible as either a single byte, a word (two
bytes), a doubleword (four bytes), or a quadword (eight bytes). Notice
that I didn't mention anything about what those bytes might store.
It's up to you, the programmer, to enforce type rules. The first eight
registers are the original eight, and some of them are specifically
referenced in some instructions as an implicit operand - this
shouldn't stop you from using them as general purpose registers most
of the time.
rax
was the accumulator, and it's often the implicit
source operand of some instructions, like the
mul
instruction. rbx
isn't often a special
register, but rcx
is special, and is often used as a
counter. rdx
is occasionally used to extend
rax
to twice its usual length, again, see the
mul
instruction for an example. rsi
and
rdi
are used in string operations. rsp
is
the only one that's special all the time - it's the stack pointer.
rbp
is the base pointer (i.e. it points to the return
address on the stack), but it can also be used as a special purpose
register. The rest, which are numbered and not named, are general
purpose registers all the time.
Most of the basic arithmetic operations you'd be familiar with in a
high level language have direct counterparts in assembly - addition
(add
) and subtraction (sub
) work on signed
and unsigned numbers with the same instruction. Multiplication is
split into signed (imul
) and unsigned (mul
),
as is division (idiv
, div
). You'll often
want to avoid multiplication and especially division, since unsigned
multiplication and division require you to fiddle around with the
special purpose uses of rax
and rdx
.
Control flow, however, is a bit less congruent. You have three main control-flow primitives - unconditional jumps, conditional jumps, and function calls. Unconditional jumps and conditional jumps can be used to make the equivalent of if/else statements and loops. Here's an example of a simple if statement:
cmp rax, 0 jne .notZero ; rax is zero add rax, 5 .notZero:
We do a comparison of rax
with zero, and then, if it's
not equal to zero, we jump to the label .notZero
. The
overall effect of this is that if rax
is zero, then it
becomes five; otherwise, it stays the same. There's two main
comparison operations - cmp
, which subtracts the second
operand from the first and sets the flags according to the result, and
test
, which does a bitwise-and of the two operands and
sets the flags according to the result. The flags register (you can't
directly interact with this one) is queried when doing a conditional
jump. Note that a cmp
followed by a relational-style
conditional jump, like jle
, will jump if the first
operand has the indicated relation to the second operand. (In math
speak, a - b is related to zero identically to how
a is related to b). There's also the negations of each
sort of conditional jump - jnle
, for example (which,
incidentally, is equivalent, both logically and in the assembly, to
jg
).
In addition to the relational conditional jumps, you can also query
the state of individual flags directly - you can condition on the
carry flag, the overflow flag, the parity flag, the sign flag, and the
zero flag being either on or off. This is useful if, for example,
you've done a test
and you want to jump if the result was
zero - you'd do a jz
, which expresses your intent a lot
better than the equivalent je
.
Note that cmp x, 0
is equivalent to
test x, x
, except like most instructions, you can only
have one memory operand per test
instruction.
In addition to jumping forwards, you can also jump back, allowing you to construct loops and even more control flow than the (admittedly uncommonly felt) limitations of structured programming. Goto considered harmful? In assembly, there's only gotos.
The last bit of control flow is the call
instruction. It
takes a destination address (usually in the form of a label), and will
redirect the instruction pointer to that address until a
ret
is encountered, after which time the instruction
pointer returns to just after the call
instruction. Just
like a function call. Note, however, there's no arguments nor any
return value - you have to construct that yourself. To provide
arguments, you'd put them in registers or in specific locations on the
stack, and to get a return value, you'd read it from a register or off
of the stack. I'd recommend sticking to the
System V calling convention
if you're writing something intended to be used in a different file.
Finally, let's touch on memory access. At some point, you're going to
need to store some data somewhere that isn't a register. You've got
three options. You either store it in the stack - at the start of your
function, you subtract from the stack pointer to allocate space (on
Linux, the stack grows downwards), and store data there. Or you store
it in static memory - some space allocated in the
.data
or .bss
section. Or you store it in
the heap. You'd use the brk
syscall (see below) to
allocate some space, and store the data there. But remember that
you're going to need to store the pointer to that data on the heap
somewhere as well!
So how does one allocate some space in a data section? Like this:
section .data variable: dq 42
This declares a quadword worth of memory in the data section whose
address is represented in the symbol variable
. You can
access the value by saying [variable]
- you're using the
plain old displacement addressing mode. See
this blog post by William Woodruff
for a more in-depth discussion of addressing memory. An interesting
quirk of gdb
is that it allows you to directly view the
contents of the address, but only by asking it to print out the
contents of the name - while it is true in C and other higher level
languages that such a symbol ought to be interpreted as the location
of some static variable, in assembly, it's necessary to understand
that it's still a pointer, just a pointer to statically allocated
memory. Thus, in assembly code, when you say variable
,
you are getting the address pointed to by the label, while in
gdb
, you're getting the contents of the address labelled.
This tripped me up a few times when I was using gdb
to
debug my code. While gdb
supports assembly, its main
target is still the high level languages, like C, and in times like
this, it shows.
So you can now run instructions, but instructions without input are
not very useful. You can get some input by reading the command line
arguments - when Linux sets up your program's stack, it puts the
number of command line arguments to
QWORD [rsp + 0]
and pointers to the command line
arguments in QWORD [rsp + 8]
onwards, if there are any,
followed by a null pointer and then a pointer to the environment
string (see man env
for more details on this). Most
shells supply at least one - the name your command was invoked with.
For example, running ./compare one two
would result in
QWORD [rsp + 0]
being 3,
QWORD [rsp + 8]
begin a pointer to
"./compare"
, QWORD [rsp + 16]
being a
pointer to "one"
, QWORD [rsp + 24]
being a
pointer to "two"
, QWORD [rsp + 32]
being
zero, and QWORD [rsp + 40]
being a pointer to the
environment string.
So you can get input (in the form of strings) from the command line,
but you don't know how to exit the program (no, you can't just
hlt
- that would halt the entire core, and only Linux is
allowed to do that) and you don't know how to write to the command
line yet. Here's how you do every other bit of interacting with the
operating system: syscall
. syscall
is a
special instruction that raises an interrupt that Linux will catch and
process. See man 2 syscall
for more info, and
man 2 syscalls
for a list. To make a syscall in assembly,
you'd put the syscall number in rax
, the arguments in
rdi
, rsi
, rdx
,
r10
, r8
, r9
, and execute the
syscall
instruction. One example syscall is exit - number
60. Here's how you'd use it:
mov rdi, 0 mov rax, 60 syscall
You set the first and only argument to the exit code (which is truncated by the OS to a byte), and you set the system call number to 60, and make the syscall. You don't care what comes after, since your program will definitely have exited and will have ceased to be.
Some of the other system calls I use are open
,
fstat
, and mmap
- these allow me to
memory-map the input files from Advent of Code so I can just parse
through them using pointers. Additionally, brk
lets you
allocate heap space - but note that the assembly version of
brk
, unlike most system calls, does not have the same
interface as the C library version.
I would be the first person to tell you that doing assembly
programming in this day and age is useful only to compiler writers, OS
developers, and some folks writing very specialized libraries (like
the folks behind GMP, for example). There's no real reason why you'd
want to program in assembly, and in fact, you often can't beat GCC in
terms of performance, and you certainly can't beat even C for ease of
use. So why learn to program in assembly? For the same reason that
half this blog post isn't about assembly - you get to peek behind the
curtain and learn about the abstractions that back our modern
computing environment, and understand the influence this foundational
layer has. For example, the reason why the C modulo operator
(%
) actually does a remainder operation is because the
idiv
instruction specifies that it returns the quotient
and the remainder, not the quotient and the modulus; the reason why
right shift of a negative signed number is hazardous in C is because
there's two possible instructions the compiler could emit, and one
does an arithmetic shift, filling with the sign bit, while the other
does a logical shift, filling with zero; and the reason why you can
(mostly) run 32 bit programs on 64 bit computers is because x86 is
binary-compatible with x86_64.