Justin Hu | How to ELF

You may be aware that I'm doing Advent of Code in assembly. You might be wondering where someone learns the (dark) art of programming in assembly. Perhaps you'd like to write a compiler, or just want to recreate how the programmers of yore did things. So here's a brief overview of assembly programming on x86_64 and Linux - I'm going to assume that you've read the nasm manual well enough to understand assembly syntax, and are a proficient C programmer.

What Your Compiler Does

Learning assembly is really mostly unwrapping the layer of abstraction the C compiler provides - which is actually a really thick layering of multiple abstractions. Let's look at one of the thinner layers - the compiler driver. What you think of as a C compiler is actually a few programs working together, all orchestrated by the compiler driver. You can see this if you give gcc the option -v. Here's an example, with the irrelevant stuff omitted.

justin@Carbon ~/AdventOfCode> gcc -v hello.c -o hello
 cc1 hello.c -o /tmp/ccEY5VjO.s
 as -o /tmp/ccJ9QAxD.o /tmp/ccEY5VjO.s
 collect2 -o hello /tmp/ccJ9QAxD.o -lc

Let's break this down a bit. When I ask gcc to compile a file, first, it delegates the transformation of the file from C to assembly over to cc1, the actual compiler program. This is a part that gets skipped when you're writing assembly directly. Then, it asks the system assembler, as, to assemble the file into an object file. Finally, it asks the linker, collect2 (which is itself a wrapper around the actual linker, ld), to link the object file with the C standard library (another thing that you don't have when working in assembly) and produce the executable named hello.

The actual sequence of commands to produce some assembly is largely similar to what your compiler does.

justin@Carbon ~/AdventOfCode> nasm -f elf64 common.s
justin@Carbon ~/AdventOfCode> nasm -f elf64 5a.s
justin@Carbon ~/AdventOfCode> ld -o 5a 5a.o common.o

We ask the assembler (I'm using nasm since its syntax is a lot nicer to work with) to compile our assembly files into object files (the output file name can be implicit), and then we ask the linker to produce a executable. (nasm unhelpfully defaults to 32-bit ELF files, even on my 64-bit system, so we need to tell it otherwise.)

Just What is a Program Anyway

So now let's talk about ELF files - the Executable and Linkable Format is a file format for binary data. Both object files and executables are ELF files, and I'll get into the exact differences later. In general, an ELF file consists of sections; when a program is run, Linux will load those sections into memory (from where in the file and to where in memory based on information stored in the ELF file - in the program header table), set all registers except the instruction pointer and stack pointer to zero, set up the stack, and start userspace execution with the instruction pointer set to the value specified in the ELF file.

All of the ELF file manipulation is the province of the assembler and the linker. Your job, as the programmer, is to tell the assembler what sections you would like to have, and what the contents of those sections are. So what sections could you have? There's four main types your assembler recognizes. There's .text sections (yes, the period is intentional - section names always start with a dot, don't ask me why), which contain the actual assembly code of your program. There's .data sections - these contain data that your program may read and modify. There's .rodata sections - these contain data that your program may only read. This is usually where string literals are kept. Finally, there's .bss sections, where zero-initialized writable data is kept. (.bss stands for block started by symbol - it's historical baggage; x86_64 assembly has enough historical baggage to fill an airliner.) You'd use a .bss section because however much you have, it takes up a constant amount of disk space (constant because the section still needs to have its metadata and size stored in the ELF file).

Only the linker and the assembler really care about section types - they're a construct that your assembler has decided to provide. There's nothing in ELF stopping you from declaring a section that is not readable and only writable, or a section that is readable, writable, and executable.

Now, let's get to the difference between object files and executable files. The linker takes in object files, decides where everything goes in memory in absolute terms, and produces an executable. The main difference between object files and executables is that object files have a relocation table, and executable files don't. Suppose I want one object file to call a function defined elsewhere. I need to know the address of the function to call it, but I only know that once the linker decides where everything goes in memory in absolute terms. So the assembler will add an entry into the relocation section instructing the linker to edit the code and insert the absolute address of the function once it's known. Likewise, the assembler, for the other file, will add an entry to the symbol table section declaring that the function with this name may be found at some offset in this file's text section. And then the linker must reconcile these tables, except it also needs to know where the entry point of the program should be - where the instruction pointer should point when the program starts. That's traditionally pointed at the symbol _start in the text section. Interestingly, the linker keeps the symbol table around in the finished executable (the strip command discards these). This is quite convenient since assemblers don't include any more detailed kind of debugging symbols, but gdb is perfectly happy to read the included symbol table.

As an aside, your C compiler won't actually produce any code for the _start symbol - the main function is at a symbol called main (more or less - some C compilers prefix C symbols with an underscore). The C standard library supplies a wrapper at _start that does some work to get the command line arguments (see below) into the expected locations before calling main and then making the exit syscall, among other things. This is necessarily written in assembly - which is, perhaps, a reason you might want to know how to at least read assembly: so you can understand how your program's runtime environment gets set up. This is also where, in languages with non-constant global variables, those variables are initialized. For more information, see GCC's documentation on constructors and initialization.

Doing Things

Okay - let's cover actual assembly programming. Remember that, since this is assembly, almost everything is a matter of programmer convention and almost nothing is enforced by your environment. You've got 16 registers, accessible as either a single byte, a word (two bytes), a doubleword (four bytes), or a quadword (eight bytes). Notice that I didn't mention anything about what those bytes might store. It's up to you, the programmer, to enforce type rules. The first eight registers are the original eight, and some of them are specifically referenced in some instructions as an implicit operand - this shouldn't stop you from using them as general purpose registers most of the time. rax was the accumulator, and it's often the implicit source operand of some instructions, like the mul instruction. rbx isn't often a special register, but rcx is special, and is often used as a counter. rdx is occasionally used to extend rax to twice its usual length, again, see the mul instruction for an example. rsi and rdi are used in string operations. rsp is the only one that's special all the time - it's the stack pointer. rbp is the base pointer (i.e. it points to the return address on the stack), but it can also be used as a special purpose register. The rest, which are numbered and not named, are general purpose registers all the time.

Most of the basic arithmetic operations you'd be familiar with in a high level language have direct counterparts in assembly - addition (add) and subtraction (sub) work on signed and unsigned numbers with the same instruction. Multiplication is split into signed (imul) and unsigned (mul), as is division (idiv, div). You'll often want to avoid multiplication and especially division, since unsigned multiplication and division require you to fiddle around with the special purpose uses of rax and rdx.

Control flow, however, is a bit less congruent. You have three main control-flow primitives - unconditional jumps, conditional jumps, and function calls. Unconditional jumps and conditional jumps can be used to make the equivalent of if/else statements and loops. Here's an example of a simple if statement:

  cmp rax, 0
  jne .notZero

  ; rax is zero
  add rax, 5

.notZero:

We do a comparison of rax with zero, and then, if it's not equal to zero, we jump to the label .notZero. The overall effect of this is that if rax is zero, then it becomes five; otherwise, it stays the same. There's two main comparison operations - cmp, which subtracts the second operand from the first and sets the flags according to the result, and test, which does a bitwise-and of the two operands and sets the flags according to the result. The flags register (you can't directly interact with this one) is queried when doing a conditional jump. Note that a cmp followed by a relational-style conditional jump, like jle, will jump if the first operand has the indicated relation to the second operand. (In math speak, a - b is related to zero identically to how a is related to b). There's also the negations of each sort of conditional jump - jnle, for example (which, incidentally, is equivalent, both logically and in the assembly, to jg).

In addition to the relational conditional jumps, you can also query the state of individual flags directly - you can condition on the carry flag, the overflow flag, the parity flag, the sign flag, and the zero flag being either on or off. This is useful if, for example, you've done a test and you want to jump if the result was zero - you'd do a jz, which expresses your intent a lot better than the equivalent je.

Note that cmp x, 0 is equivalent to test x, x, except like most instructions, you can only have one memory operand per test instruction.

In addition to jumping forwards, you can also jump back, allowing you to construct loops and even more control flow than the (admittedly uncommonly felt) limitations of structured programming. Goto considered harmful? In assembly, there's only gotos.

The last bit of control flow is the call instruction. It takes a destination address (usually in the form of a label), and will redirect the instruction pointer to that address until a ret is encountered, after which time the instruction pointer returns to just after the call instruction. Just like a function call. Note, however, there's no arguments nor any return value - you have to construct that yourself. To provide arguments, you'd put them in registers or in specific locations on the stack, and to get a return value, you'd read it from a register or off of the stack. I'd recommend sticking to the System V calling convention if you're writing something intended to be used in a different file.

Finally, let's touch on memory access. At some point, you're going to need to store some data somewhere that isn't a register. You've got three options. You either store it in the stack - at the start of your function, you subtract from the stack pointer to allocate space (on Linux, the stack grows downwards), and store data there. Or you store it in static memory - some space allocated in the .data or .bss section. Or you store it in the heap. You'd use the brk syscall (see below) to allocate some space, and store the data there. But remember that you're going to need to store the pointer to that data on the heap somewhere as well!

So how does one allocate some space in a data section? Like this:

section .data
variable: dq 42

This declares a quadword worth of memory in the data section whose address is represented in the symbol variable. You can access the value by saying [variable] - you're using the plain old displacement addressing mode. See this blog post by William Woodruff for a more in-depth discussion of addressing memory. An interesting quirk of gdb is that it allows you to directly view the contents of the address, but only by asking it to print out the contents of the name - while it is true in C and other higher level languages that such a symbol ought to be interpreted as the location of some static variable, in assembly, it's necessary to understand that it's still a pointer, just a pointer to statically allocated memory. Thus, in assembly code, when you say variable, you are getting the address pointed to by the label, while in gdb, you're getting the contents of the address labelled. This tripped me up a few times when I was using gdb to debug my code. While gdb supports assembly, its main target is still the high level languages, like C, and in times like this, it shows.

The Environment

So you can now run instructions, but instructions without input are not very useful. You can get some input by reading the command line arguments - when Linux sets up your program's stack, it puts the number of command line arguments to QWORD [rsp + 0] and pointers to the command line arguments in QWORD [rsp + 8] onwards, if there are any, followed by a null pointer and then a pointer to the environment string (see man env for more details on this). Most shells supply at least one - the name your command was invoked with. For example, running ./compare one two would result in QWORD [rsp + 0] being 3, QWORD [rsp + 8] begin a pointer to "./compare", QWORD [rsp + 16] being a pointer to "one", QWORD [rsp + 24] being a pointer to "two", QWORD [rsp + 32] being zero, and QWORD [rsp + 40] being a pointer to the environment string.

So you can get input (in the form of strings) from the command line, but you don't know how to exit the program (no, you can't just hlt - that would halt the entire core, and only Linux is allowed to do that) and you don't know how to write to the command line yet. Here's how you do every other bit of interacting with the operating system: syscall. syscall is a special instruction that raises an interrupt that Linux will catch and process. See man 2 syscall for more info, and man 2 syscalls for a list. To make a syscall in assembly, you'd put the syscall number in rax, the arguments in rdi, rsi, rdx, r10, r8, r9, and execute the syscall instruction. One example syscall is exit - number 60. Here's how you'd use it:

  mov rdi, 0
  mov rax, 60
  syscall

You set the first and only argument to the exit code (which is truncated by the OS to a byte), and you set the system call number to 60, and make the syscall. You don't care what comes after, since your program will definitely have exited and will have ceased to be.

Some of the other system calls I use are open, fstat, and mmap - these allow me to memory-map the input files from Advent of Code so I can just parse through them using pointers. Additionally, brk lets you allocate heap space - but note that the assembly version of brk, unlike most system calls, does not have the same interface as the C library version.

In Conclusion - Why?

I would be the first person to tell you that doing assembly programming in this day and age is useful only to compiler writers, OS developers, and some folks writing very specialized libraries (like the folks behind GMP, for example). There's no real reason why you'd want to program in assembly, and in fact, you often can't beat GCC in terms of performance, and you certainly can't beat even C for ease of use. So why learn to program in assembly? For the same reason that half this blog post isn't about assembly - you get to peek behind the curtain and learn about the abstractions that back our modern computing environment, and understand the influence this foundational layer has. For example, the reason why the C modulo operator (%) actually does a remainder operation is because the idiv instruction specifies that it returns the quotient and the remainder, not the quotient and the modulus; the reason why right shift of a negative signed number is hazardous in C is because there's two possible instructions the compiler could emit, and one does an arithmetic shift, filling with the sign bit, while the other does a logical shift, filling with zero; and the reason why you can (mostly) run 32 bit programs on 64 bit computers is because x86 is binary-compatible with x86_64.

How to ELF

A brief introduction to below-C level programming on Linux

What Your Compiler Does

Just What is a Program Anyway

Doing Things

The Environment

In Conclusion - Why?