Assembly Language Programs

Assembly language is one step above machine language and you have already been introduced to assembly language in the section on machine language programs. In that introduction, the names of the instructions were given along with the opcodes. The names are used in assembly language and the opcodes in machine language. Assembly language allows us to think in terms of names for the instructions and variable identifiers rather than opcodes and numerical offsets or addresses.

Example 1: Basic Arithmetic and Logic

The first machine language program we looked at was actually written in three different languages (see below). We presented the algorithm in a Java-like language because that is what you were familiar with already. We then broke the program down into simpler steps using the names of the machine language instructions and variable identifiers (assembly language). Next, we replaced the names of the instructions with the machine language opcodes and replaced the variable identifiers with the corresponding offsets into the block of memory used to store the values of the local variables. In very early machines, we would have needed to take this process one more step by converting the decimal opcodes and operands into binary. The process we used to convert from assembly language to machine language is called hand-assembly. We assembled our machine language program by hand, one instruction at a time.

Java-Like      Assembly      Machine (Decimal and Binary)

C = A + B;     ILOAD A       21  0     00010101 00000000
               ILOAD B       21  1     00010101 00000001
               IADD          96        01100000
               ISTORE C      54  2     00110110 00000010
D = A - B;     ILOAD A       21  0     00010101 00000000
               ILOAD B       21  1     00010101 00000001
               ISUB         100        01100100
               ISTORE D      54  3     00110110 00000011
E = A AND B    ILOAD A       21  0     00010101 00000000
               ILOAD B       21  1     00010101 00000001
               IAND         126        01111110
               ISTORE E      54  4     00110110 00000100
F = A OR B     ILOAD A       21  0     00010101 00000000
               ILOAD B       21  1     00010101 00000001
               IOR          176        10110000
               ISTORE F      54  5     00110110 00000101
               HALT         255        11111111

The Assembler

It wasn't long before machine language programmers started writing programs that would help speed up the process of writing programs. In particular, they wrote a program called an assembler that automatically converted assembly language programs into binary machine language programs eliminating the tedious hand-assembly process. This allowed the programmer to think solely in terms of assembly language which was much easier than thinking in terms of machine language. An assembler is to assembly language what a compiler is to a Java-like language. Both convert their source code into machine language. (Some compilers convert their source code to assembly code and then run an assembler to convert the assembly code to machine language code.)

An assembler has several basic tasks:

  1. Convert instruction names (often called mnemonics) to machine language opcodes.
  2. Convert variable and constant identifiers to numeric offsets.
  3. Convert labels used in branching instructions to numeric offsets.
  4. Convert method identifiers to numeric addresses.
  5. Initialize constants.
  6. Allocate memory for method parameters and local variables

Let's consider this simple example:

Java-Like                       Assembly

final int SIX = 6;                  SIX 6
final int OBJREF = 0;               OBJREF 0

main                            .main
    int a, b, min;                  .var
                                        a b min

    a = SIX;                        ldc_w SIX
                                    istore a
    b = 10;                         bipush 10
                                    istore b
    min = Minimim(a, b);            ldc_w OBJREF
                                    iload a
                                    iload b
                                    invokevirtual Minimum
                                    istore min
}                               .end-main

int Minimum(int n1, int n2)     .method Minimum(n1 n2)
    int min;                        .var

    if (n2 <= n1)                   iload n1
                                    iload n2
                                    iflt ELSE
        min = n2;                   iload n2
                                    istore min
                                    goto END
    else                          ELSE:
        min = n1;                   iload n1
                                    istore min
    return min;                     iload min

One of the simplest things the assembler has to do is convert the names of the instructions (mnemonics) into the corresponding opcodes since there is a one-to-one relationship between names and opcodes.

Converting constant, variable, and parameter identifiers into offsets is actually pretty easy too. The offsets are based on the position of the identifier in the list of identifiers with the first identifier at offset zero. In our example program, the constant identifier SIX corresponds to an offset of 0 and the constant identifier OBJREF corresponds to offset 1. In main, the variables a, b, and min corresponds to offsets 0, 1, and 2 respectively. Notice that in a method, the parameter identifiers precede the local variable identifiers in the program listing. Consequently, the identifiers n1, n2, min correspond to offsets 0, 1 and 2 respectively.

Converting labels to offsets requires a bit more work but, conceptually, is the same as what we do when we hand-assemble a program. The offset is the number of bytes from the branching instruction up to and including the opcode of the first instruction at the new location. Consider the "if-less-than-zero" instruction in our example above. The "IFLT" is followed by a 2-byte operand, an iload (1 byte opcode, 1 byte operand), an istore (1 byte opcode, 1 byte operand), and a goto (1 byte opcode, 2 byte operand) for a total of 9 bytes. The offset associated with the END identifier is 9.

The label itself ends with a colon (as in "ELSE:" and "END:") and corresponds to the address of the next opcode to be executed. In practice, the assembler might store the machine language program as an array of bytes starting at index 0. (You can't help but notice that this array is nothing more than a simulation of the block of memory in the computer where a machine language program resides.) The index into this array always points at the next available location. Consequently, as soon as the assembler encounters a label, it associates the label with the current value of the array index. 

Notice that when a backward branch occurs, the operand of the branching instruction can be determined immediately. The offset is the value of the label (which has already been encountered) minus the address of the branching instruction. This result will be negative. Forward branches are more complicated because the assembler has not yet encountered the corresponding label and, as a result, does not know what its value will be. Many assemblers require a second pass through the assembly language program in order to resolve these forward references. Notice that requiring constants to be declared at the very beginning of the program and requiring parameters and variables to be declared before executable code eliminates forward references for these identifiers.

The code for methods follows the code for main (in both assembly language and machine language) since the execution of main must start at address 0 in memory. The value of a method identifier is the offset into the constant pool for the address of the first byte in the method. Method addresses follow the declared constants in the constant pool section of memory. In our example then, the value of "Minimum" is 2 (because it follows SIX and OBJREF in the constant pool) and the address of the first byte in "Minimum" is stored at CPP+2.

Assembled programs are loaded into memory by a loader. Loading the code itself is straightforward because all the loader has to do is read the code and copy it into memory starting at address 0. Constants, however, go into the constant pool instead. This means that the output file generated by the assembler must somehow keep code and constants separate.

In order for the INVOKEVIRTUAL instruction to set up the stack frame for a method invocation, it must know the number of parameters that were pushed onto the stack and the number of local variables. Both of these numbers are stored as 2-byte integers as part of the method itself and precede the opcode of the first executable instruction. (You may have noticed that I was careful to say that the method address in the constant pool was the address of the first byte in the method as opposed to the address of the first instruction. Now you know why.)