An Incremental Approach to Compiler Construction (2006)
October 23, 2023An Incremental Approach to Compiler Construction shows an incremental approach to build a compiler that accepts a large subset of the Scheme programming language. The compiler produces assembly code for the Intel-X86 architecture. Because real-life compilers are too complex to serve as an educational tool, the gap between real-life compilers and the educational toy compilers is too wide.
The goal of the paper is to break the barrier. The development of the compiler is broken down into 24 incremental steps. Every step yields a fully working compiler for progressively expanding a subset of Scheme.
The compiler is a function in Scheme, compile-program, that accepts a subset of Scheme.
The function calls emit to produce assembly code that can be called by C.
The compiler uses the upper bits of the machine word for storing the value, and the rest of the space to encode the type information.
For example, the lower two bits of the representation of fixednums are 00.
Characters are tagged with 00001111, boolean are given a 0011111, and the empty list is given the value 00101111.
(define (compile-program x)
(define (immediate-rep x)
(cond
((integer? x) (shift x fixnum-shift))
...))
(emit "movl $~a, %eax" (immediate-rep x))
(emit "ret"))
#include <stdio.h>
#define fixnum_mask 3
#define fixnum_tag 0
#define fixnum_shift 2
...
int main(int argc, char** argv){
int val = scheme_entry();
if((val & fixnum_mask) == fixnum_tag){
printf("%d\n", val >> fixnum_shift);
} else if(val == empty_list){
printf("()\n");
} ...
return 0;
}
All local variables in the let form are saved on the stack.
The following function compiles the let form to assembly.
si represents the index of the current stack, and rhs means the right-hand-side expression.
(define (emit-expr x si env)
(cond
((immediate? x) ...)
((variable? x)
(emit "movl ~a(%esp), %eax" (lookup x env)))
((let? x)
(emit-let (bindings x) (body x) si env))
...))
(define (emit-let bindings body si env)
(let f ((b* bindings) (new-env env) (si si))
(cond
((null? b*) (emit-expr body si new-env))
(else
(let ((b (car b*)))
(emit-expr (rhs b) si env)
(emit "movl %eax, ~a(%esp)" si)
(f (cdr b*)
(extend-env (lhs b) si new-env)
(- si wordsize)))))))
Pairs, vectors, and strings are allocated in the heap because they do not fit in a machine word.
Similar to fixnums, every pointer to a heap-allocated object is tagged by a 3-bit mask(e.g., 001 for pairs, 010 for vectors).
The pointers are allocated at double-word boundaries so that the tag and the value of the pointer do not interfare.
The following code is an implementation of (cons 10 20).
movl $40, 0(%esi)
movl $80, 4(%esi)
movl %esi, %eax
orl $1, %eax
addl $8, %esi
The language supports procedure calls.
lvar is bound to a code expression containing a list of formal paramters and a body expression.
<Prog> ::= (labels ((lvar <LExpr>) ...) <Expr>)
<LExpr> ::= (code (var ...) <Expr>)
<Expr> ::= immediate
| var
| (let ((var <Expr>) ...) <Expr>)
For each code expression, the label is first emmitted, followed by the code of the body.
The following views show the stack from the Caller’s side and the Callee’s side.

The image above is cited from the paper.