This repository contains the header files for a compiler front-end and the implementation files for a compiler back-end built atop the LLVM C++ API. The front-end implementation files are not public due to their being part of a graded project for the spring 2020 compilers elective at the University of South Florida; the professor requests that we keep those assignments private.
The source language for this compiler is Diminished Java (DJ), a language devised by Dr. Ligatti specifically for his compilers course. His language documentation contains more information about the features and capabilities of DJ.
The development of the LLVM-related code herein took place over the course of
ten weeks, as part of an independent study assignment with supervision from Dr.
Ligatti. The compiler translates the original AST (written in C) into one in
C++, which is more amenable to the C++ API provided by LLVM. The compiler then
walks the translated AST to generate an object file; dj2ll
calls clang
to
deal with linking.
The files provided here include:
- Header and implementation files for all LLVM-related work (
codeGen*
files), - Header files of my own design (the implementation of these deals directly with the original AST, so they remain private),
- Header files provided by Dr. Ligatti to guide the design of the original compiler,
- Test programs provided by Dr. Ligatti, and
- A Python script to run the compiler against the test programs.
The releases tab provides an executable version of dj2ll
. I built the
executable using Version 10.0.1 of clang
and clang++
on Arch Linux. The
compiler should work on any Linux system that has access to clang
and the LLVM
libraries. dj2ll
knows about four flags:
--skip-codegen
: lex, parse, and typecheck, but skip code generation.--run-optis
: create an optimized executable.--emit-llvm
: output to the console the LLVM IR produced by the source file.--verbose
: output the translated AST.
If you run dj2ll
on some file test.dj, it will produce an executable with name
test
in the same directory from which you called the compiler.
The llvm_includes.hpp
file contains significant global variables used either
by LLVM methods or by the code generation methods. The NamedValues
map maps a
string (representing methods and “main”) to the appropriate symbol table that
method should use. This is a limitation of my approach! If a user declares a
class named (exactly) C1_method_bar
and a class C1
exists that implements a
method bar
, a colision will occur. Because the global number of DJ users is
low, I felt this was an acceptable risk. (I get the feeling this is part of the
reason why clang
and gcc
mangle function names…)
The code in llast.cpp
and its header file define the interface for the new
AST. I opted to design the new AST as a series of DJNodes
, of which
DJProgram
and DJExpression
are the main child classes. All other DJ
expression types are child classes that inherit from DJExpression.
I chose
early on to not translate the original symbol tables, which turned out to be a
small mistake; it would have made much of code generation more simple to have
translated these data structures into C++ as well.
The code in translateAST.cpp
takes the validated AST produced by the
typechecker and translates it into the LLAST.
The files codegen.cpp
and codeGenClass.cpp
contain DJExpression
and DJProgram
methods and related helper functions for code generation.
I implemented classes as structs, with their methods implemented as functions that take a pointer to a particular structure. I laid out structs in memory like this (for some example class C):
- pointer to C (the
this
pointer) - class ID (this is the same as the class’ number in the
classesST
symbol table) - any fields
This comes into play when accessing struct fields using LLVM’s notorious
get element pointer instruction.
To get the value of the first field declared in a struct, we would create a GEP
using index (0,2), with the offset being for the this
pointer and the class ID
as layed out above.
Due to the restrictions of the LLVM IR type system, I implemented virtual
dispatch tables as nine separate functions, one for each of the possible pairs
of return type and parameter type that can occur in DJ. When a return type or
parameter type is a class, I casted the value to one of Object
type. I
confirmed that this would have no effect on accessing class fields later; a cast
back to the original type is all that it took. When inside a VTable function, a
chain of if/then/else statements accomplishes the dispatch. Even for the two DJ
class declarations below, the VTable becomes verbose:
class C1 extends Object {
nat whoami(nat unused) { printNat(1); }
}
class C2 extends C1 {
nat whoami(nat unused) { printNat(2); }
}
define i32 @natVTablenat(%Object* %0, i32 %1, i32 %2, i32 %3) {
entry:
%4 = bitcast %Object* %0 to %C1*
%5 = getelementptr %C1, %C1* %4, i32 0, i32 1
%6 = load i32, i32* %5
%7 = icmp eq i32 %1, 1
%8 = icmp eq i32 %6, 1
%9 = and i1 %7, %8
%10 = icmp eq i32 %2, 0
%11 = and i1 %9, %10
br i1 %11, label %then, label %else
then: ; preds = %entry
%12 = call i32 @C1_method_whoami(%C1* %4, i32 %3)
ret i32 %12
else: ; preds = %entry
%13 = bitcast %C1* %4 to %C2*
%14 = getelementptr %C2, %C2* %13, i32 0, i32 1
%15 = load i32, i32* %14
%16 = icmp eq i32 %1, 1
%17 = icmp eq i32 %15, 2
%18 = and i1 %16, %17
%19 = icmp eq i32 %2, 0
%20 = and i1 %18, %19
br i1 %20, label %then1, label %else2
then1: ; preds = %else
%21 = call i32 @C2_method_whoami(%C2* %13, i32 %3)
ret i32 %21
else2: ; preds = %else
%22 = getelementptr %C2, %C2* %13, i32 0, i32 1
%23 = load i32, i32* %22
%24 = icmp eq i32 %1, 2
%25 = icmp eq i32 %23, 2
%26 = and i1 %24, %25
%27 = icmp eq i32 %2, 0
%28 = and i1 %26, %27
br i1 %28, label %then3, label %else4
then3: ; preds = %else2
%29 = call i32 @C2_method_whoami(%C2* %13, i32 %3)
ret i32 %29
else4: ; preds = %else2
ret i32 0
I implemented instanceof tables in a manner similar to virtual dispatch tables; namely, a mess of chained if/then/else statements. The type system was not a concern here, and so there is only one ITable function in the resulting IR.