A tiny compiler for a language featuring LL(2) grammar. With Lexer, Parser, ASM-like codegen and VM. Complex enough to give you a flavour of how the "real" thing works whilst not being a mere toy example.
- Lexer (
lex.h
) - Parser (
parser.h
) - Assembler like code generator (
gen.h
) - Virtual machine (
vm.h
) - Symbol table (
sym.h
) - Abstract syntax tree (
ast.h
)
It is by no means a complete industry standard implementation. Some parts are simplified for the sake of better understanding.
$ make
$ ./compiler program.src
cath1 = 3;
cath2 = 4;
hypsquare = cath1 * cath1 + cath2 * cath2;
Execution result:
hypsquare = 25
Generated ASM:
PUSH 3
WRITE cath1
PUSH 4
WRITE cath2
READ cath1
READ cath1
MUL POP, POP
READ cath2
READ cath2
MUL POP, POP
ADD POP, POP
WRITE hypsquare
The language grammar in EBNF:
program = expr, ";", { program } ;
expr = id, "=", expr | term, { ("+"|"-"), term } ;
term = factor, { ("*"|"/"), factor } ;
factor = id | num | "(", expr, ")" ;
This grammar can be parsed with an LL (left-to-right, leftmost derivation) parser with only 2 tokens of lookahead. In other words, we only need to examine at most 2 upcoming lexical tokens to unambiguously determine which production rule to apply.
If we look at a single id
token, we can't tell whether it's an expr
or a factor
. We need to look at one more token to remove the ambiguity. That makes it an LL(2) grammar.
It is also a context-free grammar, which means that on the left side of every production rule there is a single non-terminal symbol.