Skip to content

Latest commit

 

History

History
739 lines (498 loc) · 23.4 KB

README.md

File metadata and controls

739 lines (498 loc) · 23.4 KB

Alia

The ultimate Alia compiler and interpreter.

Written in TypeScript.

Alia follows c-like syntax. Example code:

int x;

int double(int x) {
    return x * 2;
}

int main() {
  output "Hello World!\n";
  
  // Function Pointer
  fn(int)->int f = double;
  
  output "Input a number: ";
  input x;
  
  if(x == 42) {
    output "The answer to the \"ultimate question of life, the universe, and everything,\"\n";
  }
  
  return f(x);
}

A larger example is provided in ./infile.alia. It includes an implementation of a tiny testing library, that is then used to test the very language it is written in.

Alia can be compiled down to x64, MIPS, and LLVM assembly. It can also be run in an interpreter.

Contents

Prerequisites

  • Node 18
  • NPM 8

Installation

  1. Follow llvm-bindings installation instructions

    Also, you may have to explicitly specify the location of LLVM , if it isn't found automatically.

  2. Install alia dependencies:

    npm install

Running Compiler

To see available options, run the script with the --help argument:

./alia --help

Compiling to x64

Example call:

./alia infile.alia -o outfile.s

By default, compilation runs intermediate code optimization and final code optimization (using peephole optimization and dead code elimination). You can disable those by running the compiler with the --dontOptimize flag (useful when trying to debug the output)

After compiling, you need to run a linker. This can be done using the make link command, which accepts optional arguments:

make link OUTFILE=outfile.s PROG=dev.prog

Where:

  • OUTFILE is the name of the assembly file generated in the previous step (outfile.s is the default value).
  • PROG is the name of the executable for the program (dev.prog is the default value).

Example of generated x64 assembly:

Example x64 output

Full example

Compiling to MIPS

Example call:

./alia infile.alia -m mips.asm 2> errors.txt

Then, open mips.asm in MARS and run it.

MIPS assembly code was written to work with MARS 4.5

NOTE: when compiling to MIPS, registers are stored as 32-bit values to simplify the multiplication and division operations

Example of generated MIPS assembly:

Example MIPS output

Full example

Compiling to LLVM Assembly

Workflow for an x64 machine:

# Compile to LLVM assembly
./alia infile.alia -l llvm.ll

# Optimize the LLVM assembly
# NOTE, you will have to change the paths to the LLVM binaries
/home/a807d786/llvm_dist/bin/opt -O3 -S llvm.ll -o llvm.opt.ll

# Compile to target machine assembly
/home/a807d786/llvm_dist/bin/llc llvm.opt.ll -o llvm.s

# Create an executable and run it
make link OUTFILE=llvm.s

Example LLVM output:

Example LLVM output

Full example

Running Interpreter

Start the interpreter:

./aliainterp

The interpreter supports several meta-commands. Type :help to see those.

Several restrictions are lifted when running Alia in an interpreter:

  • Semicolon for the last line is optional
  • Variable re-declarations are treated as shadowing the variable, rather than emit an error
  • Statements can be included outside of the function
  • output statements always include a trailing newline

Example Interpreter session:

Example Interpreter session

Testing

Jest is used for unit testing.

You can run it like this:

npm test

For example, automated tests for the interpreter are in ./src/interpreter/__ tests__/runtime.test.ts

Example test run:

Example test run

Architecture

A deep dive into the architecture of the compiler follows.

alia consists of 5 parts. When compiling a program, you go through each of these stages in order:

Lexical Analysis (Tokenizing)

Lexical analysis means grouping a stream of letters or sounds into sets of units that represent meaningful syntax

This process converts the source code into a stream of tokens. Example of tokens:

  • Reserved keywords (if, else, while, etc.)
  • Identifiers (foo, bar, etc.)
  • Operators (+, -, etc.)
  • Literals (1, 2, 3, etc.)
  • Comments (// this is a comment)

Source code for tokenizer is in ./src/tokenize

Each file in that directory has a comment at the top describing the purpose of the file.

You can run the compiler in tokenizing-only mode like this:

./alia infile.alia --tokensOutput tokens.out

This would output the token stream into ./tokens.out text file.

Example token stream:

Example token stream

Syntactic analysis (Parsing)

The syntactic analysis makes sure that the token stream created in the previous step conforms to the grammar, and sets the stage up for Syntax Directed Translation.

The grammar for the language is defined using Context Free Grammar.

See ./src/grammar for the grammar definition.

Three parsers have been implemented:

  • SLR Parser (default)

    This parser is used by default. You can explicitly specify it like this:

    ./alia infile.alia --parser SLR -m mips.asm

    The source code for the SLR parser is in ./src/slrParser

    More information about SLR parsers

    Additionally, there is an option to generate a Graphviz graph based on the SLR automata.

  • CYK Parser

    Benefits:

    • Supports ambiguous grammar

    Drawbacks:

    • Runs as n^3 complexity.
    • Requires converting grammar to Chomsky Normal Form, thus was harder to implement
    • Hard to convert the result of the CYK parser into a parse tree

    For these reasons, the CYK parser was implemented only out of curiosity rather than for practical reasons. You can run the CYK parser to see whether parsing resulted in any errors, but you cannot move on to the "Syntax directed translation" if CYK parser was used.

    You can run the CYK Parser like this:

    ./alia infile.alia --parser CYK

    The program will exit with exit code 0 on successful parsing.

    Otherwise, it will print an error and exit with 1.

    The source code for the CYK parser is in ./src/cykParser

    More information about CYK parsers

  • LL(1) Parser

    LL(1) parser was implemented in a separate project and is available at maxpatiiuk/leto.

    Unlike this project, maxpatiiuk/leto is geared towards making it trivially easy to customize the grammar and the syntax-directed translation.

Syntax-directed translation

Parse tree generation

As part of running an SLR parser, a parse tree is generated. The parse tree is an intermediate form between the token stream and the abstract syntax tree. It closely resembles the token stream and can be though of as the same token stream, but with all nodes rearranged to have some children and parent nodes assigned.

More information about parse trees

(optional) Prettify the code

There is an ability to format the input code based on just the AST.

The result is not always as great as formatting the AST, but it stays closer to the original code (preserves redundant parenthesis and other features that don't change the semantics). It is also faster than formatting based on AST.

Source code is in ./src/unparseParseTree

Example call:

./alia infile.alia --unparseMode parseTree --unparse pretty-output.alia
Abstract syntax tree (AST) generation

A tree representation of the code that constitutes a processed parse tree, with non-essential information striped (like parenthesis and whitespace).

This is later used for name analysis, type analysis, and intermediate code generation. AST is also where the first optimization steps are taken.

More information about ASTs

The source code for the AST generation is in ./src/ast

The definitions of all AST nodes are in ./src/ast/definition/

(optional) Prettify the code

AST can be used as an excellent source for pretty-printing the code.

Example call:

./alia infile.alia --unparse pretty-output.alia

(you don't have to specify the --unparseMode ast argument, as it is the default unparse mode)

Semantic analysis

Name analysis annotates the AST with information about used identifiers, their scope, and their type. This is the last stage where compile-time errors can be thrown.

Name analysis

Name analysis is the process of assigning a scope to each identifier in the program. It also checks for duplicate declarations and undeclared identifiers.

The definitions of all AST nodes are in ./src/ast/definition/

Each AST node has a nameAnalysis method that is called during name analysis.

It collects information about used variables, and their scope and reports errors.

Example name analysis output:

Example name analysis output

Type analysis

Type analysis is the process of assigning a type to each identifier in the program. It also checks for type mismatches and other type-related errors.

More information about type analysis

The definitions of all AST nodes are in ./src/ast/definition/

Each AST node has a typeAnalysis method that is called during type analysis.

It collects information about the type of each variable, and function and reports any incorrect type conversions.

Example type analysis output:

Example type analysis output

You can also output the prettified source code with name analysis and type analysis annotations like this:

./alia infile.alia --namedUnparse named.alia

This is useful for debugging.

Example annotated output: Example annotated output

Intermediate code generation

The intermediate code is generated from the AST. The intermediate code is a pseudo-assembly language (also called 3AC or Quads) that is used for intermediate code optimization and generation for MIPS, x64 and LLVM assembly code.

More information about 3AC

The definitions of all AST nodes are in ./src/ast/definition/.

Each node contains a toQuads() method that is used for converting AST node to a list of Quads.

All Quads are defined in the ./src/quads directory.

For debugging purposes, you can make the compiler output the intermediate code:

./alia infile.alia -a 3ac.txt

Example Quads output:

Example Quads output

Full example

Intermediate code optimization

During the generation of quads, the first steps of optimization are taken.

Mainly, this stage is concerned with dead code elimination (like removing useless arithmetic operations). This code also replaces + 1 and - 1 operations with post-increment and post-decrement, as those, can be more efficiently represented in x64 assembly.

Note, a whole lot more optimization could be done at this stage (using frame analysis, more dead code elimination, reducing stack usage, and reusing registers more), but those are outside of the scope of this project.

You can disable all optimizations by passing a --dontOptimize argument. This is useful for debugging as that way generated intermediate code is a 1:1 representation of the AST.

Final code generation

The final code is assembly. The compiler can generate code for MIPS (32-bit), x64 (64-bit) and LLVM (64-bit).

The code of the compiler is trying to output universal assembly as much as possible. This helps ensure consistency between the assembly code for different architectures, as well as allows for sharing optimizations between architectures.

All Quads are defined in the ./src/quads directory.

Each Quad has a toMips() and toMipsValue() methods for converting to assembly and toAmd() and toAmdValue() for converting to x64 assembly. These are described in more detail in ./src/quads/definitions/index.ts.

To simplify optimization and reduce the possibility of typos, an intermediate level has been added. Rather than converting Quads directly into assembly strings, they are converted into instances of the Instruction class.

For example, it is much easier to permute an object, than it is to permute a string. It also increases type safety, as instructions can specify the types and order of arguments.

Documentation on compiling and executing generated MIPS assembly

Documentation on compiling and executing generated x64 assembly

Documentation on compiling and executing generated LLVM assembly

Final code optimization

Final code optimization consists of peephole optimization

The source code is located in ./src/instructions/optimize.

The optimization is made as platform-independent as possible.

You can disable all optimizations by passing a --dontOptimize argument. This is useful for debugging as that way generated intermediate code is a 1:1 representation of the AST.

Execution

Runtime environment

MIPS

MIPS runtime environment is MARS

The code takes advantage of its syscalls.

x64 and LLVM

x64 runtime environment is any x64 capable machine. LLVM assembly can be compiled to run on any machine.

The generated assembly code does not make syscalls directly, but rather uses a tiny C helpers library.

This greatly simplifies the code generation process, as printing numbers/strings to the screen or receiving input from the IO stream using bare syscalls is too hardcore.

Interpreter

Besides compiling down to assembly, a TypeScript-based interpreter is also available.

Source code is located in ./src/interpreter

Usage instructions

The definitions of all AST nodes are in ./src/ast/definition/.

Each AST node has an evaluate() method that is called during interpretation.

Each variable declaration has a .value property, that stores the current value.

Since function pointers are supported, the variable value may point to function declaration.

alia Documentation

Language Specification includes detailed description of the Lexical system, Syntactic analysis and the Type System.

Development Documentation

alia is built using TypeScript.

You can run TypeScript in Node.js using on-the-fly transpilation like this:

npx ts-node-esm src/index.ts --your --arguments --here

Node.js debugger could also be used. Instructions for WebStorm/PyCharm and VS Code

As far as dependencies, chalk is used in the interpreter for adding color and commander is used in the compiler for parsing the CLI arguments.

For debugging generated MIPS code, MARS is very helpful - it provided a step-by-step debugger.

For debugging generated x64 gdbtui is very helpful.

Just start the debugger like gdbtui ./dev.prog, then enter the assembly layout using layout asm, and then follow the assembly debugging commands .

Appendix 1: Extending the language

A high-level overview of the steps necessary to extend the Alia language:

If you are interested in a language that is easier to customize, take a look at maxpatiiuk/leto.

Unlike this project, maxpatiiuk/leto is geared towards making it trivially easy to customize the grammar and the syntax-directed translation.

Appendix 2: Generating a graph of the grammar

Optional step of the SLR parser that helps to visualize (and debug) the grammar.

Convert the SLR automata into a Graphviz graph definition, which in turn can be converted into an SVG or a PNG.

NOTE: due to grammar being quite large, the resulting graph is enormous.

To make it more practical, I commented out large portions of the grammar before running the graph generation to reduce the size of the graph.

First, create a parser.dot text file (in the DOT format):

./alia infile.alia --diagramPath parser.dot

Then, convert it into a parser.svg image (assuming dot is installed):

dot -Tsvg parser.dot -o parser.svg

Tested with dot v6.0.1

Example of a tiny slice of the generated graph: Example of a tiny slice of the generated graph

Appendix 3: Sources

The following pages have been helpful during the development of the compiler.

CYK Parser

SLR Parser

MIPS

x64

LLVM

Misc

Appendix 4: Naming

The language is named after "Alia Atreides" from the Dune book series.

My #1 source of sci-fi sounding terms is the Dune terminology page. Also, glossaries of sci-fi terms are helpful (techrepublic.com, bowdoin.edu)