Skip to content

The output parser

Hadron67 edited this page Feb 28, 2018 · 2 revisions

The output source code consists of two parts: Tables for tokenizer and parser, and the parser's class defination. All internal functions and variables start with jj, so you should avoid using prefix jj in the action blocks.

The specific source code depends on output language.

Typescript

For typescript, the output parser consists of the following things:

  • DFA tables and parse table.
  • Interface definations.
  • Token class.
  • A function that creates the parser.
  • Some enumerate objects.
  • Some functions.

JSDoc comments will also be generated.

Invoke the parser

To invoke the parser, you need to create a parser instance via function createParser(), which returns a closure object of type Parser, defined as an interface in Typescript:

let parser = createParser();

The name of the interface and function can be redefined using an option className in %option. For example, if className = "MyParser", there will be interface MyParser and function createMyParser(): MyParser.

Next, initialize it by calling init():

parser.init();

This method initializes the internal variables in the closure object, it has no argument by default. You can use %init directive to customize its argument and body to initialize your variables defined with %extra_arg.

Finally, you can pass an input for the parser to parse by parse(input). The argument can either be a string:

parser.parse("5 + 9 * 8");

or a stream-like object ParseInput:

var s = "5 + 9 * 8";
var i = 0;
parser.parse({
	current: () => i < s.length ? s.charCodeAt(i) : null,
    next: () => i++,
    isEof: () => i >= s.length,
    backup: (s: string) => i -= s.length;
});

Where backup is used to push a string back to the input.

Call to halt() will stop the parser, no more input will be processed. This function is usually used in an error handler, because when an error is emitted, the parser just ignores it and continue parsing, which may results in tons of nonsense errors.

You can also read tokens one by one by calling nextToken(). In which case you need to call load(input) to specify an input source. The argument of it is exactly the same as parse(input).

parser.load('5 + 8 * 9');
var t: Token;
while((t = parser.nextToken()).id !== TokenKind.EOF){
	console.log(`read token: ${t.toString()}`);
}

nextToken() scans the next token, pass it to the parser, and returns it. Since only one token instance is kept inside of the parser, this function only returns a reference of it. So if you need to use this token after next token is read, call clone() on the token to copy it:

var tokens: Token = [];
/* ... */
var t = parser.nextToken().clone();
tokens.push(t);
parser.nextToken();
/* ... */

Line terminator

A specified line terminator is necessary for generating a proper line information for every token. Call setLineTermiator(lt: LineTerm) or getLineTermiator() to set or get the line terminator, where LineTerm is an enumerate type:

Name Description
LineTerm.CR \r, used in Mac
LineTerm.LF \n, used in *nix
LineTerm.CRLF \r\n, used in Windows
LineTerm.NONE No line terminator, the input will be treated as a single line.
LineTerm.AUTO Default value. Detect line terminator automatically. The line terminator will be set to one of the first three values that appears in the first place.

Token kind

A token's kind is specified by a number. An enumerate object TokenKind will be generated to hold all these token kind numbers, with the enumerate member name being the corresponding token's name, allowing you to refer to them in you code.

Members EOF and ERROR will always be generated. They are two special kinds, they are automatically added. The former is end of file, and a token of kind ERROR will be emitted when a lexical error occurs.

Token class

Tokens are used in the tokenizer. When a token is emitted, it can be accessed through variable $token in the action of the lexical rule (see Actions for a detailed explanation).

The id is the token's kind, which is a member of TokenKind. And val is the string matched from the input. The rest four variables are used to locate this token in the input, allows one to report more readable error messages.

To avoid creating tons of objects during the parsing process, the parser won't create a new token object when it reads a token, instead, it keeps the same object, and modifies its properties. So if you want to keep a token in your program, be sure to copy it using clone(), otherwise it will be overwritten when next token is read.

Events

The parser emits some events at certain circumstances. It also use events to handle errors. An event handler can be registered by on("[event name]", [callback]). Here is a table for the events:

Name Callback Description
lexicalerror (c: string, line: number, column: number) => any Emitted when a lexical error occurs. c is the unexpected character, it will be an empty string if the unexpected character is end of file.
syntaxerror (t: Token, state: number) => any Emitted when a syntax error occurs. Where t is the unexpected token, and state is the state number at which the error was detected. You may call getExpectedTokens(state: number) to get what tokens was expected at this state, and generate an error message.
accept () => any Emitted when input is accepted.

Syntax highlighting support

The generated parser can also be used to implement syntax highlighting, such as editors like CodeMirror.

First, when a parser is used in syntax highlight, all it needs to do is scan tokens and check grammar, while it needn't, or shoudn't, execute the sematic actions. You can disable all sematic action blocks by disableBlocks(). It will disable all blocks of the form {...}. Sematic actions of the form [...] such as [+IN_BLOCK], and blocks with prefix %always will not be disabled, since they may be relevant for syntax highlight. enableBlocks() would enable all blocks.

Second, editors like CodeMirror usually cache the scanner's state when every line is scanned to improve performance, and the scanner needs to restore the state every time it starts scanning. You could use loadParserState(state) and getParserState() to set or get the parser's current parser state.

Javascript

The generated code for javascript is the same as typescript, with type annotations and interface declarations removed, and class defination replaced by a constructor defination.

Clone this wiki locally