-
Notifications
You must be signed in to change notification settings - Fork 1
The output parser
The output source code consists of two parts: Tables for tokenizer and parser, and the parser's class defination. All internal functions and variables start with jj
, so you should avoid using prefix jj
in the action blocks.
The specific source code depends on output language.
For typescript, the output parser consists of the following things:
- DFA tables and parse table.
- Interface definations.
- Token class.
- A function that creates the parser.
- Some enumerate objects.
- Some functions.
JSDoc comments will also be generated.
To invoke the parser, you need to create a parser instance via function createParser()
, which returns a closure object of type Parser
, defined as an interface in Typescript:
let parser = createParser();
The name of the interface and function can be redefined using an option className
in %option
. For example, if className = "MyParser"
, there will be interface MyParser
and function createMyParser(): MyParser
.
Next, initialize it by calling init()
:
parser.init();
This method initializes the internal variables in the closure object, it has no argument by default. You can use %init
directive to customize its argument and body to initialize your variables defined with %extra_arg
.
Finally, you can pass an input for the parser to parse by parse(input)
. The argument can either be a string:
parser.parse("5 + 9 * 8");
or a stream-like object ParseInput
:
var s = "5 + 9 * 8";
var i = 0;
parser.parse({
current: () => i < s.length ? s.charCodeAt(i) : null,
next: () => i++,
isEof: () => i >= s.length,
backup: (s: string) => i -= s.length;
});
Where backup
is used to push a string back to the input.
Call to halt()
will stop the parser, no more input will be processed. This function is usually used in an error handler, because when an error is emitted, the parser just ignores it and continue parsing, which may results in tons of nonsense errors.
You can also read tokens one by one by calling nextToken()
. In which case you need to call load(input)
to specify an input source. The argument of it is exactly the same as parse(input)
.
parser.load('5 + 8 * 9');
var t: Token;
while((t = parser.nextToken()).id !== TokenKind.EOF){
console.log(`read token: ${t.toString()}`);
}
nextToken()
scans the next token, pass it to the parser, and returns it. Since only one token instance is kept inside of the parser, this function only returns a reference of it. So if you need to use this token after next token is read, call clone()
on the token to copy it:
var tokens: Token = [];
/* ... */
var t = parser.nextToken().clone();
tokens.push(t);
parser.nextToken();
/* ... */
A specified line terminator is necessary for generating a proper line information for every token. Call setLineTermiator(lt: LineTerm)
or getLineTermiator()
to set or get the line terminator, where LineTerm
is an enumerate type:
Name | Description |
---|---|
LineTerm.CR |
\r , used in Mac |
LineTerm.LF |
\n , used in *nix |
LineTerm.CRLF |
\r\n , used in Windows |
LineTerm.NONE |
No line terminator, the input will be treated as a single line. |
LineTerm.AUTO |
Default value. Detect line terminator automatically. The line terminator will be set to one of the first three values that appears in the first place. |
A token's kind is specified by a number. An enumerate object TokenKind
will be generated to hold all these token kind numbers, with the enumerate member name being the corresponding token's name, allowing you to refer to them in you code.
Members EOF
and ERROR
will always be generated. They are two special kinds, they are automatically added. The former is end of file, and a token of kind ERROR
will be emitted when a lexical error occurs.
Tokens are used in the tokenizer. When a token is emitted, it can be accessed through variable $token
in the action of the lexical rule (see Actions for a detailed explanation).
The id
is the token's kind, which is a member of TokenKind
. And val
is the string matched from the input. The rest four variables are used to locate this token in the input, allows one to report more readable error messages.
To avoid creating tons of objects during the parsing process, the parser won't create a new token object when it reads a token, instead, it keeps the same object, and modifies its properties. So if you want to keep a token in your program, be sure to copy it using clone()
, otherwise it will be overwritten when next token is read.
The parser emits some events at certain circumstances. It also use events to handle errors. An event handler can be registered by on("[event name]", [callback])
. Here is a table for the events:
Name | Callback | Description |
---|---|---|
lexicalerror | (c: string, line: number, column: number) => any |
Emitted when a lexical error occurs. c is the unexpected character, it will be an empty string if the unexpected character is end of file. |
syntaxerror | (t: Token, state: number) => any |
Emitted when a syntax error occurs. Where t is the unexpected token, and state is the state number at which the error was detected. You may call getExpectedTokens(state: number) to get what tokens was expected at this state, and generate an error message. |
accept | () => any |
Emitted when input is accepted. |
The generated parser can also be used to implement syntax highlighting, such as editors like CodeMirror.
First, when a parser is used in syntax highlight, all it needs to do is scan tokens and check grammar, while it needn't, or shoudn't, execute the sematic actions. You can disable all sematic action blocks by disableBlocks()
. It will disable all blocks of the form {...}
. Sematic actions of the form [...]
such as [+IN_BLOCK]
, and blocks with prefix %always
will not be disabled, since they may be relevant for syntax highlight. enableBlocks()
would enable all blocks.
Second, editors like CodeMirror usually cache the scanner's state when every line is scanned to improve performance, and the scanner needs to restore the state every time it starts scanning. You could use loadParserState(state)
and getParserState()
to set or get the parser's current parser state.
The generated code for javascript is the same as typescript, with type annotations and interface declarations removed, and class defination replaced by a constructor defination.