Skip to content

Lexical rules

Hadron67 edited this page Feb 22, 2018 · 1 revision

Lexical rules are used to generate a tokenizer that converts the input string into a sequence of tokens, so that the parser could continue to process them.

In the grammar file you can use an expression similiar to regular expression to define a lexical rule.

As the tokenizer scans input string, it will store the currently matched string and line and column information, which can be accessed in the sematic actions.

Unicode is certainly supported. In fact, the maximum character code that tscc-compiler can handle is the maximum integer value of javascript, not just the largest encode of unicode.

Lexical state

sometimes one group of lexical rule is not enough. For example, the string interpolating in es6:

var a = `my name is ${name}, and sin(pi/6) = ${Math.sin(Math.PI / 2)}`;

under such circumstances we need more than one group of lexical rule, and use one group in the normal code, and the other one in the back quote strings.

tscc-compiler uses Lexical state to implement this. Each lexical state has a group of lexical rules. The tokenizer was marked with a lexical state, with the initial state being 'DEFAULT'. It will use the group of lexical rules that its current state corresponds to, and you could change the lexical state in the sematic actions.

In fact, the lexical state variable in the tokenizer is actually a stack, current state is given by the top of the stack. Push and pop state from this stack is a prefered way to do state transition.

Define lexical rules

To define lexical rules in grammar files, you could use the directive %lex:

%lex ( < [lexical state] (, [lexical state])* > )? { [lexical rule body] }

where [lexical state] is an identifier, the name of a lexical state.

The first optional part selects a set of lexical state the lexical rules to be added to. If not specified, the initial state 'DEFAULT' will be selected.

The lexical rule body

There are two types of definations in the lexical rule body: variable defination and rule defination. They can appear an arbitary number of times in the block.

Regular expressions

The regular expressions used by grammar file is different from ordinary ones like those in javascript, instead, they are similiar to those in Javacc, i.e., they aren't a literal like a string or something bounded by "/" as those in javascript, instead, they are part of the grammar file's syntax. Elements of regular expression are listed below:

  • string literals are just treated literally.
  • [...] is a set of characters. The elements needs to be single character strings, or two such strings connected by a dash "-", indicats a range of character. And elements needs to be seperated by comma ','. If a begining "^" is present, this set will be inverted.
  • (...) is a group of expressions.
  • [foo]? means [foo] is optional.
  • [foo]* means [foo] can repeat an arbitary number of times.
  • [foo]+ means [foo] needs to repeat more than one time.
  • expressions seperated by | means these expressions all can be accepted.
  • <[variable name]> references an expression variable.
  • %import("[name]") can import a built-in sub-expression, where [name] is the name of the sub-expression, see below for a list of supported sub-expression.
Sub-expression name Description
es3UnicodeIDStart unicode characters that can be the first character of an identifier in es3
es3UnicodeIDPart unicode characters that can appear in the middle of an identifier in es3
es5UnicodeIDStart same as es3UnicodeIDStart, but es5 instead of es3
es5UnicodeIDPart same as es3UnicodeIDPart, but es5 instead of es3

Allowing unicode characters in identifiers is common in many of today's programming languages. Since specifying what unicode characters can be accepted by hand is a lengthy work, this provides a more convenient way to introduce unicode characters in identifiers.

Here are some examples:

  • A common defination for identifiers: ['a'-'z', 'A'-'Z', '_', '$']['a'-'z', 'A'-'Z', '_', '$', '0'-'9']*
  • Identifier for ES5:
( ['a'-'z', 'A'-'Z', '_', '$'] | %import("es5UnicodeIDStart") )
( ['a'-'z', 'A'-'Z', '_', '$', '0'-'9'] | %import("es5UnicodeIDPart") )*
  • Integers: ['0'-'9']+
  • Line comment in C: "//" [^"\n"]*

Variable defination

Sometimes a common pattern of regular expression may appear in many lexical rules, so it is preferred to define a variable for that and reference them in the rules. The syntax is:

[variable name] = < [regular expression] >

where [variable name] is an identifier.

Note that in this [regular expression] you can also reference other variables, but circular dependence is not allowed. If you think you do need that, perhaps using grammar rules is a better solution.

Rule defination

The syntax:

< ( [token name] : )? [regular expression] > ( : [action] )?

if [token name] is present, this rule will emit a token, i.e., when the input is matched by this rule, the tokenizer will do the action, then pass the token to the parser, if not, it only do action. Checkout Actions for a detailed explaination of the [action].

Since every token in the grammar rule has a sematic value, you need to set this value in the [action] so that when a token is emitted this value could be pushed onto the sematic stack. By default, this value is simply null;

In fact, the tokenizer in tscc-compiler is somewhat low level: When a rule is matched, the tokenizer do action first, and then emit token if presented. This means more than one rule can emit the same token. It doesn't think a token has a specified pattern, nor it knows what is white space. In this way, the method to handle white space is simply write a rule to match white space and doesn't emit token:

< (["\n", " ", "\t"]|"\r\n")+ >: [='']

The action [=''] is used here to set the matched string to empty and reset the line information, otherwise the information of the next token will be incorrect.

Minimum matching

By default, the tokenizer will try to match the longest string specified by lexical rules. But sometimes it could be useful to match the shortest string. Use %least at the begining of a rule to indicate that this rule matches the shortest string.

Token names and alias

The name of the token is just the name specified in the rule, as mentioned above. Besides that, the alias of a token would be infered from the rule: if the rule only contains one string literal, then the alias is this string, otherwise this token has no alias. For example, the rule

< WHILE: 'while' >

then we have defined a token named WHILE, with its alias being 'while'.

In the grammar rules and options that requires a token, you can reference tokens by its name:

<[token name]>

or by its alias, with a string literal. This could be convenient, for instance, you could just write 'while' in the grammar rules instead of <WHILE>. But if a token has no alias or more than one token has the same alias, you can only reference them by name.