-
Notifications
You must be signed in to change notification settings - Fork 61
Syntax specification for the new parser
The syntax specification for the new parser has evolved from the current SDF format, with just minor changes. The syntax is formally specified in e-kore.k.
Before starting the full description, here are the main differences between K with SDF and the new parser:
- Tokens have been reworked. Now any production can create a constant/token, by adding the
token
attribute. This allows for context free tokens, instead of just regular ones. -
noAutoReject
has been replaced withautoReject
, and the default behavior reversed. IfautoReject
is specified, then all terminals from all productions (taking modularity into account) will be matched against the currently matched constant. If it matches, the parse fails. This is an easy way to implement keyword rejection. -
reject
has been reworked, now it is an attribute on a token production, instead of a sort. It takes a regular expression, and it can extendautoReject
. -
Token{<sdf lex>}
has been replaced withr"<regex>"
The new parser takes as input only productions, which are a sequence of terminals, non-terminal, and regular-expressions. Examples:
syntax Exp ::= "if" Exp "then" Exp "(?!else)"
// if_then, not followed by else
syntax Id ::= r"(?<![A-Za-z0-9\\_])[A-Za-z\\_][A-Za-z0-9\\_]*" [token, autoReject]
syntax BubbleItem ::= r"[^ \t\n\r]" [token, reject2("rule|syntax|endmodule|configuration|context")]
Terminals are delimited by double quotes and the parser will match exactly the sequence of characters enclosed.
RegexTerminals prefixed by r
and delimited by double quotes, they match strings described the the regular expression enclosed.
NonTerminals will match on any production of the specified sort.
Taking into consideration only productions, the parser will create parse trees, but they are quite verbose, and hard to handle in a semantic environment (reminder: 1 + 2
produces the parse tree: (("1")," ", "+", " ", ("2"))
, and the AST: _+_(1, 2)
).
For this, the following attributes can be used in order to help build the proper AST:
-
klabel
- takes the current production's children (non-terminals), and creates a node with the label specified in theklabel
attribute. All terminals and regular-expressions are ignored. -
token
- creates a node containing two fields: the sort of the production, and the exact string representation of the matched input. Adding the attribute-
autoReject
will exclude any terminal defined in the syntax. This offers a quick way of rejecting keywords, but for better control, the users may use -
reject2
which takes as input a regular expression.
-
-
bracket
- allowed only on productions that have one non-terminal and some terminals/regex-terminals, and the node will be eliminated. It is commonly used to group productions and/or override priority or associativity. - chain productions/subsorts - by default, productions with exactly one non-terminal are eliminated, unless the user specifies a
klabel
attribute.
By default, all productions that are not a subsort, or are annotated with bracket or token, will be automatically tagged with a generated klabel.
The regular expression language can be found here: Regex Language, with two additions: negative lookahead (?!X)
and negative lookbehind (?<!X)
.
Whitespace
Currently whitespace/layout is added by default, and reproduce the C-style comments: //...
and /*...*/
.
Possible ways to specify custom regular expressions could be something like this:
syntax #Whitespace ::= r"<regex>"
, in which case, the non-terminal #Whitespace would be used instead of the default regular expression. There might also be languages where whitespaces plays an important role in the parsing process, like Python or JavaScript, in which case the user might want more granularity when specifying where and which kind of whitespaces can be parsed. One possibility would be to write:
syntax Exp ::= "if" [#Whitespaces1] Exp "then" [#Whitespace2] Stmt [noDefaultWhitespace]
Specifying a non-terminal between square brackets would tell the parser to eliminate from the AST whatever was matched, and the noDefaultWhitespace
attribute would tell the parser generator to not include the default whitespace.