Skip to content
This repository has been archived by the owner on Feb 1, 2020. It is now read-only.

Syntax specification for the new parser

Radu Mereuta edited this page Apr 16, 2015 · 4 revisions

Syntax specification for the new parser

The syntax specification for the new parser has evolved from the current SDF format, with just minor changes. The syntax is formally specified in e-kore.k.

Before starting the full description, here are the main differences between K with SDF and the new parser:

  • Tokens have been reworked. Now any production can create a constant/token, by adding the token attribute. This allows for context free tokens, instead of just regular ones.
  • noAutoReject has been replaced with autoReject, and the default behavior reversed. If autoReject is specified, then all terminals from all productions (taking modularity into account) will be matched against the currently matched constant. If it matches, the parse fails. This is an easy way to implement keyword rejection.
  • reject has been reworked, now it is an attribute on a token production, instead of a sort. It takes a regular expression, and it can extend autoReject.
  • Token{<sdf lex>} has been replaced with r"<regex>"

Syntax specification for the new parser

The new parser takes as input only productions, which are a sequence of terminals, non-terminal, and regular-expressions. Examples:
syntax Exp ::= "if" Exp "then" Exp "(?!else)" // if_then, not followed by else
syntax Id ::= r"(?<![A-Za-z0-9\\_])[A-Za-z\\_][A-Za-z0-9\\_]*" [token, autoReject]
syntax BubbleItem ::= r"[^ \t\n\r]" [token, reject2("rule|syntax|endmodule|configuration|context")]

Terminals are delimited by double quotes and the parser will match exactly the sequence of characters enclosed.
RegexTerminals prefixed by r and delimited by double quotes, they match strings described the the regular expression enclosed.
NonTerminals will match on any production of the specified sort.

AST modifiers (token, klabel, bracket, subsort)

Taking into consideration only productions, the parser will create parse trees, but they are quite verbose, and hard to handle in a semantic environment (reminder: 1 + 2 produces the parse tree: (("1")," ", "+", " ", ("2")), and the AST: _+_(1, 2)).

For this, the following attributes can be used in order to help build the proper AST:

  • klabel - takes the current production's children (non-terminals), and creates a node with the label specified in the klabel attribute. All terminals and regular-expressions are ignored.
  • token - creates a node containing two fields: the sort of the production, and the exact string representation of the matched input. Adding the attribute
    • autoReject will exclude any terminal defined in the syntax. This offers a quick way of rejecting keywords, but for better control, the users may use
    • reject2 which takes as input a regular expression.
  • bracket - allowed only on productions that have one non-terminal and some terminals/regex-terminals, and the node will be eliminated. It is commonly used to group productions and/or override priority or associativity.
  • chain productions/subsorts - by default, productions with exactly one non-terminal are eliminated, unless the user specifies a klabel attribute.

By default, all productions that are not a subsort, or are annotated with bracket or token, will be automatically tagged with a generated klabel.

The regular expression language can be found here: Regex Language, with two additions: negative lookahead (?!X) and negative lookbehind (?<!X).

Future work/questions

Whitespace
Currently whitespace/layout is added by default, and reproduce the C-style comments: //... and /*...*/. Possible ways to specify custom regular expressions could be something like this: syntax #Whitespace ::= r"<regex>", in which case, the non-terminal #Whitespace would be used instead of the default regular expression. There might also be languages where whitespaces plays an important role in the parsing process, like Python or JavaScript, in which case the user might want more granularity when specifying where and which kind of whitespaces can be parsed. One possibility would be to write:
syntax Exp ::= "if" [#Whitespaces1] Exp "then" [#Whitespace2] Stmt [noDefaultWhitespace]
Specifying a non-terminal between square brackets would tell the parser to eliminate from the AST whatever was matched, and the noDefaultWhitespace attribute would tell the parser generator to not include the default whitespace.