List of tests exposing particular behaviors from Typst
- Test
positive/320
: indent takes in account comments, but redent is done at item, term or heading token. - Tests
positive/328
andpositive/329
: math function don't work with symbol. - Test
positive/330
: Comment block don't need closing end.
- Test
fixme/011
: Embedded code causes a segfault- With new version of Tree Sitter or Helix, this bug seems to be fixed
-
Testfixme/014
: Right square bracket exits from all containers -
Testfixme/012
: Matching square brackets in text are paired -
Testfixme/013
: Point ending a ref is not part of the ref -
Testfixme/010
: Math shorthand and letter can applied -
Testfixme/009
: Indentation and comments -
Testfixme/001
: Group termination in math -
Testfixme/002
: Import precedence over list -
Testfixme/008
: Condition if set statement -
Testfixme/007
: Trailing comments beforeelse
-
Testfixme/003
: Spaces in method notation -
Testfixme/004
: Leading space not recognized -
Testfixme/005
: Inlined code absorbs new line
Failing tests are found in corpus/fixme.scm
.
When searching ways to optimize the parser and simplify the grammar, I thought about using the extras feature for spaces and comments (and line breaks as well). At the end, it significantly reduced parser size. The only problem arises with function calls and, in inline code, field access. They must be directly joined (no space nor comment in between). The use of the immediate feature won't solve the problem as it only takes in acount inline regex (which would be ok with spaces but not comments, as they have to appear in output tree).
The solution is to rely on external scanner when parsing spaces or comments. Lets call a "pre-immediate" token, a token susceptible to be followed by immediate token. When a pre-immediate token is parsed, it sets a flag to true
, and when a space or comment is parsed, it resets the flag to false
(this flag is stored in scanner's state as a boolean).
This way when a token has to be immediate, an external token can be required and will only match if flag is true
. It means, any pre-immediate token have to be preceded by a token that will set to true
the flag.
The scanner uses a stack of "containers" as internal states in order to simplify the parsing of some nodes like emph
, strong
and content
. At any moment, the scanner knows what are the containing nodes. For instance:
* _ #[Hello *World*] _ *
Here, when the scanner will be called at the beginning of "World", the container stack will be stron/emph/content/strong
. When a '*' is encountered while the top of the stack is strong
, it will be interpreted as the end of this container. If the top of the stack was emph
, it would be the start of a strong
container, and it would be pushed on the stack.
It is mandatory to use the external scanner for the containers because of indentation.
Three distinct token are dedicated to indentation:
indent
: When the following line start further than the current onededent
: When the following line start at a column previously set as indentation columnredent
: When the following line start at the same column as current one, or before, but further than the previous indentation.
The concept of redent can be seen here:
- 1
- 2
- 3
- 4
Which have the same hierarchy as:
- 1
- 2
- 3
- 4
Five character classes are defined in the external scanner:
- space (
is_sp
) - line break (
is_lb
) - xid start (
is_id_start
) - xid continue (
is_id_continue
) - word part (
is_word_part
)
The three functions is_id_start
, is_id_continue
and is_word_part
are implemented as binary search.
The character list is based on the Unicode database which can be found here: https://www.unicode.org/Public/UCD/latest/ucd/
A utility is used to produce those tables: https://github.com/uben0/unicode-table
The heading
, item
and term
nodes are technically delimited contexts, but because thay behave similarly, they have the same external token as container, which is barrier
.