Skip to content

Commit

Permalink
Tweak Tokenizer docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jakobnissen committed Sep 12, 2023
1 parent 96115dc commit 504497a
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 9 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Automa"
uuid = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
authors = ["Kenta Sato <[email protected]>", "Jakob Nybo Nissen <[email protected]"]
version = "1.0.0"
version = "1.0.1"

[deps]
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
Expand Down
32 changes: 24 additions & 8 deletions docs/src/tokenizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,30 +22,40 @@ Any text of this format can be broken down into a sequence of the following toke
* Spaces: `re" +"`
* Letters: `re"[A-Za-z]+"`

Such that e.g. `("XY", "A")` can be represented as `lparent, quote, XY, quote, comma, space, quote A quote rparens`.
Such that e.g. `("XY", "A")` can be represented as the token sequence `lparens, quote, XY, quote, comma, space, quote A quote rparens`.

Breaking the text down to its tokens is called tokenization or lexing. Note that lexing in itself is not sufficient to parse the format: Lexing is _context unaware_, so e.g. the test `"((A` can be perfectly well tokenized to `quote lparens lparens A`, even if it's invalid.
Breaking the text down to its tokens is called tokenization or lexing.
Note that lexing in itself is not sufficient to parse the format:
Lexing is _context unaware_ and doesn't understand syntax, so e.g. the test `"((A` can be perfectly well tokenized to `quote lparens lparens A`, even if it's invalid syntax.

The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, to search for letters in the input. Instead of having to muck around with regex to find the letters, you use regex once to classify all text.
The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, to search for letters in the input.
Instead of having to muck around with regex to find the letters, you use regex once to classify all text.

## Making and using a tokenizer
Let's use the example above to create a tokenizer.
The most basic default tokenizer uses `UInt32` as tokens: You pass in a list of regex matching each token, then evaluate the resulting code:
The most basic tokenizer defaults to using `UInt32` as tokens: You pass in a list of regex matching each token, then evaluate the resulting code:

```jldoctest tok1
julia> make_tokenizer(
[re"\(", re"\)", re",", re"\"", re" +", re"[a-zA-Z]+"]
) |> eval
```
The `make_tokenizer` function creates Julia code (an `Expr` object) that, when evaluated, defines `Base.iterate` for the `Tokenizer` type.
The code above defined `Base.iterate(::Tokenizer{UInt32, D, 1}) where D` - we'll get back to the different type parameters of `Tokenizer` later.

Since the default tokenizer uses `UInt32` as tokens, you can then obtain a lazy iterator of tokens by calling `tokenize(UInt32, data)`:
`Tokenizer`s are most easily created with the `tokenize` function.
To create a `Tokenizer{UInt32}`, we can do call `tokenize(UInt32, data)`:

```jldoctest tok1
julia> iterator = tokenize(UInt32, """("XY", "A")"""); typeof(iterator)
Tokenizer{UInt32, String, 1}
```

This will return `Tuple{Int64, Int32, UInt32}` elements, with each element being:
Meaning: A `Tokenizer` emitting `UInt32` over `String` data, of version `1`.
Since we used `make_tokenizer` above to define iteration for this kind of tokenizer (one with `UInt32` tokens),
we can iterate this tokenizer.

When we iterate, we get `Tuple{Int64, Int32, UInt32}` elements, with each element being:
* The start index of the token
* The length of the token
* The token itself, in this example `UInt32(1)` for '(', `UInt32(2)` for ')' etc:
Expand All @@ -65,6 +75,9 @@ julia> collect(iterator)
(11, 1, 0x00000002)
```

The type of the last element in each tuple comes from the `Tokenizer` type parameter:
We specified `UInt32`, so we get `UInt32` tokens.

Any data which could not be tokenized is given the error token `UInt32(0)`:
```jldoctest tok1
julia> collect(tokenize(UInt32, "XY!!)"))
Expand All @@ -75,7 +88,10 @@ julia> collect(tokenize(UInt32, "XY!!)"))
```

Both `tokenize` and `make_tokenizer` takes an optional argument `version`, which is `1` by default.
This sets the last parameter of the `Tokenizer` struct, and as such allows you to create multiple different tokenizers with the same element type.
This sets the last parameter of the `Tokenizer` struct - for example, `make_tokenizer(tokens::Vector{RE}; version=5)`
defines `Base.iterate` for `Tokenizer{UInt32, D, 5}`.

By letting the user freely choose the value of the last type parameter, this allows you to create multiple different tokenizers with the same element type.

## Using enums as tokens
Using `UInt32` as tokens is not very convenient - so it's possible to use enums to create the tokenizer:
Expand Down Expand Up @@ -142,4 +158,4 @@ However, note that this may cause most tokenizers to error when being built, as
Automa.Tokenizer
Automa.tokenize
Automa.make_tokenizer
```
```

2 comments on commit 504497a

@jakobnissen
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/91238

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.0.1 -m "<description of version>" 504497ad3644e5c3ec68fee4f003624e55e8b24f
git push origin v1.0.1

Please sign in to comment.