Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for v1 #119

Merged
merged 64 commits into from
Jul 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
4b6fe1f
Remove SIMD and inline generator
jakobnissen Jul 15, 2022
93aef95
Remove loop unrolling
jakobnissen Jul 16, 2022
67556af
Allow CodeGenContext to be elided
jakobnissen Jul 16, 2022
088b9e4
Update benchmarks
jakobnissen Jul 16, 2022
8ad0503
Saner default for p_eof and p_end
jakobnissen Jul 17, 2022
1e95ee4
Add generate_code convenience function
jakobnissen Jul 17, 2022
cbc8215
Add validator convenience function
jakobnissen Jul 17, 2022
7c57e3c
Enable ambiguity check
jakobnissen Jul 21, 2022
1e79027
Enforce action dict symbols are same as machines
jakobnissen Jul 23, 2022
5357b75
Add check for invalid RE.action keys
jakobnissen Jul 24, 2022
5d5e517
Add input error code to generate_code function
jakobnissen Jul 24, 2022
d238472
Export user-facing names
jakobnissen Jul 25, 2022
f333cfb
Check preconditions before declaring NFA ambiguous
jakobnissen Jul 27, 2022
a743327
Fix EOF check in machine error code
jakobnissen Jul 28, 2022
52bcd12
Do not store gensym symbols in default CodeGenContext
jakobnissen Jul 28, 2022
01e26e5
Add more comments
jakobnissen Jul 30, 2022
46f8cfb
Also trigger default error when cs > 0
jakobnissen Aug 1, 2022
d3b6ada
Remove checkbounds option
jakobnissen Aug 1, 2022
fe52c0a
Make clean work
jakobnissen Aug 1, 2022
8b50068
Make Variables easier to construct
jakobnissen Aug 1, 2022
f288142
Add tests for regex set operations
jakobnissen Aug 2, 2022
af92b38
Minor polish
jakobnissen Aug 3, 2022
2ffe564
Rename p_eof to is_eof
jakobnissen Aug 12, 2022
51fe1bc
Comment generate_reader better
jakobnissen Feb 22, 2023
82a2836
Small tweaks
jakobnissen Feb 22, 2023
d6351b8
Add default error in generated reader function
jakobnissen Feb 22, 2023
15b6389
Fix bug in execute_debug
jakobnissen Feb 23, 2023
48baf89
Export machine2dot
jakobnissen Feb 23, 2023
cd94e07
Make more use of magical macros
jakobnissen Feb 23, 2023
ba32c78
Add generate_io_validator
jakobnissen Feb 23, 2023
f078d29
Add documentation to pseudomacros
jakobnissen Feb 25, 2023
35ad643
Remove dead eps nodes
jakobnissen Feb 24, 2023
a1f67fa
Remove last Stream non-pseudomacros
jakobnissen Feb 25, 2023
56c63e1
Remove Stream module
jakobnissen Feb 25, 2023
c9d5756
Simplify Machine struct layout
jakobnissen Feb 27, 2023
9f09df9
Fix warnings when running tests
jakobnissen Feb 27, 2023
daf589c
Disallow direct modification of actions field
jakobnissen Feb 27, 2023
ce522bb
Use `using` over `import`
jakobnissen Feb 28, 2023
d6b5699
Update FASTA example
jakobnissen Feb 28, 2023
d4cb6ac
Disallow final actions in looping regex
jakobnissen Mar 1, 2023
afb2e47
Make TranscodingStreams an optional dependency
jakobnissen Mar 1, 2023
6ad6a99
Error with shortest known ambiguity
jakobnissen Mar 1, 2023
d37f804
Also check ambiguities for final and all actions
jakobnissen Mar 1, 2023
decd39f
Rewrite tokenizer
jakobnissen Mar 1, 2023
a2681b5
Rename generate_validator_function
jakobnissen Mar 7, 2023
23f47e2
Export regex struct instead of module
jakobnissen Mar 7, 2023
fb98649
Tweak: Allow | and & ops b/w chars/str and RE
jakobnissen Mar 7, 2023
35d578b
Remove report_col kwarg
jakobnissen Mar 8, 2023
2d16183
Add SnoopPrecompile
jakobnissen Mar 8, 2023
937bebb
Rewrite documentation
jakobnissen Feb 22, 2023
1dda546
Bump CI version to Julia 1.6
jakobnissen Mar 8, 2023
151a4c5
Make generate_buffer_validator goto into kwarg
jakobnissen Mar 8, 2023
2bb69f2
Update README.md
jakobnissen Mar 9, 2023
3e04679
Add documentation preview
jakobnissen Mar 9, 2023
3c953e6
Always remove dead nodes
jakobnissen Mar 9, 2023
6b7ef62
Do not make TranscodingStreams an extension
jakobnissen Apr 21, 2023
fec085b
Add todo to gitignore
jakobnissen Apr 25, 2023
76145dc
Migrate from SnoopPrecompile to PrecompileTools
jakobnissen Apr 25, 2023
52faf80
Disable SIMD capability
jakobnissen Jul 1, 2023
6e3b676
Fix preconditions
jakobnissen Jul 18, 2023
e2b285a
Add more tests
jakobnissen Jul 19, 2023
355c869
Some JET fixes
jakobnissen Jul 19, 2023
b72dae5
Improve string/char and RE operations
jakobnissen Jul 19, 2023
fe0feb1
Fix Project
jakobnissen Jul 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/workflows/Documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Documentation

on:
push:
pull_request:

jobs:
Documenter:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/julia-buildpkg@latest
- uses: julia-actions/julia-docdeploy@latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
1 change: 0 additions & 1 deletion .github/workflows/Downstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ jobs:
- {user: BioJulia, repo: BED.jl, group: Automa}
- {user: BioJulia, repo: BigBed.jl, group: Automa}
- {user: BioJulia, repo: FASTX.jl, group: Automa}
- {user: BioJulia, repo: GeneticVariation.jl, group: Automa}
- {user: BioJulia, repo: GFF3.jl, group: Automa}
- {user: BioJulia, repo: XAM.jl, group: Automa}
- {user: dellison, repo: ConstituencyTrees.jl, group: Automa}
Expand Down
19 changes: 1 addition & 18 deletions .github/workflows/UnitTests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
fail-fast: false
matrix:
julia-version:
- '1.5'
- '1.6' # LTS
- '1'
julia-arch: [x86]
os: [ubuntu-latest, windows-latest, macOS-latest]
Expand Down Expand Up @@ -42,20 +42,3 @@ jobs:
name: codecov-umbrella
fail_ci_if_error: false
token: ${{ secrets.CODECOV_TOKEN }}
docs:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@latest
with:
version: '1'
- run: |
julia --project=docs -e '
using Pkg
Pkg.develop(PackageSpec(path=pwd()))
Pkg.instantiate()'
- run: julia --project=docs docs/make.jl
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ docs/*.dot
docs/build/
docs/site/
.Rproj.user
/Manifest.toml
Manifest.toml
todo.md
/LocalPreferences.toml
11 changes: 6 additions & 5 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
name = "Automa"
uuid = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
authors = ["Kenta Sato <[email protected]>", "Jakob Nybo Nissen <[email protected]"]
version = "0.8.3"
version = "1.0.0"

[deps]
ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb"
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[compat]
ScanByte = "0.4.0"
julia = "1.6"
PrecompileTools = "1"
TranscodingStreams = "0.9"
julia = "1.5"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[targets]
test = ["Test"]
test = ["Test", "TranscodingStreams"]
143 changes: 67 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,92 +1,83 @@
Automa.jl
=========
# Automa.jl

[![Docs Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://biojulia.github.io/Automa.jl/latest/)
[![codecov.io](http://codecov.io/github/BioJulia/Automa.jl/coverage.svg?branch=master)](http://codecov.io/github/BioJulia/Automa.jl?branch=master)

A Julia package for text validation, parsing, and tokenizing based on state machine compiler.
Automa is a regex-to-Julia compiler.
By compiling regex to Julia code in the form of `Expr` objects,
Automa provides facilities to create efficient and robust regex-based lexers, tokenizers and parsers using Julia's metaprogramming capabilities.
You can view Automa as a regex engine that can insert arbitrary Julia code into its input matching process, which will be executed when certain parts of the regex matches an input.

![Schema of Automa.jl](/docs/src/figure/Automa.png)
![Schema of Automa.jl](figure/Automa.png)

Automa.jl compiles regular expressions into Julia code, which is then compiled
into low-level machine code by the Julia compiler. Automa.jl is designed to
generate very efficient code to scan large text data, which is often much faster
than handcrafted code. Automa.jl can insert arbitrary Julia code that will be
executed in state transitions. This makes it possible, for example, to extract
substrings that match a part of a regular expression.
Automa is designed to generate very efficient code to scan large text data, often much faster than handcrafted code.

This is a number literal tokenizer using Automa.jl ([numbers.jl](example/numbers.jl)):
For more information [read the documentation](https://biojulia.github.io/Automa.jl/latest/), or read the examples below and in the `examples/` directory in this repository.

## Examples
### Validate some text only is composed of ASCII alphanumeric characters
```julia
# A tokenizer of octal, decimal, hexadecimal and floating point numbers
# =====================================================================

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

# Describe patterns in regular expression.
oct = re"0o[0-7]+"
dec = re"[-+]?[0-9]+"
hex = re"0x[0-9A-Fa-f]+"
prefloat = re"[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)"
float = prefloat | re.cat(prefloat | re"[-+]?[0-9]+", re"[eE][-+]?[0-9]+")
number = oct | dec | hex | float
numbers = re.cat(re.opt(number), re.rep(re" +" * number), re" *")

# Register action names to regular expressions.
number.actions[:enter] = [:mark]
oct.actions[:exit] = [:oct]
dec.actions[:exit] = [:dec]
hex.actions[:exit] = [:hex]
float.actions[:exit] = [:float]

# Compile a finite-state machine.
machine = Automa.compile(numbers)

# This generates a SVG file to visualize the state machine.
# write("numbers.dot", Automa.machine2dot(machine))
# run(`dot -Tpng -o numbers.png numbers.dot`)

# Bind an action code for each action name.
actions = Dict(
:mark => :(mark = p),
:oct => :(emit(:oct)),
:dec => :(emit(:dec)),
:hex => :(emit(:hex)),
:float => :(emit(:float)),
)
using Automa

# Generate a tokenizing function from the machine.
context = Automa.CodeGenContext()
@eval function tokenize(data::String)
tokens = Tuple{Symbol,String}[]
mark = 0
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
emit(kind) = push!(tokens, (kind, data[mark:p-1]))
$(Automa.generate_exec_code(context, machine, actions))
return tokens, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end
generate_buffer_validator(:validate_alphanumeric, re"[a-zA-Z0-9]*") |> eval

tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23")
for s in ["abc", "aU81m", "!,>"]
println("$s is alphanumeric? $(isnothing(validate_alphanumeric(s)))")
end
```

This emits tokens and the final status:
### Making a lexer
```julia
using Automa

tokens = [
:identifier => re"[A-Za-z_][0-9A-Za-z_!]*",
:lparens => re"\(",
:rparens => re"\)",
:comma => re",",
:quot => re"\"",
:space => re"[\t\f ]+",
];
@eval @enum Token errortoken $(first.(tokens)...)
make_tokenizer((errortoken,
[Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval

collect(tokenize(Token, """(alpha, "beta15")"""))
```

~/.j/v/Automa (master) $ julia -qL example/numbers.jl
julia> tokens
6-element Array{Tuple{Symbol,String},1}:
(:dec,"1")
(:hex,"0x0123BEEF")
(:oct,"0o754")
(:float,"3.14")
(:float,"-1e4")
(:float,"+6.022045e23")
### Make a simple TSV file parser
```julia
using Automa

machine = let
name = onexit!(onenter!(re"[^\t\r\n]+", :mark), :name)
field = onexit!(onenter!(re"[^\t\r\n]+", :mark), :field)
nameline = name * rep('\t' * name)
record = onexit!(field * rep('\t' * field), :record)
compile(nameline * re"\r?\n" * record * rep(re"\r?\n" * record) * rep(re"\r?\n"))
end

julia> status
:ok
actions = Dict(
:mark => :(pos = p),
:name => :(push!(headers, String(data[pos:p-1]))),
:field => quote
n_fields += 1
push!(fields, String(data[pos:p-1]))
end,
:record => quote
n_fields == length(headers) || error("Malformed TSV")
n_fields = 0
end
)

The compiled deterministic finite automaton (DFA) looks like this:
![DFA](/docs/src/figure/numbers.png)
@eval function parse_tsv(data)
headers = String[]
fields = String[]
pos = n_fields = 0
$(generate_code(machine, actions))
(headers, reshape(fields, length(headers), :))
end

For more details, see [fasta.jl](/example/fasta.jl) and read the docs page.
header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n")
```
59 changes: 21 additions & 38 deletions benchmark/runbenchmarks.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import Automa
import Automa.RegExp: @re_str
using Automa
using BenchmarkTools
using Random: seed!

Expand Down Expand Up @@ -27,25 +26,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-z]*\r?\n)*")
VISUALIZE && writesvg("case1", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 2
Expand All @@ -59,25 +54,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-Za-z]*\r?\n)*")
VISUALIZE && writesvg("case2", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 3
Expand All @@ -91,25 +82,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([ACGTacgt]*\r?\n)*")
VISUALIZE && writesvg("case3", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 4
Expand All @@ -123,22 +110,18 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-Za-z\*-]*\r?\n)*")
VISUALIZE && writesvg("case4", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))
6 changes: 4 additions & 2 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Automa = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[compat]
Automa = "1"
TranscodingStreams = "0.9"
Documenter = "0.24 - 0.26"
Automa = "0.8 - 0.9"
Loading