Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for V1 #95

Merged
merged 60 commits into from
Mar 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
95f2606
Remove dependency on Printf
jakobnissen Mar 1, 2023
1eaf4e3
Remove SIMD and inline generator
jakobnissen Jul 15, 2022
9f97bfc
Remove loop unrolling
jakobnissen Jul 16, 2022
2be04f3
Allow CodeGenContext to be elided
jakobnissen Jul 16, 2022
34bb28f
Update benchmarks
jakobnissen Jul 16, 2022
39d8bc7
Saner default for p_eof and p_end
jakobnissen Jul 17, 2022
3d90f91
Add generate_code convenience function
jakobnissen Jul 17, 2022
bba0a5a
Add validator convenience function
jakobnissen Jul 17, 2022
faf99e7
Enable ambiguity check
jakobnissen Jul 21, 2022
ff1da91
Enforce action dict symbols are same as machines
jakobnissen Jul 23, 2022
34f993a
Add check for invalid RE.action keys
jakobnissen Jul 24, 2022
7e934fe
Add input error code to generate_code function
jakobnissen Jul 24, 2022
fe5ef38
Export user-facing names
jakobnissen Jul 25, 2022
17d9e35
Check preconditions before declaring NFA ambiguous
jakobnissen Jul 27, 2022
6d77ca3
Fix EOF check in machine error code
jakobnissen Jul 28, 2022
9f47b3d
Do not store gensym symbols in default CodeGenContext
jakobnissen Jul 28, 2022
a1300f3
Add more comments
jakobnissen Jul 30, 2022
f93b9d1
Also trigger default error when cs > 0
jakobnissen Aug 1, 2022
2e30c1c
Remove checkbounds option
jakobnissen Aug 1, 2022
8fefcd4
Make clean work
jakobnissen Aug 1, 2022
e9932a5
Make Variables easier to construct
jakobnissen Aug 1, 2022
a9b1636
Add tests for regex set operations
jakobnissen Aug 2, 2022
8c46b76
Minor polish
jakobnissen Aug 3, 2022
7bccf92
Rename p_eof to is_eof
jakobnissen Aug 12, 2022
ce90f37
Comment generate_reader better
jakobnissen Feb 22, 2023
ac6f65e
Small tweaks
jakobnissen Feb 22, 2023
a78fe43
Add default error in generated reader function
jakobnissen Feb 22, 2023
761e7f6
Fix bug in execute_debug
jakobnissen Feb 23, 2023
45bec87
Export machine2dot
jakobnissen Feb 23, 2023
c396b75
Make more use of magical macros
jakobnissen Feb 23, 2023
a59b9f0
Add generate_io_validator
jakobnissen Feb 23, 2023
98e9581
Add documentation to pseudomacros
jakobnissen Feb 25, 2023
1b801f4
Remove dead eps nodes
jakobnissen Feb 24, 2023
175fd02
Remove last Stream non-pseudomacros
jakobnissen Feb 25, 2023
74671a6
Remove Stream module
jakobnissen Feb 25, 2023
40a07d9
Simplify Machine struct layout
jakobnissen Feb 27, 2023
ea8beec
Fix warnings when running tests
jakobnissen Feb 27, 2023
c26a88e
Disallow direct modification of actions field
jakobnissen Feb 27, 2023
97448f8
Use `using` over `import`
jakobnissen Feb 28, 2023
4cec9bf
Update FASTA example
jakobnissen Feb 28, 2023
1d41382
Disallow final actions in looping regex
jakobnissen Mar 1, 2023
5bb87df
WIP: Tokenizer: Remove deprecated method
jakobnissen Feb 28, 2023
0cca005
Make TranscodingStreams an optional dependency
jakobnissen Mar 1, 2023
da2320f
Error with shortest known ambiguity
jakobnissen Mar 1, 2023
69680f1
Also check ambiguities for final and all actions
jakobnissen Mar 1, 2023
55d81c9
Rewrite tokenizer
jakobnissen Mar 1, 2023
cb7fe5f
Rename generate_validator_function
jakobnissen Mar 7, 2023
603a783
Export regex struct instead of module
jakobnissen Mar 7, 2023
ef05382
Tweak: Allow | and & ops b/w chars/str and RE
jakobnissen Mar 7, 2023
87d1f83
Remove report_col kwarg
jakobnissen Mar 8, 2023
e5b22a5
Add SnoopPrecompile
jakobnissen Mar 8, 2023
7f2db10
Rewrite documentation
jakobnissen Feb 22, 2023
4f3acbe
Add figures to docs
jakobnissen Mar 8, 2023
4229921
Bump CI version to Julia 1.6
jakobnissen Mar 8, 2023
4868df3
Support Julia version 1.6
jakobnissen Mar 8, 2023
2683fce
Make generate_buffer_validator goto into kwarg
jakobnissen Mar 8, 2023
9eec66c
SnoopPrecompile more stuff
jakobnissen Mar 8, 2023
e0b5e55
Various small fixes to documentation
jakobnissen Mar 9, 2023
6c4e244
Update README.md
jakobnissen Mar 9, 2023
d2b4a1c
Add documentation preview
jakobnissen Mar 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/workflows/Documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Documentation

on:
push:
branches:
- master
- develop
- release/.*
tags: '*'
pull_request:

jobs:
Documenter:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/julia-buildpkg@latest
- uses: julia-actions/julia-docdeploy@latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
1 change: 0 additions & 1 deletion .github/workflows/Downstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ jobs:
- {user: BioJulia, repo: BED.jl, group: Automa}
- {user: BioJulia, repo: BigBed.jl, group: Automa}
- {user: BioJulia, repo: FASTX.jl, group: Automa}
- {user: BioJulia, repo: GeneticVariation.jl, group: Automa}
- {user: BioJulia, repo: GFF3.jl, group: Automa}
- {user: BioJulia, repo: XAM.jl, group: Automa}
- {user: dellison, repo: ConstituencyTrees.jl, group: Automa}
Expand Down
19 changes: 1 addition & 18 deletions .github/workflows/UnitTests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
fail-fast: false
matrix:
julia-version:
- '1.5'
- '1.6' # LTS
- '1'
julia-arch: [x86]
os: [ubuntu-latest, windows-latest, macOS-latest]
Expand Down Expand Up @@ -42,20 +42,3 @@ jobs:
name: codecov-umbrella
fail_ci_if_error: false
token: ${{ secrets.CODECOV_TOKEN }}
docs:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@latest
with:
version: '1'
- run: |
julia --project=docs -e '
using Pkg
Pkg.develop(PackageSpec(path=pwd()))
Pkg.instantiate()'
- run: julia --project=docs docs/make.jl
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ docs/*.dot
docs/build/
docs/site/
.Rproj.user
/Manifest.toml
Manifest.toml
15 changes: 12 additions & 3 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,19 +1,28 @@
name = "Automa"
uuid = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
authors = ["Kenta Sato <[email protected]>", "Jakob Nybo Nissen <[email protected]"]
version = "0.8.2"
version = "1.0.0"

[deps]
ScanByte = "7b38b023-a4d7-4c5e-8d43-3f3097f304eb"
SnoopPrecompile = "66db9d55-30c0-4569-8b51-7e840670fc0c"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[weakdeps]
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[extensions]
AutomaStream = "TranscodingStreams"

[compat]
ScanByte = "0.3.3"
SnoopPrecompile = "1"
TranscodingStreams = "0.9"
julia = "1.5"
julia = "1.6"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
TranscodingStreams = "3bb67fe8-82b1-5028-8e26-92a6c54297fa"

[targets]
test = ["Test"]
test = ["Test", "TranscodingStreams"]
143 changes: 67 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,92 +1,83 @@
Automa.jl
=========
# Automa.jl

[![Docs Latest](https://img.shields.io/badge/docs-latest-blue.svg)](https://biojulia.github.io/Automa.jl/latest/)
[![codecov.io](http://codecov.io/github/BioJulia/Automa.jl/coverage.svg?branch=master)](http://codecov.io/github/BioJulia/Automa.jl?branch=master)

A Julia package for text validation, parsing, and tokenizing based on state machine compiler.
Automa is a regex-to-Julia compiler.
By compiling regex to Julia code in the form of `Expr` objects,
Automa provides facilities to create efficient and robust regex-based lexers, tokenizers and parsers using Julia's metaprogramming capabilities.
You can view Automa as a regex engine that can insert arbitrary Julia code into its input matching process, which will be executed when certain parts of the regex matches an input.

![Schema of Automa.jl](/docs/src/figure/Automa.png)
![Schema of Automa.jl](figure/Automa.png)

Automa.jl compiles regular expressions into Julia code, which is then compiled
into low-level machine code by the Julia compiler. Automa.jl is designed to
generate very efficient code to scan large text data, which is often much faster
than handcrafted code. Automa.jl can insert arbitrary Julia code that will be
executed in state transitions. This makes it possible, for example, to extract
substrings that match a part of a regular expression.
Automa is designed to generate very efficient code to scan large text data, often much faster than handcrafted code.

This is a number literal tokenizer using Automa.jl ([numbers.jl](example/numbers.jl)):
For more information [read the documentation](https://biojulia.github.io/Automa.jl/latest/), or read the examples below and in the `examples/` directory in this repository.

## Examples
### Validate some text only is composed of ASCII alphanumeric characters
```julia
# A tokenizer of octal, decimal, hexadecimal and floating point numbers
# =====================================================================

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

# Describe patterns in regular expression.
oct = re"0o[0-7]+"
dec = re"[-+]?[0-9]+"
hex = re"0x[0-9A-Fa-f]+"
prefloat = re"[-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)"
float = prefloat | re.cat(prefloat | re"[-+]?[0-9]+", re"[eE][-+]?[0-9]+")
number = oct | dec | hex | float
numbers = re.cat(re.opt(number), re.rep(re" +" * number), re" *")

# Register action names to regular expressions.
number.actions[:enter] = [:mark]
oct.actions[:exit] = [:oct]
dec.actions[:exit] = [:dec]
hex.actions[:exit] = [:hex]
float.actions[:exit] = [:float]

# Compile a finite-state machine.
machine = Automa.compile(numbers)

# This generates a SVG file to visualize the state machine.
# write("numbers.dot", Automa.machine2dot(machine))
# run(`dot -Tpng -o numbers.png numbers.dot`)

# Bind an action code for each action name.
actions = Dict(
:mark => :(mark = p),
:oct => :(emit(:oct)),
:dec => :(emit(:dec)),
:hex => :(emit(:hex)),
:float => :(emit(:float)),
)
using Automa

# Generate a tokenizing function from the machine.
context = Automa.CodeGenContext()
@eval function tokenize(data::String)
tokens = Tuple{Symbol,String}[]
mark = 0
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
emit(kind) = push!(tokens, (kind, data[mark:p-1]))
$(Automa.generate_exec_code(context, machine, actions))
return tokens, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end
generate_buffer_validator(:validate_alphanumeric, re"[a-zA-Z0-9]*") |> eval

tokens, status = tokenize("1 0x0123BEEF 0o754 3.14 -1e4 +6.022045e23")
for s in ["abc", "aU81m", "!,>"]
println("$s is alphanumeric? $(isnothing(validate_alphanumeric(s)))")
end
```

This emits tokens and the final status:
### Making a lexer
```julia
using Automa

tokens = [
:identifier => re"[A-Za-z_][0-9A-Za-z_!]*",
:lparens => re"\(",
:rparens => re"\)",
:comma => re",",
:quot => re"\"",
:space => re"[\t\f ]+",
];
@eval @enum Token errortoken $(first.(tokens)...)
make_tokenizer((errortoken,
[Token(i) => j for (i,j) in enumerate(last.(tokens))]
)) |> eval

collect(tokenize(Token, """(alpha, "beta15")"""))
```

~/.j/v/Automa (master) $ julia -qL example/numbers.jl
julia> tokens
6-element Array{Tuple{Symbol,String},1}:
(:dec,"1")
(:hex,"0x0123BEEF")
(:oct,"0o754")
(:float,"3.14")
(:float,"-1e4")
(:float,"+6.022045e23")
### Make a simple TSV file parser
```julia
using Automa

machine = let
name = onexit!(onenter!(re"[^\t\r\n]+", :mark), :name)
field = onexit!(onenter!(re"[^\t\r\n]+", :mark), :field)
nameline = name * rep('\t' * name)
record = onexit!(field * rep('\t' * field), :record)
compile(nameline * re"\r?\n" * record * rep(re"\r?\n" * record) * rep(re"\r?\n"))
end

julia> status
:ok
actions = Dict(
:mark => :(pos = p),
:name => :(push!(headers, String(data[pos:p-1]))),
:field => quote
n_fields += 1
push!(fields, String(data[pos:p-1]))
end,
:record => quote
n_fields == length(headers) || error("Malformed TSV")
n_fields = 0
end
)

The compiled deterministic finite automaton (DFA) looks like this:
![DFA](/docs/src/figure/numbers.png)
@eval function parse_tsv(data)
headers = String[]
fields = String[]
pos = n_fields = 0
$(generate_code(machine, actions))
(headers, reshape(fields, length(headers), :))
end

For more details, see [fasta.jl](/example/fasta.jl) and read the docs page.
header, data = parse_tsv("a\tabc\n12\t13\r\nxyc\tz\n\n")
```
59 changes: 21 additions & 38 deletions benchmark/runbenchmarks.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import Automa
import Automa.RegExp: @re_str
using Automa
using BenchmarkTools
using Random: seed!

Expand Down Expand Up @@ -27,25 +26,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-z]*\r?\n)*")
VISUALIZE && writesvg("case1", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 2
Expand All @@ -59,25 +54,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-Za-z]*\r?\n)*")
VISUALIZE && writesvg("case2", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 3
Expand All @@ -91,25 +82,21 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([ACGTacgt]*\r?\n)*")
VISUALIZE && writesvg("case3", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))


# Case 4
Expand All @@ -123,22 +110,18 @@ println("PCRE: ", @benchmark match(data))

machine = Automa.compile(re"([A-Za-z\*-]*\r?\n)*")
VISUALIZE && writesvg("case4", machine)
context = Automa.CodeGenContext(generator=:goto, checkbounds=false)
context = Automa.CodeGenContext()
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl: ", @benchmark match(data))

context = Automa.CodeGenContext(generator=:goto, checkbounds=false, loopunroll=10)
context = Automa.CodeGenContext(generator=:goto)
@eval function match(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
$(Automa.generate_exec_code(context, machine))
$(Automa.generate_code(context, machine))
return cs == 0
end
@assert match(data)
println("Automa.jl (unrolled): ", @benchmark match(data))
println("Automa.jl (goto): ", @benchmark match(data))
Loading