Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for v1 #119

Merged
merged 64 commits into from
Jul 19, 2023
Merged

Breaking changes for v1 #119

merged 64 commits into from
Jul 19, 2023

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Mar 9, 2023

Supersedes #95

Todo

  • Fixup the commit history (review if fixes are necessary)
  • Get test coverage near 100%
  • Figure out if you can get CI on M1 mac

Closes #52
Closes #71
Closes #80
Closes #82
Closes #91
Closes #102
Closes #111
Closes #115
Closes #116

@codecov
Copy link

codecov bot commented Mar 9, 2023

Codecov Report

Patch coverage: 94.75% and project coverage change: +6.50 🎉

Comparison is base (3b5c913) 89.48% compared to head (fe0feb1) 95.98%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
+ Coverage   89.48%   95.98%   +6.50%     
==========================================
  Files          14       16       +2     
  Lines        1683     1793     +110     
==========================================
+ Hits         1506     1721     +215     
+ Misses        177       72     -105     
Flag Coverage Δ
unittests 95.98% <94.75%> (+6.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/action.jl 97.43% <ø> (ø)
src/dot.jl 95.78% <ø> (ø)
src/precond.jl 84.09% <ø> (ø)
src/codegen.jl 91.69% <87.50%> (+0.10%) ⬆️
src/workload.jl 89.47% <89.47%> (ø)
src/re.jl 93.40% <95.74%> (+0.73%) ⬆️
src/tokenizer.jl 96.92% <96.72%> (+4.87%) ⬆️
src/stream.jl 96.92% <96.92%> (ø)
src/Automa.jl 100.00% <100.00%> (ø)
src/byteset.jl 100.00% <100.00%> (ø)
... and 4 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jakobnissen jakobnissen force-pushed the v1 branch 2 times, most recently from 2999123 to fedab05 Compare March 9, 2023 08:51
@kescobo
Copy link
Member

kescobo commented Apr 22, 2023

FYI, it is still on my radar to kick the tires of this. I think early next month is probably realistic.

@kescobo
Copy link
Member

kescobo commented Jul 5, 2023

I think early next month is probably realistic.

I was wroooong 😭

Do you have a timeline for when you want this merged. Realistically, I need some pressure to make it a priority

@jakobnissen
Copy link
Member Author

Haha don't I know that feeling. I also expected I would be done with this branch in the summer of 2022, so... 😅
My holidays start today, so I expect I can finish the work before the holidays are over in about 3 weeks. Maybe. No promises. Although if you want a deadline, then July 17th is probably when I'll pick this work up and finish it.

@jakobnissen jakobnissen force-pushed the v1 branch 2 times, most recently from 98283df to 9187d9e Compare July 19, 2023 05:52
Honestly, four generators and a tokenizer was excessive.
Inline generator was probably not really used, and the new simd generator made
the goto generator obsolete.
Remove inline and old goto generator, and rename simd generator goto.
Before this PR, a user could forget or mis-spell a symbol in the action Dict
passed to generate_exec_code. Now add a check to throw an error if this happens.
nfa2dfa now errors if any RE object has a .actions field with an unsupported
key. This prevents a user from mistyping e.g. `pat.actions[:etner] = [:foo]`
and having it silently do nothing.
The user probably wants to use the input error code in most use cases, except
when running a machine in execute mode, debug mode, or running a validator.
Adding the input error code to generated Readers is a challenge for another day
One of the issues with using Automa is that its lack of exports makes it unclear
what is internal/external. Also, the three typical using-statements needed to
use Automa is just visual noise.
If two edges in equivalent paths have distinct preconditions, these could be
used to distinguish the two edges, and so they should not provoke an error.
The gensym'd symbols in Automa's constant expressions are computed on
precompilation of Automa.
These can then clash with symbols that are computed on the precompilation of
any particular generated code in downstream packages leading to very confusing
bugs.
With the :goto generator, checking bounds is not permitted anyway.
And the table generator is so slow that it makes no sense to disable checking
of bounds - then you might as well use the goto generator
Many potential users of Automa are not interested in parsing from IOs, but only
buffers. For those users, the IO-parsing functionality of Automa is not needed,
and so there is no need for dependency on TranscodingStreams.
NFAs with ambiguities often contain multiple ambiguities. Displaying the
simplest ambiguity when erroring makes debugging easier - especially compared
to when the shown ambiguity can never happen due to another ambiguity.
An oversight in the ambiguity check meant that actions placed on non-epsilon
edges were accidentally not included in the paths for validation.
MWE: `compile(onfinal!(re"a", :a) | onfinal!(re"a", :b))`

This breaks tokenizers, so we manually skip ambiguity check in tokenizers.
In the case of conflicting actions in tokenizers, this will cause the longest
matching token to be emitted.
The tokenizer has a completely new design and API.
* It's now much easier to use
* It's now lazy by default
* It's much faster, although not completely optimised. Its API is amenable to
  further optimisation
* It handles errors automatically

See issue #116
Users should not have access to the module directly. Instead, export the RE
struct, and also allow users to construct regex with `RE(str)`.
Instead of buffering an entire line, simply keep track of the number of columns
cleared from the buffer. This reaches some more into TranscodingStreams privates,
but it's well tested.
Currently, the functions `re2nfa` and `nfa2dfa` can produce dead (unreachable)
nodes, which is pointless. Instead of relying on the user to themselves remove
dead nodes by calling `remove_dead_nodes`, this should just happen automatically.
Since Automa doesn't just extend an existing function but adds a new function
based on TranscodingStreams, this is not the intended use case of extensions.
I've come to the conclusion that Julia does not make it possible to robustly
check what CPU instructions the user has available.
The current options are all undocumented, complex and brittle, and not suitable
for code that cannot be accepted to break at any time

Whenever a robust way of checking for CPU instructions are available, the change
is easy to revert.
Allow preconditions to be set to `:enter` only, and have directly conflicting
preconditions resolve an ambiguous NFA.
@jakobnissen
Copy link
Member Author

Ok I'm going to merge this now and release Automa v1. @kescobo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment