Replies: 3 comments 2 replies
-
Hiya! I think the key examples you want are these:
OK, I've probably thrown enough at you to get started. This should hopefully spur some questions! |
Beta Was this translation helpful? Give feedback.
-
I think this is the biggest issue I want to discuss up-front. Most of my character classes are small, but XID_START and XID_CONTINUE are not. I suspect the right answer is to somehow cheat and when I would match an For example, suppose that my lexer believes that Relatedly, one kind of token I want to offer recognition for is a generalization of Rust's There's also the problem of continuing to explore the DFA after I exit this hand-implemented non-regular state. No idea how to do that.
Is it really the best idea to build an RE string and pass it into regex_syntax? I'm not being sarcastic, I genuinely want to know. I don't want to have to think too hard about quoting (even though regex_syntax does let me solve that problem, kinda). However, looking at the NFA builder was overwhelming and I didn't know where to start. I also don't understand how a string like "[abc]+" maps into transition IDs.
It sounds like this can be swapped out without terribly much pain, so it sounds like that's a decision I can make later? |
Beta Was this translation helpful? Give feedback.
-
Andrew-- this is extremely useful, and definitely enough for me to start messing around and hurting myself. I'm not sure when I'll get to that, but I will report back once I have concrete problems to debug. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
A month or so ago I chatted with Andrew about a somewhat unusual usage of regex_automata: taking a user-defined definition for a C-like language's tokens and generating a token-tree-yielding lexer from it. Andrew asked me to put together a discussion post so we can figure this out together!
The surface-level API for defining a syntax is described below, as a user would manipulate it. The resulting lexer produces something morally equivalent to proc_macro::TokenStream. I have not published this library yet, so I don't have any code I can point to... but ideally that should not be necessary.
Essentially I am building something similar to what a crate like logos does, but much more opinionated in what it can parse to simplify common cases and provide a simpler API for converting tokens into an AST.
Currently the lexer is implemented (somewhat incorrectly) by building a trie of prefixes that can start a lexeme, and then dealing with some special cases around things like Unicode XIDs.
This is... not ideal. Instead, I would like to be able to compile this specification, at runtime, into a DFA that recognizes a single token, and explore that DFA by hand. (This bit is necessary, since the token grammar I am parsing is almost entirely regular but technically context sensitive in some places, so I need to implement a very limited version of backreferences).
However, I have no desire to implement any of the DFA construction and evaluation algorithms necessary to do this for arbitrary user specifications, so I want to use regex_automata to do this, and learn something about DFAs along the way. I would appreciate some broad pointers at what to look at first. I expect that the description I've given to be insufficient, so please ask clarifying questions on what I'm trying to do and I'll do my best to answer them; I don't know what I don't know, so to speak.
Beta Was this translation helpful? Give feedback.
All reactions