Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with overlapping token definitions #420

Open
ccleve opened this issue Sep 13, 2024 · 6 comments
Open

Error with overlapping token definitions #420

ccleve opened this issue Sep 13, 2024 · 6 comments
Labels
question Further information is requested

Comments

@ccleve
Copy link

ccleve commented Sep 13, 2024

I'm getting a strange error when a regex could match the prefix of another regex. Maybe. I just don't know what the problem is. Here's a simplified case:


#[derive(Logos, Debug, PartialEq)]
#[logos(skip r".|[\r\n]")] // skip everything not recognized
pub enum LogosToken {
    // any letter except capital Z
    #[regex(r"[a-zA-Y]+", priority = 3)]
    WordExceptZ,

    // any number
    #[regex(r"[0-9]+", priority = 3)]
    Number,

    /*
    This expression is:
    (letter or number)* [Z] (letter or number)*
    In other words, a token with any number of letters or numbers,
    including at least one capital Z.
     */
    #[regex(r"[a-zA-Z0-9]*[Z][a-zA-Z0-9]*", priority = 3)]
    TermWithZ,
}

#[pg_extern]
fn test_logos() {
    let mut lex = LogosToken::lexer("hello 42world fooZfoo");
    while let Some(result) = lex.next() {
        let slice = lex.slice();
        println!("{:?} {:?}", slice, result);
    }
}

This generates:

"hello" Ok(WordExceptZ)
"42world" Err(())
"fooZfoo" Ok(TermWithZ)

If I replace the regex over TermWithZ with #[regex(r"Z", priority = 3)], I get:

"hello" Ok(WordExceptZ)
"42" Ok(Number)
"world" Ok(WordExceptZ)
"foo" Ok(WordExceptZ)
"Z" Ok(TermWithZ)
"foo" Ok(WordExceptZ)

The "42world" is getting recognized correctly as a number and word.

What I don't understand is, why does the first TermWithZ regex mess up the recognition of "42world"? It doesn't contain a Z, so TermWithZ should ignore it completely and let the first two variants do their job.

@ccleve
Copy link
Author

ccleve commented Sep 13, 2024

After looking at this a bit more, I'm guessing I'm running into the "no backtracking" limitation.

Any workaround suggestions are welcome.

@jeertmans
Copy link
Collaborator

Hello! Can you please run in debugging mode and printout the corresponding graph?

See https://logos.maciej.codes/debugging.html.

@ccleve
Copy link
Author

ccleve commented Sep 13, 2024

@jeertmans

{
    1: ::<skip> (<skip>),
    2: [80-BF] ⇒ 1,
    3: [A0-BF][80-BF] ⇒ 1,
    4: [80-BF][80-BF] ⇒ 1,
    5: [80-9F][80-BF] ⇒ 1,
    6: [90-BF][80-BF][80-BF] ⇒ 1,
    7: [80-BF][80-BF][80-BF] ⇒ 1,
    8: [80-8F][80-BF][80-BF] ⇒ 1,
    10: ::WordExceptZ,
    13: ::Number,
    16: ::TermWithZ,
    17: {
        [0-9] ⇒ 17,
        [A-Z] ⇒ 17,
        [a-z] ⇒ 17,
        _ ⇒ 16,
    },
    18: Z ⇒ 17,
    19: {
        [0-9] ⇒ 19,
        [A-Z] ⇒ 19,
        [a-z] ⇒ 19,
        _ ⇒ 18,
    },
    22: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    24: {
        [0-9] ⇒ 24,
        [A-Y] ⇒ 19,
        Z ⇒ 25,
        [a-z] ⇒ 19,
        _ ⇒ 13,
    },
    25: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    27: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    29: {
        [0-9] ⇒ 19,
        [A-Y] ⇒ 29,
        Z ⇒ 25,
        [a-z] ⇒ 29,
        _ ⇒ 10,
    },
    32: {
        [0-9] ⇒ 25,
        [A-Z] ⇒ 25,
        [a-z] ⇒ 25,
        _ ⇒ 16,
    },
    33: {
        [00-/] ⇒ 1,
        [0-9] ⇒ 22,
        [:-@] ⇒ 1,
        [A-Y] ⇒ 27,
        Z ⇒ 32,
        [[-`] ⇒ 1,
        [a-z] ⇒ 27,
        [{-7F] ⇒ 1,
        [C2-DF] ⇒ 2,
        [E0] ⇒ 3,
        [E1-EC] ⇒ 4,
        [ED] ⇒ 5,
        [EE-EF] ⇒ 4,
        [F0] ⇒ 6,
        [F1-F3] ⇒ 7,
        [F4] ⇒ 8,
    },
}

@jeertmans
Copy link
Collaborator

Hum, so I think you are right about the fact that the error might come from no backtracking issue, but a perfect Logos implementation shouldn't have that issue.

First question: are you using Logos >=0.14.0? As it may have fixed some issues.

Second: did you try not setting any priority? Here, you set the priority to 3 to all tokens, it doesn't make much sense as the priority is only used when two or more patterns match the same slice, and they are differentiated based on their priority. But, if the number is the same, it doesn't help. So please try without any priority, and only edit one priority at a time.

Last, it is often a source of issues to have patterns embedded in others, like TermWithZ containing both Word and Number, causing backtracking issues.

@jeertmans jeertmans added the question Further information is requested label Sep 15, 2024
@ccleve
Copy link
Author

ccleve commented Sep 15, 2024 via email

@jeertmans
Copy link
Collaborator

Yes, using 0.14.1. I had to set the priorities

Ok perfect.

3 because of the skip expression, #[logos(skip r".|[\r\n]")] The "." can match anything, so there was a conflict. Other than that, the priorities can all be the same.

Ok seems legit, wasn't aware of that.

I'm not sure I understand the third point. How else would you do it?

Usually, you can break down your logic in unique, non-overlapping, tokens, and then use callbacks and extras to handle more complex logic. Unfortunately, I don't have enough time to dig into this problem and understand really the root causes of why it doesn't work :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants