Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can a callback return/append different tokens? #433

Closed
spearphishing opened this issue Oct 25, 2024 · 4 comments
Closed

can a callback return/append different tokens? #433

spearphishing opened this issue Oct 25, 2024 · 4 comments
Labels
question Further information is requested

Comments

@spearphishing
Copy link

I am attempting to parse Lua type comments with the code below:

use logos::{Lexer, Logos};

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
pub enum Token {
    // other tokens ...
    #[regex(r"--[^\[].*")]
    InlineComment,

    #[regex(r"--\[[=]*\[", parse_long_bracket)]
    LongBracketComment(String),

    #[regex(r#""([^"\\]*(\\.)?)*""#)]
    #[regex(r#"'([^'\\]*(\\.)?)'"#)]
    QuotedString,

    BrokenComment,
}

fn parse_long_bracket(lex: &mut Lexer<Token>) -> Option<String> {
    let slice = lex.slice();

    if let Some(opening) = slice.strip_prefix("--[") {
        let mut equals_count = 0;
        let mut content = String::new();

        for ch in opening.chars() {
            match ch {
                '=' => equals_count += 1,
                '[' => break,
                _ => return None, // broken comment
            }
        }

        let closing_delimiter = format!("]{}]", "=".repeat(equals_count));

        while let Some(next_char) = lex.remainder().chars().next() {
            lex.bump(1);
            content.push(next_char);

            if content.ends_with(&closing_delimiter) {
                content.truncate(content.len() - closing_delimiter.len());
                return Some(content);
            }
        }
    }

    // if this code is reached, the comment is broken
    None
}

fn main() {
    let source = r#"
        --[[
            Multi
            Line
            LongBracketComment
        ]]


        --[===[
            "Balanced"
            Multi
            Line
            LongBracketComment
        ]===]

        --[===[
            "UN-Balanced"
            Multi
            Line
            BrokenComment
        ]==]
    "#;

    let mut lex = Token::lexer(source);

    while let Some(token) = lex.next() {
        println!("{:?} | {}", token, lex.slice());
    }
}

With this code, the below tokens are returned:

Ok(LongBracketComment("\n            Multi\n            Line\n            LongBracketComment\n        "))
Ok(LongBracketComment("\n            \"Balanced\"\n            Multi\n            Line\n            LongBracketComment\n        "))
Err(()) <------ Ok(BrokenComment("...")) should be returned here

What I'm aiming for is to be able to return the BrokenComment token inside of the parse_long_bracket callback somehow. Is this possible?

Or maybe I'm doing something wrong here, as this is my first time doing any sort of lexical analysis.

@jeertmans jeertmans added the question Further information is requested label Oct 25, 2024
@spearphishing
Copy link
Author

I ended up using a secondary enum instead of different token types.

@jeertmans
Copy link
Collaborator

Hi @spearphishing, could you share your solution with us please :-) ?

@spearphishing
Copy link
Author

Instead of using an extra BrokenComment token type, I created a secondary enum like so:

#[derive(Debug, PartialEq)]
pub enum MultiLineTokenType {
    Valid,
    Broken,
}

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
pub enum Token {
    // ...

    #[regex(r"--\[[=]*\[", |lex| parse_multi_line_token(lex, "--["))]
    LongBracketComment(MultiLineTokenType)
}

// this function is also used for parsing multi-line strings
// multi-line strings behave the same way as comments, just with a different prefix
fn parse_multi_line_token(lex: &mut Lexer<Token>, prefix: &str) -> MultiLineTokenType {
    let slice = lex.slice();

    if let Some(opening) = slice.strip_prefix(prefix) {
        let mut equals_count = 0;

        for ch in opening.chars() {
            match ch {
                '=' => equals_count += 1,
                '[' => break,
                _ => return MultiLineTokenType::Broken,
            }
        }
        let closing_delimiter = format!("]{}]", "=".repeat(equals_count));

        while lex.remainder().starts_with(&closing_delimiter) == false {
            if lex.remainder().is_empty() {
                return MultiLineTokenType::Broken;
            }

            lex.bump(1);
        }

        lex.bump(closing_delimiter.len());
        return MultiLineTokenType::Valid;
    }

    MultiLineTokenType::Broken
}

Hopefully this can help others in the situation I was in. ❤️

@jeertmans
Copy link
Collaborator

Thanks! Another example solution (though not exactly the same problem) is given in #432.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants
@jeertmans @spearphishing and others