Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the parser to a context-free grammar approach #168

Open
BlueDrink9 opened this issue Oct 20, 2022 · 4 comments
Open

Changing the parser to a context-free grammar approach #168

BlueDrink9 opened this issue Oct 20, 2022 · 4 comments

Comments

@BlueDrink9
Copy link

Opening a new issue to further discuss the comment in #164

I am planning to write a formal grammar for the config, but haven't taken the time to do it. Once this is done, writing the config parsing code using a separate library will be easier. Quite a few of them exist for Rust: pest and rust-peg seemed interesting, but a lot of alternatives also exist.

I've took a look at a few rust parsers, but hadn't seen either of those. My personal preference is to use parsers that use separate grammar files with dedicated specific syntax. In this case, pest looks like the best option by far. The reason I prefer this: it separates the concerns of defining a grammar from using the grammar in Rust - the grammar is more language-agnostic, and does not have to be concerned with computer syntax elements. You can see the different with rust-peg, which defines grammars in-line. Dedicated syntaxes are therefore much much more readable, which helps a lot when it comes to maintaining them. It does mean having to learn a new syntax, but my experience has been that they are very simple to learn, especially if you are used to regex.

Separating defining the structure from transforming it into a result is also probably more useful for a config parser like this, I think, because of how we don't necessarily want to do anything with the parsed result right away.

Example of a pest grammar (separate, dedicated PEG syntax):

num = @{ int ~ ("." ~ ASCII_DIGIT*)? ~ (^"e" ~ int)? }
    int = { ("+" | "-")? ~ ASCII_DIGIT+ }

operation = _{ add | subtract | multiply | divide | power }
    add      = { "+" }
    subtract = { "-" }
    multiply = { "*" }
    divide   = { "/" }
    power    = { "^" }

expr = { term ~ (operation ~ term)* }
term = _{ num | "(" ~ expr ~ ")" }

calculation = _{ SOI ~ expr ~ EOI }
WHITESPACE = _{ " " | "\t" }

Similar grammar in rust-peg (mixes grammar and resolution)

peg::parser!( grammar arithmetic() for str {
    pub rule expression() -> i64
        = sum()

    rule sum() -> i64
        = l:product() "+" r:product() { l+r }
        / product()

    rule product() -> i64
        = l:atom() "*" r:atom() { l*r }
        / atom()

    rule atom() -> i64
        = number()
        / "(" v:sum() ")" { v }

    rule number() -> i64
        = n:$(['0'..='9']+) { n.parse().unwrap() }
});

Admittedly, I don't know any rust so of course the second one is going to look more confusing. What I firmly believe though is that removing the burden of boilerplate and rust necessities from the grammar will make the grammar easier to work with.

As proof of how easy it is, here's a rough sketch of the swhkd/sxhkd syntax's core, from the top of my head:

main = {
    SOI ~ 
    content*
    ~ EOI
}
content = {
    | comment
    | binding
    | newline *
    }

// _ means it gets ignored in the parser
comment = _{ "#" ~ CHARACTER* }
// We can ignore whitespace in general. This is a special rule which is automatically inserted between every expression - if it is here, whitespace will always be ignored.
WHITESPACE = _{ " " | "\t" }

// Accept at least one modifier, plus a number of modifiers, plus a key
binding = { modifier ~ ("+" ~ modifier)* ~ "+" ~ key ~ comment? ~ NEWLINE ~
    // The ${} syntax ensures that whitespace is not ignored, unlike usual.
    ${WHITESPACE ~ command}
}

// Caret makes strings case insensitive
modifier = {
    ^"super" 
    | ^"ctrl" 
    | ^"alt" 
    | ^"shift"
}

// Probably want to define the "key" rule in another file/import from a wayland library or however swhkd does it currently, and merge that rule with the rest of the grammar in rust rather than define it here. Alternatively - accept any string, then validate it in rust.
key = { ^<list of keys> }

// Parsing the command should be the shell's responsibility, so just send everything that isn't an unescaped newline to the shell.
command = { not_newline* }
// Uses a negative lookahead: if the following is not a newline, consume a character.
not_newline = {!unescaped_newline ~ ANY}
// Match a newline that has no escape char before it - but an escape char literal is fine.
newline = { !unescaped_escape_char "\n"
escape_char = {"\"}
// positive lookahead - match both or none.
escape_char_literal = {&escape_char ~ escape_char}
unescaped_escape_char = {!escape_char escape_char}

The above grammar should match

super + shift + enter # enter = return
    kitty

# file manager
super + shift + f
    pcmanfm

# web-browser
super + w
    firefox
@Shinyzenith
Copy link
Member

I didn't write any of the parser and lexer so I'll just go ahead and ping the ones who did: @EdenQwQ @angelofallars

@Shinyzenith Shinyzenith pinned this issue Dec 19, 2022
@Shinyzenith Shinyzenith unpinned this issue Dec 24, 2022
@woojiq
Copy link

woojiq commented Feb 24, 2024

@Shinyzenith
Hello. I found this project idea as part of GSoC. I wonder what this means:

this idea plans to introduce a new parser with robust DSL rules by defining a formal grammar.
...
Create a new parser with proper parsing techniques

We need to create an EBNF grammar, that's clear. What about parser itself, should we write the parser manually (for example use recursive descent parser algo) or use existing tools for this, like pest, and not to reinvent the wheel?

@Shinyzenith
Copy link
Member

Shinyzenith commented Feb 24, 2024

@woojiq hi, I think it's best if we define a grammar and delegate parsing responsibilities to some other parser crate to not reinvent the wheel. However, if you can find some concrete reasons to not do so then feel free to explore such possibilities too.

Cc: @lavafroth

@BlueDrink9
Copy link
Author

BlueDrink9 commented Feb 24, 2024

I took a brief survey back when I thought I ha time for this, and found the existing options pretty good. The two i reference in the first post seemed sufficient to me. I doubt implementing a new one would be necessary (although it would be a great exercise!).

My other thoughts about this issue:

  • To handle {} multiple mapping syntax well, I struggled to define a CFG way. What i thought would be a better approach is to do multiple parsing approaches, where the first one only matches braces and spits out all the resulting mappings.
  • Is it worth writing a grammar that handles an entire file? Or do you separate each definition first and then parse them individually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants