Changing the parser to a context-free grammar approach #168

BlueDrink9 · 2022-10-20T09:46:21Z

Opening a new issue to further discuss the comment in #164

I am planning to write a formal grammar for the config, but haven't taken the time to do it. Once this is done, writing the config parsing code using a separate library will be easier. Quite a few of them exist for Rust: pest and rust-peg seemed interesting, but a lot of alternatives also exist.

I've took a look at a few rust parsers, but hadn't seen either of those. My personal preference is to use parsers that use separate grammar files with dedicated specific syntax. In this case, pest looks like the best option by far. The reason I prefer this: it separates the concerns of defining a grammar from using the grammar in Rust - the grammar is more language-agnostic, and does not have to be concerned with computer syntax elements. You can see the different with rust-peg, which defines grammars in-line. Dedicated syntaxes are therefore much much more readable, which helps a lot when it comes to maintaining them. It does mean having to learn a new syntax, but my experience has been that they are very simple to learn, especially if you are used to regex.

Separating defining the structure from transforming it into a result is also probably more useful for a config parser like this, I think, because of how we don't necessarily want to do anything with the parsed result right away.

Example of a pest grammar (separate, dedicated PEG syntax):

num = @{ int ~ ("." ~ ASCII_DIGIT*)? ~ (^"e" ~ int)? }
    int = { ("+" | "-")? ~ ASCII_DIGIT+ }

operation = _{ add | subtract | multiply | divide | power }
    add      = { "+" }
    subtract = { "-" }
    multiply = { "*" }
    divide   = { "/" }
    power    = { "^" }

expr = { term ~ (operation ~ term)* }
term = _{ num | "(" ~ expr ~ ")" }

calculation = _{ SOI ~ expr ~ EOI }
WHITESPACE = _{ " " | "\t" }

Similar grammar in rust-peg (mixes grammar and resolution)

peg::parser!( grammar arithmetic() for str {
    pub rule expression() -> i64
        = sum()

    rule sum() -> i64
        = l:product() "+" r:product() { l+r }
        / product()

    rule product() -> i64
        = l:atom() "*" r:atom() { l*r }
        / atom()

    rule atom() -> i64
        = number()
        / "(" v:sum() ")" { v }

    rule number() -> i64
        = n:$(['0'..='9']+) { n.parse().unwrap() }
});

Admittedly, I don't know any rust so of course the second one is going to look more confusing. What I firmly believe though is that removing the burden of boilerplate and rust necessities from the grammar will make the grammar easier to work with.

As proof of how easy it is, here's a rough sketch of the swhkd/sxhkd syntax's core, from the top of my head:

main = {
    SOI ~ 
    content*
    ~ EOI
}
content = {
    | comment
    | binding
    | newline *
    }

// _ means it gets ignored in the parser
comment = _{ "#" ~ CHARACTER* }
// We can ignore whitespace in general. This is a special rule which is automatically inserted between every expression - if it is here, whitespace will always be ignored.
WHITESPACE = _{ " " | "\t" }

// Accept at least one modifier, plus a number of modifiers, plus a key
binding = { modifier ~ ("+" ~ modifier)* ~ "+" ~ key ~ comment? ~ NEWLINE ~
    // The ${} syntax ensures that whitespace is not ignored, unlike usual.
    ${WHITESPACE ~ command}
}

// Caret makes strings case insensitive
modifier = {
    ^"super" 
    | ^"ctrl" 
    | ^"alt" 
    | ^"shift"
}

// Probably want to define the "key" rule in another file/import from a wayland library or however swhkd does it currently, and merge that rule with the rest of the grammar in rust rather than define it here. Alternatively - accept any string, then validate it in rust.
key = { ^<list of keys> }

// Parsing the command should be the shell's responsibility, so just send everything that isn't an unescaped newline to the shell.
command = { not_newline* }
// Uses a negative lookahead: if the following is not a newline, consume a character.
not_newline = {!unescaped_newline ~ ANY}
// Match a newline that has no escape char before it - but an escape char literal is fine.
newline = { !unescaped_escape_char "\n"
escape_char = {"\"}
// positive lookahead - match both or none.
escape_char_literal = {&escape_char ~ escape_char}
unescaped_escape_char = {!escape_char escape_char}

The above grammar should match

super + shift + enter # enter = return
    kitty

# file manager
super + shift + f
    pcmanfm

# web-browser
super + w
    firefox

The text was updated successfully, but these errors were encountered:

Shinyzenith · 2022-10-26T19:11:19Z

I didn't write any of the parser and lexer so I'll just go ahead and ping the ones who did: @EdenQwQ @angelofallars

woojiq · 2024-02-24T09:25:26Z

@Shinyzenith
Hello. I found this project idea as part of GSoC. I wonder what this means:

this idea plans to introduce a new parser with robust DSL rules by defining a formal grammar.
...
Create a new parser with proper parsing techniques

We need to create an EBNF grammar, that's clear. What about parser itself, should we write the parser manually (for example use recursive descent parser algo) or use existing tools for this, like pest, and not to reinvent the wheel?

Shinyzenith · 2024-02-24T12:09:30Z

@woojiq hi, I think it's best if we define a grammar and delegate parsing responsibilities to some other parser crate to not reinvent the wheel. However, if you can find some concrete reasons to not do so then feel free to explore such possibilities too.

Cc: @lavafroth

BlueDrink9 · 2024-02-24T17:59:12Z

I took a brief survey back when I thought I ha time for this, and found the existing options pretty good. The two i reference in the first post seemed sufficient to me. I doubt implementing a new one would be necessary (although it would be a great exercise!).

My other thoughts about this issue:

To handle {} multiple mapping syntax well, I struggled to define a CFG way. What i thought would be a better approach is to do multiple parsing approaches, where the first one only matches braces and spits out all the resulting mappings.
Is it worth writing a grammar that handles an entire file? Or do you separate each definition first and then parse them individually?

BlueDrink9 mentioned this issue Oct 20, 2022

Feature: one-off /temporary/single press mode #164

Open

Shinyzenith pinned this issue Dec 19, 2022

Shinyzenith unpinned this issue Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing the parser to a context-free grammar approach #168

Changing the parser to a context-free grammar approach #168

BlueDrink9 commented Oct 20, 2022

Shinyzenith commented Oct 26, 2022

woojiq commented Feb 24, 2024

Shinyzenith commented Feb 24, 2024 •

edited

Loading

BlueDrink9 commented Feb 24, 2024 •

edited

Loading

Changing the parser to a context-free grammar approach #168

Changing the parser to a context-free grammar approach #168

Comments

BlueDrink9 commented Oct 20, 2022

Shinyzenith commented Oct 26, 2022

woojiq commented Feb 24, 2024

Shinyzenith commented Feb 24, 2024 • edited Loading

BlueDrink9 commented Feb 24, 2024 • edited Loading

Shinyzenith commented Feb 24, 2024 •

edited

Loading

BlueDrink9 commented Feb 24, 2024 •

edited

Loading