Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improved State Machine Codegen #3720

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

folkertdev
Copy link

@folkertdev folkertdev commented Oct 28, 2024

Rendered

This RFC adds loop match:

  • a loop and match can be fused into a loop match <scrutinee> { /* ... */ }
  • a loop match can be targeted by a continue <value>. The value is treated as a replacement operand to the match expression.

The state transitions (going from one branch of the match to another) can be annotated with the const keyword, providing more accurate CFG information to the backend. That means:

  • more valid programs are accepted, increasing expressivity
  • the backend can generate better code, leading to significant speedups

The goal of loop match is improved ergonomics and codegen for state machines. Rust, being a systems language, should make it possible to write efficient state machines, and currently falls short. Complex state machines are niche, but foundational to many programs (parsers, interpreters, networking protocols).

This RFC follows in part from work on zlib-rs. Today, we simply cannot achieve the same codegen as C implementations. This limitation actively harms the adoption of rust in high-performance areas like compression.

Basic example

A loop match is similar to a match inside of a loop, with a mutable variable being updated to move to the next state. For instance, these two functions are semantically equivalent:

fn loop_match() -> Option<u8> {
    loop match 1u8 {
        1 => continue 2,
        2 => continue 3,
        3 => break Some(42),
        _ => None
    }
}

fn loop_plus_match() -> Option<u8> {
    let mut state = 1u8;
    loop {
        match state {
            1 => { state = 2; continue; }
            2 => { state = 3; continue; }
            3 => { break Some(42) }
            _ => { break None }
        }
    }
}

Interesting example

The real power of loop match lies in giving the compiler more accurate information about the control flow of a program. Consider

enum State { Foo, Bar, Baz, }

let owned = Box::new(1);
let mut state = State::Foo;
loop {
    match state {
        State::Foo => state = State::Bar,
        State::Bar => {
            // or any function that moves the value
            drop(owned); // ERROR use of moved value: `owned`
            state = State::Baz;
        }
        State::Baz => break,
    }
}

Reading the code, it is obvious that state moves from states Foo to Bar to Baz: no other path is possible. Specifically, we cannot end up in State::Bar twice, and hence the generated "use of moved value" error is not a problem in practice. This program is valid, but nonetheless rejected by the rust compiler.

With loop const match and const continue the compiler now understands the control flow:

loop const match State::Foo {
    State::Foo => const continue State::Bar,
    State::Bar => {
        // or any function that moves the value
        drop(owned); // all good now!
        const continue State::Baz;
    }
    State::Baz => break,
}

Rendered

Tracking:

@folkertdev folkertdev force-pushed the labeled-match branch 2 times, most recently from 70ae186 to 72ce3d9 Compare October 28, 2024 14:20
@samueltardieu
Copy link

In the given example, break 'foo Some(42); could be replaced by Some(42). If the goal is to be able to break from a more nested block, maybe a general break 'NAME VALUE, not specific to match, could be used on arbitrary statements or blocks – the 'NAME would stay mandatory to preserve the semantics of break 'VALUE which exits the innermost loop.

Concerning the state machines implementation mentioned as a motivation, wouldn't that also be covered as part of the explicit tail call RFC? Both RFC propose to change the language syntax, but the "explicit tail call RFC" seems more generic as it covers any case of tail call optimization. However, it requires the state machine to be in a standalone function, or at the beginning of a function.

@joshtriplett joshtriplett added the I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. label Oct 28, 2024
@digama0
Copy link
Contributor

digama0 commented Oct 28, 2024

Pre-RFC discussion: Zulip

@folkertdev
Copy link
Author

In the given example, break 'foo Some(42); could be replaced by Some(42). If the goal is to be able to break from a more nested block, maybe a general break 'NAME VALUE, not specific to match, could be used on arbitrary statements or blocks – the 'NAME would stay mandatory to preserve the semantics of break 'VALUE which exits the innermost loop.

You're right, in this case the break is there because I needed some way to highlight that break can also target a labeled match.

Concerning the state machines implementation mentioned as a motivation, wouldn't that also be covered as part of the explicit tail call RFC? Both RFC propose to change the language syntax, but the "explicit tail call RFC" seems more generic as it covers any case of tail call optimization. However, it requires the state machine to be in a standalone function, or at the beginning of a function.

The section on guaranteed tail calls has some detail on why tail calls are useful (rust should have them!) but not always the best solution:

  • debugging a program that uses tail calls is not fun (I've done this with wasm runtimes)
  • having just a single stack frame makes code easier to optimize (both manually and by the compiler/LLVM)

Also in terms of ergonomics labeled match and guaranteed tail calls make very different tradeoffs, that work well in some cases and less well in others.

text/3720-labeled-match.md Outdated Show resolved Hide resolved
@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 28, 2024

I'll be honest, I really don't like this. Having a match that can implicitly act like a loop is very confusing, although I understand the desire to simplify this kind of construct.

I would much prefer if this were some kind of dedicated construct like loop match, rather than just an ordinary match statement. loop match to me would not need explicit continues, replacing all expressions with continues until it hits a break. Having a match which defaults to breaks but still allows continues is just confusing.

And just adding onto this: my main issue is that any statement should be immediately obvious whether it loops. So, seeing a large match, label or not (note: there can be labeled blocks, and maybe eventually labelled statements, so, the label is not a key giveaway), that can loop back on itself is very confusing. You should have to explicitly say that it can loop if you want looping.

@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Oct 28, 2024
@folkertdev
Copy link
Author

I agree that "a match can be a loop now" is the most unintuitive part of this proposal. I think any proposal tackling this problem will not be entirely obvious at first though.

This feature is unlikely to show up early on in a beginner's journey (e.g. how often do we really see a labeled block in the wild?). The requirement of a label on both the match and break/continue tries to make the logic, and that something special is going on, as explicit as possible. In combination with the guide, I believe that should be sufficient to get folks unstuck.

The RFC specifically uses labeled loop/block syntax and intuition to make the feature as non-invasive as possible. Expanding the language further is something I'd like to avoid if possible, but I'm open to it if the lang team prefers that route.

@clarfonthey
Copy link
Contributor

clarfonthey commented Oct 28, 2024

See, the thing is that labelled blocks can't loop. The only thing that loops are loops.

I get not wanting to "change the language too much" but I think that subtle syntax with massive changes in functionality are a much bigger change than obvious syntax. It requires people to expand their mental model substantially more when reading Rust code, whereas something like loop match does not.

Plus, it's very easy to imagine making 'loop: match x { ... } simply a shorthand for 'loop: { match x { ... } } (I believe there's even an autofix for this) and this version could not support continues with values in the same way.

@burdges
Copy link

burdges commented Oct 28, 2024

I agree with @clarfonthey that this syntax looks unecessary, including that loop match must trash the result of the match.

Yet if folks want this then the systax should be loop break match { /* match body using continue */ } because

let mut tmp = initial;
loop { break match tmp {
     // match body using assignements plus continue
} }

@workingjubilee
Copy link
Member

workingjubilee commented Oct 29, 2024

I am not so sure this actually proposes to just add labeled matches.

I think it proposes much more: to ascribe a very specific evaluation order to patterns. Otherwise I don't think it works, because without the order being very strict, I don't think the jump-to-block-like control flow works correctly. And currently we don't have a well-specified order: rust-lang/reference#1665

@folkertdev "This makes the language more complex" is "free space" in terms of drawbacks. You may wish to include, rather, that it is possible that adopting this proposal may require us to forfeit possible ways to evaluate patterns that may be more optimal in some cases. Especially ones that are not zlib's use-case.

In other words, there is not yet evidence that this cannot negatively impact the optimization of programs that don't use this feature.

@scottmcm
Copy link
Member

scottmcm commented Oct 29, 2024

  • a match can be labeled: 'label: match x { ... }

I continue to think that adding labels in more places with special behaviour like this is overall bad. Rust syntax is best when there's a "here's what's coming" warning up-front, which is why it's good that loop is there at the beginning of code that's about to loop. A label doesn't do that, especially since today it's impossible for a label to by itself allow code to run multiple times that couldn't already -- a label allows leaving a block early, but only loops allow re-running it.

If the goal is

should make it possible to write efficient state machines

then my suggestion is still that we should have a dedicated state machine construct, not try to force it into a match, especially since

Complex state machines are niche

and thus there's no need to have particularly-terse notation for it.

Spitballing: what about some builtin macro like

finite_state_machine! {
    'start: {
        continue 'two;
    }
    'two: {
        continue 'three;
    }
    'three: {
        break Some(42);
    }
}

where the blocks can, like let-else, insist on being !-typed so you have to end with continue/break/return/etc?

EDIT: Aha, finally found one of the recent zulip threads about this, https://rust-lang.zulipchat.com/#narrow/channel/213817-t-lang/topic/Fallthrough.20in.20Match.20Statements/near/474070351

(That example is an uninteresting use of the feature, but it's the same as the one in the RFC summary.)

Then it's clear up-front that this is weird stuff that you have to read differently from normal blocks, there's a obvious wall we can use to say which labels you're allowed to use for this, etc.

And there's no _ => None like in the summary example from the RFC that was completely unreachable, so I didn't understand what it was there for.


I think I'm basically echoing what @clarfonthey said above, so I'll repeat it for emphasis

I think that subtle syntax with massive changes in functionality are a much bigger change than obvious syntax

New syntax is fine. We have identifier space reserved to do it non-breakingly in current editions, then change to something else later if needed. But if it's rare, we don't necessarily even need to do that.

So I completely disagree with the RFC's phrasing of

no new keywords or syntactic constructions are needed

as an advantage. It's much easier to learn something with a macro.finite_state_machine.html page in the docs, say, than something that you don't even know what to search for since you don't know what it's called.


EDIT: For clarity, I do think something in this space would be a good thing for Rust to have. I just think that 'foo: match is the wrong way to spell it.


The labeled match proposal combines existing rust features of `match` and labeled blocks/loops. It is just the interaction between these concepts that has to be learned, no new keywords or syntactic constructions are needed. Occurrences of labeled match will be rare, and true beginners are unlikely to encounter them early on.

Labeled match does not introduce arbitrary control flow (like general `goto`) or surprising implicit control flow (like `switch` fallthrough in C and descendants). The mechanism fits nicely into how rust works today.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does introduce irreducible control flow though, right? Which is otherwise impossible in surface rust.

Copy link

@hanna-kruppe hanna-kruppe Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s possible to semantically understand the construct as a reducible CFG similar to loop { match { … } }, and not just in the trivial sense that any CFG can technically be made reducible. This would give a reasonable way to apply static analyses that are more naturally expressed in terms of structured control flow (with some loss of precision wrt possible paths through the CFG). If this is not the case, then that’s another drawback in my opinion.

However, it’s true that the sensible MIR lowering for achieving the desired codegen will usually be irreducible from the start. Today you can only(?) write programs that start out as reducible CFG and possibly become irreducible during optimization. Although I don’t know if any current MIR optimizations actually do that (jump threading is the usual suspect) or whether the ball is completely in LLVM’s court for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get the borrowck to understand this then this will really be irreducible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe this isn’t as novel as it seems. With async/await you can also describe what is conceptually irreducible control flow inside a state machine. For example, resuming loop { g().await; } after suspending at the await jumps into the middle of the loop body (after calling g().into_future(), before calling poll()) so there isn’t a single header block that dominates the loop. Here, too, it’s possible to pretend the state machine is a big loop + match to make it reducible but this loses precision about which state transitions are possible (e.g., you can’t go “backwards” within one iteration of the loop, but in the loop+match CFG it appears possible unless you do the equivalent of jump threading wrt the state variable). If I’m reading the lowered MIR correctly then this already generates irreducible MIR today without optimizations.

Copy link

@hanna-kruppe hanna-kruppe Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get the borrowck to understand this then this will really be irreducible

Yes that’s one analysis that would have to choose between precision and dealing with irreducibility. How does it deal with irreducible async/await state machines (see my last comment) today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect they're borrow-checked before the coroutine transform but I could be wrong.


The idea was first introduced in [this issue](https://github.com/ziglang/zig/issues/8220) which has a fair amount of background on how LLVM is not able to optimize certain cases, reasoning about not having a general `goto` in zig, and why tail calls do not cover all cases.

[This PR](https://github.com/ziglang/zig/pull/21257) implements the feature, and provides a nice summary of the feature and what guarantees zig makes about code generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Rust is not going to give any guarantees about code generation. That's a QoI issue, not a semantic one, and not something with stable guarantees for basically anything.

(A spec would describe how this is lowered to MIR, perhaps, but not to machine codegen.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, the RFC text is (or should be, I tried to be) careful to not provide any hard guarantees about the generated code. The property I really want is that continue 'label value is lowered to a goto in a predictable way: expert programmers should be able to be fairly confident that that desugaring kicks in when they write their code in a specific way.

Even when the continue is desugared to a goto, MIR optimalizations and LLVM can change the structure. That is fine, the assumption being that an unconditional jump is actually unlikely to be touched except if it can be removed entirely.

Zig has its own code generation backends, and hence can make hard guarantees about the assembly that they output.

Copy link
Member

@RalfJung RalfJung Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zig says:

"This lowering can inflate code size compared to a simple "switch in a loop" lowering, and any Zig implementation is, of course, free to lower this syntax however it wishes provided the language semantics are obeyed. However, the official ZSF compiler implementation will attempt to match the lowering described above, particularly in the ReleaseFast build mode."

That doesn't sound like a hard guarantee at all, it sounds like a QoI promise.

Where are you getting the idea from that Zig makes hard guarantees here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant "can" in the sense of "could": because they literally put the bytes of the final program in the right order, they could guarantee things about the shape of that generated assembly. I'm not sure they would actually want to. Based on the OP of that PR the support is different depending on the backend right now.

The semantics of `'label: match`, `continue 'label value` and `break 'label value`, including any edge cases or error messages that have been missed so far.

The RFC text provides background on why this feature is needed for improved code generation, but from the language perspective, only the above three elements are required.
The exact details of the HIR to MIR lowering can be figured out through the implemenation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons I'm not a fan of the match phrasing of this. The lowering of match to MIR is already the most complicated part of that lowering -- including referencing papers about how to do it avoiding exponential blowup -- so if it's not obvious enough to be written here, I'm inclined to say that it shouldn't happen period. match is not C++'s switch, and needs to handle way more things than just integers.

OTOH, lowering a block to MIR is easy, and that lowering already has Goto terminators to other MIR blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a summary, more detail is provided later on. I think the two open questions are

  • what should the lowering of continue 'label value be if we can't determine which pattern value matches at compile time
  • how do we handle nested pattern matches

I don't think either of these questions are hard to answer: it's just work. Work that we did not yet do for the PoC and are hesitant to do if it's unlikely to be used at some point.

This has nothing to do with the desugaring of pattern matching itself: we're just evaluating the pattern match at compile time if the necessary information is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask me if you need concrete details on how we lower matches to MIR. What I've proposed in the discussion is something I am fairly confident is feasible.


State machines require flexible control flow. However, the unstructured control flow of C is in many ways too flexible: it is hard for programmers to follow and for tools to reason about and give good errors for. Ideally, there is a middle ground between code that is easy to understand (by human and machine), interacts well with other rust features, and is flexible enough to efficiently express state machine logic.

Today there is no good way to translate C code that uses implicit fallthroughs or similar control flow to rust while preserving both the ergonomics (in particular, the number of levels of indentation) and the performance (due to LLVM using jump tables instead of an unconditional jump, see the next section). If we wanted to translate this C code to Rust:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: phrase this as "a consistent level of indentation" or "a number of level of indentations that's sub-linear in the number of states", rather than just "the number of levels".

"Rust has two levels more indentation than C" isn't good motivation, but "I shouldn't have to ski-jump off the page" is.

(Though even there, it'd be good to show you can't fix that issue by having a macro_rules macro that does the loop translation.)

Copy link
Member

@RalfJung RalfJung Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is such a macro, it was proposed on Zulip.

EDIT: Ah, but that's not a full loop, just fallthrough semantics.

@folkertdev
Copy link
Author

@workingjubilee

I think it proposes much more: to ascribe a very specific evaluation order to patterns. Otherwise I don't think it works, because without the order being very strict, I don't think the jump-to-block-like control flow works correctly. And currently we don't have a well-specified order: rust-lang/reference#1665

The idea is that we do at compile time what would be done at runtime anyway: compute the next location to jump to. The process is short-circuited, but not changed. What suggests to you that evaluation order would change?

Also strictly speaking the specifics of the codegen approach are not part of the RFC itself, but they seemed important to talk about anyway because it's at least half of the justification for the feature (the other half being improved ergonomics for state machines and translating C-style code)

@folkertdev "This makes the language more complex" is "free space" in terms of drawbacks. You may wish to include, rather, that it is possible that adopting this proposal may require us to forfeit possible ways to evaluate patterns that may be more optimal in some cases. Especially ones that are not zlib's use-case.

Again, I don't think the evaluation order is changed at all. The only change is that a double jump between blocks a -> b -> c is short-circuited to a -> c when possible.

In other words, there is not yet evidence that this cannot negatively impact the optimization of programs that don't use this feature.

Any code not using labeled match would be unaffected. When you use labeled match, you get the codegen it has. Also I can't think of any cases where an unconditional jump is worse for performance (though there might be cases where other optimizations get missed? Seems unlikely to me but theoretically possible).

@Nadrieril
Copy link
Member

Nadrieril commented Oct 29, 2024

loop match is cute, I like it.

I think it proposes much more: to ascribe a very specific evaluation order to patterns. Otherwise I don't think it works, because without the order being very strict, I don't think the jump-to-block-like control flow works correctly.

Hm, kinda. If the new state is fully specified, then jumping to the right block is a straightforward optimization on top of the obvious desugaring. If the scrutinee is partially known then optimizing the desugaring may indeed not work reliably:

loop match (&0, 0) {
    (0, 1..) => continue (foo(), 0),
    (_, 0) => break,
    _ => unreachable!()
}
// desugars+compiles to:
let mut state = (&0, 0);
loop {
    if *state.0 == 0 {
        if state.1 >= 1 {
            state = (foo(), 0);
            continue;
        }
    }
    if state.1 == 0 {
        break;
    }
    unreachable!()
}

we can't optimize the continue to break directly here, because there's an intermediate dereference that (afaik) we can't optimize away.

But we don't need to compile it to a desugaring: we could say "if the continue expression is sufficiently precise to match only one arm of the loop match, we will compile it to a direct jump to that arm. If not, all bets are off". Then it's "just" down to clever MIR lowering.

What I don't know is how to specify "the continue expression is statically known to have a given value". Do we allow continue CONSTANT? continue if CONSTANT { Some(1) } else { Some(2) }? let x = 4; foo(); continue Some(x)? I don't know where to draw the line.

@Nadrieril
Copy link
Member

We discussed this a bit with @traviscross, and found a sensible line: continue <expr> could be guaranteed to jump directly to the right arm if expr is const promotable, i.e. the same expressions that can be hoisted into a static (those such that &<expr>: &'static _). This makes it into a kind of constant-time-computed goto.

Then we can lint jumps that aren't of this form, either because the expression is not const-promotable or because it doesn't suffice to know which arm will be taken (e.g. if there's an arm guard).

I didn't see it discussed, but if we implement this as a MIR lowering property the borrow-checker would understand these direct jumps. This makes the feature pull more weight imo.

@digama0
Copy link
Contributor

digama0 commented Oct 30, 2024

I don't think that's good enough for applications, I would want continue State1(x, y) to also jump to the right arm.

@programmerjake
Copy link
Member

continue <expr> could be guaranteed to jump directly to the right arm if expr is const promotable

so then if I were to try to use this to write an interpreter it would have to be something like:

macro_rules! make_interpreter {
    (
        @variants $variants:tt $label:lifetime: loop match ($next_insn:expr) {
            $(
                $Variant:path => $block:block
            )+
        }
    ) => {
        $label: loop match $next_insn {
            $(
                $Variant => {
                    $block
                    make_interpreter! {
                        @variants $variants continue $label $next_insn
                    }
                }
            )+
        }
    };
    (
        $label:lifetime: loop match ($next_insn:expr) {
            $(
                $Variant:path => $block:block
            )+
        }
    ) => {
        make_interpreter! {
            @variants($($Variant,)+) $label: loop match ($next_insn) {
                $(
                    $Variant => $block
                )+
            }
        }
    };
    (
        @variants($($Variant:path,)+) continue $label:lifetime $next_insn:expr
    ) => {
        match $next_insn {
            $($Variant => continue $label $Variant,)+
        }
    };
}

#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug)]
#[repr(u8)]
pub enum Opcode {
    Add,
    Const,
    Ret,
    JumpIfNotEq,
}

#[repr(C)]
pub struct Insn {
    pub opcode: Opcode,
    pub args: [u8; 3],
}

pub unsafe fn interpreter(mut pc: *const Insn, regs: &mut [u64; 16]) {
    macro_rules! reg_a {
        () => {
            regs[((*pc).args[0] >> 4) as usize]
        };
    }
    macro_rules! reg_b {
        () => {
            regs[((*pc).args[0] & 0xF) as usize]
        };
    }
    macro_rules! reg_c {
        () => {
            regs[((*pc).args[1] >> 4) as usize]
        };
    }
    macro_rules! imm_i16 {
        () => {
            i16::from_ne_bytes([(*pc).args[1], (*pc).args[2]])
        };
    }
    unsafe {
        make_interpreter! {
            'a: loop match ((*pc).opcode) {
                Opcode::Add => {
                    reg_a!() = reg_b!().wrapping_add(reg_c!());
                    pc = pc.add(1);
                }
                Opcode::Const => {
                    reg_a!() = imm_i16!() as u64;
                    pc = pc.add(1);
                }
                Opcode::Ret => {
                    return;
                }
                Opcode::JumpIfNotEq => {
                    if reg_a!() == reg_b!() {
                        pc = pc.offset(imm_i16!() as isize);
                    } else {
                        pc = pc.add(1);
                    }
                }
            }
        }
    }
}

While this is a natural way to express a state machine, it is well-known that when translated to machine code in a straightforward way, this approach is inefficient on modern CPUs:

- The match is an unpredictable branch, causing many branch misses. Reducing the number of branch misses is crucial for good performance on modern hardware.
- The "loop + match" approach contains control flow paths (so, sequences of branches) that will never be taken in practice. The stack can be smaller if the control flow paths are known more precisely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off: having a prototype implementation and benchmarks to accompany the RFC is great, kudos for that.

Unfortunately what's written here doesn't seem to fit with the linked benchmark results. The reported branch mispredicts (absolute number!) don't change at all in any of the comparisons in the gist. Instead, the most drastic changes in wall clock time seem to correlate mostly with executing fewer instructions and having fewer cache misses. Branch misprediction rate even appears to be lower for the loop+match variant if it executes more instructions (including more branches) but has the same absolute number of mispredicts. The fact that cache misses sometimes differ drastically between loop+match and the labeled-match variants suggests these variants differ in more fundamental ways than what this RFC discusses. I don't understand the code / workload well enough to even speculate why this could be, but it really doesn't line up with the RFC motivation, so something's fishy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text you cite makes a point based on a very simple example. The straightforward code generation would create an unpredictable branch, and its performance would be bad. The text there already mentions the other benefit of unconditional jumps: simpler CFGs for later compiler passes.

In practice, LLVM already does not use the completely naive approach (at least on x86_64, and I believe most modern platforms). Instead, a jump table is used (this is discussed in the LLVM section). Modern branch predictors are able to deal with those quite well.


The benchmark results are from:

  • a program that uses tail calls
  • a program that uses labeled match

I think it's expected that these have roughly the same number of branch misses, because the majority of jumps in the state machine logic are unconditional in both versions.

Next the detailed benchmarks (linked in the RFC text) https://gist.github.com/folkertdev/977183fb706b7693863bd7f358578292 include a "loop + match" version. Here again we see that branch predictions are roughly the same on my machine, with a barely significant difference on bjorn's machine. I'm honestly not completely sure why that is, my assumption is that modern branch predictors are that good, and that the branches in the benchmark are predictable enough in practice that we only see a minimal increase in branch mispredictions.


So, what explains the performance difference? I believe it is the second point mentioned: a simpler CFG has many downstream benefits: stack slots can be reused sooner, fewer instructions are pushed out of the instruction cache, etc. Diffing the assembly is a mess, the labeled match version is slightly shorter (~40 lines, so roughly that number of instructions). I guess that matters if you can get rid of instructions in tight inner loops, though I can't say for sure that that is where the instructions are saved.

So, with modern branch predictors, with the input I chose for the benchmark (which is a fairly standard input file for benchmarking zlib performance), it looks like branch mispredictions are not a significant factor in explaining the difference in performance in that benchmark. But the difference we see is real, and crucial for beating C.

Hopefully that clears things up?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just be be sure, is it accurate to say: all of the variants essentially implement the same state machine and do the same work except for different representations of the state machine transitions plus whatever ripple effects that has for compiler optimizations? At the source level, there are no differences in algorithms or memory access patterns that could explain different numbers of instructions or cache misses?

This doesn't really clear things up, except confirming my suspicion that nobody actually understands where the performance improvement is coming from. That isn't necessarily a problem when one is trying to make a specific program faster: the speedups you can get for representative workloads are what matters in the end, even if they're different from what your mental performance model would have predicted. However, for a language feature intended to be used in countless yet-unwritten programs, I would hope for less hand-wavy understanding of how they'll be impacted.

You're right to point out that presenting the state machine transitions directly in the CFG can have impact throughout the entire optimization and codegen pipeline (not all of them positive, by the way). Comparing the generated assembly code of the hot loops would be a good first step to better understanding. Could you please share the relevant parts along with some profiling data such as perf record + perf annotate? Due to target-cpu=native and the big differences between CPU models, it may look very different on your machine than on mine.

That said, if the data access patterns are the same then I don't see how this could explain the big difference in cache misses. While instruction cache misses are very important for applications with huge code footprints, we're talking about tight inner loops here, and on my machine the entire text section of each binary is only about half the size of the compressed input file. Similarly, it's usually very hard for less optimal stack frame layouts to cause any measurable difference for cache misses because all the "important" stack slots will virtually always stay in L1 cache. But we don't have to limit ourselves to idle speculation: Linux perf can show you which exact instructions often cause which kinds of misses, at least on bare metal (e.g., not under WSL on a Windows machine like mine).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, here is more detail https://gist.github.com/folkertdev/0926f620e41509e2bda0da2c252921fe

This commit trifectatechfoundation/zlib-rs@85cf9cc really only turns continue 'label into continue 'label values. So now the only difference really is that labeled match in our PoC gets the improved mir lowering.

You can just search for the number of jmp raxs in the assembly to confirm that we generate fewer (not 0 because in our PoC, the initial pattern match does not become a direct jump, and we call a helper function after which we have to do a jump that is not statically known).

I don't really have a satisfying explanation for the increase in cache misses for a chunk size of 4. The cache misses appear to occur at reasonable locations.

Also, I'd hope that the results in general replicate on your machine, and that the specific architecture/target features don't matter too much.

Let me know if there is anything else you'd like to see

You're right to point out that presenting the state machine transitions directly in the CFG can have impact throughout the entire optimization and codegen pipeline (not all of them positive, by the way).

Are you thinking of anything specific here? I don't see how providing a more accurate CFG could give worse results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the data. With the len-as-match branch I got notably different results than in the original gist in some metrics. In particular, I see no significant changes to cache_misses with the labeled match variants in several cases where your results have enormous changes, so I thought it would be better to zoom in one one machine than to wrangle that variation as well. Similarly, I can't reproduce most of the results you report for the split-out-len branch except the broad strokes of wall time, instructions, and clock cycles improving by similar amounts for chunk size 4. For reference, my CPU is Intel i7-6700K @ 4GHz.

I haven't found the time to really dig into the assembly and perf annotate (which don't seem to match up, by the way). There seem to be very few samples recorded. The only really interesting thing I've gleaned from the assembly you shared is that the stack frame is the exact same size for both (256 bytes), and that both have almost the same number of instruction referencing [rsp + $offset]. This makes me even more skeptical that any correct explanation for the performance difference involves the stack.

Are you thinking of anything specific here? I don't see how providing a more accurate CFG could give worse results.

In general any tweak to the input program, even if it only gives "more information", can result in worse code generation due to incomplete and greedy heuristics. But there are some factors specific to irreducible control flow. A lot of analyses and transformations like to work in terms of reducible/structured control flow, or are less well-tuned for irreducible CFGs. Some examples:

  • LLVM's LoopInfo can only represent natural loops, so any optimization that works with loops won't work as well or at all on irreducible representations of control flow. There is a more general Cycle concept in LLVM nowadays, but it's less widely used, and I don't think it's realistically possible for it to replace the "natural loop" concept entirely.
  • It's not just transformations targeted at loops that rely on LoopInfo, it's also used in many heuristic cost models including (but not limited to) at the machine code level. For example, loop information is used to guide register allocation (spill weights), instruction scheduling, basic block placement, the x86-specific pass that converts between branches and cmovs, and many more. Of course irreducible control flow must be handled correctly, but the cost model will usually be less accurate (that doesn't have to result in worse codegen, but it can).
  • Sometimes these problems are solved by duplicating code ("node splitting"). For example, LLVM's ordinary jump threading pass deliberately limits itself w.r.t. loop+match state machines to avoid introducing irreducible control flow. The newer "DFA jump threading" pass tries to do optimize such code without making the CFG irreducible by duplicating the successor basic block instead of directly jumping to the existing copy. But such duplication can easily cause massive code size increases, so any pass that does this has a hard limit on how much code it can duplicate, so it'll still miss optimizations for larger state machines.
  • Occasionally, the CFG must be made reducible for correctness. This is more common with research-y prototypes than with the default pipeline of industry compilers, but it comes up when compiling to WebAssembly. Any irreducible CFG will be made reducible at some point before emitting wasm, either by code duplication or by (re-)introducing explicit state variables and branches on them. I don't want to make any bets whether you'll get better results by carrying structured control flow through the entire optimization pipeline or showing the "middle end" irreducible control flow and letting the backend make it structured somehow.

@digama0
Copy link
Contributor

digama0 commented Oct 30, 2024

For a bunch of reasons. One is that we don't have these sort of guarantees in the language today. Also as explained previously, this is harder (and sometimes wrong) to do as an optimization if we want reliable guarantees. And doing as an optimization means we can't get borrow checker benefits which is too bad imo.

Note that this RFC does not get borrow checker benefits. (Nor does it claim to, based on my reading of the RFC.) This is actually one of the reasons I'm not a big fan of this particular proposal, although I very desperately want rust to gain some primitive for doing irreducible control flow. To get borrow checker benefits I think you really need a syntactic form that connects the match arm to the jump, and the proposed syntax of continue 'a <expr> is just not direct enough IMO to be able to support borrow checking magic when <expr> matches some specific forms. (It gets awfully close to guaranteed constant propagation (for non-constants) visible in the type system a la typescript...)

@Nadrieril
Copy link
Member

To get borrow checker benefits I think you really need a syntactic form that connects the match arm to the jump, and the proposed syntax of continue 'a <expr> is just not direct enough IMO to be able to support borrow checking magic when <expr> matches some specific forms. (It gets awfully close to guaranteed constant propagation (for non-constants) visible in the type system a la typescript...)

Yeah, we should probably have a different syntactic expression to get the "direct jump" versus "do the match normally" behaviors, because if it affects borrowck this needs to be visible from reading the code. continue <expr> is such a good syntax though, not sure what else we could use.

@folkertdev
Copy link
Author

Note that this RFC does not get borrow checker benefits. (Nor does it claim to, based on my reading of the RFC.)

This is not something I considered (so it's not currently part of the RFC), but you get the borrow checker changes for free, because of how the MIR is generated. e.g. this errors:

enum State { Foo, Bar, Baz, }

fn main() {
    let owned = Box::new(1);
    let mut state = State::Foo;
    loop {
        match state {
            State::Foo => {
                state = State::Bar;
            }
            State::Bar => {
                // or any function that moves the value
                drop(owned); // ERROR use of moved value: `owned`
                state = State::Baz;
            }
            State::Baz => {
                break;
            }
        }
    }
}

While if written using labeled match (or loop match, whatever syntax we settle on), the CFG can never reach State::Bar a second time:

'label: match State::Foo {
    State::Foo => {
        continue 'label State::Bar;
    }
    State::Bar => {
        // or any function that moves the value
        drop(owned); // all good now! 
        continue 'label State::Baz;
    }
    State::Baz => {
        break 'label;
    }
}

And hence this program would be accepted by the borrow checker: cool! But also whether the optimization fires can determine whether a program passes the borrow checker.

@digama0
Copy link
Contributor

digama0 commented Oct 30, 2024

To get borrow checker benefits I think you really need a syntactic form that connects the match arm to the jump, and the proposed syntax of continue 'a <expr> is just not direct enough IMO to be able to support borrow checking magic when <expr> matches some specific forms. (It gets awfully close to guaranteed constant propagation (for non-constants) visible in the type system a la typescript...)

Yeah, we should probably have a different syntactic expression to get the "direct jump" versus "do the match normally" behaviors, because if it affects borrowck this needs to be visible from reading the code. continue <expr> is such a good syntax though, not sure what else we could use.

Other proposals in this space have suggested to use multiple labels, one per block, and I think that makes a lot more sense than marking a match expression and then having to describe an arm in it.

And hence this program would be accepted by the borrow checker: cool! But also whether the optimization fires can determine whether a program passes the borrow checker.

Right, I think this is a non-starter, optimizations cannot affect borrow checker behavior or else they are not optimizations. So either the real implementation will have to go to some extra work specifically to prevent this behavior from leaking, or else the RFC will have to "own it" and define precisely when the optimization which isn't an optimization actually kicks in.

@programmerjake
Copy link
Member

programmerjake commented Oct 30, 2024

what if you had to write static continue 'a expr to get the jump directly to match arm semantics (and requires being able to figure out which match arm to jump to at compile time otherwise you get a compile error) and the borrow checking stuff, otherwise it behaves as though jumping to the match (and may optimize to a jump to a match arm, but that's not observable from language semantics)

@Nadrieril
Copy link
Member

Other proposals in this space have suggested to use multiple labels, one per block, and I think that makes a lot more sense than marking a match expression and then having to describe an arm in it.

I never like these proposals because labels feel like a last-resort kind of feature in rust, not a first-class one. Plus with a match you get to pass values around without having to define a bunch of mutable variables.

what if you had to write static continue 'a expr

Yes I like this! Or const continue given that it does something like compile-time computation.

@programmerjake
Copy link
Member

programmerjake commented Oct 30, 2024

I never like these proposals because labels feel like a last-resort kind of feature in rust

but can you express computed goto using loop match (e.g. with an enum where its repr is a pointer to the block addresses in a function) where for branch prediction reasons you want every block to have a separate computed goto instruction?

That is needed for interpreters because CPUs can predict branches much better if every interpreted instruction kind has a separate computed goto since branch predictors use the address of the branch instruction to differentiate which predictions to give, allowing the branch predictor to learn more complex patterns since it gets more information.

@digama0
Copy link
Contributor

digama0 commented Oct 31, 2024

Other proposals in this space have suggested to use multiple labels, one per block, and I think that makes a lot more sense than marking a match expression and then having to describe an arm in it.

I never like these proposals because labels feel like a last-resort kind of feature in rust, not a first-class one. Plus with a match you get to pass values around without having to define a bunch of mutable variables.

I tend to agree with this, I think it would look better with function-call syntax rather than using labels. But to be clear I did mean blocks with parameters here (something which would be its own RFC needing appropriate syntax). There are plenty of design proposals for this linked in the previous discussions.

but can you express computed goto using loop match (e.g. with an enum where its repr is a pointer to the block addresses in a function) where for branch prediction reasons you want every block to have a separate computed goto instruction?

Didn't a certain @programmerjake demonstrate above how to do this using a macro? It seems kind of orthogonal, given that there is no first class support for computed goto in this proposal.

@programmerjake
Copy link
Member

Didn't a certain @programmerjake demonstrate above how to do this using a macro?

not really, because the macro I wrote would produce N separate jump tables for the N match arms and take O(N^2) space/time. computed goto can do all of that without any jump tables at all and doesn't need a crazy O(N^2) macro.

@folkertdev
Copy link
Author

I never like these proposals because labels feel like a last-resort kind of feature in rust, not a first-class one.

I agree with the sentiments on macro + label solutions. Specifically, I think the connection between control flow and data flow is important. E.g. with the label solutions it is more laborious to store the current state (so that you can later resume there), because labels are not first-class.

what if you had to write static continue 'a expr

This is great! I think I slightly prefer const continue 'label expr but static could work too.

@Aloso
Copy link

Aloso commented Nov 1, 2024

I think that a looping match construct in general would be useful, and not just some dedicated state machine syntax.

I beg to differ. Being able to write loop match {} instead of loop { match {} } is not a significant improvement in ergonomics and readability. But the state machine optimization is the main motivation of this RFC, and it is impossible to achieve in current Rust.

Moreover, I believe that the vast majority of loop { match {} } constructs are actually state machines, and would benefit from a more explicit syntax for state machines.

When your intention is to write a state machine, a macro called state_machine! would be much more explicit and readable than loop match.

@folkertdev folkertdev changed the title RFC: Labeled match RFC: Improved State Machine Codegen Nov 18, 2024
@folkertdev
Copy link
Author

folkertdev commented Nov 18, 2024

updates

I've pushed a number of updates to the RFC text. The main changes are summarized below

syntax

The syntax has now been changed to loop match, as discussed.

borrow checker changes

The standard loop match is like a loop { match { ... } } (though can potentially be used for e.g. computed goto in the future). To convey more accurate CFG information, state transitions can be annotated with const, e.g. this is accepted:

let x: u64;

loop const match true {
    true => {
        x = 42;
        const continue false;
    }
    false => {
        dbg!(x)
    }
}

Because the compiler can now know that all paths that lead to the dbg!(x) expression have initialized x, this is valid. Removing the const from either the loop match or the continue would give an error that x may not be initialized.

restrictions on when a state transition can be const

We've opted for a restrictive, but straightforward criterion for what state transitions can be marked as const: the expression must be eligible for static promotion as introduced in RFC 1414. Notably that excludes any partial patterns (where e.g. the outer constructor is known but inner fields are only known at runtime). That is unfortunate but the static promotion rule is simple and covers most current cases.

irreducible control flow

The const-marked state transitions can introduce irreducible control flow. As far as we've been able to determine, that is not a problem for borrow checking. Of course, once introduced, irreducable control flow cannot be removed, so adding it should not be taken lightly.

Where do we go from here

Tweede Golf and Trifecta Tech Foundation are committed to work on improving state machine codegen, though the pace will depend on whether we can get some supplemental funding.

Some specific questions relevant to lang team review are:

Is irreducible control flow a dealbreaker?

I think there are compelling reasons to include it, but there is a path forward that provides just the codegen improvements without adding irreducible control flow to the surface language.

Syntax

The triple-stacking of keywords in loop const match is not the prettiest. Yet all three keywords have their function. It could be loop match const, or loop static match, or whatever. Syntax suggestions are easy, and everyone can contribute to discussions on syntax.

Ultimately, I'm interested in the backend changes, and not committed to a particular syntax. To me there are two particular requirements:

  • I'd really like it if syntax were not blocked on an edition
  • I think a connection between data flow and control flow (i.e. like match/pattern matching) is essential

@saethlin
Copy link
Member

The triple-stacking of keywords in loop const match is not the prettiest. Yet all three keywords have their function. It could be loop match const, or loop static match, or whatever. Syntax suggestions are easy, and everyone can contribute to discussions on syntax.

I don't think this is a serious problem. We already have pub const unsafe extern fn, the only problem would be if reordering the keyword soup changes semantics. If there's only one valid order we can have a helpful diagnostic, as we do for the keywords that come before fn.


The advantages of having more accurate borrow checking, and accepting more valid programs, are compelling to me, but this more limited solution here could absolutely work from just a performance perspective.

## Syntax Squables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you still put #3720 (comment) / #3720 (comment) as an alternative? Basically using a dedicated macro rather than introducing a new syntax for this feature in the first place, so we can decide later a dedicated syntax and whether it is worth it (like addr_of!(x)&raw x or matches!(x, p)x is p).

finite_state_machine! {
    goto dyn count % 4;
    0 => {
        one();
        goto 3;
    }
    3 => {
        one();
        goto 2;
    }
    2 => {
        one();
        goto 1;
    }
    1 => {
        one();
        n -= 1;
        if n > 0 {
            goto 0;
        } else {
            break;
        }
    }
    _ => unreachable!(),
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I added it.

I quite like goto dyn actually, that's good.

I am, at first glance, worried about the ergonomics of such a macro. To get the goto target validation and lowering to a MIR goto you need serious compiler support. You'd also need custom error messages for this special rust-like dialect that is valid within the macro, and it needs to be parsed. Maybe there is actually good tooling for this, but my experience with the asm! macros is that the parsing is very manual and kind of tedious.

text/3720-labeled-match.md Outdated Show resolved Hide resolved
```rust
loop match read_next_byte() {
0 => continue 1,
_ => continue,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this fairly surprising. If I think of loop match ... as just a shortcut for loop { match ... }, then it makes sense in isolation. But when it's mixed with the continue value form, it means the read_next_byte() part is only executed in some subset of loop iterations, depending on which state transitions are taken. As a reader I have to (1) remember this every time I see a plain continue, and (2) scroll back up to the start of the loop match to see the code that runs to determine the next state.

If plain continues were disallowed in loop match, I could instead read loop match read_next_byte() as: now we're entering a state machine, the initial state is determined by read_next_byte(), and to understand the state machine I just chase the transitions from one state to the next. This seems conceptually simpler to me, and doesn't fundamentally lose any expressive power, does it?

@Diggsey
Copy link
Contributor

Diggsey commented Nov 23, 2024

I think the proposed syntax is confusing because the argument to loop match appears to be a constant, but can actually change.

If I see match true { false => ..., true => ... } then I would expect the true branch to always be taken... because it's written as a constant.

In other words, I think the syntax should hint at the fact that there is a hidden state variable being introduced.

For example:

loop match = true {
    ...
}

Indicates that some kind of state initialization is happening.

1 => continue 2,
2 => continue 3,
3 => break Some(42),
_ => None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop match syntax is advertised as a loop and match can be combined into a loop match, but the proposed semantics is clearly not the case:

// This will loop forever.
let a = loop { match true {
    true => {},
    false => unreachable!(),
}};

// This will assign `()` to `b`
let b = loop match true {
    true => {}, // a non-diverging value `x` implicitly becomes `break x;`
    false => unreachable!(),
};

For clarity it is better to force every block inside the loop match to be diverging (have type !), similar to the else block in a let else declaration. Yes this makes it more tedious to write but it will clear much confusion to read.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity it is better to force every block inside the loop match to be diverging (have type !), similar to the else block in a let else declaration.

I'm a big fan of this, but then think that the loop match fiction once again starts fitting worse, so think it should just make its own syntax that's not trying to look like a match (#3720 (comment)).

it can always be k#finite_state_machine or something on existing editions initially. This is not so common that a few more characters is a problem.

@ayosec
Copy link

ayosec commented Nov 30, 2024

The const-annotated state transitions provide more accurate CFG information to the backend:

  • such transitions are lowered from HIR to MIR as a goto to the right match branch, and can hence express irreducible control flow

Does it mean that something like State::Foo(Vec::new()) will not generate a direct goto? .

Could be possible to relax the requirements in some cases? I don't know if this makes sense, but maybe adding some extra rules, like: if the arm for the next state can be resolved by looking at the discriminant of an enum (similar to this example), it always generates a direct goto.

The motivation is that storing the data associated to each state in the state itself can simplify the implementation of the FSM. For example, the RFC includes this snippet:

let owned = Box::new(1);

loop const match State::Foo {
    State::Foo => const continue State::Bar,
    State::Bar => {
        // or any function that moves the value
        drop(owned); // all good now!
        const continue State::Baz;
    }
    State::Baz => break,
}

The const continue statement is needed because the compiler needs to determine that owned is not used after State::Bar, and this is only possible if the next state is State::Baz. This FSM could be rewritten like this:

enum State {
    Foo,
    Bar(Box<i32>),    // 🟢 `owned` is now a field in `State::Bar`.
    Baz,
}

loop match State::Foo {
    State::Foo => continue State::Bar(Box::new(1)),
    State::Bar(owned) => {
        // or any function that moves the value
        drop(owned);
        continue State::Baz;
    }
    State::Baz => break,
}

Since owned is inside of State::Bar, the code is (IMO) much easier to read, and the compiler does not need anything extra to verify that owned is not used after State::Bar. This is even more important in a FSM with dozens of states, where the relationship between data and states can be much more complex.

The only drawback of this approach is that it does not match the requirements of the section «Restrictions on the loop const match and const continue operand», so (I guess) it will not generate a direct goto to jump from Foo to Bar (even when a regular match can use a jump-table against the discriminant).


Another topic, related to the previous one: Do you think it would be useful to add a lint against non-direct state transactions?

Lints related to performance are very uncommon, but Rust already has large_assignments, which is a lint against correct-but-maybe-slow code. Since one of the main motivations for this RFC is to emit a more efficient representation of the loop { state = match state { … } } pattern, it makes sense to support something like #[warn(fsm_non_direct_jump)] (or similar). This allows developers to ensure that the FSM is generated with the expected gotos.


And, unrelated to the previous points: I know that the syntax is not the main goal of this RFC, but I would like to provide some feedback from the point of view of someone who discovered this proposal a few days ago:

  1. Do we need the explicit continue?

    My impression is that the continue is here because it was present in the original proposal, when it was 'label: match … { … => continue 'label … }. However, now the FSM is declared with loop match, and the loop keyword is enough to express that the match will have multiple iterations.

    The next state can be the result of the match arm. This is very close to how a FSM is written today.

    For example, in the current Rust stable we can write a FSM like this:

    let mut state = S::Init;
    let result = loop {
        state = match state {
            S::Init => {
                // Do something
                S::State0
            }
            S::State0 => {
                // Do something else
                S::State1
            }
            S::State1 => {
                // Emit the result.
                break 123
            }
        };
    };
    
    dbg!(result); // print `123`

    By making continue optional, that code can be translated to this:

    let result = loop match S::Init {
        S::Init => {
            // Do something
            S::State0
        }
        S::State0 => {
            // Do something else
            S::State1
        }
        S::State1 => {
            // Emit the result.
            break 123
        }
    };
    
    dbg!(result);

    The “implicit” continue is similar to the (also implicit) continue in other existing loops, like for and while.

    The equivalent to the const continue EXPR in the RFC can be const { EXPR }, which is already a valid syntax.

  2. As some other people have pointed out, the syntax to set the initial state can be a little bit confusing. To compare with the other loops:

    • In while EXPR, the expression EXPR is evaluated before each iteration, and it determines if the loop is still running.
    • In for ... in EXPR, the expression EXPR returns an iterator to yield the value for each iteration of the loop.

    This is different to loop match EXPR: here, EXPR is used only to determine the first iteration, but after that it has no effect.

    One of the alternative syntaxes proposed in the comments was using labels. Maybe we could use a special label, like 'start to define the entry point of the FSM. For example:

    loop match {
        'start => S::Init,
    
        S::Init => S::State0,
    
        S::State0 => S::State1,
    
        S::State1 => break 123,
    };

    In this approach, the FSM is always declared as loop match { 'start => <INIT>, <STATES> }. I think this also helps to reduce the confusion with a regular loop { match … { … } },

    The compiler should throw an error if the first arm of the loop match statement is not a 'start label (i.e., it is not allowed to define 'start in the middle of the loop match { … }).

    Also, the loop const match should not be necessary, since its effect can be replicated with 'start => const { ... }. The example in the «loop const match and const continue can be freely mixed» section could be written as:

    loop match {
        'start => const { State::Start },
        State::Start => next_state(),
        State::S2 => break,
        State::S3 => const { State::S2 },
        State::Finished => break,
    }

@nikic
Copy link

nikic commented Dec 2, 2024

I'm curious whether your motivating examples achieve the desired optimization when passing -Cllvm-args=-enable-dfa-jump-thread.

@veera-sivarajan
Copy link

Related issue rust-lang/rust#80630 but it seems to be fixed with DFA jump threading pass.

@folkertdev
Copy link
Author

this is very interesting.

On the zlib-rs main branch, this flag does nothing. In particular, the assembly for this function does not change.

However, when we make the state machine a bit smaller, we do see an effect that matches the performance of the labeled match PoC:

On the https://github.com/trifectatechfoundation/zlib-rs/commits/split-out-len/ branch we split out the part of the state machine that is hottest into its own function: trifectatechfoundation/zlib-rs@ab60e1e

On this branch, commit trifectatechfoundation/zlib-rs@ab60e1e (split out a smaller state machine) with -Cllvm-args=-enable-dfa-jump-thread has the same performance as the next commit trifectatechfoundation/zlib-rs@85cf9cc with labeled match and without -Cllvm-args=-enable-dfa-jump-thread.

Combining labeled match and -Cllvm-args=-enable-dfa-jump-thread has no effect, providing some evidence that they do the same thing. The original code had 19 jmp rax instructions, with enable-dfa-jump-thread there are 12, with labeled match only 8. The difference between the last 2 is not measurable in practice, so the 4 jmp rax instructions that were removed must be used rarely.

Practically speaking, this is great for us: the performance gain is worth duplicating ~240 lines of logic. So thanks for that suggestion @nikic!

Btw we really do need to duplicate that logic for performance, e.g. this is ~10% slower in one of our benchmarks than duplicating the LenExt etc branches

Mode::Len | Mode::LenExt | Mode::Lit | Mode::Dist | Mode::DistExt | Mode::Match => {
    match self.len_as_match() {
        None => {
            continue 'label;
        }
        Some(return_code) => {
            break 'label return_code;
        }
    }
}

From a performance standpoint, I think the fundamental points still stand, though it is now harder to make small examples:

  • manually copying hundreds of lines of tricky state machine code is not great; we still cannot achieve what C can without hacks and care

  • clearly when you make the logic large/complex enough LLVM still bails out; the optimization is not reliable

  • relying so heavily on llvm is unfortunate

  • the flag is introduced in https://reviews.llvm.org/D99205, which notes

    The current JumpThreading pass does not jump thread loops since it can result in irreducible control flow that harms other optimizations

    There are some details in the thread, and also the note that the final version does not (significantly) harm optimizations because it is positioned relatively late in the optimization pipeline. The risk of harming other optimizations however does appear to be the reason the flag is not on by default, though that may have been out of an abundance of caution?

    Anyway, in our case we don't see adverse effects of introducing irreducible control flow early: with the PoC branch we introduce the irreducible control flow before LLVM even gets its hands on the IR. My point: with first-class support this optimization can be applied more granularly. The best we can do with that flag today is per-crate, and that needs a build.rs.

@nikic
Copy link

nikic commented Dec 3, 2024

The reason it's not enabled by default is that it used to have some correctness issues and too much compile-time impact. I think those concerns were addressed in a recent rewrite of the pass though. There's a pending PR to enabled it by default at llvm/llvm-project#83033 (but no recent activity).

But yeah, as usual with these optimizations, there are no guarantees...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-lang-nominated Indicates that an issue has been nominated for prioritizing at the next lang team meeting. T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.