A highly configurable hardware-accelerated Minimum-Weight Perfect Matching (MWPM) decoder for Quantum Error Correction (QEC).
Paper coming soon!!! Stay tuned!!!
Micro Blossom is a heterogenous architecture that solves exact MWPM decoding problem in sub-microsecond latency by taking advantage of vertex and edge-level fine-grained hardware acceleration. At the heart of Micro Blossom is an algorithm (equivalent to the original blossom algorithm) specifically optimized for resource-efficient RTL (register transfer level) implementation, with compact combinational logic and pipeline design. Given an arbitrary decoding graph, Micro Blossom automatically generates a hardware implementation (either Verilog or VHDL depending on your needs). The architecture of Micro Blossom is shown below:
Correctness: Like Fusion Blossom, we have not only mathematically proven the optimality of the MWPM it generates, but also done massive correctness tests in cycle-accurate Verilog simulators under various conditions.
14x latency reduction: On a surface code of code distance
Better effective error rate than both Helios (hardware UF decoder, which runs faster but loses accuracy) and Parity Blossom (running on M1 Max) on various code distances max_tree_size
feature in the Fusion Blossom library) because UF and MWPM decoders share the same mathematical foundation.
Better worst-case time complexity and thus smaller tail in latency distribution: We observe an exponential tail of the latency distribution for Micro Blossom. The software implementation has similar behavior but it has a much longer tail in the distribution. Micro Blossom has a better latency distribution for two reasons. First, Micro Blossom has a better worst-case time complexity of
Note that an improvement of latency is generally harder than an improvement of throughput because the latter can be achieved by
increasing the number of cores using coarse-grained parallelism (see our Fusion Blossom paper) but the latency is bounded by how much the algorithm is sequential at its core. Given the complexity and sequential nature of the blossom
algorithm, it was even believed in the community that hardware acceleration of exact MWPM decoding is impractical. Yet we re-design the algorithm
to exploit the finest parallelism possible (vertex and edge parallelism) and achieve a significant reduction in latency.
Note that this doesn't mean we are sacrificing the decoding throughput: in fact, thanks to the pipeline design, the hardware accelerator
has a large decoding throughput capability that supports real-time decoding of at most 110 logical qubits (
For people concerned about why we evaluate the average latency rather than worst-case latency: we believe only the average case matters for several reasons below. Note that when we evaluate the decoding latency distribution, we accumulate 1000 logical errors (2.5e8 samples in total) to make sure we capture the latencies with probability at or even below the logical error rate
- Adding "idle QEC cycle" support at the lowest control layer is not hard, and is favorable for various reasons beyond QEC decoding: for example, some logical feedforward branches may have longer execution paths and we should just let other idle logical qubits run their idle QEC cycles while waiting for a longer branch to run.
- Mathematically, only average case matters. The ultimate goal is that the overall logical error rate of a quantum circuit (including the added idle time due to decoding latency and feedforward) is low. Let's calculate the overall logical error rate including the latency-induced idle errors: Suppose the latency distribution is
$P(L)$ , then$\int_0^\infty P(L) dL = 1$ and$\int_0^\infty P(L) L dL = \bar{L}$ (aka average latency). Let's calculate the overall logical error rate.$p_L = \int_0^\infty P(L) p_{L0} (1 + L/d) dL = p_{L0} (1 + \bar{L}/d)$ . This argument can be extended to more complicated circuits as well, but the idea is pretty intuitive: logical error rate is also a statistical value that is linearly related to the latency. - It is difficult to scale up QEC decoding to distributed systems while meeting hard deadline requirements, even though there are existing decoders that achieve hard real-time for simple memory experiments at a very low physical error rate and code distance. We believe that in the long run when scaling up, all QEC decoding systems will face the problem of not being able to achieve a hard
$1 \mu s$ deadline, but it doesn't matter that much according to the two points above.
- src: source code
- fpga: the FPGA source code, including generator scripts
- microblossom: Scala code for MicroBlossom module using SpinalHDL library
- Xilinx: Xilinx project build scripts
- cpu: the CPU source code
- blossom: code for development and testing on a host machine
- blossom-nostd: nostd code that is designed to run in an embedded environment but can also run in OS
- embedded: build binary for embedded system
- fpga: the FPGA source code, including generator scripts
- benchmark: evaluation results that can be reproduced using the scripts included (and VMK180 Xilinx evaluation board)
- behavior: CPU simulation, with exactly the same RTL logic
- hardware: evaluation of hardware
- bram_speed: understand the CPU-FPGA communication cost under various clock frequencies
- decoding_speed: evaluate the decoding latency under various conditions and its distribution on real VMK180 hardware
- resource_estimate: post-implementation resource usage
Micro Blossom consists of two parts: the CPU program in Rust and the FPGA program in Scala.
The FPGA part has a CLI program that generates Verilog/VHDL files given a decoding graph input.
The decoding graph input is defined by a JSON format in src/fpga/microblossom/utils/SingleGraph.scala
.
A few examples of how to generate and use this decoding graph are below
# 1. generate example graphs in ./resources/graphs/*.json
cd src/cpu/blossom
cargo run --release --bin generate_example_graphs
cd ..
# (if you want to use the visualization tool to see these graphs, check src/cpu/blossom/bin/generate_example_graphs.rs)
# 2. use one graph to generate a Verilog
# you need to install Java and sbt build tool first, check:
# https://spinalhdl.github.io/SpinalDoc-RTD/master/SpinalHDL/Getting%20Started/Install%20and%20setup.html
mkdir gen
sbt "runMain microblossom.MicroBlossomBusGenerator --graph ./resources/graphs/example_code_capacity_d3.json"
# (for a complete list of supported arguments, sbt "runMain microblossom.MicroBlossomBusGenerator --help"
# this command generates Verilog file at `./gen/MicroBlossomBus.v` with a default AXI4 interface.
The generated Verilog can be used either in simulation (see test cases in src/cpu/blossom/src/dual_module_axi4.rs
and benchmark/behavior/tests/run.py
) or real hardware evaluation (see test cases in benchmark/hardware/tests/run.py
which runs the same behavior tests but on real FPGA hardware).
The benchmark scripts automate this process of generating the graphs, Verilog, and the Xilinx project. For any questions about how to use the project, please email me.
It is recommended to use the docker file to create an environment.
docker build --tag 'micro-blossom' .
docker run -itd --name 'mb' -v .:/root/micro-blossom 'micro-blossom'
docker exec -it mb bash # into the bash environment