[Roadmap] FlashInfer v0.2 to v0.3 #675

yzh119 · 2024-12-17T11:13:36Z

johnnynunez · 2025-01-21T23:07:29Z

Initial support blackwell:
#747
10.0 blackwell b100/b200
12.0 blackwell rtx50
super: flex attention

AgrawalAmey · 2025-02-11T06:17:21Z

Looking forward to Pod-Attention support!

AgrawalAmey · 2025-02-11T07:19:51Z

To add more context, we have the following piece of code in mneomsyne codebase:

def _arrange_sequences_for_execution(
        self,
        seq_schedule_metadata_list: List[SequenceScheduleMetadata],
    ) -> List[SequenceScheduleMetadata]:
        """
        We need to arrange sequences in a way that allows us to perform
        attention computation in an efficient manner. Due to poor handling of mixed batches
        in attention kernels. We need to split the first split the sequences into prefill and decode:
        | prefill seqs | decode seqs |

        Secondly, when we mix sequences of different lengths, the attention kernel parallelization
        heuristics fail, and results in high latency. Thus, we need to further split the sequences:
        | long seqs | short seqs |

        Furthermore, within each group, we can have kvp sequences. Some of these kvp
        sequences might not require kv cache to be saved. So, within each group, we need to further
        organize sequences as follows:
        | non kvp seqs | kvp seqs w/ save_kv_cache | kvp seqs w/o save_kv_cache |
        """

In essence, we create make 4 different instances of flashinfer prefill attention wrapper and call the kernel 4 times 😢 cc @yzh119

Edenzzzz · 2025-03-04T21:03:14Z

Could POD-Attention potentially support the removal of prefill and decode batch scheduling logic, and instead just run all the decode and prefill requests together?

yzh119 · 2025-03-05T21:41:32Z

@Edenzzzz good idea, there is no reason to keep two set of APIs. Actually the current prefill attention can be used by decoding, just set the query length per request to 1.

We should use a unified BatchAttention API for all cases.

Edenzzzz · 2025-03-05T22:33:18Z

@yzh119 Thanks! I plan to try employing similar logic in SGLang this week.

yzh119 added the roadmap label Dec 17, 2024

yzh119 pinned this issue Dec 17, 2024

yzh119 mentioned this issue Dec 26, 2024

Fused-RoPE Attention with q_offset and k_offset #701

Open

yzh119 mentioned this issue Jan 7, 2025

How to use low bit KV Cache #721

Open

yyihuang mentioned this issue Mar 29, 2025

SM-constraint-GEMM by triton persistent kernel #982

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] FlashInfer v0.2 to v0.3 #675

[Roadmap] FlashInfer v0.2 to v0.3 #675

yzh119 commented Dec 17, 2024 •

edited

Loading

johnnynunez commented Jan 21, 2025 •

edited

Loading

AgrawalAmey commented Feb 11, 2025

AgrawalAmey commented Feb 11, 2025

Edenzzzz commented Mar 4, 2025

yzh119 commented Mar 5, 2025

Edenzzzz commented Mar 5, 2025

[Roadmap] FlashInfer v0.2 to v0.3 #675

[Roadmap] FlashInfer v0.2 to v0.3 #675

Comments

yzh119 commented Dec 17, 2024 • edited Loading

Milestones

johnnynunez commented Jan 21, 2025 • edited Loading

AgrawalAmey commented Feb 11, 2025

AgrawalAmey commented Feb 11, 2025

Edenzzzz commented Mar 4, 2025

yzh119 commented Mar 5, 2025

Edenzzzz commented Mar 5, 2025

yzh119 commented Dec 17, 2024 •

edited

Loading

johnnynunez commented Jan 21, 2025 •

edited

Loading