Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open
2 of 15 tasks
yzh119 opened this issue Dec 17, 2024 · 6 comments
Open
2 of 15 tasks

[Roadmap] FlashInfer v0.2 to v0.3 #675

yzh119 opened this issue Dec 17, 2024 · 6 comments
Labels

Comments

@yzh119
Copy link
Collaborator

yzh119 commented Dec 17, 2024

Milestones

Our tentative roadmap includes the following milestones:


We welcome your feedback and suggestions!
Let us know what features you'd like to see in FlashInfer.

@johnnynunez
Copy link

johnnynunez commented Jan 21, 2025

Initial support blackwell:
#747
10.0 blackwell b100/b200
12.0 blackwell rtx50
super: flex attention

@AgrawalAmey
Copy link

Looking forward to Pod-Attention support!

@AgrawalAmey
Copy link

To add more context, we have the following piece of code in mneomsyne codebase:

def _arrange_sequences_for_execution(
        self,
        seq_schedule_metadata_list: List[SequenceScheduleMetadata],
    ) -> List[SequenceScheduleMetadata]:
        """
        We need to arrange sequences in a way that allows us to perform
        attention computation in an efficient manner. Due to poor handling of mixed batches
        in attention kernels. We need to split the first split the sequences into prefill and decode:
        | prefill seqs | decode seqs |

        Secondly, when we mix sequences of different lengths, the attention kernel parallelization
        heuristics fail, and results in high latency. Thus, we need to further split the sequences:
        | long seqs | short seqs |

        Furthermore, within each group, we can have kvp sequences. Some of these kvp
        sequences might not require kv cache to be saved. So, within each group, we need to further
        organize sequences as follows:
        | non kvp seqs | kvp seqs w/ save_kv_cache | kvp seqs w/o save_kv_cache |
        """

In essence, we create make 4 different instances of flashinfer prefill attention wrapper and call the kernel 4 times 😢 cc @yzh119

@Edenzzzz
Copy link

Edenzzzz commented Mar 4, 2025

Could POD-Attention potentially support the removal of prefill and decode batch scheduling logic, and instead just run all the decode and prefill requests together?

@yzh119
Copy link
Collaborator Author

yzh119 commented Mar 5, 2025

@Edenzzzz good idea, there is no reason to keep two set of APIs. Actually the current prefill attention can be used by decoding, just set the query length per request to 1.

We should use a unified BatchAttention API for all cases.

@Edenzzzz
Copy link

Edenzzzz commented Mar 5, 2025

@yzh119 Thanks! I plan to try employing similar logic in SGLang this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants