Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about cascade inference #789

Open
sleepwalker2017 opened this issue Feb 5, 2025 · 3 comments
Open

Question about cascade inference #789

sleepwalker2017 opened this issue Feb 5, 2025 · 3 comments

Comments

@sleepwalker2017
Copy link

https://flashinfer.ai/2024/02/02/cascade-inference.html

Hi, I notice this blog posted a year ago.

I wonder what situation does the Evaluations part refer to.

Is it for prefill stage ? or decoding stage? Or for both phase?

@yzh119
Copy link
Collaborator

yzh119 commented Feb 5, 2025

It only refers to decode attention kernel, not end-to-end results.

@sleepwalker2017
Copy link
Author

sleepwalker2017 commented Feb 6, 2025

It only refers to decode attention kernel, not end-to-end results.

Thank you.

Is this optimization mainly aimed at the decoding stage?

How is the benefit for the prefill stage?

@yzh119
Copy link
Collaborator

yzh119 commented Feb 6, 2025

Is this optimization mainly aimed at the decoding stage?

Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants