Question about cascade inference #789

sleepwalker2017 · 2025-02-05T10:41:54Z

https://flashinfer.ai/2024/02/02/cascade-inference.html

Hi, I notice this blog posted a year ago.

I wonder what situation does the Evaluations part refer to.

Is it for prefill stage ? or decoding stage? Or for both phase?

The text was updated successfully, but these errors were encountered:

yzh119 · 2025-02-05T16:41:44Z

It only refers to decode attention kernel, not end-to-end results.

sleepwalker2017 · 2025-02-06T02:12:45Z

It only refers to decode attention kernel, not end-to-end results.

Thank you.

Is this optimization mainly aimed at the decoding stage?

How is the benefit for the prefill stage?

yzh119 · 2025-02-06T02:41:52Z

Is this optimization mainly aimed at the decoding stage?

Yes, and it doesn't work for attention variants such as MLA (even for decoding), which exhibit very high operational intensity (128) in decoding stage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cascade inference #789

Question about cascade inference #789

sleepwalker2017 commented Feb 5, 2025

yzh119 commented Feb 5, 2025

sleepwalker2017 commented Feb 6, 2025 •

edited

Loading

yzh119 commented Feb 6, 2025 •

edited

Loading

Question about cascade inference #789

Question about cascade inference #789

Comments

sleepwalker2017 commented Feb 5, 2025

yzh119 commented Feb 5, 2025

sleepwalker2017 commented Feb 6, 2025 • edited Loading

yzh119 commented Feb 6, 2025 • edited Loading

sleepwalker2017 commented Feb 6, 2025 •

edited

Loading

yzh119 commented Feb 6, 2025 •

edited

Loading