support KV-Compress paged KV cache #27

IsaacRe · 2024-11-27T19:44:47Z

This PR adds support for paged KV cache following the structure of KV-Compress, where cache blocks are paged out on a per-head basis.

In this case the cache shape becomes [num_blocks, block_size, d] (rather than [num_blocks, block_size, num_heads, d]) and the block table shape becomes [num_seqs, num_heads, max_blocks_per_seq] (rather than [num_seqs, max_blocks_per_seq]). Sequence lengths also need to be specified per-head, so that a (num_seqs * num_heads) or (num_seqs * num_heads + 1)-sized tensor is provided (depending on whether the sequence length or cumulative offsets are used).

I configured it to use the dimensionality of the block tables tensor to detect whether a KV-Compress cache is being used, assuming KV-Compress when dim > 2 and following the existing logic otherwise.

bgamari · 2024-11-28T00:19:28Z

csrc/flash_attn/flash_api.cpp

@@ -159,6 +168,7 @@ void run_mha_fwd(Flash_fwd_params &params, cudaStream_t stream, bool force_split
        HEADDIM_SWITCH(params.d, [&] {
            BOOL_SWITCH(params.is_causal, Is_causal, [&] {
                if (params.num_splits <= 1 && !force_split_kernel) {  // If we don't set it num_splits == 0
+                    assert(false);


What is going on here? Perhaps you mean to also remove the call that follows?

Good catch. That was just to make sure it was launching the right kernel when I was testing. It should be removed

WoosukKwon · 2024-11-28T08:00:47Z

@IsaacRe Amazing! Excited to see this work 🚀 BTW, are you in the vLLM slack workspace?

Signed-off-by: Isaac Rehg <[email protected]>

IsaacRe · 2024-11-28T16:20:20Z

@IsaacRe Amazing! Excited to see this work 🚀 BTW, are you in the vLLM slack workspace?

Thanks! Yes I am. I'm wrapping up chunked-prefill compat and will update in the channel when benchmarks are done

GWS0428 · 2025-04-22T08:06:12Z

@IsaacRe Are you still working on this?

IsaacRe force-pushed the irehg/vllm-kvc branch from 3aa31f1 to ef92d5f Compare November 27, 2024 19:57

bgamari reviewed Nov 28, 2024

View reviewed changes

support KV-Compress paged KV cache

28f6637

Signed-off-by: Isaac Rehg <[email protected]>

IsaacRe force-pushed the irehg/vllm-kvc branch from ef92d5f to 28f6637 Compare November 28, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support KV-Compress paged KV cache #27

support KV-Compress paged KV cache #27

IsaacRe commented Nov 27, 2024

bgamari Nov 28, 2024

IsaacRe Nov 28, 2024

WoosukKwon commented Nov 28, 2024

IsaacRe commented Nov 28, 2024

GWS0428 commented Apr 22, 2025

support KV-Compress paged KV cache #27

Are you sure you want to change the base?

support KV-Compress paged KV cache #27

Conversation

IsaacRe commented Nov 27, 2024

bgamari Nov 28, 2024

Choose a reason for hiding this comment

IsaacRe Nov 28, 2024

Choose a reason for hiding this comment

WoosukKwon commented Nov 28, 2024

IsaacRe commented Nov 28, 2024

GWS0428 commented Apr 22, 2025