1
1
<div align =" center " >
2
2
<h1 ><img src =" static/images/ShadowKV.png " height =" 40px " > ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference</h1 >
3
+
4
+ ** training-free, high-throughput long-context LLM inference**
3
5
</div >
4
6
<div align =" center " >
5
7
<b ><a href =" https://github.com/preminstrel " >Hanshi Sun</a ></b ><sup >1,2</sup >,
28
30
</div >
29
31
30
32
## Environment Set Up
33
+ To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:
31
34
``` bash
32
35
# create env
33
36
conda create -n ShadowKV python=3.10 -y
@@ -45,9 +48,12 @@ pip install nemo_toolkit[all]==1.23
45
48
46
49
# flashinfer
47
50
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
48
- pip install huggingface_hub==0.22.0
49
51
50
- # build kernels
52
+ # cutlass
53
+ mkdir 3rdparty
54
+ git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass
55
+
56
+ # build kernels for ShadowKV
51
57
python setup.py build_ext --inplace
52
58
```
53
59
## Supported Models
@@ -56,27 +62,57 @@ Currently, we support the following LLMs:
56
62
- GLM-4-9B-1M: [ THUDM/glm-4-9b-chat-1m] ( https://huggingface.co/THUDM/glm-4-9b-chat-1m )
57
63
- Llama-3.1-8B: [ meta-llama/Meta-Llama-3.1-8B-Instruct] ( https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct )
58
64
- Yi-9B-200K: [ 01-ai/Yi-9B-200K] ( https://huggingface.co/01-ai/Yi-9B-200K )
59
- - Phi-3-Mini-128K: [ microsoft/Phi-3-mini-128k-instruct] ( https://huggingface.co/microsoft/Phi-3-mini-128k-instruct )
60
- - Qwen2-7B-128K: [ Qwen/Qwen2-7B-Instruct] ( https://huggingface.co/Qwen/Qwen2-7B-Instruct )
65
+ - Phi-3-Mini-128K: [ microsoft/Phi-3-mini-128k-instruct] ( https://huggingface.co/microsoft/Phi-3-mini-128k-instruct ) (only NIAH test supported)
66
+ - Qwen2-7B-128K: [ Qwen/Qwen2-7B-Instruct] ( https://huggingface.co/Qwen/Qwen2-7B-Instruct ) (only NIAH test supported)
61
67
62
68
## Accuracy Evaluations
63
69
Here we provide an example to build the dataset and run evaluation for the [ RULER] ( https://github.com/hsiehjackson/RULER ) benchmark with Llama-3-8B-1M.
64
70
65
- ### Build Dataset
66
-
71
+ ### Build Datasets
72
+ To build RULER dataset, please run the following command:
67
73
``` bash
68
74
# build RULER
69
75
python -c " import nltk; nltk.download('punkt')"
70
76
cd data/ruler
71
77
bash create_dataset.sh " gradientai/Llama-3-8B-Instruct-Gradient-1048k" " llama-3"
72
78
```
73
79
74
- ### Run Evaluation
80
+ ### Run Evaluations
81
+ For the accuracy evaluation, please run the following command with 8xA100 GPUs:
75
82
76
83
``` bash
77
84
# Full attention
78
- OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multikey_3,ruler/ niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/cwe ,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name " gradientai/Llama-3-8B-Instruct-Gradient-1048k"
85
+ OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name " gradientai/Llama-3-8B-Instruct-Gradient-1048k"
79
86
80
87
# ShadowKV
81
- OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multikey_3,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/cwe,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8
88
+ OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8
89
+ ```
90
+
91
+ #### Compatibility with MInference
92
+ ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the ` --minference ` flag to the command. For example:
93
+
94
+ ``` bash
95
+ # Full attention with MInference
96
+ OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference
97
+
98
+ # ShadowKV with MInference
99
+ OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name " ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference
100
+ ```
101
+
102
+ ## Efficiency Evaluations
103
+ For the efficiency evaluation, please run the following command with a single A100 GPU:
104
+
105
+ ``` bash
106
+ python test/e2e.py --model_name " meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen " 122k"
107
+ ```
108
+ ## Citation
109
+ If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:
110
+
111
+ ``` bibtex
112
+ @article{sun2024shadowkv,
113
+ title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
114
+ author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
115
+ journal={arXiv preprint arXiv:2410.XXXXX},
116
+ year={2024}
117
+ }
82
118
```
0 commit comments