Skip to content

Commit c2f7c87

Browse files
committed
PR
1 parent 5cd68f6 commit c2f7c87

11 files changed

+130
-85
lines changed

.gitignore

+5
Original file line numberDiff line numberDiff line change
@@ -1 +1,6 @@
11
*.DS_Store
2+
build/
3+
3rdparty/
4+
*.pyc
5+
*.so
6+
*.jsonl

README.md

+45-9
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
<div align="center">
22
<h1><img src="static/images/ShadowKV.png" height="40px"> ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference</h1>
3+
4+
**training-free, high-throughput long-context LLM inference**
35
</div>
46
<div align="center">
57
<b><a href="https://github.com/preminstrel">Hanshi Sun</a></b><sup>1,2</sup>,
@@ -28,6 +30,7 @@
2830
</div>
2931

3032
## Environment Set Up
33+
To reproduce the results in the paper, you need to set up the environment as follows with a single A100 GPU:
3134
```bash
3235
# create env
3336
conda create -n ShadowKV python=3.10 -y
@@ -45,9 +48,12 @@ pip install nemo_toolkit[all]==1.23
4548

4649
# flashinfer
4750
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
48-
pip install huggingface_hub==0.22.0
4951

50-
# build kernels
52+
# cutlass
53+
mkdir 3rdparty
54+
git clone https://github.com/NVIDIA/cutlass.git 3rdparty/cutlass
55+
56+
# build kernels for ShadowKV
5157
python setup.py build_ext --inplace
5258
```
5359
## Supported Models
@@ -56,27 +62,57 @@ Currently, we support the following LLMs:
5662
- GLM-4-9B-1M: [THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)
5763
- Llama-3.1-8B: [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
5864
- Yi-9B-200K: [01-ai/Yi-9B-200K](https://huggingface.co/01-ai/Yi-9B-200K)
59-
- Phi-3-Mini-128K: [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)
60-
- Qwen2-7B-128K: [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
65+
- Phi-3-Mini-128K: [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) (only NIAH test supported)
66+
- Qwen2-7B-128K: [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) (only NIAH test supported)
6167

6268
## Accuracy Evaluations
6369
Here we provide an example to build the dataset and run evaluation for the [RULER](https://github.com/hsiehjackson/RULER) benchmark with Llama-3-8B-1M.
6470

65-
### Build Dataset
66-
71+
### Build Datasets
72+
To build RULER dataset, please run the following command:
6773
```bash
6874
# build RULER
6975
python -c "import nltk; nltk.download('punkt')"
7076
cd data/ruler
7177
bash create_dataset.sh "gradientai/Llama-3-8B-Instruct-Gradient-1048k" "llama-3"
7278
```
7379

74-
### Run Evaluation
80+
### Run Evaluations
81+
For the accuracy evaluation, please run the following command with 8xA100 GPUs:
7582

7683
```bash
7784
# Full attention
78-
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multikey_3,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/cwe,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"
85+
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --model_name "gradientai/Llama-3-8B-Instruct-Gradient-1048k"
7986

8087
# ShadowKV
81-
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multikey_3,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/cwe,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8
88+
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8
89+
```
90+
91+
#### Compatibility with MInference
92+
ShadowKV is compatible with pre-filling acceleration techniques, such as MInference. To enable MInference, please add the `--minference` flag to the command. For example:
93+
94+
```bash
95+
# Full attention with MInference
96+
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method full --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --minference
97+
98+
# ShadowKV with MInference
99+
OMP_NUM_THREADS=48 torchrun --standalone --nnodes=1 --nproc_per_node 8 test/eval_acc.py --datalen 131072 --method shadowkv --dataset_name "ruler/niah_single_1,ruler/niah_single_2,ruler/niah_single_3,ruler/niah_multikey_1,ruler/niah_multikey_2,ruler/niah_multiquery,ruler/niah_multivalue,ruler/vt,ruler/fwe,ruler/qa_1,ruler/qa_2" --sparse_budget 2048 --rank 160 --chunk_size 8 --minference
100+
```
101+
102+
## Efficiency Evaluations
103+
For the efficiency evaluation, please run the following command with a single A100 GPU:
104+
105+
```bash
106+
python test/e2e.py --model_name "meta-llama/Meta-Llama-3.1-8B-Instruct" --datalen "122k"
107+
```
108+
## Citation
109+
If you find ShadowKV useful or relevant to your project and research, please kindly cite our paper:
110+
111+
```bibtex
112+
@article{sun2024shadowkv,
113+
title={ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference},
114+
author={Sun, Hanshi and Chang, Li-Wen and Bao, Wenlei and Zheng, Size and Zheng, Ningxin and Liu, Xin and Dong, Harry and Chi, Yuejie and Chen, Beidi},
115+
journal={arXiv preprint arXiv:2410.XXXXX},
116+
year={2024}
117+
}
82118
```

data/ruler/create_dataset.sh

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111
# Model and Tokenizer
1212
SEQ_LENGTHS=(
13+
65536
1314
131072
1415
262144
1516
)

0 commit comments

Comments
 (0)