Official repo for paper: Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
This paper is accepted by AAAI 2024
- Replace llama model files in
transformers
package withtransformers/models/llama
- Download models
sh download.sh
- Use GPTQ to quantize weights
sh run-gptq-llama.sh
- Quantize activation with
gptq_fq_quant_llama.py
@inproceedings{
shen2024agile,
title = {Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge},
author = {Shen, Xuan and Dong, Peiyan and Lu, Lei and Kong, Zhenglun and Li, Zhengang and Lin, Ming and Wu, Chao and Wang, Yanzhi},
booktitle = {AAAI},
year = {2024},
}
The code is mainly based on the quantization works GPTQ and FQ-ViT.