Web UI for quantizing LLMs from Hugging Face. Inspired by Oobabooga and the Colab AutoQuant notebook.
Only GGUF quantization works for now. Windows is unstable.
Got to https://docs.docker.com/desktop/setup/install/windows-install/, download and install Docker Desktop.
Right click on `start_windows.ps1` and press `Run with PowerShell`.
./start_linux.sh
- Multi-method Quantization: Choose from GGUF, GPTQ, ExLlamaV2, AWQ, and HQQ.
- Unified Docker Setup: Separate Dockerfiles for CPU-only and GPU (CUDA) builds.
- Dynamic Runtime Detection: Launch the appropriate container based on hardware (via startup scripts for Linux and Windows).
- Easy-to-Use Web UI: Built with Gradio, enabling interactive model quantization.
Method | CPU Quantization | CPU Inference | GPU Quantization | GPU Inference | Tradeoffs / Notes |
---|---|---|---|---|---|
GGUF | Yes | Yes | Yes (but not rquired) | Yes (but not rquired | Designed for efficient CPU inference via llama.cpp; optimized for low precision on CPUs. |
GPTQ | No | No | Yes | Yes | High compression & accuracy but built for CUDA; forcing CPU-only leads to very slow and unreliable processing. |
ExLlamaV2 | No | No | Yes | Yes | Optimized for GPU; CPU fallback is possible but performance is suboptimal. |
AWQ | No | No | Yes | Yes | Relies on CUDA kernels for fast quantization; CPU-only execution is generally impractical. |
HQQ | No | No | Yes | Yes | Designed primarily for GPU inference with specialized kernels; CPU usage is not widely validated and may be very slow. |
- GPTQ, ExLlamaV2, AWQ, and HQQ need a GPU for quantization (and inference). As of now, only GGUF is reliably CPU-friendly, both for quantization and inference.
SpongeQuant/
├── app/
│ ├── app.py # Main application code (Gradio UI)
│ ├── requirements.cpu.txt # CPU-only dependencies
│ ├── requirements.gpu-cuda.txt # GPU (CUDA) dependencies
│ └── ... # Other application files
├── Dockerfile.cpu # Dockerfile for CPU-only mode
├── Dockerfile.gpu-cuda # Dockerfile for GPU (CUDA) mode
├── Dockerfile.gpu-rocm # (Placeholder for future ROCm support)
├── start_linux.sh # Startup script for Linux
├── start_windows.ps1 # Startup script for Windows
├── README.md # This file
└── ... # Other files (models, quantized_models, etc.)
Contributions are welcome! Please feel free to open issues or submit pull requests on GitHub.
docker run --gpus all -it -p "${PORT}:${PORT}" \
-v "$(pwd)/app/gguf:/app/gguf" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/quantized_models:/app/quantized_models" \
--rm "${IMAGE_NAME}"
docker run -it -p "${PORT}:${PORT}" \
-v "$(pwd)/app/gguf:/app/gguf" \
-v "$(pwd)/models:/app/models" \
-v "$(pwd)/quantized_models:/app/quantized_models" \
--rm "${IMAGE_NAME}"
x86-64 CPUs have AVX2/FMA support, which accelerate tensor operations in llama.cpp much faster than ARM NEON/DOTPROD.