FlexGen-Lite aims to replicate and extend the functionality of FlexGen, a system designed to enable high-throughput generative inference of large language models (LLMs) on a single GPU. By leveraging a sophisticated architecture that minimizes memory overhead and maximizes computational efficiency, FlexGen-Lite addresses the challenges of deploying LLMs on limited-resource hardware.
Deploying large language models on single GPU setups presents significant computational and memory challenges. Traditional methods struggle due to high memory demands and inefficiencies in data movement between GPU, CPU, and disk. These challenges result in performance bottlenecks, particularly for throughput-centric tasks.
FlexGen-Lite introduces a hybrid CPU-GPU architecture that strategically divides computational tasks to optimize memory usage and processing efficiency. The CPU handles sequential tasks like the generation of Key (K), Query (Q), and Value (V) tensors, while the GPU is reserved for parallelizable tasks such as activation generation through Multi-Head Attention (MHA).
- Efficient Tensor Management and Reduced I/O Costs: Optimized layer-wise loading minimizes the need for frequent access to slower secondary storage, significantly reducing I/O costs.
- Strategic Offloading: The zig-zag computational pattern aligns data more precisely with computational demands, enhancing overall system efficiency and throughput.
- Throughput Optimization via Batch and Block Sizing: Dynamic batch sizing and block scheduling maximize GPU utilization, increasing throughput.
The implementation follows a hybrid CPU-GPU architecture:
- CPU Responsibilities: Initial data processing and attention calculations are performed on the CPU to avoid high I/O costs associated with transferring large KV caches between GPU and CPU.
- GPU Responsibilities: The GPU handles parallelizable tasks and activation handling, reducing the time required for data transfers.
The zigzag traversal pattern for data management between CPU, GPU, and disk optimizes computational efficiency by minimizing unnecessary data movement.
Experiments were conducted using the OPT-1.3B language model on Google Colab. The results demonstrate significant improvements in throughput compared to traditional methods and the original FlexGen system. Detailed throughput analysis and comparisons are provided in the project report.
FlexGen-Lite offers an optimized way to run highly batched high-throughput inference jobs on single GPU setups. This project enhances understanding of systems and large language models, especially for production model serving and inference pipelines.
To install and run FlexGen-Lite, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/FlexGen-Lite.git cd FlexGen-Lite
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the main script:
python main.py
Detailed instructions on how to use FlexGen-Lite can be found in the docs
folder. Example usage:
from flexgen_lite import FlexGenLite
model = FlexGenLite('path/to/model')
output = model.generate("Your input text here")
print(output)