Skip to content
63 changes: 61 additions & 2 deletions docsrc/tutorials/compile_hf_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Overview of tools/llm Directory
The ``tools/llm`` directory provides the following tools to compile LLM models from Huggingface:

* **run_llm.py**: Main entry point for model compilation, generating outputs, and benchmarking
* **run_vlm.py**: Entry point for compiling and benchmarking Visual Language Models (VLMs)
* **Static Cache Utilities**: ``static_cache_v1.py`` and ``static_cache_v2.py`` for KV cache optimization
* **SDPA Attention**: ``sdpa_converter.py`` and ``register_sdpa.py`` for registering scaled dot-product attention converter and lowering pass.
* **Testing Components**: Model-specific test files for validation
Expand Down Expand Up @@ -60,6 +61,30 @@ We have officially verified support for the following LLM families:
- FP16, FP32
- Yes

Supported VLM Models
--------------------
We have officially verified support for the following Visual Language Models (VLMs):

.. list-table::
:widths: 20 40 20 20 20
:header-rows: 1

* - Model Series
- HuggingFace Model Card
- Precision
- KV Cache Support ?
- Component Support
* - Qwen 2.5 VL
- Qwen/Qwen2.5-VL-3B-Instruct
- FP16, FP32
- Yes (static_v1 only)
- Language Model only (Image Encoder not supported)
* - Eagle2
- nvidia/Eagle2-2B
- FP16, FP32
- Yes (static_v1 only)
- Language Model and Image Encoder both supported

Getting Started with run_llm.py
-------------------------------

Expand Down Expand Up @@ -112,6 +137,36 @@ Other Usage Examples
python tools/llm/run_llm.py --model Qwen/Qwen2.5-1.5B-Instruct --precision FP32 --benchmark


Getting Started with run_vlm.py
-------------------------------

For Visual Language Models (VLMs), use ``run_vlm.py`` to compile and benchmark models that process both text and images.

Basic Usage
^^^^^^^^^^^

.. code-block:: bash

python tools/llm/run_vlm.py \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--precision FP16 \
--num_tokens 128 \
--cache static_v1 \
--enable_pytorch_run \
--benchmark

Key Arguments
^^^^^^^^^^^^^

* ``--model``: Name or path of the HuggingFace VLM
* ``--prompt``: Input prompt for generation
* ``--image_path``: (Optional) Path to input image file. If not provided, will use a sample image
* ``--precision``: Precision mode (``FP16``, ``FP32``)
* ``--num_tokens``: Number of output tokens to generate
* ``--cache``: KV cache type (``static_v1`` or empty for no KV caching)
* ``--benchmark``: Enable benchmarking mode
* ``--enable_pytorch_run``: Also run and compare PyTorch baseline

KV Caching in Torch-TensorRT
---------------------------------

Expand All @@ -122,7 +177,7 @@ The length of KV cache = input sequence length + output sequence length (specifi
Static Cache v1
^^^^^^^^^^^^^^^^

The ``static_cache_v1.py`` implements KV cache in the model graph as follows:
The ``static_cache_v1.py`` implements KV cache in the model graph as follows:

.. code-block:: python

Expand Down Expand Up @@ -210,9 +265,13 @@ Limitations and Known Issues

* Sliding window attention (used in Gemma3 and Qwen 3 models) is not yet supported
* Some model architectures (e.g. Phi-4) have issues with exporting the torch model.
* For VLMs, Qwen2.5-VL image encoder compilation is not supported due to dynamic operations incompatible with torch.export.

Requirements
^^^^^^^^^^^^

* Torch-TensorRT 2.8.0 or later
* Transformers v4.52.3
* Transformers v4.52.3
* For VLM models (run_vlm.py):
- ``pip install qwen-vl-utils`` (for Qwen2.5-VL-3B-Instruct model)
- ``pip install flash-attn --no-build-isolation -v`` (for Eagle2-2B model)
28 changes: 24 additions & 4 deletions tools/llm/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Optimizing LLMs in Torch-TensorRT

This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry point is `run_llm.py`, which demonstrates how to export, compile, and run LLMs with various caching strategies and precision modes. Note that this is an **experimental release** and APIs may change in future versions.
This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) and Visual Language Models (VLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry points are `run_llm.py` for text-only LLMs and `run_vlm.py` for vision-language models. Note that this is an **experimental release** and APIs may change in future versions.

### Key Features

- **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
- **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
- **Precision Modes:** Supports FP16, BF16, and FP32.
- **KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
- **Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
Expand All @@ -24,20 +25,33 @@ We have officially verified support for the following models:
| Qwen 2.5 | Qwen/Qwen2.5-0.5B-Instruct<br>Qwen/Qwen2.5-1.5B-Instruct<br>Qwen/Qwen2.5-4B-Instruct<br>Qwen/Qwen2.5-7B-Instruct | FP16, FP32 | Yes |
| Qwen 3 | Qwen/Qwen3-0.6B<br>Qwen/Qwen3-1.7B<br>Qwen/Qwen3-4B<br>Qwen/Qwen3-8B | FP16, FP32 | Yes |

### Supported VLM Models

| Model Series | HF Model Card | Precision | KV Cache Supported ? |
|--------------|---------------|-----------|-------------------|
| Qwen 2.5 VL | Qwen/Qwen2.5-VL-3B-Instruct | FP16, FP32 | Yes |
| Eagle2 | nvidia/Eagle2-2B | FP16, FP32 | Yes |

### Usage

The main entry point is : `run_llm.py`
#### Text-only LLMs: `run_llm.py`

```bash
python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
```

#### Vision Language Models: `run_vlm.py`

```bash
python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use eagle model command here since that is fully optimized

```

#### Key Arguments

- `--model`: Name or path of the HuggingFace LLM.
- `--model`: Name or path of the HuggingFace LLM/VLM.
- `--tokenizer`: (Optional) Tokenizer name; defaults to model.
- `--prompt`: Input prompt for generation.
- `--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
- `--precision`: Precision mode (`FP16`, `FP32`).
- `--num_tokens`: Number of output tokens to generate.
- `--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
Expand All @@ -60,8 +74,14 @@ This codebase can be extended to

## Limitations
- We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
- **Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.

## Requirements

- Torch-TensorRT 2.8.0
- Transformers v4.52.3
- Transformers v4.52.3
- For VLM models (run_vlm.py):
- `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model)
- **Flash Attention**: For models using flash attention operations (e.g., Eagle2-2B), install one of the following:
- **Fast installation (recommended)**: `pip install flash-attn==2.8.1` (pre-built wheel, should work)
- **Source build (slow)**: `pip install flash-attn --no-build-isolation -v` (fallback if pre-built wheels fail)
Loading
Loading