pytorch · chohk88 · Jul 2, 2025 · Jul 17, 2025 · Jul 21, 2025 · Jul 24, 2025
diff --git a/docsrc/tutorials/compile_hf_models.rst b/docsrc/tutorials/compile_hf_models.rst
@@ -18,6 +18,7 @@ Overview of tools/llm Directory
 The ``tools/llm`` directory provides the following tools to compile LLM models from Huggingface:
 
 * **run_llm.py**: Main entry point for model compilation, generating outputs, and benchmarking
+* **run_vlm.py**: Entry point for compiling and benchmarking Visual Language Models (VLMs)
 * **Static Cache Utilities**: ``static_cache_v1.py`` and ``static_cache_v2.py`` for KV cache optimization
 * **SDPA Attention**: ``sdpa_converter.py`` and ``register_sdpa.py`` for registering scaled dot-product attention converter and lowering pass.
 * **Testing Components**: Model-specific test files for validation
@@ -60,6 +61,30 @@ We have officially verified support for the following LLM families:
      - FP16, FP32
      - Yes
 
+Supported VLM Models
+--------------------
+We have officially verified support for the following Visual Language Models (VLMs):
+
+.. list-table::
+   :widths: 20 40 20 20 20
+   :header-rows: 1
+
+   * - Model Series
+     - HuggingFace Model Card
+     - Precision
+     - KV Cache Support ?
+     - Component Support
+   * - Qwen 2.5 VL
+     - Qwen/Qwen2.5-VL-3B-Instruct
+     - FP16, FP32
+     - Yes (static_v1 only)
+     - Language Model only (Image Encoder not supported)
+   * - Eagle2
+     - nvidia/Eagle2-2B
+     - FP16, FP32
+     - Yes (static_v1 only)
+     - Language Model and Image Encoder both supported
+
 Getting Started with run_llm.py
 -------------------------------
 
@@ -112,6 +137,36 @@ Other Usage Examples
    python tools/llm/run_llm.py --model Qwen/Qwen2.5-1.5B-Instruct --precision FP32 --benchmark
 
 
+Getting Started with run_vlm.py
+-------------------------------
+
+For Visual Language Models (VLMs), use ``run_vlm.py`` to compile and benchmark models that process both text and images.
+
+Basic Usage
+^^^^^^^^^^^
+
+.. code-block:: bash
+
+   python tools/llm/run_vlm.py \
+     --model Qwen/Qwen2.5-VL-3B-Instruct \
+     --precision FP16 \
+     --num_tokens 128 \
+     --cache static_v1 \
+     --enable_pytorch_run \
+     --benchmark
+
+Key Arguments
+^^^^^^^^^^^^^
+
+* ``--model``: Name or path of the HuggingFace VLM
+* ``--prompt``: Input prompt for generation
+* ``--image_path``: (Optional) Path to input image file. If not provided, will use a sample image
+* ``--precision``: Precision mode (``FP16``, ``FP32``)
+* ``--num_tokens``: Number of output tokens to generate
+* ``--cache``: KV cache type (``static_v1`` or empty for no KV caching)
+* ``--benchmark``: Enable benchmarking mode
+* ``--enable_pytorch_run``: Also run and compare PyTorch baseline
+
 KV Caching in Torch-TensorRT
 ---------------------------------
 
@@ -122,7 +177,7 @@ The length of KV cache = input sequence length + output sequence length (specifi
 Static Cache v1
 ^^^^^^^^^^^^^^^^
 
-The ``static_cache_v1.py`` implements KV cache  in the model graph as follows: 
+The ``static_cache_v1.py`` implements KV cache in the model graph as follows: 
 
 .. code-block:: python
 
@@ -210,9 +265,13 @@ Limitations and Known Issues
 
 * Sliding window attention (used in Gemma3 and Qwen 3 models) is not yet supported
 * Some model architectures (e.g. Phi-4) have issues with exporting the torch model.
+* For VLMs, Qwen2.5-VL image encoder compilation is not supported due to dynamic operations incompatible with torch.export.
 
 Requirements
 ^^^^^^^^^^^^
 
 * Torch-TensorRT 2.8.0 or later
-* Transformers v4.52.3
+* Transformers v4.52.3
+* For VLM models (run_vlm.py):
+  - ``pip install qwen-vl-utils`` (for Qwen2.5-VL-3B-Instruct model)
+  - ``pip install flash-attn --no-build-isolation -v`` (for Eagle2-2B model)
diff --git a/tools/llm/README.md b/tools/llm/README.md
@@ -1,10 +1,11 @@
 # Optimizing LLMs in Torch-TensorRT
 
-This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry point is `run_llm.py`, which demonstrates how to export, compile, and run LLMs with various caching strategies and precision modes. Note that this is an **experimental release** and APIs may change in future versions.
+This directory provides utilities and scripts for compiling, optimizing, and benchmarking Large Language Models (LLMs) and Visual Language Models (VLMs) using Torch-TensorRT, with a focus on efficient inference on NVIDIA GPUs. The main entry points are `run_llm.py` for text-only LLMs and `run_vlm.py` for vision-language models. Note that this is an **experimental release** and APIs may change in future versions.
 
 ### Key Features
 
 - **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
+- **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
 - **Precision Modes:** Supports FP16, BF16, and FP32.
 - **KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
 - **Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
@@ -24,20 +25,33 @@ We have officially verified support for the following models:
 | Qwen 2.5 | Qwen/Qwen2.5-0.5B-Instruct<br>Qwen/Qwen2.5-1.5B-Instruct<br>Qwen/Qwen2.5-4B-Instruct<br>Qwen/Qwen2.5-7B-Instruct | FP16, FP32 | Yes |
 | Qwen 3 | Qwen/Qwen3-0.6B<br>Qwen/Qwen3-1.7B<br>Qwen/Qwen3-4B<br>Qwen/Qwen3-8B | FP16, FP32 | Yes |
 
+### Supported VLM Models
+
+| Model Series | HF Model Card | Precision | KV Cache Supported ? |
+|--------------|---------------|-----------|-------------------|
+| Qwen 2.5 VL | Qwen/Qwen2.5-VL-3B-Instruct | FP16, FP32 | Yes |
+| Eagle2 | nvidia/Eagle2-2B | FP16, FP32 | Yes |
 
 ### Usage
 
-The main entry point is : `run_llm.py`
+#### Text-only LLMs: `run_llm.py`
 
 ```bash
 python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
 ```
 
+#### Vision Language Models: `run_vlm.py`
+
+```bash
+python run_vlm.py --model Qwen/Qwen2.5-VL-3B-Instruct --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark
+```
+
 #### Key Arguments
 
-- `--model`: Name or path of the HuggingFace LLM.
+- `--model`: Name or path of the HuggingFace LLM/VLM.
 - `--tokenizer`: (Optional) Tokenizer name; defaults to model.
 - `--prompt`: Input prompt for generation.
+- `--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
 - `--precision`: Precision mode (`FP16`, `FP32`).
 - `--num_tokens`: Number of output tokens to generate.
 - `--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
@@ -60,8 +74,14 @@ This codebase can be extended to
 
 ## Limitations
 - We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.
+- **Flash Attention Limitation**: Some models (e.g., Eagle2-2B) internally use flash attention operations (`torch.ops.flash_attn._flash_attn_forward.default`) which require the `flash-attn` package to be installed. Without flash-attn, these models will fail to load or run properly.
 
 ## Requirements
 
 - Torch-TensorRT 2.8.0
-- Transformers v4.52.3
+- Transformers v4.52.3
+- For VLM models (run_vlm.py):
+  - `pip install qwen-vl-utils` (for Qwen2.5-VL-3B-Instruct model)
+  - **Flash Attention**: For models using flash attention operations (e.g., Eagle2-2B), install one of the following:
+    - **Fast installation (recommended)**: `pip install flash-attn==2.8.1` (pre-built wheel, should work)
+    - **Source build (slow)**: `pip install flash-attn --no-build-isolation -v` (fallback if pre-built wheels fail)