You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: we highly recommend to turn on attn_temperature_tuning to improve accuracy for long contexts longer than 32K tokens, and VLLM_DISABLE_COMPILE_CACHE=1 is required.
60
-
61
59
**Multimodality:**
62
60
63
61
The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py).
@@ -74,6 +72,7 @@ While more performance enhancements are on the way, we believe the Llama 4 model
74
72
75
73
***Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
76
74
***Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
75
+
***Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
77
76
78
77
**Other Hardware Support & Quantizations:**
79
78
@@ -108,4 +107,3 @@ We extend our sincere thanks to the Meta team for their implementation of the mo
108
107
We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang.
109
108
110
109
The vLLM team’s performance benchmarks were run on hardware generously provided by Nebius and NVIDIA.
0 commit comments