You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-04-05-llama4.md
-1
Original file line number
Diff line number
Diff line change
@@ -72,7 +72,6 @@ While more performance enhancements are on the way, we believe the Llama 4 model
72
72
73
73
***Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
74
74
***Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
75
-
***Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
0 commit comments