Skip to content

Commit 447290f

Browse files
authored
remove tips for attn_temperature_tuning in llama4 blog (#51)
Since we auto-enable this with max-model-len > 32 in PR vllm-project/vllm#16439, this tip can be removed to avoid confusion.
1 parent e4a43da commit 447290f

File tree

1 file changed

+0
-1
lines changed

1 file changed

+0
-1
lines changed

_posts/2025-04-05-llama4.md

-1
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,6 @@ While more performance enhancements are on the way, we believe the Llama 4 model
7272

7373
* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
7474
* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
75-
* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
7675

7776
**Other Hardware Support & Quantizations:**
7877

0 commit comments

Comments
 (0)