[Track] long context performance sglang vs vllm #3471

zhyncs · 2025-02-10T14:11:02Z

Currently, the two most popular practical scenarios for LLM are chatbot-like scenario or code completion scenario. SGLang has shown good performance on the ShareGPT dataset in the past. With the increasing popularity of open source models like Qwen2.5-Coder-7B-Instruct with a context of 128k, some potential users, such as hot startups, are interested in customizing SGLang for their own use cases, especially when dealing with long contexts in code scenario. The following is a simple performance benchmark aimed at providing insights into the current capabilities of open source LLM engine rather than comparing them directly. This will help guide future optimization efforts effectively. The following content will be regularly updated.

Performance: SGLang (chunked prefill 32k) > vLLM default > SGLang default (chunked prefill 8k) > vLLM enable chunked prefill (2k)
Hardware: H200
Version: SGLang v0.4.2.post4, vLLM 0.7.2

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct --disable-log-requests
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct --disable-log-requests --enable-chunked-prefill
python3 -m sglang.bench_serving --dataset-name random --random-input-len 30000 --random-output-len 500 --random-range-ratio 1 --request-rate 1 --num-prompts 64 --backend vllm

vLLM default

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  79.55
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    29938
Request throughput (req/s):              0.78
Input token throughput (tok/s):          23381.02
Output token throughput (tok/s):         389.68
Total token throughput (tok/s):          23770.70
Concurrency:                             39.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51091.85
Median E2E Latency (ms):                 51920.22
---------------Time to First Token----------------
Mean TTFT (ms):                          4081.17
Median TTFT (ms):                        4106.11
P99 TTFT (ms):                           7798.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.21
Median TPOT (ms):                        95.48
P99 TPOT (ms):                           150.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           96.27
Median ITL (ms):                         39.60
P99 ITL (ms):                            120.09
==================================================

vLLM enable chunked prefill (2k)

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  91.71
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30164
Request throughput (req/s):              0.68
Input token throughput (tok/s):          20282.32
Output token throughput (tok/s):         338.04
Total token throughput (tok/s):          20620.36
Concurrency:                             33.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   49099.86
Median E2E Latency (ms):                 50278.12
---------------Time to First Token----------------
Mean TTFT (ms):                          13002.48
Median TTFT (ms):                        12155.46
P99 TTFT (ms):                           27604.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.34
Median TPOT (ms):                        85.10
P99 TPOT (ms):                           94.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           73.66
Median ITL (ms):                         84.39
P99 ITL (ms):                            116.96
==================================================

python3 -m sglang.launch_server --model Qwen/Qwen2.5-Coder-7B-Instruct
python3 -m sglang.launch_server --model Qwen/Qwen2.5-Coder-7B-Instruct --chunked-prefill-size 32000
python3 -m sglang.bench_serving --dataset-name random --random-input-len 30000 --random-output-len 500 --random-range-ratio 1 --request-rate 1 --num-prompts 64

SGLang default (chunked prefill 8k) 

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  83.94
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30164
Request throughput (req/s):              0.74
Input token throughput (tok/s):          22157.42
Output token throughput (tok/s):         369.29
Total token throughput (tok/s):          22526.71
Concurrency:                             42.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57135.08
Median E2E Latency (ms):                 58910.28
---------------Time to First Token----------------
Mean TTFT (ms):                          8395.95
Median TTFT (ms):                        9529.31
P99 TTFT (ms):                           17141.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          97.67
Median TPOT (ms):                        97.71
P99 TPOT (ms):                           164.48
---------------Inter-token Latency----------------
Mean ITL (ms):                           97.67
Median ITL (ms):                         29.03
P99 ITL (ms):                            31.88
==================================================

SGLang (chunked prefill 32k)

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  74.37
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30206
Request throughput (req/s):              0.83
Input token throughput (tok/s):          25011.43
Output token throughput (tok/s):         416.86
Total token throughput (tok/s):          25428.28
Concurrency:                             38.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45938.30
Median E2E Latency (ms):                 46798.18
---------------Time to First Token----------------
Mean TTFT (ms):                          4318.49
Median TTFT (ms):                        3220.63
P99 TTFT (ms):                           9065.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.41
Median TPOT (ms):                        84.33
P99 TPOT (ms):                           140.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.74
Median ITL (ms):                         28.91
P99 ITL (ms):                            953.49
==================================================

The text was updated successfully, but these errors were encountered:

zhyncs · 2025-02-10T14:15:36Z

Currently, the ragged prefill of SGLang and vLLM have both enabled FA3.
Some feasible optimization methods:

FP8 Attention (FlashInfer will support this soon)
Sparse Attention (MInference, Star Attention etc)
and can be combined with the existing W8A8 FP8 in these two engines

zhyncs · 2025-02-10T14:17:40Z

FYI This optimization plan will begin after the DeepSeek V3/R1 optimization project is overall completed.

ehuaa · 2025-03-06T02:18:26Z

Hi @zhyncs , I have a question here, in sglang, prefill batch and decode batch is not mixed together by default. But vllm seems batch them together by default if you enable chunked prefill with --enable-chunked-prefill. So the comparison above for sglang, maybe you should enable chunked prefill by --enable-mixed-chunk?

zhyncs self-assigned this Feb 10, 2025

zhyncs added high priority flashinfer performance labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track] long context performance sglang vs vllm #3471

[Track] long context performance sglang vs vllm #3471

zhyncs commented Feb 10, 2025

zhyncs commented Feb 10, 2025 •

edited

Loading

zhyncs commented Feb 10, 2025

ehuaa commented Mar 6, 2025

[Track] long context performance sglang vs vllm #3471

[Track] long context performance sglang vs vllm #3471

Comments

zhyncs commented Feb 10, 2025

zhyncs commented Feb 10, 2025 • edited Loading

zhyncs commented Feb 10, 2025

ehuaa commented Mar 6, 2025

zhyncs commented Feb 10, 2025 •

edited

Loading