Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Track] long context performance sglang vs vllm #3471

Open
zhyncs opened this issue Feb 10, 2025 · 2 comments
Open

[Track] long context performance sglang vs vllm #3471

zhyncs opened this issue Feb 10, 2025 · 2 comments

Comments

@zhyncs
Copy link
Member

zhyncs commented Feb 10, 2025

Currently, the two most popular practical scenarios for LLM are chatbot-like scenario or code completion scenario. SGLang has shown good performance on the ShareGPT dataset in the past. With the increasing popularity of open source models like Qwen2.5-Coder-7B-Instruct with a context of 128k, some potential users, such as hot startups, are interested in customizing SGLang for their own use cases, especially when dealing with long contexts in code scenario. The following is a simple performance benchmark aimed at providing insights into the current capabilities of open source LLM engine rather than comparing them directly. This will help guide future optimization efforts effectively. The following content will be regularly updated.

Performance: SGLang (chunked prefill 32k) > vLLM default > SGLang default (chunked prefill 8k) > vLLM enable chunked prefill (2k)
Hardware: H200
Version: SGLang v0.4.2.post4, vLLM 0.7.2

python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct --disable-log-requests
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct --disable-log-requests --enable-chunked-prefill
python3 -m sglang.bench_serving --dataset-name random --random-input-len 30000 --random-output-len 500 --random-range-ratio 1 --request-rate 1 --num-prompts 64 --backend vllm
vLLM default

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  79.55
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    29938
Request throughput (req/s):              0.78
Input token throughput (tok/s):          23381.02
Output token throughput (tok/s):         389.68
Total token throughput (tok/s):          23770.70
Concurrency:                             39.82
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   51091.85
Median E2E Latency (ms):                 51920.22
---------------Time to First Token----------------
Mean TTFT (ms):                          4081.17
Median TTFT (ms):                        4106.11
P99 TTFT (ms):                           7798.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.21
Median TPOT (ms):                        95.48
P99 TPOT (ms):                           150.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           96.27
Median ITL (ms):                         39.60
P99 ITL (ms):                            120.09
==================================================
vLLM enable chunked prefill (2k)

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  91.71
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30164
Request throughput (req/s):              0.68
Input token throughput (tok/s):          20282.32
Output token throughput (tok/s):         338.04
Total token throughput (tok/s):          20620.36
Concurrency:                             33.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   49099.86
Median E2E Latency (ms):                 50278.12
---------------Time to First Token----------------
Mean TTFT (ms):                          13002.48
Median TTFT (ms):                        12155.46
P99 TTFT (ms):                           27604.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.34
Median TPOT (ms):                        85.10
P99 TPOT (ms):                           94.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           73.66
Median ITL (ms):                         84.39
P99 ITL (ms):                            116.96
==================================================
python3 -m sglang.launch_server --model Qwen/Qwen2.5-Coder-7B-Instruct
python3 -m sglang.launch_server --model Qwen/Qwen2.5-Coder-7B-Instruct --chunked-prefill-size 32000
python3 -m sglang.bench_serving --dataset-name random --random-input-len 30000 --random-output-len 500 --random-range-ratio 1 --request-rate 1 --num-prompts 64
SGLang default (chunked prefill 8k) 

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  83.94
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30164
Request throughput (req/s):              0.74
Input token throughput (tok/s):          22157.42
Output token throughput (tok/s):         369.29
Total token throughput (tok/s):          22526.71
Concurrency:                             42.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57135.08
Median E2E Latency (ms):                 58910.28
---------------Time to First Token----------------
Mean TTFT (ms):                          8395.95
Median TTFT (ms):                        9529.31
P99 TTFT (ms):                           17141.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          97.67
Median TPOT (ms):                        97.71
P99 TPOT (ms):                           164.48
---------------Inter-token Latency----------------
Mean ITL (ms):                           97.67
Median ITL (ms):                         29.03
P99 ITL (ms):                            31.88
==================================================
SGLang (chunked prefill 32k)

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    1.0
Max reqeuest concurrency:                not set
Successful requests:                     62
Benchmark duration (s):                  74.37
Total input tokens:                      1860000
Total generated tokens:                  31000
Total generated tokens (retokenized):    30206
Request throughput (req/s):              0.83
Input token throughput (tok/s):          25011.43
Output token throughput (tok/s):         416.86
Total token throughput (tok/s):          25428.28
Concurrency:                             38.30
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   45938.30
Median E2E Latency (ms):                 46798.18
---------------Time to First Token----------------
Mean TTFT (ms):                          4318.49
Median TTFT (ms):                        3220.63
P99 TTFT (ms):                           9065.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          83.41
Median TPOT (ms):                        84.33
P99 TPOT (ms):                           140.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           83.74
Median ITL (ms):                         28.91
P99 ITL (ms):                            953.49
==================================================
@zhyncs zhyncs self-assigned this Feb 10, 2025
@zhyncs
Copy link
Member Author

zhyncs commented Feb 10, 2025

Currently, the ragged prefill of SGLang and vLLM have both enabled FA3.
Some feasible optimization methods:

@zhyncs
Copy link
Member Author

zhyncs commented Feb 10, 2025

FYI This optimization plan will begin after the DeepSeek V3/R1 optimization project is overall completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant