Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance degradation when using OpenAI Frontend + streaming #8045

Closed
jolyons123 opened this issue Feb 28, 2025 · 3 comments
Closed

Comments

@jolyons123
Copy link

jolyons123 commented Feb 28, 2025

Hi,

We experience very bad performance with genai (TRLLM backend) when using the OpenAI Frontend chat endpoint in conjunction with streaming requests (at least according to genai_perf/perf_analyzer benchmarks).

I benchmarked with those options: --synthetic-input-tokens-mean 200, --output-tokens-mean 500, --concurrency 100, --measurement-interval 10000

Triton+TRTLLM backend, OpenAI Frontend Endpoint

Streaming: ON
Image

Streaming: OFF
Image

Triton+TRTLLM backend, KServe Endpoint

Streaming: ON
Image

Streaming: OFF
Image

As you can see, the performance degradation when using streaming with the KServe endpoint is only about -23% (streaming: 17.38 r/s vs. non-streaming: 22.6 r/s) while the performance degradation when using streaming with the OpenAI Frontend chat endpoint is -86% (streaming: 7.81 r/s vs. non-streaming 54.87 r/s)

(Note that there seems to be an issue in how the default sampling parameters are set on Triton/genai_perf side, because the default Triton endpoint/KServe experiment has about 3 times more average output sequence length than the OpenAI Frontend chat endpoint experiment - which probably also explains the request throughput difference between the default Triton/KServe and OpenAI Frontend chat endpoint experiment)

@statiraju
Copy link
Contributor

@jolyons123 did you try the latest container. we made significant improvement on the performance of the OpenAI frontend w.r.t to kserve APis.

@jolyons123
Copy link
Author

jolyons123 commented Mar 3, 2025

Hi @statiraju ,

I re-tested by building the engine with TRTLLM v0.17.0 and using the newest Triton image (25.02-trtllm-python-py3) and it is much better.

Triton+TRTLLM backend, OpenAI Frontend Endpoint
Streaming: ON
Image

Still not on par with NVIDIA NIM images, but I guess there is also a few parameters that I missed during building the engine.

Thanks for your help :)

@rmccorm4
Copy link
Contributor

rmccorm4 commented Mar 7, 2025

Thanks for verifying the improved results with latest image @jolyons123 !

Still not on par with NVIDIA NIM images, but I guess there is also a few parameters that I missed during building the engine.

Any significant remaining delta is likely related to engine building and not the choice of frontend, as you've pointed out. Going to close this - but feel free to open a new issue if new issues arise.

@rmccorm4 rmccorm4 closed this as completed Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants