Significant performance degradation when using OpenAI Frontend + streaming #8045

jolyons123 · 2025-02-28T14:52:28Z

Hi,

We experience very bad performance with genai (TRLLM backend) when using the OpenAI Frontend chat endpoint in conjunction with streaming requests (at least according to genai_perf/perf_analyzer benchmarks).

I benchmarked with those options: --synthetic-input-tokens-mean 200, --output-tokens-mean 500, --concurrency 100, --measurement-interval 10000

Triton+TRTLLM backend, OpenAI Frontend Endpoint

Streaming: ON

Streaming: OFF

Triton+TRTLLM backend, KServe Endpoint

Streaming: ON

Streaming: OFF

As you can see, the performance degradation when using streaming with the KServe endpoint is only about -23% (streaming: 17.38 r/s vs. non-streaming: 22.6 r/s) while the performance degradation when using streaming with the OpenAI Frontend chat endpoint is -86% (streaming: 7.81 r/s vs. non-streaming 54.87 r/s)

(Note that there seems to be an issue in how the default sampling parameters are set on Triton/genai_perf side, because the default Triton endpoint/KServe experiment has about 3 times more average output sequence length than the OpenAI Frontend chat endpoint experiment - which probably also explains the request throughput difference between the default Triton/KServe and OpenAI Frontend chat endpoint experiment)

The text was updated successfully, but these errors were encountered:

statiraju · 2025-02-28T15:21:59Z

@jolyons123 did you try the latest container. we made significant improvement on the performance of the OpenAI frontend w.r.t to kserve APis.

jolyons123 · 2025-03-03T15:06:15Z

Hi @statiraju ,

I re-tested by building the engine with TRTLLM v0.17.0 and using the newest Triton image (25.02-trtllm-python-py3) and it is much better.

Triton+TRTLLM backend, OpenAI Frontend Endpoint
Streaming: ON

Still not on par with NVIDIA NIM images, but I guess there is also a few parameters that I missed during building the engine.

Thanks for your help :)

rmccorm4 · 2025-03-07T18:30:19Z

Thanks for verifying the improved results with latest image @jolyons123 !

Still not on par with NVIDIA NIM images, but I guess there is also a few parameters that I missed during building the engine.

Any significant remaining delta is likely related to engine building and not the choice of frontend, as you've pointed out. Going to close this - but feel free to open a new issue if new issues arise.

rmccorm4 closed this as completed Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance degradation when using OpenAI Frontend + streaming #8045

Significant performance degradation when using OpenAI Frontend + streaming #8045

jolyons123 commented Feb 28, 2025 •

edited

Loading

statiraju commented Feb 28, 2025

jolyons123 commented Mar 3, 2025 •

edited

Loading

rmccorm4 commented Mar 7, 2025

Significant performance degradation when using OpenAI Frontend + streaming #8045

Significant performance degradation when using OpenAI Frontend + streaming #8045

Comments

jolyons123 commented Feb 28, 2025 • edited Loading

statiraju commented Feb 28, 2025

jolyons123 commented Mar 3, 2025 • edited Loading

rmccorm4 commented Mar 7, 2025

jolyons123 commented Feb 28, 2025 •

edited

Loading

jolyons123 commented Mar 3, 2025 •

edited

Loading