You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We experience very bad performance with genai (TRLLM backend) when using the OpenAI Frontend chat endpoint in conjunction with streaming requests (at least according to genai_perf/perf_analyzer benchmarks).
I benchmarked with those options: --synthetic-input-tokens-mean 200, --output-tokens-mean 500, --concurrency 100, --measurement-interval 10000
Triton+TRTLLM backend, OpenAI Frontend Endpoint
Streaming: ON
Streaming: OFF
Triton+TRTLLM backend, KServe Endpoint
Streaming: ON
Streaming: OFF
As you can see, the performance degradation when using streaming with the KServe endpoint is only about -23% (streaming: 17.38 r/s vs. non-streaming: 22.6 r/s) while the performance degradation when using streaming with the OpenAI Frontend chat endpoint is -86% (streaming: 7.81 r/s vs. non-streaming 54.87 r/s)
(Note that there seems to be an issue in how the default sampling parameters are set on Triton/genai_perf side, because the default Triton endpoint/KServe experiment has about 3 times more average output sequence length than the OpenAI Frontend chat endpoint experiment - which probably also explains the request throughput difference between the default Triton/KServe and OpenAI Frontend chat endpoint experiment)
The text was updated successfully, but these errors were encountered:
Thanks for verifying the improved results with latest image @jolyons123 !
Still not on par with NVIDIA NIM images, but I guess there is also a few parameters that I missed during building the engine.
Any significant remaining delta is likely related to engine building and not the choice of frontend, as you've pointed out. Going to close this - but feel free to open a new issue if new issues arise.
Hi,
We experience very bad performance with genai (TRLLM backend) when using the OpenAI Frontend chat endpoint in conjunction with streaming requests (at least according to genai_perf/perf_analyzer benchmarks).
I benchmarked with those options:
--synthetic-input-tokens-mean 200, --output-tokens-mean 500, --concurrency 100, --measurement-interval 10000
Triton+TRTLLM backend, OpenAI Frontend Endpoint
Streaming: ON

Streaming: OFF

Triton+TRTLLM backend, KServe Endpoint
Streaming: ON

Streaming: OFF

As you can see, the performance degradation when using streaming with the KServe endpoint is only about -23% (streaming: 17.38 r/s vs. non-streaming: 22.6 r/s) while the performance degradation when using streaming with the OpenAI Frontend chat endpoint is -86% (streaming: 7.81 r/s vs. non-streaming 54.87 r/s)
(Note that there seems to be an issue in how the default sampling parameters are set on Triton/genai_perf side, because the default Triton endpoint/KServe experiment has about 3 times more average output sequence length than the OpenAI Frontend chat endpoint experiment - which probably also explains the request throughput difference between the default Triton/KServe and OpenAI Frontend chat endpoint experiment)
The text was updated successfully, but these errors were encountered: