Why does TensorRT-LLM use output scaling for SmoothQuant while ModelOpt does not? #4023

junstar92 · 2025-05-02T04:03:19Z

junstar92
May 2, 2025

I'm currently working on building a TensorRT-LLM engine from an LLM that was quantized using ModelOpt with SmoothQuant.

However, I've run into some difficulties because the SmoothQuant implementation in ModelOpt and TensorRT-LLM appears to differ slightly - especially regarding output scaling (scale_y).
While ModelOpt applies SmoothQuant without output scaling during inference, TensorRT-LLM expects it to be used as the input scale for SmoothQuantGemm plugin.

Because of this difference, the model quantized with ModelOpt cannot be directly used to build a TensorRT-LLM engine without additional modifications, leading to compatibility issues.

I would like to understand:

Why does TensorRT-LLM require or benefit from applying output scaling, while ModelOpt does not?
Is this mainly due to TensorRT-LLM's plugin kernel design, its need for tighter accuracy control, or other optimizations?
What would be the recommended way to bridge this gap when converting a ModelOpt-quantized model for use with TensorRT-LLM?

Any technical explanation or best practice for addressing this difference would be greatly appreciated.

juney-nvidia · 2025-05-03T03:33:00Z

juney-nvidia
May 3, 2025
Maintainer

@yechank-nvidia Hi Yechan, can you help follow this ask from the community?

Thanks
June

0 replies

junstar92 · 2025-05-17T10:59:02Z

junstar92
May 17, 2025
Author

@juney-nvidia @yechank-nvidia
Thanks in advance, and thank you for all your great work on this project!
I was wondering if there have been any updates regarding this topic. I'm very interested in this feature and would appreciate any insights or progress, if available.

Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does TensorRT-LLM use output scaling for SmoothQuant while ModelOpt does not? #4023

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why does TensorRT-LLM use output scaling for SmoothQuant while ModelOpt does not? #4023

Uh oh!

Uh oh!

junstar92 May 2, 2025

Replies: 2 comments

Uh oh!

juney-nvidia May 3, 2025 Maintainer

Uh oh!

junstar92 May 17, 2025 Author

junstar92
May 2, 2025

juney-nvidia
May 3, 2025
Maintainer

junstar92
May 17, 2025
Author