Replies: 2 comments
-
@yechank-nvidia Hi Yechan, can you help follow this ask from the community? Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
-
@juney-nvidia @yechank-nvidia Thanks |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm currently working on building a TensorRT-LLM engine from an LLM that was quantized using
ModelOpt
with SmoothQuant.However, I've run into some difficulties because the SmoothQuant implementation in
ModelOpt
andTensorRT-LLM
appears to differ slightly - especially regarding output scaling (scale_y).While
ModelOpt
applies SmoothQuant without output scaling during inference,TensorRT-LLM
expects it to be used as the input scale forSmoothQuantGemm
plugin.Because of this difference, the model quantized with
ModelOpt
cannot be directly used to build a TensorRT-LLM engine without additional modifications, leading to compatibility issues.I would like to understand:
Any technical explanation or best practice for addressing this difference would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions