-
Notifications
You must be signed in to change notification settings - Fork 363
❓ [Question] Manually Annotate Quantization Parameters in FX Graph #3522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @narendasan @peri044 maybe? 🙏 |
This should be possible as this is what the tensorrt model optimizer toolkit effectively does. @peri044 or @lanluo-nvidia could maybe give more specific guidance. |
We currently use NVIDIA Model optimizer toolkit which inserts quantization nodes within the torch model using quantize API
You can also manually insert a quantization custom op by implementing a lowering pass which adds these nodes to the
Please let me know if you have any further questions. |
hey @peri044 , thanks for the response. i tried modelopt -> export on a simple model below. am i using this wrong or missing something obvious? im using non-strict export (strict runs into import modelopt.torch.quantization as mtq
import torch
from modelopt.torch.quantization.utils import export_torch_mode
class JustAConv(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv = torch.nn.Conv2d(3, 3, 3)
def forward(self, inputs):
return self.conv(inputs)
if __name__ == "__main__":
model = JustAConv().to("cuda").eval()
sample_input = torch.ones(1, 3, 224, 224).to("cuda")
quant_cfg = mtq.INT8_DEFAULT_CFG
mtq.quantize(
model,
quant_cfg,
forward_loop=lambda model: model(sample_input),
)
with torch.no_grad():
with export_torch_mode():
exported_program = torch.export.export(model, (sample_input,), strict=False) |
@patrick-botco I have tried your example with our latest main, when strict=False it is working as expected. |
hey @lanluo-nvidia thanks for checking! here are my pytorch and modelopt versions:
|
@patrick-botco
|
thanks @lanluo-nvidia - upgrading to torch 2.6 resolves the issue. compiling the exported program gives me something unexpected though. for reference, the model (after
the issue: compiling the exported program # continuing from above
trt_model = torch_tensorrt.dynamo.compile(
exported_program,
inputs=(sample_input,),
enabled_precisions={torch.int8},
min_block_size=1,
debug=True,
) the initial lowering passes look good
however; after constant folding,
per-channel weight quantization is not respected - it seems like
more importantly, the gemm kernel itself is
do you happen to know what the issue is? am i using this wrong / missing something? thanks! cc @peri044 @narendasan as well 🙏 i am using these versions to test:
|
@patrick-botco
Let me first create a separate bug fixing PR for you, so that it can be merged to main asap. |
Here is the PR raised:
|
thanks so much @lanluo-nvidia ! |
Uh oh!
There was an error while loading. Please reload this page.
❓ Question
is there a way to manually annotate quantization parameters that will be respected throughout torch_tensorrt conversion (e.g. manually adding q/dq nodes, or specifying some tensor metadata) via dynamo? thank you!
The text was updated successfully, but these errors were encountered: