Skip to content

Onnx slim transform #536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

tchawada
Copy link

@tchawada tchawada commented Aug 12, 2025

Performance comparison with onnx-slim transformed model and original model for GPT2 model.

========================= Performance Stats with Onnx-Slim=========================
Average Prefill time a.k.a TTFT is= 0.01 sec        
Decode is= 408.75 tokens/sec        
Total is= 399.23 tokens/sec        
Total (E2E) inference time is= 0.31 sec
onnx file size=420K
qpc file size=391M
GPT2LMHeadModel_0.onnx.data=622.94M
time taken to slim 19 seconds
compile time=39.90 seconds
========================= Performance Stats without Onnx-Slim =========================
Average Prefill time a.k.a TTFT is= 0.01 sec        
Decode is= 384.3 tokens/sec        
Total is= 374.99 tokens/sec        
Total (E2E) inference time is= 0.33 sec
onnx file size=516K
qpc file size= 391M
GPT2LMHeadModel_0.onnx.data=622.94M
compile time =36.44 seconds

@tchawada tchawada closed this Aug 12, 2025
@tchawada
Copy link
Author

Performance comparison with onnx-slim transformed model and original model for GPT2 model.
========================= Performance Stats with Onnx-Slim=========================
Average Prefill time a.k.a TTFT is= 0.01 sec
Decode is= 408.75 tokens/sec
Total is= 399.23 tokens/sec
Total (E2E) inference time is= 0.31 sec
onnx file size=420K
qpc file size=391M
GPT2LMHeadModel_0.onnx.data=622.94M
time taken to slim 19 seconds
compile time=39.90 seconds
========================= Performance Stats without Onnx-Slim =========================
Average Prefill time a.k.a TTFT is= 0.01 sec
Decode is= 384.3 tokens/sec
Total is= 374.99 tokens/sec
Total (E2E) inference time is= 0.33 sec
onnx file size=516K
qpc file size= 391M
GPT2LMHeadModel_0.onnx.data=622.94M
compile time =36.44 seconds

@tchawada tchawada reopened this Aug 12, 2025
@quic-amitraj
Copy link
Contributor

Please once test with the full model https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.

@quic-amitraj
Copy link
Contributor

quic-amitraj commented Aug 12, 2025

Apply ruff check and format @tchawada

@tchawada
Copy link
Author

tchawada commented Aug 12, 2025 via email

Copy link
Contributor

@quic-hemagnih quic-hemagnih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the lint warnings

@@ -37,6 +39,8 @@ class FP16ClipTransform(OnnxTransform):
Clips the tensor values to be in FP16 range, but preserves -inf values.
"""

print("FP16ClipTransform is applied")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use logger to print any messages.

onnx_slim_transform = kwargs.get("enable_onnx_slim_transform", False)
temp_onnx_path = kwargs.get("temp_onnx_path", None)
if onnx_slim_transform:
print("onnx slim transform done")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

print("onnx slim transform done")
transformed = True
slimmed_model = onnxslim.slim(model)
onnx.save(slimmed_model, temp_onnx_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Type Checking or Validation Ensure temp_onnx_path is not None before saving

@tchawada
Copy link
Author

tchawada commented Aug 13, 2025 via email

@quic-rishinr
Copy link
Contributor

Instead of adding onnx_slim_transform to every AutoModel class can we consider creating a transform configuration module that returns enabled/disabled transforms as dict. Apply transforms in the base class based on this config? this can be applicable for both pytorch and onnx transforms.

@inisis
Copy link

inisis commented Aug 21, 2025

Hi, I'm the author of onnxslim, thanks for using it, and onnxslim applies to evevy single onnx model, feel free to message me if you have any problem and looking forward to more cooperation and intergration for your porjects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants