Skip to content

Operator level microbenchmarking #3154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SSYernar
Copy link
Contributor

@SSYernar SSYernar commented Jul 2, 2025

Summary:
This change introduces microbenchmarking for PyTorch operators.
Since we need to capture and measure each operator call (which is happening under the hood of PyTorch), we need to use torch.profiler.profile. Example operators are aten:mm, aten::sigmoid, cudaLaunchKernel, etc…
Use --benchmark_operators to enable the operator-level benchmarking.
Use --limit_operator_results argument to specify the number of top runtime operators to benchmark.
Use --target_operators argument to list PyTorch operators to benchmark.

Example output:

TrainPipelineSparseDist             | Malloc retries (P50/P90/P100): 0.0 / 0.0 / 0.0 | Runtime (P90): 442.08 ms | Peak Memory alloc (P90): 24.23 GB | Peak Memory reserved (P90): 26.21 GB
operator_aten::copy_                | Malloc retries (P50/P90/P100): -1.0 / -1.0 / -1.0 | Runtime (P90): 39.21 ms | Peak Memory alloc (P90): 0.00 GB | Peak Memory reserved (P90): -0.00 GB
...

Differential Revision: D77676673

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77676673

SSYernar added a commit to SSYernar/torchrec that referenced this pull request Jul 9, 2025
Summary:

This change introduces microbenchmarking for PyTorch operators.
Since we need to capture and measure each operator call (which is happening under the hood of PyTorch), we need to use torch.profiler.profile. Example operators are `aten:mm`, `aten::sigmoid`, `cudaLaunchKernel`, etc…
Use `--benchmark_operators` to enable the operator-level benchmarking.
Use `--limit_operator_results` argument to specify the number of top runtime operators to benchmark.
Use `--target_operators` argument to list PyTorch operators to benchmark.

Example output:
```
TrainPipelineSparseDist             | Malloc retries (P50/P90/P100): 0.0 / 0.0 / 0.0 | Runtime (P90): 442.08 ms | Peak Memory alloc (P90): 24.23 GB | Peak Memory reserved (P90): 26.21 GB
operator_aten::copy_                | Malloc retries (P50/P90/P100): -1.0 / -1.0 / -1.0 | Runtime (P90): 39.21 ms | Peak Memory alloc (P90): 0.00 GB | Peak Memory reserved (P90): -0.00 GB
...
```

Differential Revision: D77676673
@SSYernar SSYernar force-pushed the export-D77676673 branch from da06c9d to f67ee85 Compare July 9, 2025 09:55
Summary:
Pull Request resolved: pytorch#3154

This change introduces microbenchmarking for PyTorch operators.
Since we need to capture and measure each operator call (which is happening under the hood of PyTorch), we need to use torch.profiler.profile. Example operators are `aten:mm`, `aten::sigmoid`, `cudaLaunchKernel`, etc…
Use `--benchmark_operators` to enable the operator-level benchmarking.
Use `--limit_operator_results` argument to specify the number of top runtime operators to benchmark.
Use `--target_operators` argument to list PyTorch operators to benchmark.

Example output:
```
TrainPipelineSparseDist             | Malloc retries (P50/P90/P100): 0.0 / 0.0 / 0.0 | Runtime (P90): 442.08 ms | Peak Memory alloc (P90): 24.23 GB | Peak Memory reserved (P90): 26.21 GB
operator_aten::copy_                | Malloc retries (P50/P90/P100): -1.0 / -1.0 / -1.0 | Runtime (P90): 39.21 ms | Peak Memory alloc (P90): 0.00 GB | Peak Memory reserved (P90): -0.00 GB
...
```

Differential Revision: D77676673
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77676673

@SSYernar SSYernar force-pushed the export-D77676673 branch from f67ee85 to 484444e Compare July 9, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants