Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OpenXLA-pin to Nov24 #8

Merged
merged 34 commits into from
Dec 5, 2023
Merged

Update OpenXLA-pin to Nov24 #8

merged 34 commits into from
Dec 5, 2023

Conversation

wbmc
Copy link

@wbmc wbmc commented Dec 5, 2023

golechwierowicz and others added 30 commits November 28, 2023 09:05
* Collect CUDA/CPU profiling info into result sheets.

This PR:
0. Adds CUDA/CPU collection capabilties to the script.
1. Modifies result_analyzer.py to analyze newly collected results.
2. Moves CUDA synchronize/XLA device synchronize into the profiler.
3. Fixes list typing for Python 3.8+.

Tested with command:
python3 xla/benchmarks/experiment_runner.py --dynamo=openxla --xla=PJRT --test=train --filter=basic_gnn_gcn$ --suite-name=torchbench --accelerator=cuda --progress-bar --output-dirname=/tmp/output --repeat=2 --print-subprocess --no-resume --profile-cuda-cpu-collect --profile-cuda
python3 xla/benchmarks/result_analyzer.py --output-dir=/tmp/output

* Lint, and add _s suffix to metrics

---------

Co-authored-by: root <[email protected]>
…pytorch#5914)

* Add test.

* Create `base_` tensor for views.

* Use base tensor in `as_strided` operation.

* Set base tensor of `as_strided`.

* Fix lint errors.

* Fix for disabled functionalization.

* Address review.
(de)quantize_per_tensor/channel ops from PT2E quantization workflow are lowered to stablehlo uniform_dequantize/quantize.
---------

Co-authored-by: Siyuan Liu <[email protected]>
* Don't fallback for pow
…utation (pytorch#5933)

* Truncate python stack when outputting frame that cause the graph execution

* add mp tests

* move tests to a new dir

---------

Co-authored-by: root <[email protected]>
Update some missing changes from `GPU` to `CUDA`
* Add benchmark noise reducing info.

Add info about knobs making benchmarks more stable
across different runs.

* Add more general info about setting clock freq.

* Move comments out of the code
…5948)

* Error when changing `PJRT_DEVICE` after runtime initialized

* format

* better error
…ch#5737)

* Refactor ExecuteReplicated to operate on sharded data directly

* Remove old handlers

* formatting

* Improve naming and logging

* update docstring

* Remove obsolete unit tests

* improve comment

* Remove slow calls to get output shapes.

* fix implicit sharding

* remove declarations of input/output handlers

* formatting

* give everything a manual placeholder sharding

* see if CI passes

* formatting

* Shard parameter and output handling

* Use absl::BlockingCounter

* formatting

* fix merge

* Assign valid output shardings

* tune and document costs

* formatting

* implicitly replicate output to match outputhandler

* clarify ReplicateShardedData

* fix merge
* Add graph hash and num input/output to PT_XLA_DEBUG

* Remove unnecessary checks

* fix typo

* static const
…custom op_name (pytorch#5838)

* Add python binding to allow custom op_name metadata for lowere HLO

* As discussed increase timeout on GPU tests by 20%

* Add lowering for stack frame index and stack frame id in metadata

* Add fix for stack depth when using set custom op_name in a python context

* Changes after adding tests for lowered stack frames and finding several issues

* Add routine to XlaNode to search back through operands and recusively set meta data

* Fix recursion condition so we don't explore nodes with metadata
* Distribute Literal->Tensor copies across thread pool

* Update for pytorch#5799
This PR enables fast TF32 for PyTorch by default to mirror XLA
behaviour.
* Add all-gather and reduce-scatter coalescence support for FSDP.

Also allow using reduce-scatter's scale param in FSDP.
(revived pytorch#4145)

* clang-format-7 and python lint fixes

* Fix "SyntaxError: 'return' outside function" error

* Code/test fixes to get run_tests.sh to run on CPU

* Fix allgather to be compatible with openxla allgather tuple change without token

* Fix reduce-scatter-coalesce to be compatible with openxla reduce-scatter tuple change without token

* Separate out the reduce-scatter-coalesce changes into a separate PR

* Some cleanups

* Add separate BuildAllGatherCoalesced builder and AllGatherCoalesced class

* Use token_handler.GetInput to capture token

* Clean up

* Clean up

* Switch to GetOperandListWithToken naming for func GetOperandList
…#5939)

* Allow openxla for eval.

* Update readme.

* Revert `openxla_eval` rule.
* Only initialize once for the test suite instead of each test.

* remove comments

* removed unused lines

* fix linter

* fix a tpu issue

* fix minor issue
* Add script for updating core aten opset issue

* Add update function
* Add graph hash to save tensor output

* Add support for dynamo

* fix test
* Add profiler API for async capture

* Add unit test
@wbmc wbmc merged commit a610b9b into master Dec 5, 2023
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.