Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorOps kernels refactoring #3346

Open
wants to merge 34 commits into
base: develop
Choose a base branch
from

Conversation

novakovicdj
Copy link
Contributor

This is draft PR for refactoring tensor ops kernels to solver structure, so far only Op1dTensorGeneric kernel is switched

src/include/miopen/tensor/solvers.hpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/tensor/problem_description.cpp Outdated Show resolved Hide resolved
Comment on lines 41 to 43
const void* alpha0_,
const void* alpha1_,
const void* beta_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this conversation
https://github.com/ROCm/MIOpen/pull/3346/files#r1824480257

Probably alpha0/1 must not be a part of the PD, ideally beta as well, but right now it has to be there..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a bool marking if alpha0/... has a "default" value meaning no additional work required suffice?

src/include/miopen/tensor/problem_description.hpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op2dTensorLite.cpp Outdated Show resolved Hide resolved
Comment on lines 88 to 90
size_t Aoffset;
size_t Boffset;
size_t Coffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle this internally? IIRC it should be possible to externally pass any subtensor via changing pointer+descriptor. If so this is a duplicated functionality

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the main point is the pointer is void * and actual type is an miopen_Type_t enum. That's why you can't just add them without special helpers.

src/include/miopen/tensor/invoke_params.hpp Outdated Show resolved Hide resolved
src/include/miopen/tensor/problem_description.hpp Outdated Show resolved Hide resolved
Comment on lines 41 to 43
const void* alpha0_,
const void* alpha1_,
const void* beta_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a bool marking if alpha0/... has a "default" value meaning no additional work required suffice?

src/include/miopen/tensor/problem_description.hpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/solver/tensor/Op1dTensorGeneric.cpp Outdated Show resolved Hide resolved
src/include/miopen/tensor_ops.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/tensor_op_helpers.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/tensor_op_helpers.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/tensor_op_helpers.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/tensor_op_helpers.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/tensor_op_helpers.hpp Outdated Show resolved Hide resolved
src/solver/tensorOp/Op2dTensorLite.cpp Outdated Show resolved Hide resolved
src/solver/tensorOp/Op2dTensorSquash.cpp Outdated Show resolved Hide resolved
src/solver/tensorOp/Op4dTensorLite.cpp Outdated Show resolved Hide resolved
src/solver/tensorOp/Op4dTensorLite.cpp Outdated Show resolved Hide resolved
src/solver/tensorOp/OpTensorFwdBias.cpp Outdated Show resolved Hide resolved
@novakovicdj novakovicdj marked this pull request as ready for review November 7, 2024 15:19
@shurale-nkn
Copy link
Contributor

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

@novakovicdj
Copy link
Contributor Author

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

Here is a comparison of average host time between old and new structure

Kernel New structure [ms] Old structure [ms] diff [ms]
Op1dTensorGeneric first run 279.3786 291.3806 -12.002
other 100 runs 0.2908 0.2549 0.0359
Op2dTensorGeneric first run 281.8186 283.4622 -1.6436
other 100 runs 0.356 0.2432 0.1128
Op2dTensorLite first run 634.2228 662.2278 -28.005
other 100 runs 0.335 0.2308 0.1042
Op2dTensorSquash first run 668.978 699.9932 -31.0152
other 100 runs 0.3481 0.2548 0.0933
Op3dTensorGeneric first run 642.1512 656.3394 -14.1882
other 100 runs 0.2659 0.2485 0.0174
OpTensorFwdBias first run 636.6204 654.8222 -18.2018
other 100 runs 0.3351 0.2321 0.103
OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159
other 100 runs 0.3498 0.2434 0.1064
OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365
other 100 runs 0.3466 0.2755 0.0711
OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844
other 100 runs 0.3552 0.2569 0.0983
Op4dTensorLite first run 641.4747 664.4976 -23.0229
other 100 runs 0.33 0.2206 0.1094
Op4dTensorGeneric first run 650.7638 670.8961 -20.1323
other 100 runs 0.3563 0.2456 0.1107
Op5dTensorGeneric first run 655.6774 685.431 -29.7536
other 100 runs 0.3745 0.2437 0.1308

New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call

@shurale-nkn
Copy link
Contributor

Please provide a comparison of the average only CPU time (new solver vs old api) measurements for 100 calls with same problem and the costs associated with the first call of the unique problem configuration.

Here is a comparison of average host time between old and new structure

Kernel New structure [ms] Old structure [ms] diff [ms]
Op1dTensorGeneric first run 279.3786 291.3806 -12.002
other 100 runs 0.2908 0.2549 0.0359
Op2dTensorGeneric first run 281.8186 283.4622 -1.6436
other 100 runs 0.356 0.2432 0.1128
Op2dTensorLite first run 634.2228 662.2278 -28.005
other 100 runs 0.335 0.2308 0.1042
Op2dTensorSquash first run 668.978 699.9932 -31.0152
other 100 runs 0.3481 0.2548 0.0933
Op3dTensorGeneric first run 642.1512 656.3394 -14.1882
other 100 runs 0.2659 0.2485 0.0174
OpTensorFwdBias first run 636.6204 654.8222 -18.2018
other 100 runs 0.3351 0.2321 0.103
OpTensorFwdBiasGeneric first run 636.4756 662.4915 -26.0159
other 100 runs 0.3498 0.2434 0.1064
OpTensorLeadingOnes first run 644.8348 666.8713 -22.0365
other 100 runs 0.3466 0.2755 0.0711
OpTensorLeadingOnesGeneric first run 648.6535 669.6379 -20.9844
other 100 runs 0.3552 0.2569 0.0983
Op4dTensorLite first run 641.4747 664.4976 -23.0229
other 100 runs 0.33 0.2206 0.1094
Op4dTensorGeneric first run 650.7638 670.8961 -20.1323
other 100 runs 0.3563 0.2456 0.1107
Op5dTensorGeneric first run 655.6774 685.431 -29.7536
other 100 runs 0.3745 0.2437 0.1308
New structure is faster on average for 20ms for first runs and it is slower for 0.1ms for other 100 calls or 0.001ms per call

The results are very strange; we need to obtain the experiment protocol. How was the program executed, and what was used for measurement?
so far, according to the table, each subsequent launch is on average 30% slower

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 1, 2024

@randyspauldingamd @BrianHarrisonAMD I guess it's a final review round.

@BrianHarrisonAMD
Copy link
Collaborator

I think we need to coordinate these changes with #3402 to ensure const is correct after both are merged.
Or I guess we could decide to do a follow-up.

@randyspauldingamd
Copy link
Contributor

I think we need to coordinate these changes with #3402 to ensure const is correct after both are merged. Or I guess we could decide to do a follow-up.

If you were asking for feedback, I'd be fine with a follow-up (in a timely fashion). @novakovicdj, are you going to be joining our scrum anytime soon? If not, perhaps we could ask @DrizztDoUrden to add a ticket and coordinate with you.

@BrianHarrisonAMD
Copy link
Collaborator

Yea I think it depends on what @DrizztDoUrden would like to do.
I think we either merge this first, and fix it in #3402, or merge both, and fix in a separate follow up.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 4, 2024

I think we need to coordinate these changes with #3402 to ensure const is correct after both are merged. Or I guess we could decide to do a follow-up.

It can be fixed later. It's quite big and require extra effort for the maintenance.

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 4, 2024

I think we need to coordinate these changes with #3402 to ensure const is correct after both are merged. Or I guess we could decide to do a follow-up.

If you were asking for feedback, I'd be fine with a follow-up (in a timely fashion). @novakovicdj, are you going to be joining our scrum anytime soon? If not, perhaps we could ask @DrizztDoUrden to add a ticket and coordinate with you.

Probably not in a short-term perspective, there is some bureaucracy involved.

@BrianHarrisonAMD
Copy link
Collaborator

Greetings @novakovicdj!

Can you update this branch with develop and resolve the conflicts?

@CAHEK7
Copy link
Contributor

CAHEK7 commented Dec 17, 2024

Hi @BrianHarrisonAMD @BradPepersAMD,
I'm not sure about the latest merge policies and are they applicable to Djordje, but probably we have to merge it manually.
Housekeeping that big PR can be painful.

@BrianHarrisonAMD
Copy link
Collaborator

Ill kick off another CI run, and we can merge once it passes.

@BrianHarrisonAMD
Copy link
Collaborator

Looks like it failed, but seems unrelated.
Restarted that stage.

DrizztDoUrden

This comment was marked as duplicate.

Copy link
Contributor

@DrizztDoUrden DrizztDoUrden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants