-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TL/MLX5: a2a various optimizations #1067
base: master
Are you sure you want to change the base?
Conversation
size_t t1 = power2(ucc_max(msgsize, 8)); | ||
size_t tsize = height * ucc_max(power2(width) * t1, MAX_MSG_SIZE); | ||
|
||
return tsize <= MAX_TRANSPOSE_SIZE && msgsize <= 128 && height <= 64 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define literals and not hardcode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the 128 and 64? If so, yes agreed.
@@ -28,6 +28,19 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = { | |||
ucc_offsetof(ucc_tl_mlx5_lib_config_t, dm_buf_num), | |||
UCC_CONFIG_TYPE_ULUNITS}, | |||
|
|||
{"FORCE_REGULAR", "y", | |||
"Force the regular case where the block dimensions " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would word this a bit differently, up to you.
"Enforce the regular case where the block dimensions evenly divide ppn. This option requires BLOCK_SIZE = 0."
@@ -104,6 +117,24 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = { | |||
ucc_offsetof(ucc_tl_mlx5_lib_config_t, mcast_conf.one_sided_reliability_enable), | |||
UCC_CONFIG_TYPE_BOOL}, | |||
|
|||
{"SEND_BATCH_SIZE", "2", | |||
"number of blocks that are transposed " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably capitalize the start of every option description
UCC_CONFIG_TYPE_UINT}, | ||
|
||
{"NBR_SERIALIZED_BATCHES", "4", | ||
"number of block batches " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. We should probably capitalize the start of every option description
UCC_CONFIG_TYPE_UINT}, | ||
|
||
{"NBR_BATCHES_PER_PASSAGE", "1", | ||
"number of batches of blocks sent to one remote node before enqueing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here. We should probably capitalize the start of every option description
@@ -786,12 +835,15 @@ UCC_TL_MLX5_PROFILE_FUNC(ucc_status_t, ucc_tl_mlx5_alltoall_init, | |||
== a2a->node.asr_rank); | |||
int n_tasks = is_asr ? 5 : 3; | |||
int curr_task = 0; | |||
int ppn = tl_team->a2a->node.sbgp->group_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe align assignments and variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments, but overall looks very solid - thanks!
What
This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain
![before_after](https://private-user-images.githubusercontent.com/17732757/399792918-ceb73ec8-86d4-46ac-b155-8cce16bebf26.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NjI3NDAsIm5iZiI6MTczOTU2MjQ0MCwicGF0aCI6Ii8xNzczMjc1Ny8zOTk3OTI5MTgtY2ViNzNlYzgtODZkNC00NmFjLWIxNTUtOGNjZTE2YmViZjI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDE5NDcyMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWEzZmNjMzA0YWY5OGU4ZmNmMmI0ODMxYTAwNDM5ODI0OTc1MTdiM2UwODVlYTAyNDZlMWI1M2Q1NjAwMGEwOWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.L0mMX4KOU94rYyZUypknRFnMWoeypkwuf9hN9cVD9xA)
Support rectangular blocks
this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:
FORCE_WIDER
, imposingh <= w
FORCE_LONGER
, imposingh >= w
Reuse device memory chunks for several blocks
as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option
NBR_SERIALIZED_BATCHES
.batch the inter-node RDMA sends
we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by
SEND_BATCH_SIZE
Iterate across nodes before blocks when posting the WQEs
allows to better saturate the NW. This option is controlled by
NBR_BATCHES_PER_PASSAGE
which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1
option to force regular case
through the TL/MLX5's env
FORCE_REGULAR
, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.