TL/MLX5: a2a various optimizations #1067

samnordmann · 2025-01-02T16:38:06Z

What

This PR contains various optimizations for TL/MLX5/a2a, leading to significant performance gain

Support rectangular blocks

this is a critical optimization that brings immediate performance benefits since it gives more flexibility in choosing the block dimension to better saturate the transpose unit. To complete this feature, we expose two (independent) options for determining the block dimensions h and w:

FORCE_WIDER, imposing h <= w
FORCE_LONGER, imposing h >= w

Reuse device memory chunks for several blocks

as long as two blocks need to be sent to the same remote peer, the WQEs dealing with those blocks can 1) be enqueued on the same QPE and 2) use the same device memory chunks. This allows to use one dm chunk to post (and offload to the NIC) the processing of a batch of blocks. This makes the algorithm wait less on free device memory chunks. This option is controlled by the option NBR_SERIALIZED_BATCHES.

batch the inter-node RDMA sends

we allow successive results of the transpose WQE to be batched before being sent to a remote peer. This allows to better saturate the network by aggregating the message. The batch size is controlled by SEND_BATCH_SIZE

Iterate across nodes before blocks when posting the WQEs

allows to better saturate the NW. This option is controlled by NBR_BATCHES_PER_PASSAGE which sets the number of batches to send to a remote peer before moving to the next one. The old behavior corresponds to large values of this parameter, i.e.,NBR_BATCHES_PER_PASSAGE >> 1

option to force regular case

through the TL/MLX5's env FORCE_REGULAR, forcing the chosen block dimensions to divide ppn. This option is useful 1) for debug purposes, and also 2) since the regular case is expected to perform better than the irregular case.

All those optimizations are independent, but we introduce them in a single PR to avoid resolving many conflicts.

samnordmann · 2025-01-22T16:20:33Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

+    size_t t1    = power2(ucc_max(msgsize, 8));
+    size_t tsize = height * ucc_max(power2(width) * t1, MAX_MSG_SIZE);
+
+    return tsize <= MAX_TRANSPOSE_SIZE && msgsize <= 128 && height <= 64 &&


define literals and not hardcode

Are you talking about the 128 and 64? If so, yes agreed.

janjust · 2025-02-12T14:53:14Z

src/components/tl/mlx5/tl_mlx5.c

@@ -28,6 +28,19 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = {
     ucc_offsetof(ucc_tl_mlx5_lib_config_t, dm_buf_num),
     UCC_CONFIG_TYPE_ULUNITS},

+    {"FORCE_REGULAR", "y",
+     "Force the regular case where the block dimensions "


I would word this a bit differently, up to you.

"Enforce the regular case where the block dimensions evenly divide ppn. This option requires BLOCK_SIZE = 0."

janjust · 2025-02-12T14:53:56Z

src/components/tl/mlx5/tl_mlx5.c

@@ -104,6 +117,24 @@ static ucc_config_field_t ucc_tl_mlx5_lib_config_table[] = {
     ucc_offsetof(ucc_tl_mlx5_lib_config_t, mcast_conf.one_sided_reliability_enable),
     UCC_CONFIG_TYPE_BOOL},

+    {"SEND_BATCH_SIZE", "2",
+     "number of blocks that are transposed "


We should probably capitalize the start of every option description

janjust · 2025-02-12T14:54:26Z

src/components/tl/mlx5/tl_mlx5.c

+     UCC_CONFIG_TYPE_UINT},
+
+    {"NBR_SERIALIZED_BATCHES", "4",
+     "number of block batches "


Same here. We should probably capitalize the start of every option description

janjust · 2025-02-12T14:54:34Z

src/components/tl/mlx5/tl_mlx5.c

+     UCC_CONFIG_TYPE_UINT},
+
+    {"NBR_BATCHES_PER_PASSAGE", "1",
+     "number of batches of blocks sent to one remote node before enqueing",


And here. We should probably capitalize the start of every option description

janjust · 2025-02-12T16:18:14Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

@@ -786,12 +835,15 @@ UCC_TL_MLX5_PROFILE_FUNC(ucc_status_t, ucc_tl_mlx5_alltoall_init,
                                                        == a2a->node.asr_rank);
    int                     n_tasks   = is_asr ? 5 : 3;
    int                     curr_task = 0;
+    int                       ppn       = tl_team->a2a->node.sbgp->group_size;


Maybe align assignments and variables?

janjust

I left minor comments, but overall looks very solid - thanks!

samnordmann added 2 commits January 2, 2025 18:35

TL/MLX5: fix nrows ncols in WQE

ffdcec9

TL/MLX5: various optimizations

5e60c92

samnordmann mentioned this pull request Jan 2, 2025

TL/MLX5: various optimizations #1012

Closed

TL/MLX5: tune default config

fd47eb5

samnordmann added the Ready-for-Review label Jan 2, 2025

samnordmann requested review from Sergei-Lebedev and janjust January 2, 2025 21:00

janjust added the Code-Review-Required label Jan 6, 2025

samnordmann commented Jan 22, 2025

View reviewed changes

samnordmann added Code-Review-Completed and removed Code-Review-Required labels Jan 27, 2025

janjust reviewed Feb 12, 2025

View reviewed changes

janjust approved these changes Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/MLX5: a2a various optimizations #1067

TL/MLX5: a2a various optimizations #1067

samnordmann commented Jan 2, 2025 •

edited

Loading

samnordmann Jan 22, 2025

janjust Feb 12, 2025

janjust Feb 12, 2025

janjust Feb 12, 2025

janjust Feb 12, 2025

janjust Feb 12, 2025

janjust Feb 12, 2025

janjust left a comment

TL/MLX5: a2a various optimizations #1067

Are you sure you want to change the base?

TL/MLX5: a2a various optimizations #1067

Conversation

samnordmann commented Jan 2, 2025 • edited Loading

What

Support rectangular blocks

Reuse device memory chunks for several blocks

batch the inter-node RDMA sends

Iterate across nodes before blocks when posting the WQEs

option to force regular case

samnordmann Jan 22, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust Feb 12, 2025

Choose a reason for hiding this comment

janjust left a comment

Choose a reason for hiding this comment

samnordmann commented Jan 2, 2025 •

edited

Loading