#16391: propagate sub_device_ids to mesh #16410

SeanNijjar · 2025-01-02T21:58:39Z

Further update all-gather-async tests to pass subdevice ID information
- Also modified the test to add fabric teardown in case of exception to avoid hangs :) (longer term this can hopefully be replaced with something cleaner like teardown callback registration exposed by metal so we don't need to wrap in try-catch

Ticket

Link to Github Issue

Problem description

All-gather v2 hangs when running with cluster axis API on persistent fabric

What's changed

In tests:

Updated tensor to/from calls to take subdevice

Infra:

Update mesh tensor mesh composer APIs to accept and properly handle subdevice IDs for copying tensors

Checklist

Post commit CI: https://github.com/tenstorrent/tt-metal/actions/runs/12600468079
T3K unit, frequent, nightly: https://github.com/tenstorrent/tt-metal/actions/runs/12592987414
TG unit, frequent: https://github.com/tenstorrent/tt-metal/actions/runs/12600430907 (same failure as on main)
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

Closes #16391

SeanNijjar · 2025-01-02T22:20:27Z

FYI @xuncaiTT

- Further update all-gather-async tests

omilyutin-tt · 2025-01-06T04:07:35Z

ttnn/ttnn/distributed/distributed.py

Do we now also need to update the C++ APIs? See distributed_tensor.hpp

@tt-aho / @SeanNijjar - what is the high level plan for the APIs involving subdevices? Passing subdevice IDs to mesh composers is odd, as it is completely unrelated to the mesh distribution functionality. Do we plan to plumb subdevice IDs to all of the APIs that copy tensors under the hood? From the documentation: The sub-device IDs to wait on. Defaults to all sub-devices. - what does this mean exactly, do we wait before copying a tensor, or after? If this is a synchronization primitive, can we make it an explicit API instead, like ttnn.wait_for_subdevices(...)?

@ayerofieiev-tt

This is for stalling before the reading/writing of buffers.
I am currently working on a new api similar to what you have proposed, but it is for adjusting the default to stall on, instead of an explicit stall api that you proposed. Adjusting a stored/cached list of what to stall on minimizes the burden on the user to inject synchronization calls themselves, and having to track their own list to stall on everywhere. This should also allow us to remove the need to propagate sub_device_ids to all these apis.

Ex below:

What would be coded now

sub_device_0 = ... sub_device_1 = ... manager = create_manager([sub_device_0, sub_device_1]) load_manager(manager) run_long_running_op_on_sub_device_1() adjust_default_stalls([sub_device_0]) write_buffer(sub_device_ids=[sub_device_0]) run_op_on_sub_device_0() read_buffer(sub_device_ids=[sub_device_0])

With new api (adjust_default_stalls is the new api and is not the final name for it)

sub_device_0 = ... sub_device_1 = ... manager = create_manager([sub_device_0, sub_device_1]) load_manager(manager) run_long_running_op_on_sub_device_1() adjust_default_stalls([sub_device_0]) write_buffer() run_op_on_sub_device_0() read_buffer()

@tt-aho - based on above discussion - I take it the recommendation here is to abandon part of this PR (the part that updates the mesh composer) and when your changes are available, rebase and merge (well... after review of course). Correct?

I think so. This is my current pr for reference #16473. I'm planning to add the new api first, then remove the sub_device_ids propagation in the read/write apis in a subsequent pr.

Thanks, this is great! Makes sense, also +1 to using the term set instead of adjust as per #16473.

SeanNijjar requested review from cfjchu, ayerofieiev-tt, dmakoviichuk-tt, omilyutin-tt, jvegaTT, tt-aho and TT-BrianLiu as code owners January 2, 2025 21:58

tt-aho approved these changes Jan 2, 2025

View reviewed changes

SeanNijjar requested review from cglagovichTT, uaydonat, johanna-rock-tt, djordje-tt and kpaigwar as code owners January 3, 2025 03:49

cglagovichTT approved these changes Jan 3, 2025

View reviewed changes

SeanNijjar added 5 commits January 3, 2025 15:03

#16391: propagate sub_device_ids to mesh

0f7e0c2

- Further update all-gather-async tests

Fix all-gather global semaphore (fake) lockstep allocator bug

ca04628

update missed methods

0164372

one more missed method

ac19676

fix typing

f1b53f0

kpaigwar approved these changes Jan 3, 2025

View reviewed changes

fixes after rebase conflicts

3ff6e59

SeanNijjar force-pushed the snijjar/issue-16391 branch from 6254b84 to 3ff6e59 Compare January 3, 2025 15:56

trim trailing white space

0044f04

omilyutin-tt requested changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#16391: propagate sub_device_ids to mesh #16410

#16391: propagate sub_device_ids to mesh #16410

SeanNijjar commented Jan 2, 2025 •

edited

Loading

SeanNijjar commented Jan 2, 2025

omilyutin-tt Jan 6, 2025

tt-aho Jan 6, 2025

SeanNijjar Jan 6, 2025 •

edited

Loading

tt-aho Jan 6, 2025

omilyutin-tt Jan 7, 2025

#16391: propagate sub_device_ids to mesh #16410

Are you sure you want to change the base?

#16391: propagate sub_device_ids to mesh #16410

Conversation

SeanNijjar commented Jan 2, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

SeanNijjar commented Jan 2, 2025

omilyutin-tt Jan 6, 2025

Choose a reason for hiding this comment

tt-aho Jan 6, 2025

Choose a reason for hiding this comment

SeanNijjar Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

tt-aho Jan 6, 2025

Choose a reason for hiding this comment

omilyutin-tt Jan 7, 2025

Choose a reason for hiding this comment

SeanNijjar commented Jan 2, 2025 •

edited

Loading

SeanNijjar Jan 6, 2025 •

edited

Loading