MeshWorkload: Initial Implementation #16405

tt-asaigal · 2025-01-02T18:57:25Z

Ticket

Problem description

MeshWorkload APIs need to be implemented as per the spec presented here. Please see the issue for more details and the scope of this work.

What's changed

TT-Metal Dispatch Changes:

Expose finalize as a generic function templated on Program and MeshWorkload to support computing L1 offsets for both data structures through a shared path
Move write_program_command_sequence out of the EnqueueProgramCommand and expose it as a utility function, since it is used by both MeshWorkload and Program
Add an API to query the program dispatch core per CQ per device, since this is needed by MeshCommandQueue

TT-Mesh Changes:

Add the MeshWorkload class. Currently supports Single-Program-Multi-Device and Multi-Program-Multi-Device use cases. Heterogenous Runtime Args will be brought up in a separate commit.
Add the MeshCommandQueue class. Currently piggybacks off the single device Command Queues for performing IO. All functionality will eventually be moved into the MeshCommandQueue, once we support MeshBuffer reads and writes.
The MeshCommandQueue maintains independent accelerator state for dispatching MeshWorkloads. Since buffer reads and writes are still done through the single device CQs, this state must be in sync across all CQ objects. This is done through the experimental::write_program_commands function in mesh_workload_utils.hpp
Expose top level APIs to create, populate and enqueue a MeshWorkload through a MeshCommandQueue when using Fast Dispatch
Add several sanity, randomized and end to end tests for MeshWorkload creation and dispatch

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

tt-asaigal · 2025-01-03T19:45:24Z

Passing post commit: https://github.com/tenstorrent/tt-metal/actions/runs/12601307589
Passing T3K unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/12590783780

tests/tt_metal/distributed/distributed_fixture.hpp

omilyutin-tt · 2025-01-04T23:25:47Z

tests/tt_metal/distributed/distributed_fixture.hpp

+    }
+
+    void TearDown() override {
+        mesh_device_->close_devices();


In this PR, @TT-BrianLiu found that mesh_device_ isn't even created if the test is skipped, so TearDown needs a conditional block.

This is a duplicated code btw, can we use the fixture you defined here in place of that in tests/ttnn/unit_tests/gtests/ttnn_test_fixtures.hpp? https://github.com/tenstorrent/tt-metal/blob/aaf2d7304c138c046efd5c0a94c14da3e6f95ce4/tests/tt_metal/tt_metal/common/multi_device_fixture.hpp would be even better!

Ideally the tests would #include this header directly, but I think for the time being you can add forwarding include in the ttnn_test_fixtures and just delete T3kMultiDeviceFixture.

I've updated the teardown step for the distributed fixture. Changing other mutl-device fixtures is outside the scope of this PR, will clean this up in a separate change.

tests/tt_metal/tt_metal/api/circular_buffer/test_CircularBuffer_creation.cpp

tt_metal/distributed/distributed.hpp

tt_metal/distributed/mesh_command_queue.hpp

tt_metal/impl/program/program_dispatch_utils.cpp

omilyutin-tt · 2025-01-05T00:00:21Z

tt_metal/impl/program/program_dispatch_utils.hpp

+template <typename T>
+void finalize(T& workload_type, Device* device);


The friend relation here and in program_base_addr_on_core is non-ideal, especially when the state is mutated externally... Is it possible to keep the member function "finalize" on both mesh workload / program, but then move parts of the implementation here? For example, a shared utility may compute and return all of the necessary offsets / sizes based on the provided core type and some other parameters (e.g. the ones in workload.get_kernels(index), workload.get_kernel_groups(index), etc), then MeshWorkload / Program may use the result to set everything necessary internally?

For program_base_addr_on_core, you can similarly have a shared implementation that works on a vector of subdevices, and the last used cq - no need to templetize and no friend relation (unless I missed something?)

tt_metal/impl/program/program_dispatch_utils.cpp

tests/tt_metal/distributed/test_mesh_workload.cpp

omilyutin-tt · 2025-01-05T00:16:28Z

tt_metal/distributed/mesh_workload.hpp

+// The LogicalDeviceRange concept is fundamentally identical to the CoreRange concept
+// Use this definition for now, since CoreRange contains several utility functions required
+// in the MeshWorkload context. CoreRange can eventually be renamed to Range2D.
+using LogicalDeviceRange = CoreRange;


tt_metal/distributed/mesh_workload.hpp

tests/tt_metal/distributed/test_mesh_workload.cpp

tt_metal/impl/dispatch/command_queue.hpp

tt_metal/distributed/mesh_device.cpp

cfjchu · 2025-01-06T21:48:52Z

tests/tt_metal/distributed/distributed_fixture.hpp

+    }
+
+    void TearDown() override {
+        mesh_device_->close_devices();


cfjchu · 2025-01-06T21:52:06Z

tt_metal/distributed/mesh_workload.hpp

+    std::unordered_map<LogicalDeviceRange, Program>& get_programs() { return this->programs_; }
+    // For testing purposes only
+    void set_last_used_command_queue_for_testing(MeshCommandQueue* mesh_cq);
+    MeshCommandQueue* get_last_used_command_queue() const;


Is this API just for testing? what is this used for?

Yes, this is an API for testing purposes only (for now at least). Each MeshDevice will eventually have multiple command queues. Each CQ has a separate WorkerConfigBufferMgr object that tracks the state of the program config ring buffer in L1/SRAM.
The last used command queue is referenced to query this object and derive global addresses in functions like get_sem_base_addr and get_cb_base_addr.

We have essentially the same function in our tt_metal CQs for the same purpose.

cfjchu · 2025-01-06T21:52:57Z

tt_metal/distributed/mesh_workload.hpp

+    // Main User-Facing API building blocks
+    MeshWorkload();
+    void add_program(const LogicalDeviceRange& device_range, Program& program);
+    std::unordered_map<LogicalDeviceRange, Program>& get_programs() { return this->programs_; }


I don't think we want users to update this map externally? should this just be a const-ref?

Good catch here. We want users to be able to mutate certain program state (ex: RTA) but not the map itself. Made the fn return a const ref for the map and added separate getter for a program.

cfjchu · 2025-01-06T21:55:29Z

tt_metal/distributed/mesh_workload.cpp

+
+void MeshWorkload::add_program(const LogicalDeviceRange& device_range, Program& program) {
+    // Add a program to a MeshWorkload and tie it a specific logical device range
+    this->programs_[device_range] = std::move(program);


This is a bug. After the move, the original program parameter will be left in a moved-from state. Should the signature be Program&& program

I updated this, not sure why it doesnt show up in the diff here. It's there in the latest commit.

cfjchu · 2025-01-06T21:57:42Z

tt_metal/impl/device/device.hpp

@@ -278,6 +278,7 @@ class Device {
    void load_sub_device_manager(SubDeviceManagerId sub_device_manager_id);
    void clear_loaded_sub_device_manager();
    LaunchMessageRingBufferState& get_worker_launch_message_buffer_state(SubDeviceId sub_device_id);
+    CoreCoord virtual_program_dispatch_core(uint8_t cq_id) const;


can you clarify what this API is?

This API returns the virtual coordinates of the dispatch core responsible for program management on device. TT-Mesh infra needs access to this when generating the FD commands for a MeshWorkload.

tt_metal/distributed/distributed.hpp

tt_metal/distributed/distributed.cpp

tt_metal/distributed/mesh_command_queue.hpp

cfjchu · 2025-01-06T22:31:18Z

tt_metal/impl/program/dispatch.cpp

+template void finalize_program_offsets<Program>(Program&, Device*);
+template void finalize_program_offsets<distributed::MeshWorkload>(distributed::MeshWorkload&, Device*);
+template uint32_t program_base_addr_on_core<Program, Device*>(Program&, Device*, HalProgrammableCoreType);
+template uint32_t program_base_addr_on_core<distributed::MeshWorkload, std::shared_ptr<distributed::MeshDevice>>(
+    distributed::MeshWorkload&, std::shared_ptr<distributed::MeshDevice>, HalProgrammableCoreType);
+}  // namespace program_dispatch


as discussed.. let's just get rid of all this templating and just have interface classes

I went through a potential change with a Workload interface class here.

We can't remove the template parameters here, until MeshDevice and Device inherit from an IDevice object.

This is because MeshWorkload needs to be templated on MeshDevice and Program needs to be templated on Device, due to a disconnect between the 2 classes and their impl.

I think it makes sense to revisit this, once we have Artem's Device cleanup on main.

Yes, as agreed we'll rework this once Artem merges in the changes on IDevice 👍

…rograms with mixed usage of tensix and eth cores

abhullar-tt · 2025-01-07T20:26:45Z

tt_metal/distributed/distributed.cpp

+
+MeshWorkload CreateMeshWorkload() { return MeshWorkload(); }
+
+void InsertProgramInMeshWorkload(


nit and minor (so feel free to ignore), consider rename to AddProgramToMeshWorkload

abhullar-tt · 2025-01-07T20:33:20Z

tt_metal/distributed/mesh_command_queue.hpp

+    CoreType dispatch_core_type_;
+
+public:
+    MeshCommandQueue(MeshDevice* mesh_device, uint32_t id);


what does a MeshCommandQueue id correspond to, is there only 2 MeshCommandQueues per MeshDevice?

abhullar-tt · 2025-01-07T21:38:35Z

tt_metal/distributed/mesh_command_queue.cpp

+        uint32_t num_workers = 0;
+        for (auto& device : this->mesh_device_->get_devices()) {
+            if (num_workers) {
+                TT_FATAL(


will this always be true? if two devices have different num of harvested rows/cols then they can't be programmed with the same MeshWorkload even if the number of workers used by Workload is the same

cfjchu · 2025-01-07T20:28:46Z

tests/tt_metal/distributed/distributed_fixture.hpp

+    }
+
+    void TearDown() override {
+        if (!::testing::Test::IsSkipped()) {


it's probably just safest to check that mesh_device is not null

cfjchu · 2025-01-07T20:34:00Z

tt_metal/distributed/mesh_workload.hpp

+    // Main User-Facing API building blocks
+    MeshWorkload();
+    void add_program(const LogicalDeviceRange& device_range, Program& program);
+    std::unordered_map<LogicalDeviceRange, Program>& get_programs() { return this->programs_; }


cfjchu · 2025-01-07T20:34:17Z

tt_metal/distributed/mesh_workload.cpp

+
+void MeshWorkload::add_program(const LogicalDeviceRange& device_range, Program& program) {
+    // Add a program to a MeshWorkload and tie it a specific logical device range
+    this->programs_[device_range] = std::move(program);


cfjchu · 2025-01-07T22:01:10Z

tt_metal/impl/program/dispatch.cpp

+template void finalize_program_offsets<Program>(Program&, Device*);
+template void finalize_program_offsets<distributed::MeshWorkload>(distributed::MeshWorkload&, Device*);
+template uint32_t program_base_addr_on_core<Program, Device*>(Program&, Device*, HalProgrammableCoreType);
+template uint32_t program_base_addr_on_core<distributed::MeshWorkload, std::shared_ptr<distributed::MeshDevice>>(
+    distributed::MeshWorkload&, std::shared_ptr<distributed::MeshDevice>, HalProgrammableCoreType);
+}  // namespace program_dispatch


Yes, as agreed we'll rework this once Artem merges in the changes on IDevice 👍

tt-aho · 2025-01-07T22:10:39Z

tt_metal/impl/program/dispatch.cpp

+// dispatch.hpp
+#include "dispatch.hpp"


Remove comment?

And normally the style in the codebase is to include the full path to the header right?

tt-aho · 2025-01-07T22:13:12Z

tt_metal/impl/program/dispatch.cpp

+        program.get_program_config(index).rta_offset = rta_offset;
+        program.get_program_config(index).crta_offsets = crta_offsets;
+        program.get_program_config(index).crta_sizes = crta_sizes;
+        program.get_program_config(index).sem_offset = sem_offset;
+        program.get_program_config(index).sem_size = sem_size;
+        program.get_program_config(index).cb_offset = cb_offset;
+        program.get_program_config(index).cb_size = cb_size;
+        program.get_program_config(index).kernel_text_offset = kernel_text_offset;
+        program.get_program_config(index).kernel_text_size = kernel_text_size;
+        program.get_program_config_sizes()[index] = offset;


Nit: Would it be better to just do auto& program_config = program.get_program_config(index);, and then just use program_config?

tt-aho · 2025-01-07T22:26:13Z

tt_metal/distributed/mesh_workload.cpp

+// void MeshWorkload::set_runtime_args(const LogicalDeviceRange& device_range, const CoreRangeSet& core_range_set,
+// KernelHandle kernel_id, const std::vector<uint32_t> runtime_args) {
+//     std::size_t intersection_count = 0;
+
+//     for (auto& program_on_grid : this->programs_) {
+//         auto& program_device_range = program_on_grid.first;
+//         if (device_range.intersects(program_device_range)) {
+//             program_to_set_rt
+//         }
+//     }
+// }


tt-asaigal force-pushed the asaigal/mesh_workload branch from 0e77e6a to 6a95e83 Compare January 2, 2025 21:38

tt-asaigal marked this pull request as ready for review January 2, 2025 22:00

tt-asaigal requested review from cfjchu, aliuTT, omilyutin-tt, abhullar-tt, pgkeller, tt-aho, tt-dma, ubcheema, davorchap and a team as code owners January 2, 2025 22:00

tt-asaigal force-pushed the asaigal/mesh_workload branch 3 times, most recently from 05ef361 to c1610b6 Compare January 3, 2025 17:15

omilyutin-tt requested changes Jan 5, 2025

View reviewed changes

omilyutin-tt reviewed Jan 5, 2025

View reviewed changes

tt_metal/distributed/mesh_workload.hpp Outdated Show resolved Hide resolved

omilyutin-tt reviewed Jan 5, 2025

View reviewed changes

tests/tt_metal/distributed/test_mesh_workload.cpp Outdated Show resolved Hide resolved

omilyutin-tt reviewed Jan 5, 2025

View reviewed changes

tests/tt_metal/distributed/test_mesh_workload.cpp Outdated Show resolved Hide resolved

tests/tt_metal/distributed/test_mesh_workload.cpp Show resolved Hide resolved

tt-asaigal force-pushed the asaigal/mesh_workload branch 4 times, most recently from f35feb2 to 118672f Compare January 6, 2025 21:27

cfjchu reviewed Jan 6, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_workload branch 3 times, most recently from 7b0a60c to a40cc50 Compare January 7, 2025 02:40

tt-asaigal force-pushed the asaigal/mesh_workload branch 3 times, most recently from ed5d8ec to d4be00f Compare January 7, 2025 20:18

tt-asaigal added 3 commits January 7, 2025 12:26

#16409: MeshWorkload Initial Implementation and Tests

1a1d9ff

#0: Add handling + tests for cases where a MeshWorkload consists of p…

4c4e15a

…rograms with mixed usage of tensix and eth cores

Additional PR feedback

f651fa0

tt-asaigal force-pushed the asaigal/mesh_workload branch from d4be00f to f651fa0 Compare January 7, 2025 20:26

abhullar-tt reviewed Jan 7, 2025

View reviewed changes

cfjchu approved these changes Jan 7, 2025

View reviewed changes

tt-aho reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MeshWorkload: Initial Implementation #16405

MeshWorkload: Initial Implementation #16405

tt-asaigal commented Jan 2, 2025 •

edited

Loading

tt-asaigal commented Jan 3, 2025

omilyutin-tt Jan 4, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025

omilyutin-tt Jan 5, 2025

omilyutin-tt Jan 5, 2025

omilyutin-tt Jan 5, 2025

cfjchu Jan 6, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025

cfjchu Jan 7, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025 •

edited

Loading

cfjchu Jan 7, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025

cfjchu Jan 6, 2025

tt-asaigal Jan 7, 2025

cfjchu Jan 7, 2025

abhullar-tt Jan 7, 2025

abhullar-tt Jan 7, 2025

abhullar-tt Jan 7, 2025

cfjchu Jan 7, 2025

cfjchu Jan 7, 2025

cfjchu Jan 7, 2025

cfjchu Jan 7, 2025

tt-aho Jan 7, 2025

tt-aho Jan 7, 2025

tt-aho Jan 7, 2025

		template <typename T>
		void finalize(T& workload_type, Device* device);


		MeshWorkload CreateMeshWorkload() { return MeshWorkload(); }

		void InsertProgramInMeshWorkload(

MeshWorkload: Initial Implementation #16405

Are you sure you want to change the base?

MeshWorkload: Initial Implementation #16405

Conversation

tt-asaigal commented Jan 2, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

tt-asaigal commented Jan 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tt-asaigal Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tt-asaigal commented Jan 2, 2025 •

edited

Loading

tt-asaigal Jan 7, 2025 •

edited

Loading