Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport to 2.8: PTX support for Blackwell #3624

Merged
merged 14 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/libcudacxx/ptx/instructions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ PTX Instructions
:maxdepth: 1

instructions/barrier_cluster
instructions/clusterlaunchcontrol
instructions/cp_async_bulk
instructions/cp_async_bulk_commit_group
instructions/cp_async_bulk_wait_group
instructions/cp_async_bulk_tensor
instructions/cp_async_mbarrier_arrive
instructions/cp_reduce_async_bulk
instructions/cp_reduce_async_bulk_tensor
instructions/fence
Expand All @@ -21,8 +23,22 @@ PTX Instructions
instructions/mbarrier_expect_tx
instructions/mbarrier_test_wait
instructions/mbarrier_try_wait
instructions/multimem_ld_reduce
instructions/multimem_red
instructions/multimem_st
instructions/red_async
instructions/st_async
instructions/st_bulk
instructions/tcgen05_alloc
instructions/tcgen05_commit
instructions/tcgen05_cp
instructions/tcgen05_fence
instructions/tcgen05_ld
instructions/tcgen05_mma
instructions/tcgen05_mma_ws
instructions/tcgen05_shift
instructions/tcgen05_st
instructions/tcgen05_wait
instructions/tensormap_replace
instructions/tensormap_cp_fenceproxy
instructions/special_registers
Expand Down
11 changes: 11 additions & 0 deletions docs/libcudacxx/ptx/instructions/clusterlaunchcontrol.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.. _libcudacxx-ptx-instructions-clusterlaunchcontrol:

clusterlaunchcontrol
====================

- PTX ISA:
`clusterlaunchcontrol.try_cancel <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-try-cancel>`__
- PTX ISA:
`clusterlaunchcontrol.query_cancel <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-clusterlaunchcontrol-query-cancel>`__

.. include:: generated/clusterlaunchcontrol.rst
5 changes: 5 additions & 0 deletions docs/libcudacxx/ptx/instructions/cp_async_bulk_tensor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,8 @@ Multicast
---------

.. include:: generated/cp_async_bulk_tensor_multicast.rst

Scatter / Gather
----------------

.. include:: generated/cp_async_bulk_tensor_gather_scatter.rst
10 changes: 10 additions & 0 deletions docs/libcudacxx/ptx/instructions/cp_async_mbarrier_arrive.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. _libcudacxx-ptx-instructions-cp-async-mbarrier-arrive:

cp.async.mbarrier.arrive
========================

- PTX ISA:
`cp.async.mbarrier.arrive <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive>`__

.. include:: generated/cp_async_mbarrier_arrive.rst
.. include:: generated/cp_async_mbarrier_arrive_noinc.rst
10 changes: 10 additions & 0 deletions docs/libcudacxx/ptx/instructions/fence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ fence

.. include:: generated/fence.rst

fence.sync_restrict
-------------------

.. include:: generated/fence_sync_restrict.rst

fence.mbarrier_init
-------------------

Expand All @@ -29,6 +34,11 @@ fence.proxy.async

.. include:: generated/fence_proxy_async.rst

fence.proxy.async.sync_restrict
-------------------------------

.. include:: generated/fence_proxy_async_generic_sync_restrict.rst

fence.proxy.tensormap
---------------------

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
..
This file was automatically generated. Do not edit.

clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.b128 [addr], [smem_bar]; // PTX ISA 86, SM_100
template <typename = void>
__device__ static inline void clusterlaunchcontrol_try_cancel(
void* addr,
uint64_t* smem_bar);

clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.multicast::cluster::all.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.try_cancel.async.shared::cta.mbarrier::complete_tx::bytes.multicast::cluster::all.b128 [addr], [smem_bar]; // PTX ISA 86, SM_100a, SM_101a
template <typename = void>
__device__ static inline void clusterlaunchcontrol_try_cancel_multicast(
void* addr,
uint64_t* smem_bar);

clusterlaunchcontrol.query_cancel.is_canceled.pred.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.query_cancel.is_canceled.pred.b128 pred_is_canceled, try_cancel_response; // PTX ISA 86, SM_100
template <typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline bool clusterlaunchcontrol_query_cancel_is_canceled(
B128 try_cancel_response);

clusterlaunchcontrol.query_cancel.get_first_ctaid::x.b32.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.query_cancel.get_first_ctaid::x.b32.b128 ret_dim, try_cancel_response; // PTX ISA 86, SM_100
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true, typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline B32 clusterlaunchcontrol_query_cancel_get_first_ctaid_x(
B128 try_cancel_response);

clusterlaunchcontrol.query_cancel.get_first_ctaid::y.b32.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.query_cancel.get_first_ctaid::y.b32.b128 ret_dim, try_cancel_response; // PTX ISA 86, SM_100
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true, typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline B32 clusterlaunchcontrol_query_cancel_get_first_ctaid_y(
B128 try_cancel_response);

clusterlaunchcontrol.query_cancel.get_first_ctaid::z.b32.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.query_cancel.get_first_ctaid::z.b32.b128 ret_dim, try_cancel_response; // PTX ISA 86, SM_100
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true, typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline B32 clusterlaunchcontrol_query_cancel_get_first_ctaid_z(
B128 try_cancel_response);

clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// clusterlaunchcontrol.query_cancel.get_first_ctaid.v4.b32.b128 block_dim, try_cancel_response; // PTX ISA 86, SM_100
template <typename B32, enable_if_t<sizeof(B32) == 4, bool> = true, typename B128, enable_if_t<sizeof(B128) == 16, bool> = true>
__device__ static inline void clusterlaunchcontrol_query_cancel_get_first_ctaid(
B32 (&block_dim)[4],
B128 try_cancel_response);
38 changes: 35 additions & 3 deletions docs/libcudacxx/ptx/instructions/generated/cp_async_bulk.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [smem_bar]; // 1a. unicast PTX ISA 80, SM_90
// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [smem_bar]; // PTX ISA 80, SM_90
// .dst = { .shared::cluster }
// .src = { .global }
template <typename = void>
Expand All @@ -17,11 +17,27 @@ cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes
const uint32_t& size,
uint64_t* smem_bar);

cp.async.bulk.shared::cta.global.mbarrier::complete_tx::bytes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [smem_bar]; // PTX ISA 86, SM_90
// .dst = { .shared::cta }
// .src = { .global }
template <typename = void>
__device__ static inline void cp_async_bulk(
cuda::ptx::space_shared_t,
cuda::ptx::space_global_t,
void* dstMem,
const void* srcMem,
const uint32_t& size,
uint64_t* smem_bar);

cp.async.bulk.shared::cluster.shared::cta.mbarrier::complete_tx::bytes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [rdsmem_bar]; // 2. PTX ISA 80, SM_90
// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes [dstMem], [srcMem], size, [rdsmem_bar]; // PTX ISA 80, SM_90
// .dst = { .shared::cluster }
// .src = { .shared::cta }
template <typename = void>
Expand All @@ -37,7 +53,7 @@ cp.async.bulk.global.shared::cta.bulk_group
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk.dst.src.bulk_group [dstMem], [srcMem], size; // 3. PTX ISA 80, SM_90
// cp.async.bulk.dst.src.bulk_group [dstMem], [srcMem], size; // PTX ISA 80, SM_90
// .dst = { .global }
// .src = { .shared::cta }
template <typename = void>
Expand All @@ -47,3 +63,19 @@ cp.async.bulk.global.shared::cta.bulk_group
void* dstMem,
const void* srcMem,
const uint32_t& size);

cp.async.bulk.global.shared::cta.bulk_group.cp_mask
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk.dst.src.bulk_group.cp_mask [dstMem], [srcMem], size, byteMask; // PTX ISA 86, SM_100
// .dst = { .global }
// .src = { .shared::cta }
template <typename = void>
__device__ static inline void cp_async_bulk_cp_mask(
cuda::ptx::space_global_t,
cuda::ptx::space_shared_t,
void* dstMem,
const void* srcMem,
const uint32_t& size,
const uint16_t& byteMask);
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ cp.async.bulk.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::clu
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: cuda

// cp.async.bulk{.dst}{.src}.mbarrier::complete_tx::bytes.multicast::cluster [dstMem], [srcMem], size, [smem_bar], ctaMask; // 1. PTX ISA 80, SM_90a
// cp.async.bulk.dst.src.mbarrier::complete_tx::bytes.multicast::cluster [dstMem], [srcMem], size, [smem_bar], ctaMask; // PTX ISA 80, SM_90a, SM_100a, SM_101a
// .dst = { .shared::cluster }
// .src = { .global }
template <typename = void>
Expand Down
Loading