[Feature Request] Improvements or guidelines are needed for data copying with the bfp8_b type. #15639

hschoi4448 · 2024-12-03T05:50:34Z

Is your feature request related to a problem? Please describe.
Hello,
The recent issue I'm facing is with embedding forward, an operation where values need to be copied from the weight tensor to the output tensor.

For a simple example,

weight tensor size: [32, 32]
output tensor size: [5,32]
Need copying weight[0, :] to output[1, :]

In the bfp8_b type, the shared exponent, sign, and mantissa are stored separately.

Problem1. Unnecessary CB allocation and copying processes for Shared Exponenet

Starting with the exponent, the behavior I want is as follows:

In the Reader, copy weight DRAM[0] to weight SRAM[1].
Copy weight SRAM[1] to output DRAM[1].

However, this is where the problem arises.

Currently, in TT, reading and writing are not possible unless the "address % 32" of DRAM and SRAM match.
Therefore, in the Reader, reading from weight DRAM[0] to ** weight SRAM[1]** is not possible because their address % 32 values (0 and 1) are different.

Here’s a workaround for the current issue:

Allocate an additional temporary CB.
In the Reader, copy weight DRAM[0] to temp SRAM[0].
Copy temp SRAM[0] to weight SRAM[1].
Copy weight SRAM[1] to output DRAM[1].

Maybe performance drop Here(?)

Problem2. Unnecessary CB allocation and copying processes for sign and mantissa

The same issue occurs with the sign and mantissa as well.

In the same example, the offsets for the sign and mantissa are as follows:

weight[0:0] = 64
output[1:0] = 80

In this case, since the values of 64%32 and 80%32 are different, direct copying from weight DRAM to weight SRAM is not possible.

Similarly, in this case, allocating a temporary CB, copying from weight DRAM to temp SRAM, and then using memcpy to copy from temp SRAM back to weight DRAM will solve the issue.

Describe the solution you'd like

Please confirm if creating a temp CB and using memcpy is the best choice.
If this is not an optimal approach, please suggest a better implementation or improvements to avoid unnecessary operations in bfp8_b, such as removing the address % 32 constraint.

The text was updated successfully, but these errors were encountered:

ayerofieiev-tt · 2024-12-03T06:27:56Z

@llongTT @TT-BrianLiu @tt-aho can someone review this issue please and help provide some guidance?

tt-aho · 2024-12-03T20:49:17Z

To clarify, the main objective here is just how to copy single rows of a bfp8_b tile to another bfp8 tile?

In this example, for copying weight[0, :] to output[1, :] ie. 1 row of the tile, we would need to copy 2 exponent bytes, 1 byte corresponding to the exponent of ace 0 row 0, and 1 byte corresponding to face 1 row 0, and then the 16 elements of face 0 row 0, and the 16 elements of face 1 row 0 correct? Like this diagram below?

Trying to copy a single byte/multiple small reads/writes over NOC would be very inefficient. My suggestion for this if you want to use this approach would be read full input tiles, and to construct the full output tile in sram before writing it out, ex:

noc_async_read the full weight dram tile containing the corresponding row (if we haven't already read/cached it in L1) to temp weight sram tile
noc_async_read the two face rows to the output sram tile. L1 <-> L1 requires only 16B alignment, and each sub-face row will be 16B aligned
copy the 2 exponents to the output sram tile
Repeat 1-3 until the full output tile is constructed
write the output sram tile to dram

However, there is an alternative implementation that could yield better performance and potentially more portable (I do not currently have any data/estimation on how these approaches compare). The above approach constructs one output tile at a time, but we can potentially construct an entire row of output tiles as follows:

Find the weight tile rows needed to populate a given output row
Read an entire weight tile row into L1
Untilize the weight tile row to bfloat16
noc_async_read the needed rows to the correct output row location (in bfloat16)
Repeat 1-4 until the full output tile row is populated
Tilize the output tile row to bfp8
write the output sram tile row to dram

It might be possible to add a new llk api that handles step 4 through packing instead of falling back to noc copy, but for testing/bringup noc copy method should be supported/work.

hschoi4448 · 2024-12-04T04:11:30Z

Trying to copy a single byte/multiple small reads/writes over NOC would be very inefficient. My suggestion for this if you want to use this approach would be read full input tiles, and to construct the full output tile in sram before writing it out, ex:

Thank you for your response. However, I still have a few more questions.

I also believe that performing small read/write operations through NOC is inefficient.

However, considering that the purpose of bfp8_b is to improve performance by reducing the amount of memory read/write in memory-bound models, preloading unnecessary data at the tile level seems contrary to its fundamental objective.

Compared to reading only the necessary rows with bfloat16, if bfp8_b reads an entire tile, it results in significantly more data being read:

bfloat16 row: 64 bytes
bfp8_b tile: 1088 bytes

Excluding hardware-specific characteristics, generally speaking, reading more data results in slower performance. In such cases, we cannot use the bfp8_b type. This is because the bfp8_b type can be up to 17 times slower than bfloat16.

Is the TT NPU efficient enough at reading data in tile units to handle 17 times more data read compared to reading a single row of bfloat16?

If so, do kernels that are implemented to read only some data from NOC need to be reworked?

@tt-aho

tt-aho · 2024-12-04T07:13:58Z

Storing data in BFP8 has 2 main benefits, reducing space needed to store it, and transferring less data, but transferring less data is mainly beneficial if you read or need all/most of the data in the tile. Having to do multiple small transactions to extract the rows, and doing this tile by tile seems very suboptimal. In the case of embeddings, we don't want full tiles, but to read rows. The most optimal format from my view is bfloat16 in row major which is what our current embeddings supports, and not tiled layout as with row major you can read your entire row in 1 transaction, but having the weights be tile requires you to iterate and extract the row tile by tile even if this wasn't bfp8, ex tiled bfloat16 would be 2 32B txns per tile.

My comments above were how to get it working with bfp8 as I thought that was the requirement, but if this is just a question on performance for embeddings or other ops that operates on rows, then using bfloat16 in row major would be the best option for this use case, unless we need the benefit of saving space that comes from BFP8.

hschoi4448 added feature-request External feature request moreh moreh contribution labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Improvements or guidelines are needed for data copying with the bfp8_b type. #15639

[Feature Request] Improvements or guidelines are needed for data copying with the bfp8_b type. #15639

hschoi4448 commented Dec 3, 2024

ayerofieiev-tt commented Dec 3, 2024

tt-aho commented Dec 3, 2024

hschoi4448 commented Dec 4, 2024

tt-aho commented Dec 4, 2024 •

edited

Loading

[Feature Request] Improvements or guidelines are needed for data copying with the **bfp8_b** type. #15639

[Feature Request] Improvements or guidelines are needed for data copying with the **bfp8_b** type. #15639

Comments

hschoi4448 commented Dec 3, 2024

Problem1. Unnecessary CB allocation and copying processes for Shared Exponenet

Problem2. Unnecessary CB allocation and copying processes for sign and mantissa

ayerofieiev-tt commented Dec 3, 2024

tt-aho commented Dec 3, 2024

hschoi4448 commented Dec 4, 2024

tt-aho commented Dec 4, 2024 • edited Loading

[Feature Request] Improvements or guidelines are needed for data copying with the bfp8_b type. #15639

[Feature Request] Improvements or guidelines are needed for data copying with the bfp8_b type. #15639

tt-aho commented Dec 4, 2024 •

edited

Loading