Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Improvements or guidelines are needed for data copying with the **bfp8_b** type. #15639

Open
hschoi4448 opened this issue Dec 3, 2024 · 4 comments
Labels
feature-request External feature request moreh moreh contribution

Comments

@hschoi4448
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Hello,
The recent issue I'm facing is with embedding forward, an operation where values need to be copied from the weight tensor to the output tensor.

For a simple example,

  • weight tensor size: [32, 32]
  • output tensor size: [5,32]
  • Need copying weight[0, :] to output[1, :]

Image

In the bfp8_b type, the shared exponent, sign, and mantissa are stored separately.

Problem1. Unnecessary CB allocation and copying processes for Shared Exponenet

Starting with the exponent, the behavior I want is as follows:

  1. In the Reader, copy weight DRAM[0] to weight SRAM[1].
  2. Copy weight SRAM[1] to output DRAM[1].

However, this is where the problem arises.

Currently, in TT, reading and writing are not possible unless the "address % 32" of DRAM and SRAM match.
Therefore, in the Reader, reading from weight DRAM[0] to ** weight SRAM[1]** is not possible because their address % 32 values (0 and 1) are different.

Here’s a workaround for the current issue:

  1. Allocate an additional temporary CB.
  2. In the Reader, copy weight DRAM[0] to temp SRAM[0].
  3. Copy temp SRAM[0] to weight SRAM[1].
  4. Copy weight SRAM[1] to output DRAM[1].

Maybe performance drop Here(?)

Problem2. Unnecessary CB allocation and copying processes for sign and mantissa

The same issue occurs with the sign and mantissa as well.

In the same example, the offsets for the sign and mantissa are as follows:

weight[0:0] = 64
output[1:0] = 80

In this case, since the values of 64%32 and 80%32 are different, direct copying from weight DRAM to weight SRAM is not possible.

Similarly, in this case, allocating a temporary CB, copying from weight DRAM to temp SRAM, and then using memcpy to copy from temp SRAM back to weight DRAM will solve the issue.

Describe the solution you'd like

  1. Please confirm if creating a temp CB and using memcpy is the best choice.

  2. If this is not an optimal approach, please suggest a better implementation or improvements to avoid unnecessary operations in bfp8_b, such as removing the address % 32 constraint.

@hschoi4448 hschoi4448 added feature-request External feature request moreh moreh contribution labels Dec 3, 2024
@ayerofieiev-tt
Copy link
Member

@llongTT @TT-BrianLiu @tt-aho can someone review this issue please and help provide some guidance?

@tt-aho
Copy link
Contributor

tt-aho commented Dec 3, 2024

To clarify, the main objective here is just how to copy single rows of a bfp8_b tile to another bfp8 tile?

In this example, for copying weight[0, :] to output[1, :] ie. 1 row of the tile, we would need to copy 2 exponent bytes, 1 byte corresponding to the exponent of ace 0 row 0, and 1 byte corresponding to face 1 row 0, and then the 16 elements of face 0 row 0, and the 16 elements of face 1 row 0 correct? Like this diagram below?
Image

Trying to copy a single byte/multiple small reads/writes over NOC would be very inefficient. My suggestion for this if you want to use this approach would be read full input tiles, and to construct the full output tile in sram before writing it out, ex:

  1. noc_async_read the full weight dram tile containing the corresponding row (if we haven't already read/cached it in L1) to temp weight sram tile
  2. noc_async_read the two face rows to the output sram tile. L1 <-> L1 requires only 16B alignment, and each sub-face row will be 16B aligned
  3. copy the 2 exponents to the output sram tile
  4. Repeat 1-3 until the full output tile is constructed
  5. write the output sram tile to dram

However, there is an alternative implementation that could yield better performance and potentially more portable (I do not currently have any data/estimation on how these approaches compare). The above approach constructs one output tile at a time, but we can potentially construct an entire row of output tiles as follows:

  1. Find the weight tile rows needed to populate a given output row
  2. Read an entire weight tile row into L1
  3. Untilize the weight tile row to bfloat16
  4. noc_async_read the needed rows to the correct output row location (in bfloat16)
  5. Repeat 1-4 until the full output tile row is populated
  6. Tilize the output tile row to bfp8
  7. write the output sram tile row to dram

It might be possible to add a new llk api that handles step 4 through packing instead of falling back to noc copy, but for testing/bringup noc copy method should be supported/work.

@hschoi4448
Copy link
Contributor Author

Trying to copy a single byte/multiple small reads/writes over NOC would be very inefficient. My suggestion for this if you want to use this approach would be read full input tiles, and to construct the full output tile in sram before writing it out, ex:

Thank you for your response. However, I still have a few more questions.

I also believe that performing small read/write operations through NOC is inefficient.

However, considering that the purpose of bfp8_b is to improve performance by reducing the amount of memory read/write in memory-bound models, preloading unnecessary data at the tile level seems contrary to its fundamental objective.

  1. Compared to reading only the necessary rows with bfloat16, if bfp8_b reads an entire tile, it results in significantly more data being read:
  • bfloat16 row: 64 bytes
  • bfp8_b tile: 1088 bytes

Excluding hardware-specific characteristics, generally speaking, reading more data results in slower performance. In such cases, we cannot use the bfp8_b type. This is because the bfp8_b type can be up to 17 times slower than bfloat16.

Is the TT NPU efficient enough at reading data in tile units to handle 17 times more data read compared to reading a single row of bfloat16?

  1. If so, do kernels that are implemented to read only some data from NOC need to be reworked?

@tt-aho

@tt-aho
Copy link
Contributor

tt-aho commented Dec 4, 2024

Storing data in BFP8 has 2 main benefits, reducing space needed to store it, and transferring less data, but transferring less data is mainly beneficial if you read or need all/most of the data in the tile. Having to do multiple small transactions to extract the rows, and doing this tile by tile seems very suboptimal. In the case of embeddings, we don't want full tiles, but to read rows. The most optimal format from my view is bfloat16 in row major which is what our current embeddings supports, and not tiled layout as with row major you can read your entire row in 1 transaction, but having the weights be tile requires you to iterate and extract the row tile by tile even if this wasn't bfp8, ex tiled bfloat16 would be 2 32B txns per tile.

My comments above were how to get it working with bfp8 as I thought that was the requirement, but if this is just a question on performance for embeddings or other ops that operates on rows, then using bfloat16 in row major would be the best option for this use case, unless we need the benefit of saving space that comes from BFP8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request External feature request moreh moreh contribution
Projects
None yet
Development

No branches or pull requests

3 participants