-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Improvements or guidelines are needed for data copying with the **bfp8_b** type. #15639
Comments
@llongTT @TT-BrianLiu @tt-aho can someone review this issue please and help provide some guidance? |
Thank you for your response. However, I still have a few more questions. I also believe that performing small read/write operations through NOC is inefficient. However, considering that the purpose of
Excluding hardware-specific characteristics, generally speaking, reading more data results in slower performance. In such cases, we cannot use the Is the TT NPU efficient enough at reading data in tile units to handle 17 times more data read compared to reading a single row of
|
Storing data in BFP8 has 2 main benefits, reducing space needed to store it, and transferring less data, but transferring less data is mainly beneficial if you read or need all/most of the data in the tile. Having to do multiple small transactions to extract the rows, and doing this tile by tile seems very suboptimal. In the case of embeddings, we don't want full tiles, but to read rows. The most optimal format from my view is bfloat16 in row major which is what our current embeddings supports, and not tiled layout as with row major you can read your entire row in 1 transaction, but having the weights be tile requires you to iterate and extract the row tile by tile even if this wasn't bfp8, ex tiled bfloat16 would be 2 32B txns per tile. My comments above were how to get it working with bfp8 as I thought that was the requirement, but if this is just a question on performance for embeddings or other ops that operates on rows, then using bfloat16 in row major would be the best option for this use case, unless we need the benefit of saving space that comes from BFP8. |
Is your feature request related to a problem? Please describe.
Hello,
The recent issue I'm facing is with embedding forward, an operation where values need to be copied from the weight tensor to the output tensor.
For a simple example,
In the bfp8_b type, the shared exponent, sign, and mantissa are stored separately.
Problem1. Unnecessary CB allocation and copying processes for Shared Exponenet
Starting with the exponent, the behavior I want is as follows:
However, this is where the problem arises.
Currently, in TT, reading and writing are not possible unless the "address % 32" of DRAM and SRAM match.
Therefore, in the Reader, reading from weight DRAM[0] to ** weight SRAM[1]** is not possible because their address % 32 values (0 and 1) are different.
Here’s a workaround for the current issue:
Maybe performance drop Here(?)
Problem2. Unnecessary CB allocation and copying processes for sign and mantissa
The same issue occurs with the sign and mantissa as well.
In the same example, the offsets for the sign and mantissa are as follows:
weight[0:0] = 64
output[1:0] = 80
In this case, since the values of 64%32 and 80%32 are different, direct copying from weight DRAM to weight SRAM is not possible.
Similarly, in this case, allocating a temporary CB, copying from weight DRAM to temp SRAM, and then using
memcpy
to copy from temp SRAM back to weight DRAM will solve the issue.Describe the solution you'd like
Please confirm if creating a temp CB and using
memcpy
is the best choice.If this is not an optimal approach, please suggest a better implementation or improvements to avoid unnecessary operations in bfp8_b, such as removing the address % 32 constraint.
The text was updated successfully, but these errors were encountered: