Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11933
Replies: 1 comment
-
Does each query attend randomly to a selected set of keys, or does each query attend to keys in some kind of block/causal manner? (By random, I mean a non-common pattern where a query can look arbitrarily at some token positions but not with any specific pattern) If it's a causal pattern, you can look into flash-attention or pytorch cudnn attention, which support If it's a causal-similarish pattern where each query attends to a prefix of keys, the fastest available implementation is flash-attention. You don't need to materialize the mask and can just specify cumulative sequence lengths. If it's a block-followed-by-causal pattern, flash-attention/xformers should be able to support it too. If it's an arbitrary pattern, you can look into flex attention |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m working with a Transformer model that routinely processes sequences up to 12 288 tokens.
For attention bias I currently create a dense attn_bias of shape 12 288 × 12 288.
Right now, I have problems with memory for my multi-head att, because of that large tensors.
I could create smaller blocks on fly from sparse attn_bias tensor, but I am not sure if xformers supports this type of processing.
I would be grateful for any help. Maybe there are other packages which could help my solve that problem?
Maciek
Beta Was this translation helpful? Give feedback.
All reactions