Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11933

maciejwisniewski-drugdiscovery · 2025-07-15T09:21:56Z

maciejwisniewski-drugdiscovery
Jul 15, 2025

I’m working with a Transformer model that routinely processes sequences up to 12 288 tokens.
For attention bias I currently create a dense attn_bias of shape 12 288 × 12 288.

Right now, I have problems with memory for my multi-head att, because of that large tensors.
I could create smaller blocks on fly from sparse attn_bias tensor, but I am not sure if xformers supports this type of processing.

I would be grateful for any help. Maybe there are other packages which could help my solve that problem?

Maciek

a-r-r-o-w · 2025-07-15T10:10:13Z

a-r-r-o-w
Jul 15, 2025
Maintainer

Does each query attend randomly to a selected set of keys, or does each query attend to keys in some kind of block/causal manner? (By random, I mean a non-common pattern where a query can look arbitrarily at some token positions but not with any specific pattern)

If it's a causal pattern, you can look into flash-attention or pytorch cudnn attention, which support causal=True as an argument.

If it's a causal-similarish pattern where each query attends to a prefix of keys, the fastest available implementation is flash-attention. You don't need to materialize the mask and can just specify cumulative sequence lengths.

If it's a block-followed-by-causal pattern, flash-attention/xformers should be able to support it too.

If it's an arbitrary pattern, you can look into flex attention

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11933

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11933

Uh oh!

maciejwisniewski-drugdiscovery Jul 15, 2025

Replies: 1 comment

Uh oh!

a-r-r-o-w Jul 15, 2025 Maintainer

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11933

maciejwisniewski-drugdiscovery
Jul 15, 2025

a-r-r-o-w
Jul 15, 2025
Maintainer