[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

nightflight-dk · 2024-10-18T01:50:42Z

🚀 The feature, motivation and pitch

Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.

paper: ArXiv
press coverage (October 16th): VentureBeat

Alternatives

N/A

Additional context

github: Diff-Transformer

"
multihead_diffattn.py contains naive implementation of multi-head differential attention.

multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers).

multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).

Also refer to microsoft/unilm#1633 for another implementation.
"

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

nightflight-dk added the feature request label Oct 18, 2024

nightflight-dk changed the title ~~[Feature]: Support for Diff-Transformer to limit noise in attention calculation during inference~~ [Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime Oct 18, 2024

nightflight-dk changed the title ~~[Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime~~ [Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

nightflight-dk commented Oct 18, 2024

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

Comments

nightflight-dk commented Oct 18, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...