Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime #9480

Open
1 task done
nightflight-dk opened this issue Oct 18, 2024 · 0 comments

Comments

@nightflight-dk
Copy link

🚀 The feature, motivation and pitch

Microsoft Research and Tsinghua University researchers have introduced Differential Transformer (Diff Transformer), a new LLM architecture that improves performance by amplifying attention to relevant context while filtering out noise. Their findings, published in a research paper, show that Diff Transformer outperforms the classic Transformer architecture in various settings. The Diff-Transformer can be applied both during the training phase and to pretrained models. When applied to pretrained models, it can enhance their robustness and accuracy in practical applications like in-context learning and text summarization. Sources below. The feature request here is to examine the application potential at vLLM runtime.

paper: ArXiv
press coverage (October 16th): VentureBeat

Alternatives

N/A

Additional context

github: Diff-Transformer

"
multihead_diffattn.py contains naive implementation of multi-head differential attention.

multihead_flashdiff_1.py contains multi-head differential attention implemented with FlashAttention, for packages that support different qk/v dimensions (e.g., our customized-flash-attention and xformers).

multihead_flashdiff_2.py contains multi-head differential attention implemented with FlashAttention, for packages that do not support different qk/v dimensions (e.g., flash-attention).

Also refer to microsoft/unilm#1633 for another implementation.
"

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@nightflight-dk nightflight-dk changed the title [Feature]: Support for Diff-Transformer to limit noise in attention calculation during inference [Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime Oct 18, 2024
@nightflight-dk nightflight-dk changed the title [Feature]: Support for Diff-Transformer to limit noise in attention calculation runtime [Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant