Here're some resources about Approximated Attention
In addition to heuristic approaches aimed at restricting full attention computation, some research explores the mathematical essence behind attention kernel computations. These studies use estimation methods based on the sparsity or low-rank properties of attention matrices to approximate attention with linear complexity, albeit at the cost of precision. We introduce several of these approximation techniques below.
- Intro
- Definitions
- Low-Rank Approximation
- Nested Attention
- Kernelized Approximation
- Sparse-Kernelized Hybrid
Prior to delving into the literature, we introduce the following definitions to establish a unified representantion of mathematical symbols within the equations below.
-
Definition 1 (Attention Kernel): We define a simplified version of the attention kernel computation below (omitted the mask $M$, and other projection operations), where
$P$ denotes the unnormalized relevance matrix for each pair of$\mathbf q, \mathbf k$ ,$A$ is the row-wise normalized and scaled attention matrix and$O$ is the output hidden states.
-
Definition 2 (Causal Mask Function): Note that we have not distinguished whether the methods are employed for BERT-like encoder-only LLMs or GPT-like decoder-only LLMs in previous sections since most of them can be trivially transferred from the BERT setting to the GPT setting with a causal attention mask. However, the casual mask is often non-trivial for many approximation strategies.
So to facilitate later discussions, we first define a general weighted causal function
$\xi_{\mathbf w}$ in Equation below, where$\mathbf w \in \mathbb{R}^L$ represents a weights vector for each row. This function will substitute the causal attention mask operation.
-
Definition 3 (Generalized Kernelizable Attention): The standard attention kernel computation can be generalized to kernelizable as below, where the kernel function
$\mathcal{K}(\cdot,\cdot): \mathbb{R}^{d}\times \mathbb{R}^{d}\rightarrow R_+$ is applied row-wise to each pair of$\mathbf q_i, \mathbf k_j$ in$Q,K$ , and$D$ is the normalization factor. From this view, the vanilla softmax attention just implements a specific kernel function as$\mathcal{K}(Q,K) = \exp(\frac{QK^{\mathrm{T}}}{\sqrt{d_k}})$ , which explicitly derives a$L\times L$ attention matrix. But if we carefully choose another kernel function to be factorizable as the condition in the second step of the following equations, then simply applying the associative property, we can compute matrix multiplication of$K,V$ and$K, \mathbf 1_L$ ahead with lower complexity$O(Ld^2)$ .
paper link: here
citation:
@article{wang2020linformer,
title={Linformer: Self-attention with linear complexity},
author={Wang, Sinong and Li, Belinda Z and Khabsa, Madian and Fang, Han and Ma, Hao},
journal={arXiv preprint arXiv:2006.04768},
year={2020}
}
paper link: here
citation:
@article{ma2021luna,
title={Luna: Linear unified nested attention},
author={Ma, Xuezhe and Kong, Xiang and Wang, Sinong and Zhou, Chunting and May, Jonathan and Ma, Hao and Zettlemoyer, Luke},
journal={Advances in Neural Information Processing Systems},
volume={34},
pages={2441--2453},
year={2021}
}
paper link: here
@misc{chen2023primalattention,
title={Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation},
author={Yingyi Chen and Qinghua Tao and Francesco Tonin and Johan A. K. Suykens},
year={2023},
eprint={2305.19798},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
paper link: here
citation:
@article{nguyen2022fourierformer,
title={Fourierformer: Transformer meets generalized fourier integral theorem},
author={Nguyen, Tan and Pham, Minh and Nguyen, Tam and Nguyen, Khai and Osher, Stanley and Ho, Nhat},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={29319--29335},
year={2022}
}
paper link: here
citation:
@article{peng2021random,
title={Random feature attention},
author={Peng, Hao and Pappas, Nikolaos and Yogatama, Dani and Schwartz, Roy and Smith, Noah A and Kong, Lingpeng},
journal={arXiv preprint arXiv:2103.02143},
year={2021}
}
paper link: here
citation:
@article{choromanski2020rethinking,
title={Rethinking attention with performers},
author={Choromanski, Krzysztof and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarlos, Tamas and Hawkins, Peter and Davis, Jared and Mohiuddin, Afroz and Kaiser, Lukasz and others},
journal={arXiv preprint arXiv:2009.14794},
year={2020}
}
$$ \begin{align} \mathcal{K}{Li}(\mathbf q,\mathbf k) := \varphi{Li}(\mathbf q)\times \varphi_{Li}(\mathbf k)^\mathrm{T}, \quad where\quad \varphi_{Li}(\mathbf x) = \mathrm{elu}(\mathbf x) + 1 \end{align} $$
paper link: here
citation:
@inproceedings{katharopoulos2020transformers,
title={Transformers are rnns: Fast autoregressive transformers with linear attention},
author={Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran{\c{c}}ois},
booktitle={International conference on machine learning},
pages={5156--5165},
year={2020},
organization={PMLR}
}
paper link: here
citation:
@article{chen2021scatterbrain,
title={Scatterbrain: Unifying sparse and low-rank attention approximation},
author={Chen, Beidi and Dao, Tri and Winsor, Eric and Song, Zhao and Rudra, Atri and R{\'e}, Christopher},
journal={arXiv preprint arXiv:2110.15343},
year={2021}
}