Here're some resources about Attention Bias
As alternative mechanisms to explicitly encoding positional information, attention bias have been explored to capture the sequentiality and temporality of natural language incorporated into attention kernel.
As shown in equation below, the attention bias is depicted as a matrix, denoted as
blog link: here
citation:
@misc{transformer-upgrade-7,
author = "Su, Jianlin",
title = "Transformer Upgrade Roadmap: 7. Length Extrapolation and Local Attention",
year = "2023",
month = "Jan",
howpublished = "\url{https://spaces.ac.cn/archives/9431}"
}
illustration:
Su introduced a simple method in his blog where he utilizes a super-baseline approach during inference, as illustrated in the equation. This method relies on a local causal attention mask, where each query attends to keys whose distances have not exceeded
paper link: here
citation:
@inproceedings{chi2023dissecting,
title={Dissecting transformer length extrapolation via the lens of receptive field analysis},
author={Chi, Ta-Chung and Fan, Ting-Han and Rudnicky, Alexander and Ramadge, Peter},
booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={13522--13537},
year={2023}
}
paper link: here
citation:
@article{chi2022kerple,
title={Kerple: Kernelized relative positional embedding for length extrapolation},
author={Chi, Ta-Chung and Fan, Ting-Han and Ramadge, Peter J and Rudnicky, Alexander},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={8386--8399},
year={2022}
}
paper link: here
citation:
@article{press2021train,
title={Train short, test long: Attention with linear biases enables input length extrapolation},
author={Press, Ofir and Smith, Noah A and Lewis, Mike},
journal={arXiv preprint arXiv:2108.12409},
year={2021}
}