-
Notifications
You must be signed in to change notification settings - Fork 537
[Executorch][llm] Enable local global attention in export_llama script #10612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/kimishpatel/189/base
Are you sure you want to change the base?
Conversation
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10612
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 22472c1 with merge base 1ae8c2c ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) ghstack-source-id: 281455704 Pull Request resolved: #10612
This pull request was exported from Phabricator. Differential Revision: D73891423 |
…llama script" Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) [ghstack-poisoned]
Pull Request resolved: #10612 Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. ghstack-source-id: 282013415 @exported-using-ghexport Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
This pull request was exported from Phabricator. Differential Revision: D73891423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding local global attention is tied closely to the model, and isn't something you can adjust given a particular checkpoint unlike these other export options such as qmode
, enable_kv_cache
, etc. I think it would be better to have this in model_args instead so that we can represent it in the config json file, so it is part of model configuration, not export configuration
You can also make all the layers just do sliding window attention for example, so you can view this also as pure optimization. I havent thought a lot of about making this a model arg vs. export arg. I think your point definitely makes sense but I think it can be configurable as well if you want to support infinite generation as many runners do. |
Hmm can a model trained without local(/global) attention run fine with local attention to take advantage of performance optimizations? If so and we view this as an export optimization then it might make sense and we can apply it to any arbitrary transformer. Main concern for me is that when we have this as an export option, then models that require local/global attention become coupled to this export CLI option, instead of just being able to represent it in the config json file, like is done in Huggingface |
…llama script" Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) cc larryliu0820 mergennachin cccclai helunwencser jackzhxng [ghstack-poisoned]
Pull Request resolved: #10612 Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. ghstack-source-id: 282458841 @exported-using-ghexport Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
This pull request was exported from Phabricator. Differential Revision: D73891423 |
I think you can even when the model is not trained. Using sliding window (e.g. attention sink https://arxiv.org/abs/2309.17453) is used to support much larger context. e.g. you may have a model trained with 128k context but you cant fully support that with such large kv cache |
…llama script" Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) cc larryliu0820 mergennachin cccclai helunwencser jackzhxng [ghstack-poisoned]
Pull Request resolved: #10612 Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. ghstack-source-id: 282706487 @exported-using-ghexport Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
This pull request was exported from Phabricator. Differential Revision: D73891423 |
…llama script" Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/) cc larryliu0820 mergennachin cccclai helunwencser jackzhxng [ghstack-poisoned]
Pull Request resolved: #10612 Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention. For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want to repeat. ghstack-source-id: 282807645 @exported-using-ghexport Differential Revision: [D73891423](https://our.internmc.facebook.com/intern/diff/D73891423/)
This pull request was exported from Phabricator. Differential Revision: D73891423 |
Stack from ghstack (oldest at bottom):
Added a new option of --local_global_attention that takes in pattern of sizes to determine which layers are using local sliding window attention.
For example, [0, 256, 256, 0, 256, 256] can be used for 6 layers transformer. Or you can also use [0, 256, 256] as pattern you want
to repeat.
Differential Revision: D73891423
cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng