-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation #191
base: master
Are you sure you want to change the base?
Conversation
…which to place an SHA block
…iments being more stable with more aggressive clipping at 0.5 on 20 million chunks
sha_sandwich_norm = true | ||
|
||
[aux_decoder] | ||
loss_weight = 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set this to 0 to turn off auxiliary AR loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protocol should be to start off with 0.25
and search for higher values up to 1.
if you see continued improvement
…ing --batch * --accum
… a command line flag. also add ability to turn off self attention in the AR decoder
…concatting feature dimension across layers
…b_attn flag in configs
… default head dimension to 64
@@ -27,6 +27,9 @@ attn_dropout = 0.1 | |||
ff_dropout = 0.1 | |||
num_attn_heads = 1 | |||
|
|||
use_isab_attn = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using ISAB attention, num_attn_heads
above should be set to at least 4
@@ -30,6 +30,8 @@ num_attn_heads = 1 | |||
use_isab_attn = true | |||
isab_num_latents = 6 | |||
|
|||
weight_tie_attn_blocks = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for parameter saving when using ISAB blocks, which has twice the number of attention parameters than S(M)HA blocks
num_attn_heads = 1 # number of attention heads, which should be kept at 1 for single-head attention, but can be increased to > 1 to turn on multi-head attention | ||
dim_attn_head = 64 # dimension per attention head, should just keep at 64, but can be lowered to 32 for further efficiency / perf tradeoff | ||
|
||
use_isab_attn = false # whether to use ISAB attention (induced-set attention block from the Set Transformers paper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you were to set this to true, the number of attention heads need to be increased to 4 or above. a good starting config would be
num_attn_heads = 4
dim_attn_head = 64
use_isab_attn = true
isab_num_latents = 6
clean PR