Replies: 4 comments
-
All the GPU peer fix does is force Torch to move tensors via CPU when moving from one GPU to another. Torch should already do this automatically when GPU peer access isn't supported, but there have been cases where it mistakenly thinks peer access is supported while it actually isn't. So if multi-GPU breaks everything completely, the fix is a setting you can try to work around that particular Torch bug. Otherwise you should leave it off to allow direct copies when supported. The A perplexity drop would be a good thing, so I'm not sure what you mean. But generally yes, they change the order of operations slightly, specifically of floating-point additions, which means they're not completely equivalent to the non-fused methods and so you get slightly different results. In my experience the difference is marginal, though, and it can swing either way. |
Beta Was this translation helpful? Give feedback.
-
Well I mean a rise in the perplexity number.. I know it should be opposite. I'll leave the peer fix off then since it does give some tiny gains and I'm on the latest stable torch anyway. So the MLP threshold only kicks in at longer context? I think the batches by default are 2048? And then sdp_thd is the threshold for SDP attention? What about:
|
Beta Was this translation helpful? Give feedback.
-
Sequence length in this case refers to the number of tokens sent through each forward pass. In most cases, if say you're generating a single sequence from a prompt, you'll be sending maybe 100 tokens through in the first pass, then one token at a time for all subsequent passes. So the fused modules won't be used for the prompt, but they will be used for the individual tokens afterwards. |
Beta Was this translation helpful? Give feedback.
-
So now we have flash attention 2, multi stream and core affinity? I will try them out. results: FA2, slight speedup of around .10 t/s. Helps especially when you can't use fused attention like with a lora. I'm not sure what it does with memory yet. Will have to see if I can squeeze more context out of it. |
Beta Was this translation helpful? Give feedback.
-
I have been using fused_attn on and off and using mlp_thd between 0 and 2 to toggle fused MLP. Not sure if that is the right way. In textgen these things were not added to exllama_hf but I added them back.
I do notice slightly faster speeds when using them but also a slight perplexity drop. Any more information on what they should be used for. I need to check the GPU peer fix too and if that is better off or on, especially since I have nvlink and direct card to card tensor moves should theoretically be faster if torch doesn't mess it up.
Ok, checked peer fix.. gives only 0.0x tokens speedup. Also ooba perplexity doesn't seem fully stable and returns slightly different value each time.
Beta Was this translation helpful? Give feedback.
All reactions