Any info on the tuning? #228

Ph0rk0z · 2023-07-16T17:28:26Z

Ph0rk0z
Jul 16, 2023

        self.matmul_recons_thd = 8
        self.fused_mlp_thd = 2
        self.sdp_thd = 8
        self.fused_attn = True
        self.matmul_fused_remap = False
        self.concurrent_streams = False

I have been using fused_attn on and off and using mlp_thd between 0 and 2 to toggle fused MLP. Not sure if that is the right way. In textgen these things were not added to exllama_hf but I added them back.

I do notice slightly faster speeds when using them but also a slight perplexity drop. Any more information on what they should be used for. I need to check the GPU peer fix too and if that is better off or on, especially since I have nvlink and direct card to card tensor moves should theoretically be faster if torch doesn't mess it up.

Ok, checked peer fix.. gives only 0.0x tokens speedup. Also ooba perplexity doesn't seem fully stable and returns slightly different value each time.

turboderp · 2023-07-17T10:10:08Z

turboderp
Jul 17, 2023
Maintainer

All the GPU peer fix does is force Torch to move tensors via CPU when moving from one GPU to another. Torch should already do this automatically when GPU peer access isn't supported, but there have been cases where it mistakenly thinks peer access is supported while it actually isn't. So if multi-GPU breaks everything completely, the fix is a setting you can try to work around that particular Torch bug. Otherwise you should leave it off to allow direct copies when supported.

The fused_mlp_thd setting is the maximum number of rows (i.e. total seq_len*batch_size) that will trigger the fused MLP path. Setting it to 0 disables that path altogether. Fused attention and MLP are faster so they should normally be preferred.

A perplexity drop would be a good thing, so I'm not sure what you mean. But generally yes, they change the order of operations slightly, specifically of floating-point additions, which means they're not completely equivalent to the non-fused methods and so you get slightly different results. In my experience the difference is marginal, though, and it can swing either way.

0 replies

Ph0rk0z · 2023-07-17T14:13:10Z

Ph0rk0z
Jul 17, 2023
Author

Well I mean a rise in the perplexity number.. I know it should be opposite. I'll leave the peer fix off then since it does give some tiny gains and I'm on the latest stable torch anyway.

So the MLP threshold only kicks in at longer context? I think the batches by default are 2048? And then sdp_thd is the threshold for SDP attention?

What about:

        self.matmul_fused_remap = False
        self.concurrent_streams = False

0 replies

turboderp · 2023-07-17T20:27:26Z

turboderp
Jul 17, 2023
Maintainer

Sequence length in this case refers to the number of tokens sent through each forward pass. In most cases, if say you're generating a single sequence from a prompt, you'll be sending maybe 100 tokens through in the first pass, then one token at a time for all subsequent passes. So the fused modules won't be used for the prompt, but they will be used for the individual tokens afterwards.

0 replies

Ph0rk0z · 2023-08-06T10:48:31Z

Ph0rk0z
Aug 6, 2023
Author

So now we have flash attention 2, multi stream and core affinity? I will try them out.

results:
Concurrency/cores - Speed drop.

FA2, slight speedup of around .10 t/s. Helps especially when you can't use fused attention like with a lora. I'm not sure what it does with memory yet. Will have to see if I can squeeze more context out of it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any info on the tuning? #228

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Any info on the tuning? #228

Ph0rk0z Jul 16, 2023

Replies: 4 comments

turboderp Jul 17, 2023 Maintainer

Ph0rk0z Jul 17, 2023 Author

turboderp Jul 17, 2023 Maintainer

Ph0rk0z Aug 6, 2023 Author

Ph0rk0z
Jul 16, 2023

turboderp
Jul 17, 2023
Maintainer

Ph0rk0z
Jul 17, 2023
Author

turboderp
Jul 17, 2023
Maintainer

Ph0rk0z
Aug 6, 2023
Author