Perf test on various HW #16

dvoidus · 2023-05-29T23:42:02Z

dvoidus
May 29, 2023

First of all I would like to thank you for your work, really like your inference implementation, seems to be the fastest so far for nvidia gpus!

I ran a bunch of tests on various GPUs and wanted to share results I got.
All tests are avg perf over 40 runs with 1024 context and 170 tokens beings generated each run using wizardlm-30b

gpu		architecture	        perf		power
--------------------------------------------------------------
H100 PCIe 	Hopper			34.1 t/s 	300W
4090		Ada Lovelace	        32.5 t/s 	450W
A6000 Ada	Ada Lovelace	        31.6 t/s 	300W
L40		Ada Lovelace	        27.7 t/s 	300W
A100 SXM4	Ampere			25.8 t/s 	400W
3090		Ampere			22.7 t/s	350W
A6000		Ampere			21.4 t/s	300W
A40	        Ampere			21.4 t/s 	300W
A5000		Ampere			17.8 t/s	230W

Interesting to know if you have any ideas on how to boost perf further. Looking at the numbers, clearly there might be some possibility specifically for high end gpus.

turboderp · 2023-05-30T01:40:18Z

turboderp
May 30, 2023
Maintainer

Thanks. That's informative. Are those cloud instances? Do you know anything about the CPUs they're running on, because it still seems to be somewhat CPU-bound, at least on the higher-end GPUs.

I have a bunch of ideas for improving performance further, but there is an end in sight because memory bandwidth ends up being a a hard limit on how fast you can perform the forward pass. The 4090 will go up to 1 TB/s, supposedly, and with 17 GB of weights, that's about 59 tokens/second as a theoretical maximum. I'm still hoping to squeeze out a little bit more. Here are some thoughts:

For the CUDA kernels at the moment I'm trusting the optimizer to inline and completely reimagine a bunch of stuff that makes the source code easier for me to follow, like the MatrixView classes. I'm also counting on being more or less up against the bandwidth limit anyway, so a little inefficiency shouldn't make any difference at all. Still, I'll give at least the quantized matmul kernel some love soon. It could also use some tunable parameters so it isn't optimized so strictly for the 4090.

I stopped using scaled_dot_product_attention as the default, because while it's faster on long sequences, it loses by quite a bit on short sequences to the regular O(n^2) queries @ keys approach. And that's even though it uses Flash Attention under the hood. The PyTorch version just doesn't seem to be optimized for cases where the sequence length is very short. So there's room for improvement with more of the self-attention step implemented in CUDA.

At some point, though, I feel like I'll just be left with a Python wrapper for a PyTorch C++ extension, and then I might ditch PyTorch altogether. There are some ideas I've been considering, like splitting the model vertically (to use multiple GPUs in parallel) that would be much easier without PyTorch managing resources all the time. It could also free up some VRAM.

9 replies

dvoidus May 30, 2023
Author

Thanks for the chart!
Can you give more details on how you run your test (context len, number tokens generated, so on) and what numbers mean in your chart (you have two t/s values, are they best and worst?).

turboderp May 30, 2023
Maintainer

@dvoidus It was vanilla Llama 65B, GPTQ with (IIRC) groupsize 128. I wasn't actually able to get it to use the context, but that's down to the fact that the model isn't trained for it and the positional embedding scheme doesn't generalize past the training. So it outputs gibberish after 2048 tokens. But it did do the full attention on 9.5k tokens without running OoM, so all that's needed after that is some fine-tuning. But potentially a lot of it, maybe even an unrealistic amount.

I've previously done experiments getting 13B up to 6k tokens with a LoRA. I got to the point in the tuning where it was no longer producing garbage past 2048 tokens, but it still wasn't good at actually taking the full context into account. I want to get back to it at some point and try out some new ideas, though. Eventually...

disarmyouwitha May 30, 2023

@dvoidus This is just logging output from test_benchmark_inference.py -p -ppl

So I believe the first inference is done with a prompt of 1920 generating 128 tokens, and then generating 128 tokens using a 4 token prompt to get more or less a worst/best case benchmark.

dvoidus May 30, 2023
Author

@disarmyouwitha
Yeah, got it, thanks
So, have you tried to run using tensor cores? Seems like you can get way more FLOPS moving to tensor cores cores.

turboderp May 30, 2023
Maintainer

The implementation already uses tensor cores implicitly where it makes sense. But tensor cores won't help you when generating tokens one at a time. The matrix multiplications that happen during inference are all of the shape:

[seq_len, hidden_dim] @ [hidden_dim, x] -> [seq_len, x]

Where x is various large numbers depending on where you are in each decoder block.

Crucially, when generating a token at a time, seq_len = 1, so you've always got a row vector on the left-hand side. Tensor cores are useless then because they work on square fragments, down to 8x8 for FP16 if I'm not mistaken but no lower than that. You could effectively treat the left-hand side as an [8, hidden_dim] matrix, but 7/8 of the computation would be redundant in that case.

Moreover, for every output element in a matrix-matrix multiplication you need to read a row from the left-hand operand and a column from the right-hand operand. When the left-hand side only has one row it ends up in the cache very quickly (or you can buffer it in SMEM otherwise), whereas the right-hand operand is streamed in from memory exactly once. Unless you've got extremely slow cores or extremely fast VRAM, the operation ends up being entirely bandwidth-limited, and with even a naively written kernel the multiplication will be done in however long you can read in both matrices from RAM.

This is why on the 7B model for instance I'm getting 150 tokens/second with a sequence length of 1, but nearly 10k tokens/second with a long sequence, because in the latter case the matrix is first converted to FP16 in a process that takes as long as a vector-matrix multiplication (bandwidth-limited in the same way) followed by whatever the cuBLAS matmul decides is optimal.

dvoidus · 2023-06-02T22:58:19Z

dvoidus
Jun 2, 2023
Author

@turboderp Have you seen this project https://github.com/mit-han-lab/llm-awq ?
Paper (https://arxiv.org/abs/2306.00978) claims "We also implement efficient tensor core kernels with reorder-free online dequantization to accelerate AWQ, achieving a 1.45x speedup over GPTQ and is 1.85x faster than the cuBLAS FP16 implementation."

1 reply

turboderp Jun 2, 2023
Maintainer

That looks quite interesting. Lots of assembly in there, so it's a little difficult to parse. But it looks like there are some neat tricks worth stealing. :)

zakkor · 2023-06-10T09:59:30Z

zakkor
Jun 10, 2023

Trying to figure out which GPU to buy, very hard to decide between 3090 (or dual 3090s) or a 4090.

@turboderp are you able to post benchmark numbers on your setup, but using a 3090 (or Ti), instead of the 4090?

Would be very helpful to see what the actual performance diff is between these two. Thanks for all your hard work btw, this repo is insane

0 replies

turboderp · 2023-06-10T11:14:31Z

turboderp
Jun 10, 2023
Maintainer

Of couse. Here are the same tests from the readme, run on the 3090:

Model	Size	grpsz	act	Seq.	4090	4090	4090	3090-Ti	3090-Ti	3090-Ti
Llama	7B	128	no	2,048	13,918	168	139	7,571	125	108
Llama	13B	128	no	2,048	7,507	99	84	4,114	73	64
Llama	30B	128	no	2,048	2,959	47	40	1,737	32	29
Llama	30B	128	yes	2,048	2,784	45	37	1,647	31	28
Llama	30B	32	yes	1,550	2,636	41	37	1,518	27	25
Koala	13B	128	yes	2,048	5,529	93	79	3,849	69	61
WizardLM	30B	-	no	2,048	2,313	47	40	1,748	28	25

I think there's definitely work to do on optimizing the kernel for the 3090-Ti. The slower prompt speed is to be expected since it just boils down to a regular cuBLAS FP16 matmul. The 4090 is some 50% faster as expected. The token speed is a little disappointing, though. Only about 60-80% even though the bandwidth is supposed to be the same as the 4090. It's possible it just needs some tuning for the 30-series specifically.

0 replies

TNT3530 · 2023-06-10T18:34:33Z

TNT3530
Jun 10, 2023

CMP50 HX using Wizard Vicuna 13B Uncensored GPTQ
128 Tokens, 1920 prompt length - 40.95 tokens/Sec
128 Tokens, 4 prompt length - 51.81 tokens/sec

0 replies

qeternity · 2023-06-12T13:33:58Z

qeternity
Jun 12, 2023

A few benchmarks on openllama 7b and cuda 12.1

Also I modify the benchmark script to generate 512 tokens as it's a bit more realistic vs 128 default imo

GPU	CPU	Tokenization (t/s)	First Pass (t/s)	Second Pass (t/s)
3060 Ti	EPYC 7282	3175	55	65
3080	i7-11700K	5260	85	95
4090	7900X	12168	132	152
4090	3990X	11666	129	146
3060	3950X	2163	45	50
3090	5600G	5844	95	106

0 replies

jmoney7823956789378 · 2023-06-14T21:44:53Z

jmoney7823956789378
Jun 14, 2023

This is a slower, first-gen EPYC, FYI.
I run 8x32GB of the the cheapest, shittiest, 2133MHz ECC DDR4.

GPU	CPU	Model	Tokenization (t/s)	First Pass (t/s)	Second Pass (t/s)
MI60 x2	EPYC 7261	65B	78.59	3.99	4.64
A4000 x3	EPYC 7261	65B	363.61	6.97	7.52
A4000 x4	EPYC 7261	65B	361.63	6.64	7.26
MI60	EPYC 7261	30B	210.80	7.36	8.90
A4000 x2	EPYC 7261	30B	710.12	15.15	17.27
A4000 x4	EPYC 7261	30B	709.08	15.15	17.27
MI60	EPYC 7261	13B	432.64	12.50	13.78
A4000 x1	EPYC 7261	13B	1712.76	33.85	39.33
A4000 x2	EPYC 7261	13B	1749.53	36.07	42.64
A4000 x4	EPYC 7261	13B	1720.40	34.42	38.56
MI60	EPYC 7261	7B	658.14	23.68	29.39
A4000 x1	EPYC 7261	7B	3469.58	53.40	62.62
A4000 x2	EPYC 7261	7B	3468.83	57.74	66.27
A4000 x4	EPYC 7261	7B	3458.58	60.68	68.43

3 replies

turboderp Jun 14, 2023
Maintainer

The A4000 speed looks pretty good. It's got 44% of the bandwidth of the 4090, getting 38% of the speed. So at least it doesn't look like it's held back by the slow CPU. And system RAM shouldn't make much of a difference either way, I think.

I would expect more from the MI60, though. I should probably have a look at that sometime. Are there any cloud providers that offer AMD GPU servers?

jmoney7823956789378 Jun 15, 2023

I'm not sure about cloud providers, but I'm willing to put one in a spare host for you to experiment with.

jmoney7823956789378 Jun 21, 2023

@turboderp updated because I may have accidentally stumbled upon two more cheap-ish a4000s. Unsurprisingly, the inference speed goes down when inferencing needs to be juggled among more devices than necessary (7B is an exception).
I'd also like to say, this is all on PCIe Gen 3, since 1st gen EPYC doesn't support 4, and neither does my cheap board.

jmoney7823956789378 · 2023-06-15T15:20:32Z

jmoney7823956789378
Jun 15, 2023

GPU	CPU	Model	Tokenization (t/s)	First Pass (t/s)	Second Pass (t/s)
RTX 2070 (laptop)	i7 9750H	7B	2497.34	38.24	43.82

0 replies

burgerlawful · 2023-06-19T15:26:03Z

burgerlawful
Jun 19, 2023

Tested it on my 1060 with llama 7b 4bit, got about 4 tokens/s (compared to 10+ tokens/s with gptq-for-llama). Just the two warmup passes in the benchmark script took over 10 minutes to run. I know this isn't being optimized for that kind of hardware but I figured I'd put it here to give an idea of what it's like. On my 3070 however it's super fast, in the range of 50-70 tokens/s, while gptq-for-llama is below 15 tokens/s.

6 replies

Blackshadow8910 Jun 20, 2023

I had similar results with my 1080, but I did some really dumb fiddling around and got it run much better after changing a few lines in q4_matmul.cu. Namely, I replaced the cublasHgemm call in q4_matmul_recons_cuda with cublasSgemmEx, which seemed to be taking up a huge chunk of time at the beginning of each generation. I saw some other people getting bad speeds with pascal cards, so I guess this might help with that. This brought me from around 3 t/s (much lower if generating only a few tokens because of the startup) to 8 t/s, while gptq-for-llama usually also gives me a consistent 3 t/s.

Kind of weird how much faster your 1060 is than my 1080 though, dunno if I'm doing something horribly wrong.

turboderp Jun 20, 2023
Maintainer

I'm aware that Torch uses cublasSgemmEx in place of cublasHgemm but I never got around to really investigating why. The warmup passes are just forward passes on a long sequence, where the bulk of the work ends up being done by cublas, so if it actually works with cublasSgemmEx maybe it works better on newer cards as well? What does the code end up looking like?

Blackshadow8910 Jun 20, 2023

Well, I don't really know what I'm doing, so sorry if this doesn't actually work properly, but the change was pretty much just replacing the last few lines with

    const float alpha = 1.0f;
    const float beta = no_zero ? 1.0f : 0.0f;
    cublasSgemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N, width, height, dim, &alpha, buffers->temp_dq, CUDA_R_16F, width,
                x_mapped, CUDA_R_16F, dim, &beta, out, CUDA_R_16F, width);

As far as I understand cublasHgemm is really slow on pascal cards and cublasSgemmEx just turns the half inputs into FP32, right? If so, it probably only benefits pascal gpus.

turboderp Jun 20, 2023
Maintainer

Presumably it uses FP32 for the computation, but it still reads FP16 and outputs FP16. And I just tested it and it is indeed much slower on the 3090 and 4090. Still no reason not to include it. I guess I'll add it as an option.

burgerlawful Jun 22, 2023

@turboderp I did a quick test with -nh2 the other day but didn't note down the results because it wasn't anything significant, but I can run some more tests with other options if there are any that you think might help

qeternity · 2023-06-20T22:20:36Z

qeternity
Jun 20, 2023

2 replies

turboderp Jun 20, 2023
Maintainer

Those are measures in request per minute, though. I have no idea how that translates to tokens per second.

qeternity Jun 21, 2023

Oh, you're right...had completely missed that! Deleted.

zakkor · 2023-06-21T15:42:47Z

zakkor
Jun 21, 2023

What sort of PCIe lane setup do you all have on your motherboards, for a dual GPU setup?

I'm having trouble finding a consumer mobo that supports two x16 slots at more than @ x16 + x4. I found a couple that can do @ x8 + x8, and it's reasonably priced.

Which would be the better option: x16 + x4, or x8 + x8?

How much does this matter when it comes to inference in general (or finetuning), but also more specifically when it comes to exllama?

9 replies

turboderp Jun 21, 2023
Maintainer

I don't really know anything about Petals. But it should be possible to split inference over Ethernet, yes. Training is much more complicated, though.

As for how parallelizing could help if you had enough bandwidth, it's trivial to split a matrix multiplication into two or more smaller operations that can run in parallel. That's essentially why CUDA is so fast to begin with. You can do the same split across different devices, but the catch is that you have to combine the results after every matmul. So GPU 1 needs to copy the state from GPU 2 and vice versa, hundreds of times per token.

By contrast, ExLlama (and I think most if not all other implementations) just let the GPUs work in turn. So if you split a model across two GPUs, each will only be working 50% of the time.

ghost Jun 21, 2023

I noticed that, the same model on one was the same speed as it on both cards. Thank you for the explanation :)

That's what sparked my interested for 48GB 3090's :))

jmoney7823956789378 Jun 21, 2023

I noticed that, the same model on one was the same speed as it on both cards. Thank you for the explanation :)

That's what sparked my interested for 48GB 3090's :))

Are you referencing the soldering of higher density vram onto a 3090? Because that would make for one extremely affordable A6000.

gileneusz Sep 19, 2023

48GB 3090? is it even possible?

ghost Sep 19, 2023

Yes, although locked down by BIOS currently. 44gb 2080ti's possible (though buggy), the 3090 supports 48gb by switching its 24 memory pads from 1Gb to 2Gb DDR6X chips but the modders would have to pay something like 50k for an unlocked vbios to remove the software lock.

Currently people working on affordable 3070ti's with 16gb that should sell for ~<400 new very soon and work their way up to be able to afford these vbios leaks for >24gb mods which inevitably will happen but will do so months down the line.

One day @gileneusz , one day.

cozycold · 2023-06-24T00:08:54Z

cozycold
Jun 24, 2023

I just wondered, how well would exllama (in theory) work with the Nvidia A16? Loading the model would take quite some time, I suppose but what about the inference after that?

1 reply

KaruroChori Jun 25, 2023

The overall performance is likely going to be quite bad for the price.
That card is more like 4 cards with 16 GB each sharing the same pci connection.
If you look at the GPU architecture inside it is extremely unimpressive compared to a much cheaper 3090, even counting them as 4.

TNT3530 · 2023-07-29T05:07:39Z

TNT3530
Jul 29, 2023

GPU	CPU	Model	Tokenization (t/s)	First Pass (t/s)	Second Pass (t/s)	Perplexity
AMD Instinct MI100	Intel Xeon Gold 6148	Pygmalion 7B	1607.46	27.4	38.88	6.31
MI100	*	Pygmalion 13b	781.27	17.29	27.20	6.331
2x MI100	*	Pygmalion 13b	779.77	17.48	27.84	6.735
MI100	*	Airoboros 33b GPT4 2.0	323.23	6.86	10.34	4.870
2x MI100	*	*	321.92	6.87	10.51	4.87
3x MI100	*	LLaMA 2 70b	154.58	3.30	5.31	4.291
4x MI100	*	*	154.06	3.29	5.31	4.290

0 replies

Perf test on various HW #16

Replies: 13 comments · 31 replies

turboderp May 30, 2023 Maintainer

dvoidus May 30, 2023 Author

turboderp May 30, 2023 Maintainer

dvoidus May 30, 2023 Author

turboderp May 30, 2023 Maintainer

dvoidus Jun 2, 2023 Author

turboderp Jun 2, 2023 Maintainer

turboderp Jun 10, 2023 Maintainer

turboderp Jun 14, 2023 Maintainer

turboderp Jun 20, 2023 Maintainer

turboderp Jun 20, 2023 Maintainer

turboderp Jun 20, 2023 Maintainer

turboderp Jun 21, 2023 Maintainer

Replies: 13 comments 31 replies

turboderp
May 30, 2023
Maintainer

dvoidus May 30, 2023
Author

turboderp May 30, 2023
Maintainer

dvoidus May 30, 2023
Author

turboderp May 30, 2023
Maintainer

dvoidus
Jun 2, 2023
Author

turboderp Jun 2, 2023
Maintainer

turboderp
Jun 10, 2023
Maintainer

turboderp Jun 14, 2023
Maintainer

turboderp Jun 20, 2023
Maintainer

turboderp Jun 20, 2023
Maintainer

turboderp Jun 20, 2023
Maintainer

turboderp Jun 21, 2023
Maintainer