Question about inference speed #229

MrJungle1 · 2023-07-10T09:39:23Z

MrJungle1
Jul 10, 2023

Very good work, but I have a question about the inference speed of different machines, I got 43.22 tokens/s speed on A10, but only 51.4 tokens/s speed on A100, according to my understanding at least should Twice the difference Is there any reasonable explanation for this problem? thanks.

jmoney7823956789378 · 2023-07-10T11:31:11Z

jmoney7823956789378
Jul 10, 2023

What model?

0 replies

nikshepsvn · 2023-07-10T15:22:52Z

nikshepsvn
Jul 10, 2023

exllama is very optimized for consumer GPU architecture so hence enterprise GPUs might not perform or scale as well, im sure @turboderp has the details of why (fp16 math and what not) but thats probably the TLDR

1 reply

gaoptimize Sep 16, 2023

I was surprised when I saw that Windows was the default OS. I don't know about you all, but my Widows computer has ~100 background processes and services that are constantly swapping out of the CPU, not using any processing, but they are no doubt cache killers. Can we please have a discussion about the absolute MINIMUM background processes and services that are essential to run Windows, and a way to limit there ability to access the CPU, beyond just setting their "Priority" to very low, and these LLM processes to very high? I'd also like to know if you all have been able to demonstrate performance increases and stability by setting any of the LLM processes to "Real Time".

One more thing on this: Is anyone using there integrated graphics for display duties so the 3090s down have to deal with this?

turboderp · 2023-07-10T23:15:36Z

turboderp
Jul 10, 2023
Maintainer

The A100 should be a lot faster than the A10, yes. I just haven't optimized for it specifically, partly because it's kind of tricky to set up remote CUDA profiling on a cloud server. I've been meaning to get around to it, also to figure out some weird performance degradation that happened a while ago on the H100.

There has also been a CPU bottleneck at one point, which should have been more or less dealt with, but in my experience some of these GPU cloud instances have very slow CPU cores, so that could also be part of the explanation.

0 replies

MrJungle1 · 2023-07-11T05:10:51Z

MrJungle1
Jul 11, 2023
Author

What model?

llama-13b and llama-30b

0 replies

SinanAkkoyun · 2023-07-13T08:47:58Z

SinanAkkoyun
Jul 13, 2023

some weird performance degradation that happened a while ago on the H100.

Do you think FP8 TE utilization would drastically speed up inference even more?

There has also been a CPU bottleneck at one point, which should have been more or less dealt with, but in my experience some of these GPU cloud instances have very slow CPU cores, so that could also be part of the explanation.

Is the CPU load easy to distribute? Or will the slow CPU cores on cloud instances always be a bottleneck?

Thank you

0 replies

MichaelHauser0971 · 2023-07-14T05:07:58Z

MichaelHauser0971
Jul 14, 2023

The A100 should be a lot faster than the A10, yes. I just haven't optimized for it specifically, partly because it's kind of tricky to set up remote CUDA profiling on a cloud server. I've been meaning to get around to it, also to figure out some weird performance degradation that happened a while ago on the H100.A100应该比A10快很多，是的。我只是没有专门针对它进行优化，部分原因是在云服务器上设置远程 CUDA 分析有点棘手。我一直想解决这个问题，也想弄清楚不久前在H100上发生的一些奇怪的性能下降。

There has also been a CPU bottleneck at one point, which should have been more or less dealt with, but in my experience some of these GPU cloud instances have very slow CPU cores, so that could also be part of the explanation.在某一点上也存在 CPU 瓶颈，这应该或多或少地得到解决，但根据我的经验，其中一些 GPU 云实例的 CPU 内核非常慢，所以这也可能是解释的一部分。

The A100 should be a lot faster than the A10, yes. I just haven't optimized for it specifically, partly because it's kind of tricky to set up remote CUDA profiling on a cloud server. I've been meaning to get around to it, also to figure out some weird performance degradation that happened a while ago on the H100.

There has also been a CPU bottleneck at one point, which should have been more or less dealt with, but in my experience some of these GPU cloud instances have very slow CPU cores, so that could also be part of the explanation.

I had the similar problem even with a local physical machine.
My machine configuration is as follows：
CPU:Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
GPU:NVIDIA A800-SXM4-80GB
when I run a llama-13b model. Speed: 52.25 tokens/second。
-- Inference, first pass.
** Time, Inference: 0.29 seconds
** Speed: 6675.12 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 44.69 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 52.25 tokens/second

I also test on another machine with A6000 GPU, It get a better performance than A800.
Another machine configuration is as follows
CPU:Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
GPU:NVIDIA RTX A6000
the peformance for the llama-13b model is as follow:
-- Inference, first pass.
** Time, Inference: 0.59 seconds
** Speed: 3271.80 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 48.04 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 62.40 tokens/second

Compare with A6000 GPU, On A800 GPU, the prompt speed is much better, but generate speed becomes slower.

0 replies

turboderp · 2023-07-14T11:48:23Z

turboderp
Jul 14, 2023
Maintainer

Do you think FP8 TE utilization would drastically speed up inference even more?

The short answer is no. On prompt evaluation, sure, maybe on batches, but tensor cores don't work for matrix-vector multiplication, which is what you end up doing most of in text generation. There is (as far as I can tell) no way to utilize FP8 types outside of tensor fragments, except for some intrinsics that convert to and from other types.

Is the CPU load easy to distribute? Or will the slow CPU cores on cloud instances always be a bottleneck

Well, all the heavy lifting is done on the GPU. All you're doing on the CPU side is building a queue of CUDA operations, and the CUDA runtime will start working on the first item as soon as it's ready. So the only bar you have to meet is being able to define operations faster than the GPU can go through them. On average, even.

In principle, this should be possible on even a really slow embedded CPU. But Python is very slow. Operations that should take nanoseconds (like passing a pointer to a function or whatever) end up taking microseconds instead. And combined with GPTQ being very fast, and kernels often completing in just a few microseconds when the hidden state is small, the CPU can end up becoming a bottleneck after all.

It's all because of Python, though, so the solution is simple: do as much of the work as possible in the C++ extension. Ideally you'd define the whole forward pass in native code and just have a single call to the extension, per token. At that point, a 4090 strapped to a potato would perform about as well as a 4090 on a high-end motherboard with a 13900K.

ExLlama is still roughly shaped like the HF LlamaModel, and while a bunch of operations do get combined like this, there's still quite a bit of Python code that has to run over the forward pass. In my case, I end up with about 30% actual CPU utilization for a 7b model (you need a CUDA profiler to measure this), which means that as long as host code runs at least 30% as fast it does in my case, you won't be bottlenecked.

Now, I have a 12900K, and a Xeon Gold 6326 shouldn't be that much slower in single-threaded performance, but there may be more going on. I suspect the CUDA runtime running in a virtualized environment might behave differently, introducing extra synchronization events or something like that. Or maybe it's down to some load balancing mechanism getting tripped up? Idk. I'll have to do some profiling on a server like this, and it's a little tricky. Overall, though, the CPU doesn't have a whole lot of actual work to do here. So if there is a CPU bottleneck it's always going to be a solvable problem. Somehow.

0 replies

SinanAkkoyun · 2023-07-15T00:14:50Z

SinanAkkoyun
Jul 15, 2023

The short answer is no.

That is odd. I might misunderstand your syatatememt but FP8 Transformer Engine (just like bfloat16 instead of float16) is "the" way to speed up H100 LLM inference latency 3x to 5x, that's what it's advertised for and what people have acutally seen

Regarding the slowness of Python: Before I dive into it, what do you expect how much faster would a pure C++ implementation be latency wise?

I really appreciate your detailed response

0 replies

turboderp · 2023-07-15T13:03:58Z

turboderp
Jul 15, 2023
Maintainer

That is odd. I might misunderstand your syatatememt but FP8 Transformer Engine (just like bfloat16 instead of float16) is "the" way to speed up H100 LLM inference latency 3x to 5x, that's what it's advertised for and what people have acutally seen

Tensor cores help you with GEMM, not GEMV. The reason is that they work on fragments down to 4x4, so if one side of your matmul is a matrix with a height of 1, you just can't tile it with 4x4 fragments. You can pad your input and your output, but then you're doing way more work than you need to.

What this means for transformers is that anywhere you're relying on GEMM (training or long sequences) tensor cores will help greatly, but until NVIDIA comes up with some type of "vector core" to use in GEMV, you're not going to see any benefit when generating one token at a time.

Also, if they did, it's not clear that it would help since GEMV is much more easily memory-bound. I.e. it's not actually computing all those dot products that slows you down, it's reading the inputs from memory. As for that, having access to FP8 outside of fragments would open up some interesting possibilities, but it just doesn't look like it's possible right now.

And even if it were, you'd be looking at FP8 activations, not FP8 weights, since those are still much larger and much slower than GPTQ. I could imagine some alternative quantization format that relies on e3m4 precision, storing one 4-bit sign+exponent for every group of weights rather than an FP16 scale, and then reconstructing FP8 weights trivially using bitfield operations. But that would be a whole thing on its own with a new quantizer etc. Without it you've just got an FP8 model which may be a lot smaller and faster than FP16, but still about twice the size of 4-bit GPTQ.

Regarding the slowness of Python: Before I dive into it, what do you expect how much faster would a pure C++ implementation be latency wise?

Anywhere from 0% to maybe 300% faster. It depends what you're comparing to. If you're not CPU-bound, it's because whatever host code you're running (Python, C++ or otherwise) is fast enough that it completes before the CUDA queue completes. There's no performance to gain in that case. It's why Python makes sense in the first place, even though it's a very slow interpreted language being used for ultra-high-performance compute tasks like ML.

If you're launching 100 CUDA kernels, and each kernel takes 100 ms to complete, it doesn't matter if it takes 10 ms for the CPU to queue up each kernel launch. By the time the first kernel is done, there are already 10 more kernels ready to go, and by the time those have completed the CPU will have finished all its work and is just sitting there in a busy loop waiting for the GPU. Optimizing on the CPU side makes no difference at all. But, if instead each kernel completed in 1 ms, then the CPU would become the bottleneck. The GPU would be idle 90% of the time waiting for its next assignment.

The thing about LLMs is that you end up running a lot of GPU operations over a forward pass. During training, or if you're running inference on a long sequence, those operations easily take long enough that the CPU can keep up, even if it's wasting 99.99% of its potential on Python. But once you're doing token-by-token inference the GPU operations get a lot smaller. Factor in GPTQ with its very efficient VRAM usage and suddenly Python becomes the bottleneck.

So, one of the things that makes ExLlama fast is that it tries to do enough of the host-side work in C++ where it's trivial to race ahead of the GPU. And once you do enough there's no further improvement to be made on the host. What I suspect is happening on the H100 with these Xeon servers is that, for one reason or another, the CUDA queue runs out from time to time which brings back the CPU bottleneck that should have been completely (!) eliminated otherwise. As for how much performance could be improved by addressing those issues, it's really hard to say without some in-depth profiling on those particular servers. And remote CUDA profiling is, as mentioned, kind of difficult.

0 replies

SinanAkkoyun · 2023-07-15T13:31:39Z

SinanAkkoyun
Jul 15, 2023

Alright, I see, thank you very much!

Your explanations are very clear and conceise whilst being very detailed, I highly appreciate that.

Thank you for taking the time and effort, I learned a lot from yor comment.

0 replies

SinanAkkoyun · 2023-07-17T09:34:51Z

SinanAkkoyun
Jul 17, 2023

@turboderp
I tested out a runpod config of a 4090 and an 6000 Ada. The problem is, that the 4090 config had a 13900k (which yielded similar performance to your benchmarks), but the 6000 Ada only has a Xeon(R) Gold 6348 CPU.

Performance on the Ada with the 3B openllama model: 128.06 tokens/second and 141.10 tokens/second.
(4090 performance: 167.22 | 221.41)

I wanted to ask you, is that because of the 6000 worse specs? Or is that purely because of the CPU? Would you suggest the 6000 Ada has the capabilities to run just as fast as the 4090 on the right CPU?
I would really like to test both GPUs on a powerful CPU but I can't at the moment. Just wanted to ask before I do the effort of lending a 6000 Ada from someone :)

0 replies

SinanAkkoyun · 2023-07-17T09:56:10Z

SinanAkkoyun
Jul 17, 2023

Further figures:
3B orca: (same model used before, idk why it's faster here)
164.12 / 169.92
136.24 / 136.72
127 / 137
(It fluctuates very much despite the benchmark being the only process running.)

Wizard-Vicuna-7B-Uncensored-GPTQ: 111.14 / 150.47
110 / 134

guanaco-33B-GPTQ:
30.7 / 40.35
30.73 / 40.38

guanaco-65B-GPTQ:
17.76 / 21.16
17.09 / 20.22

0 replies

turboderp · 2023-07-17T20:25:01Z

turboderp
Jul 17, 2023
Maintainer

Would you suggest the 6000 Ada has the capabilities to run just as fast as the 4090 on the right CPU?

It's really hard to say. The A6000 Ada has anywhere from 75% to 95% of the VRAM bandwidth of the 4090, depending on where you're reading the specs. It's rated for slightly more FLOPs, but a much lower TDP.... so, yeah. I don't know.

(Most?) Xeon CPUs have pretty poor single-threaded performance, and I still suspect virtualization could have an effect on cloud servers, but there still shouldn't be much of a CPU bottleneck left. So again, not really sure.

4090 go brr, though, that much is certain. And consumer CPUs are just faster it seems. I hope to have some time soon to do more thorough tests across a range of (newer) GPUs and CPUs, to see how much I've managed to narrowly optimize for just my own hardware and what could maybe be done about it.

0 replies

SinanAkkoyun · 2023-07-18T12:06:26Z

SinanAkkoyun
Jul 18, 2023

I see, I am also curious about the CPU bottleneck and might be able to run a test soon on local hardware without virtualization.

I just ran a 65B benchmark test on a dual 4090 (AMD EPYC 7282) runpod environment and saw 20.25 / 25.00 tps, which is considerably faster than the single 6000 Ada setup (I would argue the CPUs are kind of the same?), just for anyone who is curious.

(for comparing potential virt/cpu bottleneck, the 7B model runs at 117.13 / 136.92 tps)
(models used: TheBloke/airoboros-7B-gpt4-1.4-GPTQ TheBloke/airoboros-65B-gpt4-1.4-GPTQ)

Thank you very much for taking the time to evaluate :)

0 replies

ri938 · 2023-07-24T04:04:02Z

ri938
Jul 24, 2023

@turboderp you mentioned CPU bound computation on the cloud. I wonder is that what I am seeing.

I am running inference on a 30B model and wanting 35 tokens per second from benchmarks but am only seeing about 20 tokens / second.

It does vary quite a bit depending on the CPU

AMD EPYC 7513 32-Core Processor :: 0.037 seconds per token
Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz :: 0.042 seconds per token
AMD EPYC 7413 24-Core Processor :: 0.045 seconds per token

0 replies

SinanAkkoyun · 2023-07-26T12:24:12Z

SinanAkkoyun
Jul 26, 2023

Hey @turboderp I have another question
I need a very high speed custom model. I will train it on movement prediction in a game engine and I would like to use the 3B pretrained model because of it's reasoning and retrain it all over
You said you think 500 tps is much doable, may I ask when will you consider optimizing exllama to make 500 tps possible? :)
Thank you very much

0 replies

turboderp · 2023-07-26T19:55:05Z

turboderp
Jul 26, 2023
Maintainer

I'm currently focused on V2 which will have faster options for 3B, specifically with speculative sampling in mind. I'm undecided on how much more effort to devote to optimizing V1 given the fairly promising results for V2 so far.

0 replies

SinanAkkoyun · 2023-07-26T21:08:42Z

SinanAkkoyun
Jul 26, 2023

I'm currently focused on V2 which will have faster options for 3B, specifically with speculative sampling in mind. I'm undecided on how much more effort to devote to optimizing V1 given the fairly promising results for V2 so far.

Oh cool!!! Do you very roughly know when you will release V2? :) I also wanted to implement speculative sampling like mentioned in some other comments, maybe I can work on that with the release of V2 or something if you didn't already implement it

0 replies

turboderp · 2023-07-26T23:01:10Z

turboderp
Jul 26, 2023
Maintainer

I don't know. I still need to write a bunch of code, a lot of it is very experimental, and the experiments take a while to run. So it's really hard to set a timeline. I'd like to have something to show for it within a couple of weeks, but I don't know for sure.

0 replies

SinanAkkoyun · 2023-07-27T13:24:35Z

SinanAkkoyun
Jul 27, 2023

That's great news! I am looking very forward to v2 :)

0 replies

SinanAkkoyun · 2023-08-02T06:01:29Z

SinanAkkoyun
Aug 2, 2023

Hey @turboderp I hope you are having a great time. I wanted to ask, what do you think, will v2 be faster on a dual 3090+SLI or dual 4090? I'd love an estimate from you :)

0 replies

turboderp · 2023-08-02T09:57:33Z

turboderp
Aug 2, 2023
Maintainer

Probably dual 4090s will be faster. The direction I'm leaning now still uses reordering tricks that make tensor parallelism difficult, so the raw speed of the 4090 would make much more of a difference than inter-GPU bandwidth.

0 replies

SinanAkkoyun · 2023-08-02T10:24:22Z

SinanAkkoyun
Aug 2, 2023

Ok great! We are thinking about upgrading in the next week, that's why
Just wanted to confirm with you, the 4090s will still be a rather safe bet, even after some months right?

0 replies

turboderp · 2023-08-02T11:11:16Z

turboderp
Aug 2, 2023
Maintainer

I think so, yes. 3090s might be a better value, of course. Depending on the task at hand, they can give you like 80% of the performance at less than half the price. But then in other workloads it'll be more like a third of the performance. And certain features like FP8 just aren't supported at all on the 3090.

Personally I think I would have been happy enough with two 3090s, but it's not to the point that I regret buying the 4090. For what it's worth, I'm considering a third GPU, and if I do it'll probably end up being a 3090. Not for NVLink though, just for the price/utility ratio.

0 replies

SinanAkkoyun · 2023-08-02T11:31:50Z

SinanAkkoyun
Aug 2, 2023

I see, thank you very much! 😊
Yes the 3090 is of amazing value, especially given that you already rock a 4090, that would be a great choice imho and you will have lots of VRAM :)

0 replies

jmoney7823956789378 · 2023-08-02T11:37:20Z

jmoney7823956789378
Aug 2, 2023

Even adding a 3090 to my slow little A4000s turned out to be an awesome choice, as it gave me about an extra 3 t/s on average!
Plus it freed up one of the A4000s so I can use it to load a second instance for testing 13B models, while the rest handle 70B.

0 replies

Question about inference speed #229

Replies: 26 comments · 1 reply

turboderp Jul 10, 2023 Maintainer

MrJungle1 Jul 11, 2023 Author

turboderp Jul 14, 2023 Maintainer

turboderp Jul 15, 2023 Maintainer

turboderp Jul 17, 2023 Maintainer

turboderp Jul 26, 2023 Maintainer

turboderp Jul 26, 2023 Maintainer

turboderp Aug 2, 2023 Maintainer

turboderp Aug 2, 2023 Maintainer

Replies: 26 comments 1 reply

turboderp
Jul 10, 2023
Maintainer

MrJungle1
Jul 11, 2023
Author

turboderp
Jul 14, 2023
Maintainer

turboderp
Jul 15, 2023
Maintainer

turboderp
Jul 17, 2023
Maintainer

turboderp
Jul 26, 2023
Maintainer

turboderp
Jul 26, 2023
Maintainer

turboderp
Aug 2, 2023
Maintainer

turboderp
Aug 2, 2023
Maintainer