Replies: 26 comments 1 reply
-
What model? |
Beta Was this translation helpful? Give feedback.
-
exllama is very optimized for consumer GPU architecture so hence enterprise GPUs might not perform or scale as well, im sure @turboderp has the details of why (fp16 math and what not) but thats probably the TLDR |
Beta Was this translation helpful? Give feedback.
-
The A100 should be a lot faster than the A10, yes. I just haven't optimized for it specifically, partly because it's kind of tricky to set up remote CUDA profiling on a cloud server. I've been meaning to get around to it, also to figure out some weird performance degradation that happened a while ago on the H100. There has also been a CPU bottleneck at one point, which should have been more or less dealt with, but in my experience some of these GPU cloud instances have very slow CPU cores, so that could also be part of the explanation. |
Beta Was this translation helpful? Give feedback.
-
llama-13b and llama-30b |
Beta Was this translation helpful? Give feedback.
-
Do you think FP8 TE utilization would drastically speed up inference even more?
Is the CPU load easy to distribute? Or will the slow CPU cores on cloud instances always be a bottleneck? Thank you |
Beta Was this translation helpful? Give feedback.
-
I had the similar problem even with a local physical machine. I also test on another machine with A6000 GPU, It get a better performance than A800. Compare with A6000 GPU, On A800 GPU, the prompt speed is much better, but generate speed becomes slower. |
Beta Was this translation helpful? Give feedback.
-
The short answer is no. On prompt evaluation, sure, maybe on batches, but tensor cores don't work for matrix-vector multiplication, which is what you end up doing most of in text generation. There is (as far as I can tell) no way to utilize FP8 types outside of tensor fragments, except for some intrinsics that convert to and from other types.
Well, all the heavy lifting is done on the GPU. All you're doing on the CPU side is building a queue of CUDA operations, and the CUDA runtime will start working on the first item as soon as it's ready. So the only bar you have to meet is being able to define operations faster than the GPU can go through them. On average, even. In principle, this should be possible on even a really slow embedded CPU. But Python is very slow. Operations that should take nanoseconds (like passing a pointer to a function or whatever) end up taking microseconds instead. And combined with GPTQ being very fast, and kernels often completing in just a few microseconds when the hidden state is small, the CPU can end up becoming a bottleneck after all. It's all because of Python, though, so the solution is simple: do as much of the work as possible in the C++ extension. Ideally you'd define the whole forward pass in native code and just have a single call to the extension, per token. At that point, a 4090 strapped to a potato would perform about as well as a 4090 on a high-end motherboard with a 13900K. ExLlama is still roughly shaped like the HF LlamaModel, and while a bunch of operations do get combined like this, there's still quite a bit of Python code that has to run over the forward pass. In my case, I end up with about 30% actual CPU utilization for a 7b model (you need a CUDA profiler to measure this), which means that as long as host code runs at least 30% as fast it does in my case, you won't be bottlenecked. Now, I have a 12900K, and a Xeon Gold 6326 shouldn't be that much slower in single-threaded performance, but there may be more going on. I suspect the CUDA runtime running in a virtualized environment might behave differently, introducing extra synchronization events or something like that. Or maybe it's down to some load balancing mechanism getting tripped up? Idk. I'll have to do some profiling on a server like this, and it's a little tricky. Overall, though, the CPU doesn't have a whole lot of actual work to do here. So if there is a CPU bottleneck it's always going to be a solvable problem. Somehow. |
Beta Was this translation helpful? Give feedback.
-
That is odd. I might misunderstand your syatatememt but FP8 Transformer Engine (just like bfloat16 instead of float16) is "the" way to speed up H100 LLM inference latency 3x to 5x, that's what it's advertised for and what people have acutally seen Regarding the slowness of Python: Before I dive into it, what do you expect how much faster would a pure C++ implementation be latency wise? I really appreciate your detailed response |
Beta Was this translation helpful? Give feedback.
-
Tensor cores help you with GEMM, not GEMV. The reason is that they work on fragments down to 4x4, so if one side of your matmul is a matrix with a height of 1, you just can't tile it with 4x4 fragments. You can pad your input and your output, but then you're doing way more work than you need to. What this means for transformers is that anywhere you're relying on GEMM (training or long sequences) tensor cores will help greatly, but until NVIDIA comes up with some type of "vector core" to use in GEMV, you're not going to see any benefit when generating one token at a time. Also, if they did, it's not clear that it would help since GEMV is much more easily memory-bound. I.e. it's not actually computing all those dot products that slows you down, it's reading the inputs from memory. As for that, having access to FP8 outside of fragments would open up some interesting possibilities, but it just doesn't look like it's possible right now. And even if it were, you'd be looking at FP8 activations, not FP8 weights, since those are still much larger and much slower than GPTQ. I could imagine some alternative quantization format that relies on e3m4 precision, storing one 4-bit sign+exponent for every group of weights rather than an FP16 scale, and then reconstructing FP8 weights trivially using bitfield operations. But that would be a whole thing on its own with a new quantizer etc. Without it you've just got an FP8 model which may be a lot smaller and faster than FP16, but still about twice the size of 4-bit GPTQ.
Anywhere from 0% to maybe 300% faster. It depends what you're comparing to. If you're not CPU-bound, it's because whatever host code you're running (Python, C++ or otherwise) is fast enough that it completes before the CUDA queue completes. There's no performance to gain in that case. It's why Python makes sense in the first place, even though it's a very slow interpreted language being used for ultra-high-performance compute tasks like ML. If you're launching 100 CUDA kernels, and each kernel takes 100 ms to complete, it doesn't matter if it takes 10 ms for the CPU to queue up each kernel launch. By the time the first kernel is done, there are already 10 more kernels ready to go, and by the time those have completed the CPU will have finished all its work and is just sitting there in a busy loop waiting for the GPU. Optimizing on the CPU side makes no difference at all. But, if instead each kernel completed in 1 ms, then the CPU would become the bottleneck. The GPU would be idle 90% of the time waiting for its next assignment. The thing about LLMs is that you end up running a lot of GPU operations over a forward pass. During training, or if you're running inference on a long sequence, those operations easily take long enough that the CPU can keep up, even if it's wasting 99.99% of its potential on Python. But once you're doing token-by-token inference the GPU operations get a lot smaller. Factor in GPTQ with its very efficient VRAM usage and suddenly Python becomes the bottleneck. So, one of the things that makes ExLlama fast is that it tries to do enough of the host-side work in C++ where it's trivial to race ahead of the GPU. And once you do enough there's no further improvement to be made on the host. What I suspect is happening on the H100 with these Xeon servers is that, for one reason or another, the CUDA queue runs out from time to time which brings back the CPU bottleneck that should have been completely (!) eliminated otherwise. As for how much performance could be improved by addressing those issues, it's really hard to say without some in-depth profiling on those particular servers. And remote CUDA profiling is, as mentioned, kind of difficult. |
Beta Was this translation helpful? Give feedback.
-
Alright, I see, thank you very much! Your explanations are very clear and conceise whilst being very detailed, I highly appreciate that. Thank you for taking the time and effort, I learned a lot from yor comment. |
Beta Was this translation helpful? Give feedback.
-
@turboderp Performance on the Ada with the 3B openllama model: 128.06 tokens/second and 141.10 tokens/second. I wanted to ask you, is that because of the 6000 worse specs? Or is that purely because of the CPU? Would you suggest the 6000 Ada has the capabilities to run just as fast as the 4090 on the right CPU? |
Beta Was this translation helpful? Give feedback.
-
Further figures: Wizard-Vicuna-7B-Uncensored-GPTQ: 111.14 / 150.47 guanaco-33B-GPTQ: guanaco-65B-GPTQ: |
Beta Was this translation helpful? Give feedback.
-
It's really hard to say. The A6000 Ada has anywhere from 75% to 95% of the VRAM bandwidth of the 4090, depending on where you're reading the specs. It's rated for slightly more FLOPs, but a much lower TDP.... so, yeah. I don't know. (Most?) Xeon CPUs have pretty poor single-threaded performance, and I still suspect virtualization could have an effect on cloud servers, but there still shouldn't be much of a CPU bottleneck left. So again, not really sure. 4090 go brr, though, that much is certain. And consumer CPUs are just faster it seems. I hope to have some time soon to do more thorough tests across a range of (newer) GPUs and CPUs, to see how much I've managed to narrowly optimize for just my own hardware and what could maybe be done about it. |
Beta Was this translation helpful? Give feedback.
-
I see, I am also curious about the CPU bottleneck and might be able to run a test soon on local hardware without virtualization. I just ran a 65B benchmark test on a dual 4090 (AMD EPYC 7282) runpod environment and saw 20.25 / 25.00 tps, which is considerably faster than the single 6000 Ada setup (I would argue the CPUs are kind of the same?), just for anyone who is curious. (for comparing potential virt/cpu bottleneck, the 7B model runs at 117.13 / 136.92 tps) Thank you very much for taking the time to evaluate :) |
Beta Was this translation helpful? Give feedback.
-
@turboderp you mentioned CPU bound computation on the cloud. I wonder is that what I am seeing. I am running inference on a 30B model and wanting 35 tokens per second from benchmarks but am only seeing about 20 tokens / second. It does vary quite a bit depending on the CPU AMD EPYC 7513 32-Core Processor :: 0.037 seconds per token |
Beta Was this translation helpful? Give feedback.
-
Hey @turboderp I have another question |
Beta Was this translation helpful? Give feedback.
-
I'm currently focused on V2 which will have faster options for 3B, specifically with speculative sampling in mind. I'm undecided on how much more effort to devote to optimizing V1 given the fairly promising results for V2 so far. |
Beta Was this translation helpful? Give feedback.
-
Oh cool!!! Do you very roughly know when you will release V2? :) I also wanted to implement speculative sampling like mentioned in some other comments, maybe I can work on that with the release of V2 or something if you didn't already implement it |
Beta Was this translation helpful? Give feedback.
-
I don't know. I still need to write a bunch of code, a lot of it is very experimental, and the experiments take a while to run. So it's really hard to set a timeline. I'd like to have something to show for it within a couple of weeks, but I don't know for sure. |
Beta Was this translation helpful? Give feedback.
-
That's great news! I am looking very forward to v2 :) |
Beta Was this translation helpful? Give feedback.
-
Hey @turboderp I hope you are having a great time. I wanted to ask, what do you think, will v2 be faster on a dual 3090+SLI or dual 4090? I'd love an estimate from you :) |
Beta Was this translation helpful? Give feedback.
-
Probably dual 4090s will be faster. The direction I'm leaning now still uses reordering tricks that make tensor parallelism difficult, so the raw speed of the 4090 would make much more of a difference than inter-GPU bandwidth. |
Beta Was this translation helpful? Give feedback.
-
Ok great! We are thinking about upgrading in the next week, that's why |
Beta Was this translation helpful? Give feedback.
-
I think so, yes. 3090s might be a better value, of course. Depending on the task at hand, they can give you like 80% of the performance at less than half the price. But then in other workloads it'll be more like a third of the performance. And certain features like FP8 just aren't supported at all on the 3090. Personally I think I would have been happy enough with two 3090s, but it's not to the point that I regret buying the 4090. For what it's worth, I'm considering a third GPU, and if I do it'll probably end up being a 3090. Not for NVLink though, just for the price/utility ratio. |
Beta Was this translation helpful? Give feedback.
-
I see, thank you very much! 😊 |
Beta Was this translation helpful? Give feedback.
-
Even adding a 3090 to my slow little A4000s turned out to be an awesome choice, as it gave me about an extra 3 t/s on average! |
Beta Was this translation helpful? Give feedback.
-
Very good work, but I have a question about the inference speed of different machines, I got 43.22 tokens/s speed on A10, but only 51.4 tokens/s speed on A100, according to my understanding at least should Twice the difference Is there any reasonable explanation for this problem? thanks.
Beta Was this translation helpful? Give feedback.
All reactions