Replies: 13 comments 31 replies
-
Thanks. That's informative. Are those cloud instances? Do you know anything about the CPUs they're running on, because it still seems to be somewhat CPU-bound, at least on the higher-end GPUs. I have a bunch of ideas for improving performance further, but there is an end in sight because memory bandwidth ends up being a a hard limit on how fast you can perform the forward pass. The 4090 will go up to 1 TB/s, supposedly, and with 17 GB of weights, that's about 59 tokens/second as a theoretical maximum. I'm still hoping to squeeze out a little bit more. Here are some thoughts: For the CUDA kernels at the moment I'm trusting the optimizer to inline and completely reimagine a bunch of stuff that makes the source code easier for me to follow, like the MatrixView classes. I'm also counting on being more or less up against the bandwidth limit anyway, so a little inefficiency shouldn't make any difference at all. Still, I'll give at least the quantized matmul kernel some love soon. It could also use some tunable parameters so it isn't optimized so strictly for the 4090. I stopped using At some point, though, I feel like I'll just be left with a Python wrapper for a PyTorch C++ extension, and then I might ditch PyTorch altogether. There are some ideas I've been considering, like splitting the model vertically (to use multiple GPUs in parallel) that would be much easier without PyTorch managing resources all the time. It could also free up some VRAM. |
Beta Was this translation helpful? Give feedback.
-
@turboderp Have you seen this project https://github.com/mit-han-lab/llm-awq ? |
Beta Was this translation helpful? Give feedback.
-
Trying to figure out which GPU to buy, very hard to decide between 3090 (or dual 3090s) or a 4090. @turboderp are you able to post benchmark numbers on your setup, but using a 3090 (or Ti), instead of the 4090? Would be very helpful to see what the actual performance diff is between these two. Thanks for all your hard work btw, this repo is insane |
Beta Was this translation helpful? Give feedback.
-
Of couse. Here are the same tests from the readme, run on the 3090:
I think there's definitely work to do on optimizing the kernel for the 3090-Ti. The slower prompt speed is to be expected since it just boils down to a regular cuBLAS FP16 matmul. The 4090 is some 50% faster as expected. The token speed is a little disappointing, though. Only about 60-80% even though the bandwidth is supposed to be the same as the 4090. It's possible it just needs some tuning for the 30-series specifically. |
Beta Was this translation helpful? Give feedback.
-
CMP50 HX using Wizard Vicuna 13B Uncensored GPTQ |
Beta Was this translation helpful? Give feedback.
-
A few benchmarks on openllama 7b and cuda 12.1 Also I modify the benchmark script to generate 512 tokens as it's a bit more realistic vs 128 default imo
|
Beta Was this translation helpful? Give feedback.
-
This is a slower, first-gen EPYC, FYI.
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Tested it on my 1060 with llama 7b 4bit, got about 4 tokens/s (compared to 10+ tokens/s with gptq-for-llama). Just the two warmup passes in the benchmark script took over 10 minutes to run. I know this isn't being optimized for that kind of hardware but I figured I'd put it here to give an idea of what it's like. On my 3070 however it's super fast, in the range of 50-70 tokens/s, while gptq-for-llama is below 15 tokens/s. |
Beta Was this translation helpful? Give feedback.
-
What sort of PCIe lane setup do you all have on your motherboards, for a dual GPU setup? I'm having trouble finding a consumer mobo that supports two x16 slots at more than @ x16 + x4. I found a couple that can do @ x8 + x8, and it's reasonably priced. Which would be the better option: x16 + x4, or x8 + x8? How much does this matter when it comes to inference in general (or finetuning), but also more specifically when it comes to exllama? |
Beta Was this translation helpful? Give feedback.
-
I just wondered, how well would exllama (in theory) work with the Nvidia A16? Loading the model would take quite some time, I suppose but what about the inference after that? |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
First of all I would like to thank you for your work, really like your inference implementation, seems to be the fastest so far for nvidia gpus!
I ran a bunch of tests on various GPUs and wanted to share results I got.
All tests are avg perf over 40 runs with 1024 context and 170 tokens beings generated each run using wizardlm-30b
Interesting to know if you have any ideas on how to boost perf further. Looking at the numbers, clearly there might be some possibility specifically for high end gpus.
Beta Was this translation helpful? Give feedback.
All reactions