-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed on A100 #266
Comments
I haven't tested 70B on A100 before, but the speed is close to what I've seen for 65B on A100, so I think this is about expected, yes. |
To give you another data point, with 70B I get 10 - 13 t/s per A100 80 GB (SXM4). |
I can't believe that the a100 gets the same speed as the 3090. Maybe something can be improved here? |
There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound, as you end up being here, what matters is that the A100 40G only has about 50-60% more global memory bandwidth than the 3090. So if the implementation is properly optimized and tuned for that architecture (ExLlama isn't, to be clear) then you're looking at 50-60% more tokens per second. Now, if you're serving large batches, inference becomes compute-bound instead, and the A100 will outperform the 3090 very easily. But to serve large batches you also need a bunch more VRAM dedicated to state and cache. 40 GB won't get you very far, and even 80 GB is questionable. What use-case are you optimizing for, then? One quantized 70B model serving no more than 8 concurrent users, or something? A small business willing to invest in one A100 but not two, or three? Or if you're also trying to accommodate multi-A100 setups with tensor parallelism and whatnot, at what point does quantization stop making sense? But yes, V2 is coming, and it's faster all around, including on the A100. So there's that. |
Hi, thanks for the cool project.
I am testing Llama-2-70B-GPTQ with 1 * A100 40G, the speed is around 9 t/s
Is this the expected speed? I noticed in some other issues that the code is only optimized for consumer GPUs, but I just wanted to double check if that's the expected speed or I made mistakes somewhere
The text was updated successfully, but these errors were encountered: