Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed on A100 #266

Open
Ber666 opened this issue Aug 30, 2023 · 4 comments
Open

Speed on A100 #266

Ber666 opened this issue Aug 30, 2023 · 4 comments

Comments

@Ber666
Copy link

Ber666 commented Aug 30, 2023

Hi, thanks for the cool project.
I am testing Llama-2-70B-GPTQ with 1 * A100 40G, the speed is around 9 t/s
image
Is this the expected speed? I noticed in some other issues that the code is only optimized for consumer GPUs, but I just wanted to double check if that's the expected speed or I made mistakes somewhere

@turboderp
Copy link
Owner

I haven't tested 70B on A100 before, but the speed is close to what I've seen for 65B on A100, so I think this is about expected, yes.

@jday96314
Copy link

To give you another data point, with 70B I get 10 - 13 t/s per A100 80 GB (SXM4).

@akaikite
Copy link

I can't believe that the a100 gets the same speed as the 3090. Maybe something can be improved here?

@turboderp
Copy link
Owner

There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound, as you end up being here, what matters is that the A100 40G only has about 50-60% more global memory bandwidth than the 3090. So if the implementation is properly optimized and tuned for that architecture (ExLlama isn't, to be clear) then you're looking at 50-60% more tokens per second.

Now, if you're serving large batches, inference becomes compute-bound instead, and the A100 will outperform the 3090 very easily. But to serve large batches you also need a bunch more VRAM dedicated to state and cache. 40 GB won't get you very far, and even 80 GB is questionable. What use-case are you optimizing for, then? One quantized 70B model serving no more than 8 concurrent users, or something? A small business willing to invest in one A100 but not two, or three? Or if you're also trying to accommodate multi-A100 setups with tensor parallelism and whatnot, at what point does quantization stop making sense?

But yes, V2 is coming, and it's faster all around, including on the A100. So there's that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants