Scaling multiple-GPU benchmarks #11236

daroczig · 2025-01-14T13:03:56Z

daroczig
Jan 14, 2025

We are running (and publishing with open licenses) a bunch of different benchmarks (e.g. memory bandwidth, OpenSSL or compression speed, redis/static web server cases, geekbench and passmark etc) on 2000+ cloud server types at sparecores.com, and currently working on a new set of benchmarks to be run on all servers for LLM inference speed using tiny, medium-sized and larger models as well with various configs.

llama-bench has been a great tool in our initial tests (working with both CPUs and GPUs), but we run into issues when trying to benchmark machines with multiple GPUs: it did not scale at all, only one GPU was used in the tests (or sometimes multiple GPUs at fractional loads and with very similar score to using a single GPU).

Can someone help us understand what limit we are facing here? Ideally, we need a tool that we can run on CPUs, a single GPU, or many GPUs and have a global score for the token/sec for each model/config on each server -- hopefully without much tweaking, and working even with small models so that we can run this on tiny machines as well.

What we have tried:

tiny (few hundred M params) to medium-sized (8B) models
t, ngl, sm, ts CLI params

Example run on a g5.12x @ AWS:

$ docker run --gpus=all -ti --rm \
    --entrypoint /app/llama-bench \
    -v /models:/models ghcr.io/ggerganov/llama.cpp:full-cuda \
    -m /models/qwen1_5-0_5b-chat-q2_k.gguf \
    -m /models/Meta-Llama-3-8B-Instruct-Q6_K.gguf \
    -ngl 200

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes
  Device 1: NVIDIA A10G, compute capability 8.6, VMM: yes
  Device 2: NVIDIA A10G, compute capability 8.6, VMM: yes
  Device 3: NVIDIA A10G, compute capability 8.6, VMM: yes

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 0.5B Q2_K - Medium       | 278.92 MiB |   619.57 M | CUDA       | 200 |         pp512 |     18407.51 ± 53.43 |
| qwen2 0.5B Q2_K - Medium       | 278.92 MiB |   619.57 M | CUDA       | 200 |         tg128 |        293.61 ± 0.73 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 200 |         pp512 |       3286.36 ± 1.33 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 200 |         tg128 |         62.52 ± 0.02 |

build: 504af20e (4476)

Any hints are appreciated 🙇

Answered by slaren

Jan 17, 2025

The default -sm layer only supports pipeline parallelism when evaluating large prompts. You would probably need to use a larger prompt to observe an significant improvement. You can also use -sm row to enable tensor parallelism, but performance is not always better.

View full answer

slaren · 2025-01-17T00:01:26Z

slaren
Jan 17, 2025
Maintainer

The default -sm layer only supports pipeline parallelism when evaluating large prompts. You would probably need to use a larger prompt to observe an significant improvement. You can also use -sm row to enable tensor parallelism, but performance is not always better.

0 replies

daroczig · 2025-01-24T22:35:51Z

daroczig
Jan 24, 2025
Author

Thank you for the suggestion and details, much appreciated! 🙇

-sm row was very slow (x2-3), and using a single GPU was still faster than utilizing multiple GPUs, but using larger prompts indeed helped a lot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling multiple-GPU benchmarks #11236

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Scaling multiple-GPU benchmarks #11236

daroczig Jan 14, 2025

Replies: 2 comments

slaren Jan 17, 2025 Maintainer

daroczig Jan 24, 2025 Author

daroczig
Jan 14, 2025

slaren
Jan 17, 2025
Maintainer

daroczig
Jan 24, 2025
Author