-
We are running (and publishing with open licenses) a bunch of different benchmarks (e.g. memory bandwidth, OpenSSL or compression speed, redis/static web server cases, geekbench and passmark etc) on 2000+ cloud server types at sparecores.com, and currently working on a new set of benchmarks to be run on all servers for LLM inference speed using tiny, medium-sized and larger models as well with various configs.
Can someone help us understand what limit we are facing here? Ideally, we need a tool that we can run on CPUs, a single GPU, or many GPUs and have a global score for the token/sec for each model/config on each server -- hopefully without much tweaking, and working even with small models so that we can run this on tiny machines as well. What we have tried:
Example run on a
Any hints are appreciated 🙇 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The default |
Beta Was this translation helpful? Give feedback.
-
Thank you for the suggestion and details, much appreciated! 🙇
|
Beta Was this translation helpful? Give feedback.
The default
-sm layer
only supports pipeline parallelism when evaluating large prompts. You would probably need to use a larger prompt to observe an significant improvement. You can also use-sm row
to enable tensor parallelism, but performance is not always better.