tutorial : parallel inference using Hugging Face dedicated endpoints #9041
ggerganov
started this conversation in
Show and tell
Replies: 1 comment
-
That's awesome! Thanks for taking time to test many configurations. I look forward to the improvements in the KV cache management, and hopefully the performance will be improved significantly. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview
This post demonstrates how to deploy
llama.cpp
as an inference engine in the cloud using HF dedicated inference endpoint. We create a sample endpoint serving a LLaMA model on a single-GPU node and run some benchmarks on it. Some sample results are presented and possible optimizations are discussed. Feedback and additional ideas for optimization welcome!Instructions
Go to https://ui.endpoints.huggingface.co/ and setup the new endpoint like this:
Here we use the
LLAMACPP_ARGS
environment variable as temporary mechanism to pass custom arguments to thellama-server
binary. This is possible because the selected Docker container (in this case ggml/llama-cpp-cuda-default) supports it:https://github.com/ggml-org/hf-inference-endpoints/blob/6df3fdeb9528a561582ec60ba3ef3308943b5799/llama.cpp/cuda-default/Dockerfile#L39
After the endpoint initializes successfully, you should see this:
In the "Logs" tab, you can download the generated log from the execution of the selected Docker container running
llama-bench
andllama-server
. For example:To test the connectivity, run a basic
curl
command (here we ignore any prompt templates, to make it simple):Or simply open the URL in the browser and use the built-in chat interface:
https://iaa9969mg5il6omy.us-east-1.aws.endpoints.huggingface.cloud
Benchmarks
We will run a parallel load test using a
k6
script:After the test is complete, you will see the following stats:
We are mainly interested in the following metrics:
The higher the rates - the better.
Configuring the test parameters
The
common.js
file can be adjusted in different ways:max_new_tokens = 50
specifies the maximum number of new tokens to be generated for each requestvus: 16
sets the number of parallel requests being sent to the endpointiterations: 500
is the total number of requests sentmaxDuration: 480s
is the maximum runtime of the testgenerate_payload
function formats the input promptsSelecting
LLAMACPP_ARGS
Depending on the use case, the
LLAMACPP_ARGS
environment variable of the endpoint is important to be set properly to achieve optimal results. Here is an example:Using this configuration, we are planning that the endpoint will be serving a maximum of 16 requests in parallel with a total KV cache size of 131072 tokens. This means that each request should not exceed
131072 / 16 = 8192
tokens (prompt + completion). Generally, enabling Flash Attention (-fa
) is recommended for GPU endpoints.Using these parameters, we can inspect the logs of the endpoint and see that the KV cache alone requires about ~16GB of VRAM:
Performance
Here are sample numbers using the described benchmark and an Nvidia A10G 24GB GPU endpoint:
Model: https://huggingface.co/ngxson/Meta-Llama-3.1-8B-Instruct-Q4_K_M-GGUF/tree/main
GPU: https://www.nvidia.com/en-us/data-center/products/a10-gpu/
CUDA: 11.7.1
llama-bench
# K6 tests VUs = 1, iters = 500, 180s max LLAMACPP_ARGS = -fa -c 8192 -np 1 --threads-http 16 --metrics -dt 0.05 VUs = 2, iters = 500, 180s max LLAMACPP_ARGS = -fa -c 16384 -np 2 --threads-http 16 --metrics -dt 0.05 VUs = 4, iters = 500, 180s max LLAMACPP_ARGS = -fa -c 32768 -np 4 --threads-http 16 --metrics -dt 0.05 VUs = 8, iters = 500, 180s max LLAMACPP_ARGS = -fa -c 65536 -np 8 --threads-http 16 --metrics -dt 0.05 VUs = 16, iters = 500, 180s max LLAMACPP_ARGS = -fa -c 131072 -np 16 --threads-http 16 --metrics -dt 0.05
Measured
It's interesting to analyze how well the inference scales with increasing number of parallel requests. With an ideal implementation that has perfect scaling for parallel requests, the numbers in the table above would have scaled linearly with the number of VUs (virtual users). So if we take the results for 1 VU as a baseline we can estimate the expected ideal performance for VUs > 1:
Ideal (estimated from VUs = 1)
Measured / Ideal
Based on these analysis, we can see that
llama.cpp
has ~40% @ 8 VUs and ~60% @ 16 VUs performance deficit compared to a perfect linear scaling implementation. These numbers are obviously not super precise and can vary based on the test parameters, but should still give a rough idea of how much further we can improve the performance for parallel generation.Most likely, the biggest factor that degrades the performance at the moment is the unified KV cache (see #4130 (comment)). It seems not very suitable for the parallel use case and if we want to improve the parallel performance in the future, we should implement it in a more suitable way.
Depending on the selected hardware, the performance could potentially be improved further by tuning some of the advanced build parameters. For simplicity, here we have used the default
llama.cpp
build options, only enablingGGML_CUDA_FORCE_MMQ=ON
in the Docker container.Credits
Thanks to @ngxson for guiding me how to create the HF endpoints and the Docker containers and how to run the K6 benchmarks.
Beta Was this translation helpful? Give feedback.
All reactions