tutorial : parallel inference using Hugging Face dedicated endpoints #9041

ggerganov · 2024-08-15T11:43:49Z

ggerganov
Aug 15, 2024
Maintainer

Overview

This post demonstrates how to deploy llama.cpp as an inference engine in the cloud using HF dedicated inference endpoint. We create a sample endpoint serving a LLaMA model on a single-GPU node and run some benchmarks on it. Some sample results are presented and possible optimizations are discussed. Feedback and additional ideas for optimization welcome!

Instructions

Go to https://ui.endpoints.huggingface.co/ and setup the new endpoint like this:

Here we use the LLAMACPP_ARGS environment variable as temporary mechanism to pass custom arguments to the llama-server binary. This is possible because the selected Docker container (in this case ggml/llama-cpp-cuda-default) supports it:

https://github.com/ggml-org/hf-inference-endpoints/blob/6df3fdeb9528a561582ec60ba3ef3308943b5799/llama.cpp/cuda-default/Dockerfile#L39

After the endpoint initializes successfully, you should see this:

In the "Logs" tab, you can download the generated log from the execution of the selected Docker container running llama-bench and llama-server. For example:

- 2024-08-15T10:02:58.378+00:00 ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
- 2024-08-15T10:02:58.378+00:00 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
- 2024-08-15T10:02:58.378+00:00 ggml_cuda_init: found 1 CUDA devices:
- 2024-08-15T10:02:58.378+00:00   Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes
- 2024-08-15T10:03:23.773+00:00 | model                          |       size |     params | backend    | ngl | fa |          test |              t/s |
- 2024-08-15T10:03:23.773+00:00 | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |
- 2024-08-15T10:03:23.773+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp1 |     94.43 ± 0.50 |
- 2024-08-15T10:03:25.679+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp2 |    165.99 ± 1.36 |
- 2024-08-15T10:03:25.765+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp4 |    327.32 ± 0.63 |
- 2024-08-15T10:03:25.873+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp8 |    493.59 ± 0.31 |
- 2024-08-15T10:03:40.354+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp16 |    899.84 ± 2.20 |
- 2024-08-15T10:03:40.488+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp32 |   1567.01 ± 0.99 |
- 2024-08-15T10:03:48.351+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp64 |   2521.71 ± 5.56 |
- 2024-08-15T10:03:48.603+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp128 |   3190.69 ± 1.54 |
- 2024-08-15T10:03:49.047+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp256 |   3775.43 ± 4.89 |
- 2024-08-15T10:03:49.852+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp512 |   3890.33 ± 2.48 |
- 2024-08-15T10:03:51.457+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |        pp1024 |   3876.82 ± 1.49 |
- 2024-08-15T10:03:58.274+00:00 | llama 8B Q4_0                  |   4.33 GiB |     8.03 B | CUDA       | 9999 |  1 |         tg128 |     94.30 ± 0.15 |
- 2024-08-15T10:03:58.367+00:00 
- 2024-08-15T10:03:58.367+00:00 build: 4b9afbb (1)
- 2024-08-15T10:03:58.498+00:00 INFO [                    main] build info | tid="139728639041536" timestamp=1723716238 build=1 commit="4b9afbb"
- 2024-08-15T10:03:58.498+00:00 INFO [                    main] system info | tid="139728639041536" timestamp=1723716238 n_threads=4 n_threads_batch=-1 total_threads=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
- 2024-08-15T10:03:58.535+00:00 llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /repository/meta-llama-3.1-8b-instruct-q4_0.gguf (version GGUF V3 (latest))
- 2024-08-15T10:03:58.535+00:00 llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
...

To test the connectivity, run a basic curl command (here we ignore any prompt templates, to make it simple):

curl --request POST --url https://iaa9969mg5il6omy.us-east-1.aws.endpoints.huggingface.cloud/completion \
     --header "Content-Type: application/json" \
     --data '{"prompt": "I believe the meaning of life is", "n_predict": 16}' | jq

Or simply open the URL in the browser and use the built-in chat interface:

https://iaa9969mg5il6omy.us-east-1.aws.endpoints.huggingface.cloud

Benchmarks

We will run a parallel load test using a k6 script:

git clone https://github.com/ngxson/hf-text-generation-inference
cd hf-text-generation-inference

git checkout 9c83e04f40f00a682fdca43a3f20a40f5adf9b12

cd load_tests
make ShareGPT_V3_unfiltered_cleaned_split.json
make prepare_share
# you should see a new file called "small.json"

# change the HOST with your created endpoint
HOST=iaa9969mg5il6omy.us-east-1.aws.endpoints.huggingface.cloud docker compose up

After the test is complete, you will see the following stats:

...
tgi_load_test-1  | running (2m10.0s), 03/16 VUs, 497 complete and 0 interrupted iterations
tgi_load_test-1  | throughput   [  99% ] 16 VUs  2m10.0s/3m0s  497/500 shared iters
tgi_load_test-1  | 
tgi_load_test-1  | running (2m10.8s), 00/16 VUs, 500 complete and 0 interrupted iterations
tgi_load_test-1  | throughput ✓ [ 100% ] 16 VUs  2m10.8s/3m0s  500/500 shared iters
tgi_load_test-1  | 
tgi_load_test-1  |      ✓ Post status is 200
tgi_load_test-1  | 
tgi_load_test-1  |      checks.........................: 100.00% ✓ 500         ✗ 0   
tgi_load_test-1  |      data_received..................: 1.7 MB  13 kB/s
tgi_load_test-1  |      data_sent......................: 828 kB  6.3 kB/s
tgi_load_test-1  |      http_req_blocked...............: avg=11.12ms  min=83ns     med=1.04µs   max=365.62ms p(90)=1.83µs   p(95)=2.12µs  
tgi_load_test-1  |      http_req_connecting............: avg=3.66ms   min=0s       med=0s       max=121ms    p(90)=0s       p(95)=0s      
tgi_load_test-1  |      http_req_duration..............: avg=3.12s    min=498.42ms med=2.53s    max=7.49s    p(90)=6.17s    p(95)=6.76s   
tgi_load_test-1  |        { expected_response:true }...: avg=3.12s    min=498.42ms med=2.53s    max=7.49s    p(90)=6.17s    p(95)=6.76s   
tgi_load_test-1  |    ✓ http_req_failed................: 0.00%   ✓ 0           ✗ 500 
tgi_load_test-1  |      http_req_receiving.............: avg=191.06µs min=27.25µs  med=164.37µs max=1.21ms   p(90)=315.87µs p(95)=380.86µs
tgi_load_test-1  |      http_req_sending...............: avg=254.06µs min=24.95µs  med=240.66µs max=2.11ms   p(90)=361.4µs  p(95)=435.16µs
tgi_load_test-1  |      http_req_tls_handshaking.......: avg=7.35ms   min=0s       med=0s       max=241.73ms p(90)=0s       p(95)=0s      
tgi_load_test-1  |      http_req_waiting...............: avg=3.12s    min=497.86ms med=2.53s    max=7.49s    p(90)=6.17s    p(95)=6.75s   
tgi_load_test-1  |      http_reqs......................: 500     3.823877/s
tgi_load_test-1  |      input_tokens...................: 154207  1179.337187/s
tgi_load_test-1  |      iteration_duration.............: avg=4.14s    min=1.5s     med=3.53s    max=8.49s    p(90)=7.18s    p(95)=7.77s   
tgi_load_test-1  |      iterations.....................: 500     3.823877/s
tgi_load_test-1  |      new_tokens.....................: 23994   183.500207/s
tgi_load_test-1  |      time_per_token.................: avg=75.77ms  min=15.48ms  med=51.76ms  max=2.79s    p(90)=133.76ms p(95)=142.07ms
tgi_load_test-1  |      tokens.........................: 178201  1362.837394/s
tgi_load_test-1  |      vus............................: 3       min=3         max=16
tgi_load_test-1  |      vus_max........................: 16      min=16        max=16
tgi_load_test-1  | 
tgi_load_test-1  | 
tgi_load_test-1  | running (2m10.8s), 00/16 VUs, 500 complete and 0 interrupted iterations
tgi_load_test-1  | throughput ✓ [ 100% ] 16 VUs  2m10.8s/3m0s  500/500 shared iters
tgi_load_test-1 exited with code 0

We are mainly interested in the following metrics:

tgi_load_test-1  |      http_reqs......................: 500     3.823877/s
tgi_load_test-1  |      input_tokens...................: 154207  1179.337187/s
tgi_load_test-1  |      iterations.....................: 500     3.823877/s
tgi_load_test-1  |      new_tokens.....................: 23994   183.500207/s
tgi_load_test-1  |      tokens.........................: 178201  1362.837394/s

The higher the rates - the better.

Configuring the test parameters

The common.js file can be adjusted in different ways:

max_new_tokens = 50 specifies the maximum number of new tokens to be generated for each request
vus: 16 sets the number of parallel requests being sent to the endpoint
iterations: 500 is the total number of requests sent
maxDuration: 480s is the maximum runtime of the test
the generate_payload function formats the input prompts

Selecting `LLAMACPP_ARGS`

Depending on the use case, the LLAMACPP_ARGS environment variable of the endpoint is important to be set properly to achieve optimal results. Here is an example:

LLAMACPP_ARGS="-fa -c 131072 -np 16 --metrics -dt 0.2"

Using this configuration, we are planning that the endpoint will be serving a maximum of 16 requests in parallel with a total KV cache size of 131072 tokens. This means that each request should not exceed 131072 / 16 = 8192 tokens (prompt + completion). Generally, enabling Flash Attention (-fa) is recommended for GPU endpoints.

Using these parameters, we can inspect the logs of the endpoint and see that the KV cache alone requires about ~16GB of VRAM:

- 2024-08-15T10:03:58.910+00:00   Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes
- 2024-08-15T10:03:58.966+00:00 llm_load_tensors: ggml ctx size =    0.27 MiB
- 2024-08-15T10:03:59.349+00:00 llm_load_tensors: offloading 32 repeating layers to GPU
- 2024-08-15T10:03:59.349+00:00 llm_load_tensors: offloading non-repeating layers to GPU
- 2024-08-15T10:03:59.349+00:00 llm_load_tensors: offloaded 33/33 layers to GPU
- 2024-08-15T10:03:59.349+00:00 llm_load_tensors:        CPU buffer size =   281.81 MiB
- 2024-08-15T10:03:59.349+00:00 llm_load_tensors:      CUDA0 buffer size =  4156.00 MiB
- 2024-08-15T10:03:59.859+00:00 .......................................................................................
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: n_ctx      = 131072
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: n_batch    = 2048
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: n_ubatch   = 512
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: flash_attn = 1
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: freq_base  = 500000.0
- 2024-08-15T10:03:59.867+00:00 llama_new_context_with_model: freq_scale = 1
- 2024-08-15T10:03:59.907+00:00 llama_kv_cache_init:      CUDA0 KV buffer size = 16384.00 MiB
- 2024-08-15T10:03:59.907+00:00 llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
- 2024-08-15T10:03:59.913+00:00 llama_new_context_with_model:  CUDA_Host  output buffer size =     8.32 MiB
- 2024-08-15T10:04:00.061+00:00 llama_new_context_with_model:      CUDA0 compute buffer size =   416.00 MiB
- 2024-08-15T10:04:00.061+00:00 llama_new_context_with_model:  CUDA_Host compute buffer size =   264.01 MiB
- 2024-08-15T10:04:00.061+00:00 llama_new_context_with_model: graph nodes  = 903
- 2024-08-15T10:04:00.061+00:00 llama_new_context_with_model: graph splits = 2

Performance

Here are sample numbers using the described benchmark and an Nvidia A10G 24GB GPU endpoint:

Model: https://huggingface.co/ngxson/Meta-Llama-3.1-8B-Instruct-Q4_K_M-GGUF/tree/main
GPU: https://www.nvidia.com/en-us/data-center/products/a10-gpu/
CUDA: 11.7.1
llama-bench

- 2024-08-14T14:39:11.628+00:00 ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes					
- 2024-08-14T14:39:11.628+00:00 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no					
- 2024-08-14T14:39:11.628+00:00 ggml_cuda_init: found 1 CUDA devices:					
- 2024-08-14T14:39:11.629+00:00   Device 0: NVIDIA A10G, compute capability 8.6, VMM: yes					
- 2024-08-14T14:40:54.319+00:00 | model                          |       size |     params | backend    | ngl | fa |          test |              t/s |					
- 2024-08-14T14:40:54.319+00:00 | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | ---------------: |					
- 2024-08-14T14:40:54.319+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp1 |     82.52 ± 0.03 |					
- 2024-08-14T14:40:54.406+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp2 |    156.69 ± 0.13 |					
- 2024-08-14T14:40:54.502+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp4 |    283.03 ± 1.29 |					
- 2024-08-14T14:40:54.642+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |           pp8 |    371.77 ± 0.91 |					
- 2024-08-14T14:40:54.763+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp16 |    872.72 ± 2.57 |					
- 2024-08-14T14:40:54.897+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp32 |   1576.34 ± 2.30 |					
- 2024-08-14T14:40:55.070+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |          pp64 |   2363.52 ± 5.64 |					
- 2024-08-14T14:40:55.340+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp128 |   2975.13 ± 0.72 |					
- 2024-08-14T14:40:55.776+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp256 |   3615.63 ± 4.00 |					
- 2024-08-14T14:40:56.612+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |         pp512 |   3739.45 ± 1.00 |					
- 2024-08-14T14:40:58.284+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |        pp1024 |   3717.84 ± 1.04 |					
- 2024-08-14T14:41:06.053+00:00 | llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       | 9999 |  1 |         tg128 |     82.68 ± 0.10 |

K6 bench

# K6 tests

VUs = 1, iters = 500, 180s max					
LLAMACPP_ARGS = -fa -c 8192 -np 1 --threads-http 16 --metrics -dt 0.05					
					
VUs = 2, iters = 500, 180s max					
LLAMACPP_ARGS = -fa -c 16384 -np 2 --threads-http 16 --metrics -dt 0.05					
					
VUs = 4, iters = 500, 180s max					
LLAMACPP_ARGS = -fa -c 32768 -np 4 --threads-http 16 --metrics -dt 0.05					
					
VUs = 8, iters = 500, 180s max					
LLAMACPP_ARGS = -fa -c 65536 -np 8 --threads-http 16 --metrics -dt 0.05					
					
VUs = 16, iters = 500, 180s max					
LLAMACPP_ARGS = -fa -c 131072 -np 16 --threads-http 16 --metrics -dt 0.05

Measured

VUs	1	2	4	8	16
iterations/s	0.542	0.997	1.822	2.585	3.643
input_tokens/s	164.302	364.718	502.967	809.416	1123.703
new_tokens/s	26.215	48.087	87.477	124.619	175.128
tokens/s	190.517	412.805	590.444	934.034	1298.832

It's interesting to analyze how well the inference scales with increasing number of parallel requests. With an ideal implementation that has perfect scaling for parallel requests, the numbers in the table above would have scaled linearly with the number of VUs (virtual users). So if we take the results for 1 VU as a baseline we can estimate the expected ideal performance for VUs > 1:

Ideal (estimated from VUs = 1)

VUs	1	2	4	8	16
iterations/s	0.542	1.084	2.168	4.337	8.674
input_tokens/s	164.302	328.604	657.208	1314.416	2628.831
new_tokens/s	26.215	52.429	104.859	209.717	419.434
tokens/s	190.517	381.033	762.066	1524.133	3048.265

Measured / Ideal

VUs	1	2	4	8	16
iterations/s	1.000	0.920	0.840	0.596	0.420
input_tokens/s	1.000	1.110	0.765	0.616	0.427
new_tokens/s	1.000	0.917	0.834	0.594	0.418
tokens/s	1.000	1.083	0.775	0.613	0.426

Based on these analysis, we can see that llama.cpp has ~40% @ 8 VUs and ~60% @ 16 VUs performance deficit compared to a perfect linear scaling implementation. These numbers are obviously not super precise and can vary based on the test parameters, but should still give a rough idea of how much further we can improve the performance for parallel generation.

Most likely, the biggest factor that degrades the performance at the moment is the unified KV cache (see #4130 (comment)). It seems not very suitable for the parallel use case and if we want to improve the parallel performance in the future, we should implement it in a more suitable way.

Depending on the selected hardware, the performance could potentially be improved further by tuning some of the advanced build parameters. For simplicity, here we have used the default llama.cpp build options, only enabling GGML_CUDA_FORCE_MMQ=ON in the Docker container.

Credits

Thanks to @ngxson for guiding me how to create the HF endpoints and the Docker containers and how to run the K6 benchmarks.

ngxson · 2024-08-15T13:30:33Z

ngxson
Aug 15, 2024
Collaborator

That's awesome! Thanks for taking time to test many configurations. I look forward to the improvements in the KV cache management, and hopefully the performance will be improved significantly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial : parallel inference using Hugging Face dedicated endpoints #9041

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

tutorial : parallel inference using Hugging Face dedicated endpoints #9041

ggerganov Aug 15, 2024 Maintainer

Overview

Instructions

Benchmarks

Configuring the test parameters

Selecting LLAMACPP_ARGS

Performance

Credits

Replies: 1 comment

ngxson Aug 15, 2024 Collaborator

ggerganov
Aug 15, 2024
Maintainer

Selecting `LLAMACPP_ARGS`

ngxson
Aug 15, 2024
Collaborator