[Profiling][Model][Doc] Support Llama3-8B and 70B on A100s (#22)

* Merged PR 1873: Support Llama3 8B and 70B for 32k context length on a100_pairwise_nvlink # Changelog * Support Llama3 8B and 70B https://llama.meta.com/llama3/ * Max supported context length is 32k, only on 4xA100. * Pipeline parallel is not profiled yet for more than 4k. * Attention profiling enhancements: ** Reduce number of input combinations by removing those batches which require more kv cache blocks than available GPU memory. * Fix llama3-8B and 70B profiling data * Bring documentation files to top-level docs/ folder * Add llama3-70b attention profiling data * format * minor
microsoft · Jul 24, 2024 · 2bb7a08 · 2bb7a08
1 parent 2e72d7a
commit 2bb7a08
Show file tree

Hide file tree

Showing 22 changed files with 150,223 additions and 21,140 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,47 @@
 # Vidur: LLM Inference Simulator
 
-Vidur is a high-fidelity LLM inference simulator, designed to aid capacity planning and deployment configuration optimization. Please refer to our [MLSys'24 paper](https://arxiv.org/abs/2405.05465) for more details.<br>
+Vidur is a high-fidelity and extensible LLM inference simulator. It can help you with:
+
+1. Capacity planning and finding the best deployment configuration for your LLM deployments.
+2. Test new research ideas like new scheduling algorithms, optimizations like speculative decoding, etc.
+3. Study the system performance of models under different workloads and configurations.
+
+... all without access to GPUs except for a quick initial profiling phase.
+
+Please refer to our [MLSys'24 paper](https://arxiv.org/abs/2405.05465) for more details.
 We have a [live demo](https://vidur.westus2.cloudapp.azure.com/) that captures the capabilities of the system.
 
-![Simulator Fidelity](./assets/dynamic_fidelity_v8_request_e2e_time_normalized_85_p95.jpeg)
-*Difference in 95th percentile Request E2E Normalized time showing fidelity of Vidur's execution time predictions across four models and three dynamic workload traces, using request load at 85% of the maximum serving capacity for each scenario.*
-![Config Search](./assets/llama70b_Chat1M_ttft_tbt_90_99_2.0_0.2.jpeg)
-*Capacity per dollar for different deployment configurations vs TTFT-P90 (left) and TBT-P99 (middle) for LLaMA2-70B.*
+## Supported Models
+
+| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
+| --- | --- | --- | --- | --- |
+| `meta-llama/Meta-Llama-3-8B` | ✅ | ❌ | ✅ | ❌ |
+| `meta-llama/Meta-Llama-3-70B` | ✅ | ❌ | ✅ | ❌ |
+| `meta-llama/Llama-2-7b-hf` | ✅ | ✅ | ✅ | ✅ |
+| `codellama/CodeLlama-34b-Instruct-hf"` | ✅ | ✅ | ✅ | ✅ |
+| `meta-llama/Llama-2-70b-hf` | ✅ | ✅ | ✅ | ✅ |
+| `internlm/internlm-20b` | ✅ | ✅ | ✅ | ✅ |
+| `Qwen/Qwen-72B` | ✅ | ✅ | ✅ | ✅ |
+
+* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
+* All models support a maximum context length of 4k except `Llama3-8B` and `Llama3-70B` which support 16k context length by passing additional CLI params:
+
+```text
+--sklearn_execution_time_predictor_prediction_max_prefill_chunk_size 16384 \
+--sklearn_execution_time_predictor_prediction_max_batch_size 512 \
+--sklearn_execution_time_predictor_prediction_max_tokens_per_request 16384 \
+```
+
+* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
+* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
+* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
+* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.
+
+## Chrome Trace
+
+Vidur exports chrome traces of each simulation. The trace can be found in the `simulator_output` directory. The trace can be opened by navigating to `chrome://tracing/` or `edge://tracing/` and loading the trace.
+
+![Chrome Trace](./assets/chrome_trace.png)
 
 ## Setup
 
@@ -84,35 +119,13 @@ python -m vidur.main  \
 --vllm_scheduler_max_tokens_in_batch 4096
 ```
 
-The simulator supports a plethora of parameters for the simulation description which can be found [here](vidur/config/README.md).
-
-The metrics will be logged to wandb directly and a copy will be stored in the `simulator_output` directory along with the chrome trace. A description of all the logged metrics can be found [here](vidur/metrics/README.md).
+The simulator supports a plethora of parameters for the simulation description which can be found [here](docs/launch_parameters.md).
 
-## Supported Models
-
-| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
-| --- | --- | --- | --- | --- |
-| `meta-llama/Llama-2-7b-hf` | ✅ | ✅ | ✅ | ✅ |
-| `codellama/CodeLlama-34b-Instruct-hf"` | ✅ | ✅ | ✅ | ✅ |
-| `meta-llama/Llama-2-70b-hf` | ✅ | ✅ | ✅ | ✅ |
-| `internlm/internlm-20b` | ✅ | ✅ | ✅ | ✅ |
-| `Qwen/Qwen-72B` | ✅ | ✅ | ✅ | ✅ |
-
-* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
-* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
-* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
-* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.
-* Instructions on adding a new model to existing or new SKUs can be found [here](vidur/profiling/README.md).
-
-## Chrome Trace
-
-Vidur exports chrome traces of each simulation. The trace can be found in the `simulator_output` directory. The trace can be opened by navigating to `chrome://tracing/` or `edge://tracing/` and loading the trace.
-
-![Chrome Trace](./assets/chrome_trace.png)
+The metrics will be logged to wandb directly and a copy will be stored in the `simulator_output` directory along with the chrome trace. A description of all the logged metrics can be found [here](docs/metrics.md).
 
 ## Formatting Code
 
-To run the code formatters execute the following command,
+To format code, execute the following command:
 
 ```sh
 make format

diff --git a/data/model_configs/meta-llama/Meta-Llama-3-70B.yml b/data/model_configs/meta-llama/Meta-Llama-3-70B.yml
@@ -0,0 +1,16 @@
+num_layers: 80
+num_q_heads: 64
+num_kv_heads: 8
+embedding_dim: 8192
+mlp_hidden_dim: 28672
+max_position_embeddings: 8192
+use_gated_mlp: true
+use_bias: false
+use_qkv_bias: false
+activation: silu
+norm: rms_norm
+post_attn_norm: true
+rope_theta: 500000.0
+rope_scaling: null
+vocab_size: 128256
+is_neox_style: true
diff --git a/data/model_configs/meta-llama/Meta-Llama-3-8B.yml b/data/model_configs/meta-llama/Meta-Llama-3-8B.yml
@@ -0,0 +1,16 @@
+num_layers: 32
+num_q_heads: 32
+num_kv_heads: 8
+embedding_dim: 4096
+mlp_hidden_dim: 14336
+max_position_embeddings: 4096
+use_gated_mlp: true
+use_bias: false
+use_qkv_bias: false
+activation: silu
+norm: rms_norm
+post_attn_norm: true
+rope_theta: 500000.0
+rope_scaling: null
+vocab_size: 128256
+is_neox_style: true