Remove deprecated arguments from API and clarify model_name and chat_…

…template_name (#1931) * make model_name optional * remove model_name from turbomind engine * add chat_template_name in turbomind model config * tell model_name and chat_template_name apart * test chat.py * develop get_tm_model * remove get_hf_config_content * remove to_file since it is indicated by out_dir * minor fix * add test_async_engine.py * remove tp from class AsyncEngine * --chat-template can be a string * remove deprecates * fix ut * fix when test chatting * fix lmdeploy convert tc * update CLI * update * fix tc * fix * fix according to reviewer comments * update * update * update * update * update * rollback user guide * fix * fix typo * rm trust_remote_code from cli * fix typo * update * fix linting * fix linting * fix lint * fix profile_generation * fix docstring
InternLM · Aug 8, 2024 · fb6c5a1 · fb6c5a1
1 parent 061f997
commit fb6c5a1
Show file tree

Hide file tree

Showing 41 changed files with 598 additions and 814 deletions.
diff --git a/.gitignore b/.gitignore
@@ -50,6 +50,7 @@ dist/
 examples/cpp/llama/*.csv
 *.npy
 *.weight
+install/
 
 # LMDeploy
 workspace/

diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ ______________________________________________________________________
 - \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
 - \[2023/09\] TurboMind supports Qwen-14B
 - \[2023/09\] TurboMind supports InternLM-20B
-- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
+- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/llm/codellama.md) for deployment guide
 - \[2023/09\] TurboMind supports Baichuan2-7B
 - \[2023/08\] TurboMind supports flash-attention2.
 - \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling

diff --git a/README_ja.md b/README_ja.md
@@ -53,7 +53,7 @@ ______________________________________________________________________
 - \[2023/11\] TurboMindの主要なアップグレード、包括的なPaged Attention、シーケンス長制限のない高速なアテンションカーネル、2倍速いKV8カーネル、Split-Kデコーディング（Flash Decoding）、およびsm_75のW4A16推論
 - \[2023/09\] TurboMindはQwen-14Bをサポート
 - \[2023/09\] TurboMindはInternLM-20Bをサポート
-- \[2023/09\] TurboMindはCode Llamaのすべての機能をサポート：コード補完、インフィリング、チャット/インストラクト、Pythonスペシャリスト。デプロイメントガイドは[こちら](./docs/en/supported_models/codellama.md)をクリックしてください
+- \[2023/09\] TurboMindはCode Llamaのすべての機能をサポート：コード補完、インフィリング、チャット/インストラクト、Pythonスペシャリスト。デプロイメントガイドは[こちら](./docs/en/llm/codellama.md)をクリックしてください
 - \[2023/09\] TurboMindはBaichuan2-7Bをサポート
 - \[2023/08\] TurboMindはflash-attention2をサポート
 - \[2023/08\] TurboMindはQwen-7B、動的NTK-RoPEスケーリング、動的logNスケーリングをサポート

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -53,7 +53,7 @@ ______________________________________________________________________
 - \[2023/11\] TurboMind 重磅升级。包括：Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
 - \[2023/09\] TurboMind 支持 Qwen-14B
 - \[2023/09\] TurboMind 支持 InternLM-20B 模型
-- \[2023/09\] TurboMind 支持 Code Llama 所有功能：代码续写、填空、对话、Python专项。点击[这里](./docs/zh_cn/supported_models/codellama.md)阅读部署方法
+- \[2023/09\] TurboMind 支持 Code Llama 所有功能：代码续写、填空、对话、Python专项。点击[这里](./docs/zh_cn/llm/codellama.md)阅读部署方法
 - \[2023/09\] TurboMind 支持 Baichuan2-7B
 - \[2023/08\] TurboMind 支持 flash-attention2
 - \[2023/08\] TurboMind 支持 Qwen-7B，动态NTK-RoPE缩放，动态logN缩放

diff --git a/autotest/tools/convert/test_convert.py b/autotest/tools/convert/test_convert.py
@@ -40,7 +40,7 @@ def convert(config, model_case, cuda_prefix):
                               or 'awq' in model_case.lower()):
         cmd = get_command_with_extra(' '.join([
             'lmdeploy convert', model_name, origin_model_path, '--dst-path',
-            dst_path, '--model-format awq --group-size 128 --trust-remote-code'
+            dst_path, '--model-format awq --group-size 128'
         ]),
                                      config,
                                      model_case,
@@ -49,7 +49,7 @@ def convert(config, model_case, cuda_prefix):
     else:
         cmd = get_command_with_extra(' '.join([
             'lmdeploy convert', model_name, origin_model_path, '--dst-path',
-            dst_path, '--trust-remote-code'
+            dst_path
         ]),
                                      config,
                                      model_case,

diff --git a/benchmark/profile_generation.py b/benchmark/profile_generation.py
@@ -78,14 +78,12 @@ def warmup(model, concurrency: int, input_ids: List[int], warmup_round: int,
         return
 
     print('start to warmup ...')
-    output_seqlen = gen_config.max_new_tokens
 
     def _infer(model, session_id):
         chatbot = model.create_instance()
         for _ in range(warmup_round):
             for _ in chatbot.stream_infer(session_id,
                                           input_ids=input_ids,
-                                          request_output_len=output_seqlen,
                                           sequence_start=True,
                                           sequence_end=True,
                                           ignore_eos=True,
@@ -197,7 +195,7 @@ def profile_throughput(model_path: str, concurrency: int, input_seqlen: int,
           f'token_latency percentiles(50%,75%,95%,99%)(s): {percentiles}\n'
           f'throughput(output): {out_token_throughput} token/s\n'
           f'throughput(total): {total_token_throughput} token/s\n{"-" * 50}')
-    return tm_model.model_name, \
+    return model_path, \
         [first_token_latency_min, first_token_latency_max,
          first_token_latency_ave], \
         percentiles, out_token_throughput, total_token_throughput, \

diff --git a/docs/en/benchmark/evaluate_with_opencompass.md b/docs/en/benchmark/evaluate_with_opencompass.md
@@ -141,8 +141,8 @@ models = [internlm_chat_20b]
 
 **Note**
 
-- If you want to pass more arguments for `engine_config`和`gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#turbomindengineconfig)
-  and [EngineGenerationConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#generationconfig)
+- If you want to pass more arguments for `engine_config`和`gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L114)
+  and [EngineGenerationConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L56)
 
 ## Execute Evaluation Task
 

diff --git a/docs/en/get_started.md b/docs/en/get_started.md
@@ -48,8 +48,11 @@ pipe = pipeline('internlm/internlm2_5-7b-chat',
 ```
 
 ```{note}
-The parameter "cache_max_entry_count" significantly influences the GPU memory usage. It means the proportion of FREE GPU memory occupied by the K/V cache after the model weights are loaded.
-The default value is 0.8. Once allocated, the K/V cache memory is reused repeatedly, which is why it is common to observe that the built pipeline and the api_server mentioned later in the next consumes a substantial amount of GPU memory.
+The parameter "cache_max_entry_count" significantly influences the GPU memory usage.
+It means the proportion of FREE GPU memory occupied by the K/V cache after the model weights are loaded.
+
+The default value is 0.8. The K/V cache memory is allocated once and reused repeatedly, which is why it is observed that the built pipeline and the "api_server" mentioned later in the next consumes a substantial amount of GPU memory.
+
 If you encounter an Out-of-Memory(OOM) error, you may need to consider lowering the value of "cache_max_entry_count".
 ```
 

diff --git a/docs/en/llm/codellama.md b/docs/en/llm/codellama.md
@@ -0,0 +1,165 @@
+# codellama
+
+## Introduction
+
+[codellama](https://github.com/facebookresearch/codellama) features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence”). It can also be used for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.
+
+There are three sizes (7b, 13b, 34b) as well as three flavours (base model, Python fine-tuned, and instruction tuned) released on [HuggingFace](https://huggingface.co/codellama).
+
+| Base Model                                                                      | Python                                                                                        | Instruct                                                                                          |
+| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
+| [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)   | [codellama/CodeLlama-7b-Python-hf](https://huggingface.co/codellama/CodeLlama-7b-Python-hf)   | [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)   |
+| [codellama/CodeLlama-13b-hf](https://huggingface.co/codellama/CodeLlama-13b-hf) | [codellama/CodeLlama-13b-Python-hf](https://huggingface.co/codellama/CodeLlama-13b-Python-hf) | [codellama/CodeLlama-13b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) |
+| [codellama/CodeLlama-34b-hf](https://huggingface.co/codellama/CodeLlama-34b-hf) | [codellama/CodeLlama-34b-Python-hf](https://huggingface.co/codellama/CodeLlama-34b-Python-hf) | [codellama/CodeLlama-34b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf) |
+
+The correspondence between the model and capabilities is:
+
+| models     | code completion | infilling         | instructions / chat | python specialist |
+| ---------- | --------------- | ----------------- | ------------------- | ----------------- |
+| Base Model | Y               | Y(7B,13B), N(34B) | N                   | N                 |
+| Python     | Y               | N                 | N                   | Y                 |
+| Instruct   | Y               | Y(7B,13B), N(34B) | Y                   | N                 |
+
+## Inference
+
+Based on the above table, this section shows how to utilize CodeLlama's capabilities by examples
+
+### Completion
+
+```python
+from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig
+
+pipe = pipeline('meta-llama/CodeLlama-7b-hf',
+                chat_template_config=ChatTemplateConfig(
+                    model_name='codellama',
+                    capability='completion'
+                ))
+
+response = pipe(
+    'import socket\n\ndef ping_exponential_backoff(host: str):',
+    gen_config=GenerationConfig(
+        top_k=10,
+        temperature=0.1,
+        top_p=0.95
+    )
+)
+print(response.text)
+```
+
+### Infilling
+
+```python
+from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig
+
+pipe = pipeline('meta-llama/CodeLlama-7b-hf',
+                chat_template_config=ChatTemplateConfig(
+                    model_name='codellama',
+                    capability='infilling'
+                ))
+
+prompt = """
+def remove_non_ascii(s: str) -> str:
+    \"\"\"
+    <FILL>
+    \"\"\"
+    return result
+"""
+response = pipe(
+    prompt,
+    gen_config=GenerationConfig(
+        top_k=10,
+        temperature=0.1,
+        top_p=0.95,
+        max_new_tokens=500
+    )
+)
+print(response.text)
+```
+
+### Chat
+
+```python
+from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig
+
+pipe = pipeline('meta-llama/CodeLlama-7b-Instruct-hf',
+                chat_template_config=ChatTemplateConfig(
+                    model_name='codellama',
+                    capability='chat'
+                ))
+
+response = pipe(
+    'implement quick sort in C++',
+    gen_config=GenerationConfig(
+        top_k=10,
+        temperature=0.1,
+        top_p=0.95
+    )
+)
+print(response.text)
+```
+
+### Python specialist
+
+```python
+from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig
+
+pipe = pipeline('meta-llama/CodeLlama-7b-Python-hf',
+                chat_template_config=ChatTemplateConfig(
+                    model_name='codellama',
+                    capability='python'
+                ))
+
+response = pipe(
+    'implement quick sort',
+    gen_config=GenerationConfig(
+        top_k=10,
+        temperature=0.1,
+        top_p=0.95
+    )
+)
+print(response.text)
+```
+
+## Quantization
+
+TBD
+
+## Serving
+
+Prepare a chat template json file, for instance "codellama.json", with the following content:
+
+```json
+{
+    "model_name": "codellama",
+    "capability": "completion"
+}
+```
+
+Then launch the service as follows:
+
+```shell
+lmdeploy serve api_server meta-llama/CodeLlama-7b-Instruct-hf --chat-template codellama.json
+```
+
+After the service is launched successfully, you can access the service with `openai` package:
+
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key='YOUR_API_KEY',
+    base_url="http://0.0.0.0:23333/v1"
+)
+model_name = client.models.list().data[0].id
+response = client.chat.completions.create(
+  model=model_name,
+  messages=[
+    {"role": "user", "content": "import socket\n\ndef ping_exponential_backoff(host: str):"},
+  ],
+    temperature=0.1,
+    top_p=0.95,
+    max_tokens=500
+)
+print(response)
+```
+
+Regarding the detailed information of the api_server, you can refer to the [guide](../llm/api_server.md).
diff --git a/docs/en/multi_modal/cogvlm.md b/docs/en/multi_modal/cogvlm.md
@@ -17,17 +17,7 @@ pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https:
 pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https://download.pytorch.org/whl/cu121
 ```
 
-Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
-
-```shell
-# cuda 11.8
-# to get the latest version, run: pip index versions lmdeploy
-export LMDEPLOY_VERSION=0.5.3
-export PYTHON_VERSION=38
-pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
-# cuda 12.1
-pip install lmdeploy
-```
+Install LMDeploy by following the [installation guide](../installation.md)
 
 ### Prepare
 
@@ -43,7 +33,7 @@ huggingface-cli download lmsys/vicuna-7b-v1.5 special_tokens_map.json tokenizer.
 
 ### Offline inference pipeline
 
-The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
+The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)
 
 ```python
 from lmdeploy import pipeline

diff --git a/docs/en/multi_modal/minicpmv.md b/docs/en/multi_modal/minicpmv.md
@@ -6,15 +6,11 @@
 
 ## Quick Start
 
-Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
-
-```shell
-pip install lmdeploy
-```
+Please install LMDeploy by following the [installation guide](../installation.md)
 
 ### Offline inference pipeline
 
-The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
+The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)
 
 ```python
 from lmdeploy import pipeline

diff --git a/docs/en/multi_modal/xcomposer2d5.md b/docs/en/multi_modal/xcomposer2d5.md
@@ -8,18 +8,15 @@
 
 ### Installation
 
-Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
+Please install LMDeploy by following the [installation guide](../installation.md), and install other packages that InternLM-XComposer-2.5 needs
 
 ```shell
-pip install lmdeploy
-
-# install other packages that InternLM-XComposer-2.5 needs
 pip install decord
 ```
 
 ### Offline inference pipeline
 
-The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
+The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)
 
 ```python
 from lmdeploy import pipeline