Skip to content

Commit

Permalink
Remove deprecated arguments from API and clarify model_name and chat_…
Browse files Browse the repository at this point in the history
…template_name (#1931)

* make model_name optional

* remove model_name from turbomind engine

* add chat_template_name in turbomind model config

* tell model_name and chat_template_name apart

* test chat.py

* develop get_tm_model

* remove get_hf_config_content

* remove to_file since it is indicated by out_dir

* minor fix

* add test_async_engine.py

* remove tp from class AsyncEngine

* --chat-template can be a string

* remove deprecates

* fix ut

* fix when test chatting

* fix lmdeploy convert tc

* update CLI

* update

* fix tc

* fix

* fix according to reviewer comments

* update

* update

* update

* update

* update

* rollback user guide

* fix

* fix typo

* rm trust_remote_code from cli

* fix typo

* update

* fix linting

* fix linting

* fix lint

* fix profile_generation

* fix docstring
  • Loading branch information
lvhan028 authored Aug 8, 2024
1 parent 061f997 commit fb6c5a1
Show file tree
Hide file tree
Showing 41 changed files with 598 additions and 814 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ dist/
examples/cpp/llama/*.csv
*.npy
*.weight
install/

# LMDeploy
workspace/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ ______________________________________________________________________
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/llm/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
Expand Down
2 changes: 1 addition & 1 deletion README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ ______________________________________________________________________
- \[2023/11\] TurboMindの主要なアップグレード、包括的なPaged Attention、シーケンス長制限のない高速なアテンションカーネル、2倍速いKV8カーネル、Split-Kデコーディング(Flash Decoding)、およびsm_75のW4A16推論
- \[2023/09\] TurboMindはQwen-14Bをサポート
- \[2023/09\] TurboMindはInternLM-20Bをサポート
- \[2023/09\] TurboMindはCode Llamaのすべての機能をサポート:コード補完、インフィリング、チャット/インストラクト、Pythonスペシャリスト。デプロイメントガイドは[こちら](./docs/en/supported_models/codellama.md)をクリックしてください
- \[2023/09\] TurboMindはCode Llamaのすべての機能をサポート:コード補完、インフィリング、チャット/インストラクト、Pythonスペシャリスト。デプロイメントガイドは[こちら](./docs/en/llm/codellama.md)をクリックしてください
- \[2023/09\] TurboMindはBaichuan2-7Bをサポート
- \[2023/08\] TurboMindはflash-attention2をサポート
- \[2023/08\] TurboMindはQwen-7B、動的NTK-RoPEスケーリング、動的logNスケーリングをサポート
Expand Down
2 changes: 1 addition & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ ______________________________________________________________________
- \[2023/11\] TurboMind 重磅升级。包括:Paged Attention、更快的且不受序列最大长度限制的 attention kernel、2+倍快的 KV8 kernels、Split-K decoding (Flash Decoding) 和 支持 sm_75 架构的 W4A16
- \[2023/09\] TurboMind 支持 Qwen-14B
- \[2023/09\] TurboMind 支持 InternLM-20B 模型
- \[2023/09\] TurboMind 支持 Code Llama 所有功能:代码续写、填空、对话、Python专项。点击[这里](./docs/zh_cn/supported_models/codellama.md)阅读部署方法
- \[2023/09\] TurboMind 支持 Code Llama 所有功能:代码续写、填空、对话、Python专项。点击[这里](./docs/zh_cn/llm/codellama.md)阅读部署方法
- \[2023/09\] TurboMind 支持 Baichuan2-7B
- \[2023/08\] TurboMind 支持 flash-attention2
- \[2023/08\] TurboMind 支持 Qwen-7B,动态NTK-RoPE缩放,动态logN缩放
Expand Down
4 changes: 2 additions & 2 deletions autotest/tools/convert/test_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def convert(config, model_case, cuda_prefix):
or 'awq' in model_case.lower()):
cmd = get_command_with_extra(' '.join([
'lmdeploy convert', model_name, origin_model_path, '--dst-path',
dst_path, '--model-format awq --group-size 128 --trust-remote-code'
dst_path, '--model-format awq --group-size 128'
]),
config,
model_case,
Expand All @@ -49,7 +49,7 @@ def convert(config, model_case, cuda_prefix):
else:
cmd = get_command_with_extra(' '.join([
'lmdeploy convert', model_name, origin_model_path, '--dst-path',
dst_path, '--trust-remote-code'
dst_path
]),
config,
model_case,
Expand Down
4 changes: 1 addition & 3 deletions benchmark/profile_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,12 @@ def warmup(model, concurrency: int, input_ids: List[int], warmup_round: int,
return

print('start to warmup ...')
output_seqlen = gen_config.max_new_tokens

def _infer(model, session_id):
chatbot = model.create_instance()
for _ in range(warmup_round):
for _ in chatbot.stream_infer(session_id,
input_ids=input_ids,
request_output_len=output_seqlen,
sequence_start=True,
sequence_end=True,
ignore_eos=True,
Expand Down Expand Up @@ -197,7 +195,7 @@ def profile_throughput(model_path: str, concurrency: int, input_seqlen: int,
f'token_latency percentiles(50%,75%,95%,99%)(s): {percentiles}\n'
f'throughput(output): {out_token_throughput} token/s\n'
f'throughput(total): {total_token_throughput} token/s\n{"-" * 50}')
return tm_model.model_name, \
return model_path, \
[first_token_latency_min, first_token_latency_max,
first_token_latency_ave], \
percentiles, out_token_throughput, total_token_throughput, \
Expand Down
4 changes: 2 additions & 2 deletions docs/en/benchmark/evaluate_with_opencompass.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,8 @@ models = [internlm_chat_20b]

**Note**

- If you want to pass more arguments for `engine_config``gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#turbomindengineconfig)
and [EngineGenerationConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#generationconfig)
- If you want to pass more arguments for `engine_config``gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L114)
and [EngineGenerationConfig](https://github.com/InternLM/lmdeploy/blob/061f99736544c8bf574309d47baf574b69ab7eaf/lmdeploy/messages.py#L56)

## Execute Evaluation Task

Expand Down
7 changes: 5 additions & 2 deletions docs/en/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,11 @@ pipe = pipeline('internlm/internlm2_5-7b-chat',
```

```{note}
The parameter "cache_max_entry_count" significantly influences the GPU memory usage. It means the proportion of FREE GPU memory occupied by the K/V cache after the model weights are loaded.
The default value is 0.8. Once allocated, the K/V cache memory is reused repeatedly, which is why it is common to observe that the built pipeline and the api_server mentioned later in the next consumes a substantial amount of GPU memory.
The parameter "cache_max_entry_count" significantly influences the GPU memory usage.
It means the proportion of FREE GPU memory occupied by the K/V cache after the model weights are loaded.
The default value is 0.8. The K/V cache memory is allocated once and reused repeatedly, which is why it is observed that the built pipeline and the "api_server" mentioned later in the next consumes a substantial amount of GPU memory.
If you encounter an Out-of-Memory(OOM) error, you may need to consider lowering the value of "cache_max_entry_count".
```

Expand Down
165 changes: 165 additions & 0 deletions docs/en/llm/codellama.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# codellama

## Introduction

[codellama](https://github.com/facebookresearch/codellama) features enhanced coding capabilities. It can generate code and natural language about code, from both code and natural language prompts (e.g., “Write me a function that outputs the fibonacci sequence”). It can also be used for code completion and debugging. It supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.

There are three sizes (7b, 13b, 34b) as well as three flavours (base model, Python fine-tuned, and instruction tuned) released on [HuggingFace](https://huggingface.co/codellama).

| Base Model | Python | Instruct |
| ------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf) | [codellama/CodeLlama-7b-Python-hf](https://huggingface.co/codellama/CodeLlama-7b-Python-hf) | [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf) |
| [codellama/CodeLlama-13b-hf](https://huggingface.co/codellama/CodeLlama-13b-hf) | [codellama/CodeLlama-13b-Python-hf](https://huggingface.co/codellama/CodeLlama-13b-Python-hf) | [codellama/CodeLlama-13b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) |
| [codellama/CodeLlama-34b-hf](https://huggingface.co/codellama/CodeLlama-34b-hf) | [codellama/CodeLlama-34b-Python-hf](https://huggingface.co/codellama/CodeLlama-34b-Python-hf) | [codellama/CodeLlama-34b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf) |

The correspondence between the model and capabilities is:

| models | code completion | infilling | instructions / chat | python specialist |
| ---------- | --------------- | ----------------- | ------------------- | ----------------- |
| Base Model | Y | Y(7B,13B), N(34B) | N | N |
| Python | Y | N | N | Y |
| Instruct | Y | Y(7B,13B), N(34B) | Y | N |

## Inference

Based on the above table, this section shows how to utilize CodeLlama's capabilities by examples

### Completion

```python
from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig

pipe = pipeline('meta-llama/CodeLlama-7b-hf',
chat_template_config=ChatTemplateConfig(
model_name='codellama',
capability='completion'
))

response = pipe(
'import socket\n\ndef ping_exponential_backoff(host: str):',
gen_config=GenerationConfig(
top_k=10,
temperature=0.1,
top_p=0.95
)
)
print(response.text)
```

### Infilling

```python
from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig

pipe = pipeline('meta-llama/CodeLlama-7b-hf',
chat_template_config=ChatTemplateConfig(
model_name='codellama',
capability='infilling'
))

prompt = """
def remove_non_ascii(s: str) -> str:
\"\"\"
<FILL>
\"\"\"
return result
"""
response = pipe(
prompt,
gen_config=GenerationConfig(
top_k=10,
temperature=0.1,
top_p=0.95,
max_new_tokens=500
)
)
print(response.text)
```

### Chat

```python
from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig

pipe = pipeline('meta-llama/CodeLlama-7b-Instruct-hf',
chat_template_config=ChatTemplateConfig(
model_name='codellama',
capability='chat'
))

response = pipe(
'implement quick sort in C++',
gen_config=GenerationConfig(
top_k=10,
temperature=0.1,
top_p=0.95
)
)
print(response.text)
```

### Python specialist

```python
from lmdeploy import pipeline, GenerationConfig, ChatTemplateConfig

pipe = pipeline('meta-llama/CodeLlama-7b-Python-hf',
chat_template_config=ChatTemplateConfig(
model_name='codellama',
capability='python'
))

response = pipe(
'implement quick sort',
gen_config=GenerationConfig(
top_k=10,
temperature=0.1,
top_p=0.95
)
)
print(response.text)
```

## Quantization

TBD

## Serving

Prepare a chat template json file, for instance "codellama.json", with the following content:

```json
{
"model_name": "codellama",
"capability": "completion"
}
```

Then launch the service as follows:

```shell
lmdeploy serve api_server meta-llama/CodeLlama-7b-Instruct-hf --chat-template codellama.json
```

After the service is launched successfully, you can access the service with `openai` package:

```python
from openai import OpenAI
client = OpenAI(
api_key='YOUR_API_KEY',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[
{"role": "user", "content": "import socket\n\ndef ping_exponential_backoff(host: str):"},
],
temperature=0.1,
top_p=0.95,
max_tokens=500
)
print(response)
```

Regarding the detailed information of the api_server, you can refer to the [guide](../llm/api_server.md).
14 changes: 2 additions & 12 deletions docs/en/multi_modal/cogvlm.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,7 @@ pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https:
pip install torch==2.2.2 torchvision==0.17.2 xformers==0.0.26 --index-url https://download.pytorch.org/whl/cu121
```

Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.

```shell
# cuda 11.8
# to get the latest version, run: pip index versions lmdeploy
export LMDEPLOY_VERSION=0.5.3
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
# cuda 12.1
pip install lmdeploy
```
Install LMDeploy by following the [installation guide](../installation.md)

### Prepare

Expand All @@ -43,7 +33,7 @@ huggingface-cli download lmsys/vicuna-7b-v1.5 special_tokens_map.json tokenizer.

### Offline inference pipeline

The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
Expand Down
8 changes: 2 additions & 6 deletions docs/en/multi_modal/minicpmv.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,11 @@

## Quick Start

Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.

```shell
pip install lmdeploy
```
Please install LMDeploy by following the [installation guide](../installation.md)

### Offline inference pipeline

The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
Expand Down
7 changes: 2 additions & 5 deletions docs/en/multi_modal/xcomposer2d5.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,15 @@

### Installation

Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
Please install LMDeploy by following the [installation guide](../installation.md), and install other packages that InternLM-XComposer-2.5 needs

```shell
pip install lmdeploy

# install other packages that InternLM-XComposer-2.5 needs
pip install decord
```

### Offline inference pipeline

The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#vlm-offline-inference-pipeline)
The following sample code shows the basic usage of VLM pipeline. For more examples, please refer to [VLM Offline Inference Pipeline](./vl_pipeline.md)

```python
from lmdeploy import pipeline
Expand Down
Loading

0 comments on commit fb6c5a1

Please sign in to comment.