16 Oct 20:01

github-actions

651c614

v0.2.1

Major Changes

PagedAttention V2 kernel: Up to 20% end-to-end latency reduction
Support log probabilities for prompt tokens
AWQ support for Mistral 7B

What's Changed

fixing typo in tiiuae/falcon-rw-7b model name by @0ssamaak0 in #1226
Added dtype arg to benchmarks by @kg6-sleipnir in #1228
fix vulnerable memory modification to gpu shared memory by @soundOfDestiny in #1241
support sharding llama2-70b on more than 8 GPUs by @zhuohan123 in #1209
[Minor] Fix type annotations by @WoosukKwon in #1238
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic by @zhuohan123 in #1181
add support for tokenizer revision by @cassanof in #1163
Use monotonic time where appropriate by @Yard1 in #1249
API server support ipv4 / ipv6 dualstack by @yunfeng-scale in #1288
Move bfloat16 check to worker by @Yard1 in #1259
[FIX] Explain why the finished_reason of ignored sequences are length by @zhuohan123 in #1289
Update README.md by @zhuohan123 in #1292
[Minor] Fix comment in mistral.py by @zhuohan123 in #1303
lock torch version to 2.0.1 when build for #1283 by @yanxiyue in #1290
minor update by @WrRan in #1311
change the timing of sorting logits by @yhlskt23 in #1309
workaround of AWQ for Turing GPUs by @twaka in #1252
Fix overflow in awq kernel by @chu-tianxiang in #1295
Update model_loader.py by @AmaleshV in #1278
Add blacklist for model checkpoint by @WoosukKwon in #1325
Update README.md Aquila2. by @ftgreat in #1331
Improve detokenization performance by @Yard1 in #1338
Bump up transformers version & Remove MistralConfig by @WoosukKwon in #1254
Fix the issue for AquilaChat2-* models by @lu-wang-dl in #1339
Fix error message on TORCH_CUDA_ARCH_LIST by @WoosukKwon in #1239
Minor fix on AWQ kernel launch by @WoosukKwon in #1356
Implement PagedAttention V2 by @WoosukKwon in #1348
Implement prompt logprobs & Batched topk for computing logprobs by @zhuohan123 in #1328
Fix PyTorch version to 2.0.1 in workflow by @WoosukKwon in #1377
Fix PyTorch index URL in workflow by @WoosukKwon in #1378
Fix sampler test by @WoosukKwon in #1379
Bump up the version to v0.2.1 by @zhuohan123 in #1355

New Contributors

@0ssamaak0 made their first contribution in #1226
@kg6-sleipnir made their first contribution in #1228
@soundOfDestiny made their first contribution in #1241
@cassanof made their first contribution in #1163
@yunfeng-scale made their first contribution in #1288
@yanxiyue made their first contribution in #1290
@yhlskt23 made their first contribution in #1309
@chu-tianxiang made their first contribution in #1295
@AmaleshV made their first contribution in #1278
@lu-wang-dl made their first contribution in #1339

Full Changelog: v0.2.0...v0.2.1

Contributors

WrRan, twaka, and 14 other contributors

Assets 6

28 Sep 22:31

github-actions

v0.2.0

e2fb71e

v0.2.0

Major changes

Up to 60% performance improvement by optimizing de-tokenization and sampler
Initial support for AWQ (performance not optimized)
Support for RoPE scaling and LongChat
Support for Mistral-7B
Many bug fixes

What's Changed

add option to shorten prompt print in log by @leiwen83 in #991
Make max_model_len configurable by @Yard1 in #972
Fix typo in README.md by @eltociear in #1033
Use TGI-like incremental detokenization by @Yard1 in #984
Add Model Revision Support in #1014
[FIX] Minor bug fixes by @zhuohan123 in #1035
Announce paper release by @WoosukKwon in #1036
Fix detokenization leaving special tokens by @Yard1 in #1044
Add pandas to requirements.txt by @WoosukKwon in #1047
OpenAI-Server: Only fail if logit_bias has actual values by @LLukas22 in #1045
Fix warning message on LLaMA FastTokenizer by @WoosukKwon in #1037
Abort when coroutine is cancelled by @rucyang in #1020
Implement AWQ quantization support for LLaMA by @WoosukKwon in #1032
Remove AsyncLLMEngine busy loop, shield background task by @Yard1 in #1059
Fix hanging when prompt exceeds limit by @chenxu2048 in #1029
[FIX] Don't initialize parameter by default by @zhuohan123 in #1067
added support for quantize on LLM module by @orellavie1212 in #1080
align llm_engine and async_engine step method. by @esmeetu in #1081
Fix get_max_num_running_seqs for waiting and swapped seq groups by @zhuohan123 in #1068
Add safetensors support for quantized models by @WoosukKwon in #1073
Add minimum capability requirement for AWQ by @WoosukKwon in #1064
[Community] Add vLLM Discord server by @zhuohan123 in #1086
Add pyarrow to dependencies & Print warning on Ray import error by @WoosukKwon in #1094
Add gpu_memory_utilization and swap_space to LLM by @WoosukKwon in #1090
Add documentation to Triton server tutorial by @tanmayv25 in #983
rope_theta and max_position_embeddings from config by @Yard1 in #1096
Replace torch.cuda.DtypeTensor with torch.tensor by @WoosukKwon in #1123
Add float16 and float32 to dtype choices by @WoosukKwon in #1115
clean api code, remove redundant background task. by @esmeetu in #1102
feat: support stop_token_ids parameter. by @gesanqiu in #1097
Use --ipc=host in docker run for distributed inference by @WoosukKwon in #1125
Docs: Fix broken link to openai example by @nkpz in #1145
Announce the First vLLM Meetup by @WoosukKwon in #1148
[Sampler] Vectorized sampling (simplified) by @zhuohan123 in #1048
[FIX] Simplify sampler logic by @zhuohan123 in #1156
Fix config for Falcon by @WoosukKwon in #1164
Align max_tokens behavior with openai by @HermitSun in #852
[Setup] Enable TORCH_CUDA_ARCH_LIST for selecting target GPUs by @WoosukKwon in #1074
Add comments on RoPE initialization by @WoosukKwon in #1176
Allocate more shared memory to attention kernel by @Yard1 in #1154
Support Longchat by @LiuXiaoxuanPKU in #555
fix typo (?) by @WrRan in #1184
fix qwen-14b model by @Sanster in #1173
Automatically set max_num_batched_tokens by @WoosukKwon in #1198
Use standard extras for uvicorn by @danilopeixoto in #1166
Keep special sampling params by @blahblahasdf in #1186
qwen add rope_scaling by @Sanster in #1210
[Mistral] Mistral-7B-v0.1 support by @Bam4d in #1196
Fix Mistral model by @WoosukKwon in #1220
[Fix] Remove false assertion by @WoosukKwon in #1222
Add Mistral to supported model list by @WoosukKwon in #1221
Fix OOM in attention kernel test by @WoosukKwon in #1223
Provide default max model length by @WoosukKwon in #1224
Bump up the version to v0.2.0 by @WoosukKwon in #1212

New Contributors

@leiwen83 made their first contribution in #991
@LLukas22 made their first contribution in #1045
@rucyang made their first contribution in #1020
@chenxu2048 made their first contribution in #1029
@orellavie1212 made their first contribution in #1080
@tanmayv25 made their first contribution in #983
@nkpz made their first contribution in #1145
@WrRan made their first contribution in #1184
@danilopeixoto made their first contribution in #1166
@blahblahasdf made their first contribution in #1186
@Bam4d made their first contribution in #1196

Full Changelog: v0.1.7...v0.2.0

Contributors

nkpz, Bam4d, and 18 other contributors

Assets 6

11 Sep 07:56

github-actions

v0.1.7

90eb3f4

v0.1.7

A minor release to fix the bugs in ALiBi, Falcon-40B, and Code Llama.

What's Changed

fix "tansformers_module" ModuleNotFoundError when load model with trust_remote_code=True by @Jingru in #871
Fix wrong dtype in PagedAttentionWithALiBi bias by @Yard1 in #996
fix: CUDA error when inferencing with Falcon-40B base model by @kyujin-cho in #992
[Docs] Update installation page by @WoosukKwon in #1005
Update setup.py by @WoosukKwon in #1006
Use FP32 in RoPE initialization by @WoosukKwon in #1004
Bump up the version to v0.1.7 by @WoosukKwon in #1013

New Contributors

@Jingru made their first contribution in #871
@kyujin-cho made their first contribution in #992

Full Changelog: v0.1.6...v0.1.7

Contributors

kyujin-cho, Jingru, and 2 other contributors

Assets 6

08 Sep 07:08

github-actions

v0.1.6

1117aa1

v0.1.6

Note: This is an emergency release to revert a breaking API change that can make many existing codes using AsyncLLMServer not work.

What's Changed

faster startup of vLLM by @ri938 in #982
Start background task in AsyncLLMEngine.generate by @Yard1 in #988
Bump up the version to v0.1.6 by @zhuohan123 in #989

New Contributors

@ri938 made their first contribution in #982

Full Changelog: v0.1.5...v0.1.6

Contributors

ri938, Yard1, and zhuohan123

Assets 6

07 Sep 23:16

github-actions

v0.1.5

852ef5b

v0.1.5

Major Changes

Align beam search with hf_model.generate.
Stablelize AsyncLLMEngine with a background engine loop.
Add support for CodeLLaMA.
Add many model correctness tests.
Many other correctness fixes.

What's Changed

Add support for CodeLlama by @Yard1 in #854
[Fix] Fix a condition for ignored sequences by @zhuohan123 in #867
use flash-attn via xformers by @tmm1 in #877
Enable request body OpenAPI spec for OpenAI endpoints by @Peilun-Li in #865
Accelerate LLaMA model loading by @JF-D in #234
Improve _prune_hidden_states micro-benchmark by @tmm1 in #707
fix: bug fix when penalties are negative by @pfldy2850 in #913
[Docs] Minor fixes in supported models by @WoosukKwon in #920
Fix README.md Link by @zhuohan123 in #927
Add tests for models by @WoosukKwon in #922
Avoid compiling kernels for double data type by @WoosukKwon in #933
[BugFix] Fix NaN errors in paged attention kernel by @WoosukKwon in #936
Refactor AsyncLLMEngine by @Yard1 in #880
Only emit warning about internal tokenizer if it isn't being used by @nelson-liu in #939
Align vLLM's beam search implementation with HF generate by @zhuohan123 in #857
Initialize AsyncLLMEngine bg loop correctly by @Yard1 in #943
FIx vLLM cannot launch by @HermitSun in #948
Clean up kernel unit tests by @WoosukKwon in #938
Use queue for finished requests by @Yard1 in #957
[BugFix] Implement RoPE for GPT-J by @WoosukKwon in #941
Set torch default dtype in a context manager by @Yard1 in #971
Bump up transformers version in requirements.txt by @WoosukKwon in #976
Make AsyncLLMEngine more robust & fix batched abort by @Yard1 in #969
Enable safetensors loading for all models by @zhuohan123 in #974
[FIX] Fix Alibi implementation in PagedAttention kernel by @zhuohan123 in #945
Bump up the version to v0.1.5 by @WoosukKwon in #944

New Contributors

@tmm1 made their first contribution in #877
@Peilun-Li made their first contribution in #865
@JF-D made their first contribution in #234
@pfldy2850 made their first contribution in #913
@nelson-liu made their first contribution in #939

Full Changelog: v0.1.4...v0.1.5

Contributors

tmm1, nelson-liu, and 7 other contributors

Assets 6

25 Aug 03:31

github-actions

v0.1.4

791d79d

vLLM v0.1.4

Major changes

From now on, vLLM is published with pre-built CUDA binaries. Users don't have to compile the vLLM's CUDA kernels on their machine.
New models: InternLM, Qwen, Aquila.
Optimizing CUDA kernels for paged attention and GELU.
Many bug fixes.

What's Changed

Fix gibberish outputs of GPT-BigCode-based models by @HermitSun in #676
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel by @naed90 in #420
add QWen-7b support by @Sanster in #685
add internlm model by @gqjia in #528
Check the max prompt length for the OpenAI completions API by @nicobasile in #472
[Fix] unwantted bias in InternLM Model by @wangruohui in #740
Supports tokens and arrays of tokens as inputs to the OpenAI completion API by @wanmok in #715
Fix baichuan doc style by @UranusSeven in #748
Fix typo in tokenizer.py by @eltociear in #750
Align with huggingface Top K sampling by @Abraham-Xu in #753
explicitly del state by @cauyxy in #784
Fix typo in sampling_params.py by @wangcx18 in #788
[Feature | CI] Added a github action to build wheels by @Danielkinz in #746
set default coompute capability according to cuda version by @zxdvd in #773
Fix mqa is false case in gpt_bigcode by @zhaoyang-star in #806
Add support for aquila by @shunxing1234 in #663
Update Supported Model List by @zhuohan123 in #825
Fix 'GPTBigCodeForCausalLM' object has no attribute 'tensor_model_parallel_world_size' by @HermitSun in #827
Add compute capability 8.9 to default targets by @WoosukKwon in #829
Implement approximate GELU kernels by @WoosukKwon in #828
Fix typo of Aquila in README.md by @ftgreat in #836
Fix for breaking changes in xformers 0.0.21 by @WoosukKwon in #834
Clean up code by @wenjun93 in #844
Set replacement=True in torch.multinomial by @WoosukKwon in #858
Bump up the version to v0.1.4 by @WoosukKwon in #846

New Contributors

@naed90 made their first contribution in #420
@gqjia made their first contribution in #528
@nicobasile made their first contribution in #472
@wanmok made their first contribution in #715
@UranusSeven made their first contribution in #748
@eltociear made their first contribution in #750
@Abraham-Xu made their first contribution in #753
@cauyxy made their first contribution in #784
@wangcx18 made their first contribution in #788
@Danielkinz made their first contribution in #746
@zhaoyang-star made their first contribution in #806
@shunxing1234 made their first contribution in #663
@ftgreat made their first contribution in #836
@wenjun93 made their first contribution in #844

Full Changelog: v0.1.3...v0.1.4

Contributors

zxdvd, Sanster, and 18 other contributors

Assets 6

02 Aug 23:56

WoosukKwon

v0.1.3

aa84c92

vLLM v0.1.3

What's Changed

Major changes

More model support: LLaMA 2, Falcon, GPT-J, Baichuan, etc.
Efficient support for MQA and GQA.
Changes in the scheduling algorithm: vLLM now uses a TGI-style continuous batching.
And many bug fixes.

All changes

fix: only response [DONE] once when streaming response. by @gesanqiu in #378
[Fix] Change /generate response-type to json for non-streaming by @nicolasf in #374
Add trust-remote-code flag to handle remote tokenizers by @codethazine in #364
avoid python list copy in sequence initialization by @LiuXiaoxuanPKU in #401
[Fix] Sort LLM outputs by request ID before return by @WoosukKwon in #402
Add trust_remote_code arg to get_config by @WoosukKwon in #405
Don't try to load training_args.bin by @lpfhs in #373
[Model] Add support for GPT-J by @AndreSlavescu in #226
fix: freeze pydantic to v1 by @kemingy in #429
Fix handling of special tokens in decoding. by @xcnick in #418
add vocab padding for LLama(Support WizardLM) by @esmeetu in #411
Fix the KeyError when loading bloom-based models by @HermitSun in #441
Optimize MQA Kernel by @zhuohan123 in #452
Offload port selection to OS by @zhangir-azerbayev in #467
[Doc] Add doc for running vLLM on the cloud by @Michaelvll in #426
[Fix] Fix the condition of max_seq_len by @zhuohan123 in #477
Add support for baichuan by @codethazine in #365
fix max seq len by @LiuXiaoxuanPKU in #489
Fixed old name reference for max_seq_len by @MoeedDar in #498
hotfix attn alibi wo head mapping by @Oliver-ss in #496
fix(ray_utils): ignore re-init error by @mspronesti in #465
Support trust_remote_code in benchmark by @wangruohui in #518
fix: enable trust-remote-code in api server & benchmark. by @gesanqiu in #509
Ray placement group support by @Yard1 in #397
Fix bad assert in initialize_cluster if PG already exists by @Yard1 in #526
Add support for LLaMA-2 by @zhuohan123 in #505
GPTJConfig has no attribute rotary. by @leegohi04517 in #532
[Fix] Fix GPTBigcoder for distributed execution by @zhuohan123 in #503
Fix paged attention testing. by @shanshanpt in #495
fixed tensor parallel is not defined by @MoeedDar in #564
Add Baichuan-7B to README by @zhuohan123 in #494
[Fix] Add chat completion Example and simplify dependencies by @zhuohan123 in #576
[Fix] Add model sequence length into model config by @zhuohan123 in #575
[Fix] fix import error of RayWorker (#604) by @zxdvd in #605
fix ModuleNotFoundError by @mklf in #599
[Doc] Change old max_seq_len to max_model_len in docs by @SiriusNEO in #622
fix biachuan-7b tp by @Sanster in #598
[Model] support baichuan-13b based on baichuan-7b by @Oliver-ss in #643
Fix log message in scheduler by @LiuXiaoxuanPKU in #652
Add Falcon support (new) by @zhuohan123 in #592
[BUG FIX] upgrade fschat version to 0.2.23 by @YHPeter in #650
Refactor scheduler by @WoosukKwon in #658
[Doc] Add Baichuan 13B to supported models by @zhuohan123 in #656
Bump up version to 0.1.3 by @zhuohan123 in #657

New Contributors

@nicolasf made their first contribution in #374
@codethazine made their first contribution in #364
@lpfhs made their first contribution in #373
@AndreSlavescu made their first contribution in #226
@kemingy made their first contribution in #429
@xcnick made their first contribution in #418
@esmeetu made their first contribution in #411
@HermitSun made their first contribution in #441
@zhangir-azerbayev made their first contribution in #467
@MoeedDar made their first contribution in #498
@Oliver-ss made their first contribution in #496
@mspronesti made their first contribution in #465
@wangruohui made their first contribution in #518
@Yard1 made their first contribution in #397
@leegohi04517 made their first contribution in #532
@shanshanpt made their first contribution in #495
@zxdvd made their first contribution in #605
@mklf made their first contribution in #599
@SiriusNEO made their first contribution in #622
@Sanster made their first contribution in #598
@YHPeter made their first contribution in #650

Full Changelog: v0.1.2...v0.1.3

Contributors

zxdvd, nicolasf, and 24 other contributors

Assets 2

05 Jul 04:51

zhuohan123

v0.1.2

1c395b4

vLLM v0.1.2

What's Changed

Initial support for GPTBigCode
Support for MPT and BLOOM
Custom tokenizer
ChatCompletion endpoint in OpenAI demo server
Code format
Various bug fixes and improvements
Documentation improvement

Contributors

Thanks to the following amazing people who contributed to this release:

@michaelfeil @WoosukKwon @metacryptom @merrymercy @BasicCoder @zhuohan123 @twaka @comaniac @neubig @JRC1995 @LiuXiaoxuanPKU @bm777 @Michaelvll @gesanqiu @ironpinguin @coolcloudcol @akxxsb

Full Changelog: v0.1.1...v0.1.2

Contributors

ironpinguin, neubig, and 15 other contributors

Assets 2

22 Jun 07:38

zhuohan123

v0.1.1

83658c8

vLLM v0.1.1 (Patch)

What's Changed

Fix Ray node resources error by @zhuohan123 in #193
[Bugfix] Fix a bug in RequestOutput.finished by @WoosukKwon in #202
[Fix] Better error message when there is OOM during cache initialization by @zhuohan123 in #203
Bump up version to 0.1.1 by @zhuohan123 in #204

Full Changelog: v0.1.0...v0.1.1

Contributors

zhuohan123 and WoosukKwon

Assets 2

20 Jun 06:28

WoosukKwon

v0.1.0

67d96c2

vLLM v0.1.0

The first official release of vLLM!

See our README for details.

Thanks

Thanks @WoosukKwon @zhuohan123 @suquark for their contributions.

Contributors

suquark, zhuohan123, and WoosukKwon

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major Changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Major Changes

What's Changed

New Contributors

Contributors

Major changes

What's Changed

New Contributors

Contributors

What's Changed

Major changes

All changes

New Contributors

Contributors

What's Changed

Contributors

Contributors

What's Changed

Contributors

The first official release of vLLM!

Thanks

Contributors

Releases: vllm-project/vllm

v0.2.1

Major Changes

What's Changed

New Contributors

Contributors

v0.2.0

Major changes

What's Changed

New Contributors

Contributors

v0.1.7

What's Changed

New Contributors

Contributors

v0.1.6

What's Changed

New Contributors

Contributors

v0.1.5

Major Changes

What's Changed

New Contributors

Contributors

vLLM v0.1.4

Major changes

What's Changed

New Contributors

Contributors

vLLM v0.1.3

What's Changed

Major changes

All changes

New Contributors

Contributors

vLLM v0.1.2

What's Changed

Contributors

Contributors

vLLM v0.1.1 (Patch)

What's Changed

Contributors

vLLM v0.1.0

The first official release of vLLM!

Thanks

Contributors