25 Aug 15:08

c8211c9

v0.6.0

DDPO for diffusion models

We are excited to welcome the first RLHF + diffusion models algorithm to refine the generations from diffusion models.
Read more about it directly in the docs.

Before	After DDPO finetuning

Denoising Diffusion Policy Optimization by @metric-space in #508

Bug fixes and other enhancements

The release also comes with multiple bug fixes reported and/or led by the community, check out the commit history below

What's Changed

Release: v0.5.0 by @younesbelkada in #607
Set dev version by @younesbelkada in #608
[Modeling] Add token support for hf_hub_download by @younesbelkada in #604
Add docs explaining logged metrics by @vwxyzjn in #616
[DPO] stack-llama-2 training scripts by @kashif in #611
Use log_with argument in SFT example by @hitorilabs in #620
Allow already tokenized sequences for response_template in DataCollatorForCompletionOnlyLM by @ivsanro1 in #622
Improve docs by @lvwerra in #612
Move repo by @lvwerra in #628
Add score scaling/normalization/clipping by @zfang in #560
Disable dropout in DPO Training by @NouamaneTazi in #639
Add checks on backward batch size by @vwxyzjn in #651
Resolve various typos throughout the docs by @tomaarsen in #654
Update README.md by @Santosh-Gupta in #657
Allow for ref_model=None in DPOTrainer by @vincentmin in #640
Add more args to SFT example by @photomz in #642
Handle potentially long sequences with DataCollatorForCompletionOnlyLM by @tannonk in #644
[sft_llama2] Add check of arguments by @younesbelkada in #660
Fix DPO blogpost thumbnail by @lvwerra in #673
propagating eval_batch_size to TrainingArguments by @rahuljha in #675
[CI] Fix unmutable TrainingArguments issue by @younesbelkada in #676
Update sft_llama2.py by @msaad02 in #678
fix PeftConfig loading from a remote repo. by @w32zhong in #649
Simplify immutable TrainingArgs fix using dataclasses.replace by @tomaarsen in #682

New Contributors

@hitorilabs made their first contribution in #620
@ivsanro1 made their first contribution in #622
@zfang made their first contribution in #560
@NouamaneTazi made their first contribution in #639
@Santosh-Gupta made their first contribution in #657
@vincentmin made their first contribution in #640
@photomz made their first contribution in #642
@tannonk made their first contribution in #644
@rahuljha made their first contribution in #675
@msaad02 made their first contribution in #678
@w32zhong made their first contribution in #649

Full Changelog: v0.5.0...v0.6.0

Contributors

kashif, rahuljha, and 15 other contributors

Assets 2

02 Aug 09:08

younesbelkada

v0.5.0

c83fad6

v0.5.0

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

DPO Trainer by @kashif in #416
[DPO] make sure all the concated batches are on same device by @kashif in #528
[DPO] remove response/pairs from the DPO side by @kashif in #540
[DPO] remove unnecessary batch size arg to Collator by @kashif in #554
[DPO] Resolve logging for DPOTrainer by @tomaarsen in #570

What's Changed

Reward trainer multi-gpu eval bug by @rlindskog in #513
Use local process index for _get_current_device() by @lewtun in #515

Extending the `DataCollatorForCompletionOnlyLM`

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

[core] Fix offline case by @younesbelkada in #538
Relax reward trainer constraint by @younesbelkada in #539
ADD: num_proc to SFTTrainer by @BramVanroy in #547
[SFTTrainer] Add warning for wrong padding_side by @younesbelkada in #550
Minor typo and whitespace fixes by @tmm1 in #559
[SFTTrainer] Add epochs and num steps on CLI by @younesbelkada in #562
Add DataCollatorForCompletionOnlyLM in the docs by @younesbelkada in #565
Add comment to explain how the sentiment pipeline is used to run the … by @jvhoffbauer in #555
Fix model output dim in reward trainer example by @liutianlin0121 in #566
Computes the KL penalty using the entire distribution by @edbeeching in #541
Add missing max_seq_length arg to example sft_trainer.py by @SharkWipf in #585
[PPO] fix corner cases with PPO batch size and forward_batch_size by @younesbelkada in #563
Update the example sft_trainer.py by @ZeusFSX in #587
docs: Replace SFTTrainer with RewardTrainer in comment by @tomaarsen in #589
Fix comparison in DataCollatorForCompletionOnlyLM (#588) by @RyujiTamaki in #594
refactor grad accum by @vwxyzjn in #546

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

[examples] Big refactor of examples and documentation by @younesbelkada in #509
[examples] Fix sentiment nit by @younesbelkada in #517
[examples] make the sft script more modulable by @younesbelkada in #543
Add use_auth_token arg to sft_trainer example by @corey-lambda in #544

New Contributors

@rlindskog made their first contribution in #513
@corey-lambda made their first contribution in #544
@tmm1 made their first contribution in #559
@jvhoffbauer made their first contribution in #555
@liutianlin0121 made their first contribution in #566
@SharkWipf made their first contribution in #585
@ZeusFSX made their first contribution in #587
@gaetanlop made their first contribution in #456
@RyujiTamaki made their first contribution in #594

Full Changelog: v0.4.7...v0.5.0

Contributors

tmm1, kashif, and 14 other contributors

Assets 2

13 Jul 09:08

younesbelkada

v0.4.7

d06b131

v0.4.7

Patch release: `SFTTrainer` and `PPOTrainer` bug fixes

What's Changed

Make shuffle optional by @lopez-hector in #457
Pre-commit by @vwxyzjn in #448
Debug the tortuous logic in _prepare_dataset function by @BeibinLi in #464
[CI] Fix CI RM by @younesbelkada in #468
Update sft_trainer.py by @JulesGM in #474
Refactor README by @younesbelkada in #460
add ratio threshold to avoid spikes by @lvwerra in #488
fix typo in reward_modeling.py by @csyourui in #494
FIX: contributing guidelines command by @BramVanroy in #493
Remove padding in batched generation. by @lvwerra in #487
Adds some options to stabilize the KL penalty by @edbeeching in #486
correctly implement gradient checkpointing to multi-adapter example by @mnoukhov in #479
Disable mlm by default in DataCollatorForCompletionOnlyLM, add ignore_index and docstring by @BramVanroy in #476
Use float instead of double to avoid issues with MPS device by @younesbelkada in #499
[PPOTrainer] Add prefix tuning support by @younesbelkada in #501
[PPOTrainer] Add prompt tuning support on TRL by @younesbelkada in #500
[SFTTrainer] Fix the sequence length check of SFTTrainer by @younesbelkada in #512

New Contributors

@lopez-hector made their first contribution in #457
@BeibinLi made their first contribution in #464
@csyourui made their first contribution in #494
@BramVanroy made their first contribution in #493

Full Changelog: v0.4.6...v0.4.7

Contributors

BramVanroy, JulesGM, and 8 other contributors

Assets 2

23 Jun 09:19

younesbelkada

v0.4.6

e1531aa

v0.4.6

Patch release

Patch release to fix a bug on google colab with PPOTrainer & PPOConfig + wandb

What's Changed

Fix google colab issue by @younesbelkada in #459

Full Changelog: v0.4.5...v0.4.6

Contributors

younesbelkada

Assets 2

23 Jun 08:40

younesbelkada

v0.4.5

4314567

v0.4.5

Patch release 1 - `SFTTrainer` enhancements and fixes

This patch release adds multiple fixes for the SFTTrainer and enhancements. Another patch release is coming for fixing an issue with PPOTrainer and Google Colab combined with wandb logging

What's Changed

Add slurm utility by @vwxyzjn in #412
Enable autotag feature w/ wandb by @vwxyzjn in #411
[doc build] Use secrets by @mishig25 in #420
Update test_reward_trainer.py by @younesbelkada in #421
best-of-n sampler class by @metric-space in #375
handle the offline case by @younesbelkada in #431
Fix correct gradient accumulation by @younesbelkada in #407
Drop support for Python 3.7 by @younesbelkada in #441
[SFTTrainer] Relax dataset constraints by @younesbelkada in #442
[SFTTrainer] Fix non packed dataset by @younesbelkada in #444
[core] Add stale bot by @younesbelkada in #447
[SFTTrainer] Introducing DataCollatorForCompletionOnlyLM by @younesbelkada in #445
[ConstantLengthDataset] Fix packed dataset issue by @younesbelkada in #452
Update accelerate arg passthrourgh for tensorboard logging to reflect logging_dir deprecation. by @jganitkevitch in #437
Multi adapter RL (MARL) - a single model for RM & Value Head by @younesbelkada in #373

New Contributors

@jganitkevitch made their first contribution in #437

Full Changelog: v0.4.4...v0.4.5

Contributors

jganitkevitch, vwxyzjn, and 3 other contributors

Assets 2

08 Jun 14:42

younesbelkada

v0.4.4

5c5d768

v0.4.4

Patch release

[core] unpin accelerate by @younesbelkada in #418

Full Changelog: v0.4.3...v0.4.4

Contributors

younesbelkada

Assets 2

08 Jun 08:54

younesbelkada

v0.4.3

ff13c5b

v0.4.3

0.4.3 Patch release

Patch release - pin accelerate version

Skip flaky test until next transformers release by @younesbelkada in #410
Pin accelerate version by @younesbelkada in #414

Full Changelog: v0.4.2...v0.4.3

Contributors

younesbelkada

Assets 2

07 Jun 13:20

younesbelkada

v0.4.2

b46716c

v0.4.2

QLoRA RLHF, SFT Trainer and RewardTrainer

A new version of TRL that includes training larger models using QLoRA (4 bit quantization through bitsandbytes), brand new classes RewardTrainer and SFTTrainer to easily conduct your RLHF projects end-to-end!

Introducing `SFTTrainer` and `RewardTrainer`

Use the brand new trainer to easily train your reward model and supervised fine-tuned (SFT) model with few lines of code!

[core] officially support SFT (Supervised Finetuning) by @younesbelkada in #323
[SFT] Fix sft issues by @younesbelkada in #336
[docs] fix SFT doc by @younesbelkada in #367
[core] Officially Support Reward Modeling by @younesbelkada in #303
Resolve broken evaluation/prediction for RewardTrainer by @tomaarsen in #404

QLoRA integration

Pass 4bit models directly into PPOTrainer for more memory efficient training

[core] Add 4bit QLora by @younesbelkada in #383
[bnb] fix 4 bit SFT by @younesbelkada in #396

Updated StackLlama example

Great work by @mnoukhov that managed to fix the issues related with StackLlama and the new versions of accelerate, peft and transformers. The completely reproducible examples below:

StackLLaMA: correctly merge peft model by @mnoukhov in #398
StackLlama: fixed RL training and added args by @mnoukhov in #400
Fixed some type annotations of trl.trainer.PPoTrainer by @JulesGM in #392
StackLLaMA: fix supervised finetuning and reward model training by @mnoukhov in #399

Bug fixes and improvements

[core] refactor peft API by @younesbelkada in #231
Batched generation by @lvwerra in #228
Reduce memory consumption in batched_forward_pass by @ohashi56225 in #234
[core] Add warning when negative KL by @younesbelkada in #239
adds early stopping by @edbeeching in #238
PPO config init is bloated by @GauravVirmani in #241
feat(ci): enable pip cache by @SauravMaheshkar in #198
Improve logging for PPO + Docs page by @natolambert in #243
Fix typo by @heya5 in #253
Using batched generate in sentiment scripts by @GauravVirmani in #249
[core] Fix DeepSpeed zero-3 issue by @younesbelkada in #182
[distributed] Fix early stopping and DP by @younesbelkada in #254
[core] Fix ds issue by @younesbelkada in #260
Add LlaMa in tests + create_reference_model by @younesbelkada in #261
Use active model to generate response in example on README (#269) by @rmill040 in #271
stack-llama by @edbeeching in #273
Adding pointer back to Meta's LLaMA. by @meg-huggingface in #277
fix doc string problem in ppo trainer loss function by @thuwyh in #279
Add LLaMA tutorial to docs by @natolambert in #278
Fix swapped helper texts by @philipp-classen in #284
fix typo in gpt2-sentiment.ipynb by @eltociear in #293
add functionality to push best models to the hub during training by @Bearnardd in #275
Small improvements / fixes to toxicity example by @natolambert in #266
Fix arguments description by @lvzii in #298
[t5] Fix negative kl issue by @younesbelkada in #262
Log Token distribution of Query / Response by @natolambert in #295
clean examples folder by @natolambert in #294
fixed typo in error message by @soerenarlt in #312
fix DS for peft ref_model in ppo trainer by @halfrot in #309
[CI] Fix broken tests by @younesbelkada in #318
[Docs] Add details on multi-GPU / multi-node by @younesbelkada in #320
Give a key to the wandb PPOConfig config entry by @JulesGM in #315
added doc for using torch.distributed.launch/run by @oroojlooy in #324
Fix argument's description by @vinhkhuc in #339
stack_llama: update instructions in README, fix broken _get_submodules and save tokenizer by @teticio in #358
stack_llama: add parameter to control max_length (to mitigate OOM errors) by @teticio in #359
[PPO] Relax negative KL constraint by @younesbelkada in #352
[PPOTrainer] Fix tensorboard issue by @younesbelkada in #330
140/best n sampling by @metric-space in #326
Fix bug when loading local peft model by @Opdoop in #342
add is_trainable in kwargs by @Opdoop in #363
Remove obsolete layer_norm_names parameter and add peft>=0.3.0 to requirements by @teticio in #366
Delete test_training.py by @younesbelkada in #371
[core] Fix warning issue by @younesbelkada in #377
Update customization.mdx by @binganao in #390
fix dataloader typo in ppo_trainer.py by @LZY-the-boys in #389
from_pretrain with peft adapter on the hub (# 379) by @glerzing in #380
keep state_dict kwargs instead of popping it in save_pretrained by @rizar in #393
Remove unused imports in docs. by @vwxyzjn in #406

New Contributors

@ohashi56225 made their first contribution in #234
@GauravVirmani made their first contribution in #241
@SauravMaheshkar made their first contribution in #198
@heya5 made their first contribution in #253
@rmill040 made their first contribution in #271
@thuwyh made their first contribution in #279
@philipp-classen made their first contribution in #284
@Bearnardd made their first contribution in #275
@lvzii made their first contribution in #298
@soerenarlt made their first contribution in #312
@halfrot made their first contribution in #309
@oroojlooy made their first contribution in #324
@vinhkhuc made their first contribution in #339
@teticio made their first contribution in #358
@metric-space made their first contribution in #326
@Opdoop made their first contribution in #342
@binganao made their first contribution in #390
@LZY-the-boys made their first contribution in #389
@glerzing made their first contribution in #380
@rizar made their first contribution in #393
@mnoukhov made their first contribution in #398
@tomaarsen made their first contribution in #404
@vwxyzjn made their first contribution in #406

Full Changelog: v0.4.1...v0.4.2

Contributors

rizar, vinhkhuc, and 28 other contributors

Assets 2

17 Mar 10:39

younesbelkada

v0.4.1

e618317

v0.4.1

Large models training, Naive Pipeline Parallelism, `peft` Data Parallelism support and distributed training bug fixes

This release includes a set of features and bug fixes to scale up your RLHF experiments for much larger models leveraging peft and bitsandbytes.

Naive Pipeline Parallelism support

Let's support naive Pipeline Parallelism by @younesbelkada in #210

We introduce a new paradigm in trl , termed as Naive Pipeline Parallelism, to fit large scale models on your training setup and apply RLHF on them. This feature uses peft to train adapters and bitsandbytes to reduce the memory foot print of your active model

`peft` Data Parallelism support

[peft] Fix DP issues by @younesbelkada in #221
[core] fix DP issue by @younesbelkada in #222

There were some bugs with respect to peft integration and DP. This release includes the bug fixes to enable multi-GPU training using accelerate + DDP (DIstributed Data Parallel)

Memory optimization

Your training runs can be now much more memory efficient thanks to few tricks / bug fixes:
Now PPOConfig also supports the flag optimize_cuda_cache (set to False by default) to avoid increasing CUDA memory issues

Grad accumulation and memory bugfix by @edbeeching in #220
adds a missing detach to the ratio by @edbeeching in #224

Pytorch 2.0 fixes

This release also includes minor fixes related to PyTorch 2.0 release

[test] attempt to fix CI test for PT 2.0 by @younesbelkada in #225

What's Changed

adds sentiment example for a 20b model by @edbeeching in #208
Update README.md blog post link by @TeamDman in #212
spell mistakes by @k-for-code in #213
spell corrections by @k-for-code in #214
Small changes when integrating into H4 by @natolambert in #216

New Contributors

@TeamDman made their first contribution in #212
@k-for-code made their first contribution in #213

Full Changelog: v0.4.0...v0.4.1

Contributors

edbeeching, TeamDman, and 3 other contributors

Assets 2

09 Mar 11:38

younesbelkada

v0.4.0

c9a0a87

v0.4.0

`v0.4.0`: `peft` integration

Apply RLHF and fine-tune your favorite large model on consumer GPU using peft and trl ! Share also easily your trained RLHF adapters on the Hub with few lines of code

With this integration you can train gpt-neo-x (20B parameter model - 40GB in bfloat16) on a 24GB consumer GPU!

What's Changed

Allow running evaluate-toxicity with cpu by @jordimas in #195
[core] Fix quality issue by @younesbelkada in #197
Add 1.12.1 torch compatibility in sum method by @PanchenkoYehor in #190
peft integration by @edbeeching in #163
[core] Update dependency by @younesbelkada in #206

New Contributors

@PanchenkoYehor made their first contribution in #190

Full Changelog: v0.3.1...v0.4.0

Contributors

jordimas, edbeeching, and 2 other contributors

Assets 2

Releases: huggingface/trl

v0.6.0

DDPO for diffusion models

Bug fixes and other enhancements

What's Changed

New Contributors

Contributors

v0.5.0

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

DPO Trainer

What's Changed

Extending the DataCollatorForCompletionOnlyLM

Important bug fixes

Big refactor of examples and documentation

New Contributors

Contributors

v0.4.7

Patch release: SFTTrainer and PPOTrainer bug fixes

What's Changed

New Contributors

Contributors

v0.4.6

Patch release

What's Changed

Contributors

v0.4.5

Patch release 1 - SFTTrainer enhancements and fixes

What's Changed

New Contributors

Contributors

v0.4.4

Patch release

Contributors

v0.4.3

0.4.3 Patch release

Contributors

v0.4.2

QLoRA RLHF, SFT Trainer and RewardTrainer

Introducing SFTTrainer and RewardTrainer

QLoRA integration

Updated StackLlama example

Bug fixes and improvements

New Contributors

Contributors

v0.4.1

Large models training, Naive Pipeline Parallelism, peft Data Parallelism support and distributed training bug fixes

Naive Pipeline Parallelism support

peft Data Parallelism support

Memory optimization

Pytorch 2.0 fixes

What's Changed

New Contributors

Contributors

v0.4.0

v0.4.0: peft integration

What's Changed

New Contributors

Contributors

Extending the `DataCollatorForCompletionOnlyLM`

Patch release: `SFTTrainer` and `PPOTrainer` bug fixes

Patch release 1 - `SFTTrainer` enhancements and fixes

Introducing `SFTTrainer` and `RewardTrainer`

Large models training, Naive Pipeline Parallelism, `peft` Data Parallelism support and distributed training bug fixes

`peft` Data Parallelism support

`v0.4.0`: `peft` integration