Skip to content

Releases: huggingface/trl

v0.6.0

25 Aug 15:08
Compare
Choose a tag to compare

DDPO for diffusion models

We are excited to welcome the first RLHF + diffusion models algorithm to refine the generations from diffusion models.
Read more about it directly in the docs.

Before After DDPO finetuning

Bug fixes and other enhancements

The release also comes with multiple bug fixes reported and/or led by the community, check out the commit history below

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0

v0.5.0

02 Aug 09:08
Compare
Choose a tag to compare

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

  • DPO Trainer by @kashif in #416
  • [DPO] make sure all the concated batches are on same device by @kashif in #528
  • [DPO] remove response/pairs from the DPO side by @kashif in #540
  • [DPO] remove unnecessary batch size arg to Collator by @kashif in #554
  • [DPO] Resolve logging for DPOTrainer by @tomaarsen in #570

What's Changed

  • Reward trainer multi-gpu eval bug by @rlindskog in #513
  • Use local process index for _get_current_device() by @lewtun in #515

Extending the DataCollatorForCompletionOnlyLM

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

  • Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

New Contributors

Full Changelog: v0.4.7...v0.5.0

v0.4.7

13 Jul 09:08
Compare
Choose a tag to compare

Patch release: SFTTrainer and PPOTrainer bug fixes

What's Changed

New Contributors

Full Changelog: v0.4.6...v0.4.7

v0.4.6

23 Jun 09:19
Compare
Choose a tag to compare

Patch release

Patch release to fix a bug on google colab with PPOTrainer & PPOConfig + wandb

What's Changed

Full Changelog: v0.4.5...v0.4.6

v0.4.5

23 Jun 08:40
Compare
Choose a tag to compare

Patch release 1 - SFTTrainer enhancements and fixes

This patch release adds multiple fixes for the SFTTrainer and enhancements. Another patch release is coming for fixing an issue with PPOTrainer and Google Colab combined with wandb logging

What's Changed

New Contributors

Full Changelog: v0.4.4...v0.4.5

v0.4.4

08 Jun 14:42
Compare
Choose a tag to compare

Patch release

Full Changelog: v0.4.3...v0.4.4

v0.4.3

08 Jun 08:54
Compare
Choose a tag to compare

0.4.3 Patch release

Patch release - pin accelerate version

Full Changelog: v0.4.2...v0.4.3

v0.4.2

07 Jun 13:20
Compare
Choose a tag to compare

QLoRA RLHF, SFT Trainer and RewardTrainer

A new version of TRL that includes training larger models using QLoRA (4 bit quantization through bitsandbytes), brand new classes RewardTrainer and SFTTrainer to easily conduct your RLHF projects end-to-end!

Introducing SFTTrainer and RewardTrainer

Use the brand new trainer to easily train your reward model and supervised fine-tuned (SFT) model with few lines of code!

QLoRA integration

Pass 4bit models directly into PPOTrainer for more memory efficient training

Updated StackLlama example

Great work by @mnoukhov that managed to fix the issues related with StackLlama and the new versions of accelerate, peft and transformers. The completely reproducible examples below:

  • StackLLaMA: correctly merge peft model by @mnoukhov in #398
  • StackLlama: fixed RL training and added args by @mnoukhov in #400
  • Fixed some type annotations of trl.trainer.PPoTrainer by @JulesGM in #392
  • StackLLaMA: fix supervised finetuning and reward model training by @mnoukhov in #399

Bug fixes and improvements

New Contributors

Full Changelog: v0.4.1...v0.4.2

v0.4.1

17 Mar 10:39
Compare
Choose a tag to compare

Large models training, Naive Pipeline Parallelism, peft Data Parallelism support and distributed training bug fixes

This release includes a set of features and bug fixes to scale up your RLHF experiments for much larger models leveraging peft and bitsandbytes.

Naive Pipeline Parallelism support

We introduce a new paradigm in trl , termed as Naive Pipeline Parallelism, to fit large scale models on your training setup and apply RLHF on them. This feature uses peft to train adapters and bitsandbytes to reduce the memory foot print of your active model

image

peft Data Parallelism support

There were some bugs with respect to peft integration and DP. This release includes the bug fixes to enable multi-GPU training using accelerate + DDP (DIstributed Data Parallel)

Memory optimization

Your training runs can be now much more memory efficient thanks to few tricks / bug fixes:
Now PPOConfig also supports the flag optimize_cuda_cache (set to False by default) to avoid increasing CUDA memory issues

Pytorch 2.0 fixes

This release also includes minor fixes related to PyTorch 2.0 release

What's Changed

New Contributors

  • @TeamDman made their first contribution in #212
  • @k-for-code made their first contribution in #213

Full Changelog: v0.4.0...v0.4.1

v0.4.0

09 Mar 11:38
Compare
Choose a tag to compare

v0.4.0: peft integration

Apply RLHF and fine-tune your favorite large model on consumer GPU using peft and trl ! Share also easily your trained RLHF adapters on the Hub with few lines of code

With this integration you can train gpt-neo-x (20B parameter model - 40GB in bfloat16) on a 24GB consumer GPU!

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.4.0