Asynchronous RLHF: Faster and More Efficient Online DPO #2278

mnoukhov · 2024-10-24T22:07:03Z

This implements a proposed faster and more efficient paradigm: asynchronous RLHF. See the paper: https://arxiv.org/abs/2410.18252

reasoning

vllm inference is generally faster than hf generate. We want to separate the generation and training so we can use fast generation libraries / utilities

this proposes a simple first solution: run training on n gpus and generation on 1 gpu. This can be extended to run generation on more GPUs but,in practice for >=8 GPU setups with 8B models and less, 1 gpu for generation tends to be fine.

setup

We create an asynchronous trainer for Online DPO that uses vllm for generation. The generation GPU has vllm started on a separate python thread and communication between training and generation is via Queues. The training looks something like this:

the training thread gets the batch of prompts
send data
a. training thread sends batch of prompts and current model weights to generation
b. generation thread sends previous prompts with generated completions to training
parallel training and generation
a. training thread calculates reward, then trains on previous prompts and completions, updates weights
b. generation thread generates completions to each prompt
back to step 2
...
at the end of training the training thread sends None for both prompts and parameters so the generation thread closes itself

Example usage for a 4 GPU setup is

accelerate launch --num_processes 3 examples/scripts/dpo_online_async.py \
    --model_name_or_path trl-lib/pythia-1b-deduped-tldr-sft  \
    --reward_model_path trl-lib/pythia-1b-deduped-tldr-rm \
    --dataset_name trl-lib/tldr \
    --learning_rate 5.0e-7 \
    --output_dir pythia-1b-tldr-online-dpo \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 16

the generation GPU is by default the accelerate.num_processes + 1 GPU, so GPUs [1,2,3] are for training and GPU 4 is generation with vllm.

notes

online_dpo_trainer currently extends the regular huggingface Trainer and is limited to generating one minibatch of samples and training one step on those samples. As argued in the paper, more training steps on data (num_ppo_epochs) or generating more minibatches (num_mini_batches) can be useful. For this reason, AsyncOnlineDPOTrainer follows the style of RLOOTrainer.
to test functionality, I've added SyncOnlineDPOTrainer that has the exact same structure as Async but is synchronous and uses hf generate. I can remove it for the final submission,

Before submitting / To Do

~~[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).~~
Did you read the contributor guideline, Pull Request section?
~~- [ ] Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.~~
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests?

I am working on docs. Do we want tests similar to RLOOTrainer ?

Who can review?

@qgallouedec @lewtun anyone else! comments welcome

seanexp · 2024-10-25T02:58:57Z

What is the primary difference between this PR and #1628 ?

mnoukhov · 2024-10-25T14:06:30Z

This is an updated and multi-gpu extension of #1628. It is also work between @vwxyzjn and I!

Instead of keeping vllm models on the same GPU, we move them to another. It also uses the more flexible vllm_utils.py written by @vwxyzjn in allenai/open_instruct (https://github.com/allenai/open-instruct/blob/main/open_instruct/vllm_utils.py) which allows using any version of vllm as opposed to the fixed 0.4.2 from #1628.

Finally, this has been tested and verified to match regular Online DPO performance while being faster and more efficient, see our new preprint https://arxiv.org/abs/2410.18252

HuggingFaceDocBuilderDev · 2024-10-28T13:17:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

seanexp · 2024-11-01T08:20:48Z

trl/vllm_utils.py

+"""This file basically allows us to place vLLM's driver worker in a specified
+GPU. For example. you can try the following.
+```python
+from transformers import AutoTokenizer
+from vllm import SamplingParams
+from open_instruct.vllm_utils import SingleGPULLM
+tok = AutoTokenizer.from_pretrained("facebook/opt-125m")
+tok.chat_template = (
+    "{% for message in messages %}"
+    "{{'\n\n' if not loop.first else ''}}"
+    "{{message['role']|capitalize + ': ' +message['content']}}"
+    "{% if loop.last and not add_generation_prompt %}{{ eos_token }}{% endif %}"
+    "{% endfor %}"
+)
+prompts = [
+    {"role": "user", "content": "Compose a speech about the need for more affordable dental care."},
+]
+prompt_ids = tok.apply_chat_template(prompts, add_generation_prompt=True)
+sampling_params = SamplingParams(temperature=0.001, top_p=1.0, max_tokens=1024, include_stop_str_in_output=True)
+llm = SingleGPULLM(model="facebook/opt-125m", tensor_parallel_size=1, device="cuda:1")
+llmp = llm.llm_engine.model_executor.driver_worker.model_runner.model
+print(f"🔥🔥🔥 vllm lives in {llmp.lm_head.weight.device}")
+print("prepare to generate")
+outputs = llm.generate(prompt_token_ids=[prompt_ids], sampling_params=sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+"""


@mnoukhov

Could you kindly update the docstring? I think SingleGPULLM should be replaced with vllm_single_gpu_patch and LLM

mnoukhov added 8 commits October 23, 2024 11:55

initial online dpo trainer as ppo trainer

f4af389

sync online dpo trainer

dfdfb2f

async utils working

3bc05d0

script and imports

44a2de5

is it working

f8b1b52

sync running

afdca5c

async running

97abfab

all running and sync fallback

2dde116

mnoukhov added 2 commits October 29, 2024 18:48

precommit and ruff

421f942

Merge remote-tracking branch 'upstream/main' into async

80abb5d

seanexp reviewed Nov 1, 2024

View reviewed changes

Merge branch 'main' into async

51fd661

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous RLHF: Faster and More Efficient Online DPO #2278

Asynchronous RLHF: Faster and More Efficient Online DPO #2278

mnoukhov commented Oct 24, 2024 •

edited

Loading

seanexp commented Oct 25, 2024

mnoukhov commented Oct 25, 2024

HuggingFaceDocBuilderDev commented Oct 28, 2024

seanexp Nov 1, 2024

Asynchronous RLHF: Faster and More Efficient Online DPO #2278

Are you sure you want to change the base?

Asynchronous RLHF: Faster and More Efficient Online DPO #2278

Conversation

mnoukhov commented Oct 24, 2024 • edited Loading

reasoning

setup

notes

Before submitting / To Do

Who can review?

seanexp commented Oct 25, 2024

mnoukhov commented Oct 25, 2024

HuggingFaceDocBuilderDev commented Oct 28, 2024

seanexp Nov 1, 2024

Choose a reason for hiding this comment

mnoukhov commented Oct 24, 2024 •

edited

Loading