Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added initial TPO implementation #1965

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

Conversation

sahsaeedi
Copy link

@sahsaeedi sahsaeedi commented Aug 24, 2024

What does this PR do?

This PR adds initial TPO (Triple Preference Optimization) implementation.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case. ( Please add TPO trainer to the trl #1901 )
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@qgallouedec

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sahsaeedi
Copy link
Author

Hi @qgallouedec,

I fixed the issues. I would appreciate it if you review the PR.

@qgallouedec
Copy link
Member

Thanks for contributing @sahsaeedi, sorry for the delay answering.
Do you have some reference results to share (wandb run) that use this implementation?

@sahsaeedi
Copy link
Author

sahsaeedi commented Sep 18, 2024

Hi @qgallouedec,

Yes, I fine-tuned llama3-8B-Instrcut on llama3-ultrafeedback-armorm dataset.

image

Let me know if you need some specific results.
Thanks!

@kashif
Copy link
Collaborator

kashif commented Sep 18, 2024

so @sahsaeedi I kinda refactored the DPO's data processing helpers etc. and was thinking... can one just subclass the DPOTrainer for this method?

@sahsaeedi
Copy link
Author

Hi @kashif,

I am still trying to figure it out, and my main concern is processing the data. TPO data processing is a little different from DPO. Also, the dataset needs to be processed, and some conditions need to be met. It should be good to have inherent TPO from the Trainer instead of DPOTrainer. However, if you think we have to do that, I will start working on it.

@sahsaeedi
Copy link
Author

Hi @qgallouedec,
Is there any update?

@qgallouedec
Copy link
Member

Hi @sahsaeedi, sorry for the delay. Could you please update your branch? We've been doing a lot of work recently to standardize the API through trainers, docs, configurations, etc. This branch should be aligned with recent changes. Feel free to ask if you need help with this.

In addition, we're working on refactoring the data processing in DPO (which I think your code is mainly inspired by) because it's too complex at the moment. I'd like to avoid refactoring two trainers, so I won't merge this one until it's done. You'll probably have to do a second round of update.

@sahsaeedi
Copy link
Author

Hi @qgallouedec ,

No worries. Thanks to response.
Would you like me to update the branch here, or should I update the TPOTrainer with the latest trl version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants