-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added initial TPO implementation #1965
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi @qgallouedec, I fixed the issues. I would appreciate it if you review the PR. |
Thanks for contributing @sahsaeedi, sorry for the delay answering. |
Hi @qgallouedec, Yes, I fine-tuned llama3-8B-Instrcut on llama3-ultrafeedback-armorm dataset. Let me know if you need some specific results. |
so @sahsaeedi I kinda refactored the DPO's data processing helpers etc. and was thinking... can one just subclass the |
Hi @kashif, I am still trying to figure it out, and my main concern is processing the data. TPO data processing is a little different from DPO. Also, the dataset needs to be processed, and some conditions need to be met. It should be good to have inherent TPO from the Trainer instead of DPOTrainer. However, if you think we have to do that, I will start working on it. |
Hi @qgallouedec, |
Hi @sahsaeedi, sorry for the delay. Could you please update your branch? We've been doing a lot of work recently to standardize the API through trainers, docs, configurations, etc. This branch should be aligned with recent changes. Feel free to ask if you need help with this. In addition, we're working on refactoring the data processing in DPO (which I think your code is mainly inspired by) because it's too complex at the moment. I'd like to avoid refactoring two trainers, so I won't merge this one until it's done. You'll probably have to do a second round of update. |
Hi @qgallouedec , No worries. Thanks to response. |
What does this PR do?
This PR adds initial TPO (Triple Preference Optimization) implementation.
Before submitting
Pull Request section?
to it if that's the case. ( Please add TPO trainer to the trl #1901 )
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@qgallouedec