Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you please release the processed pretraining data? #8

Closed
phellonchen opened this issue Feb 14, 2023 · 3 comments
Closed

Could you please release the processed pretraining data? #8

phellonchen opened this issue Feb 14, 2023 · 3 comments

Comments

@phellonchen
Copy link

No description provided.

@StevenTang1998
Copy link
Member

You can download them at the link: https://huggingface.co/RUCAIBox. Since some datasets have license limitations, we cannot merge them into one dataset. You can merge them by your own.

@phellonchen
Copy link
Author

Thanks. One more question, where can I find the code about a temperature-scaled mixing strategy (Raffel et al., 2020) with a rate of T = 2 to mitigate the disparity in tasks and datasets ? I have not found it in https://github.com/RUCAIBox/TextBox.

@StevenTang1998
Copy link
Member

The general code of pre-training is still under developping. For pre-training MVP, we just conducted the temperature-scaled mixing strategy by copying instances. You can also use it as a simple alternative.
For example, A dataset has 2 instances and B dataset has 8 instances. We merge them into a unified datasest with the temperature-scaled mixing strategy by doubling the instances in A dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants