-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differences in Model Performance When Reproducing Experiment #32
Comments
I'm not the author. I'm also reproducing this paper. I'm quite confused about their default settings in their scripts. If you changed nothing in their scripts, you would build a gradient datastore with 200 sample for each dataset in step 2, but for reproduction, you should use the full dataset (am I right?). Would you like to share your results in step 2, 3, 4 (like logging info, gradient datastore size, or sth. else)? So that I can confirm where the problems are located. |
Hi, I used their script for calculating gradients but delete '--max_samples 200' so that it will build a gradient datastore for all data samples. Besides, for the batchsize, I used the default setting in their scripts and got the ckpt 422, 845, 1268, 1688. |
I guess you probably run the script in a single GPU. The author reported the bz=128, therefore the total steps of one epoch of 4 datasets should be 105, which is the |
Yes, I run it on a single GPU but I think only the batchsize shouldn't affect the result so much. |
Hi, thank you for your nice work!
I'm reproducing the results in Table 2, using Mistral-7B model on MMLU and TydiQA and select 5% data.
I adhere to the scripts in your repo to conduct the warmup, data selection and training, and use the evaluation code in your repo to evaluate. I do not change any settings in your script, though only use a random seed of 3.
Despite following these settings, the performance of my model is worse than the results in Table 2.
For MMLU, the performance of Random is 58.3 (60.0 in your paper), LESS is 60.8 (61.8 in your paper).
For TydiQA, the f1 of Random is 44.6, LESS is 55.1.
My environments are: torch 2.4.0, transformers 4.45.2, peft 0.13.1, datasets 3.0.1
Are these differences reasonable? Could you please confirm if the settings in your scripts are fully aligned with those used in your paper?
Thanks.
The text was updated successfully, but these errors were encountered: