-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation Metric #44
Comments
So I've compared two different alpaca 7b models on the Squad Dataset:
At least on the surface, it appears the cleaning & curation we've been doing has helped significantly. |
I have mentioned a few options previously in this issue: tloen/alpaca-lora#147 |
Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229. |
Just added piqa benchmark, also redid the scoring of the Squad bench:
Note: PIQA benchmark has issues. Do not use it yet. |
Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:
Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm. |
May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1. |
The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report. |
Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.
Going to focus on these two datasets for evaluation:
I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.
The text was updated successfully, but these errors were encountered: