Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Metric #44

Open
gururise opened this issue Apr 2, 2023 · 8 comments
Open

Evaluation Metric #44

gururise opened this issue Apr 2, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@gururise
Copy link
Owner

gururise commented Apr 2, 2023

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

  1. SquaD Dataset - F1 Score
  2. WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

@gururise gururise added the enhancement New feature or request label Apr 2, 2023
@gururise
Copy link
Owner Author

gururise commented Apr 2, 2023

So I've compared two different alpaca 7b models on the Squad Dataset:

dataset model Squad(Mini) F1
Original Alpaca samwit/alpaca7B-lora 34.63
Cleaned Alpaca tloen/alpaca-lora-7b 49.64

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

@claysauruswrecks
Copy link
Contributor

I have mentioned a few options previously in this issue: tloen/alpaca-lora#147

@gururise
Copy link
Owner Author

gururise commented Apr 2, 2023

Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.

@gururise
Copy link
Owner Author

gururise commented Apr 3, 2023

Just added piqa benchmark, also redid the scoring of the Squad bench:

dataset Hugging Face parameters SquadMini (f1) Piqa (acc)
Original Alpaca samwit/alpaca7B-lora 7b 74.271 50.5
Cleaned Alpaca (Mar 26) tloen/alpaca-lora-7b 7b 75.629 54.0
Cleaned Alpaca (Mar 31) yahma/alpaca-7b-lora 7b 76.388 52.6
GPT4All nomic-ai/gpt4all-lora 7b 72.643 49.5

Note: PIQA benchmark has issues. Do not use it yet.

@gururise
Copy link
Owner Author

gururise commented Apr 6, 2023

Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:

Dataset Model parameters WikiText (ppl) MNLI (acc) Piqa (acc norm)
Original Alpaca samwit/alpaca7B-lora 7b (lora) 9.5396 38.33 78.51
Cleaned Alpaca (Mar 26) tloen/alpaca-lora-7b 7b (lora) 9.4885 51.6 79.33
GPT4All nomic-ai/gpt4all-lora 7b (lora) 10.09 38.97 78.40

Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.

@YukinoshitaKaren
Copy link

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

@gururise
Copy link
Owner Author

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.

@claysauruswrecks
Copy link
Contributor

image

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants