Evaluation Metric #44

gururise · 2023-04-02T03:15:31Z

Planning on adding an evaluation metric that can be used to benchmark trained alpaca models.

Going to focus on these two datasets for evaluation:

SquaD Dataset - F1 Score
WikiText Dataset - Perplexity

I'm not so sure the Wikitext perplexity score will give us much useful information, but seems to be a popular metric for these foundation models. I'm more interested in the Squad F1 score, which will give us a standard benchmark for a Q/A task. Even though alpaca is not trained on Squad style dataset, I think with the right prompt, it can be done.

gururise · 2023-04-02T03:52:46Z

So I've compared two different alpaca 7b models on the Squad Dataset:

dataset	model	Squad(Mini) F1
Original Alpaca	samwit/alpaca7B-lora	34.63
Cleaned Alpaca	tloen/alpaca-lora-7b	49.64

At least on the surface, it appears the cleaning & curation we've been doing has helped significantly.

claysauruswrecks · 2023-04-02T04:48:50Z

I have mentioned a few options previously in this issue: tloen/alpaca-lora#147

gururise · 2023-04-02T18:04:30Z

Just FYI. I re-ran the SQUADmini bench on a model I fine-tuned on March 31 release of the cleaned dataset and got an avg F1 score of 55.229.

gururise · 2023-04-03T18:48:17Z

Just added piqa benchmark, also redid the scoring of the Squad bench:

dataset	Hugging Face	parameters	SquadMini (f1)	Piqa (acc)
Original Alpaca	samwit/alpaca7B-lora	7b	74.271	~~50.5~~
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b	75.629	~~54.0~~
Cleaned Alpaca (Mar 31)	yahma/alpaca-7b-lora	7b	76.388	~~52.6~~
GPT4All	nomic-ai/gpt4all-lora	7b	72.643	~~49.5~~

Note: PIQA benchmark has issues. Do not use it yet.

gururise · 2023-04-06T20:18:18Z

Decided to standardize by using the lm-eval-harness by EleutherAI instead. Here are the new results:

Dataset	Model	parameters	WikiText (ppl)	MNLI (acc)	Piqa (acc norm)
Original Alpaca	samwit/alpaca7B-lora	7b (lora)	9.5396	38.33	78.51
Cleaned Alpaca (Mar 26)	tloen/alpaca-lora-7b	7b (lora)	9.4885	51.6	79.33
GPT4All	nomic-ai/gpt4all-lora	7b (lora)	10.09	38.97	78.40

Not sure why the model trained on the cleaned dataset scored so high in the MNLI benchmark. I ran the test multiple times to confirm.

YukinoshitaKaren · 2023-04-13T11:18:41Z

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

gururise · 2023-04-13T17:29:18Z

May I ask a question? You use 'tloen/alpaca-lora-7b' got a 49.64 'Squad(Mini) F1' 2 weeks ago, and you use the same model got 75.629 last week, why are the two results so different? I have tried this model and got around 55.07 Squad(Mini) F1.

The SQUAD MINI score calculations were re-done in that time. Anyhow, going forward, we are ditching the benchmark eval.py and using the lm-evaluation-harness from EleutherAI. The scores reported in the main README are directly from the lm-evaluation-harness report.

claysauruswrecks · 2023-04-14T09:08:41Z

https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All-J_Technical_Report_2.pdf

gururise added the enhancement New feature or request label Apr 2, 2023

gururise mentioned this issue Apr 2, 2023

Program to evaluate models on Squad and Wikitext #46

Merged

claysauruswrecks mentioned this issue Apr 4, 2023

Update training data to cleaned dataset for improved performance tatsu-lab/stanford_alpaca#177

Open

gururise mentioned this issue Apr 4, 2023

attempt to fix piqa scoring #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Metric #44

Evaluation Metric #44

gururise commented Apr 2, 2023

gururise commented Apr 2, 2023 •

edited

Loading

claysauruswrecks commented Apr 2, 2023

gururise commented Apr 2, 2023 •

edited

Loading

gururise commented Apr 3, 2023 •

edited

Loading

gururise commented Apr 6, 2023 •

edited

Loading

YukinoshitaKaren commented Apr 13, 2023

gururise commented Apr 13, 2023

claysauruswrecks commented Apr 14, 2023

Evaluation Metric #44

Evaluation Metric #44

Comments

gururise commented Apr 2, 2023

gururise commented Apr 2, 2023 • edited Loading

claysauruswrecks commented Apr 2, 2023

gururise commented Apr 2, 2023 • edited Loading

gururise commented Apr 3, 2023 • edited Loading

gururise commented Apr 6, 2023 • edited Loading

YukinoshitaKaren commented Apr 13, 2023

gururise commented Apr 13, 2023

claysauruswrecks commented Apr 14, 2023

gururise commented Apr 2, 2023 •

edited

Loading

gururise commented Apr 2, 2023 •

edited

Loading

gururise commented Apr 3, 2023 •

edited

Loading

gururise commented Apr 6, 2023 •

edited

Loading