Evaluation tooling for road-core project
Currently we support 2 types of evaluations.
-
consistency
: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a json file -
model
: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). Sample QnA set (parquet) with 30 queries per OCP documentation title.
Notes:
- QnAs must not be used for model training or tuning. This is created only for evaluation purpose.
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
- OLS API should be ready/live with all the required provider+model configured.
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, model evaluation first checks .csv file generated by consistency evaluation. If response is not present in csv file, then only we call API to get the response.
These evaluations are also part of e2e test cases. Currently consistency evaluation is parimarily used to gate PRs. Final e2e suite will also invoke model evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
pdm run evaluate
Please refer above files for the structure, add new data accordingly.
eval_type: This will control which evaluation, we want to do. Currently we have 3 options.
consistency
-> Compares model specific answer for QnAs provided in json filemodel
-> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.all
-> Both of the above evaluations.
eval_api_url: OLS API url. Default is http://localhost:8080
. If deployed in a cluster, then pass cluster API url.
eval_api_token_file: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
eval_scenario: This is primarily required to indetify which pre-defined answers need to be compared. Values can be with_rag
, without_rag
. Currently we always do evaluation for the API with rag.
eval_query_ids: Option to give set of query ids for evaluation. By default all queries are processed.
eval_provider_model_id: We can provide set of provider/model combinations as ids for comparison.
qna_pool_file: Applicable only for model
evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
eval_out_dir: Directory, where output csv/json files will be saved.
eval_metrics: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph. This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
judge_provider / judge_model: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file.
eval_modes: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
Evaluation scripts creates below files.
- CSV file with response for given provider/model & modes.
- response evaluation result with scores (for consistency check).
- Final csv file with all results, json score summary & graph (for model evaluation)