diff --git a/README.md b/README.md index f33be597..e1ddfb86 100644 --- a/README.md +++ b/README.md @@ -351,7 +351,7 @@ Every array will produce the combinations of flat configurations when the method "temperature": "determines the OpenAI temperature. Valid value ranges from 0 to 1." }, "eval": { - "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']", + "metric_types": "determines the metrics used for evaluation (end-to-end or component-wise metrics using LLMs). Valid values for end-to-end metrics are lcsstr, lcsseq, cosine_ochiai, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2. Valid values for component-wise LLM-based metrics are llm_answer_relevance, llm_context_precision and llm_context_recall. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens', 'llm_answer_relevance']", } } ``` @@ -516,7 +516,7 @@ During the QnA generation step, you may occasionally encounter errors related to **End-to-end evaluation metrics:** not all the metrics comparing the generated and ground-truth answers are able to capture differences in semantics. For example, metrics such as `levenshtein` or `jaro_winkler` only measure edit distances. The `cosine` metric doesn't allow the comparison of semantics either: it uses the *textdistance* token-based implementation based on term frequency vectors. To calculate the semantic similarity between the generated answers and the expected responses, consider using embedding-based metrics such as Bert scores (`bert_`). -**Component-wise evaluation metrics:** evaluation metrics using LLM-as-judges aren't deterministic. The `llm_` metrics included in the accelerator use the model indicated in the `azure_oai_eval_deployment_name` config field. The prompts used for evaluation instruction can be adjusted and are included in the `prompts.py` file (`llm_answer_relevance_instruction`, `llm_context_recall_instruction`, `llm_context_precision_instruction`). +**Component-wise evaluation metrics:** evaluation metrics using LLM-as-judges aren't deterministic. The `llm_` metrics included in the accelerator use the model indicated in the `azure_oai_eval_deployment_name` config field. The prompts used for evaluation instruction can be adjusted and are included in the `prompts.py` file (`ragas_answer_relevance_instruction`, `ragas_context_recall_instruction`, `ragas_context_precision_instruction`). **Retrieval-based metrics:** MAP scores are computed by comparing each retrieved chunk against the question and the chunk used to generate the qna pair. To assess whether a retrieved chunk is relevant or not, the similarity between the retrieved chunk and the concatenation of the end user question and the chunk used in the qna step (`02_qa_generation.py`) is computed using the SpacyEvaluator. Spacy similarity defaults to the average of the token vectors, meaning that the computation is insensitive to the order of the words. By default, the similarity threshold is set to 80% (`spacy_evaluator.py`). diff --git a/config.sample.json b/config.sample.json index cfd162ed..33bc3606 100644 --- a/config.sample.json +++ b/config.sample.json @@ -80,8 +80,8 @@ "bert_all_MiniLM_L6_v2", "cosine_ochiai", "bert_distilbert_base_nli_stsb_mean_tokens", - "llm_answer_relevance", - "llm_context_precision" + "ragas_answer_relevance", + "ragas_context_precision" ] } } diff --git a/config.schema.json b/config.schema.json index 8ab02676..4b7268b0 100644 --- a/config.schema.json +++ b/config.schema.json @@ -577,9 +577,14 @@ "bert_large_nli_stsb_mean_tokens", "bert_distilbert_base_nli_stsb_mean_tokens", "bert_paraphrase_multilingual_MiniLM_L12_v2", - "llm_answer_relevance", - "llm_context_precision", - "llm_context_recall" + "ragas_answer_relevance", + "ragas_context_precision", + "ragas_context_recall", + "pf_answer_relevance", + "pf_answer_coherence", + "pf_answer_fluency", + "pf_answer_similarity", + "pf_answer_groundedness" ] }, "description": "Metrics used for evaluation" diff --git a/dev-requirements.txt b/dev-requirements.txt index 9f93179a..2f004a79 100644 --- a/dev-requirements.txt +++ b/dev-requirements.txt @@ -1,3 +1,4 @@ +azure-ai-evaluation==1.0.0b3 promptflow==1.15.0 promptflow-tools==1.4.0 pytest==8.3.3 diff --git a/docs/evaluation-metrics.md b/docs/evaluation-metrics.md index 0b97e22f..13683068 100644 --- a/docs/evaluation-metrics.md +++ b/docs/evaluation-metrics.md @@ -16,7 +16,7 @@ You can choose which metrics should be calculated in your experiment by updating "metric_types": [ "lcsstr", "lcsseq", - "cosine", + "cosine_ochiai", "jaro_winkler", "hamming", "jaccard", @@ -37,9 +37,9 @@ You can choose which metrics should be calculated in your experiment by updating "bert_large_nli_stsb_mean_tokens", "bert_distilbert_base_nli_stsb_mean_tokens", "bert_paraphrase_multilingual_MiniLM_L12_v2", - "llm_answer_relevance", - "llm_context_precision", - "llm_context_recall" + "ragas_answer_relevance", + "ragas_context_precision", + "ragas_context_recall" ] ``` @@ -66,9 +66,9 @@ Computes the longest common subsequence (LCS) similarity score between two input ### Cosine similarity (Ochiai coefficient) -| Configuration Key | Calculation Base | Possible Values | -| ----------------- | -------------------- | ------------------ | -| `cosine` | `actual`, `expected` | Percentage (0-100) | +| Configuration Key | Calculation Base | Possible Values | +| -------------------------| -------------------- | ------------------ | +| `cosine_ochiai` | `actual`, `expected` | Percentage (0-100) | This coefficient is calculated as the intersection of the term-frequency vectors of the generated answer (actual) and the ground-truth answer (expected) divided by the geometric mean of the sizes of these vectors. @@ -168,7 +168,7 @@ These metrics also require the `chat_model_name` property to be set in the `sear | Configuration Key | Calculation Base | Possible Values | | ------------------ | -------------------- | --------------------------------- | -| `llm_answer_relevance` | `actual`, `expected` | From 0 to 1 with 1 being the best | +| `ragas_answer_relevance` | `actual`, `expected` | From 0 to 1 with 1 being the best | Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. @@ -177,7 +177,7 @@ information is penalized. | Configuration Key | Calculation Base | Possible Values | | ------------------- | ------------------- | ----------------------------------------------------------------- | -| `llm_context_precision` | `question`, `retrieved_contexts` | Percentage (0-100) | +| `ragas_context_precision` | `question`, `retrieved_contexts` | Percentage (0-100) | Proportion of retrieved contexts relevant to the question. Evaluates whether or not the context generated by the RAG solution is useful for answering a question. @@ -185,6 +185,6 @@ Proportion of retrieved contexts relevant to the question. Evaluates whether or | Configuration Key | Calculation Base | Possible Values | | ------------------- | ------------------- | ----------------------------------------------------------------- | -| `llm_context_recall` | `question`, `expected`, `retrieved_contexts` | Percentage (0-100) | +| `ragas_context_recall` | `question`, `expected`, `retrieved_contexts` | Percentage (0-100) | Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved contexts. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context. diff --git a/promptflow/rag-experiment-accelerator/README.md b/promptflow/rag-experiment-accelerator/README.md index af52b952..2b34a473 100644 --- a/promptflow/rag-experiment-accelerator/README.md +++ b/promptflow/rag-experiment-accelerator/README.md @@ -117,7 +117,7 @@ az ml environment create --file ./environment.yaml -w $MLWorkSpaceName "cross_encoder_model" :"determines the model used for cross-encoding re-ranking step. Valid value is cross-encoder/stsb-roberta-base", "search_types" : "determines the search types used for experimentation. Valid value are search_for_match_semantic, search_for_match_Hybrid_multi, search_for_match_Hybrid_cross, search_for_match_text, search_for_match_pure_vector, search_for_match_pure_vector_multi, search_for_match_pure_vector_cross, search_for_manual_hybrid. e.g. ['search_for_manual_hybrid', 'search_for_match_Hybrid_multi','search_for_match_semantic' ]", "retrieve_num_of_documents": "determines the number of chunks to retrieve from the search index", - "metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, cosine, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, llm_context_precision, llm_answer_relevance. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens']", + "metric_types" : "determines the metrics used for evaluation purpose. Valid value are lcsstr, lcsseq, jaro_winkler, hamming, jaccard, levenshtein, fuzzy_score, cosine_ochiai, bert_all_MiniLM_L6_v2, bert_base_nli_mean_tokens, bert_large_nli_mean_tokens, bert_large_nli_stsb_mean_tokens, bert_distilbert_base_nli_stsb_mean_tokens, bert_paraphrase_multilingual_MiniLM_L12_v2, ragas_context_precision, ragas_answer_relevance. e.g ['fuzzy_score','bert_all_MiniLM_L6_v2','cosine_ochiai','bert_distilbert_base_nli_stsb_mean_tokens']", "azure_oai_chat_deployment_name": "determines the Azure OpenAI chat deployment name", "azure_oai_eval_deployment_name": "determines the Azure OpenAI evaluation deployment name", "embedding_model_name": "embedding model name", diff --git a/rag_experiment_accelerator/config/eval_config.py b/rag_experiment_accelerator/config/eval_config.py index b7540d67..31438a6d 100644 --- a/rag_experiment_accelerator/config/eval_config.py +++ b/rag_experiment_accelerator/config/eval_config.py @@ -10,7 +10,7 @@ class EvalConfig(BaseConfig): "bert_all_MiniLM_L6_v2", "cosine_ochiai", "bert_distilbert_base_nli_stsb_mean_tokens", - "llm_answer_relevance", - "llm_context_precision", + "ragas_answer_relevance", + "ragas_context_precision", ] ) diff --git a/rag_experiment_accelerator/evaluation/azure_ai_metrics.py b/rag_experiment_accelerator/evaluation/azure_ai_metrics.py new file mode 100644 index 00000000..65b174b5 --- /dev/null +++ b/rag_experiment_accelerator/evaluation/azure_ai_metrics.py @@ -0,0 +1,145 @@ +from azure.ai.evaluation import ( + CoherenceEvaluator, + FluencyEvaluator, + GroundednessEvaluator, + RelevanceEvaluator, + SimilarityEvaluator, +) + +from rag_experiment_accelerator.config.environment import Environment + + +class AzureAIEvals: + """Class that leverages the evaluators from the Promptflow evaluation framework + for LLM pipelines""" + def __init__(self, environment: Environment, deployment_name: str): + self.model_config = { + "azure_endpoint": environment.openai_endpoint, + "api_key": environment.openai_api_key, + "deployment_name": deployment_name + } + + def compute_score( + self, + metric_name: str, + question: str, + generated_answer: str, + ground_truth_answer: str, + retrieved_contexts: list[str], + ) -> float: + """ + Compute LLM as a judge score based on the Promptflow evaluation framework. + """ + match metric_name: + case "azai_answer_relevance": + score = self.relevance_evaluator( + question=question, answer=generated_answer + ) + case "azai_answer_coherence": + score = self.coherence_evaluator( + question=question, answer=generated_answer + ) + case "azai_answer_similarity": + score = self.similarity_evaluator( + question=question, + answer=generated_answer, + ground_truth=ground_truth_answer, + ) + case "azai_answer_fluency": + score = self.fluency_evaluator( + question=question, answer=generated_answer + ) + case "azai_answer_groundedness": + score = self.groundedness_evaluator( + answer=generated_answer, retrieved_contexts=retrieved_contexts + ) + case _: + raise KeyError(f"Invalid metric type: {metric_name}") + + return score + + def relevance_evaluator(self, question: str, answer: str) -> float: + eval_fn = RelevanceEvaluator(model_config=self.model_config) + score = eval_fn(question=question, answer=answer) + return score + + def coherence_evaluator(self, question: str, answer: str) -> float: + eval_fn = CoherenceEvaluator(model_config=self.model_config) + score = eval_fn(question=question, answer=answer) + return score + + def similarity_evaluator( + self, question: str, answer: str, ground_truth: str + ) -> float: + """ + Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. + If the information and content in the predicted answer is similar or equivalent to the correct answer, + then the value of the Equivalence metric should be high, else it should be low. Given the question, + correct answer, and predicted answer, determine the value of Equivalence metric using the following + rating scale: + One star: the predicted answer is not at all similar to the correct answer + Two stars: the predicted answer is mostly not similar to the correct answer + Three stars: the predicted answer is somewhat similar to the correct answer + Four stars: the predicted answer is mostly similar to the correct answer + Five stars: the predicted answer is completely similar to the correct answer + + This rating value should always be an integer between 1 and 5. + """ + eval_fn = SimilarityEvaluator(model_config=self.model_config) + score = eval_fn(question=question, answer=answer, ground_truth=ground_truth) + return score + + def fluency_evaluator(self, question: str, answer: str) -> float: + """ + Fluency measures the quality of individual sentences in the answer, + and whether they are well-written and grammatically correct. Consider + the quality of individual sentences when evaluating fluency. Given the + question and answer, score the fluency of the answer between one to + five stars using the following rating scale: + One star: the answer completely lacks fluency + Two stars: the answer mostly lacks fluency + Three stars: the answer is partially fluent + Four stars: the answer is mostly fluent + Five stars: the answer has perfect fluency + + This rating value should always be an integer between 1 and 5. + """ + eval_fn = FluencyEvaluator(model_config=self.model_config) + score = eval_fn(question=question, answer=answer) + return score + + def groundedness_evaluator( + self, answer: str, retrieved_contexts: list[str] + ) -> float: + """ + Groundedness is measured the following way: + Given a CONTEXT and an ANSWER about that CONTEXT, rate the following way if the ANSWER is + entailed by the CONTEXT, + 1. 5: The ANSWER follows logically from the information contained in the CONTEXT. + 2. 1: The ANSWER is logically false from the information contained in the CONTEXT. + 3. an integer score between 1 and 5 and if such integer score does not exist, use 1: + It is not possible to determine whether the ANSWER is true or false without + further information. Read the passage of information thoroughly and select the + correct answer from the three answer labels. Read the CONTEXT thoroughly to + ensure you know what the CONTEXT entails. + + This rating value should always be an integer between 1 and 5. + + Here we have a list of contexts and an answer. We return the best (max) groundedness score + when comparing the answer with each context in the list. + + Args: + answer (str): The answer generated by the model. + retrieved_contexts (list[str]): The list of retrieved contexts for the query. + + Returns: + float: The groundedness score generated between the answer and the list of contexts + """ + eval_fn = GroundednessEvaluator(model_config=self.model_config) + + best_score = 0 + for context in retrieved_contexts: + score = eval_fn(context=context, answer=answer) + best_score = max(best_score, score) + + return best_score diff --git a/rag_experiment_accelerator/evaluation/eval.py b/rag_experiment_accelerator/evaluation/eval.py index 6244c064..edcf4d18 100644 --- a/rag_experiment_accelerator/evaluation/eval.py +++ b/rag_experiment_accelerator/evaluation/eval.py @@ -14,10 +14,10 @@ ) from rag_experiment_accelerator.config.config import Config from rag_experiment_accelerator.config.index_config import IndexConfig +from promptflow.core import AzureOpenAIModelConfiguration from rag_experiment_accelerator.evaluation import plain_metrics from rag_experiment_accelerator.evaluation.llm_based_metrics import ( compute_llm_based_score, - lower_and_strip, ) from rag_experiment_accelerator.evaluation.plot_metrics import ( draw_hist_df, @@ -32,6 +32,8 @@ ) from rag_experiment_accelerator.llm.response_generator import ResponseGenerator +from rag_experiment_accelerator.evaluation.ragas_metrics import RagasEvals +from rag_experiment_accelerator.evaluation.azure_ai_quality_metrics import PromptFlowEvals from rag_experiment_accelerator.utils.logging import get_logger from rag_experiment_accelerator.config.environment import Environment @@ -41,13 +43,18 @@ warnings.filterwarnings("ignore") +def lower_and_strip(text: str) -> str: + return text.lower().strip() + + def compute_metrics( metric_type, question, actual, expected, - response_generator: ResponseGenerator, retrieved_contexts, + ragas_evals: RagasEvals, + pf_evals: PromptFlowEvals ): """ Computes a score for the similarity between two strings using a specified metric. @@ -87,14 +94,20 @@ def compute_metrics( - "bert_large_nli_stsb_mean_tokens": BERT-based semantic similarity (large model, STS-B, mean tokens) - "bert_distilbert_base_nli_stsb_mean_tokens": BERT-based semantic similarity (DistilBERT base model, STS-B, mean tokens) - "bert_paraphrase_multilingual_MiniLM_L12_v2": BERT-based semantic similarity (multilingual paraphrase model, MiniLM L12 v2) - - "llm_context_precision": Verifies whether or not a given context is useful for answering a question. - - "llm_answer_relevance": Scores the relevancy of the answer according to the given question. - - "llm_context_recall": Scores context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context. + - "ragas_context_precision": Verifies whether or not a given context is useful for answering a question. + - "ragas_answer_relevance": Scores the relevancy of the answer according to the given question. + - "ragas_context_recall": Scores context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context. + - "pf_answer_relevance": Scores the relevancy of the answer according to the given question. + - "pf_answer_coherence": Scores the coherence of the answer according to the given question. + - "pf_answer_similarity": Scores the similarity of the answer to the ground truth answer. + - "pf_answer_fluency": Scores the fluency of the answer according to the given question. + - "pf_answer_groundedness": Scores the groundedness of the answer according to the retrieved contexts. question (str): question text actual (str): The first string to compare. expected (str): The second string to compare. - response_generator (ResponseGenerator): The response generator to use for generating responses. retrieved_contexts (list[str]): The list of retrieved contexts for the query. + ragas_evals (RagasEvals): The Ragas evaluators to use for scoring. + pf_evals (PromptFlowEvals): The PromptFlow evaluators to use for scoring. Returns: @@ -117,8 +130,9 @@ def compute_metrics( question, actual, expected, - response_generator, retrieved_contexts, + ragas_evals, + pf_evals ) except KeyError: logger.error(f"Unsupported metric type: {metric_type}") @@ -128,7 +142,8 @@ def compute_metrics( def evaluate_single_prompt( data, - response_generator, + ragas_evals, + pf_evals, metric_types, data_list, total_precision_scores_by_search_type, @@ -146,8 +161,9 @@ def evaluate_single_prompt( data.question, actual, expected, - response_generator, data.retrieved_contexts, + ragas_evals, + pf_evals ) metric_dic[metric_type] = score @@ -207,9 +223,19 @@ def evaluate_prompts( handler = QueryOutputHandler(config.path.query_data_dir) + # Ragas and PromptFlow evaluators response_generator = ResponseGenerator( environment, config, config.openai.azure_oai_eval_deployment_name ) + ragas_evals = RagasEvals(response_generator) + + az_openai_model_config = AzureOpenAIModelConfiguration( + azure_endpoint=environment.openai_endpoint, + api_key=environment.openai_api_key, + azure_deployment=config.openai.azure_oai_eval_deployment_name + ) + + pf_evals = PromptFlowEvals(az_openai_model_config) query_data_load = handler.load( index_config.index_name(), config.experiment_name, config.job_name @@ -222,7 +248,8 @@ def evaluate_prompts( executor.submit( evaluate_single_prompt, data, - response_generator, + ragas_evals, + pf_evals, metric_types, data_list, total_precision_scores_by_search_type, diff --git a/rag_experiment_accelerator/evaluation/llm_based_metrics.py b/rag_experiment_accelerator/evaluation/llm_based_metrics.py index 4d722f2e..39c3de04 100644 --- a/rag_experiment_accelerator/evaluation/llm_based_metrics.py +++ b/rag_experiment_accelerator/evaluation/llm_based_metrics.py @@ -1,174 +1,52 @@ -from sentence_transformers import SentenceTransformer -from sklearn.metrics.pairwise import cosine_similarity - -from rag_experiment_accelerator.llm.prompt import ( - llm_answer_relevance_instruction, - llm_context_recall_instruction, - llm_context_precision_instruction, -) -from rag_experiment_accelerator.llm.response_generator import ResponseGenerator +from rag_experiment_accelerator.evaluation.ragas_metrics import RagasEvals +from rag_experiment_accelerator.evaluation.azure_ai_metrics import AzureAIEvals from rag_experiment_accelerator.utils.logging import get_logger logger = get_logger(__name__) -def lower_and_strip(text): - """ - Converts the input to lowercase without spaces or empty string if None. - - Args: - text (str): The string to format. - - Returns: - str: The formatted input string. - """ - if text is None: - return "" - else: - return text.lower().strip() - - -def llm_answer_relevance( - response_generator: ResponseGenerator, question, answer -) -> float: - """ - Scores the relevancy of the answer according to the given question. - Answers with incomplete, redundant or unnecessary information is penalized. - Score can range from 0 to 1 with 1 being the best. - - Args: - question (str): The question being asked. - answer (str): The generated answer. - - Returns: - double: The relevancy score generated between the question and answer. - - """ - result = response_generator.generate_response( - llm_answer_relevance_instruction, text=answer - ) - if result is None: - logger.warning("Unable to generate answer relevance score") - return 0.0 - - model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") - - embedding1 = model.encode([str(question)]) - embedding2 = model.encode([str(result)]) - similarity_score = cosine_similarity(embedding1, embedding2) - - return float(similarity_score[0][0] * 100) - - -def llm_context_precision( - response_generator: ResponseGenerator, question, retrieved_contexts +def compute_llm_based_score( + metric_type: str, + question: str, + generated_answer: str, + ground_truth_answer: str, + retrieved_contexts: list[str], + ragas_evals: RagasEvals, + azai_evals: AzureAIEvals, ) -> float: """ - Computes precision by assessing whether each retrieved context is useful for answering a question. - Only considers the presence of relevant chunks in the retrieved contexts, but doesn't take into - account their ranking order. + Compute the LLM-as-a-judge score for the given metric type. Args: - question (str): The question being asked. - retrieved_contexts (list[str]): The list of retrieved contexts for the query. + metric_type (str): The metric type to compute the score for. + question (str): The question. + generated_answer (str): The generated answer. + ground_truth_answer (str): The ground truth answer. + retrieved_contexts (List[str]): The retrieved contexts. + response_generator (ResponseGenerator): The response generator. + ragas_evals (RagasEvals): The ragas evaluators. + azai_evals (AzureAIEvals): The Azure AI studio evaluators. Returns: - double: proportion of relevant chunks retrieved for the question + float: The computed LLM-as-a-judge score. """ - relevancy_scores = [] - - for context in retrieved_contexts: - result: str | None = response_generator.generate_response( - llm_context_precision_instruction, - context=context, + if metric_type.startswith("ragas_"): + score = ragas_evals.compute_score( + metric_type=metric_type, question=question, + generated_answer=generated_answer, + ground_truth_answer=ground_truth_answer, + retrieved_contexts=retrieved_contexts, + ) + elif metric_type.startswith("azai_"): + score = azai_evals.compute_score( + metric_name=metric_type, + question=question, + generated_answer=generated_answer, + ground_truth_answer=ground_truth_answer, + retrieved_contexts=retrieved_contexts, ) - llm_judge_response = lower_and_strip(result) - # Since we're only asking for one response, the result is always a boolean 1 or 0 - if llm_judge_response == "yes": - relevancy_scores.append(1) - elif llm_judge_response == "no": - relevancy_scores.append(0) - else: - logger.warning("Unable to generate context precision score") - - logger.debug(relevancy_scores) - - if not relevancy_scores: - logger.warning("Unable to compute average context precision") - return -1 else: - return (sum(relevancy_scores) / len(relevancy_scores)) * 100 - - -def llm_context_recall( - response_generator: ResponseGenerator, - question, - groundtruth_answer, - retrieved_contexts, -): - """ - Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context. - Context_recall values range between 0 and 1, with higher values indicating better performance. - To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine - whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer - should be attributable to the retrieved context. The formula for calculating context recall is as follows: - context_recall = GT sentences that can be attributed to context / nr sentences in GT - - Code adapted from https://github.com/explodinggradients/ragas - Copyright [2023] [Exploding Gradients] - under the Apache License (see evaluation folder) - - Args: - question (str): The question being asked - groundtruth_answer (str): The ground truth ("output_prompt") - retrieved_contexts (list[str]): The list of retrieved contexts for the query - - Returns: - double: The context recall score generated between the ground truth (expected) and context. - """ - context = "\n".join(retrieved_contexts) - prompt = ( - "\nquestion: " - + question - + "\ncontext: " - + context - + "\nanswer: " - + groundtruth_answer - ) - result = response_generator.generate_response( - sys_message=llm_context_recall_instruction, - prompt=prompt, - ) - good_response = '"Attributed": "1"' - bad_response = '"Attributed": "0"' - - return ( - result.count(good_response) - / (result.count(good_response) + result.count(bad_response)) - ) * 100 - - -def compute_llm_based_score( - metric_type, - question, - actual, - expected, - response_generator: ResponseGenerator, - retrieved_contexts, -): - match metric_type: - case "llm_answer_relevance": - score = llm_answer_relevance(response_generator, question, actual) - case "llm_context_precision": - score = llm_context_precision( - response_generator, question, retrieved_contexts - ) - case "llm_context_recall": - score = llm_context_recall( - response_generator, question, expected, retrieved_contexts - ) - case _: - raise KeyError(f"Invalid metric type: {metric_type}") + raise KeyError(f"Invalid metric type: {metric_type}") return score diff --git a/rag_experiment_accelerator/evaluation/ragas_metrics.py b/rag_experiment_accelerator/evaluation/ragas_metrics.py new file mode 100644 index 00000000..2ca1a996 --- /dev/null +++ b/rag_experiment_accelerator/evaluation/ragas_metrics.py @@ -0,0 +1,171 @@ +from sentence_transformers import SentenceTransformer +from sklearn.metrics.pairwise import cosine_similarity + +from rag_experiment_accelerator.llm.prompt import ( + ragas_answer_relevance_instruction, + ragas_context_recall_instruction, + ragas_context_precision_instruction, +) +from rag_experiment_accelerator.llm.response_generator import ResponseGenerator +from rag_experiment_accelerator.utils.logging import get_logger + +logger = get_logger(__name__) + + +class RagasEvals: + """Class that leverages the evaluators from the ragas evaluation framework + for RAG pipelines: https://github.com/explodinggradients/ragas + """ + def __init__(self, response_generator: ResponseGenerator): + self.response_generator = response_generator + + def compute_score( + self, + metric_type: str, + question: str, + generated_answer: str, + ground_truth_answer: str, + retrieved_contexts: list[str]) -> float: + """ + Compute the LLM-as-a-judge score for the given metric type from the RAGAS framework. + + Args: + metric_type (str): The metric type to compute the score for. + question (str): The question. + generated_answer (str): The generated answer. + ground_truth_answer (str): The ground truth answer. + retrieved_contexts (List[str]): The retrieved contexts. + response_generator (ResponseGenerator): The response generator. + + Returns: + float: The computed LLM-as-a-judge score. + """ + match metric_type: + case "ragas_answer_relevance": + score = self.ragas_answer_relevance(question=question, answer=generated_answer) + case "ragas_context_precision": + score = self.ragas_context_precision(question=question, retrieved_contexts=retrieved_contexts) + case "ragas_context_recall": + score = self.ragas_context_recall(question=question, + groundtruth_answer=ground_truth_answer, + retrieved_contexts=retrieved_contexts) + case _: + raise KeyError(f"Invalid metric type: {metric_type}") + + return score + + def lower_and_strip(self, text: str) -> str: + return text.lower().strip() + + def ragas_answer_relevance(self, question, answer) -> float: + """ + Scores the relevancy of the answer according to the given question. + Answers with incomplete, redundant or unnecessary information is penalized. + Score can range from 0 to 1 with 1 being the best. + + Args: + question (str): The question being asked. + answer (str): The generated answer. + + Returns: + double: The relevancy score generated between the question and answer. + + """ + result = self.response_generator.generate_response( + ragas_answer_relevance_instruction, text=answer + ) + if result is None: + logger.warning("Unable to generate answer relevance score") + return 0.0 + + model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") + + embedding1 = model.encode([str(question)]) + embedding2 = model.encode([str(result)]) + similarity_score = cosine_similarity(embedding1, embedding2) + + return float(similarity_score[0][0] * 100) + + def ragas_context_precision( + self, question: str, retrieved_contexts: list[str] + ) -> float: + """ + Computes precision by assessing whether each retrieved context is useful for answering a question. + Only considers the presence of relevant chunks in the retrieved contexts, but doesn't take into + account their ranking order. + + Args: + question (str): The question being asked. + retrieved_contexts (list[str]): The list of retrieved contexts for the query. + + Returns: + double: proportion of relevant chunks retrieved for the question + """ + relevancy_scores = [] + + for context in retrieved_contexts: + result: str | None = self.response_generator.generate_response( + ragas_context_precision_instruction, + context=context, + question=question, + ) + llm_judge_response = self.lower_and_strip(result) + # Since we're only asking for one response, the result is always a boolean 1 or 0 + if llm_judge_response == "yes": + relevancy_scores.append(1) + elif llm_judge_response == "no": + relevancy_scores.append(0) + else: + logger.warning("Unable to generate context precision score") + + logger.debug(relevancy_scores) + + if not relevancy_scores: + logger.warning("Unable to compute average context precision") + return -1 + else: + return (sum(relevancy_scores) / len(relevancy_scores)) * 100 + + def ragas_context_recall( + self, question: str, groundtruth_answer: str, retrieved_contexts: list[str] + ) -> float: + """ + Estimates context recall by estimating TP and FN using annotated answer (ground truth) and retrieved context. + Context_recall values range between 0 and 1, with higher values indicating better performance. + To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine + whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer + should be attributable to the retrieved context. The formula for calculating context recall is as follows: + context_recall = GT sentences that can be attributed to context / nr sentences in GT + + Code adapted from https://github.com/explodinggradients/ragas + Copyright [2023] [Exploding Gradients] + under the Apache License (see evaluation folder) + + Args: + question (str): The question being asked + groundtruth_answer (str): The ground truth ("output_prompt") + retrieved_contexts (list[str]): The list of retrieved contexts for the query + + Returns: + double: The context recall score generated between the ground truth (expected) and context. + """ + context = "\n".join(retrieved_contexts) + prompt = ( + "\nquestion: " + + question + + "\ncontext: " + + context + + "\nanswer: " + + groundtruth_answer + ) + result = self.response_generator.generate_response( + sys_message=ragas_context_recall_instruction, + prompt=prompt, + ) + good_response = '"Attributed": "1"' + bad_response = '"Attributed": "0"' + + return ( + result.count(good_response) + / (result.count(good_response) + result.count(bad_response)) + ) * 100 diff --git a/rag_experiment_accelerator/evaluation/search_eval.py b/rag_experiment_accelerator/evaluation/search_eval.py index 49db5ffe..dfba4a3e 100644 --- a/rag_experiment_accelerator/evaluation/search_eval.py +++ b/rag_experiment_accelerator/evaluation/search_eval.py @@ -30,9 +30,7 @@ def evaluate_search_result( logger.info(f"Search Score: {doc['@search.score']}") precision_score = round( - metrics.precision_score( - is_relevant_results[:k], precision_predictions[:k] - ), + metrics.precision_score(is_relevant_results[:k], precision_predictions[:k]), 2, ) precision_scores.append(precision_score) diff --git a/rag_experiment_accelerator/evaluation/tests/test_llm_based_metrics.py b/rag_experiment_accelerator/evaluation/tests/test_llm_based_metrics.py index cd7eb3b9..f747afd9 100644 --- a/rag_experiment_accelerator/evaluation/tests/test_llm_based_metrics.py +++ b/rag_experiment_accelerator/evaluation/tests/test_llm_based_metrics.py @@ -1,15 +1,12 @@ from unittest.mock import patch -from rag_experiment_accelerator.evaluation.llm_based_metrics import ( - llm_answer_relevance, - llm_context_precision, - llm_context_recall, -) +from rag_experiment_accelerator.evaluation.ragas_metrics import RagasEvals +from rag_experiment_accelerator.evaluation.azure_ai_quality_metrics import PromptFlowEvals @patch("rag_experiment_accelerator.evaluation.eval.ResponseGenerator") -@patch("rag_experiment_accelerator.evaluation.llm_based_metrics.SentenceTransformer") -def test_llm_answer_relevance(mock_st, mock_generate_response): +@patch("rag_experiment_accelerator.evaluation.ragas_metrics.SentenceTransformer") +def test_ragas_answer_relevance(mock_st, mock_generate_response): mock_generate_response.return_value.generate_response.return_value = ( "What is the name of the largest bone in the human body?" ) @@ -24,17 +21,19 @@ def test_llm_answer_relevance(mock_st, mock_generate_response): " body." ), ) - score = llm_answer_relevance(mock_generate_response, question, answer) + r_eval = RagasEvals(mock_generate_response) + score = r_eval.ragas_answer_relevance(question, answer) assert round(score) == 100 @patch("rag_experiment_accelerator.evaluation.eval.ResponseGenerator") -def test_llm_context_precision(mock_generate_response): +def test_ragas_context_precision(mock_generate_response): question = "What is the name of the largest bone in the human body?" retrieved_contexts = ["Retrieved context 1", "Retrieved context 2"] mock_generate_response.generate_response.side_effect = ["Yes", "No", "Yes", "No"] - score = llm_context_precision(mock_generate_response, question, retrieved_contexts) + r_eval = RagasEvals(mock_generate_response) + score = r_eval.ragas_context_precision(question, retrieved_contexts) expected_relevancy_scores = [1, 0, 1, 0] expected_precision = ( @@ -45,7 +44,7 @@ def test_llm_context_precision(mock_generate_response): @patch("rag_experiment_accelerator.evaluation.eval.ResponseGenerator") -def test_llm_context_recall(mock_generate_response): +def test_ragas_context_recall(mock_generate_response): mock_generate_response.generate_response.return_value = ( '"Attributed": "1" "Attributed": "1" "Attributed": "1" "Attributed": "0"' ) @@ -53,5 +52,84 @@ def test_llm_context_recall(mock_generate_response): context = 'According to the Cleveland Clinic, "The femur is the largest and strongest bone in the human body. It can support as much as 30 times the weight of your body. The average adult male femur is 48 cm (18.9 in) in length and 2.34 cm (0.92 in) in diameter. The average weight among adult males in the United States is 196 lbs (872 N). Therefore, the adult male femur can support roughly 6,000 lbs of compressive force."' answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." - score = llm_context_recall(mock_generate_response, question, answer, context) + r_eval = RagasEvals(mock_generate_response) + score = r_eval.ragas_context_recall(question, answer, context) assert score == 75 + + +@patch("rag_experiment_accelerator.evaluation.promptflow_quality_metrics.AzureOpenAIModelConfiguration") +def test_promptflow_fluency_evaluator(mock_model_config): + mock_model_config.return_value = "model_config" + p_eval = PromptFlowEvals(mock_model_config) + + question = "What is the name of the largest bone in the human body?" + good_answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." + bad_answer = "The bone human largest body femur not." + + good_score = p_eval.fluency_evaluator(question, good_answer) + bad_score = p_eval.fluency_evaluator(question, bad_answer) + + assert good_score == 5 + assert bad_score == 1 + + +@patch("rag_experiment_accelerator.evaluation.promptflow_quality_metrics.AzureOpenAIModelConfiguration") +def test_promptflow_groundedness_evaluator(mock_model_config): + mock_model_config.return_value = "model_config" + p_eval = PromptFlowEvals(mock_model_config) + + answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." + ungrounded_contexts = ["Retrieved context 1", "Retrieved context 2"] + true_context = 'According to the Cleveland Clinic, "The femur is the largest and strongest bone in the human body. It can support as much as 30 times the weight of your body. The average adult male femur is 48 cm (18.9 in) in length and 2.34 cm (0.92 in) in diameter. The average weight among adult males in the United States is 196 lbs (872 N). Therefore, the adult male femur can support roughly 6,000 lbs of compressive force."' + grounded_contexts = ungrounded_contexts + [true_context] + + low_score = p_eval.groundedness_evaluator(answer, ungrounded_contexts) + high_score = p_eval.groundedness_evaluator(answer, grounded_contexts) + assert low_score == 1 + assert high_score == 5 + + +@patch("rag_experiment_accelerator.evaluation.promptflow_quality_metrics.AzureOpenAIModelConfiguration") +def test_promptflow_similarity_evaluator(mock_model_config): + mock_model_config.return_value = "model_config" + p_eval = PromptFlowEvals(mock_model_config) + + question = "What is the name of the largest bone in the human body?" + ground_truth = "The femur is the largest and strongest bone in the human body. It can support as much as 30 times the weight of your body. The average length of the femur is 49.5 cm (19.4 inches)." + good_answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." + bad_answer = "The largest bone in the human body is the nasal bone." + + good_score = p_eval.similarity_evaluator(question, good_answer, ground_truth) + bad_score = p_eval.similarity_evaluator(question, bad_answer, ground_truth) + assert good_score == 5 + assert bad_score == 1 + + +@patch("rag_experiment_accelerator.evaluation.promptflow_quality_metrics.AzureOpenAIModelConfiguration") +def test_promptflow_coherence_evaluator(mock_model_config): + mock_model_config.return_value = "model_config" + p_eval = PromptFlowEvals(mock_model_config) + + question = "What is the name of the largest bone in the human body?" + coherent_answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." + incoherent_answer = "The largest bile in the human racquet is the tennis ball, also known as the thigh bone." + + good_score = p_eval.coherence_evaluator(question, coherent_answer) + bad_score = p_eval.coherence_evaluator(question, incoherent_answer) + assert good_score == 5 + assert bad_score == 1 + + +@patch("rag_experiment_accelerator.evaluation.promptflow_quality_metrics.AzureOpenAIModelConfiguration") +def test_promptflow_relevance_evaluator(mock_model_config): + mock_model_config.return_value = "model_config" + p_eval = PromptFlowEvals(mock_model_config) + + question = "What is the name of the largest bone in the human body?" + relevant_answer = "The largest bone in the human body is the femur, also known as the thigh bone. It is about 19.4 inches (49.5 cm) long on average and can support up to 30 times the weight of a person’s body." + irrelevant_answer = "Roger Federer is one of the greatest tennis players of all time." + + good_score = p_eval.relevance_evaluator(question, relevant_answer) + bad_score = p_eval.relevance_evaluator(question, irrelevant_answer) + assert good_score == 5 + assert bad_score == 1 diff --git a/rag_experiment_accelerator/llm/prompt/__init__.py b/rag_experiment_accelerator/llm/prompt/__init__.py index f12fb2cd..aa05ef5f 100644 --- a/rag_experiment_accelerator/llm/prompt/__init__.py +++ b/rag_experiment_accelerator/llm/prompt/__init__.py @@ -39,9 +39,9 @@ ) from rag_experiment_accelerator.llm.prompt.ragas_prompts import ( - llm_answer_relevance_instruction, - llm_context_precision_instruction, - llm_context_recall_instruction, + ragas_answer_relevance_instruction, + ragas_context_precision_instruction, + ragas_context_recall_instruction, ) from rag_experiment_accelerator.llm.prompt.rerank_prompts import ( diff --git a/rag_experiment_accelerator/llm/prompt/ragas_prompts.py b/rag_experiment_accelerator/llm/prompt/ragas_prompts.py index 0bf64a18..ceca6203 100644 --- a/rag_experiment_accelerator/llm/prompt/ragas_prompts.py +++ b/rag_experiment_accelerator/llm/prompt/ragas_prompts.py @@ -41,21 +41,21 @@ def is_valid_entry(entry): answer: ${answer} """ -llm_answer_relevance_instruction = Prompt( - system_message="llm_answer_relevance_instruction.txt", +ragas_answer_relevance_instruction = Prompt( + system_message="ragas_answer_relevance_instruction.txt", user_template="${text}", tags={PromptTag.NonStrict}, ) -llm_context_precision_instruction = StructuredPrompt( - system_message="llm_context_precision_instruction.txt", +ragas_context_precision_instruction = StructuredPrompt( + system_message="ragas_context_precision_instruction.txt", user_template=_context_precision_input, validator=validate_context_precision, tags={PromptTag.NonStrict}, ) -llm_context_recall_instruction = StructuredPrompt( - system_message="llm_context_recall_instruction.txt", +ragas_context_recall_instruction = StructuredPrompt( + system_message="ragas_context_recall_instruction.txt", user_template=_context_recall_input, validator=validate_context_recall, tags={PromptTag.JSON, PromptTag.NonStrict}, diff --git a/rag_experiment_accelerator/llm/prompts_text/llm_answer_relevance_instruction.txt b/rag_experiment_accelerator/llm/prompts_text/ragas_answer_relevance_instruction.txt similarity index 100% rename from rag_experiment_accelerator/llm/prompts_text/llm_answer_relevance_instruction.txt rename to rag_experiment_accelerator/llm/prompts_text/ragas_answer_relevance_instruction.txt diff --git a/rag_experiment_accelerator/llm/prompts_text/llm_context_precision_instruction.txt b/rag_experiment_accelerator/llm/prompts_text/ragas_context_precision_instruction.txt similarity index 100% rename from rag_experiment_accelerator/llm/prompts_text/llm_context_precision_instruction.txt rename to rag_experiment_accelerator/llm/prompts_text/ragas_context_precision_instruction.txt diff --git a/rag_experiment_accelerator/llm/prompts_text/llm_context_recall_instruction.txt b/rag_experiment_accelerator/llm/prompts_text/ragas_context_recall_instruction.txt similarity index 100% rename from rag_experiment_accelerator/llm/prompts_text/llm_context_recall_instruction.txt rename to rag_experiment_accelerator/llm/prompts_text/ragas_context_recall_instruction.txt diff --git a/requirements.txt b/requirements.txt index cb1fcb0c..fd7419a4 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ +azure-ai-evaluation==1.0.0b3 azure-ai-ml==1.20.0 azure-ai-textanalytics==5.3.0 azure-core==1.31.0