Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix evaluations #165

Merged
merged 4 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/evaluations.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,13 @@ jobs:
- name: evaluate
working-directory: ./src/api
run: |
python -m evaluators.evaluate
python -m evaluate

- name: Upload eval results as build artifact
uses: actions/upload-artifact@v4
with:
name: eval_result
path: ./src/api/evaluators/eval_results.jsonl
path: ./src/api/eval_results.jsonl

- name: GitHub Summary Step
if: ${{ success() }}
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,7 @@ src/api/evaluators/result.jsonl
src/api/evaluators/eval_results.jsonl
src/api/evaluators/eval_results.md
src/api/evaluators/.runs/*
src/api/result_evaluated.jsonl
src/api/result.jsonl
src/api/eval_results.jsonl
src/api/eval_results.md
32 changes: 31 additions & 1 deletion azure.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,34 @@ hooks:
interactive: true
run: infra/hooks/postprovision.ps1
infra:
provider: "bicep"
provider: "bicep"

pipeline:
variables:
- APPINSIGHTS_CONNECTIONSTRING
- AZURE_CONTAINER_ENVIRONMENT_NAME
- AZURE_CONTAINER_REGISTRY_ENDPOINT
- AZURE_CONTAINER_REGISTRY_NAME
- AZURE_COSMOS_NAME
- AZURE_EMBEDDING_NAME
- AZURE_ENV_NAME
- AZURE_LOCATION
- AZURE_OPENAI_API_VERSION
- AZURE_OPENAI_CHAT_DEPLOYMENT
- AZURE_OPENAI_ENDPOINT
- AZURE_OPENAI_NAME
- AZURE_OPENAI_RESOURCE_GROUP_LOCATION
- AZURE_OPENAI_RESOURCE_GROUP_LOCATION
- AZURE_RESOURCE_GROUP
- AZURE_SEARCH_ENDPOINT
- AZURE_SEARCH_NAME
- AZURE_SUBSCRIPTION_ID
- COSMOS_CONTAINER
- COSMOS_ENDPOINT
- OPENAI_TYPE
- SERVICE_ACA_IMAGE_NAME
- SERVICE_ACA_NAME
- SERVICE_ACA_URI

secrets:
- BING_SEARCH_KEY
2 changes: 1 addition & 1 deletion src/.dockerignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
.git*
.venv/
**/*.pyc
**/*.pyc
1 change: 0 additions & 1 deletion src/api/contoso_chat/chat.prompty
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ model:
configuration:
type: azure_openai
azure_deployment: gpt-35-turbo
azure_endpoint: ${ENV:AZURE_OPENAI_ENDPOINT}
api_version: 2023-07-01-preview
parameters:
max_tokens: 128
Expand Down
178 changes: 178 additions & 0 deletions src/api/evaluate-chat-flow.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import prompty\n",
"from evaluators.custom_evals.coherence import coherence_evaluation\n",
"from evaluators.custom_evals.relevance import relevance_evaluation\n",
"from evaluators.custom_evals.fluency import fluency_evaluation\n",
"from evaluators.custom_evals.groundedness import groundedness_evaluation\n",
"import jsonlines\n",
"import pandas as pd\n",
"from contoso_chat.chat_request import get_response"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get output from data and save to results jsonl file"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def load_data():\n",
" data_path = \"./evaluators/data.jsonl\"\n",
"\n",
" df = pd.read_json(data_path, lines=True)\n",
" df.head()\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def create_response_data(df):\n",
" results = []\n",
"\n",
" for index, row in df.iterrows():\n",
" customerId = row['customerId']\n",
" question = row['question']\n",
" \n",
" # Run contoso-chat/chat_request flow to get response\n",
" response = get_response(customerId=customerId, question=question, chat_history=[])\n",
" print(response)\n",
" \n",
" # Add results to list\n",
" result = {\n",
" 'question': question,\n",
" 'context': response[\"context\"],\n",
" 'answer': response[\"answer\"]\n",
" }\n",
" results.append(result)\n",
"\n",
" # Save results to a JSONL file\n",
" with open('result.jsonl', 'w') as file:\n",
" for result in results:\n",
" file.write(json.dumps(result) + '\\n')\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def evaluate():\n",
" # Evaluate results from results file\n",
" results_path = 'result.jsonl'\n",
" results = []\n",
" with open(results_path, 'r') as file:\n",
" for line in file:\n",
" print(line)\n",
" results.append(json.loads(line))\n",
"\n",
" for result in results:\n",
" question = result['question']\n",
" context = result['context']\n",
" answer = result['answer']\n",
" \n",
" groundedness_score = groundedness_evaluation(question=question, answer=answer, context=context)\n",
" fluency_score = fluency_evaluation(question=question, answer=answer, context=context)\n",
" coherence_score = coherence_evaluation(question=question, answer=answer, context=context)\n",
" relevance_score = relevance_evaluation(question=question, answer=answer, context=context)\n",
" \n",
" result['groundedness'] = groundedness_score\n",
" result['fluency'] = fluency_score\n",
" result['coherence'] = coherence_score\n",
" result['relevance'] = relevance_score\n",
"\n",
" # Save results to a JSONL file\n",
" with open('result_evaluated.jsonl', 'w') as file:\n",
" for result in results:\n",
" file.write(json.dumps(result) + '\\n')\n",
"\n",
" with jsonlines.open('eval_results.jsonl', 'w') as writer:\n",
" writer.write(results)\n",
" # Print results\n",
"\n",
" df = pd.read_json('result_evaluated.jsonl', lines=True)\n",
" df.head()\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def create_summary(df):\n",
" print(\"Evaluation summary:\\n\")\n",
" print(df)\n",
" # drop question, context and answer\n",
" mean_df = df.drop([\"question\", \"context\", \"answer\"], axis=1).mean()\n",
" print(\"\\nAverage scores:\")\n",
" print(mean_df)\n",
" df.to_markdown('eval_results.md')\n",
" with open('eval_results.md', 'a') as file:\n",
" file.write(\"\\n\\nAverages scores:\\n\\n\")\n",
" mean_df.to_markdown('eval_results.md', 'a')\n",
"\n",
" print(\"Results saved to result_evaluated.jsonl\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# create main funciton for python script\n",
"if __name__ == \"__main__\":\n",
"\n",
" test_data_df = load_data()\n",
" response_results = create_response_data(test_data_df)\n",
" result_evaluated = evaluate()\n",
" create_summary(result_evaluated)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pf-prompty",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
115 changes: 115 additions & 0 deletions src/api/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# %%
import os
import json
import prompty
from evaluators.custom_evals.coherence import coherence_evaluation
from evaluators.custom_evals.relevance import relevance_evaluation
from evaluators.custom_evals.fluency import fluency_evaluation
from evaluators.custom_evals.groundedness import groundedness_evaluation
import jsonlines
import pandas as pd
from contoso_chat.chat_request import get_response

# %% [markdown]
# ## Get output from data and save to results jsonl file

# %%
def load_data():
data_path = "./evaluators/data.jsonl"

df = pd.read_json(data_path, lines=True)
df.head()
return df

# %%

def create_response_data(df):
results = []

for index, row in df.iterrows():
customerId = row['customerId']
question = row['question']

# Run contoso-chat/chat_request flow to get response
response = get_response(customerId=customerId, question=question, chat_history=[])
print(response)

# Add results to list
result = {
'question': question,
'context': response["context"],
'answer': response["answer"]
}
results.append(result)

# Save results to a JSONL file
with open('result.jsonl', 'w') as file:
for result in results:
file.write(json.dumps(result) + '\n')
return results

# %%
def evaluate():
# Evaluate results from results file
results_path = 'result.jsonl'
results = []
with open(results_path, 'r') as file:
for line in file:
print(line)
results.append(json.loads(line))

for result in results:
question = result['question']
context = result['context']
answer = result['answer']

groundedness_score = groundedness_evaluation(question=question, answer=answer, context=context)
fluency_score = fluency_evaluation(question=question, answer=answer, context=context)
coherence_score = coherence_evaluation(question=question, answer=answer, context=context)
relevance_score = relevance_evaluation(question=question, answer=answer, context=context)

result['groundedness'] = groundedness_score
result['fluency'] = fluency_score
result['coherence'] = coherence_score
result['relevance'] = relevance_score

# Save results to a JSONL file
with open('result_evaluated.jsonl', 'w') as file:
for result in results:
file.write(json.dumps(result) + '\n')

with jsonlines.open('eval_results.jsonl', 'w') as writer:
writer.write(results)
# Print results

df = pd.read_json('result_evaluated.jsonl', lines=True)
df.head()

return df

# %%
def create_summary(df):
print("Evaluation summary:\n")
print(df)
# drop question, context and answer
mean_df = df.drop(["question", "context", "answer"], axis=1).mean()
print("\nAverage scores:")
print(mean_df)
df.to_markdown('eval_results.md')
with open('eval_results.md', 'a') as file:
file.write("\n\nAverages scores:\n\n")
mean_df.to_markdown('eval_results.md', 'a')

print("Results saved to result_evaluated.jsonl")

# %%
# create main funciton for python script
if __name__ == "__main__":

test_data_df = load_data()
response_results = create_response_data(test_data_df)
result_evaluated = evaluate()
create_summary(result_evaluated)



8 changes: 4 additions & 4 deletions src/api/evaluators/custom_evals/coherence.prompty
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: QnA Coherence Evaluation
description: Compute the coherence of the answer base on the question using llm.
description: Evaluates coherence score for QA scenario
model:
api: chat
configuration:
Expand All @@ -22,10 +22,10 @@ sample:
context: Track lighting, invented by Lightolier, was popular at one period of time because it was much easier to install than recessed lighting, and individual fixtures are decorative and can be easily aimed at a wall. It has regained some popularity recently in low-voltage tracks, which often look nothing like their predecessors because they do not have the safety issues that line-voltage systems have, and are therefore less bulky and more ornamental in themselves. A master transformer feeds all of the fixtures on the track or rod with 12 or 24 volts, instead of each light fixture having its own line-to-low voltage transformer. There are traditional spots and floods, as well as other small hanging fixtures. A modified version of this is cable lighting, where lights are hung from or clipped to bare metal cables under tension
answer: The main transformer is the object that feeds all the fixtures in low voltage tracks.
---
System:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
system:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.

User:
user:
Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
One star: the answer completely lacks coherence
Two stars: the answer mostly lacks coherence
Expand Down
Loading
Loading