Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

FelixTJDietrich · 2023-11-05T21:01:01Z

Motivation and Context

We want an option to systematically evaluate the modules without manual labor.

closes #84
closes #36

Description

Add /evaluation endpoint to Athena for returning arbitrary evaluation results
Add example of how this /evaluation endpoint would like to module_example (for programming exercises)
Add this new module request to the Playground
Run evaluation for all healthy modules of the same type that support evaluation in evaluation mode
Display progress of automatic evaluation (and move "Start Generating" button)
Add LLM-as-a-judge for text modules
Disable LLM-as-a-judge using environment variable
Add token usage and response times in evaluation response (coming from LangSmith)
Add eval code for SGI evaluation

Note: I omitted the "module_text_llm estimated Accepted with Minor modifications" text from the UI since it is kind of ugly to add and does not give immediate benefit. In the future one might make automatic results available for the inline feedback view but this depends also on future experiment designs.

Steps for Testing

Run an evaluation with module_text_llm
After everything is done click export, it should export the automatic_evaluation now as a separate file
File should contain LLM-as-a-judge, llm_statistics (from LangSmith, I added my key to the env), and feedback_statistics
Cancel experiment
Import results and automatic evaluation and check if it works

Screenshots

Docs:

Not Started

Sent Training Feedback

Generated Suggestions

Finished

…ture/automatic-evaluation

FelixTJDietrich · 2023-11-09T10:33:12Z

Seems like I did not validate the structured_grading_instruction_id, it should be fine now.

FelixTJDietrich · 2023-11-09T10:40:08Z

Works for me on the test server now :)

FelixTJDietrich · 2023-11-11T11:30:44Z

I also added now some docs:

…ture/automatic-evaluation

…intum/Athena into feature/automatic-evaluation

pal03377

The exported files look like expected. I also cannot find any issues (neither in the code nor in my tests) ✅

add endpoint

3e336dc

FelixTJDietrich mentioned this pull request Nov 5, 2023

Initial Evaluation Mode Roadmap #36

Closed

16 tasks

FelixTJDietrich added 5 commits November 5, 2023 22:09

add evaluation_provider

2fb28f0

add new line

38f6055

add evaluation_provider to export

92ca4ed

add example evaluation endpoint

433cd7f

add playground ui

9b4e2c9

FelixTJDietrich self-assigned this Nov 5, 2023

FelixTJDietrich added 14 commits November 6, 2023 16:22

add automatic evaluation

5f39b8c

add automatic evaluation

2667cab

add UI changes

7dbb316

fix color

5053a4a

add evaluation model

e9bcb26

add llm as a judge

cabed57

fix ui issue and some var naming

2595d5c

fix line break

2739757

add langsmith logging

6b383e3

inline statistics

e46df2c

add sgi evaluation

8d50922

refactor

a753b8a

remove unused

fa0bde5

Merge branch 'develop' of https://github.com/ls1intum/Athena into fea…

daff54a

…ture/automatic-evaluation

FelixTJDietrich marked this pull request as ready for review November 7, 2023 20:15

FelixTJDietrich added dependencies deploy:athena-test1 enhancement python playground athena package javascript labels Nov 7, 2023

FelixTJDietrich requested a review from pal03377 November 9, 2023 10:32

FelixTJDietrich added deploy:athena-test1 and removed lock:athena-test1 labels Nov 9, 2023

github-actions bot added lock:athena-test1 and removed deploy:athena-test1 labels Nov 9, 2023

FelixTJDietrich temporarily deployed to Athena Test Server 3 November 9, 2023 10:34 — with GitHub Actions Inactive

fix index

4eac295

add docs

0fe6e6d

FelixTJDietrich added documentation and removed lock:athena-test1 labels Nov 11, 2023

FelixTJDietrich added 2 commits November 11, 2023 13:12

Merge branch 'develop' of https://github.com/ls1intum/Athena into fea…

c6e08df

…ture/automatic-evaluation

Merge branch 'develop' into feature/automatic-evaluation

c384232

pal03377 added the deploy:athena-test1 label Nov 12, 2023

github-actions bot removed the deploy:athena-test1 label Nov 12, 2023

This comment was marked as outdated.

Sign in to view

github-actions bot added the deployment-error label Nov 12, 2023

pal03377 removed the deployment-error label Nov 12, 2023

FelixTJDietrich added 2 commits November 12, 2023 19:32

fix text module

2b7f212

Merge branch 'feature/automatic-evaluation' of https://github.com/ls1…

fe10281

…intum/Athena into feature/automatic-evaluation

pal03377 added the deploy:athena-test1 label Nov 12, 2023

github-actions bot added lock:athena-test1 and removed deploy:athena-test1 labels Nov 12, 2023

pal03377 temporarily deployed to Athena Test Server 3 November 12, 2023 19:59 — with GitHub Actions Inactive

pal03377 approved these changes Nov 12, 2023

View reviewed changes

pal03377 removed the lock:athena-test1 label Nov 12, 2023

FelixTJDietrich merged commit 89e7047 into develop Nov 12, 2023

FelixTJDietrich deleted the feature/automatic-evaluation branch November 12, 2023 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

FelixTJDietrich commented Nov 5, 2023 •

edited

Loading

FelixTJDietrich commented Nov 9, 2023

FelixTJDietrich commented Nov 9, 2023

FelixTJDietrich commented Nov 11, 2023

This comment was marked as outdated.

pal03377 left a comment

Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

Conversation

FelixTJDietrich commented Nov 5, 2023 • edited Loading

Motivation and Context

Description

Steps for Testing

Screenshots

Not Started

Sent Training Feedback

Generated Suggestions

Finished

FelixTJDietrich commented Nov 9, 2023

FelixTJDietrich commented Nov 9, 2023

FelixTJDietrich commented Nov 11, 2023

This comment was marked as outdated.

pal03377 left a comment

Choose a reason for hiding this comment

FelixTJDietrich commented Nov 5, 2023 •

edited

Loading