Skip to content

Commit

Permalink
Merge pull request #96 from alan-turing-institute/automatic-judge
Browse files Browse the repository at this point in the history
Automatic judge
  • Loading branch information
rchan26 authored Aug 30, 2024
2 parents a05506f + 59d8f4b commit aaa19e8
Show file tree
Hide file tree
Showing 27 changed files with 4,142 additions and 406 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@

`prompto` derives from the Italian word "_pronto_" which means "_ready_". It could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").

A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847). If you use this library, please see the [citation](#citation) below. For the experiments in the pre-print, see the [system demonstration examples](./examples/system-demo/README.md).

## Why `prompto`?

The benefit of _asynchronous querying_ is that it allows for multiple requests to be sent to an API _without_ having to wait for the LLM's response, which is particularly useful to fully utilise the rate limits of an API. This is especially useful when an experiment file contains a large number of prompts and/or has several models to query. [_Asynchronous programming_](https://docs.python.org/3/library/asyncio.html) is simply a way for programs to avoid getting stuck on long tasks (like waiting for an LLM response from an API) and instead keep running other things at the same time (to send other queries).
Expand Down Expand Up @@ -201,3 +203,14 @@ The library has a few key classes:
* [`AsyncAPI`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py): this is the base class for querying all APIs. Each API/model should inherit from this class and implement the `query` method which will (asynchronously) query the model's API and return the response. When running an experiment, the `Experiment` class will call this method for each experiment to send requests asynchronously.

When a new model is added, you must add it to the [`API`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py) dictionary which is in the `apis` module. This dictionary should map the model name to the class of the model. For details on how to add a new model, see the [guide on adding new APIs and models](./docs/add_new_api.md).

## Citation

```
@article{chan2024prompto,
title={Prompto: An open source library for asynchronous querying of LLM endpoints},
author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
journal={arXiv preprint arXiv:2408.11847},
year={2024}
}
```
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ To view this documentation in a more readable format, visit the [prompto documen
* [Configuring environment variables](./environment_variables.md)
* [prompto commands](./commands.md)
* [Specifying rate limits](./rate_limits.md)
* [Using prompto for evaluation](./evaluation.md)

## Reference

Expand Down
14 changes: 14 additions & 0 deletions docs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,17 @@
`prompto` is a Python library written by the [Research Engineering Team (REG)](https://www.turing.ac.uk/work-turing/research/research-engineering-group) at the [Alan Turing Institute](https://www.turing.ac.uk/). It was originally written by [Ryan Chan](https://github.com/rchan26), [Federico Nanni](https://github.com/fedenanni) and [Evelina Gabasova](https://github.com/evelinag).

The library is designed to facilitate the running of language model experiments stored as jsonl files. It automates querying API endpoints and logs progress asynchronously. The library is designed to be extensible and can be used to query different models.

## Citation

A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847).

Please cite the library as:
```
@article{chan2024prompto,
title={Prompto: An open source library for asynchronous querying of LLM endpoints},
author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
journal={arXiv preprint arXiv:2408.11847},
year={2024}
}
```
15 changes: 14 additions & 1 deletion docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,19 @@ prompto_run_experiment \

Note that if the experiment file is already in the input folder, we will not make a copy of the file and process the file in place.

### Automatic evaluation using an LLM-as-judge

It is possible to automatically run a LLM-as-judge evaluation of the responses by using the `--judge-location` and `--judge` arguments of the CLI. See the [Create judge file](#create-judge-file) section for more details on these arguments.

For instance, to run an experiment file with automatic evaluation using a judge, you can use the following command:
```
prompto_run_experiment \
--file path/to/experiment.jsonl \
--data-folder data \
--judge-location judge \
--judge gemini-1.0-pro
```

## Running the pipeline

As detailed in the [pipeline documentation](pipeline.md), you can run the pipeline using the `prompto_run_pipeline` command. To see all arguments of this command, run `prompto_run_pipeline --help`.
Expand Down Expand Up @@ -77,7 +90,7 @@ In `judge`, you must have two files:
* `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
* `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).

See for example [this judge example](./../examples/data/data/judge) which contains example template and settings files.
See for example [this judge example](./../examples/evaluation/judge/) which contains example template and settings files.

The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge location. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.

Expand Down
8 changes: 8 additions & 0 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Evaluation

A common use case for `prompto` is to evaluate the performance of different models on a given task where we first need to obtain a large number of responses.
In `prompto`, we provide functionality to automate the querying of different models and endpoints to obtain responses to a set of prompts and _then evaluate_ these responses.

## Automatic evaluation using an LLM-as-a-judge

## Automatic evaluation using a scoring function
File renamed without changes.
File renamed without changes.
2 changes: 1 addition & 1 deletion examples/system-demo/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# System Demonstration examples

We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints.
We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints. These experiments are analysed in our systems demonstration paper currently available as a pre-print on [arXiv](https://arxiv.org/abs/2408.11847).

We sample prompts from the instruction-following data following the Self-Instruct approach of [1] and [2]. We take a sample of 100 prompts from the [instruction-following data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) from [2] and apply the same prompt template. We then use these as prompt inputs to different models using `prompto`. See the [Generating the prompts for experiments](./alpaca_sample_generation.ipynb) notebook for more details.

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ nav:
- Running experiments and the pipeline: docs/pipeline.md
- prompto commands: docs/commands.md
- Specifying rate limits: docs/rate_limits.md
- Using prompto for evaluation: docs/evaluation.md
- Implemented APIs:
- APIs overview: docs/models.md
- Azure OpenAI: docs/azure_openai.md
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@ mkdocs-literate-nav = { version = "^0.6.1", optional = true }
mkdocs-section-index = { version = "^0.3.9", optional = true }
mkdocs-same-dir = { version = "^0.1.3", optional = true }
mkdocs-jupyter = { version = "^0.24.7", optional = true }
cli-test-helpers = { version = "^4.0.0", optional = true }
vertexai = { version = "^1.49.0", optional = true }
google-cloud-aiplatform = { version = "^1.49.0", optional = true }
google-generativeai = { version = "^0.7.0", optional = true }
openai = { version = "^1.35.3", optional = true }
pillow = { version = "^10.3.0", optional = true }
ollama = { version = "^0.2.1", optional = true }
ollama = { version = "^0.3.1", optional = true }
huggingface-hub = { version = "^0.23.4", optional = true }
quart = { version = "^0.19.6", optional = true }
transformers = { version = "^4.41.2", optional = true }
Expand All @@ -60,6 +61,7 @@ all = [
"mkdocs-section-index",
"mkdocs-same-dir",
"mkdocs-jupyter",
"cli-test-helpers",
"vertexai",
"google-cloud-aiplatform",
"google-generativeai",
Expand Down Expand Up @@ -90,6 +92,7 @@ dev = [
"mkdocs-section-index",
"mkdocs-same-dir",
"mkdocs-jupyter",
"cli-test-helpers",
]
gemini = ["vertexai", "google-cloud-aiplatform", "google-generativeai", "pillow"]
vertexai = ["vertexai", "google-cloud-aiplatform", "google-generativeai", "pillow"]
Expand Down
6 changes: 5 additions & 1 deletion src/prompto/apis/testing/testing_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ async def query(self, prompt_dict: dict, index: int | str) -> dict:
# if not either "True" or "False", we error 1/5 times
generation_config = prompt_dict.get("parameters", {})
raise_error_option = generation_config.get("raise_error", "")
raise_error_type = generation_config.get("raise_error_type", "")

if raise_error_option == "True":
raise_error = True
Expand All @@ -48,7 +49,10 @@ async def query(self, prompt_dict: dict, index: int | str) -> dict:
error_as_string=error_msg,
id=prompt_dict.get("id", "NA"),
)
raise ValueError(error_msg)
if raise_error_type == "Exception":
raise Exception(error_msg)
else:
raise ValueError(error_msg)
else:
await asyncio.sleep(1)

Expand Down
44 changes: 29 additions & 15 deletions src/prompto/experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,8 @@ def group_prompts(self) -> dict[str, list[dict]]:
# initialise some keys with the rate limits if provided
if self.settings.max_queries_dict != {}:
logging.info(
f"Grouping prompts using 'settings.max_queries_dict': {self.settings.max_queries_dict}..."
"Grouping prompts using 'settings.max_queries_dict': "
f"{self.settings.max_queries_dict}..."
)
for key, value in self.settings.max_queries_dict.items():
if isinstance(value, int):
Expand Down Expand Up @@ -351,7 +352,7 @@ async def process(self, evaluation_funcs: callable = None) -> tuple[dict, float]

# log completion of experiment
log_message = (
f"Completed experiment {self.__str__()}! "
f"Completed experiment: {self.__str__()}! "
f"Experiment processing time: {round(processing_time, 3)} seconds, "
f"Average time per query: {round(avg_query_processing_time, 3)} seconds"
)
Expand Down Expand Up @@ -411,13 +412,16 @@ async def send_requests(
"""
request_interval = 60 / rate_limit
tasks = []
for_group_string = f"for group {group} " if group is not None else ""
for_group_string = f"for group '{group}' " if group is not None else ""
attempt_frac = f"{attempt}/{self.settings.max_attempts}"

for index, item in enumerate(
tqdm(
prompt_dicts,
desc=f"Sending {len(prompt_dicts)} queries at {rate_limit} QPM with RI of {request_interval}s {for_group_string} (attempt {attempt_frac})",
desc=(
f"Sending {len(prompt_dicts)} queries at {rate_limit} QPM with RI of "
f"{request_interval}s {for_group_string}(attempt {attempt_frac})"
),
unit="query",
)
):
Expand All @@ -438,7 +442,7 @@ async def send_requests(
# wait for all tasks to complete before returning
responses = await tqdm_asyncio.gather(
*tasks,
desc=f"Waiting for responses {for_group_string} (attempt {attempt_frac})",
desc=f"Waiting for responses {for_group_string}(attempt {attempt_frac})",
unit="query",
)

Expand Down Expand Up @@ -470,6 +474,7 @@ async def send_requests_retry(
Group name, by default None. If None, then the group is
not specified in the logs
"""
for_group_string = f" for group '{group}'" if group is not None else ""
# initialise the number of attempts
attempt = 1

Expand All @@ -496,8 +501,8 @@ async def send_requests_retry(
# if we still have failed queries, we will retry them
if len(remaining_prompt_dicts) > 0:
logging.info(
f"Retrying {len(remaining_prompt_dicts)} failed queries - attempt {attempt} of "
f"{self.settings.max_attempts}..."
f"Retrying {len(remaining_prompt_dicts)} failed queries{for_group_string} - "
f"attempt {attempt} of {self.settings.max_attempts}..."
)

# send off the failed queries
Expand All @@ -510,9 +515,11 @@ async def send_requests_retry(
)
else:
# if there are no failed queries, break out of the loop
logging.info(f"No remaining failed queries{for_group_string}!")
break
else:
# if the maximum number of attempts has been reached, break out of the loop
logging.info(f"Maximum attempts reached{for_group_string}. Exiting...")
break

async def query_model_and_record_response(
Expand Down Expand Up @@ -553,7 +560,8 @@ async def query_model_and_record_response(
"""
if attempt > self.settings.max_attempts:
raise ValueError(
f"Number of attempts ({attempt}) cannot be greater than max_attempts ({self.settings.max_attempts})"
f"Attempt number ({attempt}) cannot be greater than "
f"settings.max_attempts ({self.settings.max_attempts})"
)
if index is None:
index = "NA"
Expand All @@ -572,7 +580,7 @@ async def query_model_and_record_response(
except (NotImplementedError, KeyError, ValueError, TypeError) as err:
# don't retry for selected errors, log the error and save an error response
log_message = (
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}). "
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}): "
f"{type(err).__name__} - {err}"
)
async with FILE_WRITE_LOCK:
Expand All @@ -586,7 +594,8 @@ async def query_model_and_record_response(
if attempt == self.settings.max_attempts:
# we've already tried max_attempts times, so log the error and save an error response
log_message = (
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) after maximum {self.settings.max_attempts} attempts: "
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) "
f"after maximum {self.settings.max_attempts} attempts: "
f"{type(err).__name__} - {err}"
)
async with FILE_WRITE_LOCK:
Expand All @@ -596,14 +605,16 @@ async def query_model_and_record_response(
# fill in response with error message and note that we've tried max_attempts times
completed_prompt_dict = prompt_dict
completed_prompt_dict["response"] = (
f"An unexpected error occurred when querying the API: {type(err).__name__} - {err} "
"An unexpected error occurred when querying the API: "
f"({type(err).__name__} - {err}) "
f"after maximum {self.settings.max_attempts} attempts"
)
else:
# we haven't tried max_attempts times yet, so log the error and return an Exception
log_message = (
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) on attempt {attempt} of {self.settings.max_attempts}: "
f"{type(err).__name__} - {err} - adding to the queue to try again later"
f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) on attempt "
f"{attempt} of {self.settings.max_attempts}: "
f"{type(err).__name__} - {err}. Adding to the queue to try again later..."
)
async with FILE_WRITE_LOCK:
write_log_message(
Expand Down Expand Up @@ -669,9 +680,11 @@ async def generate_text(
# query the model
response = await api.query(prompt_dict=prompt_dict, index=index)

# Perform Evaluation if evaluation function is provided
# perform Evaluation if evaluation function is provided
if evaluation_funcs is not None:
response = await self.evaluate_responses(response, evaluation_funcs)
response = await self.evaluate_responses(
prompt_dict=response, evaluation_funcs=evaluation_funcs
)

return response

Expand All @@ -698,4 +711,5 @@ async def evaluate_responses(

for func in evaluation_funcs:
prompt_dict = func(prompt_dict)

return prompt_dict
Loading

0 comments on commit aaa19e8

Please sign in to comment.