Merge pull request #96 from alan-turing-institute/automatic-judge

Automatic judge
alan-turing-institute · Aug 30, 2024 · aaa19e8 · aaa19e8
2 parents a05506f + 59d8f4b
commit aaa19e8
Show file tree

Hide file tree

Showing 27 changed files with 4,142 additions and 406 deletions.
diff --git a/README.md b/README.md
@@ -26,6 +26,8 @@
 
 `prompto` derives from the Italian word "_pronto_" which means "_ready_". It could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").
 
+A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847). If you use this library, please see the [citation](#citation) below. For the experiments in the pre-print, see the [system demonstration examples](./examples/system-demo/README.md).
+
 ## Why `prompto`?
 
 The benefit of  _asynchronous querying_ is that it allows for multiple requests to be sent to an API _without_ having to wait for the LLM's response, which is particularly useful to fully utilise the rate limits of an API. This is especially useful when an experiment file contains a large number of prompts and/or has several models to query. [_Asynchronous programming_](https://docs.python.org/3/library/asyncio.html) is simply a way for programs to avoid getting stuck on long tasks (like waiting for an LLM response from an API) and instead keep running other things at the same time (to send other queries).
@@ -201,3 +203,14 @@ The library has a few key classes:
 * [`AsyncAPI`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py): this is the base class for querying all APIs. Each API/model should inherit from this class and implement the `query` method which will (asynchronously) query the model's API and return the response. When running an experiment, the `Experiment` class will call this method for each experiment to send requests asynchronously.
 
 When a new model is added, you must add it to the [`API`](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/apis/base.py) dictionary which is in the `apis` module. This dictionary should map the model name to the class of the model. For details on how to add a new model, see the [guide on adding new APIs and models](./docs/add_new_api.md).
+
+## Citation
+
+```
+@article{chan2024prompto,
+  title={Prompto: An open source library for asynchronous querying of LLM endpoints},
+  author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
+  journal={arXiv preprint arXiv:2408.11847},
+  year={2024}
+}
+```
diff --git a/docs/README.md b/docs/README.md
@@ -15,6 +15,7 @@ To view this documentation in a more readable format, visit the [prompto documen
 * [Configuring environment variables](./environment_variables.md)
 * [prompto commands](./commands.md)
 * [Specifying rate limits](./rate_limits.md)
+* [Using prompto for evaluation](./evaluation.md)
 
 ## Reference
 

diff --git a/docs/about.md b/docs/about.md
@@ -3,3 +3,17 @@
 `prompto` is a Python library written by the [Research Engineering Team (REG)](https://www.turing.ac.uk/work-turing/research/research-engineering-group) at the [Alan Turing Institute](https://www.turing.ac.uk/). It was originally written by [Ryan Chan](https://github.com/rchan26), [Federico Nanni](https://github.com/fedenanni) and [Evelina Gabasova](https://github.com/evelinag).
 
 The library is designed to facilitate the running of language model experiments stored as jsonl files. It automates querying API endpoints and logs progress asynchronously. The library is designed to be extensible and can be used to query different models.
+
+## Citation
+
+A pre-print for this work is available on [arXiv](https://arxiv.org/abs/2408.11847).
+
+Please cite the library as:
+```
+@article{chan2024prompto,
+  title={Prompto: An open source library for asynchronous querying of LLM endpoints},
+  author={Chan, Ryan Sze-Yin and Nanni, Federico and Brown, Edwin and Chapman, Ed and Williams, Angus R and Bright, Jonathan and Gabasova, Evelina},
+  journal={arXiv preprint arXiv:2408.11847},
+  year={2024}
+}
+```
diff --git a/docs/commands.md b/docs/commands.md
@@ -29,6 +29,19 @@ prompto_run_experiment \
 
 Note that if the experiment file is already in the input folder, we will not make a copy of the file and process the file in place.
 
+### Automatic evaluation using an LLM-as-judge
+
+It is possible to automatically run a LLM-as-judge evaluation of the responses by using the `--judge-location` and `--judge` arguments of the CLI. See the [Create judge file](#create-judge-file) section for more details on these arguments.
+
+For instance, to run an experiment file with automatic evaluation using a judge, you can use the following command:
+```
+prompto_run_experiment \
+    --file path/to/experiment.jsonl \
+    --data-folder data \
+    --judge-location judge \
+    --judge gemini-1.0-pro
+```
+
 ## Running the pipeline
 
 As detailed in the [pipeline documentation](pipeline.md), you can run the pipeline using the `prompto_run_pipeline` command. To see all arguments of this command, run `prompto_run_pipeline --help`.
@@ -77,7 +90,7 @@ In `judge`, you must have two files:
 * `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
 * `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
 
-See for example [this judge example](./../examples/data/data/judge) which contains example template and settings files.
+See for example [this judge example](./../examples/evaluation/judge/) which contains example template and settings files.
 
 The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge location. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.
 

diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -0,0 +1,8 @@
+# Evaluation
+
+A common use case for `prompto` is to evaluate the performance of different models on a given task where we first need to obtain a large number of responses.
+In `prompto`, we provide functionality to automate the querying of different models and endpoints to obtain responses to a set of prompts and _then evaluate_ these responses.
+
+## Automatic evaluation using an LLM-as-a-judge
+
+## Automatic evaluation using a scoring function
diff --git a/examples/judge/settings.json → examples/evaluation/judge/settings.json b/examples/judge/settings.json → examples/evaluation/judge/settings.json
diff --git a/examples/judge/template.txt → examples/evaluation/judge/template.txt b/examples/judge/template.txt → examples/evaluation/judge/template.txt
diff --git a/examples/system-demo/README.md b/examples/system-demo/README.md
@@ -1,6 +1,6 @@
 # System Demonstration examples
 
-We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints.
+We provide some illustrative examples of how to use `prompto` and compare it against traditional a synchronous approach to querying LLM endpoints. These experiments are analysed in our systems demonstration paper currently available as a pre-print on [arXiv](https://arxiv.org/abs/2408.11847).
 
 We sample prompts from the instruction-following data following the Self-Instruct approach of [1] and [2]. We take a sample of 100 prompts from the [instruction-following data](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) from [2] and apply the same prompt template. We then use these as prompt inputs to different models using `prompto`. See the [Generating the prompts for experiments](./alpaca_sample_generation.ipynb) notebook for more details.
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,6 +41,7 @@ nav:
       - Running experiments and the pipeline: docs/pipeline.md
       - prompto commands: docs/commands.md
       - Specifying rate limits: docs/rate_limits.md
+      - Using prompto for evaluation: docs/evaluation.md
   - Implemented APIs:
       - APIs overview: docs/models.md
       - Azure OpenAI: docs/azure_openai.md

diff --git a/pyproject.toml b/pyproject.toml
@@ -29,12 +29,13 @@ mkdocs-literate-nav = { version = "^0.6.1", optional = true }
 mkdocs-section-index = { version = "^0.3.9", optional = true }
 mkdocs-same-dir = { version = "^0.1.3", optional = true }
 mkdocs-jupyter = { version = "^0.24.7", optional = true }
+cli-test-helpers = { version = "^4.0.0", optional = true }
 vertexai = { version = "^1.49.0", optional = true }
 google-cloud-aiplatform = { version = "^1.49.0", optional = true }
 google-generativeai = { version = "^0.7.0", optional = true }
 openai = { version = "^1.35.3", optional = true }
 pillow = { version = "^10.3.0", optional = true }
-ollama = { version = "^0.2.1", optional = true }
+ollama = { version = "^0.3.1", optional = true }
 huggingface-hub = { version = "^0.23.4", optional = true }
 quart = { version = "^0.19.6", optional = true }
 transformers = { version = "^4.41.2", optional = true }
@@ -60,6 +61,7 @@ all = [
     "mkdocs-section-index",
     "mkdocs-same-dir",
     "mkdocs-jupyter",
+    "cli-test-helpers",
     "vertexai",
     "google-cloud-aiplatform",
     "google-generativeai",
@@ -90,6 +92,7 @@ dev = [
     "mkdocs-section-index",
     "mkdocs-same-dir",
     "mkdocs-jupyter",
+    "cli-test-helpers",
 ]
 gemini = ["vertexai", "google-cloud-aiplatform", "google-generativeai", "pillow"]
 vertexai = ["vertexai", "google-cloud-aiplatform", "google-generativeai", "pillow"]

diff --git a/src/prompto/apis/testing/testing_api.py b/src/prompto/apis/testing/testing_api.py
@@ -31,6 +31,7 @@ async def query(self, prompt_dict: dict, index: int | str) -> dict:
         # if not either "True" or "False", we error 1/5 times
         generation_config = prompt_dict.get("parameters", {})
         raise_error_option = generation_config.get("raise_error", "")
+        raise_error_type = generation_config.get("raise_error_type", "")
 
         if raise_error_option == "True":
             raise_error = True
@@ -48,7 +49,10 @@ async def query(self, prompt_dict: dict, index: int | str) -> dict:
                 error_as_string=error_msg,
                 id=prompt_dict.get("id", "NA"),
             )
-            raise ValueError(error_msg)
+            if raise_error_type == "Exception":
+                raise Exception(error_msg)
+            else:
+                raise ValueError(error_msg)
         else:
             await asyncio.sleep(1)
 

diff --git a/src/prompto/experiment.py b/src/prompto/experiment.py
@@ -176,7 +176,8 @@ def group_prompts(self) -> dict[str, list[dict]]:
         # initialise some keys with the rate limits if provided
         if self.settings.max_queries_dict != {}:
             logging.info(
-                f"Grouping prompts using 'settings.max_queries_dict': {self.settings.max_queries_dict}..."
+                "Grouping prompts using 'settings.max_queries_dict': "
+                f"{self.settings.max_queries_dict}..."
             )
             for key, value in self.settings.max_queries_dict.items():
                 if isinstance(value, int):
@@ -351,7 +352,7 @@ async def process(self, evaluation_funcs: callable = None) -> tuple[dict, float]
 
         # log completion of experiment
         log_message = (
-            f"Completed experiment {self.__str__()}! "
+            f"Completed experiment: {self.__str__()}! "
             f"Experiment processing time: {round(processing_time, 3)} seconds, "
             f"Average time per query: {round(avg_query_processing_time, 3)} seconds"
         )
@@ -411,13 +412,16 @@ async def send_requests(
         """
         request_interval = 60 / rate_limit
         tasks = []
-        for_group_string = f"for group {group} " if group is not None else ""
+        for_group_string = f"for group '{group}' " if group is not None else ""
         attempt_frac = f"{attempt}/{self.settings.max_attempts}"
 
         for index, item in enumerate(
             tqdm(
                 prompt_dicts,
-                desc=f"Sending {len(prompt_dicts)} queries at {rate_limit} QPM with RI of {request_interval}s {for_group_string} (attempt {attempt_frac})",
+                desc=(
+                    f"Sending {len(prompt_dicts)} queries at {rate_limit} QPM with RI of "
+                    f"{request_interval}s {for_group_string}(attempt {attempt_frac})"
+                ),
                 unit="query",
             )
         ):
@@ -438,7 +442,7 @@ async def send_requests(
         # wait for all tasks to complete before returning
         responses = await tqdm_asyncio.gather(
             *tasks,
-            desc=f"Waiting for responses {for_group_string} (attempt {attempt_frac})",
+            desc=f"Waiting for responses {for_group_string}(attempt {attempt_frac})",
             unit="query",
         )
 
@@ -470,6 +474,7 @@ async def send_requests_retry(
             Group name, by default None. If None, then the group is
             not specified in the logs
         """
+        for_group_string = f" for group '{group}'" if group is not None else ""
         # initialise the number of attempts
         attempt = 1
 
@@ -496,8 +501,8 @@ async def send_requests_retry(
                 # if we still have failed queries, we will retry them
                 if len(remaining_prompt_dicts) > 0:
                     logging.info(
-                        f"Retrying {len(remaining_prompt_dicts)} failed queries - attempt {attempt} of "
-                        f"{self.settings.max_attempts}..."
+                        f"Retrying {len(remaining_prompt_dicts)} failed queries{for_group_string} - "
+                        f"attempt {attempt} of {self.settings.max_attempts}..."
                     )
 
                     # send off the failed queries
@@ -510,9 +515,11 @@ async def send_requests_retry(
                     )
                 else:
                     # if there are no failed queries, break out of the loop
+                    logging.info(f"No remaining failed queries{for_group_string}!")
                     break
             else:
                 # if the maximum number of attempts has been reached, break out of the loop
+                logging.info(f"Maximum attempts reached{for_group_string}. Exiting...")
                 break
 
     async def query_model_and_record_response(
@@ -553,7 +560,8 @@ async def query_model_and_record_response(
         """
         if attempt > self.settings.max_attempts:
             raise ValueError(
-                f"Number of attempts ({attempt}) cannot be greater than max_attempts ({self.settings.max_attempts})"
+                f"Attempt number ({attempt}) cannot be greater than "
+                f"settings.max_attempts ({self.settings.max_attempts})"
             )
         if index is None:
             index = "NA"
@@ -572,7 +580,7 @@ async def query_model_and_record_response(
         except (NotImplementedError, KeyError, ValueError, TypeError) as err:
             # don't retry for selected errors, log the error and save an error response
             log_message = (
-                f"Error (i={index}, id={prompt_dict.get('id', 'NA')}). "
+                f"Error (i={index}, id={prompt_dict.get('id', 'NA')}): "
                 f"{type(err).__name__} - {err}"
             )
             async with FILE_WRITE_LOCK:
@@ -586,7 +594,8 @@ async def query_model_and_record_response(
             if attempt == self.settings.max_attempts:
                 # we've already tried max_attempts times, so log the error and save an error response
                 log_message = (
-                    f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) after maximum {self.settings.max_attempts} attempts: "
+                    f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) "
+                    f"after maximum {self.settings.max_attempts} attempts: "
                     f"{type(err).__name__} - {err}"
                 )
                 async with FILE_WRITE_LOCK:
@@ -596,14 +605,16 @@ async def query_model_and_record_response(
                 # fill in response with error message and note that we've tried max_attempts times
                 completed_prompt_dict = prompt_dict
                 completed_prompt_dict["response"] = (
-                    f"An unexpected error occurred when querying the API: {type(err).__name__} - {err} "
+                    "An unexpected error occurred when querying the API: "
+                    f"({type(err).__name__} - {err}) "
                     f"after maximum {self.settings.max_attempts} attempts"
                 )
             else:
                 # we haven't tried max_attempts times yet, so log the error and return an Exception
                 log_message = (
-                    f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) on attempt {attempt} of {self.settings.max_attempts}: "
-                    f"{type(err).__name__} - {err} - adding to the queue to try again later"
+                    f"Error (i={index}, id={prompt_dict.get('id', 'NA')}) on attempt "
+                    f"{attempt} of {self.settings.max_attempts}: "
+                    f"{type(err).__name__} - {err}. Adding to the queue to try again later..."
                 )
                 async with FILE_WRITE_LOCK:
                     write_log_message(
@@ -669,9 +680,11 @@ async def generate_text(
         # query the model
         response = await api.query(prompt_dict=prompt_dict, index=index)
 
-        # Perform Evaluation if evaluation function is provided
+        # perform Evaluation if evaluation function is provided
         if evaluation_funcs is not None:
-            response = await self.evaluate_responses(response, evaluation_funcs)
+            response = await self.evaluate_responses(
+                prompt_dict=response, evaluation_funcs=evaluation_funcs
+            )
 
         return response
 
@@ -698,4 +711,5 @@ async def evaluate_responses(
 
         for func in evaluation_funcs:
             prompt_dict = func(prompt_dict)
+
         return prompt_dict