add mkdocs

alan-turing-institute · Jun 28, 2024 · 2babbcc · 2babbcc
1 parent c4687ca
commit 2babbcc
Show file tree

Hide file tree

Showing 39 changed files with 480 additions and 343 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -58,29 +58,29 @@ jobs:
             - name: Upload coverage report
               uses: codecov/[email protected]
 
-    # dist:
-    #     name: Distribution build
-    #     runs-on: ubuntu-latest
-    #     needs: [pre-commit]
-
-    #     steps:
-    #         - uses: actions/checkout@v4
-    #           with:
-    #             fetch-depth: 0
-
-    #         - name: Build sdist and wheel
-    #           run: pipx run build
-
-    #         - uses: actions/upload-artifact@v4
-    #           with:
-    #             path: dist
-
-    #         - name: Check products
-    #           run: pipx run twine check dist/*
-
-    #         - uses: pypa/[email protected]
-    #           if: github.event_name == 'release' && github.event.action == 'published'
-    #           with:
-    #             # Remember to generate this and set it in "GitHub Secrets"
-    #             user: __token__
-    #             password: ${{ secrets.PYPI_API_TOKEN }}
+    docs:
+        needs: [pre-commit, pytest]
+        runs-on: ubuntu-latest
+        steps:
+          - uses: actions/checkout@v3
+
+          - uses: actions/setup-python@v4
+            with:
+              python-version: '3.11'
+
+          - name: Apply mkdocs cache
+            uses: actions/cache@v3
+            with:
+              key: mkdocs-material-${{ env.cache_id }}
+              path: .cache
+              restore-keys: |
+                mkdocs-material-
+
+          - name: Install doc dependencies via poetry
+            run: |
+              pip install poetry
+              poetry install --with dev
+
+          - name: Build docs with gh-deploy --force
+            run: |
+              poetry run mkdocs gh-deploy --force
diff --git a/README.md b/README.md
@@ -1,26 +1,22 @@
 # prompto
 
-[![Actions Status][actions-badge]][actions-link]
-[![Codecov Status][codecov-badge]][codecov-link]
-[![PyPI version][pypi-version]][pypi-link]
-[![PyPI platforms][pypi-platforms]][pypi-link]
-
 `prompto` derives from the Italian word "_pronto_" which means "_ready_" and could also mean "_I prompt_" in Italian (if "_promptare_" was a verb meaning "_to prompt_").
 
 `prompto` is a Python library facilitates of LLM experiments stored as jsonl files. It automates querying API endpoints and logs progress asynchronously. The library is designed to be extensible and can be used to query different models.
 
 ## Available APIs and Models
 
 The library supports querying several APIs and models. The following APIs are currently supported are:
-- [OpenAI](docs/models.md#openai) (`"openai"`)
-- [Azure OpenAI](docs/models.md#azure-openai) (`"azure-openai"`)
-- [Gemini](docs/models.md#gemini) (`"gemini"`)
-- [Vertex AI](docs/models.md#vertex-ai) (`"vertexai"`)
-- [Huggingface text-generation-inference](docs/models.md#huggingface-text-generation-inference) (`"huggingface-tgi"`)
-- [Ollama](docs/models.md#ollama) (`"ollama"`)
-- [A simple Quart API](docs/models.md#quart-api) for running models from [`transformers`](https://github.com/huggingface/transformers) locally (`"quart"`)
 
-Our aim for `prompto` is to support more APIs and models in the future and to make it easy to add new APIs and models to the library. We welcome contributions to add new APIs and models to the library. We have a [contribution guide](docs/contribution.md) and a [guide on how to add new APIs and models](docs/add_new_api.md) to the library in the [docs](docs/).
+* [OpenAI](./docs/openai.md) (`"openai"`)
+* [Azure OpenAI](./docs/azure_openai.md) (`"azure-openai"`)
+* [Gemini](./docs/gemini.md) (`"gemini"`)
+* [Vertex AI](./docs/vertexai.md) (`"vertexai"`)
+* [Huggingface text-generation-inference](./docs/huggingface_tgi.md) (`"huggingface-tgi"`)
+* [Ollama](./docs/ollama.md) (`"ollama"`)
+* [A simple Quart API](./docs/quart.md) for running models from [`transformers`](https://github.com/huggingface/transformers) locally (`"quart"`)
+
+Our aim for `prompto` is to support more APIs and models in the future and to make it easy to add new APIs and models to the library. We welcome contributions to add new APIs and models to the library. We have a [contribution guide](docs/contribution.md) and a [guide on how to add new APIs and models](./docs/add_new_api.md) to the library in the [docs](./docs/README.md).
 
 ## Installation
 
@@ -44,9 +40,10 @@ You might also want to set up a development environment for the library. To do t
 ## Getting Started
 
 The library has functionality to process experiments and to run a pipeline which continually looks for new experiment jsonl files in the input folder. Everything starts with defining a **pipeline data folder** which contains:
-- `input` folder: contains the jsonl files with the experiments
-- `output` folder: where the results of the experiments will be stored. When an experiment is ran, a folder is created within the output folder of the experiment name (as defined in the jsonl file but removing the `.jsonl` extension) and the results and logs for the experiment are stored there
-- `media` folder: which contains the media files for the experiments. These files must be within folders of the same experiment name (as defined in the jsonl file but removing the `.jsonl` extension)
+
+* `input` folder: contains the jsonl files with the experiments
+* `output` folder: where the results of the experiments will be stored. When an experiment is ran, a folder is created within the output folder of the experiment name (as defined in the jsonl file but removing the `.jsonl` extension) and the results and logs for the experiment are stored there
+* `media` folder: which contains the media files for the experiments. These files must be within folders of the same experiment name (as defined in the jsonl file but removing the `.jsonl` extension)
 
 When using the library, you simply pass in the folder you would like to use as the pipeline data folder and the library will take care of the rest.
 
@@ -82,10 +79,11 @@ prompto_run_experiment --file data/input/openai.jsonl --max-queries 30
 ```
 
 This will:
+
 1. Create subfolders in the `data` folder (in particular, it will create `media` (`data/media`) and `output` (`data/media`) folders)
-2. Create a folder in the the `output` folder with the name of the experiment (the file name without the `.jsonl` extention - in this case, `openai`)
+2. Create a folder in the the `output` folder with the name of the experiment (the file name without the `.jsonl` extention * in this case, `openai`)
 3. Move the `openai.jsonl` file to the `output/openai` folder (and add a timestamp of when the input file was created to that file)
-4. Start running the experiment and sending requests to the OpenAI API asynchronously which we specified in this command to be 30 queries a minute (so requests are sent every 2 seconds) - the default is 10 queries per minute
+4. Start running the experiment and sending requests to the OpenAI API asynchronously which we specified in this command to be 30 queries a minute (so requests are sent every 2 seconds) * the default is 10 queries per minute
 5. Results will be stored in a "completed" jsonl file in the output folder (which is also timestamped)
 6. Logs will be printed out to the console and also stored in a log file (which is also timestamped)
 
@@ -146,9 +144,10 @@ The completed experiment file will contain the responses from the Gemini API for
 ## Using the Library in Python
 
 The library has a few key classes:
-- [`Settings`](src/prompto/settings.py): this defines the settings of the the experiment pipeline which stores the paths to the relevant data folders and the parameters for the pipeline.
-- [`Experiment`](src/prompto/experiment.py): this defines all the variables related to a _single_ experiment. An 'experiment' here is defined by a particular JSONL file which contains the data/prompts for each experiment. Each line in this folder is a particular input to the LLM which we will obtain a response for. An experiment can be processed by calling the `Experiment.process()` method which will query the model and store the results in the output folder.
-- [`ExperimentPipeline`](src/prompto/experiment_pipeline.py): this is the main class for running the full pipeline. The pipeline can be ran using the `ExperimentPipeline.run()` method which will continually check the input folder for new experiments to process.
-- [`AsyncBaseAPI`](src/prompto/base.py): this is the base class for querying all APIs. Each API/model should inherit from this class and implement the `async_query` method which will (asynchronously) query the model's API and return the response. When running an experiment, the `Experiment` class will call this method for each experiment to send requests asynchronously.
 
-When a new model is added, you must add it to the [`API`](src/prompto/apis/__init__.py) dictionary which is in the `apis` module. This dictionary should map the model name to the class of the model.
+* [`Settings`](./src/prompto/settings.py): this defines the settings of the the experiment pipeline which stores the paths to the relevant data folders and the parameters for the pipeline.
+* [`Experiment`](./src/prompto/experiment.py): this defines all the variables related to a _single_ experiment. An 'experiment' here is defined by a particular JSONL file which contains the data/prompts for each experiment. Each line in this folder is a particular input to the LLM which we will obtain a response for. An experiment can be processed by calling the `Experiment.process()` method which will query the model and store the results in the output folder.
+* [`ExperimentPipeline`](./src/prompto/experiment_pipeline.py): this is the main class for running the full pipeline. The pipeline can be ran using the `ExperimentPipeline.run()` method which will continually check the input folder for new experiments to process.
+* [`AsyncBaseAPI`](./src/prompto/apis/base.py): this is the base class for querying all APIs. Each API/model should inherit from this class and implement the `async_query` method which will (asynchronously) query the model's API and return the response. When running an experiment, the `Experiment` class will call this method for each experiment to send requests asynchronously.
+
+When a new model is added, you must add it to the [`API`](./src/prompto/apis/__init__.py) dictionary which is in the `apis` module. This dictionary should map the model name to the class of the model.
diff --git a/docs/README.md b/docs/README.md
@@ -2,13 +2,14 @@
 
 ## Getting Started
 
-* [Quickstart](../README.md#getting-started)
-* [Installation](../README.md#installation)
-* [Examples](../examples)
+* [Quickstart](./../README.md#getting-started)
+* [Installation](./../README.md#installation)
+* [Examples](./../examples/README.md)
 
 ## Using `prompto`
 
 * [Setting up an experiment file](./experiment_file.md)
+* [Configuring environment variables](./environment_variables.md)
 * [prompto Pipeline and running experiments](./pipeline.md)
 * [prompto commands](./commands.md)
 * [Specifying rate limits](./rate_limits.md)

diff --git a/docs/about.md b/docs/about.md
@@ -0,0 +1,5 @@
+# About
+
+`prompto` is a Python library written by the [Research Engineering Team (REG)](https://www.turing.ac.uk/work-turing/research/research-engineering-group) at the [Alan Turing Institute](https://www.turing.ac.uk/). It was originally written by [Ryan Chan](https://github.com/rchan26), [Federico Nanni](https://github.com/fedenanni) and [Evelina Gabasova](https://github.com/evelinag).
+
+The library is designed to facilitate the running of language model experiments stored as jsonl files. It automates querying API endpoints and logs progress asynchronously. The library is designed to be extensible and can be used to query different models.
diff --git a/docs/add_new_api.md b/docs/add_new_api.md
@@ -1 +1,15 @@
 # Instructions to add new API/model
+
+The `prompto` library supports querying multiple LLM API endpoints asynchronously (see [available APIs](./../README.md#available-apis-and-models) and the [model docs](./models.md)). However, the list of available APIs is far from complete! As we don't have access to every API available, we need your help to implement them and welcome contributions to the library! It might also be the case that an API has been implemented, but perhaps it needs to updated or improved.
+
+In this document, we aim to capture some key steps to add a new API/model to the library. We hope that this will develop into a helpful guide.
+
+For a guide to contributing to the library in general, see our [contribution guide](./contribution.md). If you have any suggestions or corrections, please feel free to contribute!
+
+## The `prompto` library structure
+
+## Asynchronous querying
+
+## The `AsyncBaseAPI` class
+
+## Implementing 'checks'
diff --git a/docs/azure_openai.md b/docs/azure_openai.md
@@ -0,0 +1,22 @@
+## Azure OpenAI
+
+**Environment variables**:
+
+* `AZURE_OPENAI_API_KEY`: the API key for the Azure OpenAI API
+* `AZURE_OPENAI_API_ENDPOINT`: the endpoint for the Azure OpenAI API
+* `AZURE_OPENAI_API_VERSION`: the version of the Azure OpenAI API
+
+**Model-specific environment variables**:
+
+As described in the [model-specific environment variables](./environment_variables.md#model-specific-environment-variables) section, you can set model-specific environment variables for different models in Azure OpenAI by appending the model name to the environment variable name. For example, if `"model_name": "prompto_model"` is specified in the `prompt_dict`, the following model-specific environment variables can be used:
+
+* `AZURE_OPENAI_API_KEY_prompto_model`
+* `AZURE_OPENAI_API_ENDPOINT_prompto_model`
+* `AZURE_OPENAI_API_VERSION_prompto_model`
+
+**Required environment variables**:
+
+For any given `prompt_dict`, the following environment variables are required:
+
+* One of `AZURE_OPENAI_API_KEY` or `AZURE_OPENAI_API_KEY_model_name`
+* One of `AZURE_OPENAI_API_ENDPOINT` or `AZURE_OPENAI_API_ENDPOINT_model_name`
diff --git a/docs/commands.md b/docs/commands.md
@@ -1,12 +1,12 @@
 # Commands
 
-- [Running an experiment file](#running-an-experiment-file)
-- [Running the pipeline](#running-the-pipeline)
-- [Run checks on an experiment file](#run-checks-on-an-experiment-file)
-- [Create judge file](#create-judge-file)
-- [Obtain missing results jsonl file](#obtain-missing-results-jsonl-file)
-- [Convert images to correct form](#convert-images-to-correct-form)
-- [Start up Quart server](#start-up-quart-server)
+* [Running an experiment file](#running-an-experiment-file)
+* [Running the pipeline](#running-the-pipeline)
+* [Run checks on an experiment file](#run-checks-on-an-experiment-file)
+* [Create judge file](#create-judge-file)
+* [Obtain missing results jsonl file](#obtain-missing-results-jsonl-file)
+* [Convert images to correct form](#convert-images-to-correct-form)
+* [Start up Quart server](#start-up-quart-server)
 
 ## Running an experiment file
 
@@ -73,10 +73,11 @@ prompto_create_judge \
 ```
 
 In `judge`, you must have two files:
-- `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
-- `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
 
-See for example [this judge example](../examples/data/data/judge) which contains example template and settings files.
+* `template.txt`: this is the template file which contains the prompts and the responses to be scored. The responses should be replaced with the placeholders `{INPUT_PROMPT}` and `{OUTPUT_RESPONSE}`.
+* `settings.json`: this is the settings json file which contains the settings for the judge(s). The keys are judge identifiers and the values are the "api", "model_name", "parameters" to specify the LLM to use as a judge (see the [experiment file documentation](experiment_file.md) for more details on these keys).
+
+See for example [this judge example](./../examples/data/data/judge) which contains example template and settings files.
 
 The judge specified with the `--judge` flag should be a key in the `settings.json` file in the judge location. You can create different judge files using different LLMs as judge by specifying a different judge identifier from the keys in the `settings.json` file.
 
@@ -103,7 +104,7 @@ prompto_convert_images --folder images
 
 ## Start up Quart server
 
-As described in the [Quart API model documentation](models.md#quart-api), we have implemented a simple [Quart API](../src/prompto/apis/quart/quart_api.py) that can be used to quary a text-generation model from the [Huggingface model hub](https://huggingface.co/models) using the Huggingface `transformers` library. To start up the Quart server, you can use the `prompto_start_quart_server` command along with the Huggingface model name. To see all arguments of this command, run `prompto_start_quart_server --help`.
+As described in the [Quart API model documentation](./quart.md), we have implemented a simple [Quart API](./../src/prompto/apis/quart/quart_api.py) that can be used to quary a text-generation model from the [Huggingface model hub](https://huggingface.co/models) using the Huggingface `transformers` library. To start up the Quart server, you can use the `prompto_start_quart_server` command along with the Huggingface model name. To see all arguments of this command, run `prompto_start_quart_server --help`.
 
 To start up the Quart server with [`vicgalle/gpt2-open-instruct-v1`](https://huggingface.co/vicgalle/gpt2-open-instruct-v1), at `"http://localhost:8000"`, you can use the following command:
 ```