Skip to content
This repository was archived by the owner on Dec 10, 2024. It is now read-only.

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
pipliggins committed Nov 20, 2024
0 parents commit ff5c919
Show file tree
Hide file tree
Showing 30 changed files with 4,006 additions and 0 deletions.
37 changes: 37 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: Unit and integration tests
on:
workflow_dispatch:
push:
branches:
- main
pull_request:

jobs:
build:
runs-on: ${{ matrix.os }}

strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: |
3.11
3.12
- name: Install the latest version of uv
uses: astral-sh/setup-uv@v2
- name: Install project
run: uv sync --dev
- name: Run tests
run: uv run pytest --cov


# - name: Upload coverage reports to Codecov
# uses: codecov/[email protected]
# if: ${{ matrix.os == 'ubuntu-latest' }}
# with:
# token: ${{ secrets.CODECOV_TOKEN }}
164 changes: 164 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

*.DS_Store
16 changes: 16 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
hooks:
- id: ruff
- id: ruff-format
args: ['--exclude', 'src/autoparser/toml_writer.py']
99 changes: 99 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# autoparser

[![](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![tests](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml)
<!-- [![codecov](https://codecov.io/gh/globaldothealth/autoparser/graph/badge.svg?token=AINU8PNJE3)](https://codecov.io/gh/globaldothealth/autoparser) -->
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

autoparser helps in the generation of ADTL parsers as
TOML files, which can then be processed by
[adtl](https://github.com/globaldothealth/adtl) to transform files from the
source schema to a specified schema.

Contains functionality to:
1. Create a basic data dictionary from a raw data file (`create-dict`)
2. Use an LLM (currently only ChatGPT via the OpenAI API) to add descriptions to the
data dictionary, to enable better parser auto-generation (`add-descriptions`)
3. Create a mapping csv file linking source to target data fields and value mappings
using the LLM, which can be edited by a user (`create-mapping`)
4. Create a TOML parser file ready for use with ADTL, based on a JSON schema
(rules-based from the mapping file; `create-parser`).

All 4 functions have both a command-line interface, and a python function associated.

## Parser construction process (CLI)

1. **Data**: Get the data as CSV or Excel and the data dictionary if available.

2. **Creating autoparser config**: Optional step if the data is not in REDCap
(English) format. The autoparser config ([example](src/autoparser/config/redcap-en.toml))
specifies most of the variable configuration settings for autoparser.

3. **Preparing the data dictionary**: If the data dictionary is not in CSV, or
split across multiple Excel sheets, then it needs to be combined to a single
CSV. If a data dictionary does not already exist, one can be created using

```shell
autoparser create-dict <path to data> -o <parser-name>
```

Here, `-o` sets the output name, and will create
`<parser-name>.csv`. For optional arguments (such as using a custom configuration
which was created in step 2), see `autoparser create-dict --help`.

4. **Generate intermediate mappings (CSV)**: Run with config and data dictionary
to generate mappings:

```shell
autoparser create-mapping <path to data dictionary> <path to schema> <language> <api key> -o <parser-name>
```

Here `language` refers to the language of the original data, e.g. "fr" for french
language data. `autoparser` defaults to using OpenAI as the LLM API, so the api key
provided should be for the OpenAi platform. In future, alternative API's and/or a
self-hosted llm are planned to be provided as options.

5. **Curate mappings**: The intermediate mappings must be manually curated, as
the LLM may have generated false matches, or missed certain fields or value mappings.

6. **Generate TOML**: This step is automated and should produce a TOML file that
conforms to the parser schema.

For example:

```shell
autoparser create-toml parser.csv <path to schema> -n parser
```

will create `parser.toml` (specified using the `-n` flag) from the
intermediate mappings `parser.csv` file.

7. **Review TOML**: The TOML file may contain errors, so it is recommended to
check it and alter as necessary.

8. **Run adtl**: Run adtl on the TOML file and the data source. This process
will report validation errors, which can be fixed by reviewing the TOML file
and looking at the source data that is invalid.

## Parser construction process (Python)

An [example notebook](example.ipynb) has been provided using the test data to demonstrate
the process of constructing a parser using the Python functions of `autoparser`.

## Troubleshooting autogenerated parsers

1. "I get validation errors like "'x' must be date":
ADTL expects dates to be provided in ISO format (i.e. YYY-MM-DD). If your dates are
formatted differently, e.g. "dd/mm/yyyy", you can add a line in the header
of the TOML file (e.g. underneath the line "returnUnmatched=True") like this:

```TOML
defaultDateFormat = "%d/%m/%Y"
```
which should automatically convert the dates for you.

2. ADTL can't find my schema (error: No such file or directory ..../x.schema.json)
autoparser puts the path to the schema at the top of the TOML file, relative to the
*current location of the parser* (i.e, where you ran the autoparser command from).
If you have since moved the parser file, you will need to update the schema path at the
top of the TOML parser.
Loading

0 comments on commit ff5c919

Please sign in to comment.