This project is part of the Software Development and Software Engineering course at ITU. The original project description can be found here.
In this project we were tasked with restructuring a Python monolith using the concepts we have learned throughout the course. This project contains a Dagger workflow and a GitHub workflow.
├── README.md <- Project description and how to run the code
│
├── .github/workflows <- GitHub Action workflows
│ │
│ ├── tag_version.yml <- Workflow for creating version tags
│ │
│ └── log_and_test_action.yml <- Workflow that automatically trains and tests model
│
├── pipeline_deps
│ │
│ └── requirements.txt <- Dependencies for the pipeline
│
├── CODEOWNERS <- Defines codeowners for the repository
│
├── go.mod <- Go file that defines the module and required dependencies
│
├── go.sum <- Go file that ensures continuity and integrity of dependencies
│
├── pipeline.go <- Dagger workflow written in Go
│
├── pyproject.toml <- Project metadata and configuration
│
├── .pre-commit-config.yaml <- Checks quality of code before commits
│
├── Makefile.venv <- Library for managing venv via makefile
│
├── Makefile <- Project related scripts
│
├── references <- Documentation and extra resources
│
├── requirements.txt <- Python dependencies need for the project
│
├── tests
│ │
│ └── verify_artifacts.py <- Tests to check if all artifacts are copied correctly
│
└── github_dagger_workflow_project <- Source code for the project
│
├── __init__.py <- Marks the directory as a Python package
│
├── 01_data_transformations.py <- Script for data preprocessing and transformation
│
├── 02_model_training.py <- Script for training the models
│
├── 03_model_selection.py <- Script for selecting the best perfoming model
│
├── 04_prod_model.py <- Script for comparing new best model and production model
│
├── 05_model_deployment.py <- Script for deploying model
│
├── config.py <- Constants and paths used in the pipeline's scripts
│
├── pipeline_utils.py <- Encapsulated code from the .py monolith.
│
├── artifacts
│ │
│ └── raw_data.csv.dvc <- Metadata tracked by DVC for data file
│
└── utils.py <- Helper functions extracted from the .py monolith
The workflow can be triggered either on pull requests to main
or manually.
It can be triggered manually here by pressing Run workflow
on the main
branch, then refresh the page and the triggered workflow will appear. After all the jobs have been run, the model artifact can be found on the summary page of the run of the first job. We also store other artifacts for convenience.
The testing is automatically run afterwards to let the user check if it was of a quality.
Artifacts are stored for 90 days.
For local running you need:
docker
(Server): >= 4.36dagger
>= 0.14
For local development you need as well:
go
- 1.23.3 is currently used.git
>= 2.39python
>= 3.11make
>= 3.81 (lower should work too)
Then run:
make setup
.venv\Scripts\activate # for windows
source .venv/bin/activate # for linux/macos
Additionally, It installs pre-commit
which takes care of formatting and linting before commits for go and python.
For that you can run scripts sequentially in the github_dagger_workflow_project.
Beware: all artifacts will be appended to your repo dir!
The command will run the dagger
pipeline. In the end, only final artifacts will be appended to
make container_run
Perhaps most useful. It will not append any of the container-produced files to the host machine, but it will run a test script which will ensure that all important artifacts are indeed logged
make test
Beware: it will not test the model on the inference test!
The same workflow which generates artifacts automatically runs the inference testing. Also, the artifacts testing and the inference test is carried out after every PR (and subsequent commits) to main
- We used
pre-commit
to lint and format, as stated above. We useruff
,ruff format
,gofmt
andgovet
. We check for PEP8 warnings and errors. main
branch-protection (with github repo settings)- PR is required before merging
- at least one approval is needed. We automatically assign reviewers with
CODEOWNERS
file. - we required status checks to be passed for both of our jobs i.e.
Train and Upload Model
andUnit Test Model Artifacts
. The test checks explicitly whether all artifacts have been generated and if the model passes inference test. Jobs are automatically triggered on merge.
- We maintained a clear goals via
Issues
and often quite verbose reviews. - we used 90% of time semantic commits
On every push to main a new tag is released with the current time it was published. See current tags: Tags
This is not the part of the documentation: you can read about a few (hard) decisions we have made on Reflections