Responsible AI Framework For Supervised ML

This repo implements a toy ML project running on simulated data with a full suite of MLOps tools to demonstrate how to build automatic documentation and proofing to support responsible AI systems.

The system implemented is summarised by the following architecture:

Scenario description

Use-case

A data science team is tasked with creating a regression model on some input data from the feature store showed in the diagram above. Their goal is to make a model that is performant and for which all the possible RAI risks are well measured, tracked over time and documented.

Data used

In this fake project, we use as feature store the infamous Boston Dataset. This dataset contains 2 potentially discriminatory columns: "B" and "LSTAT". We have created multiple snapshots of that dataset, at different points in time. At some points in the timeline, we simulate a data drift for a few randomly selected features.

We assume that the data engineering and governance team has maintained up-to-date data catalogues (in data/) including risk flags for problematic data. This will allow automatic detection of problematic data usage.

For the sake of simplicity, in this simulation data stores are simple flat files. In a more "industrial" setup, these stores would typically be regular queryable databases on your (cloud) infrastructure.

Tools used

MLFlow: we use MLFlow to store our trained regression models, ensure they are reviewed, documented and that there exist a segmentation between production-grade and development-grade models.
Evidently: we use Evidently to measure input data drift when performing inference using production-level models stored on MLFlow
Code Carbon: we use Code Carbon to track the electric consumption of our training and inference tasks, on the user's machine. NOTE: as of October 2024, Code Carbon UI seems down... Therefore, we won't show carbon tracking metrics on CC's dashboard

For the sake of simplicity, in this simulation the tools will be deployed locally, on the user's computer. In a more "industrial" setup, these tools would be typically deployed on your organisation's intranet and (programmatically !) accessible by all your Data Science / Data Engineering teams.

Getting started

Create Python environment
Start MLFlow UI
Start Evidently
Generate the fake data
Train model
Check model perf and risk on MLFlow, document it manually
Pass model to production (to do manually from MLFLow UI)
Run inference multiple times
Observe drift appearing at month 5 in Evidently

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
docs/_static		docs/_static
infra		infra
src		src
.codecarbon.config		.codecarbon.config
.env		.env
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Responsible AI Framework For Supervised ML

Scenario description

Use-case

Data used

Tools used

Getting started

About

Releases

Packages

Languages

hugovallet/responsible-ml-project

Folders and files

Latest commit

History

Repository files navigation

Responsible AI Framework For Supervised ML

Scenario description

Use-case

Data used

Tools used

Getting started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages