Machine Learning Engineering

This repository contains the code and documentation for developing and deploying machine learning models while adhering to engineering best practices.

Environment Setup

Virtual Environment

Navigate to the project directory:

cd <base>/ml-engineering

Create and activate the conda environment:

conda env create --file deploy/conda/linux_py312.yml
conda activate mle

Manage dependencies:
- Install additional dependencies using conda or pip as needed.
- Update environment file: conda env export --name mle > deploy/conda/linux_py312.yml
- Deactivate environment: conda deactivate
- Remove environment (if necessary): conda remove --name mle --all

Development Workflow

Research & Development

Reference code: <base>/ml-engineering/reference/nonstandardcode
Working notebooks: <base>/ml-engineering/notebooks/working

Script Development

Scripts are derived from working notebooks in <base>/ml-engineering/notebooks/working.

Setting PYTHONPATH

Ensure the directory containing housing_value is in PYTHONPATH:

conda env config vars set PYTHONPATH=$(pwd)/src
conda deactivate
conda activate mle
echo $PYTHONPATH

Integrated Features in Scripts

Argument Parsing: Uses argparse for command-line arguments.
Configuration Management: Implements configparser with setup.cfg.
Logging: Incorporates logging for execution tracking and debugging.

Code Quality Tools

Install required tools:

sudo apt install black isort flake8

Tool	Description	Usage
Black	Code formatter	`black <script.py>`
isort	Import sorter	`isort <script.py>`
flake8	Linter	`flake8 <script.py>`

Note: Configurations are specified in setup.cfg and .vscode/settings.json (for VS Code users).

Maintaining Code Quality

chmod +x shell/src_quality.sh
./shell/src_quality.sh

Script Execution

View available options for each script using the --help flag:

python src/housing_value/ingest_data.py --help
python src/housing_value/train.py --help
python src/housing_value/score.py --help

Testing

Install pytest:

sudo apt install python3-pytest

Note: Configurations are specified in setup.cfg.

Maintain test code quality:

chmod +x shell/tests_quality.sh
./shell/tests_quality.sh

Run tests:

pytest
pytest <test_directory>/<test.py>

Documentation

Using Sphinx for documentation generation.

Prerequisites

Install the package:
- Option 1: Editable mode (dependent on pyproject.toml): produces egg-info folder.

pip install -e .

Option 2: Build and install: produces egg-info folder as well as dist folder containing tar.gz and whl file.

python3 -m pip install --upgrade build
python3 -m build
pip install dist/housing_value-0.0.0-py3-none-any.whl

Install Sphinx & Packages for building documentation:

sudo apt install python3-sphinx
pip install sphinx sphinx-rtd-theme matplotlib
pip install sphinxcontrib-napoleon

Generating Documentation

Navigate to the docs directory:

cd docs

Check configuration files:
- Make sure to create Makefile.
Generate Sphinx project:

sphinx-quickstart

Update configuration files:
- Modify source/conf.py and source/index.rst as needed.
- Reference files are available in the reference directory.
Generate API documentation:

sphinx-apidoc -o ./source ../src/housing_value

Update configuration files:
- Modify source/housing_value.rst and source/index.rst as needed.
- Reference files are available in the reference directory.
Build HTML documentation:

make clean
make html

Return to the project root:

cd ..

Note: The documentation file hierarchy in the source directory is: index.rst > modules.rst > housing_value.rst.

Application Packaging with MLflow

Note: The file hierarchy for MLflow is structured as follows: MLproject > app.py.

Maintaining Code Quality

chmod +x shell/app_quality.sh
./shell/app_quality.sh

Tracking UI: Launch the MLflow tracking server using the command.

mlflow server --backend-store-uri mlruns/ --default-artifact-root mlruns/ --host 127.0.0.1 --port 5000

Run Experiment: Execute an experiment to generate a model artifact with the following command.

mlflow run . -P <parameters>

The optional parameter split_size defaults to 0.2.

Python Version Management: Install pyenv for managing Python versions and ensuring reproducibility, which facilitates selecting a specific Python version for the project.

chmod +x shell/pyenv.sh
./shell/pyenv.sh

Activate Conda Environment: Activate the conda environment created during the experiment execution.
Dependency Installation: Install the required dependency in activated environment.

pip install virtualenv

API Endpoint Generation: Create an API endpoint to serve the model using -

mlflow models serve -m mlruns/<experiment_id>/<run_id>/artifacts/model/ -h 127.0.0.1 -p 1234

Testing API Endpoint: Test the API endpoint from another terminal with the following formats.

Datasplit Format:

curl -X POST -H "Content-Type: application/json" --data '{"dataframe_split": {"columns": ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "ocean_proximity"], "data": [[-118.39, 34.12, 29.0, 6447.0, 1012.0, 2184.0, 960.0, 8.2816, "<1H OCEAN"]]}}' http://127.0.0.1:1234/invocations

Inputs/Instances Format:

curl -X POST -H "Content-Type: application/json" --data '{"inputs": [{"longitude": -118.39, "latitude": 34.12, "housing_median_age": 29.0, "total_rooms": 6447.0, "total_bedrooms": 1012.0, "population": 2184.0, "households": 960.0, "median_income": 8.2816, "ocean_proximity": "<1H OCEAN"}]}' http://127.0.0.1:1234/invocations

Deployment Readiness

To facilitate deployment, Docker images are created by aggregating necessary artifacts and configurations.

Artifact Aggregation:

Copy model artifacts (MLmodel and model.pkl) from mlruns/<experiment_id>/<run_id>/artifacts/model to <base>/ml-engineering/deploy/docker/mlruns. Ensure unnecessary metadata is cleaned from the MLmodel.
Transfer the requirements.txt file from mlruns/<experiment_id>/<run_id>/artifacts/model to <base>/ml-engineering/deploy/docker.
Move the wheel file (housing_value-0.0.0-py3-none-any.whl) from the dist directory to <base>/ml-engineering/deploy/docker.
Copy the setup.cfg from the project root to <base>/ml-engineering/deploy/docker, ensuring it contains only data required for inference.

Script and Configuration Creation:

Develop script run.sh to execute MLflow models serve command.
Create .dockerignore file to ignore copying files in WORKDIR of image/container.
Construct Dockerfile to package all components into a Docker image, ensuring efficient deployment and scalability.

Image Development:

cd deploy/docker

Build With Root User:

docker build . -t <dockerhub_username>/mle:rootuser -f Dockerfile.rootuser

Build Without Root User for Security: Enhance security by building an image that does not use the root user.

docker build . -t <dockerhub_username>/mle:nonrootuser -f Dockerfile.nonrootuser

Use Buildkit for Multistage Builds: Optimize your image size and build time using Docker Buildkit for multistage builds.

DOCKER_BUILDKIT=1 docker build . -t <dockerhub_username>/mle:multistage -f Dockerfile.multistage

Container Management

This section provides detailed instructions for containerizing your application using Docker and testing endpoints.

Starting and Testing a Container

Start the Container: Use the following command to start a Docker container named rootuser and map port 8080 on your host to port 5000 in the container.

docker run -dit -p 8080:5000 --name rootuser <dockerhub_username>/mle:rootuser

Test the Endpoint: Verify that your application is running correctly by sending a POST request to the endpoint using curl.

curl -X POST -H "Content-Type: application/json" --data '{"dataframe_split": {"columns": ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "ocean_proximity"], "data": [[-118.39, 34.12, 29.0, 6447.0, 1012.0, 2184.0, 960.0, 8.2816, "<1H OCEAN"]]}}' http://127.0.0.1:8080/invocations

Managing Docker Images

Push Image to Docker Hub: First, log in to Docker Hub and then push images.

docker login -u <dockerhub_username>
docker push <dockerhub_username>/mle:rootuser 
docker push <dockerhub_username>/mle:nonrootuser 
docker push <dockerhub_username>/mle:multistage

List Images and Containers: To view all Docker images and containers on system.

Images:

docker image ls

Containers:

docker ps --all

View Logs: Access the logs of a running container.

docker logs <container_name>

Delete Containers and Images: Remove a specific container or image using these commands:

Containers:

docker rm -f <container_name>

Images:

docker rmi <image_name>

Retesting in a New Environment

To test your application in a new environment:

Pull Image from Docker Hub:

docker pull <dockerhub_username>/mle:rootuser

Start the Container Again:

docker run -dit -p 8080:5000 --name rootuser <dockerhub_username>/mle:rootuser

Re-test the Endpoint: Use the same curl command as before to verify functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Machine Learning Engineering

Environment Setup

Virtual Environment

Development Workflow

Research & Development

Script Development

Setting PYTHONPATH

Integrated Features in Scripts

Code Quality Tools

Maintaining Code Quality

Script Execution

Testing

Documentation

Prerequisites

Generating Documentation

Application Packaging with MLflow

Deployment Readiness

Container Management

Starting and Testing a Container

Managing Docker Images

Retesting in a New Environment

Files

README.md

Latest commit

History

README.md

File metadata and controls

Machine Learning Engineering

Environment Setup

Virtual Environment

Development Workflow

Research & Development

Script Development

Setting PYTHONPATH

Integrated Features in Scripts

Code Quality Tools

Maintaining Code Quality

Script Execution

Testing

Documentation

Prerequisites

Generating Documentation

Application Packaging with MLflow

Deployment Readiness

Container Management

Starting and Testing a Container

Managing Docker Images

Retesting in a New Environment