Skip to content

Qarik-Group/vertex-pipelines-end-to-end-samples

 
 

Repository files navigation

Vertex Pipelines End-to-End Samples

AKA "Vertex AI Turbo Templates"

Shell

Introduction

This repository provides a reference implementation of Vertex Pipelines for creating a production-ready MLOps solution on Google Cloud. You can take this repository as a starting point you own ML use cases. The implementation includes:

  • Infrastructure-as-Code using Terraform for a typical dev/test/prod setup of Vertex AI and other relevant services
  • ML training and prediction pipelines using the Kubeflow Pipelines
  • Reusable Kubeflow components that can be used in common ML pipelines
  • CI/CD using Google Cloud Build for linting, testing, and deploying ML pipelines
  • Developer scripts (Makefile, Python scripts etc.)

Get started today by following this step-by-step notebook tutorial! 🚀 In this three-part notebook series you'll deploy a Google Cloud project and run production-ready ML pipelines using Vertex AI without writing a single line of code.

Cloud Architecture

The diagram below shows the cloud architecture for this repository.

Cloud Architecture diagram

There are four different Google Cloud projects in use

  • dev - a shared sandbox environment for use during development
  • test - environment for testing new changes before they are promoted to production. This environment should be treated as much as possible like a production environment.
  • prod - production environment
  • admin - separate Google Cloud project for setting up CI/CD in Cloud Build (since the CI/CD pipelines operate across the different environments)

Vertex Pipelines are scheduled using Google Cloud Scheduler. Cloud Scheduler emits a Pub/Sub message that triggers a Cloud Function, which in turn triggers the Vertex Pipeline to run. In future, this will be replaced with the Vertex Pipelines Scheduler (once there is a Terraform resource for it).

Setup

Prerequisites:

Deploy infrastructure:

You will need four Google Cloud projects dev, test, prod, and admin. The Cloud Build pipelines will run in the admin project, and deploy resources into the dev/test/prod projects. Before your CI/CD pipelines can deploy the infrastructure, you will need to set up a Terraform state bucket for each environment:

export DEV_PROJECT_ID=my-dev-gcp-project
export DEV_LOCATION=europe-west2
gsutil mb -l $DEV_LOCATION -p $DEV_PROJECT_ID --pap=enforced gs://$DEV_PROJECT_ID-tfstate && \
  gsutil ubla set on gs://$DEV_PROJECT_ID-tfstate

Enable APIs in admin project:

export ADMIN_PROJECT_ID=my-admin-gcp-project
gcloud services enable cloudresourcemanager.googleapis.com serviceusage.googleapis.com --project=$ADMIN_PROJECT_ID
make deploy env=dev

More details about infrastructure is explained in this guide. It describes the scheduling of pipelines and how to tear down infrastructure.

Install dependencies:

pyenv install -skip-existing                          # install Python
poetry config virtualenvs.prefer-active-python true   # configure Poetry
make install                                          # install Python dependencies
cd pipelines && poetry run pre-commit install         # install pre-commit hooks
cp env.sh.example env.sh

Update the environment variables for your dev environment in env.sh.

Authenticate to Google Cloud:

gcloud auth login
gcloud auth application-default login

Run

This repository contains example ML training and prediction pipelines which are explained in this guide.

Build containers: The model/ directory contains the code for custom training and prediction container images, including the model training script at model/training/train.py. You can modify this to suit your own use case. Build the training and prediction container images and push them to Artifact Registry with:

make build [ images="training prediction" ]

Optionally specify the images variable to only build one of the images.

Execute pipelines: Vertex AI Pipelines uses KubeFlow to orchestrate your training steps, as such you'll need to:

  1. Compile the pipeline
  2. Build dependent Docker containers
  3. Run the pipeline in Vertex AI

Execute the following command to run through steps 1-3:

make run pipeline=training [ build=<true|false> ] [ compile=<true|false> ] [ cache=<true|false> ] [ wait=<true|false> ] 

The command has the following true/false flags:

  • build - re-build containers for training & prediction code (limit by setting images=training to build only one of the containers)
  • compile - re-compile the pipeline to YAML
  • cache - cache pipeline steps
  • wait - run the pipeline (a-)sync

Shortcuts: Use these commands which support the same options as run to run the training or prediction pipeline:

make training
make prediction

Test

Unit tests are performed using pytest. The unit tests are run on each pull request. To run them locally you can execute the following command and optionally enable or disable testing of components:

make test [ packages=<pipelines components> ]

Automation

For details on setting up CI/CD, see this guide.

Issues with Vertex AI Custom Code Service Agent

If you run custom training code to train a custom-trained model, then the Vertex AI Custom Code Service Agent will be used. In those cases, the agent is created only when you first try to run custom training code which means you can't assign permissions to the agent, like artifact registry reader, from the very beginning. To tackle this, you can use this guide. This repo uses the curl command to create a simple custom training job which triggers the creation of the service agent. You can also edit that code to use gcloud ai custom-jobs create to create the job if you want.

Alternatively, Google has another method of triggering the creation of service agents that is currently in pre-GA that can be used instead of the above solution. You can read more about it here.

Putting it all together

For a full walkthrough of the journey from changing the ML pipeline code to having it scheduled and running in production, please see the guide here.

We value your contribution, see this guide for contributing to this project.

About

Use the "develop" branch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 66.0%
  • HCL 27.8%
  • Makefile 4.4%
  • Other 1.8%