In this example, we implement a pipeline to automate execution of the simple noop transform
In this tutorial, we will show the following:
- How to write the
noop
transform automation pipeline, leveraging KFP components. - How to compile a pipeline and deploy it to KFP
- How to execute pipeline and view execution results
Note: the project and the explanation below are based on KFPv1
Overall implementation roughly contains 5 major sections:
- Imports
- Components definition - definition of the main steps of our pipeline
- Input parameters definition
- Pipeline wiring - definition of the sequence of invocation (with parameter passing) of participating components
- Additional configuration
import kfp.compiler as compiler
import kfp.components as comp
import kfp.dsl as dsl
from kfp_support.workflow_support.utils import (
ONE_HOUR_SEC,
ONE_WEEK_SEC,
ComponentUtils,
)
from kubernetes import client as k8s_client
Our pipeline includes 4 steps - compute execution parameters, create Ray cluster, submit and watch Ray job, clean up Ray cluster. For each step we have to define a component that will execute them:
# components
base_kfp_image = "quay.io/dataprep1/data-prep-kit/kfp-data-processing:0.0.2"
# compute execution parameters. Here different tranforms might need different implementations. As
# a result, instead of creating a component we are creating it in place here.
compute_exec_params_op = comp.func_to_container_op(
func=ComponentUtils.default_compute_execution_params, base_image=base_kfp_image
)
# create Ray cluster
create_ray_op = comp.load_component_from_file("../../../kfp_ray_components/createRayComponent.yaml")
# execute job
execute_ray_jobs_op = comp.load_component_from_file("../../../kfp_ray_components/executeRayJobComponent.yaml")
# clean up Ray
cleanup_ray_op = comp.load_component_from_file("../../../kfp_ray_components/cleanupRayComponent.yaml")
# Task name is part of the pipeline name, the ray cluster name and the job name in DMF.
TASK_NAME: str = "noop"
Note: here we are using shared components described in this document for create_ray_op
,
execute_ray_jobs_op
and cleanup_ray_op
, while compute_exec_params_op
component is built inline, because it might
differ significantly. For "simple" pipeline cases we can use the
default implementation,
while, for example for exact dedup, we are using a very specialized one.
The input parameters section defines all the parameters required for the pipeline execution:
# Ray cluster
ray_name: str = "noop-kfp-ray", # name of Ray cluster
ray_head_options: str = '{"cpu": 1, "memory": 4, "image_pull_secret": "",\
"image": "' + task_image + '" }',
ray_worker_options: str = '{"replicas": 2, "max_replicas": 2, "min_replicas": 2, "cpu": 2, "memory": 4, "image_pull_secret": "",\
"image": "' + task_image + '" }',
server_url: str = "http://kuberay-apiserver-service.kuberay.svc.cluster.local:8888",
# data access
data_s3_config: str = "{'input_folder': 'test/noop/input/', 'output_folder': 'test/noop/output/'}",
data_s3_access_secret: str = "s3-secret",
data_max_files: int = -1,
data_num_samples: int = -1,
# orchestrator
actor_options: str = "{'num_cpus': 0.8}",
pipeline_id: str = "pipeline_id",
code_location: str = "{'github': 'github', 'commit_hash': '12345', 'path': 'path'}",
# noop parameters
noop_sleep_sec: int = 10,
# additional parameters
additional_params: str = '{"wait_interval": 2, "wait_cluster_ready_tmout": 400, "wait_cluster_up_tmout": 300, "wait_job_ready_tmout": 400, "wait_print_tmout": 30, "http_retries": 5}',
The parameters used here are as follows:
- ray_name: name of the Ray cluster
- ray_head_options: head node options, containing the following:
- cpu - number of cpus
- memory - memory
- image - image to use
- image_pull_secret - image pull secret
- ray_worker_options: worker node options (we here are using only 1 worker pool), containing the following:
- replicas - number of replicas to create
- max_replicas - max number of replicas
- min_replicas - min number of replicas
- cpu - number of cpus
- memory - memory
- image - image to use
- image_pull_secret - image pull secret
- server_url - server url
- additional_params: additional (support) parameters, containing the following:
- wait_interval - wait interval for API server, sec
- wait_cluster_ready_tmout - time to wait for cluster ready, sec
- wait_cluster_up_tmout - time to wait for cluster up, sec
- wait_job_ready_tmout - time to wait for job ready, sec
- wait_print_tmout - time between prints, sec
- http_retries - http retries for API server calls
- data_s3_access_secret - s3 access secret
- data_s3_config - s3 configuration
- data_max_files - max files to process
- data_num_samples - num samples to process
- actor_options - actor options
- pipeline_id - pipeline id
- code_location - code location
- noop_sleep_sec - noop sleep time
Note that here we are specifying initial values for all parameters that will be propagated to the workflow UI (see below)
Note Parameters are defining both S3 and Lakehouse configuration, but only one at a time can be used.
Now, when all components and input parameters are defined, we can implement pipeline wiring defining sequence of component execution and parameters submitted to every component.
# create clean_up task
clean_up_task = cleanup_ray_op(ray_name=ray_name, run_id=dsl.RUN_ID_PLACEHOLDER, server_url=server_url)
ComponentUtils.add_settings_to_component(clean_up_task, 60)
# pipeline definition
with dsl.ExitHandler(clean_up_task):
# compute execution params
compute_exec_params = compute_exec_params_op(
worker_options=ray_worker_options,
actor_options=actor_options,
)
ComponentUtils.add_settings_to_component(compute_exec_params, ONE_HOUR_SEC * 2)
# start Ray cluster
ray_cluster = create_ray_op(
ray_name=ray_name,
run_id=dsl.RUN_ID_PLACEHOLDER,
ray_head_options=ray_head_options,
ray_worker_options=ray_worker_options,
server_url=server_url,
additional_params=additional_params,
)
ComponentUtils.add_settings_to_component(ray_cluster, ONE_HOUR_SEC * 2)
ray_cluster.after(compute_exec_params)
# Execute job
execute_job = execute_ray_jobs_op(
ray_name=ray_name,
run_id=dsl.RUN_ID_PLACEHOLDER,
additional_params=additional_params,
# note that the parameters below are specific for NOOP transform
exec_params={
"data_s3_config": data_s3_config,
"data_max_files": data_max_files,
"data_num_samples": data_num_samples,
"num_workers": compute_exec_params.output,
"worker_options": actor_options,
"pipeline_id": pipeline_id,
"job_id": dsl.RUN_ID_PLACEHOLDER,
"code_location": code_location,
"noop_sleep_sec": noop_sleep_sec,
},
exec_script_name=EXEC_SCRIPT_NAME,
server_url=server_url,
)
ComponentUtils.add_settings_to_component(execute_job, ONE_WEEK_SEC)
ComponentUtils.set_s3_env_vars_to_component(execute_job, data_s3_access_secret)
execute_job.after(ray_cluster)
Here we first create cleanup_task
and the use it as an
exit handler which will be
invoked either the steps into it succeeded or failed.
Then we create each individual component passing it required parameters and specify execution sequence, for example
(ray_cluster.after(compute_exec_params)
).
The final thing that we need to do is set some pipeline global configuration:
# Configure the pipeline level to one week (in seconds)
dsl.get_pipeline_conf().set_timeout(ONE_WEEK_SEC)
To compile pipeline execute make build
command in the same directory where your pipeline is
The project provides instructions and deployment automation to run all components in an all-inclusive fashion on a single machine using a Kind cluster. However, this topology is not suitable for processing medium and large datasets, and deployment should be carried out on a real Kubernetes or OpenShift cluster. Therefore, we recommend using Kind cluster for only for local testing and debugging, not production loads. For production loads use a real Kubernetes cluster.
You can create a Kind cluster with all required software installed using the following command:
make setup
We tested Kind cluster installation on multiple platforms, including Intel Mac, AMD Mac (see this), WSL on Windows, RHEL and Ubuntu. Additional platform can be used, but might require additional configuration and testing.
Alternatively you can deploy pipeline to the existing Kubernetes cluster.
Deployment on an existing cluster requires less pre-installed software Only the following programs should be manually installed:
- Helm 3.10.0 or greater must be installed and configured on your machine.
- Kubectl 1.26 or newer must be installed on your machine, and be able to connect to the external cluster. For OpenShift clusters OpenShift CLI oc can be used instead.
In order to execute data transformers on the remote cluster, the following packages should be installed on the Kubernetes cluster:
- KubeFlow Pipelines (KFP). Currently, we use upstream Argo-based KFP v1.
- KubeRay controller and KubeRay API Server
You can install the software from their repositories, or you can use our installation scripts.
If your local kubectl is configured to connect to the external cluster do the following:
export EXTERNAL_CLUSTER=1
make setup
-
In addition, you should configure external access to the KFP UI (
svc/ml-pipeline-ui
in thekubeflow
ns) and the Ray Server API (svc/kuberay-apiserver-service
in thekuberay
ns). Depends on your cluster and its deployment it can be LoadBalancer services, Ingresses or Routes. -
Optionally, you can upload the test data into the MinIO Object Store, deployed as part of KFP. In order to do this, please provide external access to the Minio (
svc/minio-service
in thekubeflow
ns) and execute the following commands:
export MINIO_SERVER=<Minio external URL>
kubectl apply -f kind/hack/s3_secret.yaml
kind/hack/populate_minio.sh
Once the cluster is up, go to the kfp endpoint (localhost:8080/kfp/
for Kind cluster), which will bring up
KFP UI, see below:
Click on the Upload pipeline
link and follow instructions on the screen to upload your file (noop_wf.yaml
) and
name pipeline noop. Once this is done, you should see something as follows:
Before we can run the pipeline we need to create required secrets (one for image loading in case of secured
registry and one for S3 access). As KFP is deployed in kubeflow
namespace, workflow execution will happen
there as well, which means that secrets have to be created there as well.
Once this is done we can execute the workflow.
On the pipeline page (above) click on the create run
button. You will see the list of the parameters, that you
can redefine or use the default values that we specified above. After that,
go to the bottom of the page and click the start
button
This will start workflow execution. Once it completes you will see something similar to below
Note that the log (on the left) has the complete execution log.
Additionally, the log is saved to S3 (location is denoted but the last line in the log)
If you are using Kind cluster, you can delete it running the following command:
make clean
Note that this command has to run from the project kind subdirectory