Skip to content

Commit

Permalink
Use k8s jobs for cuda build chains deployment (opendatahub-io#472)
Browse files Browse the repository at this point in the history
* Use k8s job for cuda-11.0.3 build chain deployment

Add cuda-version=11.0.3 labels to the buildconfig and imagestream

Signed-off-by: Landon LaSmith <[email protected]>

* Restore default serviceaccount group namespace to image-pullers RoleBinding

Signed-off-by: Landon LaSmith <[email protected]>
  • Loading branch information
LaVLaS authored Oct 29, 2021
1 parent 1fb9f35 commit 9ada6be
Show file tree
Hide file tree
Showing 12 changed files with 542 additions and 339 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,12 @@ apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: 'system:image-pullers'
subjects:
# This group is always created by default when a new project is created but we have to explicitly include it to
# prevent image pull authorization denials when images are built from source in the JH server namespace
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: 'system:serviceaccounts:$(namespace)'
# This will produce a dead group when notebook_destination is not specified
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: 'system:serviceaccounts:$(notebook_destination)'
Expand Down
36 changes: 19 additions & 17 deletions jupyterhub/notebook-images/overlays/cuda-11.0.3/README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,42 @@
# CUDA Build Chain

This overlay contains CUDA build chain to produce CUDA based images for TensorFlow, PyTorch, and Minimal jupyter notebooks.
This overlay contains CUDA build chain to produce CUDA 11.0.3 based images for TensorFlow, PyTorch, and Minimal jupyter notebooks.

## Version Details:
When this overlay is applied, the BuildConfigs and Imagestreams required for the build chain will be created by an OpenShift [job](./cuda-build-job.yaml). Once the BuildConfigs and ImageStreams are deployed, the Open Data Hub operator will no longer reconcile or re-deploy these objects unless the job and buildchain objects are manually deleted by the user. Any changes you make will persist until you manually delete all associated build chain objects and job.

```
notebook = ">=6.0.2"
jupyterhub = ">=1.3"
jupyterlab = ">=3.0.0"
TensorFlow: v2.4.1
PyTorch: v1.8.0
CUDA: 11.0.3
```
## Build Details:

- [CUDA-ubi8-build-chain](./cuda-ubi8-build-chain.yaml): This yaml contains CUDA build chain which creates the base image which is used by the jupyter notebook images.
The CUDA build chain is stored in the cuda-build-chain [configMap](./cuda-buildchain.configmap.yaml). This configmap contains the yaml files for deploying the BuildConfigs and Imagestreams for the CUDA build chain and GPU notebooks.
- `cuda-ubi8-build-chain.yaml`: This yaml contains CUDA build chain which creates the base image which is used by the jupyter notebook images.

- [gpu-notebook](./gpu-notebook.yaml): This yaml contains CUDA build chain which creates the GPU supported jupyter notebook images like s2i-minimal-gpu-notebook, s2i-tensorflow-gpu-notebook, and s2i-pytorch-gpu-notebook.
- `gpu-notebook.yaml`: This yaml contains CUDA build chain which creates the GPU supported jupyter notebook images like s2i-minimal-gpu-notebook, s2i-tensorflow-gpu-notebook, and s2i-pytorch-gpu-notebook.

## Resource Requirements:

**_NOTE:_** If users don't have quota restrictions then they can remove the resource requirements from the [gpu-notebook](./gpu-notebook.yaml)

### Minimal GPU Notebook

The Minimal notebook requires atleast **3GB** of memory while build-time as the minimal notebook installs `jupyterhub`, `jupyterlab` and `jupyter notebook` packages along with the supported extension that requires this much amount of memory.
The Minimal notebook requires atleast **3GB** of memory while build-time as the minimal notebook installs `jupyterhub`, `jupyterlab` and `jupyter notebook` packages along with the supported extension that requires this much amount of memory.
we have added **4GB** generously to avoid issues.

### TensorFlow GPU Notebook

The TensorFlow notebook requires atleast **6GB** of memory while build-time as the TensorFlow notebook installs `jupyterlab` and `jupyter notebook` supported extension and `jupyterlab build` requires this much amount of memory.
The TensorFlow notebook requires atleast **6GB** of memory while build-time as the TensorFlow notebook installs `jupyterlab` and `jupyter notebook` supported extension and `jupyterlab build` requires this much amount of memory.
we have added **6GB** generously to avoid issues.

### PyTorch GPU Notebook

The PyTorch notebook requires atleast **6GB** of memory while build-time as the PyTorch notebook installs `jupyterlab` and `jupyter notebook` supported extension and `jupyterlab build` requires this much amount of memory.
The PyTorch notebook requires atleast **6GB** of memory while build-time as the PyTorch notebook installs `jupyterlab` and `jupyter notebook` supported extension and `jupyterlab build` requires this much amount of memory.
we have added **6GB** generously to avoid issues.

## Deleting CUDA build objects
All the job and all objects created by the job have the `cuda-version = 11.0.3` label applied. This label can be used to purge all of the CUDA objects so that the operator can restore the original CUDA build chain

```
oc delete build -l cuda-version=11.0.3
oc delete bc -l cuda-version=11.0.3
oc delete is -l cuda-version=11.0.3
oc delete cm -l cuda-version=11.0.3
oc delete job -l cuda-version=11.0.3
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: batch/v1
kind: Job
metadata:
annotations:
name: cuda-11-0-3-build
labels:
cuda-version: "$(cuda_version)"
spec:
backoffLimit: 2
template:
spec:
containers:
- image: registry.redhat.io/openshift4/ose-cli:v4.7
volumeMounts:
- name: cuda-ubi8-build-chain
mountPath: /tmp/

# work around unwriteable HOME dir / for unprivileged pods causing OC commands to be slow in pods
env:
- name: HOME
value: /tmp
- name: BUILD_NAMESPACE
valueFrom:
fieldRef:
fieldPath:
metadata.namespace

command:
- /bin/bash
- -c
- |
set -x
echo "PWD: $PWD"
oc create -n ${BUILD_NAMESPACE} -f /tmp/gpu-notebook.yaml
oc create -n ${BUILD_NAMESPACE} -f /tmp/cuda-ubi8-build-chain.yaml
imagePullPolicy: IfNotPresent
name: cuda-11-0-3-build
dnsPolicy: ClusterFirst
restartPolicy: OnFailure
serviceAccount: cuda-11.0.3-build-job
serviceAccountName: cuda-11.0.3-build-job
terminationGracePeriodSeconds: 30
volumes:
- name: cuda-ubi8-build-chain
configMap:
name: cuda-build-chain
items:
- key: gpu-notebook.yaml
path: gpu-notebook.yaml
- key: cuda-ubi8-build-chain.yaml
path: cuda-ubi8-build-chain.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
apiVersion: authorization.openshift.io/v1
kind: Role
metadata:
labels:
app: cuda-11.0.3-build-job
name: cuda-11.0.3-build-job
rules:
- apiGroups:
- ""
- build.openshift.io
resources:
- builds
verbs:
- get
- list
- watch
- apiGroups:
- ""
- image.openshift.io
resources:
- imagestreams
verbs:
- create
- patch
- update
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- secrets
- events
- persistentvolumeclaims
- pods
- services
- endpoints
verbs:
- get
- list
- watch
- apiGroups:
- ""
- template.openshift.io
resources:
- processedtemplates
- templateconfigs
- templateinstances
- templates
verbs:
- create
- delete
- deletecollection
- patch
- update
- apiGroups:
- ""
- template.openshift.io
resources:
- processedtemplates
- templateconfigs
- templateinstances
- templates
verbs:
- get
- list
- watch
- apiGroups:
- build.openshift.io
resources:
- builds
- buildconfigs
verbs:
- create
- patch
- update
- get
- list
- watch
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cuda-11.0.3-build-job
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cuda-11.0.3-build-job
subjects:
- kind: ServiceAccount
name: cuda-11.0.3-build-job
namespace: $(namespace)
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: cuda-11.0.3-build-job
Loading

0 comments on commit 9ada6be

Please sign in to comment.