This document assumes that you have either:
- OpenTofu installed on your system.
- Kind (Kubernetes In-Docker-Environment) installed and configured on your system.
- Docker and Podman installed on your system, along with NVIDIA drivers to run GPU-enabled containers.
- Ubuntu/Linux system with NVIDIA GPU hardware
- NVIDIA drivers properly installed and configured
To follow the guidelines outlined in this document, ensure that you have:
- Follow the official installation instructions for OpenTofu.
- Follow the official installation instructions for Kind.
- Follow the official installation instructions for nvKind.
- This will provide a way to containerize and deploy Kubernetes environments without having to set up a physical cluster.
- Install kind with gpus kind-with-gpus.
- Install Docker using the package manager of your operating system (e.g.
apt
for Ubuntu-based systems,brew
for macOS, etc.). - OR, install Podman, which allows you to run containers in isolated environments without needing root privileges.
- For NVIDIA drivers, ensure that they are properly installed and configured on your system.
This setup supports the following environments:
- Uses Kind as the container runtime environment for running Kubernetes clusters.
- Mounts NVIDIA drivers to the containers requiring a GPU.
- Provides an efficient way to develop, test, and deploy applications that rely on GPUs.
- Uses Docker or Podman as the container runtime environment for running containers with NVIDIA drivers.
- Supports running containers that require GPU acceleration without relying on Kind's cluster setup.
To use this setup, follow these steps:
- clone this repository to your local machine.
- define the necessary variables in an
AWS Secret Manager
file (e.g.,cluster_name
,http_ingress_port
... - run the following command to install Kind and create a Kubernetes cluster:
make local-deploy
This will deploy the Kind cluster with the specified configuration, and ArgoCD will be installed on the cluster with the preconfigured github applications. 4. Access the ArgoCD web interface
- Navigate to
http://<your-cluster-ip>:<http_ingress_port>/argocd
in your browser.
- Log in using the default credentials.
First, install Go 1.24.4:
# Download Go
wget https://go.dev/dl/go1.24.4.linux-amd64.tar.gz
# Install Go to local directory (avoiding system-wide installation)
mkdir -p ~/.local
tar -C ~/.local -xzf go1.24.4.linux-amd64.tar.gz
# Add Go to PATH
export PATH=$PATH:$HOME/.local/go/bin
# Verify installation
go version
Since some tools are easier to install via Homebrew, set up Homebrew for Linux:
# Create Homebrew directory
mkdir -p $HOME/.homebrew
# Clone Homebrew
git clone https://github.com/Homebrew/brew $HOME/.homebrew
# Clone homebrew-core tap
mkdir -p ~/.homebrew/Homebrew/Library/Taps/homebrew
git clone https://github.com/Homebrew/homebrew-core ~/.homebrew/Homebrew/Library/Taps/homebrew/homebrew-core
# Set up Homebrew environment variables
export HOMEBREW_PREFIX="$HOME/.homebrew"
export HOMEBREW_REPOSITORY="$HOME/.homebrew"
export HOMEBREW_CELLAR="$HOME/.homebrew/Cellar"
export PATH="$HOMEBREW_PREFIX/bin:$HOMEBREW_PREFIX/sbin:$PATH"
# Verify installation
brew --version
Install essential Kubernetes tools:
# Install kind (Kubernetes in Docker)
brew install kind
# Install kubectl and helm
brew install kubectl helm
# Create kubectl alias for convenience
alias k=kubectl
Set up Docker to work with NVIDIA GPUs:
# Test GPU availability
nvidia-smi -L
# Test Docker with NVIDIA runtime
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu:20.04 nvidia-smi -L
# Configure NVIDIA container toolkit
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
# Restart Docker services
sudo systemctl restart docker
sudo systemctl restart containerd
# Verify services are running
sudo systemctl status docker
sudo systemctl status containerd
Install NVIDIA's nvkind tool for GPU-enabled Kubernetes clusters:
# Install nvkind
go install github.com/NVIDIA/nvkind/cmd/nvkind@latest
# Add Go bin directory to PATH
export PATH=$PATH:$HOME/go/bin
# Create GPU-enabled Kubernetes cluster
nvkind cluster create
Deploy the NVIDIA GPU Operator to manage GPU resources in Kubernetes:
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
# Install GPU Operator (with host drivers disabled since we're using nvkind)
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --set driver.enabled=false
# Verify GPU operator pods
k get pods -n gpu-operator
Test that GPU resources are available in the cluster:
# Run interactive GPU test container
k run -it --rm gpu-test --image=nvidia/cuda:12.0.1-runtime-ubuntu22.04 -- /bin/bash
# Inside the container, you can run:
# nvidia-smi -L
Clone and deploy the NASA R2O project:
# Clone the repository
git clone https://github.com/NASA-IMPACT/r2o-deploy.git
cd r2o-deploy
# Switch to GPU support branch
git checkout feature/support-cluster-with-gpu
# Set up environment configuration
cp .env.example .env
nano .env # Edit configuration as needed
# Generate SSH key if needed
ssh-keygen -t rsa
# Source environment variables
set -a; source .env; set +a
# Navigate to local setup directory
cd local-setup
# Generate SSL certificates
openssl req -x509 -newkey rsa:2048 -nodes -keyout key.pem -out cert.pem -days 365
# Install OpenTofu (Terraform alternative)
brew install opentofu
# Initialize and deploy infrastructure
make init
make plan
make apply
make deploy
Verify the deployment:
# Check all pods
k get pods
# Check GPU operator pods specifically
k get pods -n gpu-operator
# Test GPU access again
k run -it --rm gpu-test --image=nvidia/cuda:12.0.1-runtime-ubuntu22.04 -- /bin/bash
To clean up the environment:
# List clusters
kind get clusters
# Delete the cluster (replace with actual cluster name)
kind delete cluster --name nvkind-rqbcv
- nvkind is NVIDIA's tool for creating GPU-enabled Kubernetes clusters using kind
- The GPU Operator manages GPU resources and drivers within Kubernetes
- The driver.enabled=false setting is important when using nvkind since it manages drivers differently
- The NASA R2O project appears to be a research-to-operations deployment system with GPU support
- OpenTofu is used as a Terraform alternative for infrastructure management
- Ensure NVIDIA drivers are properly installed on the host
- Verify Docker can access GPUs before creating the cluster
- Check that all services (Docker, containerd) are running after configuration changes
- Make sure environment variables are properly sourced when working with the R2O project