Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of additional content for the CIRRUS compute system #262

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
35 changes: 35 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
FROM docker.io/mambaorg/micromamba:latest

USER root

# Install git and other necessary packages
RUN apt-get update && apt-get install -y git && apt-get clean && rm -rf /var/lib/apt/lists/*

# Clone your repository
ARG REPO_URL=https://github.com/NicholasCote/HPC-Docs-ncote.git
ARG REPO_BRANCH=ncote-cirrus-dev1
WORKDIR /app
RUN git clone --recursive --depth 1 --branch ${REPO_BRANCH} ${REPO_URL} .

# Create environment from conda.yaml
ENV ENV_NAME=mkdocs
RUN micromamba env create -f conda.yaml

# Fix ownership of the repository
RUN chown -R mambauser:mambauser /app

# Switch to non-root user
USER mambauser

# Configure git for mambauser
RUN git config --global --add safe.directory /app
RUN git config --global --add safe.directory '*'

# Configure the shell to run in the conda environment
SHELL ["/usr/local/bin/_entrypoint.sh", "micromamba", "run", "-n", "mkdocs", "/bin/bash", "-c"]

EXPOSE 8000

# Set the entrypoint to run in the conda environment
ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "micromamba", "run", "-n", "mkdocs"]
CMD ["mkdocs", "serve", "--dev-addr=0.0.0.0:8000"]
79 changes: 79 additions & 0 deletions docs/compute-systems/cirrus/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
short_title: CIRRUS Overview
---

![CIRRUS](media/CIRRUS_Logo_NCARBlue.png#only-light){width="640"}
![CIRRUS](media/CIRRUS_Logo_NCAR_DarkTheme.png#only-dark){width="640"}

# CIRRUS

CIRRUS, Cloud Infrastructure for Remote Research, Universities, and Scientists, is a Kubernetes based cloud native platform hosted at NSF NCAR that provides flexible compute options with access to other High Performance Computing resources. The CIRRUS resources provide a supplement to computing needs that aren't fulfilled by the HPC offering, public cloud, or what is available to you locally.

-----

## Kubernetes

Kubernetes, often referred to as K8s with the 8 simply standing for the number of letters being replaced, is a container orchestration platform. It was originally developed internally at Google and was release as open source in 2014. It has become the industry standard for container orchestration and all commercial cloud providers run K8s to help provide high availability and load balancing. For example, when a node or pod fails K8s will automatically reschedule where pods are hosted and spin up new pods to replace what failed. It does all this without manual intervention and, more often than not, without any interruptions to the services hosted on the cluster. New workloads can also take advantage of services already running on the cluster. Deploying a web application on Virtual Machines requires a lot of configuration to expose the application on the web. This can be sped up with Infrastructure as Code (IaC) tools like Ansible or Terraform for instance, but in K8s once you set up a reverse proxy or certificate manager new workloads can take advantage of these services without having to do a lot of custom configurations. Due to this resilience, flexibility, and the overall open source nature of K8s it is the ideal platform to host most CISL on-prem cloud workloads.

-----

## Containers

Containers remove all the frills of a typical operating system and provide a place for an application to run with only the packages required for that application. This creates an efficient deployment that is typically faster and easier to share with others as dependencies are baked in to the image. It is platform and hardware agnostic leading to the flexibility to run wherever there is a container runtime/engine available to launch the container. When thinking about solutions for open science and open source software development containers are a solution that helps facilitate both in a very big way.

-----

## Quick Start

### Continuous Integration and Continuous Delivery (CI/CD)

CIRRUS utilizes a CI/CD methodology known as GitOps to deploy and manage applications. A collection of YAML files, known as a Helm chart, creates a package that Kubernetes uses to deploy applications based on the desired state set in Git. A Git code repository is then connected to the CIRRUS Continuous Delivery tool Argo CD. Argo CD watches the Git repo for changes and updates the application state to match the contents of the repository automatically. This enables changes to be handled directly via Git and enables features like version control, history, and an easier means to collaborate.

-----

### Container Registry
CIRRUS utilizes Harbor to provide a container registry based on open source software that is closer to the infrastructure running containers. A local registry allows us to utilize network infrastructure and available bandwidth between hardware for an increase in speed when pushing and pulling images locally. Harbor also includes an image scanner that will provide reports on any vulnerabilities that an image contains so we can address security concerns with images directly.

The Harbor web interface can be accessed at the following URL : https://hub.k8s.ucar.edu/. Credentials will be your CIT sso username and password. There is no need to specify a domain.

-----

### JupyterHub
NSF NCAR runs a JupyterHub instance that’s hosted on CIRRUS. It has GPU capabilities, access to datasets stored on GLADE, and provides parallel computing capabilities via Dask Gateway. Authentication is handled via a GitHub team in the NCAR organization. In order to be added to that team you have to be a member of the NCAR GitHub organization.

CIRRUS also provides the ability to host unique JupyterHub for specialized tutorials. Administrators can be configured in the JupyterHubs Helm chart and those admins can approve or deny requests to join the JupyterHub instance via the Native Authenticator.

-----

### Binder
Binder is a tool that enables sharing of custom computing environments from code repository contents. For instance, if there is a code repository that contains some Jupyter Notebooks that explain how to do something, Binder can be used to launch a compute instance that is configured automatically to run the contents of the repository. Utilize this link to the Binder official documentation to learn more about how the service runs or advanced use cases.

-----

## CIRRUS Hardware (5 Nodes)

| System Information | Node Specifications |
|---|---|
| Manufacturer | Supermicro |
| Model | SYS-120U-TNR |
| CPU Type | Intel Xeon Gold 6326 |
| CPU Speed | 2.90 GHz |
| CPU Cores | 16 |
| RAM (GB) | 512 |
| GPU Model | Nvidia A2 Tensor |
| GPU Cores | 1280 |
| GPU Memory | 16 GB |
| NICs | 2x10G & 4x25G |
| Storage | 2x100GB & 6x1.6TB NVMe |

### Totals

| CPU Cores | RAM | GPU Cores | GPU Mem | Local Storage |
|---|---|---|---|---|
| 80 | 2.5 TB | 6400 | 80 GB | 48 TB|

---

## Status

The cluster is undergoing a rebuild to provide a more robust architecture.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions docs/compute-systems/cirrus/slas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Service Level Agreements (SLAs)

This Service Level Agreement (SLA) defines the relationship between the CIRRUS team who provides the on-premise cloud infrastructure and their customers which include UCAR
Employees, Visitors, and external collaborators authorized to use the on-premise cloud resources. NSF NCAR | CISL runs Compute, Storage & Network hardware in robust Data Centers at multiple organizational facilities. The on-premise cloud offers users the ability to utilize those highly available, organizationally supported, compute resources for approved use cases. This includes access to routable network space and UCAR Domain Name Systems (DNS). These resources provide a supplement to computing needs that aren’t fulfilled by the HPC offering, public cloud, or what is available locally.

**Primary Services:** Kubernetes Cluster, Argo CD, Harbor, & JupyterHub/Binder

**Service Dependencies:** Server nodes, Networking, GLADE mount

**Audience:** Service Technical Staff, System Administrators, On & Off Site Personnel, and Authorized Affiliates

**Recognized Customers:** On & Off Site Personnel, and Authorized Affiliates

The service is designed to be available 24/7, but support will only be given during normal business hours for now.

## Response Level and Service Levels

### Definitions

Critical: Complete loss of a critical service or functionality due to failure of a device, network or other issue; or intermittent service disruptions that severely degrade the productivity of the user community. No workaround is available that will restore service reliably within one (1) hour. This may include a site wide security incident.

Urgent: Partial or reduced availability of, or accessibility to, a critical service with a significant impact on productivity of the user community; also complete or intermittent failures of a non-critical service that also has a significant impact on productivity of users. In some cases, service or functionality may be restored reliably by activation of a backup or workaround.

Regular: Basic functionality is present. There is a failure of extended functionality, but a workaround is available. Any requests for new functionality, features, or upgrades.

### Agreements

!!! important
There is currently no After Hours support. All Response types that occur after hours will be addressed at the start of business.

| Response Level | Business Hours <br> 8:00 - 17:00 M-F | After Hours |
|---|---|---|
| Critical | 2 hours | Start of Business |
| Urgent | 4 hours | Start of Business |
| Regular | Reviewed | Reviewed |

## Backup & Disaster Recovery Policy

Infrastructure as Code is utilized to maintain all the applications hosted on the On-Premise cloud. Applications will not be backed up and instead will be redeployed from code repositories. Argo CD handles the deployment of all applications and is backed up to allow restoration of all projects.

Projects in Argo CD will be backed up when changes are made. These backups can be used to restore configured projects in the event of data loss.

Images stored in Harbor will be backed up to object storage and can be restored from those backups.

## Change Management

Any changes need to be requested via a Jira ticket. There's information describing this process at this [link for ticket creation](../how-to/create-tickets). The teams Product Owner will review all new tickets and prioritize them accordingly.

Critical and Urgent issues will be addressed within the agreements defined above. Regular requests will typically be addressed during the teams bi-weekly standup meetings.

## Contact Information

**Business Hours:** 8:00 - 17:00 MST Monday - Friday

**Primary Contact:** [Nick Cote](mailto:[email protected])

**Secondary Contact:** [Jira Request](https://jira.ucar.edu/secure/CreateIssueDetails!init.jspa?pid=18470&amp;issuetype=10903&amp;summary=User%20Request:)

**Off Hours Contact:** [Nick Cote](mailto:[email protected]) & [Jira Request](https://jira.ucar.edu/secure/CreateIssueDetails!init.jspa?pid=18470&amp;issuetype=10903&amp;summary=User%20Request:)

## Monitoring & Reporting

The infrastructure utilizes Prometheus, Grafana, and Loki to monitor, collect logs, and alert the team to any issues that occur to the system.
67 changes: 67 additions & 0 deletions docs/compute-systems/cirrus/users/container-registry/image-mgmt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Image Management

The Harbor web interface can be accessed at the following URL : [https://hub.k8s.ucar.edu/](https://hub.k8s.ucar.edu/). A link is also present under the Resources tab at the top of this documentation. Credentials will be your CIT sso username and password. There is no need to specify a domain.

!!! info
Image pulling and pushing uses Docker via the command line. If Docker is not installed on your machine you can find information on how to install it [here](https://docs.docker.com/engine/install/).


## Harbor Login with Docker

If the project that the image belongs to is private, you need to login to Harbor before you can pull the image. If an image is in a public project, you do not need to be logged in to pull that image.

!!! info
If your CIT login does not work you most likely do not have permission to access. Please submit a request for Harbor access [here](https://jira.ucar.edu/secure/CreateIssueDetails!init.jspa?pid=18470&issuetype=10905).


The command line syntax to use Docker for logging in to Harbor is as follows:

```
docker login https://hub.k8s.ucar.edu/
```

You will be prompted for a Username and Password. Use your CIT username and password, no domain needs to be specified. You should get a message that your login was successful. Now you are able to pull images from private projects and push images directly to Harbor.

## Pull an image from Harbor

An image is pulled from Harbor using the `docker pull` command and specifying to pull it from the Harbor url. An example of what this command looks like is as follows:

```
docker image pull hub.k8s.ucar.edu/cislcloudpilot/cisl-cloud-base:v1-stable
```

## Push an image to Harbor

!!! info
Before you can push an image to Harbor, a corresponding project must be created in the Harbor interface. The default page on login is Projects and there is a button above the list of existing projects to create a new project. In order to push to an existing project you must be a member of that project with a Role of Developer, Maintainer, or Project Admin. If you need assistance creating projects or with permissions please open a ticket via the link [here](https://jira.ucar.edu/secure/CreateIssueDetails!init.jspa?pid=18470&issuetype=10905).


!!! info
This assumes you have an image built locally that you want to push to Harbor. Information on how to create container images locally can be found in this documentation [here](../K8s/Hosting/web-intro)

In order to push the built image to Harbor we first have to tag the image we want to push with our Harbor project information. Once the image is tagged it can be pushed to Harbor. An example of how to do this can be seen below:

```
docker tag cislcloudpilot/cisl-cloud-base:v1-stable hub.k8s.ucar.edu/cislcloudpilot/cisl-cloud-base:v1-stable
docker push hub.k8s.ucar.edu/cislcloudpilot/cisl-cloud-base:v1-stable
```

The exact syntax used in each of these commands as it relates to Harbor Projects and Repositories can be see in the code block below:

```
docker tag SOURCE_IMAGE[:TAG] hub.k8s.ucar.edu/PROJECT/REPOSITORY[:TAG]
docker push hub.k8s.ucar.edu/PROJECT/REPOSITORY[:TAG]
```

Your image should now be in the Harbor project, ncote in the example. Harbor contains a vulnerability scanner that will scan your images for any known vulnerabilities and provide a lot of relevant information on what it finds. More information on how to use the scanner can be found at this [link](scanner.ipynb)

If your project is public it can be pulled by anyone on the network. This is a great way to share your work with others in a reproducible fashion.

## Using Robot Accounts

A Project in Harbor can have multiple robot accounts with configurable permissions to perform various tasks. Instead of using a personal account when logging in to Harbor via code a robot account should be used instead.

!!! info
In order to add a robot account you must have the role of Project Admin for the associated project.

After logging in to the Harbor UI, [https://hub.k8s.ucar.edu/](https://hub.k8s.ucar.edu/), with CIT credentials click on the project name listed where the robot account should be added. Once in the project there are several tabs for project settings including one named Robot Accounts. Select this tab to navigate to the Robot Accounts settings and click the button that says + NEW ROBOT ACCOUNT. This brings up a new window where the robot account can be named, the final account name will be robot-{Name}+{Name}, Expiration time, Description, and Permissions. Best practice is to set an Expiration time and not use the never expire option. Once created the only time the password, or secret, will be displayed pops up. Copy this now and store it in a safe location where it is not exposed in plain text publicly. A new secret can be generated by selecting the robot account and using the ACTION dropdown and the REFRESH SECRET option.
8 changes: 8 additions & 0 deletions docs/compute-systems/cirrus/users/container-registry/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Container Registry

CIRRUS utilizes [Harbor](https://goharbor.io/) to provide a container registry based on open source software that is closer to the infrastructure running containers. A local registry allows us to utilize network infrastructure and available bandwidth between hardware for an increase in speed when pushing and pulling images locally. Harbor also includes an image scanner that will provide reports on any vulnerabilities that an image contains so we can address security concerns with images directly.

The Harbor web interface can be accessed at the following URL : [https://hub.k8s.ucar.edu/](https://hub.k8s.ucar.edu/). Credentials will be your CIT sso username and password. There is no need to specify a domain.

!!! info
Access to Harbor is provided to all UCAR staff that are in Active Directory. It is limited to read-only access of public projects. Only administrators are allowed to create new projects. Please submit a Jira request, using the link at the top of the Documentation, if you require access to push images. Please include a descriptive name for the project you would like created. The public repositories allow pulling images without being authenticated.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Vulnerability Scanner

Harbor has [Trivy](https://trivy.dev/), a vulnerability scanner, built in. Trivy can be used to scan images hosted on Harbor for any vulnerabilities. It provides a very robust list of results with valuable information for patching vulnerable images.

!!! info
This document assumes you have access to a project in Harbor with privileges to scan the images in that project. Documentation on how to accomplish that can be found at this [link](push-pull)


In order to run the scanner login to the [Harbor Web UI](https://hub.k8s.ucar.edu/), navigate to your project by clicking on the name, and then click the name of the image you'd like to scan. It should automatically open the Artifacts tab that allows us to select the image and scan it. Check the box next to the artifact you want and click the scan button above it. It should look like the image below:

<img src="https://ncar.github.io/cisl-cloud/_static/harbor/harbor-scan.png"/>

The box in the vulnerabilities column should change to Queued, then Scanning, and finally will give you a report of the vulnerabilities found. Hovering over the report will show a few more details, but the full report can be viewed by clicking on the artifact name. The whole report contains a lot of different information. An example of what the output looks like can be seen in the image below:

<img src="https://ncar.github.io/cisl-cloud/_static/harbor/harbor-scan-results.png"/>
Loading
Loading