Skip to content

Commit

Permalink
Merge pull request #228 from jnywong/dask-gateway
Browse files Browse the repository at this point in the history
Add end-user documentation for dask-gateway
  • Loading branch information
jnywong authored May 23, 2024
2 parents 7740c16 + f035611 commit fa5a52e
Show file tree
Hide file tree
Showing 11 changed files with 241 additions and 1 deletion.
Binary file added _static/videos/dask-compute-kmeans.mp4
Binary file not shown.
2 changes: 1 addition & 1 deletion index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Covers end-user workflows that are common for cloud-native workflows with intera
user/topics/policy/index
user/topics/data/index.md
user/howto/specify-unlisted-image.md
user/howto/index.md
```

## Administer the hub
Expand Down
9 changes: 9 additions & 0 deletions user/howto/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Features

This section describes how to use some of the features available on 2i2c hubs. Note that features are available on a custom basis, since they can incur extra cloud costs and require extra engineering work to configure and maintain.

:::{toctree}
specify-unlisted-image
launch-dask-gateway-cluster
:::

229 changes: 229 additions & 0 deletions user/howto/launch-dask-gateway-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
# Launch a dask-gateway cluster

This guide shows you how to launch a Dask gateway cluster for parallel and distributed computing.

:::{contents}
:depth: 2
:local:
:::

## What is Dask Gateway?

Dask Gateway allows users to launch clusters for scaling computations efficiently with more CPU and memory on cloud resources, without requiring direct access to the underlying Kubernetes backend of the 2i2c hub. Configuration, such as efficient cluster resourcing, authentication and security settings, is automatically handled for users to provide a consistent user experience across the hub.

:::{admonition} What's the difference between Dask and Dask Gateway?
:class: note
When using Dask on a local machine, you will often instantiate a cluster using
```python
from dask.distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()
```
For a 2i2c Dask hub, we instantiate a cluster with Dask Gateway that requests cloud resources with Kubernetes from a shared, centrally managed cluster environment.
:::

## Usage

This section roughly follows the [Dask Gateway Docs – Usage](https://gateway.dask.org/usage.html) together with the [Training on Large Datasets](https://examples.dask.org/machine-learning/training-on-large-datasets.html) example.

:::{warning}
Avoid large, long-running, idle clusters, which consume cloud computing resources and costs. Only use a cluster when you need it and shutdown when not in use.
:::

### Connect to a dask-gateway server

Create a gateway client to communicate with the dask-gateway server.

```python
from dask_gateway import Gateway
gateway = Gateway() # Uses values configured for the 2i2c Dask hub (recommended)
```

:::{note}
Leave the argument of the `Gateway()` constructor empty to use the default configuration for your 2i2c Dask hub (parameters such as `address`, `proxy_address`, `auth`, etc. will be automatically supplied to the `Gateway()` object via environment variables).
:::

### Configure cluster options

Specify options for your gateway cluster with the options widget.

```python
options = gateway.cluster_options()
options
```

:::{image} media/dask-options.png
:alt: Screenshot of an interactive options widget to configure gateway cluster options.
:align: center
:::

#### Cluster options

Instance type running worker containers
: This defaults to the machine type `n2-highmem-16` for Google Cloud and `r5.4xlarge` for AWS, with a maximum of 16 CPUs available.

Resources per worker container
: Select 1/2/4/8/16 CPUs and corresponding memory requests from a dropdown menu.

Image
: This defaults to the user image deployed on the Dask hub.

Environment variables (YAML)
: Set environment variables for both the workers and schedulers using YAML, e.g. `ENV_VAR: my_environment_variable`.

Idle cluster terminated after (minutes)
: This defaults to 30 minutes. Consider cloud computing resources and costs for your hub when setting this value.

### Create and scale gateway cluster

Pass the cluster options to a new gateway cluster.

```python
cluster = gateway.new_cluster(options)
cluster
```

:::{note}
If a new gateway server needs to be started, then this process can take around 5 minutes to complete.
:::

#### Manual scaling

Manually scale the cluster size to a fixed number of workers.

1. Expand the *Manual scaling* dropdown in the cluster widget.
1. Select the number of workers.
1. Click {guilabel}`Scale` to confirm.

:::{image} media/dask-scaling-manual.png
:alt: Screenshot of an interactive cluster widget to configure manual worker scaling.
:align: center
:::

#### Adaptive scaling

Adapt the cluster size dynamically based on current load. This helps to scale up the number of workers when necessary but scale it down and save resources when not actively computing.

1. Expand the *Adaptive scaling* dropdown in the cluster widget.
1. Select the minimum and maximum number of workers.
1. Click {guilabel}`Adapt` to confirm.

:::{image} media/dask-scaling-adaptive.png
:alt: Screenshot of an interactive cluster widget to configure adaptive worker scaling.
:align: center
:::

(dask:connect-gateway)=
### Connect to the gateway cluster

Connect to the gateway cluster to start doing work with your workers.

```python
client = cluster.get_client()
client
```

Note the dashboard address of the form `/services/dask-gateway/clusters/...` to connect to the Dask dashboard later.

### Connect Dask dashboard to Dask JupyterLab extension

Connect to a [Dask dashboard](https://docs.dask.org/en/latest/dashboard.html) to monitor computations with the JupyterLab extension.

1. Copy the dashboard address from {ref}`Connect to the gateway cluster<dask:connect-gateway>` or from running the command `client.dashboard_link`
1. Click the ![Dask icon](media/dask-icon.png) Dask icon in the left sidebar.
1. In the search box at the top of the panel, paste in the full dashboard URL of the form
```bash
https://<hub-name>.<community-name>.2i2c.cloud/<dashboard-address>
```
1. Select the diagnostic plots to visualize, e.g. workers memory, CPU, task stream.

### Run computations on the cluster

Import the Python packages for the [Train Models on Large Datasets](https://examples.dask.org/machine-learning/training-on-large-datasets.html) example.

```python
import dask_ml.datasets
import dask_ml.cluster
import matplotlib.pyplot as plt
```

Generate random datasets for k-means clustering analysis.

```python
X, y = dask_ml.datasets.make_blobs(n_samples=10000000,
chunks=1000000,
random_state=0,
centers=3,)
X = X.persist()
X
```

Group the clusters with the k-means algorithm.

```python
km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
km.fit(X)
```

<video width="100%" autoplay loop muted>
<source src="../../../_static/videos/dask-compute-kmeans.mp4" type="video/mp4" />
<p>Video showing dask dashboard while computing k-means clustering.</p>
</video>

<br>

Plot the results.

```python
fig, ax = plt.subplots()
ax.scatter(X[::10000, 0], X[::10000, 1], marker='.', c=km.labels_[::10000],
cmap='viridis', alpha=0.25);
```

:::{image} media/dask-plot-result.png
:alt: Plot of the k-means clustering results.
:align: center
:width: 75%
:::

### Shut down the cluster

Shut down the cluster when not in use to minimize the waste of computational resources and costs.

```python
cluster.close()
```

## FAQs

- *What are the best practices for using Dask to scale computations?*

Take a look at the [Dask - Best Practices](https://docs.dask.org/en/latest/best-practices.html) documentation for a guide to using Dask efficiently.

- *Why can't I choose the instance type for my cluster?*

The default is `n2-highmem-16` for Google Cloud and `r5.4xlarge` for AWS, with a maximum of 16 CPUs available on each machine. These instance types are chosen so that there is a consistent node pool across all 2i2c Dask hubs and more efficient cluster resourcing.

- *Should I use manual or adaptive scaling?*

We suggest starting with adaptive scaling to dynamically scale up the number of workers when necessary but scale down and save resources when not actively computing. Once you are confident about your code's performance, you can use the manual scaling for fine-grained control over the number of workers needed.

- *How do I choose the number of workers/resources per worker?*

Like any other scaling problem, the answer to this question is specific to the application code. As suggested in the [Dask - Best Practices](https://docs.dask.org/en/latest/best-practices.html), start with a small subset of your data and evaluate performance using the Dask dashboard before scaling resources to the full dataset.

- *Why are there limited options for CPU and memory resources per worker?*

The limited options of 1/2/4/8/16 CPUs and associated memory requests are chosen to enable the most efficient use of the node pool available. Otherwise, for example, a user could accidentally request 51% of a machine's resources per worker so only 1 worker can fit on an entire machine. In this example, if 2 workers were needed for the computation then this increases cloud costs for the 2 machines required while potentially leaving 49% of resources idle per machine.

- *Why did my kernel die?*

There can be many reasons for this, but the most common one when it comes to Dask is an out-of-memory error caused by committing data into memory that exceeds the available RAM limit. Try using Dask to scale computations on smaller datasets and write the intermediate results to disk in the {ref}`/tmp folder<filesystem:tmp>`.

- *Which image should I use for the software environment?*

A Dask supported user image that includes `dask-gateway` and `dask-labextension` is available by default on a 2i2c Dask hub – this ensures compatability between the Dask user server, scheduler and workers. The `dask-gateway-server` is not required in the user image and is installed in the backend by 2i2c and maintained to the latest version.

- *How do I install software packages to my Dask hub environment?*

During code development, use the [Dask distributed built-in scheduler plugin](https://distributed.dask.org/en/latest/plugins.html?highlight=pipinstall#built-in-scheduler-plugins) `PipInstall` to pip install a set of packages on all Dask workers (see the [Pangeo docs](https://pangeo.io/cloud.html#dask-software-environment) for an example). However, this increases the time to start up each worker. During production, for best performance we advise adding the required software packages to the [Pangeo Docker image](https://pangeo-docker-images.readthedocs.io/en/latest/), ensuring that the key package `dask-gateway` is included (and `dask-labextension` if you require the Dask dashboard) – see [Customize a community-maintained upstream image](../../admin/howto/environment/customize-image) for guidance. Deploy your custom image to the hub by either contacting your hub admin to [Configure the hub with the Configurator](../../admin/howto/configurator) or use [The "Other" choice](./specify-unlisted-image.md/#the-other-choice) in the image selection dropdown if available.
Binary file added user/howto/media/dask-icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added user/howto/media/dask-options.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added user/howto/media/dask-plot-result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added user/howto/media/dask-scaling-adaptive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added user/howto/media/dask-scaling-manual.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions user/topics/data/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ This is a *readonly* directory - anybody on the hub can *access* and *read from*
The hub administrator may choose to distribute shared materials via this directory.
The `shared` directory is not intended as a way for hub users to share data with each other.

(filesystem:tmp)=
## The `/tmp` Directory

Any directory outside of ``/home/jovyan`` is emphemeral on Cloud-hosted JupyterHubs. This means if you
Expand Down
1 change: 1 addition & 0 deletions user/topics/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Some quick recommendations for how to handle data:
For more information, see the sections below.

```{toctree}
:maxdepth: 2
filesystem
git
sharing
Expand Down

0 comments on commit fa5a52e

Please sign in to comment.