Skip to content

Commit

Permalink
Add examples to docs (#630)
Browse files Browse the repository at this point in the history
  • Loading branch information
tomwhite authored Nov 29, 2024
1 parent 8781b5f commit 23ddc75
Show file tree
Hide file tree
Showing 15 changed files with 163 additions and 105 deletions.
4 changes: 2 additions & 2 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ All arrays in any given computation must share the same `spec` instance.

A YAML file is a good way to encapsulate the configuration in a single file that lives outside the Python program.
It's a useful way to package up the settings for running using a particular executor, so it can be reused.
The Cubed [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md) use YAML files for this reason.
The Cubed [examples](examples/index.md) use YAML files for this reason.

```yaml
spec:
Expand Down Expand Up @@ -166,7 +166,7 @@ Note that there is currently no way to set a timeout for the Dask executor.
| Property | Default | Description |
|------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `retries` | 2 | The number of times to retry a task if it fails. |
| `timeout` | `None` | Tasks that take longer than the timeout will be automatically killed and retried. Defaults to the timeout specified when [deploying the lithops runtime image](https://lithops-cloud.github.io/docs/source/cli.html#lithops-runtime-deploy-runtime-name). This is 180 seconds in the [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md). |
| `timeout` | `None` | Tasks that take longer than the timeout will be automatically killed and retried. Defaults to the timeout specified when [deploying the lithops runtime image](https://lithops-cloud.github.io/docs/source/cli.html#lithops-runtime-deploy-runtime-name). This is 180 seconds in the [examples](https://github.com/cubed-dev/cubed/blob/main/examples/lithops/aws/README.md). |
| `use_backups` | | Whether to use backup tasks for mitigating stragglers. Defaults to `True` only if `work_dir` is a filesystem supporting atomic writes (currently a cloud store like S3 or GCS). |
| `compute_arrays_in_parallel` | `False` | Whether arrays are computed one at a time or in parallel. |
| Other properties | N/A | Other properties will be passed as keyword arguments to the [`lithops.executors.FunctionExecutor`](https://lithops-cloud.github.io/docs/source/api_futures.html#lithops.executors.FunctionExecutor) constructor. |
Expand Down
85 changes: 85 additions & 0 deletions docs/examples/basic-array-ops.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Basic array operations

The following examples show how to run a few basic Array API operations on Cubed arrays.

## Adding two small arrays

The first example adds two small 4x4 arrays together, and is useful for checking that the runtime is working.

```{eval-rst}
.. literalinclude:: ../../examples/add-asarray.py
```

Paste the code into a file called `add-asarray.py`, or [download](https://github.com/cubed-dev/cubed/blob/main/examples/add-asarray.py) from GitHub, then run with:

```shell
python add-asarray.py
```

If successful it will print a 4x4 array:

```
[[ 2 4 6 8]
[10 12 14 16]
[18 20 22 24]
[26 28 30 32]]
```

## Adding two larger arrays

The next example generates two random 20GB arrays and then adds them together.

```{eval-rst}
.. literalinclude:: ../../examples/add-random.py
```

Paste the code into a file called `add-random.py`, or [download](https://github.com/cubed-dev/cubed/blob/main/examples/add-random.py) from GitHub, then run with:

```shell
python add-random.py
```

This example demonstrates how we can use callbacks to gather information about the computation.

- `RichProgressBar` shows a progress bar for the computation as it is running.
- `TimelineVisualizationCallback` produces a plot (after the computation has completed) showing the timeline of events in the task lifecycle.
- `HistoryCallback` produces various stats about the computation once it has completed.

The plots and stats are written in the `history` directory in a directory with a timestamp. You can open the latest plot with

```shell
open $(ls -d history/compute-* | tail -1)/timeline.svg
```

## Matmul

The next example generates two random 5GB arrays and then multiplies them together. This is a more intensive computation than addition, and will take a few minutes to run locally.

```{eval-rst}
.. literalinclude:: ../../examples/matmul-random.py
```

Paste the code into a file called `matmul-random.py`, or [download](https://github.com/cubed-dev/cubed/blob/main/examples/matmul-random.py) from GitHub, then run with:

```shell
python matmul-random.py
```

## Trying different executors

You can run these scripts using different executors by setting environment variables to control the Cubed configuration.

For example, this will use the `processes` executor to run the example:

```shell
CUBED_SPEC__EXECUTOR_NAME=processes python add-random.py
```

For cloud executors, it's usually best to put all of the configuration in one YAML file, and set the `CUBED_CONFIG` environment variable to point to it:

```shell
export CUBED_CONFIG=/path/to/lithops/aws/cubed.yaml
python add-random.py
```

You can read more about how [configuration](../configuration.md) works in Cubed in general, and detailed steps to run on a particular cloud service [here](#cloud-set-up).
28 changes: 28 additions & 0 deletions docs/examples/how-to-run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# How to run

## Local machine

All the examples can be run on your laptop, so you can try them out in a familar environment before moving to the cloud.
No extra set up is necessary in this case.

(cloud-set-up)=
## Cloud set up

If you want to run using a cloud executor, first read <project:#which-cloud-service>

Then follow the instructions for your chosen executor runtime from the table below. They assume that you have cloned the Cubed GitHub repository locally so that you have access to files needed for setting up the cloud executor.

```shell
git clone https://github.com/cubed-dev/cubed
cd cubed/examples
cd lithops/aws # or whichever executor/cloud combination you are using
```

| Executor | Cloud | Set up instructions |
|-----------|--------|------------------------------------------------|
| Lithops | AWS | [lithops/aws/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/lithops/aws/README.md) |
| | Google | [lithops/gcp/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/lithops/gcp/README.md) |
| Modal | AWS | [modal/aws/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/modal/aws/README.md) |
| | Google | [modal/gcp/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/modal/gcp/README.md) |
| Coiled | AWS | [coiled/aws/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/coiled/aws/README.md) |
| Beam | Google | [dataflow/README.md](https://github.com/cubed-dev/cubed/blob/main/examples/dataflow/README.md) |
12 changes: 12 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Examples

Various examples demonstrating what you can do with Cubed.

```{toctree}
---
maxdepth: 2
---
how-to-run
basic-array-ops
pangeo
```
21 changes: 21 additions & 0 deletions docs/examples/pangeo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Pangeo

## Notebooks

The following example notebooks demonstrate the use of Cubed with Xarray to tackle some challenging Pangeo workloads:

1. [Pangeo Vorticity Workload](https://github.com/cubed-dev/cubed/blob/main/examples/pangeo-1-vorticity.ipynb)
2. [Pangeo Quadratic Means Workload](https://github.com/cubed-dev/cubed/blob/main/examples/pangeo-2-quadratic-means.ipynb)
3. [Pangeo Transformed Eulerian Mean Workload](https://github.com/cubed-dev/cubed/blob/main/examples/pangeo-3-tem.ipynb)
4. [Pangeo Climatological Anomalies Workload](https://github.com/cubed-dev/cubed/blob/main/examples/pangeo-4-climatological-anomalies.ipynb)

## Running the notebook examples

Before running these notebook examples, you will need to install some additional dependencies (besides Cubed).

`conda install rich pydot flox cubed-xarray`

`cubed-xarray` is necessary to wrap Cubed arrays as Xarray DataArrays or Xarray Datasets.
`flox` is for supporting efficient groupby operations in Xarray.
`pydot` allows plotting the Cubed execution plan.
`rich` is for showing progress of array operations within callbacks applied to Cubed plan operations.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Cubed is horizontally scalable and stateless, and can scale to multi-TB datasets
:caption: For users
getting-started
user-guide/index
Examples <https://github.com/tomwhite/cubed/tree/main/examples/README.md>
examples/index
api
array-api
configuration
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/diagnostics.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ The timeline callback will write a graphic `timeline.svg` to a directory with th
```

### Examples in use
See the [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md) for more information about how to use them.
See the [examples](../examples/index.md) for more information about how to use them.

## Memray

Expand Down
3 changes: 2 additions & 1 deletion docs/user-guide/executors.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The `processes` executor also runs on a single machine, and uses all the cores o

There is a third local executor called `single-threaded` that runs tasks sequentially in a single thread, and is intended for testing on small amounts of data.

(which-cloud-service)=
## Which cloud service executor should I use?

When it comes to scaling out, there are a number of executors that work in the cloud.
Expand Down Expand Up @@ -39,4 +40,4 @@ spec = cubed.Spec(
)
```

A default spec may also be configured using a YAML file. The [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md) show this in more detail for all of the executors described above.
A default spec may also be configured using a YAML file. The [examples](#cloud-set-up) show this in more detail for all of the executors described above.
91 changes: 2 additions & 89 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,92 +1,5 @@
# Examples

## Running on a local machine
This directory contains Cubed examples in the form of Python scripts and Jupyter notebooks. There are also instructions for setting up Cubed executors to run on various cloud services.

The `processes` executor is the recommended executor for running on a single machine, since it can use all the cores on the machine.

## Which cloud service executor should I use?

When it comes to scaling out, there are a number of executors that work in the cloud.

[**Lithops**](https://lithops-cloud.github.io/) is the executor we recommend for most users, since it has had the most testing so far (~1000 workers).
If your data is in Amazon S3 then use Lithops with AWS Lambda, and if it's in GCS use Lithops with Google Cloud Functions. You have to build a runtime environment as a part of the setting up process.

[**Modal**](https://modal.com/) is very easy to get started with because it handles building a runtime environment for you automatically (note that it requires that you [sign up](https://modal.com/signup) for a free account). **At the time of writing, Modal does not guarantee that functions run in any particular cloud region, so it is not currently recommended that you run large computations since excessive data transfer fees are likely.**

[**Coiled**](https://www.coiled.io/) is also easy to get started with ([sign up](https://cloud.coiled.io/signup)). It uses [Coiled Functions](https://docs.coiled.io/user_guide/usage/functions/index.html) and has a 1-2 minute overhead to start a cluster.

[**Google Cloud Dataflow**](https://cloud.google.com/dataflow) is relatively straightforward to get started with. It has the highest overhead for worker startup (minutes compared to seconds for Modal or Lithops), and although it has only been tested with ~20 workers, it is a mature service and therefore should be reliable for much larger computations.

## Set up

Follow the instructions for setting up Cubed to run on your executor runtime:

| Executor | Cloud | Set up instructions |
|-----------|--------|------------------------------------------------|
| Processes | N/A | `pip install 'cubed[diagnostics]'` |
| Lithops | AWS | [lithops/aws/README.md](lithops/aws/README.md) |
| | Google | [lithops/gcp/README.md](lithops/gcp/README.md) |
| Modal | AWS | [modal/aws/README.md](modal/aws/README.md) |
| | Google | [modal/gcp/README.md](modal/gcp/README.md) |
| Coiled | AWS | [coiled/aws/README.md](coiled/aws/README.md) |
| Beam | Google | [dataflow/README.md](dataflow/README.md) |

## Examples

The `add-asarray.py` script is a small example that adds two small 4x4 arrays together, and is useful for checking that the runtime is working.
Export `CUBED_CONFIG` as described in the set up instructions, then run the script. This is for running on the local machine using the `processes` executor:

```shell
export CUBED_CONFIG=$(pwd)/processes/cubed.yaml
python add-asarray.py
```

This is for Lithops on AWS:

```shell
export CUBED_CONFIG=$(pwd)/lithops/aws/cubed.yaml
python add-asarray.py
```

If successful it should print a 4x4 array.

The other examples are run in a similar way:

```shell
export CUBED_CONFIG=...
python add-random.py
```

and

```shell
export CUBED_CONFIG=...
python matmul-random.py
```

These will take longer to run as they operate on more data.

The last two examples use `TimelineVisualizationCallback` which produce a plot showing the timeline of events in the task lifecycle, and `HistoryCallback` to produce stats about memory usage.
The plots are SVG files and are written in the `history` directory in a directory with a timestamp. Open the latest one with

```shell
open $(ls -d history/compute-* | tail -1)/timeline.svg
```

The memory usage stats are in a CSV file which you can view with


```shell
open $(ls -d history/compute-* | tail -1)/stats.csv
```

## Running the notebook examples

Before running these notebook examples, you will need to install some additional dependencies (besides Cubed).

`mamba install rich pydot flox cubed-xarray`

`cubed-xarray` is necessary to wrap Cubed arrays as Xarray DataArrays or Xarray Datasets.
`flox` is for supporting efficient groupby operations in Xarray.
`pydot` allows plotting the Cubed execution plan.
`rich` is for showing progress of array operations within callbacks applied to Cubed plan operations.
See the [documentation](https://cubed-dev.github.io/cubed/examples/index.html) for details.
2 changes: 1 addition & 1 deletion examples/coiled/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@ Before running the examples, first change to the top-level examples directory (`
export CUBED_CONFIG=$(pwd)/coiled/aws
```

Then you can run the examples described [there](../../README.md).
Then you can run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).
2 changes: 1 addition & 1 deletion examples/lithops/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Before running the examples, first change to the top-level examples directory (`
export CUBED_CONFIG=$(pwd)/lithops/aws
```

Then you can run the examples described [there](../../README.md).
Then you can run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).

## Cleaning up

Expand Down
2 changes: 1 addition & 1 deletion examples/lithops/gcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Before running the examples, first change to the top-level examples directory (`
export CUBED_CONFIG=$(pwd)/lithops/gcp
```

Then you can run the examples described [there](../../README.md).
Then you can run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).

## Cleaning up

Expand Down
10 changes: 4 additions & 6 deletions examples/matmul-random.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,16 @@

if __name__ == "__main__":
# 200MB chunks
a = cubed.random.random((50000, 50000), chunks=(5000, 5000))
b = cubed.random.random((50000, 50000), chunks=(5000, 5000))
c = xp.astype(a, xp.float32)
d = xp.astype(b, xp.float32)
e = xp.matmul(c, d)
a = cubed.random.random((25000, 25000), chunks=(5000, 5000))
b = cubed.random.random((25000, 25000), chunks=(5000, 5000))
c = xp.matmul(a, b)

progress = RichProgressBar()
hist = HistoryCallback()
timeline_viz = TimelineVisualizationCallback()
# use store=None to write to temporary zarr
cubed.to_zarr(
e,
c,
store=None,
callbacks=[progress, hist, timeline_viz],
)
2 changes: 1 addition & 1 deletion examples/modal/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ Before running the examples, first change to the top-level examples directory (`
export CUBED_CONFIG=$(pwd)/modal/aws
```

Then you can run the examples described [there](../../README.md).
Then you can run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).
2 changes: 1 addition & 1 deletion examples/modal/gcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ Before running the examples, first change to the top-level examples directory (`
export CUBED_CONFIG=$(pwd)/modal/gcp
```

Then you can run the examples described [there](../../README.md).
Then you can run the examples in the [docs](https://cubed-dev.github.io/cubed/examples/index.html).

0 comments on commit 23ddc75

Please sign in to comment.