Skip to content

Commit

Permalink
Various minor doc improvements (#625)
Browse files Browse the repository at this point in the history
* Remove unused link

* Remove optional dependency include as it needs re-organising/simplifying

* Update storage page to take account of change in #524 (deleting local intermediate data)

* Update memory page to clarify example about setting allowed_mem for local executors

* Improve documentation for setting Zarr compression for intermediate data following #572

* Add separate articles link
  • Loading branch information
tomwhite authored Nov 24, 2024
1 parent eb01803 commit 4d79a26
Show file tree
Hide file tree
Showing 7 changed files with 22 additions and 29 deletions.
5 changes: 5 additions & 0 deletions docs/articles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Articles

[Cubed: Bounded-memory serverless array processing in xarray](https://xarray.dev/blog/cubed-xarray)

[Optimizing Cubed](https://medium.com/pangeo/optimizing-cubed-7a0b8f65f5b7)
10 changes: 5 additions & 5 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,11 +95,11 @@ These properties can be passed directly to the {py:class}`Spec <cubed.Spec>` con
| Property | Default | Description |
|--------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| `work_dir` | `None` | The directory path (specified as an fsspec URL) used for storing intermediate data. If not set, the user's temporary directory is used. |
| `allowed_mem` | `2GB` | The total memory available to a worker for running a task. This includes any `reserved_mem` that has been set. |
| `reserved_mem` | `100MB` | The memory reserved on a worker for non-data use when running a task |
| `executor_name` | `single-threaded` | The executor for running computations. One of `single-threaded`, `threads`, `processes`, `beam`, `coiled`, `dask`, `lithops`, `modal`. |
| `allowed_mem` | `"2GB"` | The total memory available to a worker for running a task. This includes any `reserved_mem` that has been set. |
| `reserved_mem` | `"100MB"` | The memory reserved on a worker for non-data use when running a task |
| `executor_name` | `"single-threaded"` | The executor for running computations. One of `"single-threaded"`, `"threads"`, `"processes"`, `"beam"`, `"coiled"`, `"dask"`, `"lithops"`, `"modal"`. |
| `executor_options` | `None` | Options to pass to the executor on construction. See below for possible options for each executor. |

| `zarr_compressor` | `"default"`| The compressor used by Zarr for intermediate data. If not specified, or set to `"default"`, Zarr will use the default Blosc compressor. If set to `None`, compression is disabled, which can be a good option when using local storage. Use a dictionary (or nested YAML) to configure arbitrary compression using Numcodecs. |

### Executor options

Expand Down Expand Up @@ -177,7 +177,7 @@ Note that `batch_size` is not currently supported for Lithops.

| Property | Default | Description |
|------------------------------|---------|-------------------------------------------------------------------------------------|
| `cloud` | `aws` | The cloud to run on. One of `aws` or `gcp`. |
| `cloud` | `"aws"` | The cloud to run on. One of `"aws"` or `"gcp"`. |
| `use_backups` | `True` | Whether to use backup tasks for mitigating stragglers. |
| `batch_size` | `None` | Number of input tasks to submit to be run in parallel. The default is not to batch. |
| `compute_arrays_in_parallel` | `False` | Whether arrays are computed one at a time or in parallel. |
Expand Down
1 change: 0 additions & 1 deletion docs/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ Have a look at [Cubed: an introduction](https://cubed-dev.github.io/cubed/cubed-
---
maxdepth: 2
---
why-cubed
installation
demo
```
8 changes: 1 addition & 7 deletions docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,4 @@ Cubed has many optional dependencies, which can be installed in sets for differe
$ python -m pip install "cubed[lithops]" # Install optional dependencies for the lithops executor
$ python -m pip install "cubed[modal]" # Install optional dependencies for the modal executor

To see the full list of which packages are installed with which options see `[project.optional_dependencies]` in `pyproject.toml`:
```{eval-rst}
.. literalinclude:: ../../pyproject.toml
:language: ini
:start-at: [project.optional-dependencies]
:end-before: [project.urls]
```
See the [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md) for details on installing Cubed to run on different executors.
7 changes: 1 addition & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ array-api
configuration
why-cubed
related-projects
articles
```

```{toctree}
Expand All @@ -39,9 +40,3 @@ operations
computation
contributing
```

## Articles

[Cubed: Bounded-memory serverless array processing in xarray](https://xarray.dev/blog/cubed-xarray)

[Optimizing Cubed](https://medium.com/pangeo/optimizing-cubed-7a0b8f65f5b7)
8 changes: 5 additions & 3 deletions docs/user-guide/memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,16 @@ The various memory settings and values are illustrated in the following diagaram

## Allowed memory

You should set ``allowed_mem`` to the maximum amount of memory that is available to the Cubed runtime. When running locally this should be no more than the amount of memory you have available on your machine. For cloud services it should be the amount of memory that the container runtime is configured to use.
You should set `allowed_mem` to the maximum amount of memory that is available to each executor running tasks. When running locally this is typically the amount of memory you have available on your machine divided by the number of cores. For example, on an 8 core machine with 16GB of memory, setting `allowed_mem` to 2GB (which is actually the default) would be appropriate.

In this example we set the allowed memory to 2GB:
For cloud services `allowed_mem` should be the amount of memory that the container runtime is configured to use.

In this example we increase the allowed memory to 4GB from the default of 2GB:

```python
import cubed

spec = cubed.Spec(allowed_mem="2GB")
spec = cubed.Spec(allowed_mem="4GB")
```

## Projected memory
Expand Down
12 changes: 5 additions & 7 deletions docs/user-guide/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Cubed uses a filesystem working directory to store intermediate data (in the form of Zarr arrays) when running a computation. By default this is a local temporary directory, which is appropriate for the default local executor.

## Local storage

Cubed will delete intermediate data only when the main Python process running the computation exits. If you run many computations in one process (in a Jupyter Notebook, for example), then you could risk running out of local storage. The directories where intermediate data is stored that Cubed creates by default are named `$TMPDIR/cubed-*`; these can be removed manually with regular file commands like `rm`.

## Cloud storage

When using a cloud service, the working directory should be set to a cloud storage directory in the same cloud region that the executor runtimes are in. In this case the directory is specified as a [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) URL, such as `s3://cubed-tomwhite-temp`. This is how you would set it using a {py:class}`Spec <cubed.Spec>` object:
Expand All @@ -14,13 +18,7 @@ spec = cubed.Spec(work_dir="s3://cubed-tomwhite-temp")

Note that you need to create the bucket before running a computation.

## Deleting intermediate data

Cubed does not delete any intermediate data that it writes, so you should ensure it is cleared out so you don't run out of space or incur unnecessary cloud storage costs.

For a local temporary directory, the operating system will typically remove old files, but if you are running a lot of jobs in a short period of time you may need to manually clean them up. The directories that Cubed creates by default are named `$TMPDIR/cubed-*`; these can be removed with regular file commands like `rm`.

On cloud object stores the data does not get removed automatically. Rather than removing the old data manually, it's convenient to use a dedicated bucket for intermediate data with a lifecycle rule that deletes data after a certain time.
On cloud object stores intermediate data does *not* get removed automatically, so you should ensure it is cleared out so you don't incur unnecessary cloud storage costs. Rather than removing the old data manually, it's convenient to use a dedicated bucket for intermediate data with a lifecycle rule that deletes data after a certain time.

To set up a lifecycle rule:

Expand Down

0 comments on commit 4d79a26

Please sign in to comment.