From 4d79a2654e104706baf9779ef348c79e3fad895c Mon Sep 17 00:00:00 2001 From: Tom White Date: Sun, 24 Nov 2024 11:32:14 +0000 Subject: [PATCH] Various minor doc improvements (#625) * Remove unused link * Remove optional dependency include as it needs re-organising/simplifying * Update storage page to take account of change in #524 (deleting local intermediate data) * Update memory page to clarify example about setting allowed_mem for local executors * Improve documentation for setting Zarr compression for intermediate data following #572 * Add separate articles link --- docs/articles.md | 5 +++++ docs/configuration.md | 10 +++++----- docs/getting-started/index.md | 1 - docs/getting-started/installation.md | 8 +------- docs/index.md | 7 +------ docs/user-guide/memory.md | 8 +++++--- docs/user-guide/storage.md | 12 +++++------- 7 files changed, 22 insertions(+), 29 deletions(-) create mode 100644 docs/articles.md diff --git a/docs/articles.md b/docs/articles.md new file mode 100644 index 00000000..b33e7999 --- /dev/null +++ b/docs/articles.md @@ -0,0 +1,5 @@ +# Articles + +[Cubed: Bounded-memory serverless array processing in xarray](https://xarray.dev/blog/cubed-xarray) + +[Optimizing Cubed](https://medium.com/pangeo/optimizing-cubed-7a0b8f65f5b7) diff --git a/docs/configuration.md b/docs/configuration.md index 7b4087d2..6a7b5df7 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -95,11 +95,11 @@ These properties can be passed directly to the {py:class}`Spec ` con | Property | Default | Description | |--------------------|-------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | `work_dir` | `None` | The directory path (specified as an fsspec URL) used for storing intermediate data. If not set, the user's temporary directory is used. | -| `allowed_mem` | `2GB` | The total memory available to a worker for running a task. This includes any `reserved_mem` that has been set. | -| `reserved_mem` | `100MB` | The memory reserved on a worker for non-data use when running a task | -| `executor_name` | `single-threaded` | The executor for running computations. One of `single-threaded`, `threads`, `processes`, `beam`, `coiled`, `dask`, `lithops`, `modal`. | +| `allowed_mem` | `"2GB"` | The total memory available to a worker for running a task. This includes any `reserved_mem` that has been set. | +| `reserved_mem` | `"100MB"` | The memory reserved on a worker for non-data use when running a task | +| `executor_name` | `"single-threaded"` | The executor for running computations. One of `"single-threaded"`, `"threads"`, `"processes"`, `"beam"`, `"coiled"`, `"dask"`, `"lithops"`, `"modal"`. | | `executor_options` | `None` | Options to pass to the executor on construction. See below for possible options for each executor. | - +| `zarr_compressor` | `"default"`| The compressor used by Zarr for intermediate data. If not specified, or set to `"default"`, Zarr will use the default Blosc compressor. If set to `None`, compression is disabled, which can be a good option when using local storage. Use a dictionary (or nested YAML) to configure arbitrary compression using Numcodecs. | ### Executor options @@ -177,7 +177,7 @@ Note that `batch_size` is not currently supported for Lithops. | Property | Default | Description | |------------------------------|---------|-------------------------------------------------------------------------------------| -| `cloud` | `aws` | The cloud to run on. One of `aws` or `gcp`. | +| `cloud` | `"aws"` | The cloud to run on. One of `"aws"` or `"gcp"`. | | `use_backups` | `True` | Whether to use backup tasks for mitigating stragglers. | | `batch_size` | `None` | Number of input tasks to submit to be run in parallel. The default is not to batch. | | `compute_arrays_in_parallel` | `False` | Whether arrays are computed one at a time or in parallel. | diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md index efae7bf2..6776b7b8 100644 --- a/docs/getting-started/index.md +++ b/docs/getting-started/index.md @@ -8,7 +8,6 @@ Have a look at [Cubed: an introduction](https://cubed-dev.github.io/cubed/cubed- --- maxdepth: 2 --- -why-cubed installation demo ``` diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index 7882db1d..30ece875 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -25,10 +25,4 @@ Cubed has many optional dependencies, which can be installed in sets for differe $ python -m pip install "cubed[lithops]" # Install optional dependencies for the lithops executor $ python -m pip install "cubed[modal]" # Install optional dependencies for the modal executor -To see the full list of which packages are installed with which options see `[project.optional_dependencies]` in `pyproject.toml`: -```{eval-rst} -.. literalinclude:: ../../pyproject.toml - :language: ini - :start-at: [project.optional-dependencies] - :end-before: [project.urls] -``` +See the [examples](https://github.com/cubed-dev/cubed/blob/main/examples/README.md) for details on installing Cubed to run on different executors. diff --git a/docs/index.md b/docs/index.md index e6514426..2ca39d72 100644 --- a/docs/index.md +++ b/docs/index.md @@ -27,6 +27,7 @@ array-api configuration why-cubed related-projects +articles ``` ```{toctree} @@ -39,9 +40,3 @@ operations computation contributing ``` - -## Articles - -[Cubed: Bounded-memory serverless array processing in xarray](https://xarray.dev/blog/cubed-xarray) - -[Optimizing Cubed](https://medium.com/pangeo/optimizing-cubed-7a0b8f65f5b7) diff --git a/docs/user-guide/memory.md b/docs/user-guide/memory.md index 06dd4b9f..1e78f4e5 100644 --- a/docs/user-guide/memory.md +++ b/docs/user-guide/memory.md @@ -8,14 +8,16 @@ The various memory settings and values are illustrated in the following diagaram ## Allowed memory -You should set ``allowed_mem`` to the maximum amount of memory that is available to the Cubed runtime. When running locally this should be no more than the amount of memory you have available on your machine. For cloud services it should be the amount of memory that the container runtime is configured to use. +You should set `allowed_mem` to the maximum amount of memory that is available to each executor running tasks. When running locally this is typically the amount of memory you have available on your machine divided by the number of cores. For example, on an 8 core machine with 16GB of memory, setting `allowed_mem` to 2GB (which is actually the default) would be appropriate. -In this example we set the allowed memory to 2GB: +For cloud services `allowed_mem` should be the amount of memory that the container runtime is configured to use. + +In this example we increase the allowed memory to 4GB from the default of 2GB: ```python import cubed -spec = cubed.Spec(allowed_mem="2GB") +spec = cubed.Spec(allowed_mem="4GB") ``` ## Projected memory diff --git a/docs/user-guide/storage.md b/docs/user-guide/storage.md index 2d36faee..2a6cd5d2 100644 --- a/docs/user-guide/storage.md +++ b/docs/user-guide/storage.md @@ -2,6 +2,10 @@ Cubed uses a filesystem working directory to store intermediate data (in the form of Zarr arrays) when running a computation. By default this is a local temporary directory, which is appropriate for the default local executor. +## Local storage + +Cubed will delete intermediate data only when the main Python process running the computation exits. If you run many computations in one process (in a Jupyter Notebook, for example), then you could risk running out of local storage. The directories where intermediate data is stored that Cubed creates by default are named `$TMPDIR/cubed-*`; these can be removed manually with regular file commands like `rm`. + ## Cloud storage When using a cloud service, the working directory should be set to a cloud storage directory in the same cloud region that the executor runtimes are in. In this case the directory is specified as a [`fsspec`](https://filesystem-spec.readthedocs.io/en/latest/) URL, such as `s3://cubed-tomwhite-temp`. This is how you would set it using a {py:class}`Spec ` object: @@ -14,13 +18,7 @@ spec = cubed.Spec(work_dir="s3://cubed-tomwhite-temp") Note that you need to create the bucket before running a computation. -## Deleting intermediate data - -Cubed does not delete any intermediate data that it writes, so you should ensure it is cleared out so you don't run out of space or incur unnecessary cloud storage costs. - -For a local temporary directory, the operating system will typically remove old files, but if you are running a lot of jobs in a short period of time you may need to manually clean them up. The directories that Cubed creates by default are named `$TMPDIR/cubed-*`; these can be removed with regular file commands like `rm`. - -On cloud object stores the data does not get removed automatically. Rather than removing the old data manually, it's convenient to use a dedicated bucket for intermediate data with a lifecycle rule that deletes data after a certain time. +On cloud object stores intermediate data does *not* get removed automatically, so you should ensure it is cleared out so you don't incur unnecessary cloud storage costs. Rather than removing the old data manually, it's convenient to use a dedicated bucket for intermediate data with a lifecycle rule that deletes data after a certain time. To set up a lifecycle rule: