From b60b9a2a1809c89eccda8c0054e3642520d39f5d Mon Sep 17 00:00:00 2001
From: Bouwe Andela <b.andela@esciencecenter.nl>
Date: Mon, 9 Mar 2026 14:10:45 +0100
Subject: [PATCH] Add docs on configuring Dask

---
 changelog/591.docs.md                    |   1 +
 docs/how-to-guides/control-memory-use.md | 100 +++++++++++++++++++++++
 docs/how-to-guides/executors.md          |   7 ++
 mkdocs.yml                               |   1 +
 4 files changed, 109 insertions(+)
 create mode 100644 changelog/591.docs.md
 create mode 100644 docs/how-to-guides/control-memory-use.md

diff --git a/changelog/591.docs.md b/changelog/591.docs.md
new file mode 100644
index 000000000..49c5afbd7
--- /dev/null
+++ b/changelog/591.docs.md
@@ -0,0 +1 @@
+Added a how-to guide on controlling memory use and parallism during diagnostic execution.
diff --git a/docs/how-to-guides/control-memory-use.md b/docs/how-to-guides/control-memory-use.md
new file mode 100644
index 000000000..72f81182e
--- /dev/null
+++ b/docs/how-to-guides/control-memory-use.md
@@ -0,0 +1,100 @@
+# How to control memory use and parallism
+
+The diagnostics packages used by the REF all use [Dask](https://dask.org/) to
+process data that is larger than memory in parallel. By default, Dask
+uses its threaded [scheduler](https://docs.dask.org/en/stable/scheduling.html),
+which may not be optimal for more complicated computations, and it configures
+this threaded scheduler to use as many worker threads as there are CPU cores on
+the machine. Because the REF typically runs multiple executors in parallel
+(see [Executors](executors.md)), and if unconfigured, each executor uses as many
+threads as there are CPU cores, this can lead to excessive memory use and too
+much parallism, which can cause the system to run out of memory or become slow
+because of excessive context switching. Inefficient scheduling by the threaded
+scheduler can also lead to excessive memory use and/or slow computations.
+Therefore, it is highly recommended that you take a moment to configure Dask
+for your system. For an in-depth introduction to these topics, see the Dask documentation on
+[configuration](https://docs.dask.org/en/latest/configuration.html) and
+[scheduling](https://docs.dask.org/en/latest/scheduling.html).
+
+## Configuring ESMValTool
+
+[ESMValCore](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/), the
+framework powering [ESMValTool](https://docs.esmvaltool.org), works best with
+the Dask Distributed scheduler. It is recommended to set `max_parallel_tasks`
+(an ESMValCore setting), to a low number, e.g. 1, 2 or 3, because only one
+[ESMValCore preprocessing task](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/recipe/overview.html#the-diagnostics-section-defines-tasks) will submit jobs to the Distributed
+scheduler at a time to avoid overloading the workers. Therefore, the following
+settings are recommended for ESMValCore:
+
+```yaml
+max_parallel_tasks: 2
+dask:
+  use: local_distributed
+  profiles:
+    local_distributed:
+      cluster:
+        type: distributed.LocalCluster
+        n_workers: 2
+        threads_per_worker: 2
+        memory_limit: 4GiB
+```
+
+These settings should be put in a file with the extension `.yaml` in the
+directory `~/.config/esmvaltool`, for example: `~/.config/esmvaltool/dask.yaml`.
+
+With the settings above, the total memory per REF diagnostic execution will be
+`n_workers * memory_limit` = 8GB.
+It is recommended to use at least 4GB of RAM per
+[Dask Distributed worker](https://distributed.dask.org/en/latest/worker.html).
+Some diagnostics may be able to run with 2GB per worker, but probably not all of them.
+You can tune the total memory / CPU use by specifying the number of workers.
+The number of CPU cores used will be `n_workers * threads_per_worker`.
+
+Note that the REF may run multiple executors in parallel, and each executor
+running an ESMValTool diagnostic will use the resources specified above.
+
+More information on how to configure ESMValCore is available in its
+[documentation](https://docs.esmvaltool.org/projects/ESMValCore/en/latest/quickstart/configure.html).
+
+/// tip | ESMValTool users
+If you are using ESMValTool outside of the REF on the same computer, it is highly
+recommended that you create a separate ESMValTool configuration directory for
+the version of ESMValTool used by the REF to avoid conflicts, e.g. accidentally
+using input data that is not managed by the REF. You can do this by setting the
+`ESMVALTOOL_CONFIG_DIR` environment variable to a different directory,
+e.g. `~/.config/esmvaltool-ref`, and then creating the Dask configuration file
+in that directory, e.g. `~/.config/esmvaltool-ref/dask.yaml` with the settings
+described above.
+///
+
+# Configuring PMP and ILAMB/IOMB
+
+Both [ILAMB/IOMB](https://ilamb3.readthedocs.io) and
+[PMP](https://pcmdi.github.io/pcmdi_metrics/) use Dask through Xarray, but they
+do not expose their own Dask configuration options. Therefore, you need to
+configure Dask globally for these packages by creating a Dask configuration
+file, e.g. at `~/.config/dask/config.yaml`.
+
+While testing the REF, we have seen occasional crashes when running ILAMB/IOMB
+and PMP diagnostics with the threaded scheduler, so we recommend using the
+synchronous scheduler for these diagnostics by adding the following content
+to the global Dask configuration file:
+
+```yaml
+scheduler: synchronous
+```
+
+For faster processing, you can can use the threaded scheduler with a limited
+number of worker threads (adjust to your system's resources):
+
+```yaml
+scheduler: threads
+num_workers: 4
+```
+
+Note that the REF may run multiple executors in parallel, and each executor
+running a PMP diagnostic will use the resources specified above.
+
+/// note | ILAMB/IOMB diagnostics
+The ILAMB/IOMB diagnostics are currently [restricted to the synchronous scheduler](https://github.com/Climate-REF/climate-ref/blob/24ada711091f6821793c5b3034e615c50dad68df/packages/climate-ref-ilamb/src/climate_ref_ilamb/standard.py#L651), so they will not respect the global Dask settings.
+///
diff --git a/docs/how-to-guides/executors.md b/docs/how-to-guides/executors.md
index 20c584fe9..b1d69fbef 100644
--- a/docs/how-to-guides/executors.md
+++ b/docs/how-to-guides/executors.md
@@ -60,3 +60,10 @@ executor = "climate_ref.executor.SynchronousExecutor"
 - **CeleryExecutor** suits distributed deployments in containerized or cloud setups.
 
 Once configured, run `ref solve` as usual and the REF will use your chosen executor to schedule and execute diagnostics.
+
+/// note | Configuring Dask
+It is recommended to configure Dask for the system running your executors, to
+avoid excessive memory use or context switching. Go to
+[How to control memory use and parallism guide](control-memory-use.md) for
+detailed instructions.
+///
diff --git a/mkdocs.yml b/mkdocs.yml
index 78c44e530..de60b644d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -29,6 +29,7 @@ nav:
       - how-to-guides/using-pre-computed-results.py
       - how-to-guides/executors.md
       - how-to-guides/hpc_executor.md
+      - how-to-guides/control-memory-use.md
       - how-to-guides/dataset-selection.py
       - how-to-guides/ingest-datasets.md
       - how-to-guides/running-diagnostics-locally.py