Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📝 Add scheduler documentation #241

Draft
wants to merge 2 commits into
base: source
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/_sources/_static/flowcharts/scheduler.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions docs/_sources/user/dashboard.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Dashboard
=========

The C-PAC dashboard allows users to schedule and run C-PAC through a graphical interface in a browser.

To launch a local dashboard, run « Details about how to access dashboard »

Technical details
=================
The implementation relies heavily on the ``asyncio`` API, to simplify concurrency. However, it is not a parallel API, meaning that everything is executed in the same thread (and there is no race condition) and the different tasks that are being executed concurrently *must not* block the asyncio execution (e.g. it can have an asyncio.sleep in a task, or an IO function).

We have 6 main parts on this implementation: Scheduler, Backend, Shedule (and its children), Message, Result, and the API.

Beginning with the Schedule. Schedule is a abstraction of the task that should be executed. For C-PAC, we have three tasks:

* DataSettings: A task to generate data configs from a provided data settings;
* DataConfig: A task to schedule a pipeline for the subjects from a data config, spawning new tasks for each participant;
* ParticipantPipeline: A task to execute a pipeline for a single subject.

More technical aspects, such as running containers, are handled by a specialization of the Schedule class: BackendSchedule. The BackendSchedules are specific to a Backend, an interface between Python & the softwares of a specific backend (e.g. Singularity binaries). The Backend must contain the parameters required for the BackendSchedules to properly communicate with the underlying softwares, such as the Docker image to be used or the SSH connection to access a SLURM cluster.

The Scheduler is the central part of this implementation, and maybe the most simple. It stores the Schedules into a tree-like structure. Schedules can spawn new Schedules and manage the Messages received by each Schedule, together with the callbacks associated to a Schedule Message type. When a Schedule is scheduled, the Scheduler will send the Schedule to its Backend, and the Backend will specialize this "naïve" Schedule into a BackendSchedule for that Backend:

.. code-block:: Python

ParticipantPipelineSchedule + DockerBackend = DockerParticipantPipelineSchedule

This "backend-aware" Schedule (from the superclass BackendSchedule) will then be executed by the Scheduler. The BackendSchedule behave as a Python generator, so the Scheduler simply iterates this object, and the items of this iteration are defined as Messages. The Messages are data classes (i.e. only store data, not methods), to give information for the Scheduler about the execution. The Messages are relayed to Scheduler watchers, which are external agents that provide a callback function for the Scheduler to call when it receives a specific type of Message. For the Spawn Message, the Scheduler schedules a new Schedule, with the parameters contained in the Spawn message.

Specifics of the Docker and Singularity containers are actually the same: they share the same base code for container execution, only differing in the container creation.

When the container is created, three tasks run concurrently for this Schedule: container status, log listener, and file listener. The first yields Messages of type Status, as a ping, so we know the container is running fine. The second connects to the WebSocket (WS) server running in the container, to capture which nodes it has run so far, and yield Messages of type Log. The last one looks in the output directory for logs and crashes, storing the files as Results in the Schedule, and yielding Messages of type Result.
Only the ParticipantPipeline has the second and the third, the others have just the container status Messages.

For SLURM, it starts connecting to the cluster via SSH. It uses the SSH multiplexing connection feature, so the authentication process happens only once, which is a good idea for connections that has a multi-factor authentication layer. After connecting to a cluster, the Backend allocates nodes to execute the Schedules and install a Miniconda & cpac. By using the cpac.api module, the local cpac communicates with the node cpac via HTTP & <span title="WebSocket">WS<span> to run the Schedules. It uses the same API to gather the results and keep the local Schedule state updated. By default, the Singularity Backend is used by the node cpac to run the Schedules.

The Results are basically files in which it would be too much to transfer via <span title="WebSocket">WS<span>. The API to gather the Results allow to slice the content using HTTP headers (``Content-Range``). It is essential for results that will be incremented during the execution (i.e. logs). Using slice, one does not need to request for the whole file again, but only the part it does not have.
23 changes: 18 additions & 5 deletions docs/_sources/user/help.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ If you have a question that is not answered in the User Guide or encounter an is

View Crash Files
^^^^^^^^^^^^^^^^
If you have the :ref:`cpac Python package <cpac-python-package>`, you can simply run
If you have the :ref:`cpac Python package <cpac-python-package>`, you can simply run

.. code-block:: console

Expand Down Expand Up @@ -43,9 +43,9 @@ Common Issues
#. My run is crashing with the crash files indicating a memory error. What is going on?

This issue often occurs when scans are resampled to a higher resolution. This is because functional images, the template that you are resampling to, and the resulting generated image are all loaded into memory at various points in the C-PAC pipeline. Typically these images are compressed, but they will be uncompressed when loaded. If the amount of memory required to load the images exceeds the amount of memory available, C-PAC will crash.

For instance, suppose you are resampling a 50 MB (100 MB uncompressed) resting-state scan with 3 mm isotropic voxels to a 3 MB (15 MB uncompressed) 1 mm template. By dividing the voxel volume for the original resting state scan by the voxel volume for the template, you can determine that the voxel count of the resampled resting-state scan will be 27 times greater. Therefore, if you multiply the uncompressed file size by 27 you can estimate that the resampled scan alone will take up 2.7 GB of space. You will need at least this much RAM for C-PAC to load the scan. If you are running multiple subjects simultaneously, you would need to multiply this estimate by the number of subjects. Note that the template, original image, and any other open applications will also have their own memory requirements, so the estimate would be closer to 2.8 GB per subject plus however much RAM is used before C-PAC is started (assuming no new applications are started mid-run).

To avoid this error, you will need to either get more RAM, run fewer subjects at once, or consider downsampling your template to a lower resolution.


Expand All @@ -54,10 +54,23 @@ Common Issues
#. I'm re-running a pipeline, but I am receiving many crashes. Most of these crashes tell me that a file that has been moved or no longer exists is being used as an input for a step in the C-PAC pipeline. What is going on and how can I tell C-PAC to use the correct inputs?

One of the features of Nipype (which C-PAC is built upon) is that steps that have been run before are not re-run when you re-start a pipeline. Nipype accomplishes this by associating a value with a step based on the properties of that step (i.e., hashing). Nipype has two potential values that it can associate with a step: a value based on the size and date of the files created by the step, and a value based upon the data present within the files themselves. The first value is what C-PAC uses with its pipelines, since it is much more computationally practical. Since this value only looks at the size and date of files to determine whether or not a step has been run, it will not see that the file's path has changed, and it will assume that all paths are consistent with the path structure from when the pipeline was run before.

To work around this error, you will need to delete the working directory associated with the previous run and create a new directory to replace it for the new run.

#. I'm trying to run a Singularity image with ``--monitoring`` but I'm getting

.. code-block:: BASH

ERROR : Failed to create container namespaces

How can I get around this error?

.. unprivileged_userns_clone_start

See `hpcng/singularity#4361 <https://github.com/hpcng/singularity/issues/4361>`_ for potential guidance.

.. unprivileged_userns_clone_end

#. How should I cite C-PAC in my paper?

Please cite the abstract located `here <http://www.frontiersin.org/10.3389/conf.fninf.2013.09.00042/event_abstract>`__.

67 changes: 67 additions & 0 deletions docs/_sources/user/scheduler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Scheduling and Progress Tracking
================================

.. raw:: html

<div class="flowchart-container"><object data="../_static/flowcharts/scheduler.svg" type="image/svg+xml"></object></div>

During pipeline execution, we want to monitor each node's execution. WebSocket (WS) allows the server to push information to the connected clients, essential in an asynchronous setup.

When starting a run with the ``--monitoring`` flag, the execution hangs until the WebSocket connects. This is preferred so no data is lost (i.e. C-PAC starts running before the WS connects).

If running the :ref:`cpac Python package <cpac-python-package>`, cpac will automatically find an available port on which to connect. If running a Docker container directly, you must expose any ports used for monitoring. The default WebSocket monitoring port is ``8080``.

For example, you can run

.. code-block:: BASH

cpac run /path/to/BIDS_directory /path/to/outputs participant --monitoring

or

.. code-block:: BASH

docker run \
-it --rm -p 8080:8080 \
-v /path/to/BIDS_directory:/bids_dir \
-v /path/to/outputs:/outputs \
fcpindi/c-pac:latest \
/bids_dir /outputs participant \
--monitoring

or

.. code-block:: BASH

singularity run \
-B /path/to/BIDS_directory:/bids_dir \
-B /path/to/outputs:/outputs \
fcpindi_c-pac.simg \
/bids_dir /outputs participant \
--monitoring

.. note::

Singularity requires the ``--fakeroot`` option to use ``network-args.portmap``. If ``--fakeroot`` gives an error like

.. code-block:: BASH

ERROR : Failed to create container namespaces

.. include:: /user/help.rst
:start-after: .. unprivileged_userns_clone_start
:end-before: .. unprivileged_userns_clone_end

Once your container starts running, C-PAC should log some setup information and then pause on the message

.. code-block:: BASH

[Waiting for monitoring websocket to connect]

One WebSocket monitoring tool is `WebSocat <https://github.com/vi/websocat#installation>`_. To use WebSocat to monitor, run

.. code-block:: BASH

websocat ws://127.0.0.1:8080/log

(replacing ``8080`` with whatever port you're using) in a terminal. C-PAC will start running and WebSocat will display realtime monitoring messages.