🐛 Processing freezes when multiple cores are assigned #2135

Pneumaethylamine · 2024-07-22T15:58:35Z

Describe the bug

Hi,
I have managed to run the default pipeline on a single subject using 1GB of RAM and 1 core, now I'd like to speed things up, my server has 800gb of ram and 38 cores but when I try and run the pipeline with 2 or more cores it runs for a few hours and then just stops at some point. There is no crash message and nothing in the log, it just stops doing anything.

This is how I initiate it

apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

and my yml is below.

I'm sorry I can't share my data, but the pipeline works fine when I change just the following options to all be 1

max_cores_per_participant: 16

num_ants_threads: 8

num_OMP_threads: 8

The last few lines of the callback.log offer no clues, it seems to be well within its RAM limits and when I check my server status it is not even close to having its limits pushed. There is a crash early in the process, something to do with not being able to pypeline.log.lock but this seems unrelated.

Occasionally it says "No resources available, potential deadlock" see screenshot 1) but then it starts running again presumably once other processes finish. As you can see in screenshot 2 there is no error message but the process has been frozen for 4 hours.

Thank you very much for the help!

crash-20240722-124119-drummondmcculloch-resampled_T1w-brain-template-20e94696-d2d1-4d41-9a1a-8948c82559fd.txt
callback.log
pypeline.log

To reproduce

No response

Preconfig

Custom pipeline configuration

pipeline_config.docx

Run command

apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

Expected behavior

The expected outputs, and for it to run without stopping

Acceptance criteria

It to run without stopping

Screenshots

C-PAC version

v1.8.7.dev1

Container platform

No response

Docker and/or Singularity version(s)

Apptainer 1.1.6-bp155.2.18

Additional context

No response

The text was updated successfully, but these errors were encountered:

Pneumaethylamine · 2024-07-23T13:03:55Z

Looking further into the error it seems to occur during the nipype draw-gantt_chart function which returns an error when start-time is TRUE indicating that "Two processes started at exactly the same time" which then freezes all other processes. I think this might be occurring because of the Permission error not allowing the program to lock the pypeline file. Not sure.

PermissionError: [Errno 13] Permission denied: '/output/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock'

"https://nipype.readthedocs.io/en/latest/api/generated/nipype.utils.draw_gantt_chart.html"

tamsinrogers · 2024-07-23T21:23:34Z

Hi @Pneumaethylamine, thanks for reaching out.

We ran a similar pipeline config on our data and couldn’t replicate this problem. Could you share some additional details about your data so we can continue troubleshooting? Knowing the size of the files you are working with, and what kinds of files (T1, T2, etc.) would be helpful - something like the result of tree —du -h in your terminal, if you are free to share that level of detail about the data.

Pneumaethylamine · 2024-07-24T07:50:51Z

Thanks for the fast response, I am just running on a single test-subject at the moment, the files look like this

(base) drummondmcculloch@bohr:/mrdata/cpac_test/NeuroPharm2-P2/sub-56061> tree -h
.
└── [ 111] ses-01
├── [4.0K] anat
│   ├── [1.9K] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.json
│   ├── [ 13M] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.nii.gz
│   ├── [2.0K] sub-56061_ses-01_acq-mprage_rec-norm_T1w.json
│   ├── [ 13M] sub-56061_ses-01_acq-mprage_rec-norm_T1w.nii.gz
│   ├── [1.9K] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.json
│   ├── [ 10M] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.nii.gz
│   ├── [2.0K] sub-56061_ses-01_acq-space_rec-norm_T2w.json
│   └── [ 11M] sub-56061_ses-01_acq-space_rec-norm_T2w.nii.gz
├── [4.0K] fmap
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_magnitude1.json
│   ├── [159K] sub-56061_ses-01_acq-greLR_magnitude1.nii.gz
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_magnitude2.json
│   ├── [158K] sub-56061_ses-01_acq-greLR_magnitude2.nii.gz
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_phasediff.json
│   └── [201K] sub-56061_ses-01_acq-greLR_phasediff.nii.gz
├── [ 126] func
│   ├── [2.5K] sub-56061_ses-01_task-rest_acq-ep2d_bold.json
│   └── [ 37M] sub-56061_ses-01_task-rest_acq-ep2d_bold.nii.gz
├── [ 110] pet
│   ├── [7.2K] sub-56061_ses-01_trc-CIMBI36_pet.json
│   └── [162M] sub-56061_ses-01_trc-CIMBI36_pet.nii.gz
└── [ 672] sub-56061_ses-01_scans.tsv

I ran it again last night with 1 core and it went totally smoothly, though it does also crash if I try and run with freesurfer options on, and only a single core i.e.,

surface_analysis:

Run freesurfer_abcd_preproc to obtain preprocessed T1w for reconall
abcd_prefreesurfer_prep:
run: On

Will run Freesurfer for surface-based analysis. Will output traditional Freesurfer derivatives.
If you wish to employ Freesurfer outputs for brain masking or tissue segmentation in the voxel-based pipeline,

select those 'Freesurfer-' labeled options further below in anatomical_preproc.
freesurfer:

run_reconall: On

Ingress freesurfer recon-all folder
ingress_reconall: Off
Add extra arguments to recon-all command
reconall_args:
Run ABCD-HCP post FreeSurfer and fMRISurface pipeline
post_freesurfer:

run: On

subcortical_gray_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferSubcorticalLabelTableLut.txt

freesurfer_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferAllLut.txt

surf_atlas_dir: /opt/dcan-tools/pipeline/global/templates/standard_mesh_atlases

gray_ordinates_dir: /opt/dcan-tools/pipeline/global/templates/Greyordinates

gray_ordinates_res: 2

high_res_mesh: 164

low_res_mesh: 32

fmri_res: 2

smooth_fwhm: 2

surface_connectivity:
run: On
surface_parcellation_template: /cpac_templates/Schaefer2018_200Parcels_17Networks_order.dlabel.nii

Where the error message is attached, perhaps it is a totally separate issue (in which case I can open another thread) but a bit of a google suggested it might be a resources issue.

crash-20240722-192830-drummondmcculloch-post_freesurfer_303.a0-af16faed-d224-4e2a-89ae-65c5feea3d3f.txt

Thank you very much!!
Drummond

Pneumaethylamine · 2024-07-26T13:18:27Z

Small update, I tried with an open dataset https://openneuro.org/datasets/ds000030/versions/00016/download
and got the same error.

tamsinrogers · 2024-07-26T14:49:34Z

Hi Drummond,

Thank you for providing the additional info. As next steps, can you try the following so we can better investigate what might be going on?

Running a new run with a clean working and output directory, using some arbitrary other number of threads.
Turning off writing to log for now (avoiding the lock).

pipeline_setup:
  log_directory:
    run_logging: Off

Pneumaethylamine · 2024-07-29T11:17:38Z

pipeline_config_single_base_fast_nolog.txt

apptainer run --containall
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/outputs/out_nolog:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
-B /mrdata/cpac_test/tmp:/tmp
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast_nolog.yml

I made a new output directory so there was nothing in the log or working dirs.

It crashed again with this yml with run_logging off.

Curiously, it still produced the log.lock errors (see below)

crash-20240729-115205-c-pac_user-resampled_T1w-template-symmetric-d777a902-1478-42c9-a59e-ac92d0f17aad.txt
crash-20240729-115205-c-pac_user-resampled_T1w-brain-template-deriv-10c15264-b286-4bde-83b1-423daf02a80a.txt

and here is the output in my terminal from when it died, it had been still for almost an hour when I ctrl+C'd to kill it.

crash_outputs.txt

Pneumaethylamine · 2024-07-29T12:26:02Z

I am working with my server IT team to try and diagnose the issue, could there be something with how the MultiProc is set up on the server that could be the issue? Thanks again.

Pneumaethylamine · 2024-08-02T08:53:38Z

Hi, Is there anything else I can provide to help debug this? Thank you. Possibly unrelated but when I set num_participant_to_run_at_once to >1 it still only runs 1 at a time.

tamsinrogers · 2024-08-05T15:43:41Z

Hi @Pneumaethylamine, could you provide the job scheduler you're using?

Pneumaethylamine · 2024-08-06T09:04:26Z

Hi Tamsin, We don't have a separate job scheduler, we just use the Linux kernel task scheduler. Thanks :)

tamsinrogers · 2024-08-08T17:13:39Z

Hi @Pneumaethylamine,

We've determined that this is the root of the crash. It looks like you may note have write permission for that file. If you still has the outputs from the run you attached the crashlog from, could check the permissions in /mrdata/cpac_test/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/ and see if anything is looks strange.

Something else we can look into - is your system using SELinux or AppArmor or something similar? A review of your Apptainer settings may also reveal some helpful information.

Pneumaethylamine · 2024-08-09T12:28:00Z

Hi Tamsin,

Thanks for getting to the root of it. Is there something I should be setting up re permissions before running apptainer? Below is my permissions overview for the log folder. I thought the permissions issue in the pipeline.log file was due to several processes trying to write to it at the same time resulting in the log.lock error from before.

(base) drummondmcculloch@hevesy:/mrdata/cpac_test/outputs/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01> ls -l
total 9036
-rw-r--r-- 1 drummondmcculloch users 513438 Jul 22 13:08 callback.log
-r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:40 cpac_data_config_6c29c2e9_2024-07-22T12-40-48Z.yml
-r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:42 cpac_data_config_6c29c2e9_2024-07-22T12-42-06Z.yml
-r--r--r-- 1 drummondmcculloch users 12714 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z_min.yml
-r--r--r-- 1 drummondmcculloch users 66311 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z.yml
-r--r--r-- 1 drummondmcculloch users 12650 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z_min.yml
-r--r--r-- 1 drummondmcculloch users 66312 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z.yml
-rw-r--r-- 1 drummondmcculloch users 1206393 Jul 22 13:52 pypeline.log
-rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:42 sub-56061_ses-01_expectedOutputs.yml
-rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:41 sub-56061_ses-01_missingOutputs.yml
-rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:41 workflow_pre_2024-07-22_12-40-59_pipeline_cpac-test_custom_pipeline.json
-rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:42 workflow_pre_2024-07-22_12-42-17_pipeline_cpac-test_custom_pipeline.json

As for apptainer, I am using version 1.1.6-bp155.2.18 with no plugins installed and the following cloud service endpoints

Cloud Services Endpoints

NAME URI ACTIVE GLOBAL EXCLUSIVE INSECURE
DefaultRemote cloud.apptainer.org YES YES NO NO

Keyservers

URI GLOBAL INSECURE ORDER
https://keys.openpgp.org YES NO 1*

Active cloud services keyserver

Are there other specific settings you would like me to check?

Thank you very much for your time helping with this,
Drummond

Pneumaethylamine · 2024-08-09T12:31:54Z

We are running AppArmour and something called "lockdown". I know nothing about how these are set up but I've messaged IT to try learn more.

Pneumaethylamine · 2024-08-13T07:49:12Z

I tried disabling AppArmor and running it again, and it made no difference. We aren't running SELinux. Thanks again for your help.

tamsinrogers · 2024-08-13T19:38:04Z

Hi @Pneumaethylamine,

The lock is designed to “simply watch[…] the existence of the lock file” if the log is being written to avoid multiple processes writing at the same time.

It seems that a process in the container is creating log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock, and the whole tree of log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01, but somehow doesn’t then have write access to that file, despite you having write access to pypeline.log.

Since you’ve already tried disabling AppArmor, as a next step, can you try running C-PAC as root and then updating the output file permissions at the end? Another thing you can try in the meantime is running some subjects single-threaded.

Pneumaethylamine · 2024-08-19T11:08:51Z

Hi Tamsin,

I tried to run the command with "fakeroot" i.e.,

apptainer run --fakeroot --containall
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/outputs/out_fakeroot:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
-B /mrdata/cpac_test/tmp:/tmp
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast2.yml

I no longer got the crash about pipeline.log.lock but the processing still stalled in the same place as you can see in the pipeline.log file

I have been running subjects single threaded which works fine, but is just very slow. I have written a shell script to open multiple instances simultaneously, but this limits group-level analyses, and I am also still unable to run surface-level analyses, and I wonder if this problem is related. Please let me know what other steps I can take to try and resolve this.

Thanks,
Drummond

tamsinrogers · 2024-08-19T18:10:13Z

Hi @Pneumaethylamine ,

Can you try:

apptainer \
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \
-B /mrdata/cpac_test/out_minimal_fast:/output \
-B /mrdata/cpac_test/configs:/pipeline_dir \
shell c-pac_latest.sif

to shell into a container? Then, in the container, you can run the command that’s crashing:

bash /code/CPAC/surface/fMRISurface/run.sh --post_freesurfer_folder /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_303 --subject sub-56061_ses-01 --fmri /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --scout /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --lowresmesh 32 --grayordinatesres 2 --fmrires 2 --smoothingFWHM 2

Hopefully that way, we can see more specifically what’s going wrong.

Pneumaethylamine · 2024-08-21T08:31:19Z

Hi Tamsin,

Strange thing, after I posted that screenshot of it hanging at 12:12:35 it briefly restarted around 9 hours later, ran for for 90 seconds and then hung again, now in a different place. It has now been stuck for 36 hours so probably won't restart again.

We have checked the server logs and there's nothing to suggest any resource limitations or that any process was stopped/started at this time.

When I run multi-thread with surface analyses on it hangs at a different point

Finally, when I run surface analyses single threaded it dies in the same place highlighted above, and then when I go in and run the appropriate command I get the attached error.

surface_single_error.txt

There seem to be two problems

1: The atlas file and the functional data have different voxel dims as you can see in this image and explaining this error, also the FoV of the functional image doesn't extend down the brainstem.

While running:
wb_command -cifti-create-dense-timeseries /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01_temp_subject.dtseries.nii -volume /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01.nii.gz /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/ROIs/Atlas_ROIs.2.nii.gz

ERROR: label volume has a different volume space than data volume

Secondly the file "task-rest01_AtlasSubcortical_s2.nii.gz" doesn't exist presumably because it wasn't created in the previous step.

I have attached the .yml I am using for the surface analyses single threaded.
surface_full_keep_working_single_txt.txt

I hope this is somehow clarifying! Thanks again for the help.

Pneumaethylamine · 2024-08-21T15:58:18Z

When I kill the hanging processes with ctrl+c they output the following text. In case that is useful. Thanks again.

ctrlC_base_multi.txt
ctrlC_surface_multi.txt

Pneumaethylamine · 2024-09-03T12:17:52Z

Hi, any update on this? Thanks.

tamsinrogers · 2024-09-04T21:12:03Z

Hi @Pneumaethylamine, thank you for the additional information! We are replicating on our end and will hopefully have some more clarification for you soon.

Not sure how helpful this will be as you are mid-run, but for the future, using pre-computed freesurfer directories as opposed to running freesurfer in cpac will usually result in fewer debugging issues.

Will get back to you as soon as possible!

Pneumaethylamine · 2024-09-24T14:54:09Z

Hi Tamsin, any luck with the replication? I'm not too concerned with the surface problem for now, that can be done later, but the multiple cores issue would be great to resolve. I appreciate your and the team's efforts!

tamsinrogers · 2024-09-25T19:21:17Z

Hi @Pneumaethylamine , we are still exploring potential solutions in our replication of this issue, but hopefully some of the following information will be helpful for now:

Because the processing pipelines are largely sequential, there is not much to gain from n_cpus > 1. You can disable nipype graph parallelism by setting --n_cpus 1. This prevents nipype from trying to schedule multiple processing nodes at once, which reduces the resources required for one run, makes it easier to avoid out-of-memory errors, and makes estimating the required resources much easier.
It is much more efficient to parallelize at the level of subjects/sessions. This might help with your runs if running parallel at the subject-session level proves to be faster/more efficient than running nodes within a subject-session in parallel.
When trying out different analysis pipelines, it is also a good practice to run as few subjects/sessions as possible. The biggest use of resources comes from re-analyzing large datasets many times.

Will be in touch soon with more! Thanks for your patience.

Pneumaethylamine added the bug label Jul 22, 2024

github-project-automation bot added this to C-PAC Development Jul 22, 2024

sgiavasis moved this to 🏗 In progress in C-PAC Development Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Processing freezes when multiple cores are assigned #2135

🐛 Processing freezes when multiple cores are assigned #2135

Pneumaethylamine commented Jul 22, 2024

Pneumaethylamine commented Jul 23, 2024

tamsinrogers commented Jul 23, 2024

Pneumaethylamine commented Jul 24, 2024 •

edited

Loading

Pneumaethylamine commented Jul 26, 2024

tamsinrogers commented Jul 26, 2024

Pneumaethylamine commented Jul 29, 2024 •

edited

Loading

Pneumaethylamine commented Jul 29, 2024

Pneumaethylamine commented Aug 2, 2024 •

edited

Loading

tamsinrogers commented Aug 5, 2024

Pneumaethylamine commented Aug 6, 2024

tamsinrogers commented Aug 8, 2024

Pneumaethylamine commented Aug 9, 2024 •

edited

Loading

Pneumaethylamine commented Aug 9, 2024

Pneumaethylamine commented Aug 13, 2024 •

edited

Loading

tamsinrogers commented Aug 13, 2024 •

edited

Loading

Pneumaethylamine commented Aug 19, 2024

tamsinrogers commented Aug 19, 2024

Pneumaethylamine commented Aug 21, 2024

Pneumaethylamine commented Aug 21, 2024

Pneumaethylamine commented Sep 3, 2024

tamsinrogers commented Sep 4, 2024

Pneumaethylamine commented Sep 24, 2024

tamsinrogers commented Sep 25, 2024

🐛 Processing freezes when multiple cores are assigned #2135

🐛 Processing freezes when multiple cores are assigned #2135

Comments

Pneumaethylamine commented Jul 22, 2024

Describe the bug

To reproduce

Preconfig

Custom pipeline configuration

Run command

Expected behavior

Acceptance criteria

Screenshots

C-PAC version

Container platform

Docker and/or Singularity version(s)

Additional context

Pneumaethylamine commented Jul 23, 2024

tamsinrogers commented Jul 23, 2024

Pneumaethylamine commented Jul 24, 2024 • edited Loading

Pneumaethylamine commented Jul 26, 2024

tamsinrogers commented Jul 26, 2024

Pneumaethylamine commented Jul 29, 2024 • edited Loading

Pneumaethylamine commented Jul 29, 2024

Pneumaethylamine commented Aug 2, 2024 • edited Loading

tamsinrogers commented Aug 5, 2024

Pneumaethylamine commented Aug 6, 2024

tamsinrogers commented Aug 8, 2024

Pneumaethylamine commented Aug 9, 2024 • edited Loading

Cloud Services Endpoints

Keyservers

Pneumaethylamine commented Aug 9, 2024

Pneumaethylamine commented Aug 13, 2024 • edited Loading

tamsinrogers commented Aug 13, 2024 • edited Loading

Pneumaethylamine commented Aug 19, 2024

tamsinrogers commented Aug 19, 2024

Pneumaethylamine commented Aug 21, 2024

Pneumaethylamine commented Aug 21, 2024

Pneumaethylamine commented Sep 3, 2024

tamsinrogers commented Sep 4, 2024

Pneumaethylamine commented Sep 24, 2024

tamsinrogers commented Sep 25, 2024

Pneumaethylamine commented Jul 24, 2024 •

edited

Loading

Pneumaethylamine commented Jul 29, 2024 •

edited

Loading

Pneumaethylamine commented Aug 2, 2024 •

edited

Loading

Pneumaethylamine commented Aug 9, 2024 •

edited

Loading

Pneumaethylamine commented Aug 13, 2024 •

edited

Loading

tamsinrogers commented Aug 13, 2024 •

edited

Loading