Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Processing freezes when multiple cores are assigned #2135

Open
14 tasks
Pneumaethylamine opened this issue Jul 22, 2024 · 23 comments
Open
14 tasks

🐛 Processing freezes when multiple cores are assigned #2135

Pneumaethylamine opened this issue Jul 22, 2024 · 23 comments
Labels

Comments

@Pneumaethylamine
Copy link

Describe the bug

Hi,
I have managed to run the default pipeline on a single subject using 1GB of RAM and 1 core, now I'd like to speed things up, my server has 800gb of ram and 38 cores but when I try and run the pipeline with 2 or more cores it runs for a few hours and then just stops at some point. There is no crash message and nothing in the log, it just stops doing anything.

This is how I initiate it

apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

and my yml is below.

I'm sorry I can't share my data, but the pipeline works fine when I change just the following options to all be 1

max_cores_per_participant: 16

num_ants_threads: 8

num_OMP_threads: 8

The last few lines of the callback.log offer no clues, it seems to be well within its RAM limits and when I check my server status it is not even close to having its limits pushed. There is a crash early in the process, something to do with not being able to pypeline.log.lock but this seems unrelated.

Occasionally it says "No resources available, potential deadlock" see screenshot 1) but then it starts running again presumably once other processes finish. As you can see in screenshot 2 there is no error message but the process has been frozen for 4 hours.

Thank you very much for the help!

crash-20240722-124119-drummondmcculloch-resampled_T1w-brain-template-20e94696-d2d1-4d41-9a1a-8948c82559fd.txt
callback.log
pypeline.log

To reproduce

No response

Preconfig

  • default
  • abcd-options
  • anat-only
  • blank
  • ccs-options
  • fmriprep-options
  • fx-options
  • monkey
  • monkey-ABCD
  • ndmg
  • nhp-macaque
  • preproc
  • rbc-options
  • rodent

Custom pipeline configuration

pipeline_config.docx

Run command

apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

Expected behavior

The expected outputs, and for it to run without stopping

Acceptance criteria

It to run without stopping

Screenshots

Screenshot 2024-07-22 125410
Screenshot 2024-07-22 174927

C-PAC version

v1.8.7.dev1

Container platform

No response

Docker and/or Singularity version(s)

Apptainer 1.1.6-bp155.2.18

Additional context

No response

@Pneumaethylamine
Copy link
Author

Looking further into the error it seems to occur during the nipype draw-gantt_chart function which returns an error when start-time is TRUE indicating that "Two processes started at exactly the same time" which then freezes all other processes. I think this might be occurring because of the Permission error not allowing the program to lock the pypeline file. Not sure.

PermissionError: [Errno 13] Permission denied: '/output/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock'

"https://nipype.readthedocs.io/en/latest/api/generated/nipype.utils.draw_gantt_chart.html"

@tamsinrogers
Copy link

Hi @Pneumaethylamine, thanks for reaching out.

We ran a similar pipeline config on our data and couldn’t replicate this problem. Could you share some additional details about your data so we can continue troubleshooting? Knowing the size of the files you are working with, and what kinds of files (T1, T2, etc.) would be helpful - something like the result of tree —du -h in your terminal, if you are free to share that level of detail about the data.

@sgiavasis sgiavasis moved this to 🏗 In progress in C-PAC Development Jul 23, 2024
@Pneumaethylamine
Copy link
Author

Pneumaethylamine commented Jul 24, 2024

Thanks for the fast response, I am just running on a single test-subject at the moment, the files look like this

(base) drummondmcculloch@bohr:/mrdata/cpac_test/NeuroPharm2-P2/sub-56061> tree -h
.
└── [ 111] ses-01
├── [4.0K] anat
│   ├── [1.9K] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.json
│   ├── [ 13M] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.nii.gz
│   ├── [2.0K] sub-56061_ses-01_acq-mprage_rec-norm_T1w.json
│   ├── [ 13M] sub-56061_ses-01_acq-mprage_rec-norm_T1w.nii.gz
│   ├── [1.9K] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.json
│   ├── [ 10M] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.nii.gz
│   ├── [2.0K] sub-56061_ses-01_acq-space_rec-norm_T2w.json
│   └── [ 11M] sub-56061_ses-01_acq-space_rec-norm_T2w.nii.gz
├── [4.0K] fmap
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_magnitude1.json
│   ├── [159K] sub-56061_ses-01_acq-greLR_magnitude1.nii.gz
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_magnitude2.json
│   ├── [158K] sub-56061_ses-01_acq-greLR_magnitude2.nii.gz
│   ├── [2.3K] sub-56061_ses-01_acq-greLR_phasediff.json
│   └── [201K] sub-56061_ses-01_acq-greLR_phasediff.nii.gz
├── [ 126] func
│   ├── [2.5K] sub-56061_ses-01_task-rest_acq-ep2d_bold.json
│   └── [ 37M] sub-56061_ses-01_task-rest_acq-ep2d_bold.nii.gz
├── [ 110] pet
│   ├── [7.2K] sub-56061_ses-01_trc-CIMBI36_pet.json
│   └── [162M] sub-56061_ses-01_trc-CIMBI36_pet.nii.gz
└── [ 672] sub-56061_ses-01_scans.tsv

I ran it again last night with 1 core and it went totally smoothly, though it does also crash if I try and run with freesurfer options on, and only a single core i.e.,

surface_analysis:

Run freesurfer_abcd_preproc to obtain preprocessed T1w for reconall
abcd_prefreesurfer_prep:
run: On

Will run Freesurfer for surface-based analysis. Will output traditional Freesurfer derivatives.
If you wish to employ Freesurfer outputs for brain masking or tissue segmentation in the voxel-based pipeline,

select those 'Freesurfer-' labeled options further below in anatomical_preproc.
freesurfer:

run_reconall: On

Ingress freesurfer recon-all folder
ingress_reconall: Off
Add extra arguments to recon-all command
reconall_args:
Run ABCD-HCP post FreeSurfer and fMRISurface pipeline
post_freesurfer:

run: On

subcortical_gray_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferSubcorticalLabelTableLut.txt

freesurfer_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferAllLut.txt

surf_atlas_dir: /opt/dcan-tools/pipeline/global/templates/standard_mesh_atlases

gray_ordinates_dir: /opt/dcan-tools/pipeline/global/templates/Greyordinates

gray_ordinates_res: 2

high_res_mesh: 164

low_res_mesh: 32

fmri_res: 2

smooth_fwhm: 2

surface_connectivity:
run: On
surface_parcellation_template: /cpac_templates/Schaefer2018_200Parcels_17Networks_order.dlabel.nii

Where the error message is attached, perhaps it is a totally separate issue (in which case I can open another thread) but a bit of a google suggested it might be a resources issue.

crash-20240722-192830-drummondmcculloch-post_freesurfer_303.a0-af16faed-d224-4e2a-89ae-65c5feea3d3f.txt

Thank you very much!!
Drummond

@Pneumaethylamine
Copy link
Author

Small update, I tried with an open dataset https://openneuro.org/datasets/ds000030/versions/00016/download
and got the same error.

@tamsinrogers
Copy link

Hi Drummond,

Thank you for providing the additional info. As next steps, can you try the following so we can better investigate what might be going on?

  1. Running a new run with a clean working and output directory, using some arbitrary other number of threads.
  2. Turning off writing to log for now (avoiding the lock).
pipeline_setup:
  log_directory:
    run_logging: Off

@Pneumaethylamine
Copy link
Author

Pneumaethylamine commented Jul 29, 2024

pipeline_config_single_base_fast_nolog.txt

apptainer run --containall
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/outputs/out_nolog:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
-B /mrdata/cpac_test/tmp:/tmp
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast_nolog.yml

I made a new output directory so there was nothing in the log or working dirs.

It crashed again with this yml with run_logging off.

Curiously, it still produced the log.lock errors (see below)

crash-20240729-115205-c-pac_user-resampled_T1w-template-symmetric-d777a902-1478-42c9-a59e-ac92d0f17aad.txt
crash-20240729-115205-c-pac_user-resampled_T1w-brain-template-deriv-10c15264-b286-4bde-83b1-423daf02a80a.txt

and here is the output in my terminal from when it died, it had been still for almost an hour when I ctrl+C'd to kill it.

crash_outputs.txt

@Pneumaethylamine
Copy link
Author

I am working with my server IT team to try and diagnose the issue, could there be something with how the MultiProc is set up on the server that could be the issue? Thanks again.

@Pneumaethylamine
Copy link
Author

Pneumaethylamine commented Aug 2, 2024

Hi, Is there anything else I can provide to help debug this? Thank you. Possibly unrelated but when I set num_participant_to_run_at_once to >1 it still only runs 1 at a time.

@tamsinrogers
Copy link

Hi @Pneumaethylamine, could you provide the job scheduler you're using?

@Pneumaethylamine
Copy link
Author

Hi Tamsin, We don't have a separate job scheduler, we just use the Linux kernel task scheduler. Thanks :)

@tamsinrogers
Copy link

Hi @Pneumaethylamine,

We've determined that this is the root of the crash. It looks like you may note have write permission for that file. If you still has the outputs from the run you attached the crashlog from, could check the permissions in /mrdata/cpac_test/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/ and see if anything is looks strange.

Something else we can look into - is your system using SELinux or AppArmor or something similar? A review of your Apptainer settings may also reveal some helpful information.

@Pneumaethylamine
Copy link
Author

Pneumaethylamine commented Aug 9, 2024

Hi Tamsin,

Thanks for getting to the root of it. Is there something I should be setting up re permissions before running apptainer? Below is my permissions overview for the log folder. I thought the permissions issue in the pipeline.log file was due to several processes trying to write to it at the same time resulting in the log.lock error from before.

(base) drummondmcculloch@hevesy:/mrdata/cpac_test/outputs/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01> ls -l
total 9036
-rw-r--r-- 1 drummondmcculloch users 513438 Jul 22 13:08 callback.log
-r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:40 cpac_data_config_6c29c2e9_2024-07-22T12-40-48Z.yml
-r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:42 cpac_data_config_6c29c2e9_2024-07-22T12-42-06Z.yml
-r--r--r-- 1 drummondmcculloch users 12714 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z_min.yml
-r--r--r-- 1 drummondmcculloch users 66311 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z.yml
-r--r--r-- 1 drummondmcculloch users 12650 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z_min.yml
-r--r--r-- 1 drummondmcculloch users 66312 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z.yml
-rw-r--r-- 1 drummondmcculloch users 1206393 Jul 22 13:52 pypeline.log
-rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:42 sub-56061_ses-01_expectedOutputs.yml
-rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:41 sub-56061_ses-01_missingOutputs.yml
-rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:41 workflow_pre_2024-07-22_12-40-59_pipeline_cpac-test_custom_pipeline.json
-rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:42 workflow_pre_2024-07-22_12-42-17_pipeline_cpac-test_custom_pipeline.json

As for apptainer, I am using version 1.1.6-bp155.2.18 with no plugins installed and the following cloud service endpoints

Cloud Services Endpoints

NAME URI ACTIVE GLOBAL EXCLUSIVE INSECURE
DefaultRemote cloud.apptainer.org YES YES NO NO

Keyservers

URI GLOBAL INSECURE ORDER
https://keys.openpgp.org YES NO 1*

  • Active cloud services keyserver

Are there other specific settings you would like me to check?

Thank you very much for your time helping with this,
Drummond

@Pneumaethylamine
Copy link
Author

We are running AppArmour and something called "lockdown". I know nothing about how these are set up but I've messaged IT to try learn more.

@Pneumaethylamine
Copy link
Author

Pneumaethylamine commented Aug 13, 2024

I tried disabling AppArmor and running it again, and it made no difference. We aren't running SELinux. Thanks again for your help.

@tamsinrogers
Copy link

tamsinrogers commented Aug 13, 2024

Hi @Pneumaethylamine,

The lock is designed to “simply watch[…] the existence of the lock file” if the log is being written to avoid multiple processes writing at the same time.

It seems that a process in the container is creating log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock, and the whole tree of log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01, but somehow doesn’t then have write access to that file, despite you having write access to pypeline.log.

Since you’ve already tried disabling AppArmor, as a next step, can you try running C-PAC as root and then updating the output file permissions at the end? Another thing you can try in the meantime is running some subjects single-threaded.

@Pneumaethylamine
Copy link
Author

Hi Tamsin,

I tried to run the command with "fakeroot" i.e.,

apptainer run --fakeroot --containall
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/outputs/out_fakeroot:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
-B /mrdata/cpac_test/tmp:/tmp
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast2.yml

I no longer got the crash about pipeline.log.lock but the processing still stalled in the same place as you can see in the pipeline.log file

image

I have been running subjects single threaded which works fine, but is just very slow. I have written a shell script to open multiple instances simultaneously, but this limits group-level analyses, and I am also still unable to run surface-level analyses, and I wonder if this problem is related. Please let me know what other steps I can take to try and resolve this.

Thanks,
Drummond

@tamsinrogers
Copy link

Hi @Pneumaethylamine ,

Can you try:

apptainer \
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \
-B /mrdata/cpac_test/out_minimal_fast:/output \
-B /mrdata/cpac_test/configs:/pipeline_dir \
shell c-pac_latest.sif

to shell into a container? Then, in the container, you can run the command that’s crashing:

bash /code/CPAC/surface/fMRISurface/run.sh --post_freesurfer_folder /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_303 --subject sub-56061_ses-01 --fmri /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --scout /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --lowresmesh 32 --grayordinatesres 2 --fmrires 2 --smoothingFWHM 2

Hopefully that way, we can see more specifically what’s going wrong.

@Pneumaethylamine
Copy link
Author

Hi Tamsin,

Strange thing, after I posted that screenshot of it hanging at 12:12:35 it briefly restarted around 9 hours later, ran for for 90 seconds and then hung again, now in a different place. It has now been stuck for 36 hours so probably won't restart again.

image

We have checked the server logs and there's nothing to suggest any resource limitations or that any process was stopped/started at this time.

When I run multi-thread with surface analyses on it hangs at a different point

image

Finally, when I run surface analyses single threaded it dies in the same place highlighted above, and then when I go in and run the appropriate command I get the attached error.

surface_single_error.txt

There seem to be two problems

1: The atlas file and the functional data have different voxel dims as you can see in this image and explaining this error, also the FoV of the functional image doesn't extend down the brainstem.

While running:
wb_command -cifti-create-dense-timeseries /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01_temp_subject.dtseries.nii -volume /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01.nii.gz /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/ROIs/Atlas_ROIs.2.nii.gz

ERROR: label volume has a different volume space than data volume

image

Secondly the file "task-rest01_AtlasSubcortical_s2.nii.gz" doesn't exist presumably because it wasn't created in the previous step.

I have attached the .yml I am using for the surface analyses single threaded.
surface_full_keep_working_single_txt.txt

I hope this is somehow clarifying! Thanks again for the help.

@Pneumaethylamine
Copy link
Author

When I kill the hanging processes with ctrl+c they output the following text. In case that is useful. Thanks again.

ctrlC_base_multi.txt
ctrlC_surface_multi.txt

@Pneumaethylamine
Copy link
Author

Hi, any update on this? Thanks.

@tamsinrogers
Copy link

Hi @Pneumaethylamine, thank you for the additional information! We are replicating on our end and will hopefully have some more clarification for you soon.

Not sure how helpful this will be as you are mid-run, but for the future, using pre-computed freesurfer directories as opposed to running freesurfer in cpac will usually result in fewer debugging issues.

Will get back to you as soon as possible!

@Pneumaethylamine
Copy link
Author

Hi Tamsin, any luck with the replication? I'm not too concerned with the surface problem for now, that can be done later, but the multiple cores issue would be great to resolve. I appreciate your and the team's efforts!

@tamsinrogers
Copy link

Hi @Pneumaethylamine , we are still exploring potential solutions in our replication of this issue, but hopefully some of the following information will be helpful for now:

  • Because the processing pipelines are largely sequential, there is not much to gain from n_cpus > 1. You can disable nipype graph parallelism by setting --n_cpus 1. This prevents nipype from trying to schedule multiple processing nodes at once, which reduces the resources required for one run, makes it easier to avoid out-of-memory errors, and makes estimating the required resources much easier.
  • It is much more efficient to parallelize at the level of subjects/sessions. This might help with your runs if running parallel at the subject-session level proves to be faster/more efficient than running nodes within a subject-session in parallel.
  • When trying out different analysis pipelines, it is also a good practice to run as few subjects/sessions as possible. The biggest use of resources comes from re-analyzing large datasets many times.

Will be in touch soon with more! Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants