-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Processing freezes when multiple cores are assigned #2135
Comments
Looking further into the error it seems to occur during the nipype draw-gantt_chart function which returns an error when start-time is TRUE indicating that "Two processes started at exactly the same time" which then freezes all other processes. I think this might be occurring because of the Permission error not allowing the program to lock the pypeline file. Not sure. PermissionError: [Errno 13] Permission denied: '/output/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock' "https://nipype.readthedocs.io/en/latest/api/generated/nipype.utils.draw_gantt_chart.html" |
Hi @Pneumaethylamine, thanks for reaching out. We ran a similar pipeline config on our data and couldn’t replicate this problem. Could you share some additional details about your data so we can continue troubleshooting? Knowing the size of the files you are working with, and what kinds of files ( |
Thanks for the fast response, I am just running on a single test-subject at the moment, the files look like this (base) drummondmcculloch@bohr:/mrdata/cpac_test/NeuroPharm2-P2/sub-56061> tree -h I ran it again last night with 1 core and it went totally smoothly, though it does also crash if I try and run with freesurfer options on, and only a single core i.e., surface_analysis: Run freesurfer_abcd_preproc to obtain preprocessed T1w for reconall Will run Freesurfer for surface-based analysis. Will output traditional Freesurfer derivatives. select those 'Freesurfer-' labeled options further below in anatomical_preproc.
Ingress freesurfer recon-all folder
surface_connectivity: Where the error message is attached, perhaps it is a totally separate issue (in which case I can open another thread) but a bit of a google suggested it might be a resources issue. Thank you very much!! |
Small update, I tried with an open dataset https://openneuro.org/datasets/ds000030/versions/00016/download |
Hi Drummond, Thank you for providing the additional info. As next steps, can you try the following so we can better investigate what might be going on?
|
pipeline_config_single_base_fast_nolog.txt apptainer run --containall I made a new output directory so there was nothing in the log or working dirs. It crashed again with this yml with run_logging off. Curiously, it still produced the log.lock errors (see below) crash-20240729-115205-c-pac_user-resampled_T1w-template-symmetric-d777a902-1478-42c9-a59e-ac92d0f17aad.txt and here is the output in my terminal from when it died, it had been still for almost an hour when I ctrl+C'd to kill it. |
I am working with my server IT team to try and diagnose the issue, could there be something with how the MultiProc is set up on the server that could be the issue? Thanks again. |
Hi, Is there anything else I can provide to help debug this? Thank you. Possibly unrelated but when I set num_participant_to_run_at_once to >1 it still only runs 1 at a time. |
Hi @Pneumaethylamine, could you provide the job scheduler you're using? |
Hi Tamsin, We don't have a separate job scheduler, we just use the Linux kernel task scheduler. Thanks :) |
We've determined that this is the root of the crash. It looks like you may note have write permission for that file. If you still has the outputs from the run you attached the crashlog from, could check the permissions in /mrdata/cpac_test/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/ and see if anything is looks strange. Something else we can look into - is your system using SELinux or AppArmor or something similar? A review of your Apptainer settings may also reveal some helpful information. |
Hi Tamsin, Thanks for getting to the root of it. Is there something I should be setting up re permissions before running apptainer? Below is my permissions overview for the log folder. I thought the permissions issue in the pipeline.log file was due to several processes trying to write to it at the same time resulting in the log.lock error from before. (base) drummondmcculloch@hevesy:/mrdata/cpac_test/outputs/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01> ls -l As for apptainer, I am using version 1.1.6-bp155.2.18 with no plugins installed and the following cloud service endpoints Cloud Services EndpointsNAME URI ACTIVE GLOBAL EXCLUSIVE INSECURE KeyserversURI GLOBAL INSECURE ORDER
Are there other specific settings you would like me to check? Thank you very much for your time helping with this, |
We are running AppArmour and something called "lockdown". I know nothing about how these are set up but I've messaged IT to try learn more. |
I tried disabling AppArmor and running it again, and it made no difference. We aren't running SELinux. Thanks again for your help. |
The lock is designed to “simply watch[…] the existence of the lock file” if the log is being written to avoid multiple processes writing at the same time. It seems that a process in the container is creating Since you’ve already tried disabling AppArmor, as a next step, can you try running C-PAC as root and then updating the output file permissions at the end? Another thing you can try in the meantime is running some subjects single-threaded. |
Hi Tamsin, I tried to run the command with "fakeroot" i.e., apptainer run --fakeroot --containall I no longer got the crash about pipeline.log.lock but the processing still stalled in the same place as you can see in the pipeline.log file I have been running subjects single threaded which works fine, but is just very slow. I have written a shell script to open multiple instances simultaneously, but this limits group-level analyses, and I am also still unable to run surface-level analyses, and I wonder if this problem is related. Please let me know what other steps I can take to try and resolve this. Thanks, |
Hi @Pneumaethylamine , Can you try:
to shell into a container? Then, in the container, you can run the command that’s crashing:
Hopefully that way, we can see more specifically what’s going wrong. |
Hi Tamsin, Strange thing, after I posted that screenshot of it hanging at 12:12:35 it briefly restarted around 9 hours later, ran for for 90 seconds and then hung again, now in a different place. It has now been stuck for 36 hours so probably won't restart again. We have checked the server logs and there's nothing to suggest any resource limitations or that any process was stopped/started at this time. When I run multi-thread with surface analyses on it hangs at a different point Finally, when I run surface analyses single threaded it dies in the same place highlighted above, and then when I go in and run the appropriate command I get the attached error. There seem to be two problems 1: The atlas file and the functional data have different voxel dims as you can see in this image and explaining this error, also the FoV of the functional image doesn't extend down the brainstem. While running: ERROR: label volume has a different volume space than data volume Secondly the file "task-rest01_AtlasSubcortical_s2.nii.gz" doesn't exist presumably because it wasn't created in the previous step. I have attached the .yml I am using for the surface analyses single threaded. I hope this is somehow clarifying! Thanks again for the help. |
When I kill the hanging processes with ctrl+c they output the following text. In case that is useful. Thanks again. |
Hi, any update on this? Thanks. |
Hi @Pneumaethylamine, thank you for the additional information! We are replicating on our end and will hopefully have some more clarification for you soon. Not sure how helpful this will be as you are mid-run, but for the future, using pre-computed freesurfer directories as opposed to running freesurfer in cpac will usually result in fewer debugging issues. Will get back to you as soon as possible! |
Hi Tamsin, any luck with the replication? I'm not too concerned with the surface problem for now, that can be done later, but the multiple cores issue would be great to resolve. I appreciate your and the team's efforts! |
Hi @Pneumaethylamine , we are still exploring potential solutions in our replication of this issue, but hopefully some of the following information will be helpful for now:
Will be in touch soon with more! Thanks for your patience. |
Describe the bug
Hi,
I have managed to run the default pipeline on a single subject using 1GB of RAM and 1 core, now I'd like to speed things up, my server has 800gb of ram and 38 cores but when I try and run the pipeline with 2 or more cores it runs for a few hours and then just stops at some point. There is no crash message and nothing in the log, it just stops doing anything.
This is how I initiate it
apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml
and my yml is below.
I'm sorry I can't share my data, but the pipeline works fine when I change just the following options to all be 1
The last few lines of the callback.log offer no clues, it seems to be well within its RAM limits and when I check my server status it is not even close to having its limits pushed. There is a crash early in the process, something to do with not being able to pypeline.log.lock but this seems unrelated.
Occasionally it says "No resources available, potential deadlock" see screenshot 1) but then it starts running again presumably once other processes finish. As you can see in screenshot 2 there is no error message but the process has been frozen for 4 hours.
Thank you very much for the help!
crash-20240722-124119-drummondmcculloch-resampled_T1w-brain-template-20e94696-d2d1-4d41-9a1a-8948c82559fd.txt
callback.log
pypeline.log
To reproduce
No response
Preconfig
Custom pipeline configuration
pipeline_config.docx
Run command
apptainer run --env FSLDIR=/usr/share/fsl/6.0
--env FSL_DIR=/usr/share/fsl/6.0
--env FSL_BIN=/usr/share/fsl/6.0/bin
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata
-B /mrdata/cpac_test/out_minimal_fast:/output
-B /mrdata/cpac_test/configs:/pipeline_dir
c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml
Expected behavior
The expected outputs, and for it to run without stopping
Acceptance criteria
It to run without stopping
Screenshots
C-PAC version
v1.8.7.dev1
Container platform
No response
Docker and/or Singularity version(s)
Apptainer 1.1.6-bp155.2.18
Additional context
No response
The text was updated successfully, but these errors were encountered: