Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading bug when using log-normal sampling #164

Open
2tony2 opened this issue Apr 18, 2022 · 3 comments
Open

Multithreading bug when using log-normal sampling #164

2tony2 opened this issue Apr 18, 2022 · 3 comments

Comments

@2tony2
Copy link

2tony2 commented Apr 18, 2022

Hi there!

I was running some simulations today with nanosim until I stumbled upon some issue that I thought may be worth pointing out here.

First of all, I was using a custom trained model and was using the min max med and sd option in simulatory.py genome mode as follows:

simulator.py genome -rg test.fa -c "nanosim_model/testmodel" -n 228 -b "guppy" -s "0.5" -dna_type "linear" -t "8" --fastq -k 6 -o simulated/out.fastq --perfect -min $sequence_min -max $sequence_max -med $sequence_length -sd 0.1

From my understanding of the source code, this samples fragment size from a log-normal distribution rather than kernel learned from data which had more desirable properties for my task at hand.

Now I was having it work at 1 and 100 reads but not 1000. After some testing the limit seemed to be 228. With the following error popping up when trying 229 reads on 8 threads:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/miniconda3/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/bin/simulator.py", line 1169, in simulation_aligned_genome
np.random.lognormal(np.log(median_l), sd_l, remaining_segments)
File "mtrand.pyx", line 3015, in numpy.random.mtrand.RandomState.lognormal
File "_common.pyx", line 598, in numpy.random._common.cont
ValueError: maximum supported dimension for an ndarray is 32, found 33

From my interpretation of this the following seems to happen: When sampling from log-normal, an ndarray is used in each thread which has a limit of 32 numbers. Specified threads - 1 are used for this process thus in this case this is 7 threads. This means that 228 reads are divided over 7 threads in ndarrays = ±32 which is around the ndarray limit. I tested with some different thread counts and this hypothesis seems to hold up.

I'm no expert at programming multithreaded applications with numpy so I do not know if this has a straightforward solution, but I just wanted to point this out so you are aware. Maybe this could help?

@SaberHQ
Copy link
Collaborator

SaberHQ commented Apr 19, 2022

Hey Tony @2tony2 ,

Thanks a lot for reporting this. I really appreciate it. I will definitely look into it with my colleagues and we will get back to you. Let me see if I can reproduce the same issue. Keep you updated.

Btw, I wonder if you have ever tried without setting min, max, med, and sd parameter or not? What about without --perfect option? I wonder if it works in those cases or not.

Pinging @cheny19 to have her thoughts on this.

@2tony2
Copy link
Author

2tony2 commented Apr 19, 2022

Yes I did use it without the min max med and sd option but for this specific dataset, some references used it will hang for some.

I think this is because I trained the model on some data which produces a read length distribution range which is outside of the range of some references being tested. From looking at the code, my hunch is that the while loop will keep running forever as the major for loop within it is exhausted without saturating the condition. Ideally you'd want to include some sort of exception if your for loop is exhausted to point this out.

This is why the lognormal sampling should work fine (and it does as long as you don't simulate too many reads) as that is not dependent on the read length distribution of the original training dataset.

I'm not sure whether I tried this specific configuration without the --perfect option, I can give it a try but this does seem specifically like a threading issue.

@lauradunphy
Copy link

lauradunphy commented May 12, 2022

Hi!

I was wondering if there had been any updates to this issue?

In case it helps, I ran into the same error with the below command:

simulator.py genome -dna_type linear -rg ref.fasta -c models/ecoli_R9_2D/ecoli_R9_2D -o simulations/output -s 1 -n 10000 -med 1500 -sd 1 --perfect

Essentially, I am trying to simulate reads with a much shorter median read length than the model I trained on (aiming for a median of 1500bp instead of the model median of ~10kbp). The error only occurs if the number of reads (-n) is set to be greater than 32. I think the error has something to do with trying to assign the median and sd because without these flags, the simulations run as expected.

Additional tests:

  • Simulations work when I decrease the number of reads to less than or equal to 32 (-n 30)

simulator.py genome -dna_type linear -rg ref.fasta -c models/ecoli_R9_2D/ecoli_R9_2D -o simulations/output -s 1 -n 30 -med 1500 -sd 1 --perfect

  • Removing the --perfect flag but keeping the -med 1500 and -sd 1 causes the code to run for an indeterminate length of time.

simulator.py genome -dna_type linear -rg ref.fasta -c models/ecoli_R9_2D/ecoli_R9_2D -o simulations/output -s 1 -n 10000 -med 1500 -sd 1

  • Removing -med and -sd, the simulation works as expected both with and without the --perfect option

simulator.py genome -dna_type linear -rg ref.fasta -c models/ecoli_R9_2D/ecoli_R9_2D -o simulations/output -s 1 -n 10000 --perfect

This was done using nanosim v3.1.0 (re-installed yesterday from conda)

Many thanks,
Laura

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants