Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulator.py taking very long to run and RAM usage above 768 GB #76

Closed
andrese52 opened this issue Dec 29, 2019 · 8 comments
Closed

simulator.py taking very long to run and RAM usage above 768 GB #76

andrese52 opened this issue Dec 29, 2019 · 8 comments

Comments

@andrese52
Copy link

I did the characterization with E. coli and an SRA run from NCBI. Then, I used that generated profile in simulator.py. Everything works well if no -med and -sd are used. However, when I want a median of 8000 and sd of 200, the simulation gets stuck and takes very long. After a few hours, it uses all RAM and the job is killed by our HPC scheduler.

See below the code being used:

simulator.py genome -n 2700 -med 8000 -sd 200 -r test-10kb.fasta -o genome-10kb -c nanosim_profile_new/ecoli --seed 974839895 -t 32

Any advice is greatly appreciated.

@cheny19
Copy link
Collaborator

cheny19 commented Dec 30, 2019

Hi Andres,

The problem is with sd. The sd is the sd of log normal distribution, instead of the whole distribution. So you will need to convert it according to wiki. if you sd is too large, it will generate some extremely long or short sequences, and then will be discarded because they are longer than the genome size or smaller than the minimum threshold.

Let me know if you have further questions.

Chen

@andrese52
Copy link
Author

Hi Chen,
Yes, may you please provide a working example in such cases? The default examples in the README.md do not include -med or -sd.

Say we want a median of 8000, what -sd would you suggest when having a genome size of 10kb to be simulated?

Thank you
Andres

@cheny19
Copy link
Collaborator

cheny19 commented Jan 22, 2020

Sorry for the late reply. The standard deviation is independent of genome size, and it purely depends on how much you want the reads to spread. I'd suggest -sd to be 1.05 or 1.1 to start with.

@cheny19 cheny19 closed this as completed Jun 9, 2020
@HLHsieh
Copy link

HLHsieh commented Mar 30, 2023

Hi @cheny19,

I also had this similar issue. Compared to default setting, simulator.py taking very long to run in the setting of -med 20000 -sd 4. I am trying to stimulate reads with median=20kb and std=10kb. I would appreciate it if you could advise.

Many thanks,
Hsin

@kmnip
Copy link
Collaborator

kmnip commented Mar 31, 2023

@HLHsieh
Can you please report your exact command?

@HLHsieh
Copy link

HLHsieh commented Mar 31, 2023

@kmnip

I executed the following

~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x -med 20000 -sd 4 --seed 100 -b guppy

My goal is to simulate reads with distribution of median=20kb and std=10kb.

I also tried to execute that command with the default value of median and std, and it went smoothly.

~/bin/NanoSim/src/simulator.py genome -rg ~/mock_genome/D4Z4_p1.fasta -c ~/bin/NanoSim/pre-trained_models/human_NA12878_DNA_FAB49712_guppy/training -t 20 -n 2000000 -o D4Z4_p1_NanoSim_100x --seed 100 -b guppy

Please advise. Thanks!

@HLHsieh
Copy link

HLHsieh commented Jul 10, 2024

Hi @kmnip,

I would like to follow up on this issue. Any suggestions would be appreciated.

PS. My version is 3.1.0.

Best,
Hsin

@kmnip
Copy link
Collaborator

kmnip commented Jul 11, 2024

@HLHsieh Let's continue in your other thread:
#210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants