-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simulating nanopore samples with a fixed mean read length (to achieve a wanted coverage value) #241
Comments
Hello @dlaehnemann, Thank you so much for your interest in NanoSim. We really appreciate your explanation and code to work around the number of reads to achieve the aimed coverage. Your code is particularly very useful for those who want to use customized mean, maximum, or minimum length with a specified standard deviation and fit a mixture model on the given parameters. Thank you for your contribution. I would like to also inform you that we are currently understaffed and due to that unfortunately we were not able to integrate your code into NanoSim yet. However, as another approach, we were able to approximate the mean coverage based on the kernel density estimation (KDE) functions returned by Again, we deeply appriciate your invaluable comments and feedback on NanoSim. We hope to improve our coverage estimation algorithms with the addition of mean length specifications. However, due to our limited bandwith and having an implementation for the emprical data which produces robust coverage output for different cases including different mean lengths, I am closing this issue. Thank you so much for your support and contributions on NanoSim. |
Many thanks for checking in here, and no need to include any of this into your own code. It was mainly for documenting what I found out, for future-me and possibly others who might find this useful. But great that you already have a standard solution for this implemented right at this time, what a coincidence time-wise! And then just two questions to make sure I understand usage of the new
If I fully understand this, I'll probably go and edit my original issue to include this as the default solution at the start of the post. Seems much easier to use... 😅 |
Thank you so much for your comments and questions @dlaehnemann. I apolagize for being not clear about those!
Many thanks for your interest in NanoSim and your clarifications on the issue. Much appreciated. |
No need to apologize, just wanted to make sure I understood the implications correctly. So maybe I was being a bit pedantic. But I'll add a quick edit to the start of my issue, so people immediately see the new |
EDIT:
If you came here to find out how to simulate samples with a specific target coverage,
NanoSim
has this functionality as of versionv3.2.3
. Just look for the new--coverage
command-line argument, that @berkeucar pointed out right below. This will use the distributions from the (pre-trained) model internally.Some interesting background on how it works is in the pull request introducing the
--coverage
command-line argument. And @berkeucar also nicely explained the possible interactions with other command-line arguments further below=== original issue before the EDIT based on the below discussion ===
It took me a while to figure out how to do this, so I'm documenting it here.
Basically, I want to specify a
mean_coverage
across the given reference sequence and themean_read_length
. But nanosim expects read--number
, a--median_len
of reads and a--sd_len
on a lognormal scale. Thanks @cheny19 for pointing out the lognormal distribution behind all this, which was essential to understand in order to get from my given values to the command line arguments.And here's the graphic that helped me understand how the normal and the lognormal distribution fit together:
https://en.wikipedia.org/wiki/Log-normal_distribution#/media/File:Lognormal_Distribution.svg
The most important takeaways for me were, that:
mu
corresponds to the median of the lognormal distribution ofe^mu
.mean_read_length
I want to specify, instead isE[X] = e^(mu+1/2*sigma^2)
.To get from the specified values to the command-line arguments, we thus need to take several steps. Let's assume we deal with a reference sequence that is
20.000bp
long and set the following:We can then already determine the
--number
of reads required with the following function:In this case, this will yield:
However, the trickier part is getting to values for
--sd_len
and--median_len
. To be honest, I did not get an intuitive understanding for what useful values of--sd_len
are, apart from that bigger values mean a wider / flatter distribution. So I decided to fix this to a value in the range recommended by @cheny19 in a different issue:I had previously tried it with larger values, and this very quickly leads to excessive runtimes and memory usage by nanosim. There are at least two issues that were probably ultimately filed due to higher values of
--sd_len
: #76 #210With this fixed, we can then calculate the
--median_len
required by nanosim by working with the equation from2.
above to get to the valuee^mu
in1.
above, which we determined to be that--median_len
:I our example case, this would lead to a value of:
As a control, one can draw a large number of reads from that distribution with the formula used for genome-based simulation in the nanosim code and the length-filtering based on the longest chromosome in a reference fasta file and see that this comes out at roughly the wanted coverage:
In the example given here, you will see that repeated calls will give mean read length values that (i) vary quite a lot, and that (ii) never really come close to the wanted mean_read_length we started out with. The variation of (i) is due to only generating a total of 120 (potential) reads, so we are not doing that many draws from the underlying distribution. And problem (ii) is due to the length filtering of generated read lengths to maximum reference length, which even happens with the
-dna_type circular
specified. Thus, to reliably get the intended coverages, you will have to make sure that themean_read_length
you specify is always a lot shorter than thereference_length
. I'd say at least an order of magnitude, but the more the better.Also, for cross-reference, these are the places where I implemented this in a simulation workflow:
--number
of reads required, give reference length, mean coverage and mean read length (determine_read_number()
above).--median_len
value from a given--sd_len
value and the mean read length (median_len()
above).In the workflow, I then do mapping of the resulting reads and coverage calculations for quality control.
I hope this is helpful for others.
The text was updated successfully, but these errors were encountered: