Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simulating nanopore samples with a fixed mean read length (to achieve a wanted coverage value) #241

Closed
dlaehnemann opened this issue Feb 3, 2025 · 4 comments

Comments

@dlaehnemann
Copy link

dlaehnemann commented Feb 3, 2025

EDIT:
If you came here to find out how to simulate samples with a specific target coverage, NanoSim has this functionality as of version v3.2.3. Just look for the new --coverage command-line argument, that @berkeucar pointed out right below. This will use the distributions from the (pre-trained) model internally.

Some interesting background on how it works is in the pull request introducing the --coverage command-line argument. And @berkeucar also nicely explained the possible interactions with other command-line arguments further below


=== original issue before the EDIT based on the below discussion ===
It took me a while to figure out how to do this, so I'm documenting it here.

Basically, I want to specify a mean_coverage across the given reference sequence and the mean_read_length. But nanosim expects read --number, a --median_len of reads and a --sd_len on a lognormal scale. Thanks @cheny19 for pointing out the lognormal distribution behind all this, which was essential to understand in order to get from my given values to the command line arguments.

And here's the graphic that helped me understand how the normal and the lognormal distribution fit together:
https://en.wikipedia.org/wiki/Log-normal_distribution#/media/File:Lognormal_Distribution.svg

The most important takeaways for me were, that:

  1. The mean of the normal distribution mu corresponds to the median of the lognormal distribution of e^mu.
  2. The mean of the lognormal distribution, which is the mean_read_length I want to specify, instead is E[X] = e^(mu+1/2*sigma^2).

To get from the specified values to the command-line arguments, we thus need to take several steps. Let's assume we deal with a reference sequence that is 20.000bp long and set the following:

mean_coverage=30
mean_read_length=5000
reference_length=20000

We can then already determine the --number of reads required with the following function:

import numpy as np
def determine_read_number(reference_length, mean_coverage, mean_read_length):
    return int( np.ceil( reference_length * mean_coverage / mean_read_length ) )

In this case, this will yield:

determine_read_number(20000, 30, 5000)
120

However, the trickier part is getting to values for --sd_len and --median_len. To be honest, I did not get an intuitive understanding for what useful values of --sd_len are, apart from that bigger values mean a wider / flatter distribution. So I decided to fix this to a value in the range recommended by @cheny19 in a different issue:

sigma=1.1

I had previously tried it with larger values, and this very quickly leads to excessive runtimes and memory usage by nanosim. There are at least two issues that were probably ultimately filed due to higher values of --sd_len: #76 #210

With this fixed, we can then calculate the --median_len required by nanosim by working with the equation from 2. above to get to the value e^mu in 1. above, which we determined to be that --median_len:

E[X] =  e^( mu + 1/2 * sigma^2 ) = e^mu * e^( 1/2 * sigma^2 )
e^mu = E[X] / e^( 1/2 * sigma^2 )
def median_len(mean_read_length, sigma):
    return int(mean_read_length/np.exp(1/2 * sigma**2))

I our example case, this would lead to a value of:

median_len(5000, 1.1)
2730

As a control, one can draw a large number of reads from that distribution with the formula used for genome-based simulation in the nanosim code and the length-filtering based on the longest chromosome in a reference fasta file and see that this comes out at roughly the wanted coverage:

def mean_read_length_control(median_len, sigma, read_number, reference_length):
    reads = [ x for x in np.random.lognormal(np.log(median_len + sigma ** 2 / 2), sigma, read_number) if x < reference_length ]
    return sum( reads ) / len(reads)

In the example given here, you will see that repeated calls will give mean read length values that (i) vary quite a lot, and that (ii) never really come close to the wanted mean_read_length we started out with. The variation of (i) is due to only generating a total of 120 (potential) reads, so we are not doing that many draws from the underlying distribution. And problem (ii) is due to the length filtering of generated read lengths to maximum reference length, which even happens with the -dna_type circular specified. Thus, to reliably get the intended coverages, you will have to make sure that the mean_read_length you specify is always a lot shorter than the reference_length. I'd say at least an order of magnitude, but the more the better.

Also, for cross-reference, these are the places where I implemented this in a simulation workflow:

  1. Calculating the --number of reads required, give reference length, mean coverage and mean read length (determine_read_number() above).
  2. Calculating the --median_len value from a given --sd_len value and the mean read length (median_len() above).

In the workflow, I then do mapping of the resulting reads and coverage calculations for quality control.

I hope this is helpful for others.

@berkeucar
Copy link
Contributor

Hello @dlaehnemann,

Thank you so much for your interest in NanoSim. We really appreciate your explanation and code to work around the number of reads to achieve the aimed coverage. Your code is particularly very useful for those who want to use customized mean, maximum, or minimum length with a specified standard deviation and fit a mixture model on the given parameters. Thank you for your contribution. I would like to also inform you that we are currently understaffed and due to that unfortunately we were not able to integrate your code into NanoSim yet.

However, as another approach, we were able to approximate the mean coverage based on the kernel density estimation (KDE) functions returned by read_analysis.py script. NanoSim uses these KDE functions saved in the output directory of read_analysis.py script and specified to the simulator.py consequently with the flag --model_prefix, unless the user specifies a --mean_len. So, if you have emprical data on the type of sequencing version of ONT sequencer of your choice, you can now use the --coverage flag to estimate your output coupled with the --model_prefix flag pointing out to your training output. Please check our latest version 3.2.3 for more details.

Again, we deeply appriciate your invaluable comments and feedback on NanoSim. We hope to improve our coverage estimation algorithms with the addition of mean length specifications. However, due to our limited bandwith and having an implementation for the emprical data which produces robust coverage output for different cases including different mean lengths, I am closing this issue.

Thank you so much for your support and contributions on NanoSim.
Berke

@dlaehnemann
Copy link
Author

Many thanks for checking in here, and no need to include any of this into your own code. It was mainly for documenting what I found out, for future-me and possibly others who might find this useful. But great that you already have a standard solution for this implemented right at this time, what a coincidence time-wise!

And then just two questions to make sure I understand usage of the new --coverage flag:

  1. Will this also work with models trained using older versions of NanoSim? And/or do you have a (rough) minimum version of NanoSim that should have been used for the training, if I want the --coverage estimate to work?
  2. Otherwise, the only requirement is to then NOT use any of the --median_len, --min_len, --max_len or --sd_len options, that could otherwise be used to modify the underlying distribution, right?

If I fully understand this, I'll probably go and edit my original issue to include this as the default solution at the start of the post. Seems much easier to use... 😅

@berkeucar
Copy link
Contributor

Thank you so much for your comments and questions @dlaehnemann. I apolagize for being not clear about those!

  1. I believe it should work for the older versions as well. If you are interested in using --coverage flag for transcriptomic reads, the minimum version would be v2.5. Other than that, I do not think there is a minimum version for genomic reads.

  2. The new --coverage flag should work with all the remaining parameters except the -n parameter which would be overriden by the calculated number of reads. However, if you are specifying the --medin_len, --min_len, --max_len or --sd_len to a value which is drastically different than the experimental values for those, --coverage flag may not estimate the number of reads as accurately as it would normally.

Many thanks for your interest in NanoSim and your clarifications on the issue. Much appreciated.
Berke

@dlaehnemann
Copy link
Author

No need to apologize, just wanted to make sure I understood the implications correctly. So maybe I was being a bit pedantic. But I'll add a quick edit to the start of my issue, so people immediately see the new --coverage functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants