-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deciding when to exclude primers #99
Comments
The logic for identifying primers before cylon is run is different than the later primer id logic that we use to mask primer positions in the consensus. Unfortunately, if a fragment ends between two alternative primers and the adapter sequence happens to be similar enough to the reference not to be soft-clipped we have no way to be sure if it's legitimately a fragment that ends in that region, or a fragment with adapter noise that ends there. Luckily we have a couple advantages over the data here:
In summary, there are going to be annoying sequencing artefacts that we need to catch so we have to use additional polishing and QC filters. Simulated test cases are very helpful for identifying these kinds of artefacts and testing at the component level. Since there isn't much that we can do in this situation, the best we can do is tweak our QC model to reduce the impact. To adjust the QC thresholds we should make sure that we look at real world examples of this. Do we have a way to identify this problem in real data? |
This issue is solely about how reads are allocated to amplicons/primers, and then deciding which primers to reject. It's not about QC. The reads unambiguously are from
The depth of the remainder of the amplicon, and indeed past it, is zero. Therefore The only noise here is the depth of 2, 1, 1, 1 at positions 23058-23061. The expected behaviour is that primer |
Before running cylon we detect which primers are present for each amplicon. We do this because we want to include cases where both alternate primers are present in the sequencing data. We construct a histogram of observed primers for each amplicon and use this to decide which ones are actually there. Unfortunately, it appears that it is possible for adapter sequences to trick this stage. Luckily, we have downstream QC and tighter primer id to compensate for these cases. |
Firstly, the code should work on these simulated reads and exclude primer Secondly, there's a minimal example of simulated nanopore reads attached that reproduces this error. Perfect reads from amplicon 75 and from amplicon 76. All reads span exactly from the start of a left primer to end of a right primer. Reads do not have any adapter and they are literally identical to the reference genome. They include each primer to its end, and have zero extra bases past the primer ends. The reads are at 200X, with 100X from each strand. Amplicon 75 reads are from primer Expected behaviour: the primer Observed behaviour: the amplicon coords given to cylon are:
which means that the primer |
Thank you for the example. Please note that there is a bit more logic behind identifying the target primers: https://github.com/jeff-k/viridian_workflow/blob/6ea9cdbf7660508b41cf0def51b17c3ae4626054/viridian_workflow/readstore.py#L320 In short, we reason about the "primer region" during this first amplicon id pass, that being the interval spanned by When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. This is to safely accommodate data which may be fragmented. Please note that these preliminary primer bounds are not related to the consensus masking or qc processes. So the primers are not really "kept". Is this behaviour causing problems for cylon? |
@jeff-k - referring to this comment of yours (my bold) "When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. " i'm confused. One of the primers has coverage of 200x and the alt has zero. So, the alt should be excluded. What do you mean neither of the primers is above the threshold of 100? |
No, i'm doubly confused.
|
Whoops, I interpreted the attached file as a fastq based on the extension and only counted half the records. Either way, yes, for point 2 this code appears to be working as described. |
I don't know if this is agreeing that there is a bug or disagreeing that there is bug and that the code does not need changing: The right primer is not being excluded. As described below. Amplicon SARS-CoV-2_76 has two right primers:
Cylon is given this:
The Expected behaviour is: Months ago it was agreed that this would be part of the method. When there is:
then all other alt primers there with no support would be excluded from that point onwards. |
OK that answers my question here I looked at that json and checked the right primer start coord which seemed to correctly be at 23028. I missed your main point was that the end was at 23140, which is the bug. That end cpord needs to be at the right hand end of 76-right, not 76-right-alt1 |
So, I withdraw my question 2, and now I see the bug again, sorry for confusion |
Ah, I see the confusion. The Note that it would be somewhat redundant if all four fields represented the same data (+/- primer lengths). Since the data required by cylon is available in the fields that are specifically labelled as the primer coords I believe that this issue can be closed. |
It can't be closed because:
|
I have double checked the primer data source and I can confirm that the right-most bound of the amplicon (that is, considering all possible amplicons) is correctly propagated to the end field. In addition to this the coordinates of the observed primers are correctly identified in the primer start and end fields. In this case that means excluding the coords of alt1. This has always been the specified behaviour of these fields and it would not be appropriate to relabel the JSON output that we have been generating already. |
ok i'd like to have a conversation about this; i'm not feeling great today, so i suggest the conversation happens on friday. i'd like this to stay open til we have had that conversation. |
We have examples where there is a choice of two primers for the end of a given amplicon. Only one primer of the two primers has support from the reads.
Observed behaviour: Both primers are kept, instead of excluding the primer that isn't supported by reads. Cylon is given amplicon coords as if both primers are present (and presumably everything downstream also thinks both primers there)
Expected behaviour: The primer not supported by reads is excluded.
Details are below for one sample.
Path on codon:
/nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/vwf.20220926.ccf5ef4676.ont.frs0.6/
.Current working directory was
/nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/
when running it.Version of viridian workflow: ccf5ef4
The primer scheme is ARTIC 4.1. The run is using simulated Nanopore reads.
Amplicon
SARS-CoV-2_76
is present at around 380X depth.The next amplicon
SARS-CoV-2_77
is not present - it has zero reads.Amplicon
SARS-CoV-2_76
has two right primers:SARS-CoV-2_76_RIGHT
at coords 23029-23057.SARS-CoV-2_76_RIGHT_alt1
at coords 23121-23141.Read mapping shows that:
SARS-CoV-2_76
all come from the primerSARS-CoV-2_76_RIGHT
.SARS-CoV-2_76_RIGHT_alt1
.Excerpt from
samtools depth -aa
on the BAM made by viridian (after coordinate sorting it):and continues at zero depth until around the start of amplicon 78:
cylon is given this in the file
ampicons.json
forSARS-CoV-2_76
:This means that primers
SARS-CoV-2_76_RIGHT
andSARS-CoV-2_76_RIGHT_alt1
have been counted as present in the reads. Expected behaviour is that onlySARS-CoV-2_76_RIGHT
is present.Screenshot attached showing the mapped reads in this region. Amplicon SARS-CoV-2_76 is highlighted (it also has two left primers. The highlighted part is using the "inner" of the two primers at each end of the amplicon).
The text was updated successfully, but these errors were encountered: