-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Symbolic alleles in VCF output #40
Comments
Hi @bartcharbon Thanks for checking out the VCF output. |
Hi @readmanchiu attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well. |
Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column. |
I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output. As well as the analysts in the lab being used to this notation. But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline. |
I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this? |
Great that you are implementing this feature, thanks! Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file. Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before. |
ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly. |
Is there anywhere that the actual sequence is shown in the output. WIth the new vcf format in Straglr 1.5.2 I now see The vcf file suggests that both alleles have the GAA pattern, which is not true.
Allele1 (random read)
Allele2 (random read)
Is there a workaround? The vcf files now feel a bit 'broken' to me not knowing the actual variant. |
Thanks @ljohansson for showing me an actual example, really appreciate it. The CNV:TR notation should be able to take care of complex motifs like this, and its purpose is to not report the sequence, making the file more readable. |
Hi @readmanchiu. Thank you for your quick answer and suggestions. Do I understand correctly that in my example the sequence is not displayed correctly, but that is could be? For instance if * RUS:GAA,GAA * would be * RUS:GAA,GAAGGA * with the necessary adaptations in the associated numbers. Could that also account for the GAAGAA sequences at that start and end of the sequence and for sequence interruptions? |
Ideally for the second allele a (GAA)n(GGAGAA)n(GAA)n genotype would be called. |
Hi @readmanchiu, And a different question. In the new 1.5.2 format shows the CIRUC (confidence intervals), while the 1.5.1 format showed the (ACR allelic copy ranges). How do these compare? Is it RUC +- CIRUS = ACR? In addition, is there an option to get the AL and ALR in the output? Then I can check if the output is correct. For instance in the example when the the GAAGGA pattern is called as a GAA pattern and RUC is 300, I guess this means that there are in fact 150 GAAGGA RU? If AL would be 900 I could confirm this (in contrast to an AL of 1800). |
Thank you so much for your insights/input. I will definitely let you know when the code is ready for your testing. |
Thank you for all your efforts and your great tool. |
Hi @readmanchiu
For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.
Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead?
That would greatly help us with integrating the tool in our pipeline.
The text was updated successfully, but these errors were encountered: