Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbolic alleles in VCF output #40

Open
bartcharbon opened this issue Jun 5, 2024 · 14 comments
Open

Symbolic alleles in VCF output #40

bartcharbon opened this issue Jun 5, 2024 · 14 comments

Comments

@bartcharbon
Copy link

Hi @readmanchiu

For VCF output of STR's I believe symbolic alleles are commonly used, e.g.: for a variant of 20 repeat units.

Is there a reason Straglr is outputing the full sequence as ALT allele in the VCF? And would it be possible to add an option to get symbolic alleles instead?
That would greatly help us with integrating the tool in our pipeline.

@readmanchiu
Copy link
Collaborator

Hi @bartcharbon

Thanks for checking out the VCF output.
Could you provide an example of how symbolic alleles are specified so that I can implement it correctly?

@bartcharbon
Copy link
Author

Hi @readmanchiu

attached is a zip file with an example vcf with symbolics, I added the straglr tsv for the same data as well.
example.zip

@readmanchiu
Copy link
Collaborator

Thanks @bartcharbon for the example. Is the motivation for outputting symbolics mostly to avoid the messiness of outputting the sequence? 'cuz the information it conveys is already reported in the genotype column.
I guess the number reported in the symbolic allele will just echo the number reported in the GT column. Interruptions will be ignored.
Originally my concern is there will be lots of <ALT>s reported, but I guess it's OK
Anyways, I will include such an options for the next release.

@bartcharbon
Copy link
Author

bartcharbon commented Jul 4, 2024

I think that some of the older tools around (mostly short read like for example expansion hunter) report it like this, as a result some of our tooling used for downstream analysis expects this kind of output.

As well as the analysts in the lab being used to this notation.

But you also make a valid case for the use of the actual sequence, I'll take a new look with that in mind, to see if this might actually fit in our pipeline.

@readmanchiu
Copy link
Collaborator

I'm close to finish implementing this option to output symbolic alleles. My assumption is that this will only be plausible for cases where the alleles detected have the same motif as the reference. But for cases like RFC1, where the expanded alleles may have a different motif from the reference, I will still need to output the actual sequence. Does it sound alright or is there a symbolic-allele way to deal with this?

@bartcharbon
Copy link
Author

Great that you are implementing this feature, thanks!

Looking at the vcf 4.2 spec symbolics are meant for "imprecise structural variants", based on that I think that even with a non reference motif symbolics are allowed to be used, since the ALT repeat motif is also in the output FORMAT fields all the information is still available in the VCF file.

Disclaimer: I'm no expert on STR's in VCF, this issue is based on differences we noticed compared to some other tools we use or used before.
The cases where the motif differs from the reference are the ones we haven't seen much, due to the fact that they will not be present in control samples (either healthy or with a repeat that has the same motif as the reference) we use for testing, therefor I'm not 100% sure how other tools adress these cases.

@readmanchiu
Copy link
Collaborator

ok, I guess then the copy number reported in symbolic alleles will refer to the actual repeat unit reported, whether it's the same as the reference or not. Some loci are more complicated with interruptions and what not, which will be handled in a later release. I will update the documentation accordingly.

@ljohansson
Copy link

ljohansson commented Nov 11, 2024

Is there anywhere that the actual sequence is shown in the output. WIth the new vcf format in Straglr 1.5.2 I now see
CNV:TR,CNV:TR where previously the sequences were shown.
I have a case (FGF14) where one allele has a GAAGAA sequence and the other a GAAGGA sequence (the first is pathogenic if too long, the second not).

The vcf file suggests that both alleles have the GAA pattern, which is not true.

chr13 102161576 . A CNV:TR,CNV:TR . PASS LOCUS:FGF14;RUS_REF:GAA;SVLEN:150,150;RN:1,1;RUS:GAA,GAA;RUC:302.0,351.3;CIRUC:-29.3,12.7,-5.3,12.4 GT:DP:AD 1/2:40:18/14

Allele1 (random read)

ecc85cf8-957c-4a16-b13d-894c41d9f008 > CTTTCTGTGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAAAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA

Allele2 (random read)

62b4b3f2-ec9d-4af8-b7c7-a46a396d700f > CTTTCTGTGAAGAAGAAGAAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGAAGGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAACGAGAAAGGAGAAGGAGAAGGAGAAGGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGGAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAA

Is there a workaround? The vcf files now feel a bit 'broken' to me not knowing the actual variant.

@readmanchiu
Copy link
Collaborator

Thanks @ljohansson for showing me an actual example, really appreciate it. The CNV:TR notation should be able to take care of complex motifs like this, and its purpose is to not report the sequence, making the file more readable.
I understand that right now Straglr does not handle complex motifs well, and probably the only way to let the user know about the existence of a complex locus is by reporting the actual sequence.
I'm currently working on reporting complex motifs in cases exactly like this (a release can be hopefully made before the end of the year), but before this happens maybe I should provide an option to report the actual sequence in the VCF if the user prefers?

@ljohansson
Copy link

ljohansson commented Nov 12, 2024

Hi @readmanchiu. Thank you for your quick answer and suggestions. Do I understand correctly that in my example the sequence is not displayed correctly, but that is could be? For instance if * RUS:GAA,GAA * would be * RUS:GAA,GAAGGA * with the necessary adaptations in the associated numbers. Could that also account for the GAAGAA sequences at that start and end of the sequence and for sequence interruptions?
I am not sure if the old method of showing the sequence in the vcf is not according to standard that it is a good idea to use that method (I'm a big fan of standards). Perhaps it is an idea to (optionally) show the sequences in the tsv file as an extra column. This would also allow to see the separate reads instead of only the sequence in the read with the median repeat length.

@readmanchiu
Copy link
Collaborator

Ideally for the second allele a (GAA)n(GGAGAA)n(GAA)n genotype would be called.
Right now my code is reporting:
(GAA)5(GGAGAA)44(gaa)1(GGAGAA)67(cga)1(gaa)1(GGAGAA)27(GAA)21
it's already able to identify the "major" motifs, but it would be perfect to treat the interruptions (lowercased) as noise and not real interruption within the GGAGAA tract (which I think the second one can be re-classified as an erroneous copy, but not the first one with just a GAA)
The standard does allow sequence to be reported in the VCF record, I just need to conceive some logic to decide when to do it

@ljohansson
Copy link

ljohansson commented Nov 15, 2024

Hi @readmanchiu,
This seems very nice already. Beware of polishing away interruptions. I have no examples, but I heard that for some repeats interruptions may affect the classification of a repeat. E.g. if you have a CGA in the middle it is not pathogenic, but if it is not it is. I am not sure about this, but it might be the case.
Is the sequence shown by your code the sequence of the median read, or is it a consensus betweeen all reads in the cluster? If it would be some kind of consensus read it may be possible to distinguish sequencing errors from real interruptions.

And a different question. In the new 1.5.2 format shows the CIRUC (confidence intervals), while the 1.5.1 format showed the (ACR allelic copy ranges). How do these compare? Is it RUC +- CIRUS = ACR?

In addition, is there an option to get the AL and ALR in the output? Then I can check if the output is correct. For instance in the example when the the GAAGGA pattern is called as a GAA pattern and RUC is 300, I guess this means that there are in fact 150 GAAGGA RU? If AL would be 900 I could confirm this (in contrast to an AL of 1800).

@readmanchiu
Copy link
Collaborator

  • yes, I am aware of interruptions. Right now a sequence would be considered a "sequencing error" of a motif if there is are only indels of the mofit bases, like a "GAAAA" or a "GA" within a (GAA)n tract would be considered a sequencing error and counted towards another copy of a GAA. A "CGA" within base substitutions would not be changed. Hope that is safe for keeping interruptions
  • the motif architecture I gave corresponded to the (GGAGAA)n read sequence you provided. Yes, the next challenge would be to come up with the consensus sequence that most likely represent the locus. The idea is that, for example, if the other reads don't have a CGA then it would be considered a one-off error and not included in the consensus repeat
  • your interpretation of CIRUC is correct
  • the VCF documentation says giving both the RUC and the RB (equivalent to ALR) are redundant, but I appreciate your point and would probably report both in the next release

Thank you so much for your insights/input. I will definitely let you know when the code is ready for your testing.

@ljohansson
Copy link

Thank you for all your efforts and your great tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants