-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter wham-only DELs and scramble-only SVAs in CleanVcf & docs updates #740
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is great! I have a few suggestions. I don't love the inline python but it's semi-temporary so will let it go.
@@ -0,0 +1,157 @@ | |||
entity:sample_id bam_or_cram_file | |||
NA10847 gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/CCDG_13607/Project_CCDG_13607_B01_GRM_WGS.cram.2019-02-06/Sample_NA10847/analysis/NA10847.final.cram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you sort these? So they are the same order as the ref panel jsons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This one matches the order of the 312-sample one - I could sort that too, although then it would mix the reference panel samples with the rest of the samples, and wouldn't match the batch assignment order
website/docs/gs/calling_modes.md
Outdated
as joint calling with larger cohorts. Additionally, SV quality will be best when the case sample closely resembles the samples | ||
in the reference panel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as joint calling with larger cohorts. Additionally, SV quality will be best when the case sample closely resembles the samples | |
in the reference panel. | |
as joint calling with larger cohorts. Additionally, input sequencing data should of similar quality to that of the reference panel, which was generated in [Byrska-Bishop et al. 2022](https://doi.org/10.1016/j.cell.2022.08.004) and is hosted in [this AnVIL workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I worded this without referring to our provided reference panel is because we do also give information on how users can build their own reference panel. I recognize most won't, but I felt like it was sufficient to state that samples should resemble the reference panel, and the page describes the provided sample above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok if this is redundant don't worry about it, but I think the word "resemble" is a bit vague so just make it clear you're referring to sample quality and sequencing methods.
wdl/CleanVcfChromosome.wdl
Outdated
|
||
python <<CODE | ||
import pysam | ||
fin = pysam.VariantFile("~{vcf}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fin = pysam.VariantFile("~{vcf}") | |
with pysam.VariantFile("~{vcf}") as fin: |
wdl/CleanVcfChromosome.wdl
Outdated
fin = pysam.VariantFile("~{vcf}") | ||
header = fin.header | ||
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">") | ||
fo = pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fo = pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header) | |
with pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header) as fo: |
wdl/CleanVcfChromosome.wdl
Outdated
import pysam | ||
fin = pysam.VariantFile("~{vcf}") | ||
header = fin.header | ||
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">") | |
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FDR,Description=\"Categories of variants with low precision including Wham-only deletions and certain Scramble SVAs\">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize I probably suggested this name but FDR is more appropriate
wdl/CleanVcfChromosome.wdl
Outdated
for record in fin: | ||
if (record.info['ALGORITHMS'] == ('wham',) and record.info['SVTYPE'] == 'DEL') or \ | ||
(record.info['ALGORITHMS'] == ('scramble',) and record.info['HIGH_SR_BACKGROUND'] and record.alts == ('<INS:ME:SVA>',)): | ||
record.filter.add('HIGH_ALGORITHM_FP_RATE') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record.filter.add('HIGH_ALGORITHM_FP_RATE') | |
record.filter.add('HIGH_ALGORITHM_FDR') |
wdl/CleanVcfChromosome.wdl
Outdated
|
||
Float input_size = size(vcf, "GiB") | ||
RuntimeAttr runtime_default = object { | ||
mem_gb: 3.75 + input_size * 1.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this copy+paste from somewhere? 3.75 should be plenty I think.
mem_gb: 3.75 + input_size * 1.5, | |
mem_gb: 3.75, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah most of the things you pointed out were from copy/paste, I'll do a better job of cleanup next time. Thanks for catching
wdl/CleanVcfChromosome.wdl
Outdated
Float input_size = size(vcf, "GiB") | ||
RuntimeAttr runtime_default = object { | ||
mem_gb: 3.75 + input_size * 1.5, | ||
disk_gb: ceil(100.0 + input_size * 3.0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disk_gb: ceil(100.0 + input_size * 3.0), | |
disk_gb: ceil(10 + input_size * 2.0), |
…l similarity text
Updates
HIGH_ALGORITHM_FP_RATE
Testing
HIGH_ALGORITHM_FP_RATE
filter on valid variant sites. No scramble sites are present in the ref panel input VCF to CleanVcf so I also tested this task locally on a VCF with Scramble SVAs to ensure they were flagged as needed