Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter wham-only DELs and scramble-only SVAs in CleanVcf & docs updates #740

Merged
merged 8 commits into from
Oct 31, 2024

Conversation

epiercehoffman
Copy link
Collaborator

@epiercehoffman epiercehoffman commented Oct 29, 2024

Updates

  1. Soft-filter wham-only deletions and scramble-only SVA insertions with HIGH_SR_BACKGROUND in INFO at the end of CleanVcf with the filter status HIGH_ALGORITHM_FP_RATE
  2. Updates to documentation, including: add notebooks to joint calling documentation, delete joint calling Terra dashboard and just link to workspace, remove ApplyManualVariantFilter from canonical pipeline documentation (replaced by changes to CleanVcf in this PR), fix a few links, small tweaks here and there
  3. Add a 156-sample single-batch Terra sample table in addition to the existing 2-batch 312-sample one
  4. Add VisualizeCnvs Terra JSON and add workflow to dockstore.yml

Testing

  • Validated all WDLs and JSONs with womtool and Terra validation script
  • Tested CleanVcf on reference panel and checked for presence of HIGH_ALGORITHM_FP_RATE filter on valid variant sites. No scramble sites are present in the ref panel input VCF to CleanVcf so I also tested this task locally on a VCF with Scramble SVAs to ensure they were flagged as needed
  • Ran the website on my branch and spot-checked formatting and links

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is great! I have a few suggestions. I don't love the inline python but it's semi-temporary so will let it go.

@@ -0,0 +1,157 @@
entity:sample_id bam_or_cram_file
NA10847 gs://fc-56ac46ea-efc4-4683-b6d5-6d95bed41c5e/CCDG_13607/Project_CCDG_13607_B01_GRM_WGS.cram.2019-02-06/Sample_NA10847/analysis/NA10847.final.cram
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sort these? So they are the same order as the ref panel jsons

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This one matches the order of the 312-sample one - I could sort that too, although then it would mix the reference panel samples with the rest of the samples, and wouldn't match the batch assignment order

Comment on lines 23 to 24
as joint calling with larger cohorts. Additionally, SV quality will be best when the case sample closely resembles the samples
in the reference panel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
as joint calling with larger cohorts. Additionally, SV quality will be best when the case sample closely resembles the samples
in the reference panel.
as joint calling with larger cohorts. Additionally, input sequencing data should of similar quality to that of the reference panel, which was generated in [Byrska-Bishop et al. 2022](https://doi.org/10.1016/j.cell.2022.08.004) and is hosted in [this AnVIL workspace](https://app.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I worded this without referring to our provided reference panel is because we do also give information on how users can build their own reference panel. I recognize most won't, but I felt like it was sufficient to state that samples should resemble the reference panel, and the page describes the provided sample above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if this is redundant don't worry about it, but I think the word "resemble" is a bit vague so just make it clear you're referring to sample quality and sequencing methods.


python <<CODE
import pysam
fin = pysam.VariantFile("~{vcf}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fin = pysam.VariantFile("~{vcf}")
with pysam.VariantFile("~{vcf}") as fin:

fin = pysam.VariantFile("~{vcf}")
header = fin.header
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">")
fo = pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fo = pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header)
with pysam.VariantFile("~{prefix}.vcf.gz", 'w', header=header) as fo:

import pysam
fin = pysam.VariantFile("~{vcf}")
header = fin.header
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FP_RATE,Description=\"Categories of variants with low specificity including Wham-only deletions and certain Scramble SVAs\">")
header.add_line("##FILTER=<ID=HIGH_ALGORITHM_FDR,Description=\"Categories of variants with low precision including Wham-only deletions and certain Scramble SVAs\">")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I probably suggested this name but FDR is more appropriate

for record in fin:
if (record.info['ALGORITHMS'] == ('wham',) and record.info['SVTYPE'] == 'DEL') or \
(record.info['ALGORITHMS'] == ('scramble',) and record.info['HIGH_SR_BACKGROUND'] and record.alts == ('<INS:ME:SVA>',)):
record.filter.add('HIGH_ALGORITHM_FP_RATE')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
record.filter.add('HIGH_ALGORITHM_FP_RATE')
record.filter.add('HIGH_ALGORITHM_FDR')


Float input_size = size(vcf, "GiB")
RuntimeAttr runtime_default = object {
mem_gb: 3.75 + input_size * 1.5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this copy+paste from somewhere? 3.75 should be plenty I think.

Suggested change
mem_gb: 3.75 + input_size * 1.5,
mem_gb: 3.75,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah most of the things you pointed out were from copy/paste, I'll do a better job of cleanup next time. Thanks for catching

Float input_size = size(vcf, "GiB")
RuntimeAttr runtime_default = object {
mem_gb: 3.75 + input_size * 1.5,
disk_gb: ceil(100.0 + input_size * 3.0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disk_gb: ceil(100.0 + input_size * 3.0),
disk_gb: ceil(10 + input_size * 2.0),

@epiercehoffman epiercehoffman merged commit aef6ac9 into main Oct 31, 2024
8 checks passed
@epiercehoffman epiercehoffman deleted the eph_high_alg_fp_filt branch October 31, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants