-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CombineBatches workflow #732
Conversation
b5de9f0
to
3a170b6
Compare
3a170b6
to
6a6d00b
Compare
2a39fcb
to
9b729de
Compare
Add reformatting to GenotypeBatch Expose reformat_script Start ripping stuff out Finish rewriting wdl and template Add TODO and delete unused task Don't assign genotypes to CNVs in add_genotypes.py Set mixed_breakend_window to 1mb Fixes Start implementing ContextAwareClustering Fix wdl Update wdl and gatk docker Comment Set context overlap parameters Update docker Fix size in ReformatGenotypedVcf Update to reformat_genotyped_vcf.py Update legacy reformatter, fix GenotypeBatchMetrics wdl Remove reformat step from GenotypeBatch Use RD_CN if CN is unavailable for mCNVs in svtk vcf2bed Add additional_args to svcluster and groupedsvcluster Add join step Update runtime attr Filter legacy records with invalid coords (needs testing) Fix record dropping; add --fix-end to wdl call Representative breakpoint summary strategy Update gatk docker Integerate SR flags into VCF Update dockers Parse last column in SR flags lists Gatk to svtk formatting Fix CNV strands and overlap breakpoint filter bothside pass file parsing Breakpoint overlap filter now sorts by BOTHSIDES_SUPPORT status rather than fraction Set empty FILTER statuses to PASS Use safer get() methods instead of brackets for accessing FORMAT fields Delete unused VcfClusterSingleChromsome.wdl Remove other unused wdls Do not require CN or RD_CN to be defined for all samples for CNVs in get_called_samples() Fix multi-allelic ALTs and genotype parsing Fix multi-allelic formatting in cleanvcf5 Clean vcf 5 script override Add SR1POS and SR2POS to gatk format to recover INS END coordinate Reset dockers to main Fix mCNV alts again Update gatk docker context to track Integrate into top-level WDLs and update json templates Update terra config Fix MakeCohortVcf inputs Remove MakeCohortVcf json templates Remove duplicate runtime_override_clean_background_fail input in GATKSVPipelineSingleSample Update yaml for testing Remove duplicate runtime_override_breakpoint_overlap_filter input
54a0f97
to
e3adbd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good! Just some very minor comments and a few questions for my own understanding
@@ -30,6 +30,7 @@ | |||
|
|||
"GATKSVPipelineSingleSample.linux_docker" : "${workspace.linux_docker}", | |||
"GATKSVPipelineSingleSample.manta_docker": "${workspace.manta_docker}", | |||
"GATKSVPipelineSingleSample.scramble_docker": "${workspace.scramble_docker}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the cromshell test JSON has both scramble and melt dockers - may want to drop melt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true. I've deleted it, although it's overridden with the "use_melt" parameter
continue | ||
else: | ||
new_header_lines.append(line) | ||
# Incorporate SR evidence flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment outdated? I don't see how it relates to the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a remnant from copy+paste
new_record.id = record.id | ||
# copy INFO fields | ||
for key in record.info: | ||
if key not in remove_infos: | ||
new_record.info[key] = record.info[key] | ||
if svtype == 'CNV': | ||
# Prior to CleanVcf, all mCNVs are DUP type | ||
new_record.info['SVTYPE'] = 'DUP' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you envision this script ever getting used after CleanVcf? If no, might be worth documenting that it shouldn't be since the name is pretty generic. I see the help text says it's for GenerateBatchMetrics but it appears that's out of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've updated the program description that this shouldn't be used after CleanVcf
@@ -174,6 +212,10 @@ def __parse_arguments(argv: List[Text]) -> argparse.Namespace: | |||
help="Comma-delimited list of FORMAT fields to remove") | |||
parser.add_argument("--remove-infos", type=str, | |||
help="Comma-delimited list of INFO fields to remove") | |||
parser.add_argument("--intervaled-ins", default=False, action='store_true', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused argument? Not referred to in script or in WDLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
@@ -148,6 +172,11 @@ def is_null(val): | |||
val = record.info[k] | |||
if is_null(val) or (isinstance(val, tuple) and len(val) == 1 and is_null(val[0])): | |||
del record.info[k] | |||
# Add SR flag to INFO field | |||
if record.id in bothside_pass_vid_set: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality does seem a little bit out of place in this script, but I suppose it's more efficient to add it to an existing task than create a new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we do now expect these INFO fields in GATK per broadinstitute/gatk#8990
} | ||
|
||
call HarmonizeHeaders.HarmonizeHeaders { | ||
# Third round of clustering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is clustering in several stages needed or could it be accomplished in one step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required. The first round permits clustering across track boundaries - e.g. clustering two very similar variants where one might fall just outside a track interval and the other just inside. The second and third steps cluster groups that aren't mutually exclusive - e.g. INS in simple repeats in the second and INS in any context in the third will contain some of the same variants but have different criteria applied.
@@ -105,7 +105,7 @@ workflow ClusterDepth { | |||
algorithm=clustering_algorithm, | |||
depth_sample_overlap=0, | |||
depth_interval_overlap=depth_interval_overlap, | |||
depth_breakend_window=0, | |||
depth_breakend_window=10000000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a big change...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reflects an inversion of some logic in GATK. We now require both breakend window and reciprocal overlap instead of either/or. broadinstitute/gatk#8962
wdl/GenotypeDepthPart2.wdl
Outdated
@@ -20,6 +20,8 @@ workflow GenotypeDepthPart2 { | |||
File coveragefile | |||
File? coveragefile_index | |||
|
|||
File? reformat_script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this input was added but is not passed to any task - here and in GenotypePESRPart2 and GenotypeBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I had those for debugging but forgot to delete these
else: | ||
alleles = ('N', alleles[1]) | ||
contig = record.contig | ||
new_record = vcf_out.new_record(contig=contig, start=record.start, stop=record.stop, alleles=alleles) | ||
# Change filter to PASS if requested | ||
if set_pass and len(record.filter) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in FilterGenotypes as of #736 - fine if we want to do it earlier but not necessary to do twice (though obviously lightweight enough that it doesn't really matter if we do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall if having .
instead of PASS
breaks any specific task between CombineBatches and filtering, but this matches the current formatting. It's probably not good practice to be doing this, but I wouldn't be surprised if it would break something to leave as .
.
@@ -7,6 +7,14 @@ | |||
"autosome_file" : "gs://gcp-public-data--broad-references/hg38/v0/sv-resources/resources/v1/autosome.fai", | |||
"bin_exclude" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/bin_exclude.hg38.gatkcov.bed.gz", | |||
"bin_exclude_index" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/bin_exclude.hg38.gatkcov.bed.gz.tbi", | |||
"clustering_config_part1" : "gs://gatk-sv-resources-public/hg38/v0/sv-resources/resources/v1/clustering_config.part_one.tsv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To what extent are these parameters optimized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are based on the first round of re-clustering from AoU. The only new parameter is size similarity which is set to 0.5.
Replaces most methods in the CombineBatches workflow with a greatly simplified set of tasks that utilize GATK
SVCluster
and the newGroupedSVCluster
tool (see PR).SVCluster
replaces most of the current functionality including VCF joining and clustering, whileGroupedSVCluster
introduces refined clustering (a.k.a. "reclustering") that has become a best practice for larger call sets.Clustering refinement is critical for consolidating redundant variants in repetitive sequence contexts such as simple repeats and segmental duplications. This also addresses an issue with duplicate insertions that share coordinates but have slightly different split read signatures (i.e. different END positions).
Genotype merging was improved slightly in this PR as well.
In addition, this PR makes some minor improvements to VCF formatting and parsing:
GenotypeBatch
, CNVs are now formatted the same way as inCleanVcf
, i.e. no genotypes and<CNV>
ALT allele andSVTYPE=CNV
, rather than using alt alleles<CN0>,<CN1>,…
and havingSVTYPE=DUP
. In case a user needs to run on a VCF with the old format, there is alegacy_vcfs
flag inCombineBatches
that will update to the new format prior to processing.CombineBatches
, records are annotated withHIGH_SR_BACKGROUND
andBOTHSIDE_PASS
INFO field flags rather than passing around separate lists, which is cumbersome..get()
for accessing FORMAT fields rather than brackets. This was required in some cases because GATK omits a FORMAT field if it is null for all samples in a given record. Pysam then throws an error since the requested key does not exist, whereas.get()
returnsNone
.CombineBatches
to minimize risk of bugs in downstream workflows.