Skip to content

Commit f6d0c40

Browse files
committed
Support gnomADe AFs; Updated tests; Abandon Travis
1 parent dd5af77 commit f6d0c40

11 files changed

+123
-114
lines changed

.travis.yml

-21
This file was deleted.

Dockerfile

+14-12
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,28 @@
11
FROM clearlinux:latest AS builder
22

33
# Install a minimal versioned OS into /install_root, and bundled tools if any
4-
ENV CLEAR_VERSION=33980
4+
ENV CLEAR_VERSION=41780
55
RUN swupd os-install --no-progress --no-boot-update --no-scripts \
66
--version ${CLEAR_VERSION} \
77
--path /install_root \
88
--statedir /swupd-state \
99
--bundles os-core-update,which
1010

1111
# Download and install conda into /usr/bin
12-
ENV MINICONDA_VERSION=py37_4.9.2
13-
RUN swupd bundle-add --no-progress curl && \
14-
curl -sL https://repo.anaconda.com/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -o /tmp/miniconda.sh && \
15-
sh /tmp/miniconda.sh -bfp /usr
12+
ENV MINICONDA_VERSION=py312_24.4.0-0
13+
RUN curl -sL https://repo.anaconda.com/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -o /tmp/miniconda.sh && \
14+
bash /tmp/miniconda.sh -bup /usr && \
15+
rm -f /tmp/miniconda.sh && \
16+
conda config --set solver libmamba
1617

17-
# Use conda to install remaining tools/dependencies into /usr/local
18-
ENV VEP_VERSION=102.0 \
19-
HTSLIB_VERSION=1.10.2 \
20-
BCFTOOLS_VERSION=1.10.2 \
21-
SAMTOOLS_VERSION=1.10 \
22-
LIFTOVER_VERSION=377
23-
RUN conda create -qy -p /usr/local \
18+
# Use mamba to install remaining tools/dependencies into /usr/local
19+
ENV VEP_VERSION=112.0 \
20+
HTSLIB_VERSION=1.20 \
21+
BCFTOOLS_VERSION=1.20 \
22+
SAMTOOLS_VERSION=1.20 \
23+
LIFTOVER_VERSION=447
24+
RUN conda create -y -p /usr/local && \
25+
conda install -y -p /usr/local \
2426
-c conda-forge \
2527
-c bioconda \
2628
-c defaults \

LICENSE

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright 2021 Memorial Sloan Kettering Cancer Center
1+
Copyright 2024 Memorial Sloan Kettering Cancer Center
22

33
Licensed under the Apache License, Version 2.0 (the "License");
44
you may not use this file except in compliance with the License.

README.md

+35-6
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,19 @@
11
vcf<img src="https://i.giphy.com/R6X7GehJWQYms.gif" width="28">maf
22
=======
33

4-
To convert a [VCF](http://samtools.github.io/hts-specs/) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild.
5-
6-
[![Build Status](https://travis-ci.com/mskcc/vcf2maf.svg?branch=master)](https://travis-ci.com/mskcc/vcf2maf)
4+
To convert a [VCF](https://samtools.github.io/hts-specs//) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild.
75

86
Quick start
97
-----------
108

11-
Find the [latest stable release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`:
9+
Find the [latest release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`:
1210

1311
export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
1412
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-*
1513
perl vcf2maf.pl --man
1614
perl maf2maf.pl --man
1715

18-
If you don't have [VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html) installed, then [follow this gist](https://gist.github.com/ckandoth/61c65ba96b011f286220fa4832ad2bc0). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this:
16+
If you don't have VEP installed, then [follow this gist](https://gist.github.com/ckandoth/4bccadcacd58aad055ed369a78bf2e7c). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this:
1917

2018
perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf
2119

@@ -49,6 +47,37 @@ After tests on variant lists from many sources, `maf2vcf` and `maf2maf` are quit
4947

5048
See `data/minimalist_test_maf.tsv` for a sampler. Addition of `Tumor_Seq_Allele1` will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments `--tum-vad-col` and `--tum-depth-col` are set correctly to the names of columns containing those read counts. Specifying the `Matched_Norm_Sample_Barcode` with its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument `--nrm-vad-col` and `--nrm-depth-col`.
5149

50+
Docker
51+
------
52+
53+
Assuming you have a recent version of docker, clone the main branch and build an image as follows:
54+
55+
git clone [email protected]:mskcc/vcf2maf.git
56+
cd vcf2maf
57+
docker build -t vcf2maf:main .
58+
docker builder prune -f
59+
60+
Now you run the scripts in docker as follows:
61+
62+
docker run --rm vcf2maf:main perl vcf2maf.pl --help
63+
docker run --rm vcf2maf:main perl maf2maf.pl --help
64+
65+
Testing
66+
-------
67+
68+
A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows:
69+
70+
wget -P tests https://data.cyri.ac/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
71+
gzip -d tests/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
72+
wget -P tests https://data.cyri.ac/homo_sapiens_vep_112_GRCh38_chr21.tar.gz
73+
tar -zxf tests/homo_sapiens_vep_112_GRCh38_chr21.tar.gz -C tests
74+
75+
And the following scripts test the docker image on predefined inputs and compare outputs against expected outputs:
76+
77+
perl tests/vcf2maf.t
78+
perl tests/vcf2vcf.t
79+
perl tests/maf2vcf.t
80+
5281
License
5382
-------
5483

@@ -57,4 +86,4 @@ License
5786
Citation
5887
--------
5988

60-
Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6.19. (2020). doi:10.5281/zenodo.593251
89+
Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6. (2020). doi:10.5281/zenodo.593251

maf2maf.pl

+5-5
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count );
1717
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count );
1818
my ( $vep_path, $vep_data, $vep_forks, $buffer_size, $any_allele ) = ( "$ENV{HOME}/miniconda3/bin", "$ENV{HOME}/.vep", 4, 5000, 0 );
19-
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" );
19+
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" );
2020
my ( $species, $ncbi_build, $cache_version, $maf_center, $max_subpop_af ) = ( "homo_sapiens", "GRCh37", "", ".", 0.0004 );
2121
my $perl_bin = $Config{perlpath};
2222

@@ -41,8 +41,9 @@
4141
MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH
4242
ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj
4343
ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE
44-
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF
45-
gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF );
44+
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomADe_AF gnomADe_AFR_AF gnomADe_AMR_AF
45+
gnomADe_ASJ_AF gnomADe_EAS_AF gnomADe_FIN_AF gnomADe_NFE_AF gnomADe_OTH_AF gnomADe_SAS_AF
46+
);
4647

4748
# Check for missing or crappy arguments
4849
unless( @ARGV and $ARGV[0]=~m/^-/ ) {
@@ -382,7 +383,7 @@ =head1 OPTIONS
382383
--species Ensembl-friendly name of species (e.g. mus_musculus for mouse) [homo_sapiens]
383384
--ncbi-build NCBI reference assembly of variants in MAF (e.g. GRCm38 for mouse) [GRCh37]
384385
--cache-version Version of offline cache to use with VEP (e.g. 75, 84, 91) [Default: Installed version]
385-
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
386+
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
386387
--help Print a brief help message and quit
387388
--man Print the detailed manual
388389
@@ -401,7 +402,6 @@ =head2 Relevant links:
401402
=head1 AUTHORS
402403
403404
Cyriac Kandoth ([email protected])
404-
Qingguo Wang ([email protected])
405405
406406
=head1 LICENSE
407407

maf2vcf.pl

+3-4
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
use Pod::Usage qw( pod2usage );
1010

1111
# Set any default paths and constants
12-
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz";
12+
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz";
1313
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count );
1414
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count );
1515

@@ -357,7 +357,7 @@ =head1 OPTIONS
357357
--input-maf Path to input file in MAF format
358358
--output-dir Path to output directory where VCFs will be stored, one per TN-pair
359359
--output-vcf Path to output multi-sample VCF containing all TN-pairs [<output-dir>/<input-maf-name>.vcf]
360-
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
360+
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
361361
--per-tn-vcfs Specify this to generate VCFs per-TN pair, in addition to the multi-sample VCF
362362
--tum-depth-col Name of MAF column for read depth in tumor BAM [t_depth]
363363
--tum-rad-col Name of MAF column for reference allele depth in tumor BAM [t_ref_count]
@@ -376,12 +376,11 @@ =head2 Relevant links:
376376
377377
Homepage: https://github.com/ckandoth/vcf2maf
378378
VCF format: http://samtools.github.io/hts-specs/
379-
MAF format: https://wiki.nci.nih.gov/x/eJaPAQ
379+
MAF format: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format
380380
381381
=head1 AUTHORS
382382
383383
Cyriac Kandoth ([email protected])
384-
Qingguo Wang ([email protected])
385384
386385
=head1 LICENSE
387386

0 commit comments

Comments
 (0)