Skip to content

Commit

Permalink
Support gnomADe AFs; Updated tests; Abandon Travis
Browse files Browse the repository at this point in the history
  • Loading branch information
ckandoth committed May 31, 2024
1 parent dd5af77 commit f6d0c40
Show file tree
Hide file tree
Showing 11 changed files with 123 additions and 114 deletions.
21 changes: 0 additions & 21 deletions .travis.yml

This file was deleted.

26 changes: 14 additions & 12 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,26 +1,28 @@
FROM clearlinux:latest AS builder

# Install a minimal versioned OS into /install_root, and bundled tools if any
ENV CLEAR_VERSION=33980
ENV CLEAR_VERSION=41780
RUN swupd os-install --no-progress --no-boot-update --no-scripts \
--version ${CLEAR_VERSION} \
--path /install_root \
--statedir /swupd-state \
--bundles os-core-update,which

# Download and install conda into /usr/bin
ENV MINICONDA_VERSION=py37_4.9.2
RUN swupd bundle-add --no-progress curl && \
curl -sL https://repo.anaconda.com/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -o /tmp/miniconda.sh && \
sh /tmp/miniconda.sh -bfp /usr
ENV MINICONDA_VERSION=py312_24.4.0-0
RUN curl -sL https://repo.anaconda.com/miniconda/Miniconda3-${MINICONDA_VERSION}-Linux-x86_64.sh -o /tmp/miniconda.sh && \
bash /tmp/miniconda.sh -bup /usr && \
rm -f /tmp/miniconda.sh && \
conda config --set solver libmamba

# Use conda to install remaining tools/dependencies into /usr/local
ENV VEP_VERSION=102.0 \
HTSLIB_VERSION=1.10.2 \
BCFTOOLS_VERSION=1.10.2 \
SAMTOOLS_VERSION=1.10 \
LIFTOVER_VERSION=377
RUN conda create -qy -p /usr/local \
# Use mamba to install remaining tools/dependencies into /usr/local
ENV VEP_VERSION=112.0 \
HTSLIB_VERSION=1.20 \
BCFTOOLS_VERSION=1.20 \
SAMTOOLS_VERSION=1.20 \
LIFTOVER_VERSION=447
RUN conda create -y -p /usr/local && \
conda install -y -p /usr/local \
-c conda-forge \
-c bioconda \
-c defaults \
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright 2021 Memorial Sloan Kettering Cancer Center
Copyright 2024 Memorial Sloan Kettering Cancer Center

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
41 changes: 35 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
vcf<img src="https://i.giphy.com/R6X7GehJWQYms.gif" width="28">maf
=======

To convert a [VCF](http://samtools.github.io/hts-specs/) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild.

[![Build Status](https://travis-ci.com/mskcc/vcf2maf.svg?branch=master)](https://travis-ci.com/mskcc/vcf2maf)
To convert a [VCF](https://samtools.github.io/hts-specs//) into a [MAF](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format), each variant must be mapped to only one of all possible gene transcripts/isoforms that it might affect. But even within a single isoform, a `Missense_Mutation` close enough to a `Splice_Site`, can be labeled as either in MAF format, but not as both. **This selection of a single effect per variant, is often subjective. And that's what this project attempts to standardize.** The `vcf2maf` and `maf2maf` scripts leave most of that responsibility to [Ensembl's VEP](http://ensembl.org/info/docs/tools/vep/index.html), but allows you to override their "canonical" isoforms, or use a custom ExAC VCF for annotation. Though the most useful feature is the **extensive support in parsing a wide range of crappy MAF-like or VCF-like formats** we've seen out in the wild.

Quick start
-----------

Find the [latest stable release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`:
Find the [latest release](https://github.com/mskcc/vcf2maf/releases), download it, and view the detailed usage manuals for `vcf2maf` and `maf2maf`:

export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4`
curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-*
perl vcf2maf.pl --man
perl maf2maf.pl --man

If you don't have [VEP](http://useast.ensembl.org/info/docs/tools/vep/index.html) installed, then [follow this gist](https://gist.github.com/ckandoth/61c65ba96b011f286220fa4832ad2bc0). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this:
If you don't have VEP installed, then [follow this gist](https://gist.github.com/ckandoth/4bccadcacd58aad055ed369a78bf2e7c). Of the many annotators out there, VEP is preferred for its large team of active coders, and its CLIA-compliant [HGVS formats](http://www.hgvs.org/mutnomen/recs.html). After installing VEP, test out `vcf2maf` like this:

perl vcf2maf.pl --input-vcf tests/test.vcf --output-maf tests/test.vep.maf

Expand Down Expand Up @@ -49,6 +47,37 @@ After tests on variant lists from many sources, `maf2vcf` and `maf2maf` are quit

See `data/minimalist_test_maf.tsv` for a sampler. Addition of `Tumor_Seq_Allele1` will be used to determine zygosity. Otherwise, it will try to determine zygosity from variant allele fractions, assuming that arguments `--tum-vad-col` and `--tum-depth-col` are set correctly to the names of columns containing those read counts. Specifying the `Matched_Norm_Sample_Barcode` with its respective columns containing read-counts, is also strongly recommended. Columns containing normal allele read counts can be specified using argument `--nrm-vad-col` and `--nrm-depth-col`.

Docker
------

Assuming you have a recent version of docker, clone the main branch and build an image as follows:

git clone git@github.com:mskcc/vcf2maf.git
cd vcf2maf
docker build -t vcf2maf:main .
docker builder prune -f

Now you run the scripts in docker as follows:

docker run --rm vcf2maf:main perl vcf2maf.pl --help
docker run --rm vcf2maf:main perl maf2maf.pl --help

Testing
-------

A small standalone test dataset was created by restricting VEP v112 cache/fasta to chr21 in GRCh38 and hosting that on a private server for download by CI services. We can manually fetch those as follows:

wget -P tests https://data.cyri.ac/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
gzip -d tests/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
wget -P tests https://data.cyri.ac/homo_sapiens_vep_112_GRCh38_chr21.tar.gz
tar -zxf tests/homo_sapiens_vep_112_GRCh38_chr21.tar.gz -C tests

And the following scripts test the docker image on predefined inputs and compare outputs against expected outputs:

perl tests/vcf2maf.t
perl tests/vcf2vcf.t
perl tests/maf2vcf.t

License
-------

Expand All @@ -57,4 +86,4 @@ License
Citation
--------

Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6.19. (2020). doi:10.5281/zenodo.593251
Cyriac Kandoth. mskcc/vcf2maf: vcf2maf v1.6. (2020). doi:10.5281/zenodo.593251
10 changes: 5 additions & 5 deletions maf2maf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count );
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count );
my ( $vep_path, $vep_data, $vep_forks, $buffer_size, $any_allele ) = ( "$ENV{HOME}/miniconda3/bin", "$ENV{HOME}/.vep", 4, 5000, 0 );
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" );
my ( $ref_fasta, $filter_vcf ) = ( "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz", "" );
my ( $species, $ncbi_build, $cache_version, $maf_center, $max_subpop_af ) = ( "homo_sapiens", "GRCh37", "", ".", 0.0004 );
my $perl_bin = $Config{perlpath};

Expand All @@ -41,8 +41,9 @@
MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH
ExAC_AF_SAS GENE_PHENO FILTER flanking_bps variant_id variant_qual ExAC_AF_Adj ExAC_AC_AN_Adj
ExAC_AC_AN ExAC_AC_AN_AFR ExAC_AC_AN_AMR ExAC_AC_AN_EAS ExAC_AC_AN_FIN ExAC_AC_AN_NFE
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF
gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF );
ExAC_AC_AN_OTH ExAC_AC_AN_SAS ExAC_FILTER gnomADe_AF gnomADe_AFR_AF gnomADe_AMR_AF
gnomADe_ASJ_AF gnomADe_EAS_AF gnomADe_FIN_AF gnomADe_NFE_AF gnomADe_OTH_AF gnomADe_SAS_AF
);

# Check for missing or crappy arguments
unless( @ARGV and $ARGV[0]=~m/^-/ ) {
Expand Down Expand Up @@ -382,7 +383,7 @@ =head1 OPTIONS
--species Ensembl-friendly name of species (e.g. mus_musculus for mouse) [homo_sapiens]
--ncbi-build NCBI reference assembly of variants in MAF (e.g. GRCm38 for mouse) [GRCh37]
--cache-version Version of offline cache to use with VEP (e.g. 75, 84, 91) [Default: Installed version]
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
--ref-fasta Reference FASTA file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
--help Print a brief help message and quit
--man Print the detailed manual
Expand All @@ -401,7 +402,6 @@ =head2 Relevant links:
=head1 AUTHORS
Cyriac Kandoth (ckandoth@gmail.com)
Qingguo Wang (josephw10000@gmail.com)
=head1 LICENSE
Expand Down
7 changes: 3 additions & 4 deletions maf2vcf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
use Pod::Usage qw( pod2usage );

# Set any default paths and constants
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz";
my $ref_fasta = "$ENV{HOME}/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz";
my ( $tum_depth_col, $tum_rad_col, $tum_vad_col ) = qw( t_depth t_ref_count t_alt_count );
my ( $nrm_depth_col, $nrm_rad_col, $nrm_vad_col ) = qw( n_depth n_ref_count n_alt_count );

Expand Down Expand Up @@ -357,7 +357,7 @@ =head1 OPTIONS
--input-maf Path to input file in MAF format
--output-dir Path to output directory where VCFs will be stored, one per TN-pair
--output-vcf Path to output multi-sample VCF containing all TN-pairs [<output-dir>/<input-maf-name>.vcf]
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/102_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
--ref-fasta Path to reference Fasta file [~/.vep/homo_sapiens/112_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz]
--per-tn-vcfs Specify this to generate VCFs per-TN pair, in addition to the multi-sample VCF
--tum-depth-col Name of MAF column for read depth in tumor BAM [t_depth]
--tum-rad-col Name of MAF column for reference allele depth in tumor BAM [t_ref_count]
Expand All @@ -376,12 +376,11 @@ =head2 Relevant links:
Homepage: https://github.com/ckandoth/vcf2maf
VCF format: http://samtools.github.io/hts-specs/
MAF format: https://wiki.nci.nih.gov/x/eJaPAQ
MAF format: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format
=head1 AUTHORS
Cyriac Kandoth (ckandoth@gmail.com)
Qingguo Wang (josephw10000@gmail.com)
=head1 LICENSE
Expand Down
Loading

0 comments on commit f6d0c40

Please sign in to comment.