Skip to content

Commit

Permalink
Merge pull request #1 from mapo9/querynator_module
Browse files Browse the repository at this point in the history
Querynator functionality added
  • Loading branch information
mapo9 authored Jun 26, 2023
2 parents 1277e9b + 3902517 commit 949570f
Show file tree
Hide file tree
Showing 48 changed files with 1,229 additions and 539 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ results/
testing/
testing*
*.pyc
dev_test
workflows/test_module
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.1.0](https://github.com/qbic-pipelines/variantmtb/releases/tag/0.1.0) - Paris-Roubaix

### `Added`

- [#1](https://github.com/qbic-pipelines/variantmtb/pull/1) - Query to CGI & CIViC. Creation of a comprehensive HTML report.


### `Fixed`

### `Dependencies`

### `Deprecated`


## v1.0dev - [date]

Initial release of nf-core/variantmtb, created with the [nf-core](https://nf-co.re/) template.
Expand Down
13 changes: 9 additions & 4 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,15 @@
## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
- [Tabix](http://www.htslib.org/doc/tabix.html)
- [bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm)

- [CGI](https://www.cancergenomeinterpreter.org/home)
> Tamborero, D., Rubio-Perez, C., Deu-Pons, J., Schroeder, M. P., Vivancos, A., Rovira, A., ... & Lopez-Bigas, N. (2018). Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome medicine, 10, 1-8.
- [CIViC](https://civicdb.org/welcome)
> Griffith, M., Spies, N. C., Krysiak, K., McMichael, J. F., Coffman, A. C., Danos, A. M., ... & Griffith, O. L. (2017). CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nature genetics, 49(2), 170-174.
- [CIViCpy](https://docs.civicpy.org/en/latest/)
> Wagner, A. H., Kiwala, S., Coffman, A. C., McMichael, J. F., Cotto, K. C., Mooney, T. B., ... & Griffith, M. (2020). CIViCpy: a python software development and analysis toolkit for the CIViC knowledgebase. JCO Clinical Cancer Informatics, 4, 245-253.
## Software packaging/containerisation tools

Expand Down
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only) ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only)
# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only)
<!-- ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only) -->

[![GitHub Actions CI Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+CI%22)
[![GitHub Actions Linting Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+linting%22)
Expand All @@ -19,22 +20,27 @@

<!-- TODO nf-core: Write a 1-2 sentence summary of what data the pipeline is for and what it does -->

**nf-core/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the biological and predictive relevance of tumor variants.
**qbic-pipelines/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the diagnostic, prognostic and predictive relevance of tumor variants.

The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

<!-- TODO nf-core: Add full-sized test dataset and amend the paragraph below if applicable -->

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results).
<!-- On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results). -->

<p align="center">
<img title="variantMTB workflow" src="docs/images/variantMTB_workflow.png" width=70%>
</p>

## Pipeline summary

<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

1. Filter for variants that [PASS](http://samtools.github.io/bcftools/bcftools.html)
2. Query [Clinvar](https://www.ncbi.nlm.nih.gov/clinvar/)
3. Query [Oncokb](https://www.oncokb.org/)
4. Query [Civic](https://civicdb.org/variants/home)
1. Normalize variants [bcftools norm](https://www.htslib.org/doc/1.0/bcftools.html#norm)
2. Index VCF file [tabix](http://www.htslib.org/doc/tabix.html)
3. Query [CGI](https://www.cancergenomeinterpreter.org/home)
4. Query [CIViC](https://civicdb.org/variants/home)
5. Categorize variants and create an comprehensive HTML report

## Quick Start

Expand All @@ -60,7 +66,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
<!-- TODO nf-core: Update the example "typical command" below used to run the pipeline -->

```console
nextflow run nf-core/variantmtb --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
nextflow run qbic-pipelines/variantmtb -r dev --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
```

## Documentation
Expand All @@ -69,7 +75,7 @@ The nf-core/variantmtb pipeline comes with documentation about the pipeline [usa

## Credits

nf-core/variantmtb was originally written by SusiJo.
nf-core/variantmtb was originally started by SusiJo and mainly developed by mapo9.

We thank the following people for their extensive assistance in the development of this pipeline:

Expand Down
7 changes: 4 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
sample,filename,genome,filetype
sample_1,path/to/file_1.vcf,GRCh38,mutations
sample_2,path/to/file_2.vcf,GRCh38,mutations
sample_3,path/to/file_3.vcf,GRCh38,mutations
77 changes: 61 additions & 16 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,30 @@ class RowChecker:
VALID_FORMATS = (
".vcf",
".vcf.gz",
".tsv",
".ext"
)

VALID_GENOMES = (
"hg19",
"GRCh37",
"hg38",
"GRCh38"
)

VALID_FILETYPES = (
"mutations",
"cnas",
"translocations"
)

def __init__(
self,
sample_col="sample",
first_col="vcf",
filename_col="filename",
genome_col="genome",
filetype_col="filetype",

**kwargs,
):
"""
Expand All @@ -42,13 +60,20 @@ def __init__(
Args:
sample_col (str): The name of the column that contains the sample name
(default "sample").
first_col (str): The name of the column that contains the first (or only)
VCF file path (default "vcf").
filename_col (str): The name of the column that contains the input file path.
genome_col (str): The name of the column that contains the reference genome.
(default "GRCh37")
filetype_col (str): The name of the column that contains the type of the input file
(default "mutations")
"""
super().__init__(**kwargs)
self._sample_col = sample_col
self._first_col = first_col
self._filename_col = filename_col
self._genome_col = genome_col
self._filetype_col = filetype_col
self._seen = set()
self.modified = []

Expand All @@ -62,8 +87,8 @@ def validate_and_transform(self, row):
"""
self._validate_sample(row)
self._validate_first(row)
self._seen.add((row[self._sample_col], row[self._first_col]))
self._validate_entries(row)
self._seen.add((row[self._sample_col], row[self._filename_col]))
self.modified.append(row)

def _validate_sample(self, row):
Expand All @@ -72,18 +97,38 @@ def _validate_sample(self, row):
# Sanitize samples slightly.
row[self._sample_col] = row[self._sample_col].replace(" ", "_")

def _validate_first(self, row):
"""Assert that the first VCF entry is non-empty and has the right format."""
assert len(row[self._first_col]) > 0, "At least the first VCF file is required."
self._validate_vcf_format(row[self._first_col])
def _validate_entries(self, row):
"""
Assert that the first VCF entry is non-empty and has the right format.
Assert that supported reference genome is given
Assert that supported filetype is provided
"""
assert len(row[self._filename_col]) > 0, "At least the first VCF file is required."
self._validate_file_format(row[self._filename_col])
self._validate_genome(row[self._genome_col])
self._validate_filetype(row[self._filetype_col])

def _validate_vcf_format(self, filename):
def _validate_file_format(self, filename):
"""Assert that a given filename has one of the expected VCF extensions."""
assert any(filename.endswith(extension) for extension in self.VALID_FORMATS), (
f"The VCF file has an unrecognized extension: {filename}\n"
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def _validate_genome(self, genome_name):
"""Assert that the given reference genome is compatible with the pipeline."""
assert any(genome_name == genome for genome in self.VALID_GENOMES), (
f"The provided reference genome is not supported: {genome_name}\n"
f"It should be one of: {', '.join(self.VALID_GENOMES)}"
)

def _validate_filetype(self, file_type):
"""Assert that the given reference genome is compatible with the pipeline."""
assert any(file_type == f_t for f_t in self.VALID_FILETYPES), (
f"The provided filetype is not supported: {file_type}\n"
f"It should be one of: {', '.join(self.VALID_FILETYPES)}"
)

def validate_unique_samples(self):
"""
Assert that the combination of sample name and VCF filename is unique.
Expand Down Expand Up @@ -155,16 +200,16 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `viral recon samplesheet`_::
sample,vcf
SAMPLE1,SAMPLE1.vcf.gz
SAMPLE2,SAMPLE2.vcf.gz
SAMPLE3,SAMPLE3.vcf.gz
sample,filename,genome,filetype
SAMPLE1,SAMPLE1.vcf.gz,hg19,mutations
SAMPLE2,SAMPLE2.tsv,GRCh37,translocations
SAMPLE3,SAMPLE3.vcf,hg19,mutations
.. _viral recon samplesheet:
https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
"""
required_columns = {"sample", "vcf"}
required_columns = {"sample", "filename", "genome", "filetype"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
Expand Down
2 changes: 0 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

process {

// TODO nf-core: Check the defaults for all processes
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
Expand All @@ -24,7 +23,6 @@ process {
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// TODO nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
Expand Down
59 changes: 40 additions & 19 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,26 +18,11 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]


withName: 'BCFTOOLS_VIEW' {
ext.args = "-f PASS"
ext.prefix = { "${meta.id}.pass" }
publishDir = [
path: { "${params.outdir}/bcftools/pass" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: 'BCFTOOLS_SPLITVEP' {
// [%AF] pastes allele frequencies of all samples contained in a vcf without quotes
// Normal sample AF: 0.01 Tumor sample AF: 0.019 is printed as 0.010.019
ext.args = "-f '%CHROM %POS %ID %REF %ALT [%AF] %IMPACT %Gene %SYMBOL %Consequence %SIFT %PolyPhen %HGVSc %HGVSp %RefSeq %Existing_variation %CLIN_SIG\n' --duplicate"
ext.prefix = { "${meta.id}.split_vep" }
withName: 'BCFTOOLS_NORM' {
ext.args = "--output-type z -a --atom-overlaps ."
ext.prefix = { "${meta.id}.normalized" }
publishDir = [
path: { "${params.outdir}/bcftools/split_vep" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
enabled: false
]
}

Expand All @@ -56,5 +41,41 @@ process {
pattern: '*_versions.yml'
]
}

withName: QUERYNATOR_CGIAPI {
publishDir = [
path: { "${params.outdir}/${meta.id}" },
mode: params.publish_dir_mode,
pattern: '*'
]
}

withName: QUERYNATOR_CIVICAPI {
publishDir = [
path: { "${params.outdir}/${meta.id}" },
mode: params.publish_dir_mode,
pattern: '*'
]
}

withName: QUERYNATOR_CREATEREPORT {
publishDir = [
path: { "${params.outdir}/${meta.id}" },
mode: params.publish_dir_mode,
pattern: '*'
]
}

withName: TABIX_TABIX {
publishDir = [
enabled: false
]
}

withName: TABIX_BGZIPTABIX {
publishDir = [
enabled: false
]
}

}
6 changes: 5 additions & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,9 @@ params {
input = "${projectDir}/tests/csv/input.csv"

// Genome references
genome = 'hg38'
genome = 'GRCh37'
fasta = "s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"

// mandatory flags
databases = 'civic'
}
24 changes: 0 additions & 24 deletions conf/test_full.config

This file was deleted.

Binary file removed docs/images/mqc_fastqc_adapter.png
Binary file not shown.
Binary file removed docs/images/mqc_fastqc_counts.png
Binary file not shown.
Binary file removed docs/images/mqc_fastqc_quality.png
Binary file not shown.
Binary file added docs/images/variantMTB_workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 949570f

Please sign in to comment.