Merge pull request #1 from mapo9/querynator_module

Querynator functionality added
qbic-pipelines · Jun 26, 2023 · 949570f · 949570f
2 parents 1277e9b + 3902517
commit 949570f
Show file tree

Hide file tree

Showing 48 changed files with 1,229 additions and 539 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,5 @@ results/
 testing/
 testing*
 *.pyc
+dev_test
+workflows/test_module
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,20 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [0.1.0](https://github.com/qbic-pipelines/variantmtb/releases/tag/0.1.0) - Paris-Roubaix
+
+### `Added`
+
+- [#1](https://github.com/qbic-pipelines/variantmtb/pull/1) - Query to CGI & CIViC. Creation of a comprehensive HTML report. 
+
+
+### `Fixed`
+
+### `Dependencies`
+
+### `Deprecated`
+
+
 ## v1.0dev - [date]
 
 Initial release of nf-core/variantmtb, created with the [nf-core](https://nf-co.re/) template.

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,10 +10,15 @@
 
 ## Pipeline tools
 
-- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
-
-- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
-  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
+- [Tabix](http://www.htslib.org/doc/tabix.html)
+- [bcftools norm](https://samtools.github.io/bcftools/bcftools.html#norm)
+
+- [CGI](https://www.cancergenomeinterpreter.org/home)
+  > Tamborero, D., Rubio-Perez, C., Deu-Pons, J., Schroeder, M. P., Vivancos, A., Rovira, A., ... & Lopez-Bigas, N. (2018). Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations. Genome medicine, 10, 1-8.
+- [CIViC](https://civicdb.org/welcome)
+  > Griffith, M., Spies, N. C., Krysiak, K., McMichael, J. F., Coffman, A. C., Danos, A. M., ... & Griffith, O. L. (2017). CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Nature genetics, 49(2), 170-174.
+- [CIViCpy](https://docs.civicpy.org/en/latest/)
+  > Wagner, A. H., Kiwala, S., Coffman, A. C., McMichael, J. F., Cotto, K. C., Mooney, T. B., ... & Griffith, M. (2020). CIViCpy: a python software development and analysis toolkit for the CIViC knowledgebase. JCO Clinical Cancer Informatics, 4, 245-253.
 
 ## Software packaging/containerisation tools
 

diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
-# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only) ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only)
+# ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_light.png#gh-light-mode-only) 
+<!-- ![nf-core/variantmtb](docs/images/nf-core-variantmtb_logo_dark.png#gh-dark-mode-only) -->
 
 [![GitHub Actions CI Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20CI/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+CI%22)
 [![GitHub Actions Linting Status](https://github.com/nf-core/variantmtb/workflows/nf-core%20linting/badge.svg)](https://github.com/nf-core/variantmtb/actions?query=workflow%3A%22nf-core+linting%22)
@@ -19,22 +20,27 @@
 
 <!-- TODO nf-core: Write a 1-2 sentence summary of what data the pipeline is for and what it does -->
 
-**nf-core/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the biological and predictive relevance of tumor variants.
+**qbic-pipelines/variantmtb** is a bioinformatics best-practice analysis pipeline for querying variant databases to investigate the diagnostic, prognostic and predictive relevance of tumor variants.
 
 The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The [Nextflow DSL2](https://www.nextflow.io/docs/latest/dsl2.html) implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from [nf-core/modules](https://github.com/nf-core/modules) in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
 
 <!-- TODO nf-core: Add full-sized test dataset and amend the paragraph below if applicable -->
 
-On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results).
+<!-- On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the [nf-core website](https://nf-co.re/variantmtb/results). -->
+
+<p align="center">
+    <img title="variantMTB workflow" src="docs/images/variantMTB_workflow.png" width=70%>
+</p>
 
 ## Pipeline summary
 
 <!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
 
-1. Filter for variants that [PASS](http://samtools.github.io/bcftools/bcftools.html)
-2. Query [Clinvar](https://www.ncbi.nlm.nih.gov/clinvar/)
-3. Query [Oncokb](https://www.oncokb.org/)
-4. Query [Civic](https://civicdb.org/variants/home)
+1. Normalize variants [bcftools norm](https://www.htslib.org/doc/1.0/bcftools.html#norm)
+2. Index VCF file [tabix](http://www.htslib.org/doc/tabix.html)
+3. Query [CGI](https://www.cancergenomeinterpreter.org/home)
+4. Query [CIViC](https://civicdb.org/variants/home)
+5. Categorize variants and create an comprehensive HTML report 
 
 ## Quick Start
 
@@ -60,7 +66,7 @@ On release, automated continuous integration tests run the pipeline on a full-si
    <!-- TODO nf-core: Update the example "typical command" below used to run the pipeline -->
 
    ```console
-   nextflow run nf-core/variantmtb --input samplesheet.csv --outdir <OUTDIR> --genome GRCh37 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
+   nextflow run qbic-pipelines/variantmtb -r dev --input samplesheet.csv --outdir <OUTDIR> --genome GRCh38 -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>
    ```
 
 ## Documentation
@@ -69,7 +75,7 @@ The nf-core/variantmtb pipeline comes with documentation about the pipeline [usa
 
 ## Credits
 
-nf-core/variantmtb was originally written by SusiJo.
+nf-core/variantmtb was originally started by SusiJo and mainly developed by mapo9.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,4 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+sample,filename,genome,filetype
+sample_1,path/to/file_1.vcf,GRCh38,mutations
+sample_2,path/to/file_2.vcf,GRCh38,mutations
+sample_3,path/to/file_3.vcf,GRCh38,mutations
diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py
@@ -28,12 +28,30 @@ class RowChecker:
     VALID_FORMATS = (
         ".vcf",
         ".vcf.gz",
+        ".tsv",
+        ".ext"
+    )
+
+    VALID_GENOMES = (
+        "hg19",
+        "GRCh37",
+        "hg38",
+        "GRCh38"
+    )
+
+    VALID_FILETYPES = (
+        "mutations",
+        "cnas",
+        "translocations"
     )
 
     def __init__(
         self,
         sample_col="sample",
-        first_col="vcf",
+        filename_col="filename",
+        genome_col="genome",
+        filetype_col="filetype",
+
         **kwargs,
     ):
         """
@@ -42,13 +60,20 @@ def __init__(
         Args:
             sample_col (str): The name of the column that contains the sample name
                 (default "sample").
-            first_col (str): The name of the column that contains the first (or only)
-                VCF file path (default "vcf").
+            filename_col (str): The name of the column that contains the input file path.
+            genome_col (str): The name of the column that contains the reference genome.
+                (default "GRCh37")
+            filetype_col (str): The name of the column that contains the type of the input file
+                (default "mutations")
+            
+            
 
         """
         super().__init__(**kwargs)
         self._sample_col = sample_col
-        self._first_col = first_col
+        self._filename_col = filename_col
+        self._genome_col = genome_col
+        self._filetype_col = filetype_col
         self._seen = set()
         self.modified = []
 
@@ -62,8 +87,8 @@ def validate_and_transform(self, row):
 
         """
         self._validate_sample(row)
-        self._validate_first(row)
-        self._seen.add((row[self._sample_col], row[self._first_col]))
+        self._validate_entries(row)
+        self._seen.add((row[self._sample_col], row[self._filename_col]))
         self.modified.append(row)
 
     def _validate_sample(self, row):
@@ -72,18 +97,38 @@ def _validate_sample(self, row):
         # Sanitize samples slightly.
         row[self._sample_col] = row[self._sample_col].replace(" ", "_")
 
-    def _validate_first(self, row):
-        """Assert that the first VCF entry is non-empty and has the right format."""
-        assert len(row[self._first_col]) > 0, "At least the first VCF file is required."
-        self._validate_vcf_format(row[self._first_col])
+    def _validate_entries(self, row):
+        """
+        Assert that the first VCF entry is non-empty and has the right format.
+        Assert that supported reference genome is given
+        Assert that supported filetype is provided
+        """
+        assert len(row[self._filename_col]) > 0, "At least the first VCF file is required."
+        self._validate_file_format(row[self._filename_col])
+        self._validate_genome(row[self._genome_col])
+        self._validate_filetype(row[self._filetype_col])
 
-    def _validate_vcf_format(self, filename):
+    def _validate_file_format(self, filename):
         """Assert that a given filename has one of the expected VCF extensions."""
         assert any(filename.endswith(extension) for extension in self.VALID_FORMATS), (
             f"The VCF file has an unrecognized extension: {filename}\n"
             f"It should be one of: {', '.join(self.VALID_FORMATS)}"
         )
 
+    def _validate_genome(self, genome_name):
+        """Assert that the given reference genome is compatible with the pipeline."""
+        assert any(genome_name == genome for genome in self.VALID_GENOMES), (
+            f"The provided reference genome is not supported: {genome_name}\n"
+            f"It should be one of: {', '.join(self.VALID_GENOMES)}"
+        )
+
+    def _validate_filetype(self, file_type):
+        """Assert that the given reference genome is compatible with the pipeline."""
+        assert any(file_type == f_t for f_t in self.VALID_FILETYPES), (
+            f"The provided filetype is not supported: {file_type}\n"
+            f"It should be one of: {', '.join(self.VALID_FILETYPES)}"
+        )
+
     def validate_unique_samples(self):
         """
         Assert that the combination of sample name and VCF filename is unique.
@@ -155,16 +200,16 @@ def check_samplesheet(file_in, file_out):
         This function checks that the samplesheet follows the following structure,
         see also the `viral recon samplesheet`_::
 
-            sample,vcf
-            SAMPLE1,SAMPLE1.vcf.gz
-            SAMPLE2,SAMPLE2.vcf.gz
-            SAMPLE3,SAMPLE3.vcf.gz
+            sample,filename,genome,filetype
+            SAMPLE1,SAMPLE1.vcf.gz,hg19,mutations
+            SAMPLE2,SAMPLE2.tsv,GRCh37,translocations
+            SAMPLE3,SAMPLE3.vcf,hg19,mutations
 
     .. _viral recon samplesheet:
         https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv
 
     """
-    required_columns = {"sample", "vcf"}
+    required_columns = {"sample", "filename", "genome", "filetype"}
     # See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
     with file_in.open(newline="") as in_handle:
         reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))

diff --git a/conf/base.config b/conf/base.config
@@ -10,7 +10,6 @@
 
 process {
 
-    // TODO nf-core: Check the defaults for all processes
     cpus   = { check_max( 1    * task.attempt, 'cpus'   ) }
     memory = { check_max( 6.GB * task.attempt, 'memory' ) }
     time   = { check_max( 4.h  * task.attempt, 'time'   ) }
@@ -24,7 +23,6 @@ process {
     //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
     //        If possible, it would be nice to keep the same label naming convention when
     //        adding in your local modules too.
-    // TODO nf-core: Customise requirements for specific processes.
     // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
     withLabel:process_low {
         cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }

diff --git a/conf/modules.config b/conf/modules.config
@@ -18,26 +18,11 @@ process {
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
     ]
 
-
-    withName: 'BCFTOOLS_VIEW' {
-        ext.args   = "-f PASS"
-        ext.prefix = { "${meta.id}.pass" }
-        publishDir = [
-        path: { "${params.outdir}/bcftools/pass" },
-        mode: params.publish_dir_mode,
-        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
-        ]
-    }
-
-    withName: 'BCFTOOLS_SPLITVEP' {
-        // [%AF] pastes allele frequencies of all samples contained in a vcf without quotes
-        // Normal sample AF: 0.01 Tumor sample AF: 0.019 is printed as 0.010.019
-        ext.args   = "-f '%CHROM %POS %ID %REF %ALT [%AF] %IMPACT %Gene %SYMBOL %Consequence %SIFT %PolyPhen %HGVSc %HGVSp %RefSeq %Existing_variation %CLIN_SIG\n' --duplicate"
-        ext.prefix = { "${meta.id}.split_vep" }
+    withName: 'BCFTOOLS_NORM' {
+        ext.args   = "--output-type z -a --atom-overlaps ."
+        ext.prefix = { "${meta.id}.normalized" }
         publishDir = [
-        path: { "${params.outdir}/bcftools/split_vep" },
-        mode: params.publish_dir_mode,
-        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+            enabled: false
         ]
     }
 
@@ -56,5 +41,41 @@ process {
             pattern: '*_versions.yml'
         ]
     }
+
+    withName: QUERYNATOR_CGIAPI {
+        publishDir = [
+            path: { "${params.outdir}/${meta.id}" },
+            mode: params.publish_dir_mode,
+            pattern: '*'
+        ]
+    }
+
+    withName: QUERYNATOR_CIVICAPI {
+        publishDir = [
+            path: { "${params.outdir}/${meta.id}" },
+            mode: params.publish_dir_mode,
+            pattern: '*'
+        ]
+    }
+
+    withName: QUERYNATOR_CREATEREPORT {
+        publishDir = [
+            path: { "${params.outdir}/${meta.id}" },
+            mode: params.publish_dir_mode,
+            pattern: '*'
+        ]
+    }
+
+    withName: TABIX_TABIX {
+        publishDir = [
+            enabled: false
+        ]
+    }
+
+    withName: TABIX_BGZIPTABIX {
+        publishDir = [
+            enabled: false
+        ]
+    }
 
 }
diff --git a/conf/test.config b/conf/test.config
@@ -23,5 +23,9 @@ params {
     input  = "${projectDir}/tests/csv/input.csv"
 
     // Genome references
-    genome = 'hg38'
+    genome = 'GRCh37'
+    fasta = "s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
+
+    // mandatory flags
+    databases = 'civic'
 }
diff --git a/conf/test_full.config b/conf/test_full.config
diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png
diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png
diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png
diff --git a/docs/images/variantMTB_workflow.png b/docs/images/variantMTB_workflow.png