Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up smrnaseq PR; container update; heatmap sort; more precise batch effect documentation #214

Merged
merged 8 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- [#213](https://github.com/qbic-pipelines/rnadeseq/pull/21) Added smrnaseq input support
- [#214](https://github.com/qbic-pipelines/rnadeseq/pull/214) Added smrnaseq input support -->follow-up to check that the CI test works. Modified smrnaseq testdata to be like QBiC datasets. Updated container env and fixed a resulting bug with include_graphics. Heatmaps are now equally ordered in zip and report
- [#213](https://github.com/qbic-pipelines/rnadeseq/pull/213) Added smrnaseq input support
- [#212](https://github.com/qbic-pipelines/rnadeseq/pull/212) Added computational methods if no --software_versions
- [#206](https://github.com/qbic-pipelines/rnadeseq/pull/206) Added logic to decide between rlog and vst, added tryCatch for heatmap saving because this only works unreliably
- [#202](https://github.com/qbic-pipelines/rnadeseq/pull/202) Added background list to pathway analysis
Expand Down
82 changes: 43 additions & 39 deletions assets/RNAseq_report.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,17 @@ build_smrnaseq_df <- function(files) {
if (is.data.frame(merged_mat)) {

# Check if rownames are the same
if (rownames(file_tab) != rownames(merged_mat)) {
if (!identical(rownames(file_tab), rownames(merged_mat))) {
stop(paste("Rownames of smrnaseq file", file_tab, "differ from the rownames of the other files; exiting..."))
}
merged_mat[[basename(f)]] <- file_tab[,2]
# Create column from QBiC code of current file name (substring of length 10)
merged_mat[[substr(basename(f), 1, 10)]] <- file_tab[,2]
} else {

# Otherwise initiate it from third matrix column and rownames
# Otherwise initiate it from third matrix column and rownames; as above, only use QBiC code, not full file name
merged_mat <- data.frame(file_tab[,2])
rownames(merged_mat) <- rownames(file_tab)
colnames(merged_mat) <- c(basename(f))
colnames(merged_mat) <- c(substr(basename(f), 1, 10))
}
}
merged_mat
Expand Down Expand Up @@ -941,7 +942,9 @@ The read mapping statistics to the reference genome for each sample are shown be
```

```{r STAR_percentages_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1800, fig.cap="STAR: Mapping Statistics", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_star_alignment_plot_1_pc.svg"))
# Disable auto-conversion of paths to relative paths as we provide include_graphics() with absolute paths. Also, error=F is necessary in every call to not fail with some stupid error
knitr.graphics.rel_path = F
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_star_alignment_plot_1_pc.svg"), error=F)
```

```{r STAR_readnums,echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand All @@ -953,7 +956,7 @@ cat(paste0("***
```

```{r STAR_readnums_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1800, fig.cap="STAR: Mapping Statistics", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_star_alignment_plot_1.svg"))
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_star_alignment_plot_1.svg"), error=F)
```

```{r FC_read_assignment, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand All @@ -979,7 +982,7 @@ The statistics of read assignment to genes are shown below. Most reads should be
```

```{r FC_percentages_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1200, fig.cap="featureCounts: Assignments", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_featureCounts_assignment_plot_1_pc.svg"))
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_featureCounts_assignment_plot_1_pc.svg"), error=F)
```

```{r FC_readnums, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand All @@ -993,7 +996,7 @@ cat(paste0("***
```

```{r FC_readnums_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1200, fig.cap="featureCounts: Assignments", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_featureCounts_assignment_plot_1.svg"))
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_featureCounts_assignment_plot_1.svg"), error=F)
```

```{r read_feat_distrib, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand All @@ -1013,7 +1016,7 @@ In RNAseq experiments, the majority of the reads should be assigned to CDS exons
```

```{r RSeqC_percentages_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1200, fig.cap="RSeQC: Read Distribution", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_rseqc_read_distribution_plot_1_pc.svg"))
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_rseqc_read_distribution_plot_1_pc.svg"), error=F)
```

```{r RSeqC_readnums, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand All @@ -1027,7 +1030,7 @@ cat(paste0("***
```

```{r RSeqC_readnums_plot, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/")), out.width="160%", dpi=1200, fig.cap="RSeQC: Read Distribution", fig.align='center'}
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_rseqc_read_distribution_plot_1.svg"))
knitr::include_graphics(paste0(wd, "/QC/multiqc_plots/svg/mqc_rseqc_read_distribution_plot_1.svg"), error=F)
```

```{r MQC_end, echo=FALSE, message=FALSE, warning=FALSE, results='asis', eval=dir.exists(paste0(wd,"/QC/multiqc_data/"))}
Expand Down Expand Up @@ -1143,10 +1146,11 @@ The original plot can be downloaded [here](./differential_gene_expression/plots/
################## SAMPLE DISTANCES HEATMAP ##################
# Sample distances, this chunk is hidden as otherwise, the heatmap is shown multiple times
sampleDists <- dist(t(assay(if (run_rlog) rld else vsd)))
sampleDistClust <- hclust(sampleDists) # Do clustering as input for heatmaps so that their column orders are the same
sampleDistMatrix <- as.matrix(sampleDists)
colors = colorRampPalette(rev(RColorBrewer::brewer.pal(9, "Blues")))(255)
par(oma=c(3,3,3,3))
dist_heatmap <- pheatmap(mat = sampleDistMatrix, color=colors)
dist_heatmap <- pheatmap(mat = sampleDistMatrix, color=colors, cluster_rows = sampleDistClust, cluster_cols = sampleDistClust)

svg("differential_gene_expression/plots/Heatmaps_of_distances.svg")
dist_heatmap
Expand All @@ -1159,7 +1163,7 @@ dist_heatmap
dev.off()

# Recompute heatmap with heatmaply for interactive plot in report
dist_heatmap <- heatmaply(sampleDistMatrix, dendrogram = "both", col=colors, grid_color = "grey")
dist_heatmap <- heatmaply(sampleDistMatrix, dendrogram = "both", col=colors, grid_color = "grey", Rowv = sampleDistClust, Colv = "Rowv")
```

```{r heatmap_show, echo=FALSE, message=FALSE, warning=FALSE, out.width="90%", out.height="130%", fig.asp=1, dpi=1000, fig.align='center'}
Expand Down Expand Up @@ -1644,7 +1648,6 @@ cat("Plots for the [genes of interest](./differential_gene_expression/metadata/g

<!-- In case KEGG analysis was also included; differentiating the cases with and without contrasts -->


<!-- PA start -->
```{r load_PA_libs, eval=params$pathway_analysis, echo=FALSE, message=FALSE, warning=FALSE, results = 'hide'}

Expand All @@ -1668,7 +1671,7 @@ short_organism_name <- substr(organism,1,3)
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(params$species_library, lib=species_dir, version="3.14", force=T)
BiocManager::install(params$species_library, lib=species_dir, version="3.17", force=T)
library(params$species_library, lib.loc=species_dir, character.only=T)
species_library_installed <- get(params$species_library)

Expand Down Expand Up @@ -2256,30 +2259,30 @@ cat("No software versions file was provided for primary analysis.")
```{r software_versions_RNASeq, echo=FALSE, results='asis', eval=isProvided(params$path_software_versions)}

# In the following part, I use newlines (in code) where possible whenever the paste0 function switches from a string to a variable call to better distinguish between the two so it is clearer in which lines quotes are needed and in which not --> Cannot always use newlines as this sometimes translates into the report.html
# TODO remove F!
if (params$input_type == "featurecounts" & F){
cat(paste0(
"## RNAseq data analysis

The Nextflow-based nf-core pipeline `rnaseq ", as.character(ifelse(software_versions_csv, version["nf-core/rnaseq"], version["Workflow"][[1]]["nf-core/rnaseq"])),
"` [^1] was used for the RNAseq Bioinfomatics analysis. `FASTQC ",
as.character(ifelse(software_versions_csv, version["FastQC"], version["FASTQC"][[1]]["fastqc"])),
"` [^3] [@andrews2010fastqc] was used to determine quality of the FASTQ files.
Subsequently, adapter trimming was conducted with `Trim Galore ",
as.character(ifelse(software_versions_csv, version["Trim Galore!"], version["TRIMGALORE"][[1]]["trimgalore"])),
"` [^4] [@krueger2012trim]. `STAR v",
as.character(ifelse(software_versions_csv, substring((version["STAR"]), 7, ), version["MAKE_TRANSCRIPTS_FASTA"][[1]]["star"])),
"` [@Dobin2013] aligner was used to map the reads that passed the quality control to the reference genome.
The RNA-seq data quality control was performed with `RSeQC ",
as.character(ifelse(software_versions_csv, version["RSeQC"], version["RSEQC_BAMSTAT"][[1]]["rseqc"])),
"` [@wang2012rseqc] and read quantification of the features (e.g. genes) with `",
quant_tool,
" ",
quant_version,
"` ",
quant_cite,
"."
))

if (params$input_type == "featurecounts"){
cat(paste0(
"## RNAseq data analysis

The Nextflow-based nf-core pipeline `rnaseq ", as.character(ifelse(software_versions_csv, version["nf-core/rnaseq"], version["Workflow"][[1]]["nf-core/rnaseq"])),
"` [^1] was used for the RNAseq Bioinfomatics analysis. `FASTQC ",
as.character(ifelse(software_versions_csv, version["FastQC"], version["FASTQC"][[1]]["fastqc"])),
"` [^3] [@andrews2010fastqc] was used to determine quality of the FASTQ files.
Subsequently, adapter trimming was conducted with `Trim Galore ",
as.character(ifelse(software_versions_csv, version["Trim Galore!"], version["TRIMGALORE"][[1]]["trimgalore"])),
"` [^4] [@krueger2012trim]. `STAR v",
as.character(ifelse(software_versions_csv, substring((version["STAR"]), 7, ), version["MAKE_TRANSCRIPTS_FASTA"][[1]]["star"])),
"` [@Dobin2013] aligner was used to map the reads that passed the quality control to the reference genome.
The RNA-seq data quality control was performed with `RSeQC ",
as.character(ifelse(software_versions_csv, version["RSeQC"], version["RSEQC_BAMSTAT"][[1]]["rseqc"])),
"` [@wang2012rseqc] and read quantification of the features (e.g. genes) with `",
quant_tool,
" ",
quant_version,
"` ",
quant_cite,
"."
))
} else if (params$input_type == "smrnaseq"){
cat(paste0(
"## smRNA-seq data analysis
Expand All @@ -2304,7 +2307,8 @@ quant_tool,
quant_version,
"` ",
quant_cite,
"."
".",
"\nThe output from `SAMtools`, more specifically the `<QBiC-Code>_mature_hairpin.sorted.idxstats` and `<QBiC-Code>_mature.sorted.idxstats` files, were then used to create a counts matrix. This was done by combining all of these files into a single table, using only the rownames and the third column (second column after the rownames) of each file which contains the counts. The last row of each file (which always has an asterisk `*` as rowname) was excluded. The resulting counts matrix was used as input for the `rnadeseq` pipeline."
))
}
```
Expand Down
1 change: 1 addition & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,7 @@ Option needed to account for batch effects in the data. To control for batch eff

Then the DESeq2 script calculates the contrasts as usual, the batch effect just needs to be considered during the design definition.
For more information, please check the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
Please note: Setting the `--batch_effect` option will NOT remove such effects from your data. The parameter is only used to visualize the effects by generating two sets of PCA and boxplots instead of only one; one set of plots will be generated from the input data, the other from the results of the batch correction in order to illustrate how the batches affect data.

### `--min_DEG_pathway`

Expand Down
62 changes: 31 additions & 31 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,40 @@ channels:
- defaults
- r
dependencies:
- bioconda::bioconductor-annotationdbi=1.56.2
- bioconda::bioconductor-deseq2=1.34.0
- bioconda::bioconductor-genefilter=1.76.0
- bioconda::bioconductor-genomeinfodbdata=1.2.7
- bioconda::bioconductor-impute=1.68.0
- bioconda::bioconductor-limma=3.50.3
- bioconda::bioconductor-pathview=1.34.0
- bioconda::bioconductor-rtracklayer=1.54.0
- bioconda::bioconductor-summarizedexperiment=1.24.0
- bioconda::bioconductor-tximeta=1.12.0
- bioconda::bioconductor-tximport=1.22.0
- bioconda::bioconductor-vsn=3.62.0
- conda-forge::pandoc=2.17.1.1
- conda-forge::r-base=4.1.2
- bioconda::bioconductor-annotationdbi=1.62.2
- bioconda::bioconductor-deseq2=1.40.2
- bioconda::bioconductor-genefilter=1.82.1
- bioconda::bioconductor-genomeinfodbdata=1.2.10
- bioconda::bioconductor-impute=1.74.1
- bioconda::bioconductor-limma=3.56.2
- bioconda::bioconductor-pathview=1.40.0
- bioconda::bioconductor-rtracklayer=1.60.0
- bioconda::bioconductor-summarizedexperiment=1.30.2
- bioconda::bioconductor-tximeta=1.18.0
- bioconda::bioconductor-tximport=1.28.0
- bioconda::bioconductor-vsn=3.68.0
- conda-forge::pandoc=3.1.3
- conda-forge::r-base=4.3.1
- conda-forge::r-details=0.3.0
- conda-forge::r-dplyr=1.0.10
- conda-forge::r-dt=0.20
- conda-forge::r-extrafont=0.17
- conda-forge::r-dplyr=1.1.2
- conda-forge::r-dt=0.28
- conda-forge::r-extrafont=0.19
- conda-forge::r-formattable=0.2.1
- conda-forge::r-ggplot2=3.3.5
- conda-forge::r-ggrepel=0.9.2
- conda-forge::r-ggvenn=0.1.9
- conda-forge::r-gplots=3.1.1
- conda-forge::r-gprofiler2=0.2.1
- conda-forge::r-heatmaply=1.3.0
- conda-forge::r-ggplot2=3.4.2
- conda-forge::r-ggrepel=0.9.3
- conda-forge::r-ggvenn=0.1.10
- conda-forge::r-gplots=3.1.3
- conda-forge::r-gprofiler2=0.2.2
- conda-forge::r-heatmaply=1.4.2
- conda-forge::r-kableextra=1.3.4
- conda-forge::r-knitr=1.37
- conda-forge::r-optparse=1.7.1
- conda-forge::r-knitr=1.43
- conda-forge::r-optparse=1.7.3
- conda-forge::r-pheatmap=1.0.12
- conda-forge::r-plyr=1.8.6
- conda-forge::r-rcolorbrewer=1.1_2
- conda-forge::r-plyr=1.8.8
- conda-forge::r-rcolorbrewer=1.1_3
- conda-forge::r-reshape2=1.4.4
- conda-forge::r-rmarkdown=2.11
- conda-forge::r-rmarkdown=2.23
- conda-forge::r-sessioninfo=1.2.2
- conda-forge::r-svglite=2.1.0
- conda-forge::r-tidyverse=1.3.1
- conda-forge::r-yaml=2.2.2
- conda-forge::r-svglite=2.1.1
- conda-forge::r-tidyverse=2.0.0
- conda-forge::r-yaml=2.3.7
16 changes: 8 additions & 8 deletions testdata/smrnaseq/samplesheet.tsv
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
QBiC Code Condition: treatment batch Secondary Name
Clone1_NN1 clone1 1 1
Clone1_NN3 clone1 3 2
Clone9_NN1 clone9 1 3
Clone9_NN2 clone9 2 4
Clone9_NN3 clone9 3 5
Control_N1 control 1 6
Control_N2 control 2 7
Control_N3 control 3 8
QABCD00001 clone1 1 1
QABCD00002 clone1 3 2
QABCD00003 clone9 1 3
QABCD00004 clone9 2 4
QABCD00005 clone9 3 5
QABCD00006 control 1 6
QABCD00007 control 2 7
QABCD00008 control 3 8
14 changes: 7 additions & 7 deletions tests/test_smrnaseq.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
- test_smrnaseq
files:
- path: results_test/differential_gene_expression/metadata/smrnaseq_files.txt
md5sum: ce3cc117fc953a8ff4d17d0118123faf
md5sum: 80efe9419bec686191344c6ce9a4ece7
- path: results_test/differential_gene_expression/DE_genes_tables/DE_contrast_condition_treatment_clone9_vs_clone1.tsv
md5sum: 46ca6c2ec1046e088853c96b28cd30e3
md5sum: 9526e0b2be54b49a8d1694b6cdd939b7
- path: results_test/differential_gene_expression/DE_genes_tables/DE_contrast_condition_treatment_control_vs_clone1.tsv
md5sum: 74214d33037aeded1bd0c839973a9e4c
md5sum: e5e4bfab55e82a0d3d80b219d60f2df0
- path: results_test/differential_gene_expression/final_gene_table/final_DE_gene_list.tsv
md5sum: 9f4950203e56a94235b00a1d554fa78c
- path: results_test/differential_gene_expression/gene_counts_tables/deseq2_library_scaled_gene_counts.tsv
Expand Down Expand Up @@ -76,14 +76,14 @@
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_clone9_vs_clone1/DE_contrast_condition_treatment_clone9_vs_clone1_KEGG_pathway_enrichment_plot.png
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_clone9_vs_clone1/DE_contrast_condition_treatment_clone9_vs_clone1_KEGG_pathway_enrichment_plot.svg
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_clone9_vs_clone1/DE_contrast_condition_treatment_clone9_vs_clone1_KEGG_pathway_enrichment_results.tsv
md5sum: b5c4c68a304f08a54b2b9d8969334e8c
md5sum: 2637f5f6f4e74e7723280449e3ac647b
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_clone9_vs_clone1/DE_contrast_condition_treatment_clone9_vs_clone1_pathway_enrichment_results.tsv
md5sum: ab088ff01a4e3991be8a424e676afcdf
md5sum: a091e2ae9a678599d9e994082907d646
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_control_vs_clone1/DE_contrast_condition_treatment_control_vs_clone1_KEGG_pathway_enrichment_plot.pdf
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_control_vs_clone1/DE_contrast_condition_treatment_control_vs_clone1_KEGG_pathway_enrichment_plot.png
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_control_vs_clone1/DE_contrast_condition_treatment_control_vs_clone1_KEGG_pathway_enrichment_plot.svg
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_control_vs_clone1/DE_contrast_condition_treatment_control_vs_clone1_KEGG_pathway_enrichment_results.tsv
md5sum: 4a2ed390ae3e9fea628bd9c014e88cc6
md5sum: 497cbe026880c9ea78c5591c7c4d07f7
- path: results_test/pathway_analysis/DE_contrast_condition_treatment_control_vs_clone1/DE_contrast_condition_treatment_control_vs_clone1_pathway_enrichment_results.tsv
md5sum: db2149ba5f2dbb6d4aed3f6284072f39
md5sum: 63a2adf7c4a971dc7443503cf968e455
- path: results_test/RNAseq_report.html
Loading