Skip to content

Commit

Permalink
Move description of output files in github page outputs.md
Browse files Browse the repository at this point in the history
  • Loading branch information
fgypas committed Oct 12, 2024
1 parent 21c571d commit 25d1291
Show file tree
Hide file tree
Showing 15 changed files with 133 additions and 135 deletions.
135 changes: 0 additions & 135 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,141 +268,6 @@ your run.
bash run.sh
```
## Output files
After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the `config.yaml` file and it is normally called `results`. Here are some of the key output files:
- **Quality control**: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.
- **Quantification files**: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene / transcript in the RNA-seq data.
- **Alignment files**: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.
After a run you will find the following structure within the `results` directory:
```bash
.
├── multiqc_config.yaml
└── mus_musculus
├── multiqc_summary
├── samples
├── summary_kallisto
├── summary_salmon
└── zpca
```
A descrpition of the different directories is shown below:
- `results`: The main output directory for the ZARP workflow.
- `mus_musculus`: A subdirectory for the organism-specific results.
- `multiqc_summary`: Summary files generated by MultiQC.
- `samples`: Sample specific outputs. A directory is created for each sample.
- `summary_kallisto`: Summary files for Kallisto quantifications.
- `summary_salmon`: Summary files for Salmon quantifications.
- `zpca`: Output files for ZARP's principal component analysis.
### Quality Control (QC) outputs
Within the `multiqc_summary` directory, you will find an interactive HTML file (`multiqc_report.html`) with various QC metrics that can help you interpret your results. An example file is shown below
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc.png>
</div>
On the left you can find a navigation bar that takes you into different sections and subsections of the tools.
- The `General Statistics` section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_general_statistics.png>
</div>
- The `FastQC: raw reads` section contains plots and quality statistics of the fastq files. Some examples are shown below like the number of duplicate reads in an experiment, the average quality of the fastq files per position, or the percent of GC content.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_fastqc_sequence_counts_plot.png>
</div>
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_fastqc_per_base_sequence_quality_plot.png>
</div>
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_fastqc_per_sequence_gc_content_plot.png>
</div>
- The `Cutadapt: adapter removal` and `Cutadapt: polyA tails removal` shows the number or the percentage of the reads trimmed
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_cutadapt_filtered_reads_plot.png>
</div>
- The `FastQC: trimmed reads` section contains plots and quality statistics of the fastq files after adapter trimming. The plots are similar to the section `FastQC: raw reads`.
- The `STAR` section shows the number and percentage of reads that are mapped using the STAR aligner.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_star_alignment_plot.png>
</div>
- The `ALFA` section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_alfa_categories.png>
</div>
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_alfa_biotypes.png>
</div>
- The `TIN` section shows the Transcript Integrity Number of the samples.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_tin_score.png>
</div>
- The `Salmon` section shows the fragment length distribution of the reads
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_salmon_plot.png>
</div>
- The `Kallisto` section shows the number of reads that were aligned
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_kallisto_alignment.png>
</div>
- Finally the `zpca` salmon and kallisto sections show PCA plots for expression levels of genes and transcripts.
<div align="center">
<img width="80%" src=images/output_files/zarp_multiqc_zpca.png>
</div>
### Quantification (Gene and transcript estimate) outputs
Within the `summary_kallisto` directory, you can find the following files:
- `genes_counts.tsv`: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- `genes_tpm.tsv`: Matrix with the gene TPM estimates.
- `transcripts_counts.tsv`: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- `transcripts_tpm.tsv`: Matrix with the transcript TPM estimates.
- `tx2geneID.tsv`: A table mapping transcript IDs to gene IDs.
Within the `summary_salmon/quantmerge` directory, you can find the following files:
- `genes_numreads.tsv`: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- `genes_tpm.tsv`: Matrix with the gene TPM estimates.
- `transcripts_numreads.tsv`: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- `transcripts_tpm.tsv`: Matrix with the transcript TPM estimates.
### Alignment outputs
Within the `samples` directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis:
- In the `map_genome` directory you can find a file with the suffix `.Aligned.sortedByCoord.out.bam` and the corresponding indexed (`.bai`) file. This is the output of the STAR aligner.
- In the `bigWig` directory you can find two folders. `UniqueMappers` and `MultimappersIncluded`. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like igv) to view the genome coverage of the mappings.
# Sample downloads from SRA
An independent Snakemake workflow `workflow/rules/sra_download.smk` is included
Expand Down
133 changes: 133 additions & 0 deletions docs/guides/outputs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Output files

After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the `config.yaml` file and it is normally called `results`. Here are some of the key output files:

- **Quality control**: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.

- **Quantification files**: These files contain the gene and transcript level expression values for each sample. They provide information about the abundance of each gene / transcript in the RNA-seq data.

- **Alignment files**: These files contain the aligned reads for each sample in BAM format. They provide information about the mapping of the reads to the reference genome.


After a run you will find the following structure within the `results` directory:

```bash
.
├── multiqc_config.yaml
└── mus_musculus
├── multiqc_summary
├── samples
├── summary_kallisto
├── summary_salmon
└── zpca
```

A descrpition of the different directories is shown below:

- `results`: The main output directory for the ZARP workflow.
- `mus_musculus`: A subdirectory for the organism-specific results.
- `multiqc_summary`: Summary files generated by MultiQC.
- `samples`: Sample specific outputs. A directory is created for each sample.
- `summary_kallisto`: Summary files for Kallisto quantifications.
- `summary_salmon`: Summary files for Salmon quantifications.
- `zpca`: Output files for ZARP's principal component analysis.

## Quality Control (QC) outputs

Within the `multiqc_summary` directory, you will find an interactive HTML file (`multiqc_report.html`) with various QC metrics that can help you interpret your results. An example file is shown below

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc.png>
</div>

On the left you can find a navigation bar that takes you into different sections and subsections of the tools.

- The `General Statistics` section contains a summary of most tools and you can find statistics on mapped reads, percent of duplicate reads, percent of adapters trimmed for various tools.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_general_statistics.png>
</div>

- The `FastQC: raw reads` section contains plots and quality statistics of the fastq files. Some examples are shown below like the number of duplicate reads in an experiment, the average quality of the fastq files per position, or the percent of GC content.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_fastqc_sequence_counts_plot.png>
</div>

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_fastqc_per_base_sequence_quality_plot.png>
</div>

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_fastqc_per_sequence_gc_content_plot.png>
</div>

- The `Cutadapt: adapter removal` and `Cutadapt: polyA tails removal` shows the number or the percentage of the reads trimmed

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_cutadapt_filtered_reads_plot.png>
</div>


- The `FastQC: trimmed reads` section contains plots and quality statistics of the fastq files after adapter trimming. The plots are similar to the section `FastQC: raw reads`.

- The `STAR` section shows the number and percentage of reads that are mapped using the STAR aligner.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_star_alignment_plot.png>
</div>

- The `ALFA` section shows the number of reads mapped to genomic categories (stop codon, 5'-UTR, CDS, intergenic, etc.) and gene biotypes (protein coding genes, miRNA , tRNA, etc.) for unique reads and multimappers.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_alfa_categories.png>
</div>

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_alfa_biotypes.png>
</div>

- The `TIN` section shows the Transcript Integrity Number of the samples.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_tin_score.png>
</div>

- The `Salmon` section shows the fragment length distribution of the reads

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_salmon_plot.png>
</div>

- The `Kallisto` section shows the number of reads that were aligned

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_kallisto_alignment.png>
</div>

- Finally the `zpca` salmon and kallisto sections show PCA plots for expression levels of genes and transcripts.

<div align="center">
<img width="80%" src=../images/output_files/zarp_multiqc_zpca.png>
</div>

## Quantification (Gene and transcript estimate) outputs

Within the `summary_kallisto` directory, you can find the following files:
- `genes_counts.tsv`: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- `genes_tpm.tsv`: Matrix with the gene TPM estimates.
- `transcripts_counts.tsv`: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- `transcripts_tpm.tsv`: Matrix with the transcript TPM estimates.
- `tx2geneID.tsv`: A table mapping transcript IDs to gene IDs.

Within the `summary_salmon/quantmerge` directory, you can find the following files:
- `genes_numreads.tsv`: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
- `genes_tpm.tsv`: Matrix with the gene TPM estimates.
- `transcripts_numreads.tsv`: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- `transcripts_tpm.tsv`: Matrix with the transcript TPM estimates.

## Alignment outputs

Within the `samples` directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis:
- In the `map_genome` directory you can find a file with the suffix `.Aligned.sortedByCoord.out.bam` and the corresponding indexed (`.bai`) file. This is the output of the STAR aligner.
- In the `bigWig` directory you can find two folders. `UniqueMappers` and `MultimappersIncluded`. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like igv) to view the genome coverage of the mappings.
File renamed without changes
File renamed without changes

0 comments on commit 25d1291

Please sign in to comment.