Skip to content

Commit

Permalink
Update usage and outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
fgypas committed Oct 13, 2024
1 parent 0b22ff3 commit acef1cf
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 17 deletions.
57 changes: 54 additions & 3 deletions docs/guides/outputs.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Output files

Here you can find an overview of the output files for the different workflows.

## Outputs of ZARP

After running the ZARP workflow, you will find several output files in the specified output directory. The output directory is defined in the `config.yaml` file and it is normally called `results`. Here are some of the key output files:

- **Quality control**: ZARP generates comprehensive quality control reports that provide insights into the quality and composition of the sequencing experiments. These reports include metrics such as read quality, alignment statistics, and gene expression summaries.
Expand Down Expand Up @@ -32,7 +36,7 @@ A descrpition of the different directories is shown below:
- `summary_salmon`: Summary files for Salmon quantifications.
- `zpca`: Output files for ZARP's principal component analysis.

## Quality Control (QC) outputs
### Quality Control (QC) outputs

Within the `multiqc_summary` directory, you will find an interactive HTML file (`multiqc_report.html`) with various QC metrics that can help you interpret your results. An example file is shown below

Expand Down Expand Up @@ -111,7 +115,7 @@ On the left you can find a navigation bar that takes you into different sections
<img width="80%" src=../images/zarp_multiqc_zpca.png>
</div>

## Quantification (Gene and transcript estimate) outputs
### Quantification (Gene and transcript estimate) outputs

Within the `summary_kallisto` directory, you can find the following files:
- `genes_counts.tsv`: Matrix with the gene counts. The first column (index) contains the gene names and the first row (column) contains the sample names. This file can later be used for downstream differential expression analysis.
Expand All @@ -126,8 +130,55 @@ Within the `summary_salmon/quantmerge` directory, you can find the following fil
- `transcripts_numreads.tsv`: Matrix with the transcript counts. The first column (index) contains the transcript names and the first row (column) contains the sample names. This file can later be used for downstream differential transcript analysis.
- `transcripts_tpm.tsv`: Matrix with the transcript TPM estimates.

## Alignment outputs
### Alignment outputs

Within the `samples` directory, you can find a directory for each sample, and within these directories you can find the output files of the individual steps. Some alignment files can be easily used to open in a genome browser for other downstream analysis:
- In the `map_genome` directory you can find a file with the suffix `.Aligned.sortedByCoord.out.bam` and the corresponding indexed (`.bai`) file. This is the output of the STAR aligner.
- In the `bigWig` directory you can find two folders. `UniqueMappers` and `MultimappersIncluded`. Within these files you find the bigWig files for the plus and minus strand. These files are convenient to load in a genome browser (like igv) to view the genome coverage of the mappings.


## Outputs of downnload SRA data

Once you run the pipeline that downloads data from the Sequence Read Archive (SRA) you can find the following file structure:

```
results/
`-- sra_downloads
|-- compress
| |-- ERR2248142
| | |-- ERR2248142.fastq.gz
| | `-- ERR2248142.se.tsv
| |-- SRR18549672
| | |-- SRR18549672.pe.tsv
| | |-- SRR18549672_1.fastq.gz
| | `-- SRR18549672_2.fastq.gz
| `-- SRR18552868
| |-- SRR18552868.fastq.gz
| `-- SRR18552868.se.tsv
|-- fasterq_dump
| `-- tmpdir
|-- get_layout
| |-- ERR2248142
| | `-- SINGLE.info
| |-- SRR18549672
| | `-- PAIRED.info
| `-- SRR18552868
| `-- SINGLE.info
|-- prefetch
| |-- ERR2248142
| | `-- ERR2248142.sra
| |-- SRR18549672
| | `-- SRR18549672.sra
| `-- SRR18552868
| `-- SRR18552868.sra
`-- sra_samples.out.tsv
```

All results are stored under the output directory you have specified in your config.yaml file (`results` in this case). The `sra_samples.out.tsv` summarizes all the experiments that were fetched from SRA. The file contains the SRR experiment and the path to fastq file(s). An example output file looks like the following:
```tsv
sample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
```
Some of the filenames indicate if the experiment was sequnced with `SINGLE (se)` or `PAIRED (pe)` end mode.
23 changes: 9 additions & 14 deletions docs/guides/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,8 @@ create a directory for your workflow run and move into it with:
values. Have a look at the examples in the `tests/` directory to see what the
files should look like, specifically:

- [samples.tsv](tests/input_files/samples.tsv)
- [config.yaml](tests/input_files/config.yaml)

- For more details and explanations, refer to the [pipeline-documentation]
- [samples.tsv](https://github.com/zavolanlab/zarp/blob/dev/tests/input_files/samples.tsv)
- [config.yaml](https://github.com/zavolanlab/zarp/blob/dev/tests/input_files/config.yaml)


4. Create a runner script. Pick one of the following choices for either local
Expand Down Expand Up @@ -78,6 +76,8 @@ your run.
bash run.sh
```
6. To find out more information on the output files please go to the [output files](https://zavolanlab.github.io/zarp/guides/outputs/) section
## How to download data from SRA?
Expand All @@ -87,13 +87,8 @@ conversion into FASTQ.
The workflow expects the following parameters in the configuration file:
* `samples`, a sample table (tsv) with column *sample* containing *SRR*
identifiers (ERR and DRR are also supported), as following
```
sample fq1 fq2
SRR18552868
SRR18549672
ERR2248142
```
identifiers (ERR and DRR are also supported), as in this example
[samples.tsv](https://github.com/zavolanlab/zarp/blob/dev/tests/input_files/sra_samples.tsv) file.
* `outdir`, an output directory
* `samples_out`, a pointer to a modified sample table with the locations of
the corresponding FASTQ files
Expand Down Expand Up @@ -129,15 +124,15 @@ ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
## How to determine sample information?
An independent Snakemake workflow `workflow/rules/htsinfer.smk` that populates the `samples.tsv` required by ZARP with the sample specific parameters `seqmode`, `f1_3p`, `f2_3p`, `organism`, `libtype` and `index_size`. Those parameters are inferred from the provided `fastq.gz` files by [HTSinfer][hts-infer].
An independent Snakemake workflow `workflow/rules/htsinfer.smk` that populates the `samples.tsv` required by ZARP with the sample specific parameters `seqmode`, `f1_3p`, `f2_3p`, `organism`, `libtype` and `index_size`. Those parameters are inferred from the provided `fastq.gz` files by [HTSinfer](https://github.com/zavolanlab/htsinfer).
> Note: The workflow uses the implicit temporary directory
from snakemake, which is called with [resources.tmpdir].
The workflow expects the following config:
* `samples`, a sample table (tsv) with column *sample* containing sample identifiers, as well as columns *fq1* and *fq2* containing the paths to the input fastq files
see example [here](tests/input_files/sra_samples.tsv). If the table contains further ZARP compatible columns (see [pipeline documentation][sample-doc]), the values specified there by the user are given priority over htsinfer's results.
see example [here](https://github.com/zavolanlab/zarp/blob/dev/tests/input_files/samples_htsinfer.tsv). If the table contains further ZARP compatible columns (see [pipeline documentation](https://github.com/zavolanlab/zarp/blob/dev/pipeline_documentation.md)), the values specified there by the user are given priority over htsinfer's results.
* `outdir`, an output directory
* `samples_out`, path to a modified sample table with inferred parameters
* `records`, set to 100000 per default
Expand All @@ -159,4 +154,4 @@ snakemake \
However, this call will exit with an error, as not all parameters can be inferred from the example files. The argument `--keep-incomplete` makes sure the `samples_htsinfer.tsv` file can nevertheless be inspected.
After successful execution - if all parameters could be either inferred or were specified by the user - `[OUTDIR]/[SAMPLES_OUT]` should contain a populated table with parameters `seqmode`, `f1_3p`, `f2_3p`, `organism`, `libtype` and `index_size` for all input samples as described in the [pipeline documentation][sample-doc].
After successful execution - if all parameters could be either inferred or were specified by the user - `[OUTDIR]/[SAMPLES_OUT]` should contain a populated table with parameters `seqmode`, `f1_3p`, `f2_3p`, `organism`, `libtype` and `index_size`.

0 comments on commit acef1cf

Please sign in to comment.