Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Control #2

Open
16 tasks
BarryDigby opened this issue Nov 29, 2021 · 1 comment
Open
16 tasks

Quality Control #2

BarryDigby opened this issue Nov 29, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@BarryDigby
Copy link
Owner

BarryDigby commented Nov 29, 2021

We will need to perform quality control on sequencing reads, BAM files generated by aligners and alignment statistics output by aligners.

Use the small BAM file provided at the following link to test out each of the quality control tools in your container locally.

You need to understand what inputs are required by each quality control tool, and the expected outputs.

FastQC

Requires sequencing reads as input.

  • Raw sequencing reads
  • Trimmed sequencing reads
  • Alignment log files

Samtools

Requires as input a BAM file generated by aligners. BAM file must be sorted, and have an index file. Refer to samtools --help for documentation regarding the stats we want to include in our report:

  • depth
  • flagstat
  • idxstats
  • stats

QualiMap

Requires as input BAM file generated by aligners. BAM file must be sorted, and have an index file. Refer to qualimap --help for documentation regarding bamqc and rnaseq.

The rnaseq module requires additional input files such as a GTF file. You can use the following GTF file here which corresponds to the test BAM file.

Remove the first line in the genes.gtf file, its chromosome name is nonsensical:

tail -n +2 genes.gtf > tmp.gtf && mv tmp.gtf genes.gtf
  • bamqc
  • rnaseq

RSeQC

RSeQC requires annotations in BED format for the -r flag. Convert the GTF file to a BED file. I had serious issues using gtf2bed so as a workaround, use gffread:

gffread -F --keep-exon-attrs genes.gtf --bed > genes.bed
  • infer_experiment.py (A)
  • bam_stat.py (A)
  • inner_distance.py (PE only, otherwise empty files) (B)
  • read_distribution.py (A)
  • read_duplication.py (B)
  • junction_annotation.py (C)
  • junction_saturation.py (A)

The meaning behind A/B/C:

A: python script does not have -o flag, redirect stdout to .txt file using >:

infer_experiment.py -i RAP1_UNINDUCED_REP2.Aligned.out_sorted.bam -r genes.bed > RAP1_UNINDUCED_REP2.Aligned.out_infer_experiment.txt

B: Has the -o flag, pass file baseName to arg:

read_duplication.py -i RAP1_UNINDUCED_REP2.Aligned.out_sorted.bam -o RAP1_UNINDUCED_REP2.Aligned.out

C: Has -o flag, but must redirect output to .txt file using 2> for multiqc compatibility:

junction_annotation.py -i RAP1_UNINDUCED_REP2.Aligned.out_sorted.bam -r genes.bed -o RAP1_UNINDUCED_REP2.Aligned.out 2> RAP1_UNINDUCED_REP2.Aligned.out_junctions.txt
@BarryDigby BarryDigby added the enhancement New feature or request label Nov 29, 2021
@BarryDigby
Copy link
Owner Author

BarryDigby commented Nov 29, 2021

Quality control is arguably the most difficult step, as it requires intimate knowledge of the entire workflow.

Work in collaboration with groups focusing on mapping RNA-Seq reads - they will be able to tell you the output files generated by aligners.

MultiQC goes at the very end of the workflow, collecting all of the files produced by fastqc, rseqc, qualimap and samtools. This will create a very comprehensive HTML report at the end of our workflow 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant