diff --git a/2023.11/day3/igv_visualisation/index.html b/2023.11/day3/igv_visualisation/index.html index 8f6de49..0de8c68 100644 --- a/2023.11/day3/igv_visualisation/index.html +++ b/2023.11/day3/igv_visualisation/index.html @@ -688,7 +688,7 @@

A first glance: the E. coli

cd ~/project/results/alignments
 samtools index SRR519926.sorted.region.bam
 
-Download it together with it’s index file (SRR519926.sorted.region.bam.bai) and the reference genome (ecoli-strK12-MG1655.fasta) to your desktop.

+Download the alignment (SRR519926.sorted.region.bam) together with it’s index file (SRR519926.sorted.region.bam.bai) and the reference genome (ecoli-strK12-MG1655.fasta) to your desktop.

If working with Docker

If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v option). So if you have used -v C:\Users\myusername\ngs-course:/root/project, your files will be in C:\Users\myusername\ngs-course.

@@ -768,10 +768,10 @@

HCC1143 data set

-

Exercise: What is the sequence coverage for that base? And the percentage T?

+

Exercise: What is the sequence sequencing depth for that base? And the percentage T?

Answer -

The coverage is 62, and 25 reads (40%) T.

+

The depth is 62, and 25 reads (40%) T.

Navigate to region chr21:19,800,320-19,818,162

Load repeat tracks by selecting File > Load from Server… from the main menu and then select Annotations > Variation and Repeats > Repeat Masker

diff --git a/2023.11/search/search_index.json b/2023.11/search/search_index.json index d7cbb38..5af8ce7 100644 --- a/2023.11/search/search_index.json +++ b/2023.11/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses ) or in your own time. After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} License & copyright License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Enrolled to the course Independently Material This website Zoom meeting (through mail) Google doc (through mail) Slack channel Learning outcomes After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualise the output Learning experiences This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository . Note It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web! Learning outcomes After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output Exercises Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Home"},{"location":"#_1","text":"Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses ) or in your own time. After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output","title":""},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#license-copyright","text":"License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Enrolled to the course Independently","title":"License & copyright"},{"location":"#material","text":"This website Zoom meeting (through mail) Google doc (through mail) Slack channel","title":"Material"},{"location":"#learning-outcomes","text":"After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualise the output","title":"Learning outcomes"},{"location":"#learning-experiences","text":"This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository . Note It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!","title":"Asking questions"},{"location":"#learning-outcomes_1","text":"After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output","title":"Learning outcomes"},{"location":"#exercises_1","text":"Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"course_schedule/","text":"Note Apart from the starting time the time schedule is indicative . Because we can not plan a course by the minute, in practice the time points will deviate. Day 1 block start end subject introduction 9:15 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work Day 2 block start end subject block 1 9:15 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work Day 3 block start end subject block 1 9:15 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject introduction 9:15 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work","title":"Day 2"},{"location":"course_schedule/#day-3","text":"block start end subject block 1 9:15 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations","title":"Day 3"},{"location":"group_work/","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working with Docker If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Material Download the presentation Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~/project ln -s /group_work/GROUP_NAME/ ./ # replace [GROUP_NAME] with your group directory Now you can find your group directory at ~/ . Use this to share files. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here . Project 1: Variant analysis of human data Aim : Find variants on chromosome 20 from three samples In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. You can get the data by running these commands: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Tasks Important! Stick to the principles for reproducible analysis described here Download the required data Do a QC on the data with fastqc Trim adapters and low quality bases with fastp . Make sure to include the option --detect_adapter_for_pe . To prevent overwriting fastp.html , specify a report filename for each sample with the option --html . After trimming the adapters, run fastqc again to see whether all adapters are gone. Create an index for bowtie2. At the same time create a fasta index ( .fai file) with samtools faidx . Check which options to use, and align with bowtie2 . At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. Mark duplicates on the individual bam files with gatk MarkDuplicates (see hints below). Merge the three bam files with samtools merge . Index the bam file afterwards. Run freebayes to call variants. Only call variants on the region chr20:10018000-10220000 by specifying the -r option. Load your alignments together with the vcf containing the variants in IGV. Check out e.g. chr20:10,026,397-10,026,638 . Run multiqc to get an overall quality report. Questions Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that? How many duplicates were in the different samples (hint: use samtools flagstat )? Why is it important to remove them for variant analysis? Why did you add read groups to the bam files? Where is this information added in the bam file? Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? There are two high quality variants in chr20:10,026,397-10,026,638 . What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. Hints You can add readgroups to the alignment file with bowtie2 with the options --rg-id and --rg , e.g. ( $SAMPLE is a variable containing a sample identifier): bowtie2 \\ -x ref.fa \\ -1 r1.fastq.gz \\ -2 r2.fastq.gz \\ --rg-id $SAMPLE \\ --rg SM: $SAMPLE \\ To run gatk MarkDuplicates you will only need to specify --INPUT and --OUTPUT , e.g.: gatk MarkDuplicates \\ --INPUT sample.bam \\ --OUTPUT sample.md.bam \\ --METRICS_FILE sample.metrics.txt Project 2: Long-read genome sequencing Aim : Align long reads from RNA-seq data to a reference genome. In this project, you will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . Here you can find the BioProject page . It is Oxford Nanopore Technology sequencing data of amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. In this project, we will align a few of the samples to the reference genome, and assess the quality of reads and the alignment. Download the human reference genome like this: wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Tasks Important! Stick to the principles for reproducible analysis described here Check out the BioProject, and download two samples that interest you. Perform QC with fastqc Perform QC with NanoPlot Align with minimap2 with default parameters Figure how you should set parameters -x and -G Evaluate the alignment quality (e.g. alignment rates, mapping quality) Compare different samples in read quality, alignment rates, depth, etc. Questions Have a look at the quality report. What are the average read lengths? Is that expected? What is the average read quality? What kind of accuracy would you expect? Note any differences between fastqc and NanoPlot ? How is that compared to the publication? Check out the options -x and -G of minimap2 . Are the defaults appropriate? You might consider using -x map-ont or -x splice . Do you see differences in the alignment in e.g. IGV? How are spliced alignments stored in the SAM file with the different settings of -x ? How deep is the gene sequenced? Do you already see evidence for splice variants in the alignments? Downloading from SRA prefetch [ SRR number ] fastq-dump --gzip [ SRR number ] Accuracy from quality scores Find the equation to calculate error probability from quality score on Wikipedia . Comparing fastqc and Nanoplot For comparing fastqc and NanoPlot , check out this blog of the author of NanoPlot, and this thread . Running minimap2 Here\u2019s an example command for minimap2 : minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ [ REFERENCE ] .fa \\ [ FASTQFILE ] .fastq.gz \\ | samtools sort \\ | samtools view -bh > [ OUTPUT ] .bam Intron sizes Check out the the intron sizes of CACNA1C in e.g. IGV or UCSC genome browser. How does that relate to the parameter -G ? More resources Need e.g. a gtf file? Here\u2019s the ensembl page Project 3: Short-read RNA-seq of mice. Aim: Generate a count matrix to estimate differential gene expression. In this project you will be working with data from: Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases . Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6 Here\u2019s the BioProject page . Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Tasks Important! Stick to the principles for reproducible analysis described here Download the tar file, and find out what\u2019s in the data folder Do a QC on the fastq files with fastqc Trim adapters and low quality bases with fastp After trimming the adapters, run fastqc again to see whether all adapters are gone. Check which options to use, and align with hisat2 Evaluate the alignment quality (e.g. alignment rates, mapping quality) Have a look at the alignments in IGV, e.g. check out Sparcl1 . For this, you can use the built-in genome ( Mouse (mm10) ). Do you see any evidence for differential splicing? Run featureCounts on both alignments. Have a look at the option -Q . For further suggestions, see the hints below. Compare the count matrices in R (find a script to get started here ; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio ) Questions Check the description at the SRA sample page. What kind of sample is this? How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are) Did trimming improve the QC results? What could be the cause of the warnings/errors in the fastqc reports? What are the alignment rates? How are spliced alignments stored in the SAM file? Are there any differences between the treatments in the percentage of assigned alignments by featureCounts ? What is the cause of this? Can you find any genes that seem to be differentially expressed? What is the effect of setting the option -Q in featureCounts ? Hints We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing. command option bowtie2-build --threads hisat2-build --threads fastqc --threads cutadapt --cores bowtie2 --threads hisat2 --threads featureCounts -T Here\u2019s some example code for hisat2 and featureCounts . Everything in between <> should be replaced with specific arguments. Here\u2019s an example for hisat2 : hisat2-build hisat2 \\ -x \\ -1 \\ -2 \\ -p \\ | samtools sort \\ | samtools view -bh \\ > Example code featureCounts : featureCounts \\ -p \\ -T 2 \\ -a \\ -o \\ *.bam","title":"Group work"},{"location":"group_work/#material","text":"Download the presentation","title":"Material"},{"location":"group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~/project ln -s /group_work/GROUP_NAME/ ./ # replace [GROUP_NAME] with your group directory Now you can find your group directory at ~/ . Use this to share files. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Working directories"},{"location":"group_work/#project-1-variant-analysis-of-human-data","text":"Aim : Find variants on chromosome 20 from three samples In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. You can get the data by running these commands: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz","title":"Project 1: Variant analysis of human data"},{"location":"group_work/#tasks","text":"Important! Stick to the principles for reproducible analysis described here Download the required data Do a QC on the data with fastqc Trim adapters and low quality bases with fastp . Make sure to include the option --detect_adapter_for_pe . To prevent overwriting fastp.html , specify a report filename for each sample with the option --html . After trimming the adapters, run fastqc again to see whether all adapters are gone. Create an index for bowtie2. At the same time create a fasta index ( .fai file) with samtools faidx . Check which options to use, and align with bowtie2 . At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. Mark duplicates on the individual bam files with gatk MarkDuplicates (see hints below). Merge the three bam files with samtools merge . Index the bam file afterwards. Run freebayes to call variants. Only call variants on the region chr20:10018000-10220000 by specifying the -r option. Load your alignments together with the vcf containing the variants in IGV. Check out e.g. chr20:10,026,397-10,026,638 . Run multiqc to get an overall quality report.","title":"Tasks"},{"location":"group_work/#questions","text":"Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that? How many duplicates were in the different samples (hint: use samtools flagstat )? Why is it important to remove them for variant analysis? Why did you add read groups to the bam files? Where is this information added in the bam file? Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? There are two high quality variants in chr20:10,026,397-10,026,638 . What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV.","title":"Questions"},{"location":"group_work/#hints","text":"You can add readgroups to the alignment file with bowtie2 with the options --rg-id and --rg , e.g. ( $SAMPLE is a variable containing a sample identifier): bowtie2 \\ -x ref.fa \\ -1 r1.fastq.gz \\ -2 r2.fastq.gz \\ --rg-id $SAMPLE \\ --rg SM: $SAMPLE \\ To run gatk MarkDuplicates you will only need to specify --INPUT and --OUTPUT , e.g.: gatk MarkDuplicates \\ --INPUT sample.bam \\ --OUTPUT sample.md.bam \\ --METRICS_FILE sample.metrics.txt","title":"Hints"},{"location":"group_work/#project-2-long-read-genome-sequencing","text":"Aim : Align long reads from RNA-seq data to a reference genome. In this project, you will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . Here you can find the BioProject page . It is Oxford Nanopore Technology sequencing data of amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. In this project, we will align a few of the samples to the reference genome, and assess the quality of reads and the alignment. Download the human reference genome like this: wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz","title":" Project 2: Long-read genome sequencing"},{"location":"group_work/#tasks_1","text":"Important! Stick to the principles for reproducible analysis described here Check out the BioProject, and download two samples that interest you. Perform QC with fastqc Perform QC with NanoPlot Align with minimap2 with default parameters Figure how you should set parameters -x and -G Evaluate the alignment quality (e.g. alignment rates, mapping quality) Compare different samples in read quality, alignment rates, depth, etc.","title":"Tasks"},{"location":"group_work/#questions_1","text":"Have a look at the quality report. What are the average read lengths? Is that expected? What is the average read quality? What kind of accuracy would you expect? Note any differences between fastqc and NanoPlot ? How is that compared to the publication? Check out the options -x and -G of minimap2 . Are the defaults appropriate? You might consider using -x map-ont or -x splice . Do you see differences in the alignment in e.g. IGV? How are spliced alignments stored in the SAM file with the different settings of -x ? How deep is the gene sequenced? Do you already see evidence for splice variants in the alignments? Downloading from SRA prefetch [ SRR number ] fastq-dump --gzip [ SRR number ] Accuracy from quality scores Find the equation to calculate error probability from quality score on Wikipedia . Comparing fastqc and Nanoplot For comparing fastqc and NanoPlot , check out this blog of the author of NanoPlot, and this thread . Running minimap2 Here\u2019s an example command for minimap2 : minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ [ REFERENCE ] .fa \\ [ FASTQFILE ] .fastq.gz \\ | samtools sort \\ | samtools view -bh > [ OUTPUT ] .bam Intron sizes Check out the the intron sizes of CACNA1C in e.g. IGV or UCSC genome browser. How does that relate to the parameter -G ? More resources Need e.g. a gtf file? Here\u2019s the ensembl page","title":"Questions"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","text":"Aim: Generate a count matrix to estimate differential gene expression. In this project you will be working with data from: Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases . Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6 Here\u2019s the BioProject page . Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz","title":" Project 3: Short-read RNA-seq of mice."},{"location":"group_work/#tasks_2","text":"Important! Stick to the principles for reproducible analysis described here Download the tar file, and find out what\u2019s in the data folder Do a QC on the fastq files with fastqc Trim adapters and low quality bases with fastp After trimming the adapters, run fastqc again to see whether all adapters are gone. Check which options to use, and align with hisat2 Evaluate the alignment quality (e.g. alignment rates, mapping quality) Have a look at the alignments in IGV, e.g. check out Sparcl1 . For this, you can use the built-in genome ( Mouse (mm10) ). Do you see any evidence for differential splicing? Run featureCounts on both alignments. Have a look at the option -Q . For further suggestions, see the hints below. Compare the count matrices in R (find a script to get started here ; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio )","title":"Tasks"},{"location":"group_work/#questions_2","text":"Check the description at the SRA sample page. What kind of sample is this? How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are) Did trimming improve the QC results? What could be the cause of the warnings/errors in the fastqc reports? What are the alignment rates? How are spliced alignments stored in the SAM file? Are there any differences between the treatments in the percentage of assigned alignments by featureCounts ? What is the cause of this? Can you find any genes that seem to be differentially expressed? What is the effect of setting the option -Q in featureCounts ?","title":"Questions"},{"location":"group_work/#hints_1","text":"We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing. command option bowtie2-build --threads hisat2-build --threads fastqc --threads cutadapt --cores bowtie2 --threads hisat2 --threads featureCounts -T Here\u2019s some example code for hisat2 and featureCounts . Everything in between <> should be replaced with specific arguments. Here\u2019s an example for hisat2 : hisat2-build hisat2 \\ -x \\ -1 \\ -2 \\ -p \\ | samtools sort \\ | samtools view -bh \\ > Example code featureCounts : featureCounts \\ -p \\ -T 2 \\ -a \\ -o \\ *.bam","title":"Hints"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"day1/intro/","text":"If working independently If you are doing this course independently, you can skip this part. Material Download the presentation","title":"Introduction"},{"location":"day1/intro/#material","text":"Download the presentation","title":"Material"},{"location":"day1/quality_control/","text":"Learning outcomes After having completed this chapter you will be able to: Find information about a sequence run on the Sequence Read Archive Run fastqc on sequence reads and interpret the results Trim adapters and low quality bases using fastp Material Download the presentation fastqc command line documentation cutadapt manual Unix command line E-utilities documentation Exercises Download and evaluate an E. coli dataset Check out the dataset at SRA . Exercise: Browse around the SRA entry and answer these questions: A. Is the dataset paired-end or single end? B. Which instrument was used for sequencing? C. What is the read length? D. How many reads do we have? Answers A. paired-end B. Illumina MiSeq C. 2 x 251 bp D. 400596 Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools . Every time you open a new terminal, you will have to load the environment: conda activate ngs-tools Make a directory reads in ~/project and download the reads from the SRA database using prefetch and fastq-dump from SRA-Tools into the reads directory. Use the code snippet below to create a scripts called 01_download_reads.sh . Store it in ~/project/scripts/ , and run it. 01_download_reads.sh #!/usr/bin/env bash cd ~/project mkdir reads cd reads prefetch SRR519926 fastq-dump --split-files SRR519926 Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry. Tip A read in a fastq file consists of four lines (more on that at file types ). Use Google to figure out how to count the number of reads in a fastq file. Answer e.g. from this thread on Biostars: ## forward read echo $( cat SRR519926_1.fastq | wc -l ) /4 | bc ## reverse read echo $( cat SRR519926_2.fastq | wc -l ) /4 | bc Run fastqc Exercise: Create a script to run fastqc and call it 02_run_fastqc.sh . After that, run it. Tip fastqc accepts multiple files as input, so you can use a wildcard to run fastqc on all the files in one line of code. Use it like this: *.fastq . Answer Your script ~/project/scripts/02_run_fastqc.sh should look like: 02_run_fastqc.sh #!/usr/bin/env bash cd ~/project/reads fastqc *.fastq Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems? Downloading files You can download files by right-click the file and after that select Download : Answer There seems to be: Low quality towards the 3\u2019 end (per base sequence quality) Full sequence reads with low quality (per sequence quality scores) Adapters in the sequences (adapter content) We can probably fix most of these issues by trimming. Trim the reads We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp will use those by default. A reference for the adapter sequences can be found here . Exercise: Check out the documentation of fastp , and the option defaults by running fastp --help . What is the default for the minimum base quality for a qualified base? ( option --qualified_quality_phred ) What is the default for the maximum percentage of unqualified bases in a read? (option --unqualified_percent_limit ) What is the default for the minimum required read length? (option --length_required ) What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options --unpaired1 and --unpaired2 ) Answer The minimum base quality is 15: Default 15 means phred quality >=Q15 is qualified. (int [=15]) The minimum required length is also 15: reads shorter than length_required will be discarded, default is 15. (int [=15]) If one of the reads does not meet the required length, the pair is discarded if --unpaired1 and/or --unpaired2 are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=]) . Exercise: Complete the script below called 03_trim_reads.sh (replace everything in between brackets [] ) to run fastp to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/ is created to write the trimmed reads. 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ [ QUALIFIED BASE THRESHOLD ] \\ [ MINIMUM LENGTH THRESHOLD ] \\ [ UNQUALIFIED PERCENTAGE LIMIT ] \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe Additional options Note that we have set the options --cut_front and --cut_tail that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe is set, which ensures that adapters are detected automatically for both R1 and R2. Answer Your script ( ~/project/scripts/03_trim_reads.sh ) should look like this: 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ --qualified_quality_phred 10 \\ --length_required 25 \\ --unqualified_percent_limit 80 \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe The use of \\ In the script above you see that we\u2019re using \\ at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\ to ignore the newlines when executing. Exercise: Check out the report in fastp.html . Has the quality improved? How many reads do we have left? Bonus : Although there were adapters in R2 according to fastqc , fastp has trouble finding adapters in R2. Also, after running fastp there doesn\u2019t seem to be much adapter left (you can double check by running fastqc on trimmed_SRR519926_2.fastq ). How could that be? Answers Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages >20 and >30 are higher. 624724 reads, so 312362 pairs (78.0%) The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.","title":"Quality control"},{"location":"day1/quality_control/#learning-outcomes","text":"After having completed this chapter you will be able to: Find information about a sequence run on the Sequence Read Archive Run fastqc on sequence reads and interpret the results Trim adapters and low quality bases using fastp","title":"Learning outcomes"},{"location":"day1/quality_control/#material","text":"Download the presentation fastqc command line documentation cutadapt manual Unix command line E-utilities documentation","title":"Material"},{"location":"day1/quality_control/#exercises","text":"","title":"Exercises"},{"location":"day1/quality_control/#download-and-evaluate-an-e-coli-dataset","text":"Check out the dataset at SRA . Exercise: Browse around the SRA entry and answer these questions: A. Is the dataset paired-end or single end? B. Which instrument was used for sequencing? C. What is the read length? D. How many reads do we have? Answers A. paired-end B. Illumina MiSeq C. 2 x 251 bp D. 400596 Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools . Every time you open a new terminal, you will have to load the environment: conda activate ngs-tools Make a directory reads in ~/project and download the reads from the SRA database using prefetch and fastq-dump from SRA-Tools into the reads directory. Use the code snippet below to create a scripts called 01_download_reads.sh . Store it in ~/project/scripts/ , and run it. 01_download_reads.sh #!/usr/bin/env bash cd ~/project mkdir reads cd reads prefetch SRR519926 fastq-dump --split-files SRR519926 Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry. Tip A read in a fastq file consists of four lines (more on that at file types ). Use Google to figure out how to count the number of reads in a fastq file. Answer e.g. from this thread on Biostars: ## forward read echo $( cat SRR519926_1.fastq | wc -l ) /4 | bc ## reverse read echo $( cat SRR519926_2.fastq | wc -l ) /4 | bc","title":"Download and evaluate an E. coli dataset"},{"location":"day1/quality_control/#run-fastqc","text":"Exercise: Create a script to run fastqc and call it 02_run_fastqc.sh . After that, run it. Tip fastqc accepts multiple files as input, so you can use a wildcard to run fastqc on all the files in one line of code. Use it like this: *.fastq . Answer Your script ~/project/scripts/02_run_fastqc.sh should look like: 02_run_fastqc.sh #!/usr/bin/env bash cd ~/project/reads fastqc *.fastq Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems? Downloading files You can download files by right-click the file and after that select Download : Answer There seems to be: Low quality towards the 3\u2019 end (per base sequence quality) Full sequence reads with low quality (per sequence quality scores) Adapters in the sequences (adapter content) We can probably fix most of these issues by trimming.","title":"Run fastqc"},{"location":"day1/quality_control/#trim-the-reads","text":"We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp will use those by default. A reference for the adapter sequences can be found here . Exercise: Check out the documentation of fastp , and the option defaults by running fastp --help . What is the default for the minimum base quality for a qualified base? ( option --qualified_quality_phred ) What is the default for the maximum percentage of unqualified bases in a read? (option --unqualified_percent_limit ) What is the default for the minimum required read length? (option --length_required ) What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options --unpaired1 and --unpaired2 ) Answer The minimum base quality is 15: Default 15 means phred quality >=Q15 is qualified. (int [=15]) The minimum required length is also 15: reads shorter than length_required will be discarded, default is 15. (int [=15]) If one of the reads does not meet the required length, the pair is discarded if --unpaired1 and/or --unpaired2 are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=]) . Exercise: Complete the script below called 03_trim_reads.sh (replace everything in between brackets [] ) to run fastp to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/ is created to write the trimmed reads. 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ [ QUALIFIED BASE THRESHOLD ] \\ [ MINIMUM LENGTH THRESHOLD ] \\ [ UNQUALIFIED PERCENTAGE LIMIT ] \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe Additional options Note that we have set the options --cut_front and --cut_tail that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe is set, which ensures that adapters are detected automatically for both R1 and R2. Answer Your script ( ~/project/scripts/03_trim_reads.sh ) should look like this: 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ --qualified_quality_phred 10 \\ --length_required 25 \\ --unqualified_percent_limit 80 \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe The use of \\ In the script above you see that we\u2019re using \\ at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\ to ignore the newlines when executing. Exercise: Check out the report in fastp.html . Has the quality improved? How many reads do we have left? Bonus : Although there were adapters in R2 according to fastqc , fastp has trouble finding adapters in R2. Also, after running fastp there doesn\u2019t seem to be much adapter left (you can double check by running fastqc on trimmed_SRR519926_2.fastq ). How could that be? Answers Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages >20 and >30 are higher. 624724 reads, so 312362 pairs (78.0%) The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.","title":"Trim the reads"},{"location":"day1/reproducibility/","text":"Learning outcomes After having completed this chapter you will be able to: Understand the importance of reproducibility Apply some basic rules to support reproducibilty in computational research Material Download the presentation Some good practices for reproducibility During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility: Execute the commands from a script in order to be able to trace back your steps Number scripts based on their order of execution (e.g. 01_download_reads.sh ) Give your scripts a descriptive and active name , e.g. 06_build_bowtie_index.sh Make your scripts specific , i.e. do not combine many different commands in the same script Refer to directories and variables at the beginning of the script By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research . By the end of day 2 ~/project should look (something) like this: . \u251c\u2500\u2500 alignment_output \u251c\u2500\u2500 reads \u251c\u2500\u2500 ref_genome \u251c\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 01_download_reads.sh \u2502 \u251c\u2500\u2500 02_run_fastqc.sh \u2502 \u251c\u2500\u2500 03_trim_reads.sh \u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh \u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh \u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh \u2502 \u251c\u2500\u2500 07_align_reads.sh \u2502 \u251c\u2500\u2500 08_compress_sort.sh \u2502 \u251c\u2500\u2500 09_extract_unmapped.sh \u2502 \u251c\u2500\u2500 10_extract_region.sh \u2502 \u2514\u2500\u2500 11_align_sort.sh \u2514\u2500\u2500 trimmed_data","title":"Reproducibility"},{"location":"day1/reproducibility/#learning-outcomes","text":"After having completed this chapter you will be able to: Understand the importance of reproducibility Apply some basic rules to support reproducibilty in computational research","title":"Learning outcomes"},{"location":"day1/reproducibility/#material","text":"Download the presentation","title":"Material"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","text":"During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility: Execute the commands from a script in order to be able to trace back your steps Number scripts based on their order of execution (e.g. 01_download_reads.sh ) Give your scripts a descriptive and active name , e.g. 06_build_bowtie_index.sh Make your scripts specific , i.e. do not combine many different commands in the same script Refer to directories and variables at the beginning of the script By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research . By the end of day 2 ~/project should look (something) like this: . \u251c\u2500\u2500 alignment_output \u251c\u2500\u2500 reads \u251c\u2500\u2500 ref_genome \u251c\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 01_download_reads.sh \u2502 \u251c\u2500\u2500 02_run_fastqc.sh \u2502 \u251c\u2500\u2500 03_trim_reads.sh \u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh \u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh \u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh \u2502 \u251c\u2500\u2500 07_align_reads.sh \u2502 \u251c\u2500\u2500 08_compress_sort.sh \u2502 \u251c\u2500\u2500 09_extract_unmapped.sh \u2502 \u251c\u2500\u2500 10_extract_region.sh \u2502 \u2514\u2500\u2500 11_align_sort.sh \u2514\u2500\u2500 trimmed_data","title":"Some good practices for reproducibility"},{"location":"day1/sequencing_technologies/","text":"Learning outcomes After having completed this chapter you will be able to: Describe the major applications of next generation sequencing Reproduce the most frequently used sequencing methods Describe the major steps taken during library preparation for Illumina sequencing Explain why the length of the sequencing reads of Illumina sequencing are limited Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing Material Download the presentation Illumina sequencing by synthesis on YouTube NEBnext library preparation poster","title":"Sequencing technologies"},{"location":"day1/sequencing_technologies/#learning-outcomes","text":"After having completed this chapter you will be able to: Describe the major applications of next generation sequencing Reproduce the most frequently used sequencing methods Describe the major steps taken during library preparation for Illumina sequencing Explain why the length of the sequencing reads of Illumina sequencing are limited Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing","title":"Learning outcomes"},{"location":"day1/sequencing_technologies/#material","text":"Download the presentation Illumina sequencing by synthesis on YouTube NEBnext library preparation poster","title":"Material"},{"location":"day1/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda Exercises First login If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl + ` . Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-introduction-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-introduction-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this: Windows Mac/Linux Get the .exe file here Double click the file Follow the instructions on the screen (defaults are usually fine) Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. Get the installation ( .sh ) script here Run in your terminal: bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts Close and reopen your terminal for changes to have effect Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. After installation, you can install the required software: Windows/MacOS Linux conda create -n ngs-tools conda activate ngs-tools conda install -y -c bioconda \\ samtools \\ bwa \\ fastqc \\ sra-tools \\ bowtie2 = 2 .4.2 \\ hisat2 = 2 .2.1 \\ subread = 2 .0.1 \\ entrez-direct \\ minimap2 \\ gatk4 \\ freebayes \\ multiqc \\ fastp Download ngs-tools.yml , and generate the conda environment like this: conda env create --name ngs-tools -f ngs-tools.yml Note If that did not succeed, follow the instructions for Windows/MacOS. This will create the conda environment ngs-tools Activate it like so: conda activate ngs-tools After successful installation and activating the environment all the software required to do the exercises should be available. If you are doing project 2 (long reads) If you are doing the project 2 as part of the course, you will need to install NanoPlot as well, using pip : pip install NanoPlot A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet Make a new directory Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"Setup"},{"location":"day1/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda","title":"Learning outcomes"},{"location":"day1/server_login/#exercises","text":"","title":"Exercises"},{"location":"day1/server_login/#first-login","text":"If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl + ` . Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:","title":"First login"},{"location":"day1/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"day1/server_login/#exercises_1","text":"","title":"Exercises"},{"location":"day1/server_login/#first-login_1","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-introduction-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-introduction-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this: Windows Mac/Linux Get the .exe file here Double click the file Follow the instructions on the screen (defaults are usually fine) Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. Get the installation ( .sh ) script here Run in your terminal: bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts Close and reopen your terminal for changes to have effect Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. After installation, you can install the required software: Windows/MacOS Linux conda create -n ngs-tools conda activate ngs-tools conda install -y -c bioconda \\ samtools \\ bwa \\ fastqc \\ sra-tools \\ bowtie2 = 2 .4.2 \\ hisat2 = 2 .2.1 \\ subread = 2 .0.1 \\ entrez-direct \\ minimap2 \\ gatk4 \\ freebayes \\ multiqc \\ fastp Download ngs-tools.yml , and generate the conda environment like this: conda env create --name ngs-tools -f ngs-tools.yml Note If that did not succeed, follow the instructions for Windows/MacOS. This will create the conda environment ngs-tools Activate it like so: conda activate ngs-tools After successful installation and activating the environment all the software required to do the exercises should be available. If you are doing project 2 (long reads) If you are doing the project 2 as part of the course, you will need to install NanoPlot as well, using pip : pip install NanoPlot","title":"First login"},{"location":"day1/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet","title":"A UNIX command line interface (CLI) refresher"},{"location":"day1/server_login/#make-a-new-directory","text":"Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts","title":"Make a new directory"},{"location":"day1/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"day1/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"day1/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"day1/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"day2/file_types/","text":"Learning outcomes After having completed this chapter you will be able to: Describe the fasta and fastq file format Describe which information can be stored in a standard Illumina fastq title Reproduce how and why base quality is stored in a fastq file as a single ASCII character Lookup relevant information of an alignment in the header of a sam file Describe what information is stored in each column of a sam file Describe how information is stored in a sam flag Describe the bed and gtf file format Describe vcf file format Material Download the presentation File definition websites: FASTQ (wikipedia) GFF (ensembl) VCF (Wikipedia) SAM: Wikipedia samtools Zhuyi Xue","title":"File types"},{"location":"day2/file_types/#learning-outcomes","text":"After having completed this chapter you will be able to: Describe the fasta and fastq file format Describe which information can be stored in a standard Illumina fastq title Reproduce how and why base quality is stored in a fastq file as a single ASCII character Lookup relevant information of an alignment in the header of a sam file Describe what information is stored in each column of a sam file Describe how information is stored in a sam flag Describe the bed and gtf file format Describe vcf file format","title":"Learning outcomes"},{"location":"day2/file_types/#material","text":"Download the presentation File definition websites: FASTQ (wikipedia) GFF (ensembl) VCF (Wikipedia) SAM: Wikipedia samtools Zhuyi Xue","title":"Material"},{"location":"day2/read_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain what a sequence aligner does Explain why in some cases the aligner needs to be \u2018splice-aware\u2019 Calculate mapping quality out of the probability that a mapping position is wrong Build an index of the reference and perform an alignment of paired-end reads with bowtie2 Material Download the presentation Unix command line E-utilities documentation bowtie2 manual Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search. Exercises Prepare the reference sequence Make a script called 05_download_ecoli_reference.sh , and paste in the code snippet below. Use it to retrieve the reference sequence using esearch and efetch : 05_download_ecoli_reference.sh #!/usr/bin/env bash REFERENCE_DIR = ~/project/ref_genome/ mkdir $REFERENCE_DIR cd $REFERENCE_DIR esearch -db nuccore -query 'U00096' \\ | efetch -format fasta > ecoli-strK12-MG1655.fasta Exercise: Check out the documentation of bowtie2-build , and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh . Answer 06_build_bowtie_index.sh #!/usr/bin/env bash cd ~/project/ref_genome bowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta Align the reads with bowtie2 Exercise: Check out the bowtie2 manual here . We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need? Answer According to the usage of bowtie2 : bowtie2 [ options ] * -x { -1 -2 | -U | --interleaved | --sra-acc | b } We\u2019ll need the options: -x to point to our index -1 and -2 to point to our forward and reverse reads Exercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh , and run it. 07_align_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome/ ALIGNED_DIR = ~/project/results/alignments mkdir -p $ALIGNED_DIR bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ > $ALIGNED_DIR /SRR519926.sam We\u2019ll go deeper into alignment statistics later on, but bowtie2 writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual . Can you guess what the reason could be?","title":"Read alignment"},{"location":"day2/read_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain what a sequence aligner does Explain why in some cases the aligner needs to be \u2018splice-aware\u2019 Calculate mapping quality out of the probability that a mapping position is wrong Build an index of the reference and perform an alignment of paired-end reads with bowtie2","title":"Learning outcomes"},{"location":"day2/read_alignment/#material","text":"Download the presentation Unix command line E-utilities documentation bowtie2 manual Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search.","title":"Material"},{"location":"day2/read_alignment/#exercises","text":"","title":"Exercises"},{"location":"day2/read_alignment/#prepare-the-reference-sequence","text":"Make a script called 05_download_ecoli_reference.sh , and paste in the code snippet below. Use it to retrieve the reference sequence using esearch and efetch : 05_download_ecoli_reference.sh #!/usr/bin/env bash REFERENCE_DIR = ~/project/ref_genome/ mkdir $REFERENCE_DIR cd $REFERENCE_DIR esearch -db nuccore -query 'U00096' \\ | efetch -format fasta > ecoli-strK12-MG1655.fasta Exercise: Check out the documentation of bowtie2-build , and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh . Answer 06_build_bowtie_index.sh #!/usr/bin/env bash cd ~/project/ref_genome bowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta","title":"Prepare the reference sequence"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","text":"Exercise: Check out the bowtie2 manual here . We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need? Answer According to the usage of bowtie2 : bowtie2 [ options ] * -x { -1 -2 | -U | --interleaved | --sra-acc | b } We\u2019ll need the options: -x to point to our index -1 and -2 to point to our forward and reverse reads Exercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh , and run it. 07_align_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome/ ALIGNED_DIR = ~/project/results/alignments mkdir -p $ALIGNED_DIR bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ > $ALIGNED_DIR /SRR519926.sam We\u2019ll go deeper into alignment statistics later on, but bowtie2 writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual . Can you guess what the reason could be?","title":"Align the reads with bowtie2"},{"location":"day2/samtools/","text":"Learning outcomes After having completed this chapter you will be able to: Use samtools flagstat to get general statistics on the flags stored in a sam/bam file Use samtools view to: compress a sam file into a bam file filter on sam flags count alignments filter out a region Use samtools sort to sort an alignment file based on coordinate Use samtools index to create an index of a sorted sam/bam file Use the pipe ( | ) symbol to pipe alignments directly to samtools to perform sorting and filtering Material samtools documentation Explain sam flags tool Exercises Alignment statistics Exercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats by using samtools flagstat . Find the documentation here . Anything that draws your attention? Answer Code: cd ~/project/results/alignments/ samtools flagstat SRR519926.sam > SRR519926.sam.stats resulting in: 624724 + 0 in total (QC-passed reads + QC-failed reads) 624724 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 621624 + 0 mapped (99.50% : N/A) 621624 + 0 primary mapped (99.50% : N/A) 624724 + 0 paired in sequencing 312362 + 0 read1 312362 + 0 read2 300442 + 0 properly paired (48.09% : N/A) 619200 + 0 with itself and mate mapped 2424 + 0 singletons (0.39% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this: samtools stats SRR519926.sam | grep ^SN | cut -f 2,3 Now look at insert size average and insert size standard deviation . You can see the standard deviation is higher than the average, suggesting a wide distribution. Compression, sorting and indexing The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that. Exercise : Create a script called 08_compress_sort.sh . Add a samtools view command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Find the required documentation here . How much was the disk space reduced by compressing the file? Tip: Samtools writes to stdout By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o . Answer 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam By using ls -lh , you can find out that SRR519926.sam has a size of 264 Mb, while SRR519926.bam is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this: samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam Exercise : Add these lines to 08_compress_sort.sh , and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file ( SRR519926.bam ) and the sorted bam file ( SRR519926.sorted.bam ) with samtools view -H . What are the differences? Answer Your script should like like this: 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam samtools view -H SRR519926.bam returns: @HD VN:1.0 SO:unsorted @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam And samtools view -H SRR519926.sorted.bam returns: @HD VN:1.0 SO:coordinate @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam There are two main differences: The SO tag at @HD type code has changed from unsorted to coordinate . A line with the @PG type code for the sorting was added. Note that the command to view the header ( samtools -H ) is also added to the header for both runs. Filtering With samtools view you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads. Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation . Answer You will need the 0x4 flag. Filtering against unmapped reads (leaving only mapped reads) with samtools view would look like this: samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam or: samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam Exercise: Generate a script called 09_extract_unmapped.sh to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat ? Tip Check out the -f and -c options of samtools view Answer Your script 09_extract_unmapped.sh should look like this: 09_extract_unmapped.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam Counting like this: samtools view -c SRR519926.sorted.unmapped.bam This should correspond to the output of samtools flagstat (624724 - 621624 = 3100) samtools view also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files. Exercise: Generate a script called 10_extract_region.sh to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header. Tip: Specifying a region Our E. coli genome has only one chromosome, because only one line starts with > in the fasta file cd ~/project/ref_genome grep \">\" ecoli-strK12-MG1655.fasta gives: >U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END Answer 10_extract_region.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh \\ SRR519926.sorted.bam \\ U00096.3:2000000-2500000 \\ > SRR519926.sorted.region.bam Redirection Samtools is easy to use in a pipe. In this case you can replace the input file with a - . For example, you can sort and compress the output of your alignment software in a pipe like this: my_alignment_command \\ | samtools sort - \\ | samtools view -bh - \\ > alignment.bam The use of - In the modern versions of samtools, the use of - is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry. Exercise: Write a script called 11_align_sort.sh that maps the reads with bowtie2 (see chapter 2 of read alignment ), sorts them, and outputs them as a BAM file with a header. Answer 11_align_sort.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome ALIGNED_DIR = ~/project/results/alignments bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ 2 > $ALIGNED_DIR /bowtie2_SRR519926.log \\ | samtools sort - \\ | samtools view -bh - \\ > $ALIGNED_DIR /SRR519926.sorted.mapped.frompipe.bam Redirecting stderr Notice the line starting with 2> . This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log . This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc . QC summary The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc , fastp , samtools and bowtie2 ). You can specify a directory that contains any log files, and it will automatically search it for you. Exercise : Run the command multiqc . in ~/project and checkout the generated report.","title":"Samtools"},{"location":"day2/samtools/#learning-outcomes","text":"After having completed this chapter you will be able to: Use samtools flagstat to get general statistics on the flags stored in a sam/bam file Use samtools view to: compress a sam file into a bam file filter on sam flags count alignments filter out a region Use samtools sort to sort an alignment file based on coordinate Use samtools index to create an index of a sorted sam/bam file Use the pipe ( | ) symbol to pipe alignments directly to samtools to perform sorting and filtering","title":"Learning outcomes"},{"location":"day2/samtools/#material","text":"samtools documentation Explain sam flags tool","title":"Material"},{"location":"day2/samtools/#exercises","text":"","title":"Exercises"},{"location":"day2/samtools/#alignment-statistics","text":"Exercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats by using samtools flagstat . Find the documentation here . Anything that draws your attention? Answer Code: cd ~/project/results/alignments/ samtools flagstat SRR519926.sam > SRR519926.sam.stats resulting in: 624724 + 0 in total (QC-passed reads + QC-failed reads) 624724 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 621624 + 0 mapped (99.50% : N/A) 621624 + 0 primary mapped (99.50% : N/A) 624724 + 0 paired in sequencing 312362 + 0 read1 312362 + 0 read2 300442 + 0 properly paired (48.09% : N/A) 619200 + 0 with itself and mate mapped 2424 + 0 singletons (0.39% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this: samtools stats SRR519926.sam | grep ^SN | cut -f 2,3 Now look at insert size average and insert size standard deviation . You can see the standard deviation is higher than the average, suggesting a wide distribution.","title":"Alignment statistics"},{"location":"day2/samtools/#compression-sorting-and-indexing","text":"The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that. Exercise : Create a script called 08_compress_sort.sh . Add a samtools view command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Find the required documentation here . How much was the disk space reduced by compressing the file? Tip: Samtools writes to stdout By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o . Answer 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam By using ls -lh , you can find out that SRR519926.sam has a size of 264 Mb, while SRR519926.bam is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this: samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam Exercise : Add these lines to 08_compress_sort.sh , and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file ( SRR519926.bam ) and the sorted bam file ( SRR519926.sorted.bam ) with samtools view -H . What are the differences? Answer Your script should like like this: 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam samtools view -H SRR519926.bam returns: @HD VN:1.0 SO:unsorted @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam And samtools view -H SRR519926.sorted.bam returns: @HD VN:1.0 SO:coordinate @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam There are two main differences: The SO tag at @HD type code has changed from unsorted to coordinate . A line with the @PG type code for the sorting was added. Note that the command to view the header ( samtools -H ) is also added to the header for both runs.","title":"Compression, sorting and indexing"},{"location":"day2/samtools/#filtering","text":"With samtools view you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads. Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation . Answer You will need the 0x4 flag. Filtering against unmapped reads (leaving only mapped reads) with samtools view would look like this: samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam or: samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam Exercise: Generate a script called 09_extract_unmapped.sh to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat ? Tip Check out the -f and -c options of samtools view Answer Your script 09_extract_unmapped.sh should look like this: 09_extract_unmapped.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam Counting like this: samtools view -c SRR519926.sorted.unmapped.bam This should correspond to the output of samtools flagstat (624724 - 621624 = 3100) samtools view also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files. Exercise: Generate a script called 10_extract_region.sh to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header. Tip: Specifying a region Our E. coli genome has only one chromosome, because only one line starts with > in the fasta file cd ~/project/ref_genome grep \">\" ecoli-strK12-MG1655.fasta gives: >U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END Answer 10_extract_region.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh \\ SRR519926.sorted.bam \\ U00096.3:2000000-2500000 \\ > SRR519926.sorted.region.bam","title":"Filtering"},{"location":"day2/samtools/#redirection","text":"Samtools is easy to use in a pipe. In this case you can replace the input file with a - . For example, you can sort and compress the output of your alignment software in a pipe like this: my_alignment_command \\ | samtools sort - \\ | samtools view -bh - \\ > alignment.bam The use of - In the modern versions of samtools, the use of - is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry. Exercise: Write a script called 11_align_sort.sh that maps the reads with bowtie2 (see chapter 2 of read alignment ), sorts them, and outputs them as a BAM file with a header. Answer 11_align_sort.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome ALIGNED_DIR = ~/project/results/alignments bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ 2 > $ALIGNED_DIR /bowtie2_SRR519926.log \\ | samtools sort - \\ | samtools view -bh - \\ > $ALIGNED_DIR /SRR519926.sorted.mapped.frompipe.bam Redirecting stderr Notice the line starting with 2> . This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log . This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc .","title":"Redirection"},{"location":"day2/samtools/#qc-summary","text":"The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc , fastp , samtools and bowtie2 ). You can specify a directory that contains any log files, and it will automatically search it for you. Exercise : Run the command multiqc . in ~/project and checkout the generated report.","title":"QC summary"},{"location":"day3/igv_visualisation/","text":"Learning outcomes After having completed this chapter you will be able to: Prepare a bam file for loading it into IGV Use IGV to: Navigate through a reference genome and alignments Retrieve information on a specific alignment Investigate (possible) variants Identify repeats and large INDELs Material The exercises below are partly based on this tutorial from the Griffith lab . Exercises A first glance: the E. coli dataset Index the alignment that was filtered for the region between 2000 and 2500 kb: cd ~/project/results/alignments samtools index SRR519926.sorted.region.bam Download it together with it\u2019s index file ( SRR519926.sorted.region.bam.bai ) and the reference genome ( ecoli-strK12-MG1655.fasta ) to your desktop. If working with Docker If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project , your files will be in C:\\Users\\myusername\\ngs-course . Load the genome ( .fasta ) into IGV: Genomes > Load Genome from File\u2026 Load the alignment file ( .bam ): File > Load from File\u2026 Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways: With the search box Select the region in the location bar View the reads as pairs, by right click on the reads and select View as pairs Exercise: There are lot of reads that are coloured red. Why is that? If you don\u2019t find any red reads.. The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size Answer According to IGV , reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size. Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click : Exercise: Click on one of the reads. What kind of information is there? Answer Most of the information from the SAM file. Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand . HCC1143 data set For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000. The BAM files containing these reads for the cancer cell line and the matched normal are: HCC1143.normal.21.19M-20M.bam HCC1143.normal.21.19M-20M.bam.bai A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu: Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam using the file browser. This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000 . Navigate to the gene CHODL by typing it in the search box. Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7 Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file. Click on the coverage track where the SNP is: Exercise: What is the sequence coverage for that base? And the percentage T? Answer The coverage is 62, and 25 reads (40%) T. Navigate to region chr21:19,800,320-19,818,162 Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker Note This might take a while to load. Right click in the alignment track and select Color alignments by > insert size and pair orientation Exercise: Why are some reads coloured white? What can be the cause of that? Answer The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3. Navigate to region chr21:19,324,500-19,331,500 Right click in the main alignment track and select: Expanded View as pairs Color alignments by > insert size and pair orientation Sort alignments by > insert size Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion? Answer The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.","title":"IGV and visualisation"},{"location":"day3/igv_visualisation/#learning-outcomes","text":"After having completed this chapter you will be able to: Prepare a bam file for loading it into IGV Use IGV to: Navigate through a reference genome and alignments Retrieve information on a specific alignment Investigate (possible) variants Identify repeats and large INDELs","title":"Learning outcomes"},{"location":"day3/igv_visualisation/#material","text":"The exercises below are partly based on this tutorial from the Griffith lab .","title":"Material"},{"location":"day3/igv_visualisation/#exercises","text":"","title":"Exercises"},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","text":"Index the alignment that was filtered for the region between 2000 and 2500 kb: cd ~/project/results/alignments samtools index SRR519926.sorted.region.bam Download it together with it\u2019s index file ( SRR519926.sorted.region.bam.bai ) and the reference genome ( ecoli-strK12-MG1655.fasta ) to your desktop. If working with Docker If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project , your files will be in C:\\Users\\myusername\\ngs-course . Load the genome ( .fasta ) into IGV: Genomes > Load Genome from File\u2026 Load the alignment file ( .bam ): File > Load from File\u2026 Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways: With the search box Select the region in the location bar View the reads as pairs, by right click on the reads and select View as pairs Exercise: There are lot of reads that are coloured red. Why is that? If you don\u2019t find any red reads.. The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size Answer According to IGV , reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size. Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click : Exercise: Click on one of the reads. What kind of information is there? Answer Most of the information from the SAM file. Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand .","title":"A first glance: the E. coli dataset"},{"location":"day3/igv_visualisation/#hcc1143-data-set","text":"For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000. The BAM files containing these reads for the cancer cell line and the matched normal are: HCC1143.normal.21.19M-20M.bam HCC1143.normal.21.19M-20M.bam.bai A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu: Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam using the file browser. This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000 . Navigate to the gene CHODL by typing it in the search box. Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7 Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file. Click on the coverage track where the SNP is: Exercise: What is the sequence coverage for that base? And the percentage T? Answer The coverage is 62, and 25 reads (40%) T. Navigate to region chr21:19,800,320-19,818,162 Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker Note This might take a while to load. Right click in the alignment track and select Color alignments by > insert size and pair orientation Exercise: Why are some reads coloured white? What can be the cause of that? Answer The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3. Navigate to region chr21:19,324,500-19,331,500 Right click in the main alignment track and select: Expanded View as pairs Color alignments by > insert size and pair orientation Sort alignments by > insert size Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion? Answer The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.","title":"HCC1143 data set"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses ) or in your own time. After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output Teachers Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Authors Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} License & copyright License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Enrolled to the course Independently Material This website Zoom meeting (through mail) Google doc (through mail) Slack channel Learning outcomes After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualise the output Learning experiences This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only. Exercises Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different. Asking questions During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository . Note It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web! Learning outcomes After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output Exercises Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Home"},{"location":"#_1","text":"Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses ) or in your own time. After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output","title":""},{"location":"#teachers","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Teachers"},{"location":"#authors","text":"Geert van Geest .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;} Patricia Palagi .st0{fill:#A6CE39;} .st1{fill:#FFFFFF;}","title":"Authors"},{"location":"#license-copyright","text":"License: CC BY-SA 4.0 Copyright: SIB Swiss Institute of Bioinformatics Enrolled to the course Independently","title":"License & copyright"},{"location":"#material","text":"This website Zoom meeting (through mail) Google doc (through mail) Slack channel","title":"Material"},{"location":"#learning-outcomes","text":"After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualise the output","title":"Learning outcomes"},{"location":"#learning-experiences","text":"This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.","title":"Learning experiences"},{"location":"#exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"#asking-questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button: A main source of communication will be our slack channel . Ask background questions that interest you personally at #background . During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a . This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option: The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally. To summarise: During lectures: raise hand/zoom functionality Personal interest questions: #background During exercises: #q-and-a on slack You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository . Note It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!","title":"Asking questions"},{"location":"#learning-outcomes_1","text":"After this course, you will be able to: Understand the basics of the different NGS technologies Perform quality control for better downstream analysis Align reads to a reference genome Visualize the output","title":"Learning outcomes"},{"location":"#exercises_1","text":"Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.","title":"Exercises"},{"location":"course_schedule/","text":"Note Apart from the starting time the time schedule is indicative . Because we can not plan a course by the minute, in practice the time points will deviate. Day 1 block start end subject introduction 9:15 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work Day 2 block start end subject block 1 9:15 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work Day 3 block start end subject block 1 9:15 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations","title":"Course schedule"},{"location":"course_schedule/#day-1","text":"block start end subject introduction 9:15 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work","title":"Day 1"},{"location":"course_schedule/#day-2","text":"block start end subject block 1 9:15 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work","title":"Day 2"},{"location":"course_schedule/#day-3","text":"block start end subject block 1 9:15 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations","title":"Day 3"},{"location":"group_work/","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people. If working with Docker If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk. If online If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually. Material Download the presentation Roles & organisation Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group. Working directories Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~/project ln -s /group_work/GROUP_NAME/ ./ # replace [GROUP_NAME] with your group directory Now you can find your group directory at ~/ . Use this to share files. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here . Project 1: Variant analysis of human data Aim : Find variants on chromosome 20 from three samples In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. You can get the data by running these commands: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz Tasks Important! Stick to the principles for reproducible analysis described here Download the required data Do a QC on the data with fastqc Trim adapters and low quality bases with fastp . Make sure to include the option --detect_adapter_for_pe . To prevent overwriting fastp.html , specify a report filename for each sample with the option --html . After trimming the adapters, run fastqc again to see whether all adapters are gone. Create an index for bowtie2. At the same time create a fasta index ( .fai file) with samtools faidx . Check which options to use, and align with bowtie2 . At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. Mark duplicates on the individual bam files with gatk MarkDuplicates (see hints below). Merge the three bam files with samtools merge . Index the bam file afterwards. Run freebayes to call variants. Only call variants on the region chr20:10018000-10220000 by specifying the -r option. Load your alignments together with the vcf containing the variants in IGV. Check out e.g. chr20:10,026,397-10,026,638 . Run multiqc to get an overall quality report. Questions Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that? How many duplicates were in the different samples (hint: use samtools flagstat )? Why is it important to remove them for variant analysis? Why did you add read groups to the bam files? Where is this information added in the bam file? Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? There are two high quality variants in chr20:10,026,397-10,026,638 . What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. Hints You can add readgroups to the alignment file with bowtie2 with the options --rg-id and --rg , e.g. ( $SAMPLE is a variable containing a sample identifier): bowtie2 \\ -x ref.fa \\ -1 r1.fastq.gz \\ -2 r2.fastq.gz \\ --rg-id $SAMPLE \\ --rg SM: $SAMPLE \\ To run gatk MarkDuplicates you will only need to specify --INPUT and --OUTPUT , e.g.: gatk MarkDuplicates \\ --INPUT sample.bam \\ --OUTPUT sample.md.bam \\ --METRICS_FILE sample.metrics.txt Project 2: Long-read genome sequencing Aim : Align long reads from RNA-seq data to a reference genome. In this project, you will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . Here you can find the BioProject page . It is Oxford Nanopore Technology sequencing data of amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. In this project, we will align a few of the samples to the reference genome, and assess the quality of reads and the alignment. Download the human reference genome like this: wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Tasks Important! Stick to the principles for reproducible analysis described here Check out the BioProject, and download two samples that interest you. Perform QC with fastqc Perform QC with NanoPlot Align with minimap2 with default parameters Figure how you should set parameters -x and -G Evaluate the alignment quality (e.g. alignment rates, mapping quality) Compare different samples in read quality, alignment rates, depth, etc. Questions Have a look at the quality report. What are the average read lengths? Is that expected? What is the average read quality? What kind of accuracy would you expect? Note any differences between fastqc and NanoPlot ? How is that compared to the publication? Check out the options -x and -G of minimap2 . Are the defaults appropriate? You might consider using -x map-ont or -x splice . Do you see differences in the alignment in e.g. IGV? How are spliced alignments stored in the SAM file with the different settings of -x ? How deep is the gene sequenced? Do you already see evidence for splice variants in the alignments? Downloading from SRA prefetch [ SRR number ] fastq-dump --gzip [ SRR number ] Accuracy from quality scores Find the equation to calculate error probability from quality score on Wikipedia . Comparing fastqc and Nanoplot For comparing fastqc and NanoPlot , check out this blog of the author of NanoPlot, and this thread . Running minimap2 Here\u2019s an example command for minimap2 : minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ [ REFERENCE ] .fa \\ [ FASTQFILE ] .fastq.gz \\ | samtools sort \\ | samtools view -bh > [ OUTPUT ] .bam Intron sizes Check out the the intron sizes of CACNA1C in e.g. IGV or UCSC genome browser. How does that relate to the parameter -G ? More resources Need e.g. a gtf file? Here\u2019s the ensembl page Project 3: Short-read RNA-seq of mice. Aim: Generate a count matrix to estimate differential gene expression. In this project you will be working with data from: Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases . Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6 Here\u2019s the BioProject page . Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz Tasks Important! Stick to the principles for reproducible analysis described here Download the tar file, and find out what\u2019s in the data folder Do a QC on the fastq files with fastqc Trim adapters and low quality bases with fastp After trimming the adapters, run fastqc again to see whether all adapters are gone. Check which options to use, and align with hisat2 Evaluate the alignment quality (e.g. alignment rates, mapping quality) Have a look at the alignments in IGV, e.g. check out Sparcl1 . For this, you can use the built-in genome ( Mouse (mm10) ). Do you see any evidence for differential splicing? Run featureCounts on both alignments. Have a look at the option -Q . For further suggestions, see the hints below. Compare the count matrices in R (find a script to get started here ; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio ) Questions Check the description at the SRA sample page. What kind of sample is this? How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are) Did trimming improve the QC results? What could be the cause of the warnings/errors in the fastqc reports? What are the alignment rates? How are spliced alignments stored in the SAM file? Are there any differences between the treatments in the percentage of assigned alignments by featureCounts ? What is the cause of this? Can you find any genes that seem to be differentially expressed? What is the effect of setting the option -Q in featureCounts ? Hints We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing. command option bowtie2-build --threads hisat2-build --threads fastqc --threads cutadapt --cores bowtie2 --threads hisat2 --threads featureCounts -T Here\u2019s some example code for hisat2 and featureCounts . Everything in between <> should be replaced with specific arguments. Here\u2019s an example for hisat2 : hisat2-build hisat2 \\ -x \\ -1 \\ -2 \\ -p \\ | samtools sort \\ | samtools view -bh \\ > Example code featureCounts : featureCounts \\ -p \\ -T 2 \\ -a \\ -o \\ *.bam","title":"Group work"},{"location":"group_work/#material","text":"Download the presentation","title":"Material"},{"location":"group_work/#roles-organisation","text":"Project based learning is about learning by doing, but also about peer instruction . This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding. Each project has tasks and questions . By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions. In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.","title":"Roles & organisation"},{"location":"group_work/#working-directories","text":"Each group has access to a shared working directory. It is mounted in the root directory ( / ). Make a soft link in your home directory: cd ~/project ln -s /group_work/GROUP_NAME/ ./ # replace [GROUP_NAME] with your group directory Now you can find your group directory at ~/ . Use this to share files. Warning Do not remove the soft link with rm -r , this will delete the entire source directory. If you want to remove only the softlink, use rm (without -r ), or unlink . More info here .","title":"Working directories"},{"location":"group_work/#project-1-variant-analysis-of-human-data","text":"Aim : Find variants on chromosome 20 from three samples In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. You can get the data by running these commands: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz tar -xvf project1.tar.gz rm project1.tar.gz","title":"Project 1: Variant analysis of human data"},{"location":"group_work/#tasks","text":"Important! Stick to the principles for reproducible analysis described here Download the required data Do a QC on the data with fastqc Trim adapters and low quality bases with fastp . Make sure to include the option --detect_adapter_for_pe . To prevent overwriting fastp.html , specify a report filename for each sample with the option --html . After trimming the adapters, run fastqc again to see whether all adapters are gone. Create an index for bowtie2. At the same time create a fasta index ( .fai file) with samtools faidx . Check which options to use, and align with bowtie2 . At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. Mark duplicates on the individual bam files with gatk MarkDuplicates (see hints below). Merge the three bam files with samtools merge . Index the bam file afterwards. Run freebayes to call variants. Only call variants on the region chr20:10018000-10220000 by specifying the -r option. Load your alignments together with the vcf containing the variants in IGV. Check out e.g. chr20:10,026,397-10,026,638 . Run multiqc to get an overall quality report.","title":"Tasks"},{"location":"group_work/#questions","text":"Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that? How many duplicates were in the different samples (hint: use samtools flagstat )? Why is it important to remove them for variant analysis? Why did you add read groups to the bam files? Where is this information added in the bam file? Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? There are two high quality variants in chr20:10,026,397-10,026,638 . What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV.","title":"Questions"},{"location":"group_work/#hints","text":"You can add readgroups to the alignment file with bowtie2 with the options --rg-id and --rg , e.g. ( $SAMPLE is a variable containing a sample identifier): bowtie2 \\ -x ref.fa \\ -1 r1.fastq.gz \\ -2 r2.fastq.gz \\ --rg-id $SAMPLE \\ --rg SM: $SAMPLE \\ To run gatk MarkDuplicates you will only need to specify --INPUT and --OUTPUT , e.g.: gatk MarkDuplicates \\ --INPUT sample.bam \\ --OUTPUT sample.md.bam \\ --METRICS_FILE sample.metrics.txt","title":"Hints"},{"location":"group_work/#project-2-long-read-genome-sequencing","text":"Aim : Align long reads from RNA-seq data to a reference genome. In this project, you will be working with data from: Clark, M. B. et al (2020). Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain . Molecular Psychiatry, 25(1), 37\u201347. https://doi.org/10.1038/s41380-019-0583-1 . Here you can find the BioProject page . It is Oxford Nanopore Technology sequencing data of amplicons of the gene CACNA1C. It is primarily used to discover new splice variants. In this project, we will align a few of the samples to the reference genome, and assess the quality of reads and the alignment. Download the human reference genome like this: wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz","title":" Project 2: Long-read genome sequencing"},{"location":"group_work/#tasks_1","text":"Important! Stick to the principles for reproducible analysis described here Check out the BioProject, and download two samples that interest you. Perform QC with fastqc Perform QC with NanoPlot Align with minimap2 with default parameters Figure how you should set parameters -x and -G Evaluate the alignment quality (e.g. alignment rates, mapping quality) Compare different samples in read quality, alignment rates, depth, etc.","title":"Tasks"},{"location":"group_work/#questions_1","text":"Have a look at the quality report. What are the average read lengths? Is that expected? What is the average read quality? What kind of accuracy would you expect? Note any differences between fastqc and NanoPlot ? How is that compared to the publication? Check out the options -x and -G of minimap2 . Are the defaults appropriate? You might consider using -x map-ont or -x splice . Do you see differences in the alignment in e.g. IGV? How are spliced alignments stored in the SAM file with the different settings of -x ? How deep is the gene sequenced? Do you already see evidence for splice variants in the alignments? Downloading from SRA prefetch [ SRR number ] fastq-dump --gzip [ SRR number ] Accuracy from quality scores Find the equation to calculate error probability from quality score on Wikipedia . Comparing fastqc and Nanoplot For comparing fastqc and NanoPlot , check out this blog of the author of NanoPlot, and this thread . Running minimap2 Here\u2019s an example command for minimap2 : minimap2 \\ -a \\ -x [ PARAMETER ] \\ -G [ PARAMETER ] \\ [ REFERENCE ] .fa \\ [ FASTQFILE ] .fastq.gz \\ | samtools sort \\ | samtools view -bh > [ OUTPUT ] .bam Intron sizes Check out the the intron sizes of CACNA1C in e.g. IGV or UCSC genome browser. How does that relate to the parameter -G ? More resources Need e.g. a gtf file? Here\u2019s the ensembl page","title":"Questions"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","text":"Aim: Generate a count matrix to estimate differential gene expression. In this project you will be working with data from: Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases . Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6 Here\u2019s the BioProject page . Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this: wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz tar -xvf project3.tar.gz rm project3.tar.gz","title":" Project 3: Short-read RNA-seq of mice."},{"location":"group_work/#tasks_2","text":"Important! Stick to the principles for reproducible analysis described here Download the tar file, and find out what\u2019s in the data folder Do a QC on the fastq files with fastqc Trim adapters and low quality bases with fastp After trimming the adapters, run fastqc again to see whether all adapters are gone. Check which options to use, and align with hisat2 Evaluate the alignment quality (e.g. alignment rates, mapping quality) Have a look at the alignments in IGV, e.g. check out Sparcl1 . For this, you can use the built-in genome ( Mouse (mm10) ). Do you see any evidence for differential splicing? Run featureCounts on both alignments. Have a look at the option -Q . For further suggestions, see the hints below. Compare the count matrices in R (find a script to get started here ; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio )","title":"Tasks"},{"location":"group_work/#questions_2","text":"Check the description at the SRA sample page. What kind of sample is this? How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are) Did trimming improve the QC results? What could be the cause of the warnings/errors in the fastqc reports? What are the alignment rates? How are spliced alignments stored in the SAM file? Are there any differences between the treatments in the percentage of assigned alignments by featureCounts ? What is the cause of this? Can you find any genes that seem to be differentially expressed? What is the effect of setting the option -Q in featureCounts ?","title":"Questions"},{"location":"group_work/#hints_1","text":"We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing. command option bowtie2-build --threads hisat2-build --threads fastqc --threads cutadapt --cores bowtie2 --threads hisat2 --threads featureCounts -T Here\u2019s some example code for hisat2 and featureCounts . Everything in between <> should be replaced with specific arguments. Here\u2019s an example for hisat2 : hisat2-build hisat2 \\ -x \\ -1 \\ -2 \\ -p \\ | samtools sort \\ | samtools view -bh \\ > Example code featureCounts : featureCounts \\ -p \\ -T 2 \\ -a \\ -o \\ *.bam","title":"Hints"},{"location":"precourse/","text":"UNIX As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial . Software We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Precourse preparations"},{"location":"precourse/#unix","text":"As is stated in the course prerequisites at the announcement web page , we expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here . If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial .","title":"UNIX"},{"location":"precourse/#software","text":"We will be mainly working on an Amazon Web Services ( AWS ) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a jupyter notebook . All participants will be granted access to a personal workspace to be used during the course. The only software you need to install before the course is Integrative Genomics Viewer (IGV) .","title":"Software"},{"location":"day1/intro/","text":"If working independently If you are doing this course independently, you can skip this part. Material Download the presentation","title":"Introduction"},{"location":"day1/intro/#material","text":"Download the presentation","title":"Material"},{"location":"day1/quality_control/","text":"Learning outcomes After having completed this chapter you will be able to: Find information about a sequence run on the Sequence Read Archive Run fastqc on sequence reads and interpret the results Trim adapters and low quality bases using fastp Material Download the presentation fastqc command line documentation cutadapt manual Unix command line E-utilities documentation Exercises Download and evaluate an E. coli dataset Check out the dataset at SRA . Exercise: Browse around the SRA entry and answer these questions: A. Is the dataset paired-end or single end? B. Which instrument was used for sequencing? C. What is the read length? D. How many reads do we have? Answers A. paired-end B. Illumina MiSeq C. 2 x 251 bp D. 400596 Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools . Every time you open a new terminal, you will have to load the environment: conda activate ngs-tools Make a directory reads in ~/project and download the reads from the SRA database using prefetch and fastq-dump from SRA-Tools into the reads directory. Use the code snippet below to create a scripts called 01_download_reads.sh . Store it in ~/project/scripts/ , and run it. 01_download_reads.sh #!/usr/bin/env bash cd ~/project mkdir reads cd reads prefetch SRR519926 fastq-dump --split-files SRR519926 Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry. Tip A read in a fastq file consists of four lines (more on that at file types ). Use Google to figure out how to count the number of reads in a fastq file. Answer e.g. from this thread on Biostars: ## forward read echo $( cat SRR519926_1.fastq | wc -l ) /4 | bc ## reverse read echo $( cat SRR519926_2.fastq | wc -l ) /4 | bc Run fastqc Exercise: Create a script to run fastqc and call it 02_run_fastqc.sh . After that, run it. Tip fastqc accepts multiple files as input, so you can use a wildcard to run fastqc on all the files in one line of code. Use it like this: *.fastq . Answer Your script ~/project/scripts/02_run_fastqc.sh should look like: 02_run_fastqc.sh #!/usr/bin/env bash cd ~/project/reads fastqc *.fastq Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems? Downloading files You can download files by right-click the file and after that select Download : Answer There seems to be: Low quality towards the 3\u2019 end (per base sequence quality) Full sequence reads with low quality (per sequence quality scores) Adapters in the sequences (adapter content) We can probably fix most of these issues by trimming. Trim the reads We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp will use those by default. A reference for the adapter sequences can be found here . Exercise: Check out the documentation of fastp , and the option defaults by running fastp --help . What is the default for the minimum base quality for a qualified base? ( option --qualified_quality_phred ) What is the default for the maximum percentage of unqualified bases in a read? (option --unqualified_percent_limit ) What is the default for the minimum required read length? (option --length_required ) What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options --unpaired1 and --unpaired2 ) Answer The minimum base quality is 15: Default 15 means phred quality >=Q15 is qualified. (int [=15]) The minimum required length is also 15: reads shorter than length_required will be discarded, default is 15. (int [=15]) If one of the reads does not meet the required length, the pair is discarded if --unpaired1 and/or --unpaired2 are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=]) . Exercise: Complete the script below called 03_trim_reads.sh (replace everything in between brackets [] ) to run fastp to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/ is created to write the trimmed reads. 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ [ QUALIFIED BASE THRESHOLD ] \\ [ MINIMUM LENGTH THRESHOLD ] \\ [ UNQUALIFIED PERCENTAGE LIMIT ] \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe Additional options Note that we have set the options --cut_front and --cut_tail that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe is set, which ensures that adapters are detected automatically for both R1 and R2. Answer Your script ( ~/project/scripts/03_trim_reads.sh ) should look like this: 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ --qualified_quality_phred 10 \\ --length_required 25 \\ --unqualified_percent_limit 80 \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe The use of \\ In the script above you see that we\u2019re using \\ at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\ to ignore the newlines when executing. Exercise: Check out the report in fastp.html . Has the quality improved? How many reads do we have left? Bonus : Although there were adapters in R2 according to fastqc , fastp has trouble finding adapters in R2. Also, after running fastp there doesn\u2019t seem to be much adapter left (you can double check by running fastqc on trimmed_SRR519926_2.fastq ). How could that be? Answers Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages >20 and >30 are higher. 624724 reads, so 312362 pairs (78.0%) The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.","title":"Quality control"},{"location":"day1/quality_control/#learning-outcomes","text":"After having completed this chapter you will be able to: Find information about a sequence run on the Sequence Read Archive Run fastqc on sequence reads and interpret the results Trim adapters and low quality bases using fastp","title":"Learning outcomes"},{"location":"day1/quality_control/#material","text":"Download the presentation fastqc command line documentation cutadapt manual Unix command line E-utilities documentation","title":"Material"},{"location":"day1/quality_control/#exercises","text":"","title":"Exercises"},{"location":"day1/quality_control/#download-and-evaluate-an-e-coli-dataset","text":"Check out the dataset at SRA . Exercise: Browse around the SRA entry and answer these questions: A. Is the dataset paired-end or single end? B. Which instrument was used for sequencing? C. What is the read length? D. How many reads do we have? Answers A. paired-end B. Illumina MiSeq C. 2 x 251 bp D. 400596 Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools . Every time you open a new terminal, you will have to load the environment: conda activate ngs-tools Make a directory reads in ~/project and download the reads from the SRA database using prefetch and fastq-dump from SRA-Tools into the reads directory. Use the code snippet below to create a scripts called 01_download_reads.sh . Store it in ~/project/scripts/ , and run it. 01_download_reads.sh #!/usr/bin/env bash cd ~/project mkdir reads cd reads prefetch SRR519926 fastq-dump --split-files SRR519926 Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry. Tip A read in a fastq file consists of four lines (more on that at file types ). Use Google to figure out how to count the number of reads in a fastq file. Answer e.g. from this thread on Biostars: ## forward read echo $( cat SRR519926_1.fastq | wc -l ) /4 | bc ## reverse read echo $( cat SRR519926_2.fastq | wc -l ) /4 | bc","title":"Download and evaluate an E. coli dataset"},{"location":"day1/quality_control/#run-fastqc","text":"Exercise: Create a script to run fastqc and call it 02_run_fastqc.sh . After that, run it. Tip fastqc accepts multiple files as input, so you can use a wildcard to run fastqc on all the files in one line of code. Use it like this: *.fastq . Answer Your script ~/project/scripts/02_run_fastqc.sh should look like: 02_run_fastqc.sh #!/usr/bin/env bash cd ~/project/reads fastqc *.fastq Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems? Downloading files You can download files by right-click the file and after that select Download : Answer There seems to be: Low quality towards the 3\u2019 end (per base sequence quality) Full sequence reads with low quality (per sequence quality scores) Adapters in the sequences (adapter content) We can probably fix most of these issues by trimming.","title":"Run fastqc"},{"location":"day1/quality_control/#trim-the-reads","text":"We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp will use those by default. A reference for the adapter sequences can be found here . Exercise: Check out the documentation of fastp , and the option defaults by running fastp --help . What is the default for the minimum base quality for a qualified base? ( option --qualified_quality_phred ) What is the default for the maximum percentage of unqualified bases in a read? (option --unqualified_percent_limit ) What is the default for the minimum required read length? (option --length_required ) What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options --unpaired1 and --unpaired2 ) Answer The minimum base quality is 15: Default 15 means phred quality >=Q15 is qualified. (int [=15]) The minimum required length is also 15: reads shorter than length_required will be discarded, default is 15. (int [=15]) If one of the reads does not meet the required length, the pair is discarded if --unpaired1 and/or --unpaired2 are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=]) . Exercise: Complete the script below called 03_trim_reads.sh (replace everything in between brackets [] ) to run fastp to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/ is created to write the trimmed reads. 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ [ QUALIFIED BASE THRESHOLD ] \\ [ MINIMUM LENGTH THRESHOLD ] \\ [ UNQUALIFIED PERCENTAGE LIMIT ] \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe Additional options Note that we have set the options --cut_front and --cut_tail that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe is set, which ensures that adapters are detected automatically for both R1 and R2. Answer Your script ( ~/project/scripts/03_trim_reads.sh ) should look like this: 03_trim_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed READS_DIR = ~/project/reads mkdir -p $TRIMMED_DIR cd $TRIMMED_DIR fastp \\ -i $READS_DIR /SRR519926_1.fastq \\ -I $READS_DIR /SRR519926_2.fastq \\ -o $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -O $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ --qualified_quality_phred 10 \\ --length_required 25 \\ --unqualified_percent_limit 80 \\ --cut_front \\ --cut_tail \\ --detect_adapter_for_pe The use of \\ In the script above you see that we\u2019re using \\ at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\ to ignore the newlines when executing. Exercise: Check out the report in fastp.html . Has the quality improved? How many reads do we have left? Bonus : Although there were adapters in R2 according to fastqc , fastp has trouble finding adapters in R2. Also, after running fastp there doesn\u2019t seem to be much adapter left (you can double check by running fastqc on trimmed_SRR519926_2.fastq ). How could that be? Answers Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages >20 and >30 are higher. 624724 reads, so 312362 pairs (78.0%) The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.","title":"Trim the reads"},{"location":"day1/reproducibility/","text":"Learning outcomes After having completed this chapter you will be able to: Understand the importance of reproducibility Apply some basic rules to support reproducibilty in computational research Material Download the presentation Some good practices for reproducibility During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility: Execute the commands from a script in order to be able to trace back your steps Number scripts based on their order of execution (e.g. 01_download_reads.sh ) Give your scripts a descriptive and active name , e.g. 06_build_bowtie_index.sh Make your scripts specific , i.e. do not combine many different commands in the same script Refer to directories and variables at the beginning of the script By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research . By the end of day 2 ~/project should look (something) like this: . \u251c\u2500\u2500 alignment_output \u251c\u2500\u2500 reads \u251c\u2500\u2500 ref_genome \u251c\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 01_download_reads.sh \u2502 \u251c\u2500\u2500 02_run_fastqc.sh \u2502 \u251c\u2500\u2500 03_trim_reads.sh \u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh \u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh \u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh \u2502 \u251c\u2500\u2500 07_align_reads.sh \u2502 \u251c\u2500\u2500 08_compress_sort.sh \u2502 \u251c\u2500\u2500 09_extract_unmapped.sh \u2502 \u251c\u2500\u2500 10_extract_region.sh \u2502 \u2514\u2500\u2500 11_align_sort.sh \u2514\u2500\u2500 trimmed_data","title":"Reproducibility"},{"location":"day1/reproducibility/#learning-outcomes","text":"After having completed this chapter you will be able to: Understand the importance of reproducibility Apply some basic rules to support reproducibilty in computational research","title":"Learning outcomes"},{"location":"day1/reproducibility/#material","text":"Download the presentation","title":"Material"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","text":"During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility: Execute the commands from a script in order to be able to trace back your steps Number scripts based on their order of execution (e.g. 01_download_reads.sh ) Give your scripts a descriptive and active name , e.g. 06_build_bowtie_index.sh Make your scripts specific , i.e. do not combine many different commands in the same script Refer to directories and variables at the beginning of the script By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research . By the end of day 2 ~/project should look (something) like this: . \u251c\u2500\u2500 alignment_output \u251c\u2500\u2500 reads \u251c\u2500\u2500 ref_genome \u251c\u2500\u2500 scripts \u2502 \u251c\u2500\u2500 01_download_reads.sh \u2502 \u251c\u2500\u2500 02_run_fastqc.sh \u2502 \u251c\u2500\u2500 03_trim_reads.sh \u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh \u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh \u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh \u2502 \u251c\u2500\u2500 07_align_reads.sh \u2502 \u251c\u2500\u2500 08_compress_sort.sh \u2502 \u251c\u2500\u2500 09_extract_unmapped.sh \u2502 \u251c\u2500\u2500 10_extract_region.sh \u2502 \u2514\u2500\u2500 11_align_sort.sh \u2514\u2500\u2500 trimmed_data","title":"Some good practices for reproducibility"},{"location":"day1/sequencing_technologies/","text":"Learning outcomes After having completed this chapter you will be able to: Describe the major applications of next generation sequencing Reproduce the most frequently used sequencing methods Describe the major steps taken during library preparation for Illumina sequencing Explain why the length of the sequencing reads of Illumina sequencing are limited Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing Material Download the presentation Illumina sequencing by synthesis on YouTube NEBnext library preparation poster","title":"Sequencing technologies"},{"location":"day1/sequencing_technologies/#learning-outcomes","text":"After having completed this chapter you will be able to: Describe the major applications of next generation sequencing Reproduce the most frequently used sequencing methods Describe the major steps taken during library preparation for Illumina sequencing Explain why the length of the sequencing reads of Illumina sequencing are limited Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing","title":"Learning outcomes"},{"location":"day1/sequencing_technologies/#material","text":"Download the presentation Illumina sequencing by synthesis on YouTube NEBnext library preparation poster","title":"Material"},{"location":"day1/server_login/","text":"Learning outcomes Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda Exercises First login If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl + ` . Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button: Material Instructions to install docker Instructions to set up to container Exercises First login Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-introduction-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-introduction-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this: Windows Mac/Linux Get the .exe file here Double click the file Follow the instructions on the screen (defaults are usually fine) Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. Get the installation ( .sh ) script here Run in your terminal: bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts Close and reopen your terminal for changes to have effect Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. After installation, you can install the required software: Windows/MacOS Linux conda create -n ngs-tools conda activate ngs-tools conda install -y -c bioconda \\ samtools \\ bwa \\ fastqc \\ sra-tools \\ bowtie2 = 2 .4.2 \\ hisat2 = 2 .2.1 \\ subread = 2 .0.1 \\ entrez-direct \\ minimap2 \\ gatk4 \\ freebayes \\ multiqc \\ fastp Download ngs-tools.yml , and generate the conda environment like this: conda env create --name ngs-tools -f ngs-tools.yml Note If that did not succeed, follow the instructions for Windows/MacOS. This will create the conda environment ngs-tools Activate it like so: conda activate ngs-tools After successful installation and activating the environment all the software required to do the exercises should be available. If you are doing project 2 (long reads) If you are doing the project 2 as part of the course, you will need to install NanoPlot as well, using pip : pip install NanoPlot A UNIX command line interface (CLI) refresher Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet Make a new directory Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts File permissions Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here . Redirection: > and | In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l Variables Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE shell scripts Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"Setup"},{"location":"day1/server_login/#learning-outcomes","text":"Note You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line. After having completed this chapter you will be able to: Use the command line to: Make a directory Change file permissions to \u2018executable\u2019 Run a bash script Pipe data from and to a file or other executable Program a loop in bash Choose your platform In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker. If you are doing the course with a teacher , you will have to login to the remote server. Therefore choose: Cloud notebook If you are doing this course independently (i.e. without a teacher) choose either: conda Docker Cloud notebook Docker conda","title":"Learning outcomes"},{"location":"day1/server_login/#exercises","text":"","title":"Exercises"},{"location":"day1/server_login/#first-login","text":"If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002 ) in your browser. This should result in the following page: Info The link gives you access to a web version of Visual Studio Code . This is a powerful code editor that you can also use as a local application on your computer. Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl + ` . Or by clicking Application menu > Terminal > New Terminal : For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:","title":"First login"},{"location":"day1/server_login/#material","text":"Instructions to install docker Instructions to set up to container","title":"Material"},{"location":"day1/server_login/#exercises_1","text":"","title":"Exercises"},{"location":"day1/server_login/#first-login_1","text":"Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system. In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10. The command to run the environment required for this course looks like this (in a terminal): Modify the script Modify the path after -v to the working directory on your computer before running it. docker run \\ --rm \\ -p 8443 :8443 \\ -e PUID = 1000 \\ -e PGID = 1000 \\ -e DEFAULT_WORKSPACE = /config/project \\ -v $PWD :/config/project \\ geertvangeest/ngs-introduction-vscode:latest If this command has run successfully, navigate in your browser to http://localhost:8443 . The option -v mounts a local directory in your computer to the directory /config/project in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory. Don\u2019t mount directly in the home dir Don\u2019t directly mount your local directory to the home directory ( /root ). This will lead to unexpected behaviour. The part geertvangeest/ngs-introduction-vscode:latest is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately. If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this: Windows Mac/Linux Get the .exe file here Double click the file Follow the instructions on the screen (defaults are usually fine) Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. Get the installation ( .sh ) script here Run in your terminal: bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts Close and reopen your terminal for changes to have effect Run the command conda list in the Ananconda prompt or terminal to check whether your installation has succeeded. After installation, you can install the required software: Windows/MacOS Linux conda create -n ngs-tools conda activate ngs-tools conda install -y -c bioconda \\ samtools \\ bwa \\ fastqc \\ sra-tools \\ bowtie2 = 2 .4.2 \\ hisat2 = 2 .2.1 \\ subread = 2 .0.1 \\ entrez-direct \\ minimap2 \\ gatk4 \\ freebayes \\ multiqc \\ fastp Download ngs-tools.yml , and generate the conda environment like this: conda env create --name ngs-tools -f ngs-tools.yml Note If that did not succeed, follow the instructions for Windows/MacOS. This will create the conda environment ngs-tools Activate it like so: conda activate ngs-tools After successful installation and activating the environment all the software required to do the exercises should be available. If you are doing project 2 (long reads) If you are doing the project 2 as part of the course, you will need to install NanoPlot as well, using pip : pip install NanoPlot","title":"First login"},{"location":"day1/server_login/#a-unix-command-line-interface-cli-refresher","text":"Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet: UNIX cheat sheet","title":"A UNIX command line interface (CLI) refresher"},{"location":"day1/server_login/#make-a-new-directory","text":"Make a directory scripts within ~/project and make it your current directory. Answer cd ~/project mkdir scripts cd scripts","title":"Make a new directory"},{"location":"day1/server_login/#file-permissions","text":"Generate an empty script in your newly made directory ~/project/scripts like this: touch new_script.sh Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it. Answer Generate a script as described above. The script should look like this: #!/usr/bin/env bash echo \"SIB courses are great!\" Usually, you can run it like this: ./new_script.sh But there\u2019s an error: bash: ./new_script.sh: Permission denied Why is there an error? Hint Use ls -lh new_script.sh to check the permissions. Answer ls -lh new_script.sh gives: -rw-r--r-- 1 user group 51B Nov 11 16 :21 new_script.sh There\u2019s no x in the permissions string. You should change at least the permissions of the user. Make the script executable for yourself, and run it. Answer Change permissions: chmod u+x new_script.sh ls -lh new_script.sh now gives: -rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh So it should be executable: ./new_script.sh More on chmod and file permissions here .","title":"File permissions"},{"location":"day1/server_login/#redirection-and","text":"In the root directory (go there like this: cd / ) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt in your working directory. Answer ls / > ~/project/system_dirs.txt The command wc -l counts the number of lines, and can read from stdin. Make a one-liner with a pipe | symbol to find out how many system directories and files there are. Answer ls / | wc -l","title":"Redirection: > and |"},{"location":"day1/server_login/#variables","text":"Store system_dirs.txt as variable (like this: VAR=variable ), and use wc -l on that variable to count the number of lines in the file. Answer FILE = ~/project/system_dirs.txt wc -l $FILE","title":"Variables"},{"location":"day1/server_login/#shell-scripts","text":"Make a shell script that automatically counts the number of system directories and files. Answer Make a script called e.g. current_system_dirs.sh : #!/usr/bin/env bash cd / ls | wc -l","title":"shell scripts"},{"location":"day2/file_types/","text":"Learning outcomes After having completed this chapter you will be able to: Describe the fasta and fastq file format Describe which information can be stored in a standard Illumina fastq title Reproduce how and why base quality is stored in a fastq file as a single ASCII character Lookup relevant information of an alignment in the header of a sam file Describe what information is stored in each column of a sam file Describe how information is stored in a sam flag Describe the bed and gtf file format Describe vcf file format Material Download the presentation File definition websites: FASTQ (wikipedia) GFF (ensembl) VCF (Wikipedia) SAM: Wikipedia samtools Zhuyi Xue","title":"File types"},{"location":"day2/file_types/#learning-outcomes","text":"After having completed this chapter you will be able to: Describe the fasta and fastq file format Describe which information can be stored in a standard Illumina fastq title Reproduce how and why base quality is stored in a fastq file as a single ASCII character Lookup relevant information of an alignment in the header of a sam file Describe what information is stored in each column of a sam file Describe how information is stored in a sam flag Describe the bed and gtf file format Describe vcf file format","title":"Learning outcomes"},{"location":"day2/file_types/#material","text":"Download the presentation File definition websites: FASTQ (wikipedia) GFF (ensembl) VCF (Wikipedia) SAM: Wikipedia samtools Zhuyi Xue","title":"Material"},{"location":"day2/read_alignment/","text":"Learning outcomes After having completed this chapter you will be able to: Explain what a sequence aligner does Explain why in some cases the aligner needs to be \u2018splice-aware\u2019 Calculate mapping quality out of the probability that a mapping position is wrong Build an index of the reference and perform an alignment of paired-end reads with bowtie2 Material Download the presentation Unix command line E-utilities documentation bowtie2 manual Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search. Exercises Prepare the reference sequence Make a script called 05_download_ecoli_reference.sh , and paste in the code snippet below. Use it to retrieve the reference sequence using esearch and efetch : 05_download_ecoli_reference.sh #!/usr/bin/env bash REFERENCE_DIR = ~/project/ref_genome/ mkdir $REFERENCE_DIR cd $REFERENCE_DIR esearch -db nuccore -query 'U00096' \\ | efetch -format fasta > ecoli-strK12-MG1655.fasta Exercise: Check out the documentation of bowtie2-build , and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh . Answer 06_build_bowtie_index.sh #!/usr/bin/env bash cd ~/project/ref_genome bowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta Align the reads with bowtie2 Exercise: Check out the bowtie2 manual here . We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need? Answer According to the usage of bowtie2 : bowtie2 [ options ] * -x { -1 -2 | -U | --interleaved | --sra-acc | b } We\u2019ll need the options: -x to point to our index -1 and -2 to point to our forward and reverse reads Exercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh , and run it. 07_align_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome/ ALIGNED_DIR = ~/project/results/alignments mkdir -p $ALIGNED_DIR bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ > $ALIGNED_DIR /SRR519926.sam We\u2019ll go deeper into alignment statistics later on, but bowtie2 writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual . Can you guess what the reason could be?","title":"Read alignment"},{"location":"day2/read_alignment/#learning-outcomes","text":"After having completed this chapter you will be able to: Explain what a sequence aligner does Explain why in some cases the aligner needs to be \u2018splice-aware\u2019 Calculate mapping quality out of the probability that a mapping position is wrong Build an index of the reference and perform an alignment of paired-end reads with bowtie2","title":"Learning outcomes"},{"location":"day2/read_alignment/#material","text":"Download the presentation Unix command line E-utilities documentation bowtie2 manual Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search.","title":"Material"},{"location":"day2/read_alignment/#exercises","text":"","title":"Exercises"},{"location":"day2/read_alignment/#prepare-the-reference-sequence","text":"Make a script called 05_download_ecoli_reference.sh , and paste in the code snippet below. Use it to retrieve the reference sequence using esearch and efetch : 05_download_ecoli_reference.sh #!/usr/bin/env bash REFERENCE_DIR = ~/project/ref_genome/ mkdir $REFERENCE_DIR cd $REFERENCE_DIR esearch -db nuccore -query 'U00096' \\ | efetch -format fasta > ecoli-strK12-MG1655.fasta Exercise: Check out the documentation of bowtie2-build , and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh . Answer 06_build_bowtie_index.sh #!/usr/bin/env bash cd ~/project/ref_genome bowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta","title":"Prepare the reference sequence"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","text":"Exercise: Check out the bowtie2 manual here . We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need? Answer According to the usage of bowtie2 : bowtie2 [ options ] * -x { -1 -2 | -U | --interleaved | --sra-acc | b } We\u2019ll need the options: -x to point to our index -1 and -2 to point to our forward and reverse reads Exercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh , and run it. 07_align_reads.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome/ ALIGNED_DIR = ~/project/results/alignments mkdir -p $ALIGNED_DIR bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ > $ALIGNED_DIR /SRR519926.sam We\u2019ll go deeper into alignment statistics later on, but bowtie2 writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual . Can you guess what the reason could be?","title":"Align the reads with bowtie2"},{"location":"day2/samtools/","text":"Learning outcomes After having completed this chapter you will be able to: Use samtools flagstat to get general statistics on the flags stored in a sam/bam file Use samtools view to: compress a sam file into a bam file filter on sam flags count alignments filter out a region Use samtools sort to sort an alignment file based on coordinate Use samtools index to create an index of a sorted sam/bam file Use the pipe ( | ) symbol to pipe alignments directly to samtools to perform sorting and filtering Material samtools documentation Explain sam flags tool Exercises Alignment statistics Exercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats by using samtools flagstat . Find the documentation here . Anything that draws your attention? Answer Code: cd ~/project/results/alignments/ samtools flagstat SRR519926.sam > SRR519926.sam.stats resulting in: 624724 + 0 in total (QC-passed reads + QC-failed reads) 624724 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 621624 + 0 mapped (99.50% : N/A) 621624 + 0 primary mapped (99.50% : N/A) 624724 + 0 paired in sequencing 312362 + 0 read1 312362 + 0 read2 300442 + 0 properly paired (48.09% : N/A) 619200 + 0 with itself and mate mapped 2424 + 0 singletons (0.39% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this: samtools stats SRR519926.sam | grep ^SN | cut -f 2,3 Now look at insert size average and insert size standard deviation . You can see the standard deviation is higher than the average, suggesting a wide distribution. Compression, sorting and indexing The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that. Exercise : Create a script called 08_compress_sort.sh . Add a samtools view command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Find the required documentation here . How much was the disk space reduced by compressing the file? Tip: Samtools writes to stdout By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o . Answer 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam By using ls -lh , you can find out that SRR519926.sam has a size of 264 Mb, while SRR519926.bam is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this: samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam Exercise : Add these lines to 08_compress_sort.sh , and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file ( SRR519926.bam ) and the sorted bam file ( SRR519926.sorted.bam ) with samtools view -H . What are the differences? Answer Your script should like like this: 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam samtools view -H SRR519926.bam returns: @HD VN:1.0 SO:unsorted @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam And samtools view -H SRR519926.sorted.bam returns: @HD VN:1.0 SO:coordinate @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam There are two main differences: The SO tag at @HD type code has changed from unsorted to coordinate . A line with the @PG type code for the sorting was added. Note that the command to view the header ( samtools -H ) is also added to the header for both runs. Filtering With samtools view you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads. Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation . Answer You will need the 0x4 flag. Filtering against unmapped reads (leaving only mapped reads) with samtools view would look like this: samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam or: samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam Exercise: Generate a script called 09_extract_unmapped.sh to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat ? Tip Check out the -f and -c options of samtools view Answer Your script 09_extract_unmapped.sh should look like this: 09_extract_unmapped.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam Counting like this: samtools view -c SRR519926.sorted.unmapped.bam This should correspond to the output of samtools flagstat (624724 - 621624 = 3100) samtools view also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files. Exercise: Generate a script called 10_extract_region.sh to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header. Tip: Specifying a region Our E. coli genome has only one chromosome, because only one line starts with > in the fasta file cd ~/project/ref_genome grep \">\" ecoli-strK12-MG1655.fasta gives: >U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END Answer 10_extract_region.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh \\ SRR519926.sorted.bam \\ U00096.3:2000000-2500000 \\ > SRR519926.sorted.region.bam Redirection Samtools is easy to use in a pipe. In this case you can replace the input file with a - . For example, you can sort and compress the output of your alignment software in a pipe like this: my_alignment_command \\ | samtools sort - \\ | samtools view -bh - \\ > alignment.bam The use of - In the modern versions of samtools, the use of - is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry. Exercise: Write a script called 11_align_sort.sh that maps the reads with bowtie2 (see chapter 2 of read alignment ), sorts them, and outputs them as a BAM file with a header. Answer 11_align_sort.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome ALIGNED_DIR = ~/project/results/alignments bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ 2 > $ALIGNED_DIR /bowtie2_SRR519926.log \\ | samtools sort - \\ | samtools view -bh - \\ > $ALIGNED_DIR /SRR519926.sorted.mapped.frompipe.bam Redirecting stderr Notice the line starting with 2> . This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log . This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc . QC summary The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc , fastp , samtools and bowtie2 ). You can specify a directory that contains any log files, and it will automatically search it for you. Exercise : Run the command multiqc . in ~/project and checkout the generated report.","title":"Samtools"},{"location":"day2/samtools/#learning-outcomes","text":"After having completed this chapter you will be able to: Use samtools flagstat to get general statistics on the flags stored in a sam/bam file Use samtools view to: compress a sam file into a bam file filter on sam flags count alignments filter out a region Use samtools sort to sort an alignment file based on coordinate Use samtools index to create an index of a sorted sam/bam file Use the pipe ( | ) symbol to pipe alignments directly to samtools to perform sorting and filtering","title":"Learning outcomes"},{"location":"day2/samtools/#material","text":"samtools documentation Explain sam flags tool","title":"Material"},{"location":"day2/samtools/#exercises","text":"","title":"Exercises"},{"location":"day2/samtools/#alignment-statistics","text":"Exercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats by using samtools flagstat . Find the documentation here . Anything that draws your attention? Answer Code: cd ~/project/results/alignments/ samtools flagstat SRR519926.sam > SRR519926.sam.stats resulting in: 624724 + 0 in total (QC-passed reads + QC-failed reads) 624724 + 0 primary 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 621624 + 0 mapped (99.50% : N/A) 621624 + 0 primary mapped (99.50% : N/A) 624724 + 0 paired in sequencing 312362 + 0 read1 312362 + 0 read2 300442 + 0 properly paired (48.09% : N/A) 619200 + 0 with itself and mate mapped 2424 + 0 singletons (0.39% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5) Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this: samtools stats SRR519926.sam | grep ^SN | cut -f 2,3 Now look at insert size average and insert size standard deviation . You can see the standard deviation is higher than the average, suggesting a wide distribution.","title":"Alignment statistics"},{"location":"day2/samtools/#compression-sorting-and-indexing","text":"The command samtools view is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that. Exercise : Create a script called 08_compress_sort.sh . Add a samtools view command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b and -h options. Find the required documentation here . How much was the disk space reduced by compressing the file? Tip: Samtools writes to stdout By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with > or use the the output option -o . Answer 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam By using ls -lh , you can find out that SRR519926.sam has a size of 264 Mb, while SRR519926.bam is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this: samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam Exercise : Add these lines to 08_compress_sort.sh , and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file ( SRR519926.bam ) and the sorted bam file ( SRR519926.sorted.bam ) with samtools view -H . What are the differences? Answer Your script should like like this: 08_compress_sort.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh SRR519926.sam > SRR519926.bam samtools sort SRR519926.bam > SRR519926.sorted.bam samtools index SRR519926.sorted.bam samtools view -H SRR519926.bam returns: @HD VN:1.0 SO:unsorted @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam And samtools view -H SRR519926.sorted.bam returns: @HD VN:1.0 SO:coordinate @SQ SN:U00096.3 LN:4641652 @PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\" @PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam @PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam There are two main differences: The SO tag at @HD type code has changed from unsorted to coordinate . A line with the @PG type code for the sorting was added. Note that the command to view the header ( samtools -H ) is also added to the header for both runs.","title":"Compression, sorting and indexing"},{"location":"day2/samtools/#filtering","text":"With samtools view you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads. Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation . Answer You will need the 0x4 flag. Filtering against unmapped reads (leaving only mapped reads) with samtools view would look like this: samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam or: samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam Exercise: Generate a script called 09_extract_unmapped.sh to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat ? Tip Check out the -f and -c options of samtools view Answer Your script 09_extract_unmapped.sh should look like this: 09_extract_unmapped.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam Counting like this: samtools view -c SRR519926.sorted.unmapped.bam This should correspond to the output of samtools flagstat (624724 - 621624 = 3100) samtools view also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files. Exercise: Generate a script called 10_extract_region.sh to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header. Tip: Specifying a region Our E. coli genome has only one chromosome, because only one line starts with > in the fasta file cd ~/project/ref_genome grep \">\" ecoli-strK12-MG1655.fasta gives: >U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END Answer 10_extract_region.sh #!/usr/bin/env bash cd ~/project/results/alignments samtools view -bh \\ SRR519926.sorted.bam \\ U00096.3:2000000-2500000 \\ > SRR519926.sorted.region.bam","title":"Filtering"},{"location":"day2/samtools/#redirection","text":"Samtools is easy to use in a pipe. In this case you can replace the input file with a - . For example, you can sort and compress the output of your alignment software in a pipe like this: my_alignment_command \\ | samtools sort - \\ | samtools view -bh - \\ > alignment.bam The use of - In the modern versions of samtools, the use of - is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry. Exercise: Write a script called 11_align_sort.sh that maps the reads with bowtie2 (see chapter 2 of read alignment ), sorts them, and outputs them as a BAM file with a header. Answer 11_align_sort.sh #!/usr/bin/env bash TRIMMED_DIR = ~/project/results/trimmed REFERENCE_DIR = ~/project/ref_genome ALIGNED_DIR = ~/project/results/alignments bowtie2 \\ -x $REFERENCE_DIR /ecoli-strK12-MG1655.fasta \\ -1 $TRIMMED_DIR /trimmed_SRR519926_1.fastq \\ -2 $TRIMMED_DIR /trimmed_SRR519926_2.fastq \\ 2 > $ALIGNED_DIR /bowtie2_SRR519926.log \\ | samtools sort - \\ | samtools view -bh - \\ > $ALIGNED_DIR /SRR519926.sorted.mapped.frompipe.bam Redirecting stderr Notice the line starting with 2> . This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log . This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc .","title":"Redirection"},{"location":"day2/samtools/#qc-summary","text":"The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc , fastp , samtools and bowtie2 ). You can specify a directory that contains any log files, and it will automatically search it for you. Exercise : Run the command multiqc . in ~/project and checkout the generated report.","title":"QC summary"},{"location":"day3/igv_visualisation/","text":"Learning outcomes After having completed this chapter you will be able to: Prepare a bam file for loading it into IGV Use IGV to: Navigate through a reference genome and alignments Retrieve information on a specific alignment Investigate (possible) variants Identify repeats and large INDELs Material The exercises below are partly based on this tutorial from the Griffith lab . Exercises A first glance: the E. coli dataset Index the alignment that was filtered for the region between 2000 and 2500 kb: cd ~/project/results/alignments samtools index SRR519926.sorted.region.bam Download the alignment ( SRR519926.sorted.region.bam ) together with it\u2019s index file ( SRR519926.sorted.region.bam.bai ) and the reference genome ( ecoli-strK12-MG1655.fasta ) to your desktop. If working with Docker If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project , your files will be in C:\\Users\\myusername\\ngs-course . Load the genome ( .fasta ) into IGV: Genomes > Load Genome from File\u2026 Load the alignment file ( .bam ): File > Load from File\u2026 Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways: With the search box Select the region in the location bar View the reads as pairs, by right click on the reads and select View as pairs Exercise: There are lot of reads that are coloured red. Why is that? If you don\u2019t find any red reads.. The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size Answer According to IGV , reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size. Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click : Exercise: Click on one of the reads. What kind of information is there? Answer Most of the information from the SAM file. Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand . HCC1143 data set For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000. The BAM files containing these reads for the cancer cell line and the matched normal are: HCC1143.normal.21.19M-20M.bam HCC1143.normal.21.19M-20M.bam.bai A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu: Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam using the file browser. This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000 . Navigate to the gene CHODL by typing it in the search box. Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7 Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file. Click on the coverage track where the SNP is: Exercise: What is the sequence sequencing depth for that base? And the percentage T? Answer The depth is 62, and 25 reads (40%) T. Navigate to region chr21:19,800,320-19,818,162 Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker Note This might take a while to load. Right click in the alignment track and select Color alignments by > insert size and pair orientation Exercise: Why are some reads coloured white? What can be the cause of that? Answer The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3. Navigate to region chr21:19,324,500-19,331,500 Right click in the main alignment track and select: Expanded View as pairs Color alignments by > insert size and pair orientation Sort alignments by > insert size Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion? Answer The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.","title":"IGV and visualisation"},{"location":"day3/igv_visualisation/#learning-outcomes","text":"After having completed this chapter you will be able to: Prepare a bam file for loading it into IGV Use IGV to: Navigate through a reference genome and alignments Retrieve information on a specific alignment Investigate (possible) variants Identify repeats and large INDELs","title":"Learning outcomes"},{"location":"day3/igv_visualisation/#material","text":"The exercises below are partly based on this tutorial from the Griffith lab .","title":"Material"},{"location":"day3/igv_visualisation/#exercises","text":"","title":"Exercises"},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","text":"Index the alignment that was filtered for the region between 2000 and 2500 kb: cd ~/project/results/alignments samtools index SRR519926.sorted.region.bam Download the alignment ( SRR519926.sorted.region.bam ) together with it\u2019s index file ( SRR519926.sorted.region.bam.bai ) and the reference genome ( ecoli-strK12-MG1655.fasta ) to your desktop. If working with Docker If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project , your files will be in C:\\Users\\myusername\\ngs-course . Load the genome ( .fasta ) into IGV: Genomes > Load Genome from File\u2026 Load the alignment file ( .bam ): File > Load from File\u2026 Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways: With the search box Select the region in the location bar View the reads as pairs, by right click on the reads and select View as pairs Exercise: There are lot of reads that are coloured red. Why is that? If you don\u2019t find any red reads.. The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size Answer According to IGV , reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size. Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click : Exercise: Click on one of the reads. What kind of information is there? Answer Most of the information from the SAM file. Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand .","title":"A first glance: the E. coli dataset"},{"location":"day3/igv_visualisation/#hcc1143-data-set","text":"For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000. The BAM files containing these reads for the cancer cell line and the matched normal are: HCC1143.normal.21.19M-20M.bam HCC1143.normal.21.19M-20M.bam.bai A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu: Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam using the file browser. This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000 . Navigate to the gene CHODL by typing it in the search box. Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7 Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file. Click on the coverage track where the SNP is: Exercise: What is the sequence sequencing depth for that base? And the percentage T? Answer The depth is 62, and 25 reads (40%) T. Navigate to region chr21:19,800,320-19,818,162 Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker Note This might take a while to load. Right click in the alignment track and select Color alignments by > insert size and pair orientation Exercise: Why are some reads coloured white? What can be the cause of that? Answer The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3. Navigate to region chr21:19,324,500-19,331,500 Right click in the main alignment track and select: Expanded View as pairs Color alignments by > insert size and pair orientation Sort alignments by > insert size Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion? Answer The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.","title":"HCC1143 data set"}]} \ No newline at end of file diff --git a/2023.11/sitemap.xml b/2023.11/sitemap.xml index af98144..6ecb606 100644 --- a/2023.11/sitemap.xml +++ b/2023.11/sitemap.xml @@ -2,67 +2,67 @@ None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily None - 2023-11-22 + 2023-11-24 daily \ No newline at end of file diff --git a/2023.11/sitemap.xml.gz b/2023.11/sitemap.xml.gz index fca1d81..472d8d8 100644 Binary files a/2023.11/sitemap.xml.gz and b/2023.11/sitemap.xml.gz differ