diff --git a/2024.4/index.html b/2024.4/index.html
index 94a8567..1bd83f2 100644
--- a/2024.4/index.html
+++ b/2024.4/index.html
@@ -944,6 +944,23 @@ <h2 id="teachers">Teachers</h2>
 		C84.2,46.7,88.7,51.3,88.7,56.8z"/>
 </g>
 </svg></span></a></li>
+<li>Frédéric Burdet <a href="https://orcid.org/0000-0002-2923-827X"><span class="twemoji"><?xml version="1.0" encoding="utf-8"?>
+<!-- Generator: Adobe Illustrator 19.1.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
+<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
+	 viewBox="0 0 256 256" style="enable-background:new 0 0 256 256;" xml:space="preserve">
+<style type="text/css">
+	.st0{fill:#A6CE39;}
+	.st1{fill:#FFFFFF;}
+</style>
+<path class="st0" d="M256,128c0,70.7-57.3,128-128,128C57.3,256,0,198.7,0,128C0,57.3,57.3,0,128,0C198.7,0,256,57.3,256,128z"/>
+<g>
+	<path class="st1" d="M86.3,186.2H70.9V79.1h15.4v48.4V186.2z"/>
+	<path class="st1" d="M108.9,79.1h41.6c39.6,0,57,28.3,57,53.6c0,27.5-21.5,53.6-56.8,53.6h-41.8V79.1z M124.3,172.4h24.5
+		c34.9,0,42.9-26.5,42.9-39.7c0-21.5-13.7-39.7-43.7-39.7h-23.7V172.4z"/>
+	<path class="st1" d="M88.7,56.8c0,5.5-4.5,10.1-10.1,10.1c-5.6,0-10.1-4.6-10.1-10.1c0-5.6,4.5-10.1,10.1-10.1
+		C84.2,46.7,88.7,51.3,88.7,56.8z"/>
+</g>
+</svg></span></a></li>
 </ul>
 <h2 id="authors">Authors</h2>
 <ul>
diff --git a/2024.4/search/search_index.json b/2024.4/search/search_index.json
index 171f080..8b4abfe 100644
--- a/2024.4/search/search_index.json
+++ b/2024.4/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"<p>Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses) or in your own time. </p> <p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualize the output</li> </ul>"},{"location":"#teachers","title":"Teachers","text":"<ul> <li>Geert van Geest  </li> </ul>"},{"location":"#authors","title":"Authors","text":"<ul> <li>Geert van Geest  </li> <li>Patricia Palagi  </li> </ul>"},{"location":"#license-copyright","title":"License &amp; copyright","text":"<p>License: CC BY-SA 4.0</p> <p>Copyright: SIB Swiss Institute of Bioinformatics</p> Enrolled to the courseIndependently <p>You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository.</p> <p>Note</p> <p>It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!</p>"},{"location":"#material","title":"Material","text":"<ul> <li>This website</li> <li>Zoom meeting (through mail)</li> <li>Google doc (through mail)</li> <li>Slack channel</li> </ul>"},{"location":"#learning-outcomes","title":"Learning outcomes","text":"<p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualise the output</li> </ul>"},{"location":"#learning-experiences","title":"Learning experiences","text":"<p>This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.</p>"},{"location":"#exercises","title":"Exercises","text":"<p>Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.</p>"},{"location":"#asking-questions","title":"Asking questions","text":"<p>During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button:</p> <p> </p> <p>A main source of communication will be our slack channel. Ask background questions that interest you personally at #background. During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a.  This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option:</p> <p> </p> <p>The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally.</p> <p>To summarise:</p> <ul> <li>During lectures: raise hand/zoom functionality</li> <li>Personal interest questions: #background</li> <li>During exercises: #q-and-a on slack</li> </ul>"},{"location":"#learning-outcomes_1","title":"Learning outcomes","text":"<p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualize the output</li> </ul>"},{"location":"#exercises_1","title":"Exercises","text":"<p>Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.</p>"},{"location":"course_schedule/","title":"Course schedule","text":"<p>Note</p> <p>Apart from the starting time the time schedule is indicative. Because we can not plan a course by the minute, in practice the time points will deviate. </p>"},{"location":"course_schedule/#day-1","title":"Day 1","text":"block start end subject introduction 9:00 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-2","title":"Day 2","text":"block start end subject block 1 9:00 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-3","title":"Day 3","text":"block start end subject block 1 9:00 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations"},{"location":"group_work/","title":"Group work","text":"<p>The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people.</p> <p>If working with Docker</p> <p>If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk.</p> <p>If online</p> <p>If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.</p>"},{"location":"group_work/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"group_work/#roles-organisation","title":"Roles &amp; organisation","text":"<p>Project based learning is about learning by doing, but also about peer instruction. This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding.</p> <p>Each project has tasks and questions. By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions.</p> <p>In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.</p>"},{"location":"group_work/#working-directories","title":"Working directories","text":"<p>Each group has access to a shared working directory. It is mounted in the root directory (<code>/</code>). You can add this directory to your working space by clicking: File &gt; Add Folder to Workspace\u2026. Then, type the path to your group directory: <code>/group_work/groupX</code> (where <code>X</code> is your group number).</p>"},{"location":"group_work/#project-1-variant-analysis-of-human-data","title":"Project 1: Variant analysis of human data","text":"<p>Aim: Find variants on chromosome 20 from three samples</p> <p>In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. </p> <p>You can get the data by running these commands:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz\ntar -xvf project1.tar.gz\nrm project1.tar.gz\n</code></pre>"},{"location":"group_work/#tasks","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Download the required data</li> <li>Do a QC on the data with <code>fastqc</code></li> <li>Trim adapters and low quality bases with <code>fastp</code>. Make sure to include the option <code>--detect_adapter_for_pe</code>. To prevent overwriting <code>fastp.html</code>, specify a report filename for each sample with the option <code>--html</code>. </li> <li>After trimming the adapters, run <code>fastqc</code> again to see whether all adapters are gone.</li> <li>Create an index for bowtie2. At the same time create a fasta index (<code>.fai</code> file) with <code>samtools faidx</code>. </li> <li>Check which options to use, and align with <code>bowtie2</code>. At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. </li> <li>Mark duplicates on the individual bam files with <code>gatk MarkDuplicates</code> (see hints below).</li> <li>Merge the three bam files with <code>samtools merge</code>. Index the bam file afterwards. </li> <li>Run <code>freebayes</code> to call variants. Only call variants on the region <code>chr20:10018000-10220000</code> by specifying the <code>-r</code> option. </li> <li>Load your alignments together with the vcf containing the variants in IGV. Check out e.g. <code>chr20:10,026,397-10,026,638</code>. </li> <li>Run <code>multiqc</code> to get an overall quality report.</li> </ul>"},{"location":"group_work/#questions","title":"Questions","text":"<ul> <li>Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that?</li> <li>How many duplicates were in the different samples (hint: use <code>samtools flagstat</code>)? Why is it important to remove them for variant analysis?</li> <li>Why did you add read groups to the bam files? Where is this information added in the bam file? </li> <li>Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? </li> <li>There are two high quality variants in <code>chr20:10,026,397-10,026,638</code>. What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. </li> </ul>"},{"location":"group_work/#hints","title":"Hints","text":"<p>You can add readgroups to the alignment file with <code>bowtie2</code> with the options <code>--rg-id</code> and <code>--rg</code>, e.g. (<code>$SAMPLE</code> is a variable containing a sample identifier):</p> <pre><code>bowtie2 \\\n-x ref.fa \\\n-1 r1.fastq.gz \\\n-2 r2.fastq.gz \\\n--rg-id $SAMPLE \\\n--rg SM:$SAMPLE \\\n</code></pre> <p>To run <code>gatk MarkDuplicates</code> you will only need to specify <code>--INPUT</code> and <code>--OUTPUT</code>, e.g.:</p> <pre><code>gatk MarkDuplicates \\\n--INPUT sample.bam \\\n--OUTPUT sample.md.bam \\\n--METRICS_FILE sample.metrics.txt \n</code></pre>"},{"location":"group_work/#project-2-long-read-genome-sequencing","title":"Project 2: Long-read genome sequencing","text":"<p>Aim: Align long reads from RNA-seq data to a reference genome.</p> <p>In this project, you will be working with data from:</p> <p>Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.</p> <p>The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For this project, we will work with two samples of this study, <code>EV_2</code> (extracellular vesicle) and <code>Cell_2</code> (whole cell). Download and unpack the data files.</p> <p>Download the human reference genome like this:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project2.tar.gz\ntar -xvf project2.tar.gz\nrm project2.tar.gz\n</code></pre> <p>You can find the fastq files in the <code>reads</code> folder and the reference genome and its annotation in the <code>reference</code> folder. To reduce computational times we work with a subset of the data on a subset of the genome (chromosome 5 and X).</p>"},{"location":"group_work/#tasks_1","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Perform QC with <code>fastqc</code></li> <li>Perform QC with <code>NanoPlot</code></li> <li>Align with <code>minimap2</code> with default parameters</li> <li>Figure how you should set parameter <code>-x</code></li> <li>Evaluate the alignment quality (e.g. alignment rates, mapping quality)</li> <li>Compare the two different samples in read quality, alignment rates, depth, etc.</li> <li>Check out the alignments in IGV. Check out e.g. <code>ELOVL5</code>.</li> </ul>"},{"location":"group_work/#questions_1","title":"Questions","text":"<ul> <li>Have a look at the quality report. What are the average read lengths? Is that expected?</li> <li>What is the average read quality? What kind of accuracy would you expect?</li> <li>Note any differences between <code>fastqc</code> and <code>NanoPlot</code>? How is that compared to the publication?</li> <li>Check out the option <code>-x</code> of <code>minimap2</code>. Are the defaults appropriate?</li> <li>You might consider using <code>-x map-ont</code> or <code>-x splice</code>. Do you see differences in the alignment in e.g. IGV?</li> <li>How are spliced alignments stored in the SAM file with the different settings of <code>-x</code>?</li> <li>How deep is the gene <code>ELOVL5</code> sequenced in both samples?</li> <li>Do you already see evidence for splice variants in the alignments?</li> </ul> <p>Accuracy from quality scores</p> <p>Find the equation to calculate error probability from quality score on Wikipedia.</p> <p>Comparing <code>fastqc</code> and <code>Nanoplot</code></p> <p>For comparing <code>fastqc</code> and <code>NanoPlot</code>, check out this blog of the author of NanoPlot, and this thread.</p> <p>Running <code>minimap2</code></p> <p>Here\u2019s an example command for <code>minimap2</code>:</p> <pre><code>minimap2 \\\n-a \\\n-x [PARAMETER] \\\n[REFERENCE].fa \\\n[FASTQFILE].fastq.gz \\\n| samtools sort \\\n| samtools view -bh &gt; [OUTPUT].bam\n</code></pre>"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","title":"Project 3: Short-read RNA-seq of mice.","text":"<p>Aim: Generate a count matrix to estimate differential gene expression. </p> <p>In this project you will be working with data from:</p> <p>Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6</p> <p>Here\u2019s the BioProject page. Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz\ntar -xvf project3.tar.gz\nrm project3.tar.gz\n</code></pre>"},{"location":"group_work/#tasks_2","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Download the tar file, and find out what\u2019s in the data folder</li> <li>Do a QC on the fastq files with <code>fastqc</code></li> <li>Trim adapters and low quality bases with <code>fastp</code></li> <li>After trimming the adapters, run <code>fastqc</code> again to see whether all adapters are gone.</li> <li>Check which options to use, and align with <code>hisat2</code></li> <li>Evaluate the alignment quality (e.g. alignment rates, mapping quality)</li> <li>Have a look at the alignments in IGV, e.g. check out <code>Sparcl1</code>. For this, you can use the built-in genome (Mouse (mm10)). Do you see any evidence for differential splicing?</li> <li>Run <code>featureCounts</code> on both alignments. Have a look at the option <code>-Q</code>. For further suggestions, see the hints below. </li> <li>Compare the count matrices in <code>R</code> (find a script to get started here; Rstudio server is running on the same machine. Approach it with your credentials and username <code>rstudio</code>)</li> </ul>"},{"location":"group_work/#questions_2","title":"Questions","text":"<ul> <li>Check the description at the SRA sample page. What kind of sample is this?</li> <li>How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are)</li> <li>Did trimming improve the QC results? What could be the cause of the warnings/errors in the <code>fastqc</code> reports?</li> <li>What are the alignment rates?</li> <li>How are spliced alignments stored in the SAM file?</li> <li>Are there any differences between the treatments in the percentage of assigned alignments by <code>featureCounts</code>? What is the cause of this? </li> <li>Can you find any genes that seem to be differentially expressed? </li> <li>What is the effect of setting the option <code>-Q</code> in <code>featureCounts</code>?</li> </ul>"},{"location":"group_work/#hints_1","title":"Hints","text":"<p>We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing.</p> command option <code>bowtie2-build</code> <code>--threads</code> <code>hisat2-build</code> <code>--threads</code> <code>fastqc</code> <code>--threads</code> <code>cutadapt</code> <code>--cores</code> <code>bowtie2</code> <code>--threads</code> <code>hisat2</code> <code>--threads</code> <code>featureCounts</code> <code>-T</code> <p>Here\u2019s some example code for <code>hisat2</code> and <code>featureCounts</code>. Everything in between <code>&lt;&gt;</code> should be replaced with specific arguments.</p> <p>Here\u2019s an example for <code>hisat2</code>:</p> <pre><code>hisat2-build &lt;reference_sequence_fasta&gt; &lt;index_basename&gt;\n\nhisat2 \\\n-x &lt;index_basename&gt; \\\n-1 &lt;foward_reads.fastq.gz&gt; \\\n-2 &lt;reverse_reads.fastq.gz&gt; \\\n-p &lt;threads&gt; \\\n| samtools sort \\\n| samtools view -bh \\\n&gt; &lt;alignment_file.bam&gt;\n</code></pre> <p>Example code <code>featureCounts</code>:</p> <pre><code>featureCounts \\\n-p \\\n-T 2 \\\n-a &lt;annotations.gtf&gt; \\\n-o &lt;output.counts.txt&gt; \\\n*.bam\n</code></pre>"},{"location":"precourse/","title":"Precourse preparations","text":""},{"location":"precourse/#unix","title":"UNIX","text":"<p>We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here. If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial.</p>"},{"location":"precourse/#software","title":"Software","text":"<p>We will be mainly working on an Amazon Web Services (AWS)  Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a VSCode interface. All participants will be granted access to a personal workspace to be used during the course.</p> <p>The only software you need to install before the course is Integrative Genomics Viewer (IGV).</p>"},{"location":"day1/intro/","title":"Introduction","text":"<p>If working independently</p> <p>If you are doing this course independently, you can skip this part.</p>"},{"location":"day1/intro/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"day1/quality_control/","title":"Quality control","text":""},{"location":"day1/quality_control/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Find information about a sequence run on the Sequence Read Archive</li> <li>Run <code>fastqc</code> on sequence reads and interpret the results</li> <li>Trim adapters and low quality bases using <code>fastp</code></li> </ul>"},{"location":"day1/quality_control/#material","title":"Material","text":"<p> Download the presentation</p> <ul> <li><code>fastqc</code> command line documentation</li> <li><code>cutadapt</code> manual</li> <li>Unix command line E-utilities documentation</li> </ul>"},{"location":"day1/quality_control/#exercises","title":"Exercises","text":""},{"location":"day1/quality_control/#download-and-evaluate-an-e-coli-dataset","title":"Download and evaluate an E. coli dataset","text":"<p>Check out the dataset at SRA.</p> <p>Exercise: Browse around the SRA entry and answer these questions:</p> <p>A. Is the dataset paired-end or single end?</p> <p>B. Which instrument was used for sequencing?</p> <p>C. What is the read length?</p> <p>D. How many reads do we have?</p> Answers <p>A. paired-end</p> <p>B. Illumina MiSeq</p> <p>C. 2 x 251 bp</p> <p>D. 400596</p> <p>Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called <code>ngs-tools</code>. Every time you open a new terminal, you will have to load the environment:</p> <pre><code>conda activate ngs-tools\n</code></pre> <p>Make a directory <code>reads</code> in <code>~/project</code> and download the reads from the SRA database using <code>prefetch</code> and <code>fastq-dump</code> from SRA-Tools into the <code>reads</code> directory. Use the code snippet below to create a scripts called <code>01_download_reads.sh</code>. Store it in <code>~/project/scripts/</code>, and run it.</p> 01_download_reads.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project\nmkdir reads\ncd reads\nprefetch SRR519926\nfastq-dump --split-files SRR519926\n</code></pre> <p>Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry.</p> <p>Tip</p> <p>A read in a fastq file consists of four lines (more on that at file types). Use Google to figure out how to count the number of reads in a fastq file.</p> Answer <p>e.g. from this thread on Biostars:</p> <pre><code>## forward read\necho $(cat SRR519926_1.fastq | wc -l)/4 | bc\n\n## reverse read\necho $(cat SRR519926_2.fastq | wc -l)/4 | bc\n</code></pre>"},{"location":"day1/quality_control/#run-fastqc","title":"Run fastqc","text":"<p>Exercise: Create a script to run <code>fastqc</code> and call it <code>02_run_fastqc.sh</code>. After that, run it.</p> <p>Tip</p> <p><code>fastqc</code> accepts multiple files as input, so you can use a wildcard to run <code>fastqc</code> on all the files in one line of code. Use it like this: <code>*.fastq</code>.  </p> Answer <p>Your script <code>~/project/scripts/02_run_fastqc.sh</code> should look like:</p> 02_run_fastqc.sh<pre><code>#!/usr/bin/env bash\ncd ~/project/reads\n\nfastqc *.fastq\n</code></pre> <p>Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems?</p> <p>Downloading files</p> <p>You can download files by right-click the file and after that select Download:</p> <p> </p> Answer <p>There seems to be:</p> <ul> <li>Low quality towards the 3\u2019 end (per base sequence quality)</li> <li>Full sequence reads with low quality (per sequence quality scores)</li> <li>Adapters in the sequences (adapter content)</li> </ul> <p>We can probably fix most of these issues by trimming.</p>"},{"location":"day1/quality_control/#trim-the-reads","title":"Trim the reads","text":"<p>We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and <code>fastp</code> will use those by default. A reference for the adapter sequences can be found here.</p> <p>Exercise: Check out the documentation of fastp, and the option defaults by running <code>fastp --help</code>. </p> <ul> <li>What is the default for the minimum base quality for a qualified base? ( option <code>--qualified_quality_phred</code>)</li> <li>What is the default for the maximum percentage of unqualified bases in a read? (option <code>--unqualified_percent_limit</code>)</li> <li>What is the default for the minimum required read length? (option <code>--length_required</code>)</li> <li>What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options <code>--unpaired1</code> and <code>--unpaired2</code>)</li> </ul> Answer <ul> <li>The minimum base quality is 15: <code>Default 15 means phred quality &gt;=Q15 is qualified. (int [=15])</code></li> <li>The minimum required length is also 15: <code>reads shorter than length_required will be discarded, default is 15. (int [=15])</code></li> <li>If one of the reads does not meet the required length, the pair is discarded if <code>--unpaired1</code> and/or <code>--unpaired2</code> are not specified: <code>for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])</code>. </li> </ul> <p>Exercise: Complete the script below called <code>03_trim_reads.sh</code> (replace everything in between brackets <code>[]</code>) to run <code>fastp</code> to trim the data.  The quality of our dataset is not great, so we will overwrite the defaults.  Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called <code>~/project/results/trimmed/</code> is created to write the trimmed reads.</p> 03_trim_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n[QUALIFIED BASE THRESHOLD] \\\n[MINIMUM LENGTH THRESHOLD] \\\n[UNQUALIFIED PERCENTAGE LIMIT] \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n</code></pre> <p>Additional options</p> <p>Note that we have set the options <code>--cut_front</code> and <code>--cut_tail</code> that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also <code>--detect_adapter_for_pe</code> is set, which ensures that adapters are detected automatically for both R1 and R2. </p> Answer <p>Your script (<code>~/project/scripts/03_trim_reads.sh</code>) should look like this:</p> 03_trim_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n--qualified_quality_phred 10 \\\n--length_required 25 \\\n--unqualified_percent_limit 80 \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n</code></pre> <p>The use of <code>\\</code></p> <p>In the script above you see that we\u2019re using <code>\\</code> at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the <code>fastp</code> command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use <code>\\</code> to ignore the newlines when executing.</p> <p>Exercise: Check out the report in <code>fastp.html</code>. </p> <ul> <li>Has the quality improved?</li> <li>How many reads do we have left?</li> <li>Bonus: Although there were adapters in R2 according to <code>fastqc</code>,  <code>fastp</code> has trouble finding adapters in R2. Also, after running <code>fastp</code> there doesn\u2019t seem to be much adapter left (you can double check by running <code>fastqc</code> on <code>trimmed_SRR519926_2.fastq</code>). How could that be? </li> </ul> Answers <ul> <li>Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages &gt;20 and &gt;30 are higher. </li> <li>624724 reads, so 312362 pairs (78.0%)</li> <li>The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.  </li> </ul>"},{"location":"day1/reproducibility/","title":"Reproducibility","text":""},{"location":"day1/reproducibility/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Understand the importance of reproducibility</li> <li>Apply some basic rules to support reproducibilty in computational research</li> </ul>"},{"location":"day1/reproducibility/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","title":"Some good practices for reproducibility","text":"<p>During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility:</p> <ol> <li>Execute the commands from a script in order to be able to trace back your steps</li> <li>Number scripts based on their order of execution (e.g. <code>01_download_reads.sh</code>)</li> <li>Give your scripts a descriptive and active name, e.g. <code>06_build_bowtie_index.sh</code></li> <li>Make your scripts specific, i.e. do not combine many different commands in the same script</li> <li>Refer to directories and variables at the beginning of the script</li> </ol> <p>By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research. </p> <p>By the end of day 2 <code>~/project</code> should look (something) like this:</p> <pre><code>.\n\u251c\u2500\u2500 alignment_output\n\u251c\u2500\u2500 reads\n\u251c\u2500\u2500 ref_genome\n\u251c\u2500\u2500 scripts\n\u2502   \u251c\u2500\u2500 01_download_reads.sh\n\u2502   \u251c\u2500\u2500 02_run_fastqc.sh\n\u2502   \u251c\u2500\u2500 03_trim_reads.sh\n\u2502   \u251c\u2500\u2500 04_run_fastqc_trimmed.sh\n\u2502   \u251c\u2500\u2500 05_download_ecoli_reference.sh\n\u2502   \u251c\u2500\u2500 06_build_bowtie_index.sh\n\u2502   \u251c\u2500\u2500 07_align_reads.sh\n\u2502   \u251c\u2500\u2500 08_compress_sort.sh\n\u2502   \u251c\u2500\u2500 09_extract_unmapped.sh\n\u2502   \u251c\u2500\u2500 10_extract_region.sh\n\u2502   \u2514\u2500\u2500 11_align_sort.sh\n\u2514\u2500\u2500 trimmed_data\n</code></pre>"},{"location":"day1/sequencing_technologies/","title":"Sequencing technologies","text":""},{"location":"day1/sequencing_technologies/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Describe the major applications of next generation sequencing</li> <li>Reproduce the most frequently used sequencing methods</li> <li>Describe the major steps taken during library preparation for Illumina sequencing</li> <li>Explain why the length of the sequencing reads of Illumina sequencing are limited</li> <li>Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing</li> </ul>"},{"location":"day1/sequencing_technologies/#material","title":"Material","text":"<p> Download the presentation</p> <p>Illumina sequencing by synthesis on YouTube</p> <p>NEBnext library preparation poster</p>"},{"location":"day1/server_login/","title":"Setup","text":""},{"location":"day1/server_login/#learning-outcomes","title":"Learning outcomes","text":"<p>Note</p> <p>You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line.</p> <p>After having completed this chapter you will be able to:</p> <ul> <li>Use the command line to:<ul> <li>Make a directory</li> <li>Change file permissions to \u2018executable\u2019</li> <li>Run a <code>bash</code> script</li> <li>Pipe data from and to a file or other executable</li> </ul> </li> <li>Program a loop in <code>bash</code></li> </ul> <p>Choose your platform</p> <p>In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker.</p> <p>If you are doing the course with a teacher, you will have to login to the remote server. Therefore choose:</p> <ul> <li>Cloud notebook</li> </ul> <p>If you are doing this course independently (i.e. without a teacher) choose either:</p> <ul> <li>conda</li> <li>Docker</li> </ul> Cloud notebookDockerconda <p>If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this:</p> WindowsMac/Linux <ul> <li>Get the <code>.exe</code> file here</li> <li>Double click the file</li> <li>Follow the instructions on the screen (defaults are usually fine)</li> <li>Run the command <code>conda list</code> in the Ananconda prompt or terminal to check whether your installation has succeeded.</li> </ul> <ul> <li>Get the installation (<code>.sh</code>) script here</li> <li>Run in your terminal:     <pre><code>bash Miniconda3-latest-Linux-x86_64.sh\n</code></pre></li> <li>Follow the prompts</li> <li>Close and reopen your terminal for changes to have effect</li> <li>Run the command <code>conda list</code> in the Ananconda prompt or terminal to check whether your installation has succeeded.</li> </ul> <p>After installation, you can install the required software:</p> Windows/MacOSLinux <pre><code>conda create -n ngs-tools\n\nconda activate ngs-tools\n\nconda install -y -c bioconda \\\n    samtools \\\n    bwa \\\n    fastqc \\\n    sra-tools \\\n    bowtie2=2.4.2 \\\n    hisat2=2.2.1 \\\n    subread=2.0.1 \\\n    entrez-direct \\\n    minimap2 \\\n    gatk4 \\\n    freebayes \\\n    multiqc \\\n    fastp\n</code></pre> <p>Download ngs-tools.yml, and generate the conda environment like this:</p> <pre><code>conda env create --name ngs-tools -f ngs-tools.yml\n</code></pre> <p>Note</p> <p>If that did not succeed, follow the instructions for Windows/MacOS.</p> <p>This will create the conda environment <code>ngs-tools</code></p> <p>Activate it like so:</p> <pre><code>conda activate ngs-tools\n</code></pre> <p>After successful installation and activating the environment all the software required to do the exercises should be available.</p> <p>If you are doing project 2 (long reads)</p> <p>If you are doing the project 2 as part of the course, you will need to install <code>NanoPlot</code> as well, using <code>pip</code>:</p> <pre><code>pip install NanoPlot\n</code></pre>"},{"location":"day1/server_login/#exercises","title":"Exercises","text":""},{"location":"day1/server_login/#first-login","title":"First login","text":"<p>If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: <code>http://12.345.678.91:10002</code>) in your browser. This should result in the following page:</p> <p> </p> <p>Info</p> <p>The link gives you access to a web version of Visual Studio Code. This is a powerful code editor that you can also use as a local application on your computer. </p> <p>Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl+`. Or by clicking Application menu &gt; Terminal &gt; New Terminal:</p> <p> </p> <p>For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:</p> <p> </p>"},{"location":"day1/server_login/#material","title":"Material","text":"<ul> <li>Instructions to install docker</li> <li>Instructions to set up to container</li> </ul>"},{"location":"day1/server_login/#exercises_1","title":"Exercises","text":""},{"location":"day1/server_login/#first-login_1","title":"First login","text":"<p>Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system.</p> <p>In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10.</p> <p></p> <p>The command to run the environment required for this course looks like this (in a terminal):</p> <p>Modify the script</p> <p>Modify the path after <code>-v</code> to the working directory on your computer before running it.</p> <pre><code>docker run \\\n--rm \\\n-p 8443:8443 \\\n-e PUID=1000 \\\n-e PGID=1000 \\\n-e DEFAULT_WORKSPACE=/config/project \\\n-v $PWD:/config/project \\\ngeertvangeest/ngs-introduction-vscode:latest\n</code></pre> <p>If this command has run successfully, navigate in your browser to http://localhost:8443.</p> <p>The option <code>-v</code> mounts a local directory in your computer to the directory <code>/config/project</code> in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.</p> <p>Don\u2019t mount directly in the home dir</p> <p>Don\u2019t directly mount your local directory to the home directory (<code>/root</code>). This will lead to unexpected behaviour.</p> <p>The part <code>geertvangeest/ngs-introduction-vscode:latest</code> is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately.</p>"},{"location":"day1/server_login/#a-unix-command-line-interface-cli-refresher","title":"A UNIX command line interface (CLI) refresher","text":"<p>Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. </p> <p>If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet:</p> <p> UNIX cheat sheet</p>"},{"location":"day1/server_login/#make-a-new-directory","title":"Make a new directory","text":"<p>Make a directory <code>scripts</code> within <code>~/project</code> and make it your current directory.</p> Answer <pre><code>cd ~/project\nmkdir scripts\ncd scripts\n</code></pre>"},{"location":"day1/server_login/#file-permissions","title":"File permissions","text":"<p>Generate an empty script in your newly made directory <code>~/project/scripts</code> like this:</p> <pre><code>touch new_script.sh\n</code></pre> <p>Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it.</p> Answer <p>Generate a script as described above. The script should look like this:</p> <pre><code>#!/usr/bin/env bash\n\necho \"SIB courses are great!\"\n</code></pre> <p>Usually, you can run it like this:</p> <pre><code>./new_script.sh\n</code></pre> <p>But there\u2019s an error:</p> <pre><code>bash: ./new_script.sh: Permission denied\n</code></pre> <p>Why is there an error?</p> <p>Hint</p> <p>Use <code>ls -lh new_script.sh</code> to check the permissions.</p> Answer <pre><code>ls -lh new_script.sh\n</code></pre> <p>gives:</p> <pre><code>-rw-r--r--  1 user  group    51B Nov 11 16:21 new_script.sh\n</code></pre> <p>There\u2019s no <code>x</code> in the permissions string. You should change at least the permissions of the user.</p> <p>Make the script executable for yourself, and run it.</p> Answer <p>Change permissions:</p> <pre><code>chmod u+x new_script.sh\n</code></pre> <p><code>ls -lh new_script.sh</code> now gives:</p> <pre><code>-rwxr--r--  1 user  group    51B Nov 11 16:21 new_script.sh\n</code></pre> <p>So it should be executable:</p> <pre><code>./new_script.sh\n</code></pre> <p>More on <code>chmod</code> and file permissions here.</p>"},{"location":"day1/server_login/#redirection-and","title":"Redirection: <code>&gt;</code> and <code>|</code>","text":"<p>In the root directory (go there like this: <code>cd /</code>) there are a range of system directories and files. Write the names of all directories and files to a file called <code>system_dirs.txt</code> in your working directory.</p> Answer <pre><code>ls / &gt; ~/project/system_dirs.txt\n</code></pre> <p>The command <code>wc -l</code> counts the number of lines, and can read from stdin. Make a one-liner with a pipe <code>|</code> symbol to find out how many system directories and files there are.</p> Answer <pre><code>ls / | wc -l\n</code></pre>"},{"location":"day1/server_login/#variables","title":"Variables","text":"<p>Store <code>system_dirs.txt</code> as variable (like this: <code>VAR=variable</code>), and use <code>wc -l</code> on that variable to count the number of lines in the file.</p> Answer <pre><code>FILE=~/project/system_dirs.txt\nwc -l $FILE\n</code></pre>"},{"location":"day1/server_login/#shell-scripts","title":"shell scripts","text":"<p>Make a shell script that automatically counts the number of system directories and files.</p> Answer <p>Make a script called e.g. <code>current_system_dirs.sh</code>: <pre><code>#!/usr/bin/env bash\ncd /\nls | wc -l\n</code></pre></p>"},{"location":"day2/file_types/","title":"File types","text":""},{"location":"day2/file_types/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Describe the fasta and fastq file format</li> <li>Describe which information can be stored in a standard Illumina fastq title</li> <li>Reproduce how and why base quality is stored in a fastq file as a single ASCII character</li> <li>Lookup relevant information of an alignment in the header of a sam file</li> <li>Describe what information is stored in each column of a sam file</li> <li>Describe how information is stored in a sam flag</li> <li>Describe the bed and gtf file format</li> <li>Describe vcf file format</li> </ul>"},{"location":"day2/file_types/#material","title":"Material","text":"<p> Download the presentation</p> <p>File definition websites:</p> <ul> <li>FASTQ (wikipedia)</li> <li>GFF (ensembl)</li> <li>VCF (Wikipedia)</li> <li>SAM:<ul> <li>Wikipedia</li> <li>samtools</li> <li>Zhuyi Xue</li> </ul> </li> </ul>"},{"location":"day2/read_alignment/","title":"Read alignment","text":""},{"location":"day2/read_alignment/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Explain what a sequence aligner does</li> <li>Explain why in some cases the aligner needs to be \u2018splice-aware\u2019</li> <li>Calculate mapping quality out of the probability that a mapping position is wrong</li> <li>Build an index of the reference and perform an alignment of paired-end reads with <code>bowtie2</code></li> </ul>"},{"location":"day2/read_alignment/#material","title":"Material","text":"<p> Download the presentation</p> <ul> <li>Unix command line E-utilities documentation</li> <li><code>bowtie2</code> manual</li> <li>Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search. </li> </ul>"},{"location":"day2/read_alignment/#exercises","title":"Exercises","text":""},{"location":"day2/read_alignment/#prepare-the-reference-sequence","title":"Prepare the reference sequence","text":"<p>Make a script called <code>05_download_ecoli_reference.sh</code>, and paste in the code snippet below. Use it to retrieve the reference sequence using <code>esearch</code> and <code>efetch</code>:</p> 05_download_ecoli_reference.sh<pre><code>#!/usr/bin/env bash\n\nREFERENCE_DIR=~/project/ref_genome/\n\nmkdir $REFERENCE_DIR\ncd $REFERENCE_DIR\n\nesearch -db nuccore -query 'U00096' \\\n| efetch -format fasta &gt; ecoli-strK12-MG1655.fasta\n</code></pre> <p>Exercise: Check out the documentation of <code>bowtie2-build</code>, and build the indexed reference genome with bowtie2 using default options. Do that with a script called <code>06_build_bowtie_index.sh</code>.</p> Answer 06_build_bowtie_index.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/ref_genome\n\nbowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta\n</code></pre>"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","title":"Align the reads with <code>bowtie2</code>","text":"<p>Exercise: Check out the bowtie2 manual here. We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need?</p> Answer <p>According to the usage of <code>bowtie2</code>: <pre><code>bowtie2 [options]* -x &lt;bt2-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; | -U &lt;r&gt; | --interleaved &lt;i&gt; | --sra-acc &lt;acc&gt; | b &lt;bam&gt;}\n</code></pre></p> <p>We\u2019ll need the options:</p> <ul> <li><code>-x</code> to point to our index</li> <li><code>-1</code> and <code>-2</code> to point to our forward and reverse reads</li> </ul> <p>Exercise: Try to understand what the script below does. After that copy it to a script called <code>07_align_reads.sh</code>, and run it.</p> 07_align_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome/\nALIGNED_DIR=~/project/results/alignments\n\nmkdir -p $ALIGNED_DIR\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n&gt; $ALIGNED_DIR/SRR519926.sam\n</code></pre> <p>We\u2019ll go deeper into alignment statistics later on, but <code>bowtie2</code> writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual. Can you guess what the reason could be?</p>"},{"location":"day2/samtools/","title":"Samtools","text":""},{"location":"day2/samtools/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Use <code>samtools flagstat</code> to get general statistics on the flags stored in a sam/bam file</li> <li>Use <code>samtools view</code> to:<ul> <li>compress a sam file into a bam file</li> <li>filter on sam flags</li> <li>count alignments</li> <li>filter out a region</li> </ul> </li> <li>Use <code>samtools sort</code> to sort an alignment file based on coordinate</li> <li>Use <code>samtools index</code> to create an index of a sorted sam/bam file</li> <li>Use the pipe (<code>|</code>) symbol to pipe alignments directly to <code>samtools</code> to perform sorting and filtering</li> </ul>"},{"location":"day2/samtools/#material","title":"Material","text":"<ul> <li><code>samtools</code> documentation</li> <li>Explain sam flags tool</li> </ul>"},{"location":"day2/samtools/#exercises","title":"Exercises","text":""},{"location":"day2/samtools/#alignment-statistics","title":"Alignment statistics","text":"<p>Exercise: Write the statistics of the E. coli alignment to file called <code>SRR519926.sam.stats</code> by using <code>samtools flagstat</code>. Find the documentation here. Anything that draws your attention?</p> Answer <p>Code: <pre><code>cd ~/project/results/alignments/\nsamtools flagstat SRR519926.sam &gt; SRR519926.sam.stats\n</code></pre></p> <p>resulting in:</p> <pre><code>624724 + 0 in total (QC-passed reads + QC-failed reads)\n624724 + 0 primary\n0 + 0 secondary\n0 + 0 supplementary\n0 + 0 duplicates\n0 + 0 primary duplicates\n621624 + 0 mapped (99.50% : N/A)\n621624 + 0 primary mapped (99.50% : N/A)\n624724 + 0 paired in sequencing\n312362 + 0 read1\n312362 + 0 read2\n300442 + 0 properly paired (48.09% : N/A)\n619200 + 0 with itself and mate mapped\n2424 + 0 singletons (0.39% : N/A)\n0 + 0 with mate mapped to a different chr\n0 + 0 with mate mapped to a different chr (mapQ&gt;=5)\n</code></pre> <p>Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this:</p> <pre><code>samtools stats SRR519926.sam | grep ^SN | cut -f 2,3\n</code></pre> <p>Now look at <code>insert size average</code> and <code>insert size standard deviation</code>. You can see the standard deviation is higher than the average, suggesting a wide distribution.</p>"},{"location":"day2/samtools/#compression-sorting-and-indexing","title":"Compression, sorting and indexing","text":"<p>The command <code>samtools view</code> is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that.</p> <p>Exercise: Create a script called <code>08_compress_sort.sh</code>. Add a <code>samtools view</code> command to compress our SAM file into a BAM file and include the header in the output. For this, use the <code>-b</code> and <code>-h</code> options. Find the required documentation here. How much was the disk space reduced by compressing the file?</p> <p>Tip: Samtools writes to stdout</p> <p>By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with <code>&gt;</code> or use the the output option <code>-o</code>.</p> Answer <p>08_compress_sort.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam &gt; SRR519926.bam\n</code></pre> By using <code>ls -lh</code>, you can find out that <code>SRR519926.sam</code> has a size of 264 Mb, while <code>SRR519926.bam</code> is only 77 Mb.  </p> <p>To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with <code>samtools</code> as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:</p> <pre><code>samtools sort SRR519926.bam &gt; SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n</code></pre> <p>Exercise: Add these lines to <code>08_compress_sort.sh</code>, and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file (<code>SRR519926.bam</code>) and the sorted bam file (<code>SRR519926.sorted.bam</code>) with <code>samtools view -H</code>. What are the differences?</p> Answer <p>Your script should like like this:</p> 08_compress_sort.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam &gt; SRR519926.bam\nsamtools sort SRR519926.bam &gt; SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n</code></pre> <p><code>samtools view -H SRR519926.bam</code> returns:</p> <pre><code>@HD     VN:1.0  SO:unsorted\n@SQ     SN:U00096.3     LN:4641652\n@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.12 CL:samtools view -H SRR519926.bam\n</code></pre> <p>And <code>samtools view -H SRR519926.sorted.bam</code> returns:</p> <pre><code>@HD     VN:1.0  SO:coordinate\n@SQ     SN:U00096.3     LN:4641652\n@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.12 CL:samtools sort SRR519926.bam\n@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.12 CL:samtools view -H SRR519926.sorted.bam\n</code></pre> <p>There are two main differences: </p> <ul> <li>The <code>SO</code> tag at <code>@HD</code> type code has changed from <code>unsorted</code> to <code>coordinate</code>.</li> <li>A line with the <code>@PG</code> type code for the sorting was added.</li> </ul> <p>Note that the command to view the header (<code>samtools -H</code>) is also added to the header for both runs.</p>"},{"location":"day2/samtools/#filtering","title":"Filtering","text":"<p>With <code>samtools view</code> you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads.</p> <p>Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation.</p> Answer <p>You will need the 0x4 flag.</p> <p>Filtering against unmapped reads (leaving only mapped reads) with <code>samtools view</code> would look like this:</p> <pre><code>samtools view -bh -F 0x4 SRR519926.sorted.bam &gt; SRR519926.sorted.mapped.bam\n</code></pre> <p>or:</p> <pre><code>samtools view -bh -F 4 SRR519926.sorted.bam &gt; SRR519926.sorted.mapped.bam\n</code></pre> <p>Exercise: Generate a script called <code>09_extract_unmapped.sh</code> to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of <code>samtools flagstat</code>?</p> <p>Tip</p> <p>Check out the <code>-f</code> and <code>-c</code> options of <code>samtools view</code></p> Answer <p>Your script <code>09_extract_unmapped.sh</code> should look like this:</p> 09_extract_unmapped.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh -f 0x4 SRR519926.sorted.bam &gt; SRR519926.sorted.unmapped.bam\n</code></pre> <p>Counting like this: <pre><code>samtools view -c SRR519926.sorted.unmapped.bam\n</code></pre></p> <p>This should correspond to the output of <code>samtools flagstat</code> (624724 - 621624 = 3100)</p> <p><code>samtools view</code> also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files.</p> <p>Exercise: Generate a script called <code>10_extract_region.sh</code> to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header.</p> <p>Tip: Specifying a region</p> <p>Our E. coli genome has only one chromosome, because only one line starts with <code>&gt;</code> in the fasta file</p> <pre><code>cd ~/project/ref_genome\ngrep \"&gt;\" ecoli-strK12-MG1655.fasta\n</code></pre> <p>gives:</p> <pre><code>&gt;U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome\n</code></pre> <p>The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: <code>U00096.3:START-END</code></p> Answer 10_extract_region.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh \\\nSRR519926.sorted.bam \\\nU00096.3:2000000-2500000 \\\n&gt; SRR519926.sorted.region.bam\n</code></pre>"},{"location":"day2/samtools/#redirection","title":"Redirection","text":"<p>Samtools is easy to use in a pipe. In this case you can replace the input file with a <code>-</code>. For example, you can sort and compress the output of your alignment software in a pipe like this:</p> <pre><code>my_alignment_command \\\n| samtools sort - \\\n| samtools view -bh - \\\n&gt; alignment.bam\n</code></pre> <p>The use of <code>-</code></p> <p>In the modern versions of samtools, the use of <code>-</code> is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry.</p> <p>Exercise: Write a script called <code>11_align_sort.sh</code> that maps the reads with bowtie2 (see chapter 2 of read alignment), sorts them, and outputs them as a BAM file with a header.</p> Answer 11_align_sort.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome\nALIGNED_DIR=~/project/results/alignments\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n2&gt; $ALIGNED_DIR/bowtie2_SRR519926.log \\\n| samtools sort - \\\n| samtools view -bh - \\\n&gt; $ALIGNED_DIR/SRR519926.sorted.mapped.frompipe.bam\n</code></pre> <p>Redirecting <code>stderr</code></p> <p>Notice the line starting with <code>2&gt;</code>. This redirects standard error to a file: <code>$ALIGNED_DIR/bowtie2_SRR519926.log</code>. This file now contains the bowtie2 logs, that can later be re-read or used in e.g. <code>multiqc</code>. </p>"},{"location":"day2/samtools/#qc-summary","title":"QC summary","text":"<p>The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including <code>fastqc</code>, <code>fastp</code>, <code>samtools</code> and <code>bowtie2</code>). You can specify a directory that contains any log files, and it will automatically search it for you. </p> <p>Exercise: Run the command <code>multiqc .</code> in <code>~/project</code> and checkout the generated report. </p>"},{"location":"day3/igv_visualisation/","title":"IGV and visualisation","text":""},{"location":"day3/igv_visualisation/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Prepare a bam file for loading it into IGV</li> <li>Use IGV to:<ul> <li>Navigate through a reference genome and alignments</li> <li>Retrieve information on a specific alignment</li> <li>Investigate (possible) variants</li> <li>Identify repeats and large INDELs </li> </ul> </li> </ul>"},{"location":"day3/igv_visualisation/#material","title":"Material","text":"<p>Presentation on the UCSC genome browser (after the exercises):</p> <p> Download the presentation</p> <p>The exercises below are partly based on this tutorial from the Griffith lab.</p>"},{"location":"day3/igv_visualisation/#exercises","title":"Exercises","text":""},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","title":"A first glance: the E. coli dataset","text":"<p>Index the alignment that was filtered for the region between 2000 and 2500 kb:</p> <p><pre><code>cd ~/project/results/alignments\nsamtools index SRR519926.sorted.region.bam\n</code></pre> Download the alignment (<code>SRR519926.sorted.region.bam</code>) together with it\u2019s index file (<code>SRR519926.sorted.region.bam.bai</code>) and the reference genome (<code>ecoli-strK12-MG1655.fasta</code>) to your desktop.</p> <p>If working with Docker</p> <p>If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the <code>-v</code> option). So if you have used <code>-v C:\\Users\\myusername\\ngs-course:/root/project</code>, your files will be in <code>C:\\Users\\myusername\\ngs-course</code>.</p> <ul> <li>Load the genome (<code>.fasta</code>) into IGV: Genomes &gt; Load Genome from File\u2026</li> <li>Load the alignment file (<code>.bam</code>): File &gt; Load from File\u2026</li> <li> <p>Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways:</p> <ol> <li>With the search box</li> </ol> <p> </p> <ol> <li>Select the region in the location bar</li> </ol> <p> </p> </li> <li> <p>View the reads as pairs, by right click on the reads and select View as pairs</p> </li> </ul> <p>Exercise: There are lot of reads that are coloured red. Why is that?</p> <p>If you don\u2019t find any red reads..</p> <p>The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by &gt; insert size</p> Answer <p>According to IGV, reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size.</p> <p>Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click:</p> <p>Exercise: Click on one of the reads. What kind of information is there?</p> Answer <p>Most of the information from the SAM file.</p> <p>Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by &gt; read strand.</p>"},{"location":"day3/igv_visualisation/#hcc1143-data-set","title":"HCC1143 data set","text":"<p>For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer.</p> <p>Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000.</p> <p>The BAM files containing these reads for the cancer cell line and the matched normal are:</p> <ul> <li><code>HCC1143.normal.21.19M-20M.bam</code></li> <li><code>HCC1143.normal.21.19M-20M.bam.bai</code></li> </ul> <p>A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu:</p> <p>Select File &gt; Load from File\u2026 from the main menu and select the BAM file <code>HCC1143.normal.21.19M-20M.bam</code> using the file browser.</p> <p>This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to <code>chr21:19,000,000-20,000,000</code>.</p> <p>Navigate to the gene CHODL by typing it in the search box.</p> <p>Load the dbsnp annotations by clicking File &gt; Load From Server\u2026 &gt; Annotations &gt; Variation and Repeats &gt; dbSNP 1.4.7</p> <p>Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file.</p> <p>Click on the coverage track where the SNP is:</p> <p>Exercise: What is the sequence sequencing depth for that base? And the percentage T?</p> Answer <p>The depth is 62, and 25 reads (40%) T.</p> <p>Navigate to region <code>chr21:19,800,320-19,818,162</code></p> <p>Load repeat tracks by selecting File &gt; Load from Server\u2026 from the main menu and then select Annotations &gt; Variation and Repeats &gt; Repeat Masker</p> <p>Note</p> <p>This might take a while to load.</p> <p>Right click in the alignment track and select Color alignments by &gt; insert size and pair orientation</p> <p>Exercise: Why are some reads coloured white? What can be the cause of that?</p> Answer <p>The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3.</p> <p>Navigate to region <code>chr21:19,324,500-19,331,500</code></p> <p>Right click in the main alignment track and select:</p> <ul> <li>Expanded</li> <li>View as pairs</li> <li>Color alignments by &gt; insert size and pair orientation</li> <li>Sort alignments by &gt; insert size</li> </ul> <p>Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion?</p> Answer <p>The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"<p>Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses) or in your own time. </p> <p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualize the output</li> </ul>"},{"location":"#teachers","title":"Teachers","text":"<ul> <li>Geert van Geest  </li> <li>Fr\u00e9d\u00e9ric Burdet  </li> </ul>"},{"location":"#authors","title":"Authors","text":"<ul> <li>Geert van Geest  </li> <li>Patricia Palagi  </li> </ul>"},{"location":"#license-copyright","title":"License &amp; copyright","text":"<p>License: CC BY-SA 4.0</p> <p>Copyright: SIB Swiss Institute of Bioinformatics</p> Enrolled to the courseIndependently <p>You can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository.</p> <p>Note</p> <p>It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!</p>"},{"location":"#material","title":"Material","text":"<ul> <li>This website</li> <li>Zoom meeting (through mail)</li> <li>Google doc (through mail)</li> <li>Slack channel</li> </ul>"},{"location":"#learning-outcomes","title":"Learning outcomes","text":"<p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualise the output</li> </ul>"},{"location":"#learning-experiences","title":"Learning experiences","text":"<p>This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.</p>"},{"location":"#exercises","title":"Exercises","text":"<p>Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.</p>"},{"location":"#asking-questions","title":"Asking questions","text":"<p>During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button:</p> <p> </p> <p>A main source of communication will be our slack channel. Ask background questions that interest you personally at #background. During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a.  This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option:</p> <p> </p> <p>The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally.</p> <p>To summarise:</p> <ul> <li>During lectures: raise hand/zoom functionality</li> <li>Personal interest questions: #background</li> <li>During exercises: #q-and-a on slack</li> </ul>"},{"location":"#learning-outcomes_1","title":"Learning outcomes","text":"<p>After this course, you will be able to:</p> <ul> <li>Understand the basics of the different NGS technologies</li> <li>Perform quality control for better downstream analysis</li> <li>Align reads to a reference genome</li> <li>Visualize the output</li> </ul>"},{"location":"#exercises_1","title":"Exercises","text":"<p>Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.</p>"},{"location":"course_schedule/","title":"Course schedule","text":"<p>Note</p> <p>Apart from the starting time the time schedule is indicative. Because we can not plan a course by the minute, in practice the time points will deviate. </p>"},{"location":"course_schedule/#day-1","title":"Day 1","text":"block start end subject introduction 9:00 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-2","title":"Day 2","text":"block start end subject block 1 9:00 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-3","title":"Day 3","text":"block start end subject block 1 9:00 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations"},{"location":"group_work/","title":"Group work","text":"<p>The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people.</p> <p>If working with Docker</p> <p>If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk.</p> <p>If online</p> <p>If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.</p>"},{"location":"group_work/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"group_work/#roles-organisation","title":"Roles &amp; organisation","text":"<p>Project based learning is about learning by doing, but also about peer instruction. This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding.</p> <p>Each project has tasks and questions. By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions.</p> <p>In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.</p>"},{"location":"group_work/#working-directories","title":"Working directories","text":"<p>Each group has access to a shared working directory. It is mounted in the root directory (<code>/</code>). You can add this directory to your working space by clicking: File &gt; Add Folder to Workspace\u2026. Then, type the path to your group directory: <code>/group_work/groupX</code> (where <code>X</code> is your group number).</p>"},{"location":"group_work/#project-1-variant-analysis-of-human-data","title":"Project 1: Variant analysis of human data","text":"<p>Aim: Find variants on chromosome 20 from three samples</p> <p>In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them. </p> <p>You can get the data by running these commands:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz\ntar -xvf project1.tar.gz\nrm project1.tar.gz\n</code></pre>"},{"location":"group_work/#tasks","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Download the required data</li> <li>Do a QC on the data with <code>fastqc</code></li> <li>Trim adapters and low quality bases with <code>fastp</code>. Make sure to include the option <code>--detect_adapter_for_pe</code>. To prevent overwriting <code>fastp.html</code>, specify a report filename for each sample with the option <code>--html</code>. </li> <li>After trimming the adapters, run <code>fastqc</code> again to see whether all adapters are gone.</li> <li>Create an index for bowtie2. At the same time create a fasta index (<code>.fai</code> file) with <code>samtools faidx</code>. </li> <li>Check which options to use, and align with <code>bowtie2</code>. At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. </li> <li>Mark duplicates on the individual bam files with <code>gatk MarkDuplicates</code> (see hints below).</li> <li>Merge the three bam files with <code>samtools merge</code>. Index the bam file afterwards. </li> <li>Run <code>freebayes</code> to call variants. Only call variants on the region <code>chr20:10018000-10220000</code> by specifying the <code>-r</code> option. </li> <li>Load your alignments together with the vcf containing the variants in IGV. Check out e.g. <code>chr20:10,026,397-10,026,638</code>. </li> <li>Run <code>multiqc</code> to get an overall quality report.</li> </ul>"},{"location":"group_work/#questions","title":"Questions","text":"<ul> <li>Have a look at the quality of the reads. Are there any adapters in there? Did adapter trimming change that? How is the base quality? Could you improve that?</li> <li>How many duplicates were in the different samples (hint: use <code>samtools flagstat</code>)? Why is it important to remove them for variant analysis?</li> <li>Why did you add read groups to the bam files? Where is this information added in the bam file? </li> <li>Are there variants that look spurious? What could be the cause of that? What information in the vcf can you use to evaluate variant quality? </li> <li>There are two high quality variants in <code>chr20:10,026,397-10,026,638</code>. What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. </li> </ul>"},{"location":"group_work/#hints","title":"Hints","text":"<p>You can add readgroups to the alignment file with <code>bowtie2</code> with the options <code>--rg-id</code> and <code>--rg</code>, e.g. (<code>$SAMPLE</code> is a variable containing a sample identifier):</p> <pre><code>bowtie2 \\\n-x ref.fa \\\n-1 r1.fastq.gz \\\n-2 r2.fastq.gz \\\n--rg-id $SAMPLE \\\n--rg SM:$SAMPLE \\\n</code></pre> <p>To run <code>gatk MarkDuplicates</code> you will only need to specify <code>--INPUT</code> and <code>--OUTPUT</code>, e.g.:</p> <pre><code>gatk MarkDuplicates \\\n--INPUT sample.bam \\\n--OUTPUT sample.md.bam \\\n--METRICS_FILE sample.metrics.txt \n</code></pre>"},{"location":"group_work/#project-2-long-read-genome-sequencing","title":"Project 2: Long-read genome sequencing","text":"<p>Aim: Align long reads from RNA-seq data to a reference genome.</p> <p>In this project, you will be working with data from:</p> <p>Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.</p> <p>The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For this project, we will work with two samples of this study, <code>EV_2</code> (extracellular vesicle) and <code>Cell_2</code> (whole cell). Download and unpack the data files.</p> <p>Download the human reference genome like this:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project2.tar.gz\ntar -xvf project2.tar.gz\nrm project2.tar.gz\n</code></pre> <p>You can find the fastq files in the <code>reads</code> folder and the reference genome and its annotation in the <code>reference</code> folder. To reduce computational times we work with a subset of the data on a subset of the genome (chromosome 5 and X).</p>"},{"location":"group_work/#tasks_1","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Perform QC with <code>fastqc</code></li> <li>Perform QC with <code>NanoPlot</code></li> <li>Align with <code>minimap2</code> with default parameters</li> <li>Figure how you should set parameter <code>-x</code></li> <li>Evaluate the alignment quality (e.g. alignment rates, mapping quality)</li> <li>Compare the two different samples in read quality, alignment rates, depth, etc.</li> <li>Check out the alignments in IGV. Check out e.g. <code>ELOVL5</code>.</li> </ul>"},{"location":"group_work/#questions_1","title":"Questions","text":"<ul> <li>Have a look at the quality report. What are the average read lengths? Is that expected?</li> <li>What is the average read quality? What kind of accuracy would you expect?</li> <li>Note any differences between <code>fastqc</code> and <code>NanoPlot</code>? How is that compared to the publication?</li> <li>Check out the option <code>-x</code> of <code>minimap2</code>. Are the defaults appropriate?</li> <li>You might consider using <code>-x map-ont</code> or <code>-x splice</code>. Do you see differences in the alignment in e.g. IGV?</li> <li>How are spliced alignments stored in the SAM file with the different settings of <code>-x</code>?</li> <li>How deep is the gene <code>ELOVL5</code> sequenced in both samples?</li> <li>Do you already see evidence for splice variants in the alignments?</li> </ul> <p>Accuracy from quality scores</p> <p>Find the equation to calculate error probability from quality score on Wikipedia.</p> <p>Comparing <code>fastqc</code> and <code>Nanoplot</code></p> <p>For comparing <code>fastqc</code> and <code>NanoPlot</code>, check out this blog of the author of NanoPlot, and this thread.</p> <p>Running <code>minimap2</code></p> <p>Here\u2019s an example command for <code>minimap2</code>:</p> <pre><code>minimap2 \\\n-a \\\n-x [PARAMETER] \\\n[REFERENCE].fa \\\n[FASTQFILE].fastq.gz \\\n| samtools sort \\\n| samtools view -bh &gt; [OUTPUT].bam\n</code></pre>"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","title":"Project 3: Short-read RNA-seq of mice.","text":"<p>Aim: Generate a count matrix to estimate differential gene expression. </p> <p>In this project you will be working with data from:</p> <p>Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6</p> <p>Here\u2019s the BioProject page. Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this:</p> <pre><code>wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz\ntar -xvf project3.tar.gz\nrm project3.tar.gz\n</code></pre>"},{"location":"group_work/#tasks_2","title":"Tasks","text":"<p>Important!</p> <p>Stick to the principles for reproducible analysis described here</p> <ul> <li>Download the tar file, and find out what\u2019s in the data folder</li> <li>Do a QC on the fastq files with <code>fastqc</code></li> <li>Trim adapters and low quality bases with <code>fastp</code></li> <li>After trimming the adapters, run <code>fastqc</code> again to see whether all adapters are gone.</li> <li>Check which options to use, and align with <code>hisat2</code></li> <li>Evaluate the alignment quality (e.g. alignment rates, mapping quality)</li> <li>Have a look at the alignments in IGV, e.g. check out <code>Sparcl1</code>. For this, you can use the built-in genome (Mouse (mm10)). Do you see any evidence for differential splicing?</li> <li>Run <code>featureCounts</code> on both alignments. Have a look at the option <code>-Q</code>. For further suggestions, see the hints below. </li> <li>Compare the count matrices in <code>R</code> (find a script to get started here; Rstudio server is running on the same machine. Approach it with your credentials and username <code>rstudio</code>)</li> </ul>"},{"location":"group_work/#questions_2","title":"Questions","text":"<ul> <li>Check the description at the SRA sample page. What kind of sample is this?</li> <li>How does the quality of the reads look? Anything special about the overrepresented sequences? (Hint: blast some overrepresented sequences, and see what they are)</li> <li>Did trimming improve the QC results? What could be the cause of the warnings/errors in the <code>fastqc</code> reports?</li> <li>What are the alignment rates?</li> <li>How are spliced alignments stored in the SAM file?</li> <li>Are there any differences between the treatments in the percentage of assigned alignments by <code>featureCounts</code>? What is the cause of this? </li> <li>Can you find any genes that seem to be differentially expressed? </li> <li>What is the effect of setting the option <code>-Q</code> in <code>featureCounts</code>?</li> </ul>"},{"location":"group_work/#hints_1","title":"Hints","text":"<p>We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing.</p> command option <code>bowtie2-build</code> <code>--threads</code> <code>hisat2-build</code> <code>--threads</code> <code>fastqc</code> <code>--threads</code> <code>cutadapt</code> <code>--cores</code> <code>bowtie2</code> <code>--threads</code> <code>hisat2</code> <code>--threads</code> <code>featureCounts</code> <code>-T</code> <p>Here\u2019s some example code for <code>hisat2</code> and <code>featureCounts</code>. Everything in between <code>&lt;&gt;</code> should be replaced with specific arguments.</p> <p>Here\u2019s an example for <code>hisat2</code>:</p> <pre><code>hisat2-build &lt;reference_sequence_fasta&gt; &lt;index_basename&gt;\n\nhisat2 \\\n-x &lt;index_basename&gt; \\\n-1 &lt;foward_reads.fastq.gz&gt; \\\n-2 &lt;reverse_reads.fastq.gz&gt; \\\n-p &lt;threads&gt; \\\n| samtools sort \\\n| samtools view -bh \\\n&gt; &lt;alignment_file.bam&gt;\n</code></pre> <p>Example code <code>featureCounts</code>:</p> <pre><code>featureCounts \\\n-p \\\n-T 2 \\\n-a &lt;annotations.gtf&gt; \\\n-o &lt;output.counts.txt&gt; \\\n*.bam\n</code></pre>"},{"location":"precourse/","title":"Precourse preparations","text":""},{"location":"precourse/#unix","title":"UNIX","text":"<p>We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here. If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial.</p>"},{"location":"precourse/#software","title":"Software","text":"<p>We will be mainly working on an Amazon Web Services (AWS)  Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a VSCode interface. All participants will be granted access to a personal workspace to be used during the course.</p> <p>The only software you need to install before the course is Integrative Genomics Viewer (IGV).</p>"},{"location":"day1/intro/","title":"Introduction","text":"<p>If working independently</p> <p>If you are doing this course independently, you can skip this part.</p>"},{"location":"day1/intro/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"day1/quality_control/","title":"Quality control","text":""},{"location":"day1/quality_control/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Find information about a sequence run on the Sequence Read Archive</li> <li>Run <code>fastqc</code> on sequence reads and interpret the results</li> <li>Trim adapters and low quality bases using <code>fastp</code></li> </ul>"},{"location":"day1/quality_control/#material","title":"Material","text":"<p> Download the presentation</p> <ul> <li><code>fastqc</code> command line documentation</li> <li><code>cutadapt</code> manual</li> <li>Unix command line E-utilities documentation</li> </ul>"},{"location":"day1/quality_control/#exercises","title":"Exercises","text":""},{"location":"day1/quality_control/#download-and-evaluate-an-e-coli-dataset","title":"Download and evaluate an E. coli dataset","text":"<p>Check out the dataset at SRA.</p> <p>Exercise: Browse around the SRA entry and answer these questions:</p> <p>A. Is the dataset paired-end or single end?</p> <p>B. Which instrument was used for sequencing?</p> <p>C. What is the read length?</p> <p>D. How many reads do we have?</p> Answers <p>A. paired-end</p> <p>B. Illumina MiSeq</p> <p>C. 2 x 251 bp</p> <p>D. 400596</p> <p>Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called <code>ngs-tools</code>. Every time you open a new terminal, you will have to load the environment:</p> <pre><code>conda activate ngs-tools\n</code></pre> <p>Make a directory <code>reads</code> in <code>~/project</code> and download the reads from the SRA database using <code>prefetch</code> and <code>fastq-dump</code> from SRA-Tools into the <code>reads</code> directory. Use the code snippet below to create a scripts called <code>01_download_reads.sh</code>. Store it in <code>~/project/scripts/</code>, and run it.</p> 01_download_reads.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project\nmkdir reads\ncd reads\nprefetch SRR519926\nfastq-dump --split-files SRR519926\n</code></pre> <p>Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry.</p> <p>Tip</p> <p>A read in a fastq file consists of four lines (more on that at file types). Use Google to figure out how to count the number of reads in a fastq file.</p> Answer <p>e.g. from this thread on Biostars:</p> <pre><code>## forward read\necho $(cat SRR519926_1.fastq | wc -l)/4 | bc\n\n## reverse read\necho $(cat SRR519926_2.fastq | wc -l)/4 | bc\n</code></pre>"},{"location":"day1/quality_control/#run-fastqc","title":"Run fastqc","text":"<p>Exercise: Create a script to run <code>fastqc</code> and call it <code>02_run_fastqc.sh</code>. After that, run it.</p> <p>Tip</p> <p><code>fastqc</code> accepts multiple files as input, so you can use a wildcard to run <code>fastqc</code> on all the files in one line of code. Use it like this: <code>*.fastq</code>.  </p> Answer <p>Your script <code>~/project/scripts/02_run_fastqc.sh</code> should look like:</p> 02_run_fastqc.sh<pre><code>#!/usr/bin/env bash\ncd ~/project/reads\n\nfastqc *.fastq\n</code></pre> <p>Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems?</p> <p>Downloading files</p> <p>You can download files by right-click the file and after that select Download:</p> <p> </p> Answer <p>There seems to be:</p> <ul> <li>Low quality towards the 3\u2019 end (per base sequence quality)</li> <li>Full sequence reads with low quality (per sequence quality scores)</li> <li>Adapters in the sequences (adapter content)</li> </ul> <p>We can probably fix most of these issues by trimming.</p>"},{"location":"day1/quality_control/#trim-the-reads","title":"Trim the reads","text":"<p>We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and <code>fastp</code> will use those by default. A reference for the adapter sequences can be found here.</p> <p>Exercise: Check out the documentation of fastp, and the option defaults by running <code>fastp --help</code>. </p> <ul> <li>What is the default for the minimum base quality for a qualified base? ( option <code>--qualified_quality_phred</code>)</li> <li>What is the default for the maximum percentage of unqualified bases in a read? (option <code>--unqualified_percent_limit</code>)</li> <li>What is the default for the minimum required read length? (option <code>--length_required</code>)</li> <li>What happens if one read in the pair does not meet the required length after trimming? (it can be specified with the options <code>--unpaired1</code> and <code>--unpaired2</code>)</li> </ul> Answer <ul> <li>The minimum base quality is 15: <code>Default 15 means phred quality &gt;=Q15 is qualified. (int [=15])</code></li> <li>The minimum required length is also 15: <code>reads shorter than length_required will be discarded, default is 15. (int [=15])</code></li> <li>If one of the reads does not meet the required length, the pair is discarded if <code>--unpaired1</code> and/or <code>--unpaired2</code> are not specified: <code>for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])</code>. </li> </ul> <p>Exercise: Complete the script below called <code>03_trim_reads.sh</code> (replace everything in between brackets <code>[]</code>) to run <code>fastp</code> to trim the data.  The quality of our dataset is not great, so we will overwrite the defaults.  Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called <code>~/project/results/trimmed/</code> is created to write the trimmed reads.</p> 03_trim_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n[QUALIFIED BASE THRESHOLD] \\\n[MINIMUM LENGTH THRESHOLD] \\\n[UNQUALIFIED PERCENTAGE LIMIT] \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n</code></pre> <p>Additional options</p> <p>Note that we have set the options <code>--cut_front</code> and <code>--cut_tail</code> that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also <code>--detect_adapter_for_pe</code> is set, which ensures that adapters are detected automatically for both R1 and R2. </p> Answer <p>Your script (<code>~/project/scripts/03_trim_reads.sh</code>) should look like this:</p> 03_trim_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n--qualified_quality_phred 10 \\\n--length_required 25 \\\n--unqualified_percent_limit 80 \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n</code></pre> <p>The use of <code>\\</code></p> <p>In the script above you see that we\u2019re using <code>\\</code> at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the <code>fastp</code> command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use <code>\\</code> to ignore the newlines when executing.</p> <p>Exercise: Check out the report in <code>fastp.html</code>. </p> <ul> <li>Has the quality improved?</li> <li>How many reads do we have left?</li> <li>Bonus: Although there were adapters in R2 according to <code>fastqc</code>,  <code>fastp</code> has trouble finding adapters in R2. Also, after running <code>fastp</code> there doesn\u2019t seem to be much adapter left (you can double check by running <code>fastqc</code> on <code>trimmed_SRR519926_2.fastq</code>). How could that be? </li> </ul> Answers <ul> <li>Yes, low quality 3\u2019 end, per sequence quality and adapter sequences have improved. Also the percentages &gt;20 and &gt;30 are higher. </li> <li>624724 reads, so 312362 pairs (78.0%)</li> <li>The 3\u2019 end of R2 has very low quality on average, this means that trimming for low quality removes almost all bases from the original 3\u2019 end, including any adapter.  </li> </ul>"},{"location":"day1/reproducibility/","title":"Reproducibility","text":""},{"location":"day1/reproducibility/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Understand the importance of reproducibility</li> <li>Apply some basic rules to support reproducibilty in computational research</li> </ul>"},{"location":"day1/reproducibility/#material","title":"Material","text":"<p> Download the presentation</p>"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","title":"Some good practices for reproducibility","text":"<p>During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility:</p> <ol> <li>Execute the commands from a script in order to be able to trace back your steps</li> <li>Number scripts based on their order of execution (e.g. <code>01_download_reads.sh</code>)</li> <li>Give your scripts a descriptive and active name, e.g. <code>06_build_bowtie_index.sh</code></li> <li>Make your scripts specific, i.e. do not combine many different commands in the same script</li> <li>Refer to directories and variables at the beginning of the script</li> </ol> <p>By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research. </p> <p>By the end of day 2 <code>~/project</code> should look (something) like this:</p> <pre><code>.\n\u251c\u2500\u2500 alignment_output\n\u251c\u2500\u2500 reads\n\u251c\u2500\u2500 ref_genome\n\u251c\u2500\u2500 scripts\n\u2502   \u251c\u2500\u2500 01_download_reads.sh\n\u2502   \u251c\u2500\u2500 02_run_fastqc.sh\n\u2502   \u251c\u2500\u2500 03_trim_reads.sh\n\u2502   \u251c\u2500\u2500 04_run_fastqc_trimmed.sh\n\u2502   \u251c\u2500\u2500 05_download_ecoli_reference.sh\n\u2502   \u251c\u2500\u2500 06_build_bowtie_index.sh\n\u2502   \u251c\u2500\u2500 07_align_reads.sh\n\u2502   \u251c\u2500\u2500 08_compress_sort.sh\n\u2502   \u251c\u2500\u2500 09_extract_unmapped.sh\n\u2502   \u251c\u2500\u2500 10_extract_region.sh\n\u2502   \u2514\u2500\u2500 11_align_sort.sh\n\u2514\u2500\u2500 trimmed_data\n</code></pre>"},{"location":"day1/sequencing_technologies/","title":"Sequencing technologies","text":""},{"location":"day1/sequencing_technologies/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Describe the major applications of next generation sequencing</li> <li>Reproduce the most frequently used sequencing methods</li> <li>Describe the major steps taken during library preparation for Illumina sequencing</li> <li>Explain why the length of the sequencing reads of Illumina sequencing are limited</li> <li>Describe the general methods used for Oxford Nanopore Sequencing and PacBio sequencing</li> </ul>"},{"location":"day1/sequencing_technologies/#material","title":"Material","text":"<p> Download the presentation</p> <p>Illumina sequencing by synthesis on YouTube</p> <p>NEBnext library preparation poster</p>"},{"location":"day1/server_login/","title":"Setup","text":""},{"location":"day1/server_login/#learning-outcomes","title":"Learning outcomes","text":"<p>Note</p> <p>You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line.</p> <p>After having completed this chapter you will be able to:</p> <ul> <li>Use the command line to:<ul> <li>Make a directory</li> <li>Change file permissions to \u2018executable\u2019</li> <li>Run a <code>bash</code> script</li> <li>Pipe data from and to a file or other executable</li> </ul> </li> <li>Program a loop in <code>bash</code></li> </ul> <p>Choose your platform</p> <p>In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker.</p> <p>If you are doing the course with a teacher, you will have to login to the remote server. Therefore choose:</p> <ul> <li>Cloud notebook</li> </ul> <p>If you are doing this course independently (i.e. without a teacher) choose either:</p> <ul> <li>conda</li> <li>Docker</li> </ul> Cloud notebookDockerconda <p>If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this:</p> WindowsMac/Linux <ul> <li>Get the <code>.exe</code> file here</li> <li>Double click the file</li> <li>Follow the instructions on the screen (defaults are usually fine)</li> <li>Run the command <code>conda list</code> in the Ananconda prompt or terminal to check whether your installation has succeeded.</li> </ul> <ul> <li>Get the installation (<code>.sh</code>) script here</li> <li>Run in your terminal:     <pre><code>bash Miniconda3-latest-Linux-x86_64.sh\n</code></pre></li> <li>Follow the prompts</li> <li>Close and reopen your terminal for changes to have effect</li> <li>Run the command <code>conda list</code> in the Ananconda prompt or terminal to check whether your installation has succeeded.</li> </ul> <p>After installation, you can install the required software:</p> Windows/MacOSLinux <pre><code>conda create -n ngs-tools\n\nconda activate ngs-tools\n\nconda install -y -c bioconda \\\n    samtools \\\n    bwa \\\n    fastqc \\\n    sra-tools \\\n    bowtie2=2.4.2 \\\n    hisat2=2.2.1 \\\n    subread=2.0.1 \\\n    entrez-direct \\\n    minimap2 \\\n    gatk4 \\\n    freebayes \\\n    multiqc \\\n    fastp\n</code></pre> <p>Download ngs-tools.yml, and generate the conda environment like this:</p> <pre><code>conda env create --name ngs-tools -f ngs-tools.yml\n</code></pre> <p>Note</p> <p>If that did not succeed, follow the instructions for Windows/MacOS.</p> <p>This will create the conda environment <code>ngs-tools</code></p> <p>Activate it like so:</p> <pre><code>conda activate ngs-tools\n</code></pre> <p>After successful installation and activating the environment all the software required to do the exercises should be available.</p> <p>If you are doing project 2 (long reads)</p> <p>If you are doing the project 2 as part of the course, you will need to install <code>NanoPlot</code> as well, using <code>pip</code>:</p> <pre><code>pip install NanoPlot\n</code></pre>"},{"location":"day1/server_login/#exercises","title":"Exercises","text":""},{"location":"day1/server_login/#first-login","title":"First login","text":"<p>If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: <code>http://12.345.678.91:10002</code>) in your browser. This should result in the following page:</p> <p> </p> <p>Info</p> <p>The link gives you access to a web version of Visual Studio Code. This is a powerful code editor that you can also use as a local application on your computer. </p> <p>Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl+`. Or by clicking Application menu &gt; Terminal &gt; New Terminal:</p> <p> </p> <p>For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:</p> <p> </p>"},{"location":"day1/server_login/#material","title":"Material","text":"<ul> <li>Instructions to install docker</li> <li>Instructions to set up to container</li> </ul>"},{"location":"day1/server_login/#exercises_1","title":"Exercises","text":""},{"location":"day1/server_login/#first-login_1","title":"First login","text":"<p>Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system.</p> <p>In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10.</p> <p></p> <p>The command to run the environment required for this course looks like this (in a terminal):</p> <p>Modify the script</p> <p>Modify the path after <code>-v</code> to the working directory on your computer before running it.</p> <pre><code>docker run \\\n--rm \\\n-p 8443:8443 \\\n-e PUID=1000 \\\n-e PGID=1000 \\\n-e DEFAULT_WORKSPACE=/config/project \\\n-v $PWD:/config/project \\\ngeertvangeest/ngs-introduction-vscode:latest\n</code></pre> <p>If this command has run successfully, navigate in your browser to http://localhost:8443.</p> <p>The option <code>-v</code> mounts a local directory in your computer to the directory <code>/config/project</code> in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.</p> <p>Don\u2019t mount directly in the home dir</p> <p>Don\u2019t directly mount your local directory to the home directory (<code>/root</code>). This will lead to unexpected behaviour.</p> <p>The part <code>geertvangeest/ngs-introduction-vscode:latest</code> is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately.</p>"},{"location":"day1/server_login/#a-unix-command-line-interface-cli-refresher","title":"A UNIX command line interface (CLI) refresher","text":"<p>Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory. </p> <p>If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet:</p> <p> UNIX cheat sheet</p>"},{"location":"day1/server_login/#make-a-new-directory","title":"Make a new directory","text":"<p>Make a directory <code>scripts</code> within <code>~/project</code> and make it your current directory.</p> Answer <pre><code>cd ~/project\nmkdir scripts\ncd scripts\n</code></pre>"},{"location":"day1/server_login/#file-permissions","title":"File permissions","text":"<p>Generate an empty script in your newly made directory <code>~/project/scripts</code> like this:</p> <pre><code>touch new_script.sh\n</code></pre> <p>Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it.</p> Answer <p>Generate a script as described above. The script should look like this:</p> <pre><code>#!/usr/bin/env bash\n\necho \"SIB courses are great!\"\n</code></pre> <p>Usually, you can run it like this:</p> <pre><code>./new_script.sh\n</code></pre> <p>But there\u2019s an error:</p> <pre><code>bash: ./new_script.sh: Permission denied\n</code></pre> <p>Why is there an error?</p> <p>Hint</p> <p>Use <code>ls -lh new_script.sh</code> to check the permissions.</p> Answer <pre><code>ls -lh new_script.sh\n</code></pre> <p>gives:</p> <pre><code>-rw-r--r--  1 user  group    51B Nov 11 16:21 new_script.sh\n</code></pre> <p>There\u2019s no <code>x</code> in the permissions string. You should change at least the permissions of the user.</p> <p>Make the script executable for yourself, and run it.</p> Answer <p>Change permissions:</p> <pre><code>chmod u+x new_script.sh\n</code></pre> <p><code>ls -lh new_script.sh</code> now gives:</p> <pre><code>-rwxr--r--  1 user  group    51B Nov 11 16:21 new_script.sh\n</code></pre> <p>So it should be executable:</p> <pre><code>./new_script.sh\n</code></pre> <p>More on <code>chmod</code> and file permissions here.</p>"},{"location":"day1/server_login/#redirection-and","title":"Redirection: <code>&gt;</code> and <code>|</code>","text":"<p>In the root directory (go there like this: <code>cd /</code>) there are a range of system directories and files. Write the names of all directories and files to a file called <code>system_dirs.txt</code> in your working directory.</p> Answer <pre><code>ls / &gt; ~/project/system_dirs.txt\n</code></pre> <p>The command <code>wc -l</code> counts the number of lines, and can read from stdin. Make a one-liner with a pipe <code>|</code> symbol to find out how many system directories and files there are.</p> Answer <pre><code>ls / | wc -l\n</code></pre>"},{"location":"day1/server_login/#variables","title":"Variables","text":"<p>Store <code>system_dirs.txt</code> as variable (like this: <code>VAR=variable</code>), and use <code>wc -l</code> on that variable to count the number of lines in the file.</p> Answer <pre><code>FILE=~/project/system_dirs.txt\nwc -l $FILE\n</code></pre>"},{"location":"day1/server_login/#shell-scripts","title":"shell scripts","text":"<p>Make a shell script that automatically counts the number of system directories and files.</p> Answer <p>Make a script called e.g. <code>current_system_dirs.sh</code>: <pre><code>#!/usr/bin/env bash\ncd /\nls | wc -l\n</code></pre></p>"},{"location":"day2/file_types/","title":"File types","text":""},{"location":"day2/file_types/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Describe the fasta and fastq file format</li> <li>Describe which information can be stored in a standard Illumina fastq title</li> <li>Reproduce how and why base quality is stored in a fastq file as a single ASCII character</li> <li>Lookup relevant information of an alignment in the header of a sam file</li> <li>Describe what information is stored in each column of a sam file</li> <li>Describe how information is stored in a sam flag</li> <li>Describe the bed and gtf file format</li> <li>Describe vcf file format</li> </ul>"},{"location":"day2/file_types/#material","title":"Material","text":"<p> Download the presentation</p> <p>File definition websites:</p> <ul> <li>FASTQ (wikipedia)</li> <li>GFF (ensembl)</li> <li>VCF (Wikipedia)</li> <li>SAM:<ul> <li>Wikipedia</li> <li>samtools</li> <li>Zhuyi Xue</li> </ul> </li> </ul>"},{"location":"day2/read_alignment/","title":"Read alignment","text":""},{"location":"day2/read_alignment/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Explain what a sequence aligner does</li> <li>Explain why in some cases the aligner needs to be \u2018splice-aware\u2019</li> <li>Calculate mapping quality out of the probability that a mapping position is wrong</li> <li>Build an index of the reference and perform an alignment of paired-end reads with <code>bowtie2</code></li> </ul>"},{"location":"day2/read_alignment/#material","title":"Material","text":"<p> Download the presentation</p> <ul> <li>Unix command line E-utilities documentation</li> <li><code>bowtie2</code> manual</li> <li>Ben Langmead\u2019s youtube channel for excellent lectures on e.g. BWT, suffix matrixes/trees, and binary search. </li> </ul>"},{"location":"day2/read_alignment/#exercises","title":"Exercises","text":""},{"location":"day2/read_alignment/#prepare-the-reference-sequence","title":"Prepare the reference sequence","text":"<p>Make a script called <code>05_download_ecoli_reference.sh</code>, and paste in the code snippet below. Use it to retrieve the reference sequence using <code>esearch</code> and <code>efetch</code>:</p> 05_download_ecoli_reference.sh<pre><code>#!/usr/bin/env bash\n\nREFERENCE_DIR=~/project/ref_genome/\n\nmkdir $REFERENCE_DIR\ncd $REFERENCE_DIR\n\nesearch -db nuccore -query 'U00096' \\\n| efetch -format fasta &gt; ecoli-strK12-MG1655.fasta\n</code></pre> <p>Exercise: Check out the documentation of <code>bowtie2-build</code>, and build the indexed reference genome with bowtie2 using default options. Do that with a script called <code>06_build_bowtie_index.sh</code>.</p> Answer 06_build_bowtie_index.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/ref_genome\n\nbowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta\n</code></pre>"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","title":"Align the reads with <code>bowtie2</code>","text":"<p>Exercise: Check out the bowtie2 manual here. We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need?</p> Answer <p>According to the usage of <code>bowtie2</code>: <pre><code>bowtie2 [options]* -x &lt;bt2-idx&gt; {-1 &lt;m1&gt; -2 &lt;m2&gt; | -U &lt;r&gt; | --interleaved &lt;i&gt; | --sra-acc &lt;acc&gt; | b &lt;bam&gt;}\n</code></pre></p> <p>We\u2019ll need the options:</p> <ul> <li><code>-x</code> to point to our index</li> <li><code>-1</code> and <code>-2</code> to point to our forward and reverse reads</li> </ul> <p>Exercise: Try to understand what the script below does. After that copy it to a script called <code>07_align_reads.sh</code>, and run it.</p> 07_align_reads.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome/\nALIGNED_DIR=~/project/results/alignments\n\nmkdir -p $ALIGNED_DIR\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n&gt; $ALIGNED_DIR/SRR519926.sam\n</code></pre> <p>We\u2019ll go deeper into alignment statistics later on, but <code>bowtie2</code> writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual. Can you guess what the reason could be?</p>"},{"location":"day2/samtools/","title":"Samtools","text":""},{"location":"day2/samtools/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Use <code>samtools flagstat</code> to get general statistics on the flags stored in a sam/bam file</li> <li>Use <code>samtools view</code> to:<ul> <li>compress a sam file into a bam file</li> <li>filter on sam flags</li> <li>count alignments</li> <li>filter out a region</li> </ul> </li> <li>Use <code>samtools sort</code> to sort an alignment file based on coordinate</li> <li>Use <code>samtools index</code> to create an index of a sorted sam/bam file</li> <li>Use the pipe (<code>|</code>) symbol to pipe alignments directly to <code>samtools</code> to perform sorting and filtering</li> </ul>"},{"location":"day2/samtools/#material","title":"Material","text":"<ul> <li><code>samtools</code> documentation</li> <li>Explain sam flags tool</li> </ul>"},{"location":"day2/samtools/#exercises","title":"Exercises","text":""},{"location":"day2/samtools/#alignment-statistics","title":"Alignment statistics","text":"<p>Exercise: Write the statistics of the E. coli alignment to file called <code>SRR519926.sam.stats</code> by using <code>samtools flagstat</code>. Find the documentation here. Anything that draws your attention?</p> Answer <p>Code: <pre><code>cd ~/project/results/alignments/\nsamtools flagstat SRR519926.sam &gt; SRR519926.sam.stats\n</code></pre></p> <p>resulting in:</p> <pre><code>624724 + 0 in total (QC-passed reads + QC-failed reads)\n624724 + 0 primary\n0 + 0 secondary\n0 + 0 supplementary\n0 + 0 duplicates\n0 + 0 primary duplicates\n621624 + 0 mapped (99.50% : N/A)\n621624 + 0 primary mapped (99.50% : N/A)\n624724 + 0 paired in sequencing\n312362 + 0 read1\n312362 + 0 read2\n300442 + 0 properly paired (48.09% : N/A)\n619200 + 0 with itself and mate mapped\n2424 + 0 singletons (0.39% : N/A)\n0 + 0 with mate mapped to a different chr\n0 + 0 with mate mapped to a different chr (mapQ&gt;=5)\n</code></pre> <p>Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this:</p> <pre><code>samtools stats SRR519926.sam | grep ^SN | cut -f 2,3\n</code></pre> <p>Now look at <code>insert size average</code> and <code>insert size standard deviation</code>. You can see the standard deviation is higher than the average, suggesting a wide distribution.</p>"},{"location":"day2/samtools/#compression-sorting-and-indexing","title":"Compression, sorting and indexing","text":"<p>The command <code>samtools view</code> is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that.</p> <p>Exercise: Create a script called <code>08_compress_sort.sh</code>. Add a <code>samtools view</code> command to compress our SAM file into a BAM file and include the header in the output. For this, use the <code>-b</code> and <code>-h</code> options. Find the required documentation here. How much was the disk space reduced by compressing the file?</p> <p>Tip: Samtools writes to stdout</p> <p>By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with <code>&gt;</code> or use the the output option <code>-o</code>.</p> Answer <p>08_compress_sort.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam &gt; SRR519926.bam\n</code></pre> By using <code>ls -lh</code>, you can find out that <code>SRR519926.sam</code> has a size of 264 Mb, while <code>SRR519926.bam</code> is only 77 Mb.  </p> <p>To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with <code>samtools</code> as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:</p> <pre><code>samtools sort SRR519926.bam &gt; SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n</code></pre> <p>Exercise: Add these lines to <code>08_compress_sort.sh</code>, and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file (<code>SRR519926.bam</code>) and the sorted bam file (<code>SRR519926.sorted.bam</code>) with <code>samtools view -H</code>. What are the differences?</p> Answer <p>Your script should like like this:</p> 08_compress_sort.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam &gt; SRR519926.bam\nsamtools sort SRR519926.bam &gt; SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n</code></pre> <p><code>samtools view -H SRR519926.bam</code> returns:</p> <pre><code>@HD     VN:1.0  SO:unsorted\n@SQ     SN:U00096.3     LN:4641652\n@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.12 CL:samtools view -H SRR519926.bam\n</code></pre> <p>And <code>samtools view -H SRR519926.sorted.bam</code> returns:</p> <pre><code>@HD     VN:1.0  SO:coordinate\n@SQ     SN:U00096.3     LN:4641652\n@PG     ID:bowtie2      PN:bowtie2      VN:2.4.2        CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG     ID:samtools     PN:samtools     PP:bowtie2      VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.12 CL:samtools sort SRR519926.bam\n@PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.12 CL:samtools view -H SRR519926.sorted.bam\n</code></pre> <p>There are two main differences: </p> <ul> <li>The <code>SO</code> tag at <code>@HD</code> type code has changed from <code>unsorted</code> to <code>coordinate</code>.</li> <li>A line with the <code>@PG</code> type code for the sorting was added.</li> </ul> <p>Note that the command to view the header (<code>samtools -H</code>) is also added to the header for both runs.</p>"},{"location":"day2/samtools/#filtering","title":"Filtering","text":"<p>With <code>samtools view</code> you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads.</p> <p>Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation.</p> Answer <p>You will need the 0x4 flag.</p> <p>Filtering against unmapped reads (leaving only mapped reads) with <code>samtools view</code> would look like this:</p> <pre><code>samtools view -bh -F 0x4 SRR519926.sorted.bam &gt; SRR519926.sorted.mapped.bam\n</code></pre> <p>or:</p> <pre><code>samtools view -bh -F 4 SRR519926.sorted.bam &gt; SRR519926.sorted.mapped.bam\n</code></pre> <p>Exercise: Generate a script called <code>09_extract_unmapped.sh</code> to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of <code>samtools flagstat</code>?</p> <p>Tip</p> <p>Check out the <code>-f</code> and <code>-c</code> options of <code>samtools view</code></p> Answer <p>Your script <code>09_extract_unmapped.sh</code> should look like this:</p> 09_extract_unmapped.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh -f 0x4 SRR519926.sorted.bam &gt; SRR519926.sorted.unmapped.bam\n</code></pre> <p>Counting like this: <pre><code>samtools view -c SRR519926.sorted.unmapped.bam\n</code></pre></p> <p>This should correspond to the output of <code>samtools flagstat</code> (624724 - 621624 = 3100)</p> <p><code>samtools view</code> also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files.</p> <p>Exercise: Generate a script called <code>10_extract_region.sh</code> to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header.</p> <p>Tip: Specifying a region</p> <p>Our E. coli genome has only one chromosome, because only one line starts with <code>&gt;</code> in the fasta file</p> <pre><code>cd ~/project/ref_genome\ngrep \"&gt;\" ecoli-strK12-MG1655.fasta\n</code></pre> <p>gives:</p> <pre><code>&gt;U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome\n</code></pre> <p>The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: <code>U00096.3:START-END</code></p> Answer 10_extract_region.sh<pre><code>#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh \\\nSRR519926.sorted.bam \\\nU00096.3:2000000-2500000 \\\n&gt; SRR519926.sorted.region.bam\n</code></pre>"},{"location":"day2/samtools/#redirection","title":"Redirection","text":"<p>Samtools is easy to use in a pipe. In this case you can replace the input file with a <code>-</code>. For example, you can sort and compress the output of your alignment software in a pipe like this:</p> <pre><code>my_alignment_command \\\n| samtools sort - \\\n| samtools view -bh - \\\n&gt; alignment.bam\n</code></pre> <p>The use of <code>-</code></p> <p>In the modern versions of samtools, the use of <code>-</code> is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry.</p> <p>Exercise: Write a script called <code>11_align_sort.sh</code> that maps the reads with bowtie2 (see chapter 2 of read alignment), sorts them, and outputs them as a BAM file with a header.</p> Answer 11_align_sort.sh<pre><code>#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome\nALIGNED_DIR=~/project/results/alignments\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n2&gt; $ALIGNED_DIR/bowtie2_SRR519926.log \\\n| samtools sort - \\\n| samtools view -bh - \\\n&gt; $ALIGNED_DIR/SRR519926.sorted.mapped.frompipe.bam\n</code></pre> <p>Redirecting <code>stderr</code></p> <p>Notice the line starting with <code>2&gt;</code>. This redirects standard error to a file: <code>$ALIGNED_DIR/bowtie2_SRR519926.log</code>. This file now contains the bowtie2 logs, that can later be re-read or used in e.g. <code>multiqc</code>. </p>"},{"location":"day2/samtools/#qc-summary","title":"QC summary","text":"<p>The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including <code>fastqc</code>, <code>fastp</code>, <code>samtools</code> and <code>bowtie2</code>). You can specify a directory that contains any log files, and it will automatically search it for you. </p> <p>Exercise: Run the command <code>multiqc .</code> in <code>~/project</code> and checkout the generated report. </p>"},{"location":"day3/igv_visualisation/","title":"IGV and visualisation","text":""},{"location":"day3/igv_visualisation/#learning-outcomes","title":"Learning outcomes","text":"<p>After having completed this chapter you will be able to:</p> <ul> <li>Prepare a bam file for loading it into IGV</li> <li>Use IGV to:<ul> <li>Navigate through a reference genome and alignments</li> <li>Retrieve information on a specific alignment</li> <li>Investigate (possible) variants</li> <li>Identify repeats and large INDELs </li> </ul> </li> </ul>"},{"location":"day3/igv_visualisation/#material","title":"Material","text":"<p>Presentation on the UCSC genome browser (after the exercises):</p> <p> Download the presentation</p> <p>The exercises below are partly based on this tutorial from the Griffith lab.</p>"},{"location":"day3/igv_visualisation/#exercises","title":"Exercises","text":""},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","title":"A first glance: the E. coli dataset","text":"<p>Index the alignment that was filtered for the region between 2000 and 2500 kb:</p> <p><pre><code>cd ~/project/results/alignments\nsamtools index SRR519926.sorted.region.bam\n</code></pre> Download the alignment (<code>SRR519926.sorted.region.bam</code>) together with it\u2019s index file (<code>SRR519926.sorted.region.bam.bai</code>) and the reference genome (<code>ecoli-strK12-MG1655.fasta</code>) to your desktop.</p> <p>If working with Docker</p> <p>If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the <code>-v</code> option). So if you have used <code>-v C:\\Users\\myusername\\ngs-course:/root/project</code>, your files will be in <code>C:\\Users\\myusername\\ngs-course</code>.</p> <ul> <li>Load the genome (<code>.fasta</code>) into IGV: Genomes &gt; Load Genome from File\u2026</li> <li>Load the alignment file (<code>.bam</code>): File &gt; Load from File\u2026</li> <li> <p>Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways:</p> <ol> <li>With the search box</li> </ol> <p> </p> <ol> <li>Select the region in the location bar</li> </ol> <p> </p> </li> <li> <p>View the reads as pairs, by right click on the reads and select View as pairs</p> </li> </ul> <p>Exercise: There are lot of reads that are coloured red. Why is that?</p> <p>If you don\u2019t find any red reads..</p> <p>The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by &gt; insert size</p> Answer <p>According to IGV, reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size.</p> <p>Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click:</p> <p>Exercise: Click on one of the reads. What kind of information is there?</p> Answer <p>Most of the information from the SAM file.</p> <p>Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by &gt; read strand.</p>"},{"location":"day3/igv_visualisation/#hcc1143-data-set","title":"HCC1143 data set","text":"<p>For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer.</p> <p>Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000.</p> <p>The BAM files containing these reads for the cancer cell line and the matched normal are:</p> <ul> <li><code>HCC1143.normal.21.19M-20M.bam</code></li> <li><code>HCC1143.normal.21.19M-20M.bam.bai</code></li> </ul> <p>A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu:</p> <p>Select File &gt; Load from File\u2026 from the main menu and select the BAM file <code>HCC1143.normal.21.19M-20M.bam</code> using the file browser.</p> <p>This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to <code>chr21:19,000,000-20,000,000</code>.</p> <p>Navigate to the gene CHODL by typing it in the search box.</p> <p>Load the dbsnp annotations by clicking File &gt; Load From Server\u2026 &gt; Annotations &gt; Variation and Repeats &gt; dbSNP 1.4.7</p> <p>Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file.</p> <p>Click on the coverage track where the SNP is:</p> <p>Exercise: What is the sequence sequencing depth for that base? And the percentage T?</p> Answer <p>The depth is 62, and 25 reads (40%) T.</p> <p>Navigate to region <code>chr21:19,800,320-19,818,162</code></p> <p>Load repeat tracks by selecting File &gt; Load from Server\u2026 from the main menu and then select Annotations &gt; Variation and Repeats &gt; Repeat Masker</p> <p>Note</p> <p>This might take a while to load.</p> <p>Right click in the alignment track and select Color alignments by &gt; insert size and pair orientation</p> <p>Exercise: Why are some reads coloured white? What can be the cause of that?</p> Answer <p>The white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3.</p> <p>Navigate to region <code>chr21:19,324,500-19,331,500</code></p> <p>Right click in the main alignment track and select:</p> <ul> <li>Expanded</li> <li>View as pairs</li> <li>Color alignments by &gt; insert size and pair orientation</li> <li>Sort alignments by &gt; insert size</li> </ul> <p>Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion?</p> Answer <p>The insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.</p>"}]}
\ No newline at end of file
diff --git a/2024.4/sitemap.xml.gz b/2024.4/sitemap.xml.gz
index b4fd8b0..c5321be 100644
Binary files a/2024.4/sitemap.xml.gz and b/2024.4/sitemap.xml.gz differ