diff --git a/2024.4/index.html b/2024.4/index.html index 94a8567..1bd83f2 100644 --- a/2024.4/index.html +++ b/2024.4/index.html @@ -944,6 +944,23 @@
Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses) or in your own time.
After this course, you will be able to:
License: CC BY-SA 4.0
Copyright: SIB Swiss Institute of Bioinformatics
Enrolled to the courseIndependentlyYou can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository.
Note
It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!
"},{"location":"#material","title":"Material","text":"After this course, you will be able to:
This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.
"},{"location":"#exercises","title":"Exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.
"},{"location":"#asking-questions","title":"Asking questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button:
A main source of communication will be our slack channel. Ask background questions that interest you personally at #background. During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a. This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option:
The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally.
To summarise:
After this course, you will be able to:
Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.
"},{"location":"course_schedule/","title":"Course schedule","text":"Note
Apart from the starting time the time schedule is indicative. Because we can not plan a course by the minute, in practice the time points will deviate.
"},{"location":"course_schedule/#day-1","title":"Day 1","text":"block start end subject introduction 9:00 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-2","title":"Day 2","text":"block start end subject block 1 9:00 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-3","title":"Day 3","text":"block start end subject block 1 9:00 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations"},{"location":"group_work/","title":"Group work","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people.
If working with Docker
If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk.
If online
If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.
"},{"location":"group_work/#material","title":"Material","text":"Download the presentation
"},{"location":"group_work/#roles-organisation","title":"Roles & organisation","text":"Project based learning is about learning by doing, but also about peer instruction. This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding.
Each project has tasks and questions. By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions.
In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.
"},{"location":"group_work/#working-directories","title":"Working directories","text":"Each group has access to a shared working directory. It is mounted in the root directory (/
). You can add this directory to your working space by clicking: File > Add Folder to Workspace\u2026. Then, type the path to your group directory: /group_work/groupX
(where X
is your group number).
Aim: Find variants on chromosome 20 from three samples
In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them.
You can get the data by running these commands:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz\ntar -xvf project1.tar.gz\nrm project1.tar.gz\n
"},{"location":"group_work/#tasks","title":"Tasks","text":"Important!
Stick to the principles for reproducible analysis described here
fastqc
fastp
. Make sure to include the option --detect_adapter_for_pe
. To prevent overwriting fastp.html
, specify a report filename for each sample with the option --html
. fastqc
again to see whether all adapters are gone..fai
file) with samtools faidx
. bowtie2
. At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. gatk MarkDuplicates
(see hints below).samtools merge
. Index the bam file afterwards. freebayes
to call variants. Only call variants on the region chr20:10018000-10220000
by specifying the -r
option. chr20:10,026,397-10,026,638
. multiqc
to get an overall quality report.samtools flagstat
)? Why is it important to remove them for variant analysis?chr20:10,026,397-10,026,638
. What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. You can add readgroups to the alignment file with bowtie2
with the options --rg-id
and --rg
, e.g. ($SAMPLE
is a variable containing a sample identifier):
bowtie2 \\\n-x ref.fa \\\n-1 r1.fastq.gz \\\n-2 r2.fastq.gz \\\n--rg-id $SAMPLE \\\n--rg SM:$SAMPLE \\\n
To run gatk MarkDuplicates
you will only need to specify --INPUT
and --OUTPUT
, e.g.:
gatk MarkDuplicates \\\n--INPUT sample.bam \\\n--OUTPUT sample.md.bam \\\n--METRICS_FILE sample.metrics.txt \n
"},{"location":"group_work/#project-2-long-read-genome-sequencing","title":"Project 2: Long-read genome sequencing","text":"Aim: Align long reads from RNA-seq data to a reference genome.
In this project, you will be working with data from:
Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.
The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For this project, we will work with two samples of this study, EV_2
(extracellular vesicle) and Cell_2
(whole cell). Download and unpack the data files.
Download the human reference genome like this:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project2.tar.gz\ntar -xvf project2.tar.gz\nrm project2.tar.gz\n
You can find the fastq files in the reads
folder and the reference genome and its annotation in the reference
folder. To reduce computational times we work with a subset of the data on a subset of the genome (chromosome 5 and X).
Important!
Stick to the principles for reproducible analysis described here
fastqc
NanoPlot
minimap2
with default parameters-x
ELOVL5
.fastqc
and NanoPlot
? How is that compared to the publication?-x
of minimap2
. Are the defaults appropriate?-x map-ont
or -x splice
. Do you see differences in the alignment in e.g. IGV?-x
?ELOVL5
sequenced in both samples?Accuracy from quality scores
Find the equation to calculate error probability from quality score on Wikipedia.
Comparing fastqc
and Nanoplot
For comparing fastqc
and NanoPlot
, check out this blog of the author of NanoPlot, and this thread.
Running minimap2
Here\u2019s an example command for minimap2
:
minimap2 \\\n-a \\\n-x [PARAMETER] \\\n[REFERENCE].fa \\\n[FASTQFILE].fastq.gz \\\n| samtools sort \\\n| samtools view -bh > [OUTPUT].bam\n
"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","title":"Project 3: Short-read RNA-seq of mice.","text":"Aim: Generate a count matrix to estimate differential gene expression.
In this project you will be working with data from:
Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6
Here\u2019s the BioProject page. Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz\ntar -xvf project3.tar.gz\nrm project3.tar.gz\n
"},{"location":"group_work/#tasks_2","title":"Tasks","text":"Important!
Stick to the principles for reproducible analysis described here
fastqc
fastp
fastqc
again to see whether all adapters are gone.hisat2
Sparcl1
. For this, you can use the built-in genome (Mouse (mm10)). Do you see any evidence for differential splicing?featureCounts
on both alignments. Have a look at the option -Q
. For further suggestions, see the hints below. R
(find a script to get started here; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio
)fastqc
reports?featureCounts
? What is the cause of this? -Q
in featureCounts
?We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing.
command optionbowtie2-build
--threads
hisat2-build
--threads
fastqc
--threads
cutadapt
--cores
bowtie2
--threads
hisat2
--threads
featureCounts
-T
Here\u2019s some example code for hisat2
and featureCounts
. Everything in between <>
should be replaced with specific arguments.
Here\u2019s an example for hisat2
:
hisat2-build <reference_sequence_fasta> <index_basename>\n\nhisat2 \\\n-x <index_basename> \\\n-1 <foward_reads.fastq.gz> \\\n-2 <reverse_reads.fastq.gz> \\\n-p <threads> \\\n| samtools sort \\\n| samtools view -bh \\\n> <alignment_file.bam>\n
Example code featureCounts
:
featureCounts \\\n-p \\\n-T 2 \\\n-a <annotations.gtf> \\\n-o <output.counts.txt> \\\n*.bam\n
"},{"location":"precourse/","title":"Precourse preparations","text":""},{"location":"precourse/#unix","title":"UNIX","text":"We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here. If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial.
"},{"location":"precourse/#software","title":"Software","text":"We will be mainly working on an Amazon Web Services (AWS) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a VSCode interface. All participants will be granted access to a personal workspace to be used during the course.
The only software you need to install before the course is Integrative Genomics Viewer (IGV).
"},{"location":"day1/intro/","title":"Introduction","text":"If working independently
If you are doing this course independently, you can skip this part.
"},{"location":"day1/intro/#material","title":"Material","text":"Download the presentation
"},{"location":"day1/quality_control/","title":"Quality control","text":""},{"location":"day1/quality_control/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
fastqc
on sequence reads and interpret the resultsfastp
Download the presentation
fastqc
command line documentationcutadapt
manualCheck out the dataset at SRA.
Exercise: Browse around the SRA entry and answer these questions:
A. Is the dataset paired-end or single end?
B. Which instrument was used for sequencing?
C. What is the read length?
D. How many reads do we have?
AnswersA. paired-end
B. Illumina MiSeq
C. 2 x 251 bp
D. 400596
Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools
. Every time you open a new terminal, you will have to load the environment:
conda activate ngs-tools\n
Make a directory reads
in ~/project
and download the reads from the SRA database using prefetch
and fastq-dump
from SRA-Tools into the reads
directory. Use the code snippet below to create a scripts called 01_download_reads.sh
. Store it in ~/project/scripts/
, and run it.
#!/usr/bin/env bash\n\ncd ~/project\nmkdir reads\ncd reads\nprefetch SRR519926\nfastq-dump --split-files SRR519926\n
Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry.
Tip
A read in a fastq file consists of four lines (more on that at file types). Use Google to figure out how to count the number of reads in a fastq file.
Answere.g. from this thread on Biostars:
## forward read\necho $(cat SRR519926_1.fastq | wc -l)/4 | bc\n\n## reverse read\necho $(cat SRR519926_2.fastq | wc -l)/4 | bc\n
"},{"location":"day1/quality_control/#run-fastqc","title":"Run fastqc","text":"Exercise: Create a script to run fastqc
and call it 02_run_fastqc.sh
. After that, run it.
Tip
fastqc
accepts multiple files as input, so you can use a wildcard to run fastqc
on all the files in one line of code. Use it like this: *.fastq
.
Your script ~/project/scripts/02_run_fastqc.sh
should look like:
#!/usr/bin/env bash\ncd ~/project/reads\n\nfastqc *.fastq\n
Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems?
Downloading files
You can download files by right-click the file and after that select Download:
Answer
There seems to be:
We can probably fix most of these issues by trimming.
"},{"location":"day1/quality_control/#trim-the-reads","title":"Trim the reads","text":"We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp
will use those by default. A reference for the adapter sequences can be found here.
Exercise: Check out the documentation of fastp, and the option defaults by running fastp --help
.
--qualified_quality_phred
)--unqualified_percent_limit
)--length_required
)--unpaired1
and --unpaired2
)Default 15 means phred quality >=Q15 is qualified. (int [=15])
reads shorter than length_required will be discarded, default is 15. (int [=15])
--unpaired1
and/or --unpaired2
are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
. Exercise: Complete the script below called 03_trim_reads.sh
(replace everything in between brackets []
) to run fastp
to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/
is created to write the trimmed reads.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n[QUALIFIED BASE THRESHOLD] \\\n[MINIMUM LENGTH THRESHOLD] \\\n[UNQUALIFIED PERCENTAGE LIMIT] \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n
Additional options
Note that we have set the options --cut_front
and --cut_tail
that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe
is set, which ensures that adapters are detected automatically for both R1 and R2.
Your script (~/project/scripts/03_trim_reads.sh
) should look like this:
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n--qualified_quality_phred 10 \\\n--length_required 25 \\\n--unqualified_percent_limit 80 \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n
The use of \\
In the script above you see that we\u2019re using \\
at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp
command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\
to ignore the newlines when executing.
Exercise: Check out the report in fastp.html
.
fastqc
, fastp
has trouble finding adapters in R2. Also, after running fastp
there doesn\u2019t seem to be much adapter left (you can double check by running fastqc
on trimmed_SRR519926_2.fastq
). How could that be? After having completed this chapter you will be able to:
Download the presentation
"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","title":"Some good practices for reproducibility","text":"During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility:
01_download_reads.sh
)06_build_bowtie_index.sh
By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research.
By the end of day 2 ~/project
should look (something) like this:
.\n\u251c\u2500\u2500 alignment_output\n\u251c\u2500\u2500 reads\n\u251c\u2500\u2500 ref_genome\n\u251c\u2500\u2500 scripts\n\u2502 \u251c\u2500\u2500 01_download_reads.sh\n\u2502 \u251c\u2500\u2500 02_run_fastqc.sh\n\u2502 \u251c\u2500\u2500 03_trim_reads.sh\n\u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh\n\u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh\n\u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh\n\u2502 \u251c\u2500\u2500 07_align_reads.sh\n\u2502 \u251c\u2500\u2500 08_compress_sort.sh\n\u2502 \u251c\u2500\u2500 09_extract_unmapped.sh\n\u2502 \u251c\u2500\u2500 10_extract_region.sh\n\u2502 \u2514\u2500\u2500 11_align_sort.sh\n\u2514\u2500\u2500 trimmed_data\n
"},{"location":"day1/sequencing_technologies/","title":"Sequencing technologies","text":""},{"location":"day1/sequencing_technologies/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
Download the presentation
Illumina sequencing by synthesis on YouTube
NEBnext library preparation poster
"},{"location":"day1/server_login/","title":"Setup","text":""},{"location":"day1/server_login/#learning-outcomes","title":"Learning outcomes","text":"Note
You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line.
After having completed this chapter you will be able to:
bash
scriptbash
Choose your platform
In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker.
If you are doing the course with a teacher, you will have to login to the remote server. Therefore choose:
If you are doing this course independently (i.e. without a teacher) choose either:
If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this:
WindowsMac/Linux.exe
file hereconda list
in the Ananconda prompt or terminal to check whether your installation has succeeded..sh
) script herebash Miniconda3-latest-Linux-x86_64.sh\n
conda list
in the Ananconda prompt or terminal to check whether your installation has succeeded.After installation, you can install the required software:
Windows/MacOSLinuxconda create -n ngs-tools\n\nconda activate ngs-tools\n\nconda install -y -c bioconda \\\n samtools \\\n bwa \\\n fastqc \\\n sra-tools \\\n bowtie2=2.4.2 \\\n hisat2=2.2.1 \\\n subread=2.0.1 \\\n entrez-direct \\\n minimap2 \\\n gatk4 \\\n freebayes \\\n multiqc \\\n fastp\n
Download ngs-tools.yml, and generate the conda environment like this:
conda env create --name ngs-tools -f ngs-tools.yml\n
Note
If that did not succeed, follow the instructions for Windows/MacOS.
This will create the conda environment ngs-tools
Activate it like so:
conda activate ngs-tools\n
After successful installation and activating the environment all the software required to do the exercises should be available.
If you are doing project 2 (long reads)
If you are doing the project 2 as part of the course, you will need to install NanoPlot
as well, using pip
:
pip install NanoPlot\n
"},{"location":"day1/server_login/#exercises","title":"Exercises","text":""},{"location":"day1/server_login/#first-login","title":"First login","text":"If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002
) in your browser. This should result in the following page:
Info
The link gives you access to a web version of Visual Studio Code. This is a powerful code editor that you can also use as a local application on your computer.
Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl+`. Or by clicking Application menu > Terminal > New Terminal:
For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:
"},{"location":"day1/server_login/#material","title":"Material","text":"
Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system.
In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10.
The command to run the environment required for this course looks like this (in a terminal):
Modify the script
Modify the path after -v
to the working directory on your computer before running it.
docker run \\\n--rm \\\n-p 8443:8443 \\\n-e PUID=1000 \\\n-e PGID=1000 \\\n-e DEFAULT_WORKSPACE=/config/project \\\n-v $PWD:/config/project \\\ngeertvangeest/ngs-introduction-vscode:latest\n
If this command has run successfully, navigate in your browser to http://localhost:8443.
The option -v
mounts a local directory in your computer to the directory /config/project
in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.
Don\u2019t mount directly in the home dir
Don\u2019t directly mount your local directory to the home directory (/root
). This will lead to unexpected behaviour.
The part geertvangeest/ngs-introduction-vscode:latest
is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately.
Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.
If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet:
UNIX cheat sheet
"},{"location":"day1/server_login/#make-a-new-directory","title":"Make a new directory","text":"Make a directory scripts
within ~/project
and make it your current directory.
cd ~/project\nmkdir scripts\ncd scripts\n
"},{"location":"day1/server_login/#file-permissions","title":"File permissions","text":"Generate an empty script in your newly made directory ~/project/scripts
like this:
touch new_script.sh\n
Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it.
AnswerGenerate a script as described above. The script should look like this:
#!/usr/bin/env bash\n\necho \"SIB courses are great!\"\n
Usually, you can run it like this:
./new_script.sh\n
But there\u2019s an error:
bash: ./new_script.sh: Permission denied\n
Why is there an error?
Hint
Use ls -lh new_script.sh
to check the permissions.
ls -lh new_script.sh\n
gives:
-rw-r--r-- 1 user group 51B Nov 11 16:21 new_script.sh\n
There\u2019s no x
in the permissions string. You should change at least the permissions of the user.
Make the script executable for yourself, and run it.
AnswerChange permissions:
chmod u+x new_script.sh\n
ls -lh new_script.sh
now gives:
-rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh\n
So it should be executable:
./new_script.sh\n
More on chmod
and file permissions here.
>
and |
","text":"In the root directory (go there like this: cd /
) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt
in your working directory.
ls / > ~/project/system_dirs.txt\n
The command wc -l
counts the number of lines, and can read from stdin. Make a one-liner with a pipe |
symbol to find out how many system directories and files there are.
ls / | wc -l\n
"},{"location":"day1/server_login/#variables","title":"Variables","text":"Store system_dirs.txt
as variable (like this: VAR=variable
), and use wc -l
on that variable to count the number of lines in the file.
FILE=~/project/system_dirs.txt\nwc -l $FILE\n
"},{"location":"day1/server_login/#shell-scripts","title":"shell scripts","text":"Make a shell script that automatically counts the number of system directories and files.
AnswerMake a script called e.g. current_system_dirs.sh
:
#!/usr/bin/env bash\ncd /\nls | wc -l\n
"},{"location":"day2/file_types/","title":"File types","text":""},{"location":"day2/file_types/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
Download the presentation
File definition websites:
After having completed this chapter you will be able to:
bowtie2
Download the presentation
bowtie2
manualMake a script called 05_download_ecoli_reference.sh
, and paste in the code snippet below. Use it to retrieve the reference sequence using esearch
and efetch
:
#!/usr/bin/env bash\n\nREFERENCE_DIR=~/project/ref_genome/\n\nmkdir $REFERENCE_DIR\ncd $REFERENCE_DIR\n\nesearch -db nuccore -query 'U00096' \\\n| efetch -format fasta > ecoli-strK12-MG1655.fasta\n
Exercise: Check out the documentation of bowtie2-build
, and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh
.
#!/usr/bin/env bash\n\ncd ~/project/ref_genome\n\nbowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta\n
"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","title":"Align the reads with bowtie2
","text":"Exercise: Check out the bowtie2 manual here. We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need?
AnswerAccording to the usage of bowtie2
:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | --sra-acc <acc> | b <bam>}\n
We\u2019ll need the options:
-x
to point to our index-1
and -2
to point to our forward and reverse readsExercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh
, and run it.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome/\nALIGNED_DIR=~/project/results/alignments\n\nmkdir -p $ALIGNED_DIR\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n> $ALIGNED_DIR/SRR519926.sam\n
We\u2019ll go deeper into alignment statistics later on, but bowtie2
writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual. Can you guess what the reason could be?
After having completed this chapter you will be able to:
samtools flagstat
to get general statistics on the flags stored in a sam/bam filesamtools view
to:samtools sort
to sort an alignment file based on coordinatesamtools index
to create an index of a sorted sam/bam file|
) symbol to pipe alignments directly to samtools
to perform sorting and filteringsamtools
documentationExercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats
by using samtools flagstat
. Find the documentation here. Anything that draws your attention?
Code:
cd ~/project/results/alignments/\nsamtools flagstat SRR519926.sam > SRR519926.sam.stats\n
resulting in:
624724 + 0 in total (QC-passed reads + QC-failed reads)\n624724 + 0 primary\n0 + 0 secondary\n0 + 0 supplementary\n0 + 0 duplicates\n0 + 0 primary duplicates\n621624 + 0 mapped (99.50% : N/A)\n621624 + 0 primary mapped (99.50% : N/A)\n624724 + 0 paired in sequencing\n312362 + 0 read1\n312362 + 0 read2\n300442 + 0 properly paired (48.09% : N/A)\n619200 + 0 with itself and mate mapped\n2424 + 0 singletons (0.39% : N/A)\n0 + 0 with mate mapped to a different chr\n0 + 0 with mate mapped to a different chr (mapQ>=5)\n
Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this:
samtools stats SRR519926.sam | grep ^SN | cut -f 2,3\n
Now look at insert size average
and insert size standard deviation
. You can see the standard deviation is higher than the average, suggesting a wide distribution.
The command samtools view
is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that.
Exercise: Create a script called 08_compress_sort.sh
. Add a samtools view
command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b
and -h
options. Find the required documentation here. How much was the disk space reduced by compressing the file?
Tip: Samtools writes to stdout
By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with >
or use the the output option -o
.
08_compress_sort.sh
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam > SRR519926.bam\n
By using ls -lh
, you can find out that SRR519926.sam
has a size of 264 Mb, while SRR519926.bam
is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools
as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:
samtools sort SRR519926.bam > SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n
Exercise: Add these lines to 08_compress_sort.sh
, and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file (SRR519926.bam
) and the sorted bam file (SRR519926.sorted.bam
) with samtools view -H
. What are the differences?
Your script should like like this:
08_compress_sort.sh#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam > SRR519926.bam\nsamtools sort SRR519926.bam > SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n
samtools view -H SRR519926.bam
returns:
@HD VN:1.0 SO:unsorted\n@SQ SN:U00096.3 LN:4641652\n@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam\n
And samtools view -H SRR519926.sorted.bam
returns:
@HD VN:1.0 SO:coordinate\n@SQ SN:U00096.3 LN:4641652\n@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam\n@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam\n
There are two main differences:
SO
tag at @HD
type code has changed from unsorted
to coordinate
.@PG
type code for the sorting was added.Note that the command to view the header (samtools -H
) is also added to the header for both runs.
With samtools view
you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads.
Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation.
AnswerYou will need the 0x4 flag.
Filtering against unmapped reads (leaving only mapped reads) with samtools view
would look like this:
samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam\n
or:
samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam\n
Exercise: Generate a script called 09_extract_unmapped.sh
to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat
?
Tip
Check out the -f
and -c
options of samtools view
Your script 09_extract_unmapped.sh
should look like this:
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam\n
Counting like this:
samtools view -c SRR519926.sorted.unmapped.bam\n
This should correspond to the output of samtools flagstat
(624724 - 621624 = 3100)
samtools view
also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files.
Exercise: Generate a script called 10_extract_region.sh
to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header.
Tip: Specifying a region
Our E. coli genome has only one chromosome, because only one line starts with >
in the fasta file
cd ~/project/ref_genome\ngrep \">\" ecoli-strK12-MG1655.fasta\n
gives:
>U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome\n
The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh \\\nSRR519926.sorted.bam \\\nU00096.3:2000000-2500000 \\\n> SRR519926.sorted.region.bam\n
"},{"location":"day2/samtools/#redirection","title":"Redirection","text":"Samtools is easy to use in a pipe. In this case you can replace the input file with a -
. For example, you can sort and compress the output of your alignment software in a pipe like this:
my_alignment_command \\\n| samtools sort - \\\n| samtools view -bh - \\\n> alignment.bam\n
The use of -
In the modern versions of samtools, the use of -
is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry.
Exercise: Write a script called 11_align_sort.sh
that maps the reads with bowtie2 (see chapter 2 of read alignment), sorts them, and outputs them as a BAM file with a header.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome\nALIGNED_DIR=~/project/results/alignments\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n2> $ALIGNED_DIR/bowtie2_SRR519926.log \\\n| samtools sort - \\\n| samtools view -bh - \\\n> $ALIGNED_DIR/SRR519926.sorted.mapped.frompipe.bam\n
Redirecting stderr
Notice the line starting with 2>
. This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log
. This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc
.
The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc
, fastp
, samtools
and bowtie2
). You can specify a directory that contains any log files, and it will automatically search it for you.
Exercise: Run the command multiqc .
in ~/project
and checkout the generated report.
After having completed this chapter you will be able to:
Presentation on the UCSC genome browser (after the exercises):
Download the presentation
The exercises below are partly based on this tutorial from the Griffith lab.
"},{"location":"day3/igv_visualisation/#exercises","title":"Exercises","text":""},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","title":"A first glance: the E. coli dataset","text":"Index the alignment that was filtered for the region between 2000 and 2500 kb:
cd ~/project/results/alignments\nsamtools index SRR519926.sorted.region.bam\n
Download the alignment (SRR519926.sorted.region.bam
) together with it\u2019s index file (SRR519926.sorted.region.bam.bai
) and the reference genome (ecoli-strK12-MG1655.fasta
) to your desktop. If working with Docker
If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v
option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project
, your files will be in C:\\Users\\myusername\\ngs-course
.
.fasta
) into IGV: Genomes > Load Genome from File\u2026.bam
): File > Load from File\u2026Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways:
View the reads as pairs, by right click on the reads and select View as pairs
Exercise: There are lot of reads that are coloured red. Why is that?
If you don\u2019t find any red reads..
The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size
AnswerAccording to IGV, reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size.
Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click:
Exercise: Click on one of the reads. What kind of information is there?
AnswerMost of the information from the SAM file.
Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand.
"},{"location":"day3/igv_visualisation/#hcc1143-data-set","title":"HCC1143 data set","text":"For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer.
Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000.
The BAM files containing these reads for the cancer cell line and the matched normal are:
HCC1143.normal.21.19M-20M.bam
HCC1143.normal.21.19M-20M.bam.bai
A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu:
Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam
using the file browser.
This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000
.
Navigate to the gene CHODL by typing it in the search box.
Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7
Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file.
Click on the coverage track where the SNP is:
Exercise: What is the sequence sequencing depth for that base? And the percentage T?
AnswerThe depth is 62, and 25 reads (40%) T.
Navigate to region chr21:19,800,320-19,818,162
Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker
Note
This might take a while to load.
Right click in the alignment track and select Color alignments by > insert size and pair orientation
Exercise: Why are some reads coloured white? What can be the cause of that?
AnswerThe white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3.
Navigate to region chr21:19,324,500-19,331,500
Right click in the main alignment track and select:
Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion?
AnswerThe insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Home","text":"Here you can find the course material for the SIB course \u2018NGS - Quality control, Alignment, Visualisation\u2019. You can follow this course enrolled (check out upcoming training courses) or in your own time.
After this course, you will be able to:
License: CC BY-SA 4.0
Copyright: SIB Swiss Institute of Bioinformatics
Enrolled to the courseIndependentlyYou can do this course completely independently without a teacher. To do the exercises, we will set things up locally with a Docker container. If there any issues, use the issues page on our github repository.
Note
It might take us a while to respond to issues. Therefore, first check if a similar issue already exists, and/or try to fix it yourself. There\u2019s a lot of documentation/fora/threads on the web!
"},{"location":"#material","title":"Material","text":"After this course, you will be able to:
This course will consist of lectures, exercises and polls. During exercises, you are free to discuss with other participants. During lectures, focus on the lecture only.
"},{"location":"#exercises","title":"Exercises","text":"Each block has practical work involved. Some more than others. The practicals are subdivided into chapters, and we\u2019ll have a (short) discussion after each chapter. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.
"},{"location":"#asking-questions","title":"Asking questions","text":"During lectures, you are encouraged to raise your hand if you have questions (if in-person), or use the Zoom functionality (if online). Use the \u2018Reactions\u2019 button:
A main source of communication will be our slack channel. Ask background questions that interest you personally at #background. During the exercises, e.g. if you are stuck or don\u2019t understand what is going on, use the slack channel #q-and-a. This channel is not only meant for asking questions but also for answering questions of other participants. If you are replying to a question, use the \u201creply in thread\u201d option:
The teacher will review the answers, and add/modify if necessary. If you\u2019re really stuck and need specific tutor support, write the teachers or helpers personally.
To summarise:
After this course, you will be able to:
Each block has practical work involved. Some more than others. All answers to the practicals are incorporated, but they are hidden. Do the exercise first by yourself, before checking out the answer. If your answer is different from the answer in the practicals, try to figure out why they are different.
"},{"location":"course_schedule/","title":"Course schedule","text":"Note
Apart from the starting time the time schedule is indicative. Because we can not plan a course by the minute, in practice the time points will deviate.
"},{"location":"course_schedule/#day-1","title":"Day 1","text":"block start end subject introduction 9:00 AM 9:30 AM Introduction block 1 9:30 AM 10:30 AM Sequencing technologies 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Setup + Reproducibility 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Quality control 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-2","title":"Day 2","text":"block start end subject block 1 9:00 AM 10:30 AM Read alignment 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM File types 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Samtools 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Group work"},{"location":"course_schedule/#day-3","title":"Day 3","text":"block start end subject block 1 9:00 AM 10:30 PM IGV and visualisation 10:30 AM 11:00 AM BREAK block 2 11:00 AM 12:30 PM Group work 12:30 PM 1:30 PM BREAK block 3 1:30 PM 3:00 PM Group work 3:00 PM 3:30 PM BREAK block 4 3:30 PM 5:15 PM Presentations"},{"location":"group_work/","title":"Group work","text":"The last part of this course will consist of project-based-learning. This means that you will work in groups on a single question. We will split up into groups of five people.
If working with Docker
If you are working with Docker, I assume you are working independently and therefore can not work in a group. However, you can test your skills with these real biological datasets. Realize that the datasets and calculations are (much) bigger compared to the exercises, so check if your computer is up for it. You\u2019ll probably need around 4 cores, 16G of RAM and 50G of harddisk.
If online
If the course takes place online, we will use break-out rooms to communicate within groups. Please stay in the break-out room during the day, also if you are working individually.
"},{"location":"group_work/#material","title":"Material","text":"Download the presentation
"},{"location":"group_work/#roles-organisation","title":"Roles & organisation","text":"Project based learning is about learning by doing, but also about peer instruction. This means that you will be both a learner and a teacher. There will be differences in levels among participants, but because of that, some will learn efficiently from people that have just learned, and others will teach and increase their understanding.
Each project has tasks and questions. By performing the tasks, you should be able to answer the questions. You should consider the tasks and questions as a guidance. If interesting questions pop up during the project, you are encouraged to work on those. Also, you don\u2019t have to perform all the tasks and answer all the questions.
In the afternoon of day 1, you will start on the project. On day 3, you can work on the project in the morning and in the first part of the afternoon. We will conclude the projects with a 10-minute presentation of each group.
"},{"location":"group_work/#working-directories","title":"Working directories","text":"Each group has access to a shared working directory. It is mounted in the root directory (/
). You can add this directory to your working space by clicking: File > Add Folder to Workspace\u2026. Then, type the path to your group directory: /group_work/groupX
(where X
is your group number).
Aim: Find variants on chromosome 20 from three samples
In this project you will be working with Illumina reads from three samples: a father, mother and a child. You will perform quality control, align the reads, mark duplicates, detect variants and visualize them.
You can get the data by running these commands:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project1.tar.gz\ntar -xvf project1.tar.gz\nrm project1.tar.gz\n
"},{"location":"group_work/#tasks","title":"Tasks","text":"Important!
Stick to the principles for reproducible analysis described here
fastqc
fastp
. Make sure to include the option --detect_adapter_for_pe
. To prevent overwriting fastp.html
, specify a report filename for each sample with the option --html
. fastqc
again to see whether all adapters are gone..fai
file) with samtools faidx
. bowtie2
. At the same time add readgroups to the aligned reads (see hints below). Make sure you end up with an indexed and sorted bam file. gatk MarkDuplicates
(see hints below).samtools merge
. Index the bam file afterwards. freebayes
to call variants. Only call variants on the region chr20:10018000-10220000
by specifying the -r
option. chr20:10,026,397-10,026,638
. multiqc
to get an overall quality report.samtools flagstat
)? Why is it important to remove them for variant analysis?chr20:10,026,397-10,026,638
. What are the genotypes of the three samples according to freebayes? Is this according to what you see in the alignments? If the alternative alleles are present in the same individual, are they in phase or in repulsion? Note: you can also load vcf files in IGV. You can add readgroups to the alignment file with bowtie2
with the options --rg-id
and --rg
, e.g. ($SAMPLE
is a variable containing a sample identifier):
bowtie2 \\\n-x ref.fa \\\n-1 r1.fastq.gz \\\n-2 r2.fastq.gz \\\n--rg-id $SAMPLE \\\n--rg SM:$SAMPLE \\\n
To run gatk MarkDuplicates
you will only need to specify --INPUT
and --OUTPUT
, e.g.:
gatk MarkDuplicates \\\n--INPUT sample.bam \\\n--OUTPUT sample.md.bam \\\n--METRICS_FILE sample.metrics.txt \n
"},{"location":"group_work/#project-2-long-read-genome-sequencing","title":"Project 2: Long-read genome sequencing","text":"Aim: Align long reads from RNA-seq data to a reference genome.
In this project, you will be working with data from:
Padilla, Juan-Carlos A., Seda Barutcu, Ludovic Malet, Gabrielle Deschamps-Francoeur, Virginie Calderon, Eunjeong Kwon, and Eric L\u00e9cuyer. \u201cProfiling the Polyadenylated Transcriptome of Extracellular Vesicles with Long-Read Nanopore Sequencing.\u201d BMC Genomics 24, no. 1 (September 22, 2023): 564. https://doi.org/10.1186/s12864-023-09552-6.
The authors used RNA sequencing with Oxford Nanopore Technology of both extracellular vesicles and whole cells from cell culture. For this project, we will work with two samples of this study, EV_2
(extracellular vesicle) and Cell_2
(whole cell). Download and unpack the data files.
Download the human reference genome like this:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project2.tar.gz\ntar -xvf project2.tar.gz\nrm project2.tar.gz\n
You can find the fastq files in the reads
folder and the reference genome and its annotation in the reference
folder. To reduce computational times we work with a subset of the data on a subset of the genome (chromosome 5 and X).
Important!
Stick to the principles for reproducible analysis described here
fastqc
NanoPlot
minimap2
with default parameters-x
ELOVL5
.fastqc
and NanoPlot
? How is that compared to the publication?-x
of minimap2
. Are the defaults appropriate?-x map-ont
or -x splice
. Do you see differences in the alignment in e.g. IGV?-x
?ELOVL5
sequenced in both samples?Accuracy from quality scores
Find the equation to calculate error probability from quality score on Wikipedia.
Comparing fastqc
and Nanoplot
For comparing fastqc
and NanoPlot
, check out this blog of the author of NanoPlot, and this thread.
Running minimap2
Here\u2019s an example command for minimap2
:
minimap2 \\\n-a \\\n-x [PARAMETER] \\\n[REFERENCE].fa \\\n[FASTQFILE].fastq.gz \\\n| samtools sort \\\n| samtools view -bh > [OUTPUT].bam\n
"},{"location":"group_work/#project-3-short-read-rna-seq-of-mice","title":"Project 3: Short-read RNA-seq of mice.","text":"Aim: Generate a count matrix to estimate differential gene expression.
In this project you will be working with data from:
Singhania A, Graham CM, Gabry\u0161ov\u00e1 L, Moreira-Teixeira L, Stavropoulos E, Pitt JM, et al (2019). Transcriptional profiling unveils type I and II interferon networks in blood and tissues across diseases. Nat Commun. 10:1\u201321. https://doi.org/10.1038/s41467-019-10601-6
Here\u2019s the BioProject page. Since the mouse genome is rather large, we have prepared reads for you that originate from chromosome 5. Use those for the project. Download them like this:
wget https://ngs-introduction-training.s3.eu-central-1.amazonaws.com/project3.tar.gz\ntar -xvf project3.tar.gz\nrm project3.tar.gz\n
"},{"location":"group_work/#tasks_2","title":"Tasks","text":"Important!
Stick to the principles for reproducible analysis described here
fastqc
fastp
fastqc
again to see whether all adapters are gone.hisat2
Sparcl1
. For this, you can use the built-in genome (Mouse (mm10)). Do you see any evidence for differential splicing?featureCounts
on both alignments. Have a look at the option -Q
. For further suggestions, see the hints below. R
(find a script to get started here; Rstudio server is running on the same machine. Approach it with your credentials and username rstudio
)fastqc
reports?featureCounts
? What is the cause of this? -Q
in featureCounts
?We are now doing computations on a full genome, with full transcriptomic data. This is quite a bit more than we have used during the exercises. Therefore, computations take longer. However, most tools support parallel processing, in which you can specify how many cores you want to use to run in parallel. Your environment contains four cores, so this is also the maximum number of processes you can specify. Below you can find the options used in each command to specify multi-core processing.
command optionbowtie2-build
--threads
hisat2-build
--threads
fastqc
--threads
cutadapt
--cores
bowtie2
--threads
hisat2
--threads
featureCounts
-T
Here\u2019s some example code for hisat2
and featureCounts
. Everything in between <>
should be replaced with specific arguments.
Here\u2019s an example for hisat2
:
hisat2-build <reference_sequence_fasta> <index_basename>\n\nhisat2 \\\n-x <index_basename> \\\n-1 <foward_reads.fastq.gz> \\\n-2 <reverse_reads.fastq.gz> \\\n-p <threads> \\\n| samtools sort \\\n| samtools view -bh \\\n> <alignment_file.bam>\n
Example code featureCounts
:
featureCounts \\\n-p \\\n-T 2 \\\n-a <annotations.gtf> \\\n-o <output.counts.txt> \\\n*.bam\n
"},{"location":"precourse/","title":"Precourse preparations","text":""},{"location":"precourse/#unix","title":"UNIX","text":"We expect participants to have a basic understanding of working with the command line on UNIX-based systems. You can test your UNIX skills with a quiz here. If you don\u2019t have experience with UNIX command line, or if you\u2019re unsure whether you meet the prerequisites, follow our online UNIX tutorial.
"},{"location":"precourse/#software","title":"Software","text":"We will be mainly working on an Amazon Web Services (AWS) Elastic Cloud (EC2) server. Our Ubuntu server behaves like a \u2018normal\u2019 remote server, and can be approached through a VSCode interface. All participants will be granted access to a personal workspace to be used during the course.
The only software you need to install before the course is Integrative Genomics Viewer (IGV).
"},{"location":"day1/intro/","title":"Introduction","text":"If working independently
If you are doing this course independently, you can skip this part.
"},{"location":"day1/intro/#material","title":"Material","text":"Download the presentation
"},{"location":"day1/quality_control/","title":"Quality control","text":""},{"location":"day1/quality_control/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
fastqc
on sequence reads and interpret the resultsfastp
Download the presentation
fastqc
command line documentationcutadapt
manualCheck out the dataset at SRA.
Exercise: Browse around the SRA entry and answer these questions:
A. Is the dataset paired-end or single end?
B. Which instrument was used for sequencing?
C. What is the read length?
D. How many reads do we have?
AnswersA. paired-end
B. Illumina MiSeq
C. 2 x 251 bp
D. 400596
Now we will use some bioinformatics tools to do download reads and perform quality control. The tools are pre-installed in a conda environment called ngs-tools
. Every time you open a new terminal, you will have to load the environment:
conda activate ngs-tools\n
Make a directory reads
in ~/project
and download the reads from the SRA database using prefetch
and fastq-dump
from SRA-Tools into the reads
directory. Use the code snippet below to create a scripts called 01_download_reads.sh
. Store it in ~/project/scripts/
, and run it.
#!/usr/bin/env bash\n\ncd ~/project\nmkdir reads\ncd reads\nprefetch SRR519926\nfastq-dump --split-files SRR519926\n
Exercise: Check whether the download was successful by counting the number of reads in the fastq files and compare it to the SRA entry.
Tip
A read in a fastq file consists of four lines (more on that at file types). Use Google to figure out how to count the number of reads in a fastq file.
Answere.g. from this thread on Biostars:
## forward read\necho $(cat SRR519926_1.fastq | wc -l)/4 | bc\n\n## reverse read\necho $(cat SRR519926_2.fastq | wc -l)/4 | bc\n
"},{"location":"day1/quality_control/#run-fastqc","title":"Run fastqc","text":"Exercise: Create a script to run fastqc
and call it 02_run_fastqc.sh
. After that, run it.
Tip
fastqc
accepts multiple files as input, so you can use a wildcard to run fastqc
on all the files in one line of code. Use it like this: *.fastq
.
Your script ~/project/scripts/02_run_fastqc.sh
should look like:
#!/usr/bin/env bash\ncd ~/project/reads\n\nfastqc *.fastq\n
Exercise: Download the html files to your local computer, and view the results. How is the quality? Where are the problems?
Downloading files
You can download files by right-click the file and after that select Download:
Answer
There seems to be:
We can probably fix most of these issues by trimming.
"},{"location":"day1/quality_control/#trim-the-reads","title":"Trim the reads","text":"We will use fastp for trimming adapters and low quality bases from our reads. The most used adapters for Illumina are TruSeq adapters, and fastp
will use those by default. A reference for the adapter sequences can be found here.
Exercise: Check out the documentation of fastp, and the option defaults by running fastp --help
.
--qualified_quality_phred
)--unqualified_percent_limit
)--length_required
)--unpaired1
and --unpaired2
)Default 15 means phred quality >=Q15 is qualified. (int [=15])
reads shorter than length_required will be discarded, default is 15. (int [=15])
--unpaired1
and/or --unpaired2
are not specified: for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
. Exercise: Complete the script below called 03_trim_reads.sh
(replace everything in between brackets []
) to run fastp
to trim the data. The quality of our dataset is not great, so we will overwrite the defaults. Use a a minimum qualified base quality of 10, set the maximum percentage of unqalified bases to 80% and a minimum read length of 25. Note that a new directory called ~/project/results/trimmed/
is created to write the trimmed reads.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n[QUALIFIED BASE THRESHOLD] \\\n[MINIMUM LENGTH THRESHOLD] \\\n[UNQUALIFIED PERCENTAGE LIMIT] \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n
Additional options
Note that we have set the options --cut_front
and --cut_tail
that will ensure low quality bases are trimmed in a sliding window from both the 5\u2019 and 3\u2019 ends. Also --detect_adapter_for_pe
is set, which ensures that adapters are detected automatically for both R1 and R2.
Your script (~/project/scripts/03_trim_reads.sh
) should look like this:
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREADS_DIR=~/project/reads\n\nmkdir -p $TRIMMED_DIR\n\ncd $TRIMMED_DIR\n\nfastp \\\n-i $READS_DIR/SRR519926_1.fastq \\\n-I $READS_DIR/SRR519926_2.fastq \\\n-o $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-O $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n--qualified_quality_phred 10 \\\n--length_required 25 \\\n--unqualified_percent_limit 80 \\\n--cut_front \\\n--cut_tail \\\n--detect_adapter_for_pe\n
The use of \\
In the script above you see that we\u2019re using \\
at the end of many lines. We use it to tell bash to ignore the newlines. If we would not do it, the fastp
command would become a very long line, and the script would become very difficult to read. It is in general good practice to put every option of a long command on a newline in your script and use \\
to ignore the newlines when executing.
Exercise: Check out the report in fastp.html
.
fastqc
, fastp
has trouble finding adapters in R2. Also, after running fastp
there doesn\u2019t seem to be much adapter left (you can double check by running fastqc
on trimmed_SRR519926_2.fastq
). How could that be? After having completed this chapter you will be able to:
Download the presentation
"},{"location":"day1/reproducibility/#some-good-practices-for-reproducibility","title":"Some good practices for reproducibility","text":"During today and tomorrow we will work with a small E. coli dataset to practice quality control, alignment and alignment filtering. You can consider this as a small project. During the exercise you will be guided to adhere to the following basic principles for reproducibility:
01_download_reads.sh
)06_build_bowtie_index.sh
By adhering to these simple principles it will be relatively straightforward to re-do your analysis steps only based on the scripts, and will get you started to adhere to the Ten Simple Rules for Reproducible Computational Research.
By the end of day 2 ~/project
should look (something) like this:
.\n\u251c\u2500\u2500 alignment_output\n\u251c\u2500\u2500 reads\n\u251c\u2500\u2500 ref_genome\n\u251c\u2500\u2500 scripts\n\u2502 \u251c\u2500\u2500 01_download_reads.sh\n\u2502 \u251c\u2500\u2500 02_run_fastqc.sh\n\u2502 \u251c\u2500\u2500 03_trim_reads.sh\n\u2502 \u251c\u2500\u2500 04_run_fastqc_trimmed.sh\n\u2502 \u251c\u2500\u2500 05_download_ecoli_reference.sh\n\u2502 \u251c\u2500\u2500 06_build_bowtie_index.sh\n\u2502 \u251c\u2500\u2500 07_align_reads.sh\n\u2502 \u251c\u2500\u2500 08_compress_sort.sh\n\u2502 \u251c\u2500\u2500 09_extract_unmapped.sh\n\u2502 \u251c\u2500\u2500 10_extract_region.sh\n\u2502 \u2514\u2500\u2500 11_align_sort.sh\n\u2514\u2500\u2500 trimmed_data\n
"},{"location":"day1/sequencing_technologies/","title":"Sequencing technologies","text":""},{"location":"day1/sequencing_technologies/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
Download the presentation
Illumina sequencing by synthesis on YouTube
NEBnext library preparation poster
"},{"location":"day1/server_login/","title":"Setup","text":""},{"location":"day1/server_login/#learning-outcomes","title":"Learning outcomes","text":"Note
You might already be able to do some or all of these learning outcomes. If so, you can go through the corresponding exercises quickly. The general aim of this chapter is to work comfortably on a remote server by using the command line.
After having completed this chapter you will be able to:
bash
scriptbash
Choose your platform
In this part we will show you how to access the cloud server, or setup your computer to do the exercises with conda or with Docker.
If you are doing the course with a teacher, you will have to login to the remote server. Therefore choose:
If you are doing this course independently (i.e. without a teacher) choose either:
If you have a conda installation on your local computer, you can install the required software using conda. If not, you can install Miniconda like this:
WindowsMac/Linux.exe
file hereconda list
in the Ananconda prompt or terminal to check whether your installation has succeeded..sh
) script herebash Miniconda3-latest-Linux-x86_64.sh\n
conda list
in the Ananconda prompt or terminal to check whether your installation has succeeded.After installation, you can install the required software:
Windows/MacOSLinuxconda create -n ngs-tools\n\nconda activate ngs-tools\n\nconda install -y -c bioconda \\\n samtools \\\n bwa \\\n fastqc \\\n sra-tools \\\n bowtie2=2.4.2 \\\n hisat2=2.2.1 \\\n subread=2.0.1 \\\n entrez-direct \\\n minimap2 \\\n gatk4 \\\n freebayes \\\n multiqc \\\n fastp\n
Download ngs-tools.yml, and generate the conda environment like this:
conda env create --name ngs-tools -f ngs-tools.yml\n
Note
If that did not succeed, follow the instructions for Windows/MacOS.
This will create the conda environment ngs-tools
Activate it like so:
conda activate ngs-tools\n
After successful installation and activating the environment all the software required to do the exercises should be available.
If you are doing project 2 (long reads)
If you are doing the project 2 as part of the course, you will need to install NanoPlot
as well, using pip
:
pip install NanoPlot\n
"},{"location":"day1/server_login/#exercises","title":"Exercises","text":""},{"location":"day1/server_login/#first-login","title":"First login","text":"If you are participating in this course with a teacher, you have received a link and a password. Copy-paste the link (including the port, e.g.: http://12.345.678.91:10002
) in your browser. This should result in the following page:
Info
The link gives you access to a web version of Visual Studio Code. This is a powerful code editor that you can also use as a local application on your computer.
Type in the password that was provided to you by the teacher. Now let\u2019s open the terminal. You can do that with Ctrl+`. Or by clicking Application menu > Terminal > New Terminal:
For a.o. efficiency and reproducibility it makes sense to execute your commands from a script. With use of the \u2018new file\u2019 button:
"},{"location":"day1/server_login/#material","title":"Material","text":"
Docker can be used to run an entire isolated environment in a container. This means that we can run the software with all its dependencies required for this course locally in your computer. Independent of your operating system.
In the video below there\u2019s a tutorial on how to set up a docker container for this course. Note that you will need administrator rights, and that if you are using Windows, you need the latest version of Windows 10.
The command to run the environment required for this course looks like this (in a terminal):
Modify the script
Modify the path after -v
to the working directory on your computer before running it.
docker run \\\n--rm \\\n-p 8443:8443 \\\n-e PUID=1000 \\\n-e PGID=1000 \\\n-e DEFAULT_WORKSPACE=/config/project \\\n-v $PWD:/config/project \\\ngeertvangeest/ngs-introduction-vscode:latest\n
If this command has run successfully, navigate in your browser to http://localhost:8443.
The option -v
mounts a local directory in your computer to the directory /config/project
in the docker container. In that way, you have files available both in the container and on your computer. Use this directory on your computer to e.g. visualise data with IGV. Change the first path to a path on your computer that you want to use as a working directory.
Don\u2019t mount directly in the home dir
Don\u2019t directly mount your local directory to the home directory (/root
). This will lead to unexpected behaviour.
The part geertvangeest/ngs-introduction-vscode:latest
is the image we are going to load into the container. The image contains all the information about software and dependencies needed for this course. When you run this command for the first time it will download the image. Once it\u2019s on your computer, it will start immediately.
Most bioinformatics software are UNIX based and are executed through the CLI. When working with NGS data, it is therefore convenient to improve your knowledge on UNIX. For this course, we need basic understanding of UNIX CLI, so here are some exercises to refresh your memory.
If you need some reminders of the commands, here\u2019s a link to a UNIX command line cheat sheet:
UNIX cheat sheet
"},{"location":"day1/server_login/#make-a-new-directory","title":"Make a new directory","text":"Make a directory scripts
within ~/project
and make it your current directory.
cd ~/project\nmkdir scripts\ncd scripts\n
"},{"location":"day1/server_login/#file-permissions","title":"File permissions","text":"Generate an empty script in your newly made directory ~/project/scripts
like this:
touch new_script.sh\n
Add a command to this script that writes \u201cSIB courses are great!\u201d (or something you can better relate to.. ) to stdout, and try to run it.
AnswerGenerate a script as described above. The script should look like this:
#!/usr/bin/env bash\n\necho \"SIB courses are great!\"\n
Usually, you can run it like this:
./new_script.sh\n
But there\u2019s an error:
bash: ./new_script.sh: Permission denied\n
Why is there an error?
Hint
Use ls -lh new_script.sh
to check the permissions.
ls -lh new_script.sh\n
gives:
-rw-r--r-- 1 user group 51B Nov 11 16:21 new_script.sh\n
There\u2019s no x
in the permissions string. You should change at least the permissions of the user.
Make the script executable for yourself, and run it.
AnswerChange permissions:
chmod u+x new_script.sh\n
ls -lh new_script.sh
now gives:
-rwxr--r-- 1 user group 51B Nov 11 16:21 new_script.sh\n
So it should be executable:
./new_script.sh\n
More on chmod
and file permissions here.
>
and |
","text":"In the root directory (go there like this: cd /
) there are a range of system directories and files. Write the names of all directories and files to a file called system_dirs.txt
in your working directory.
ls / > ~/project/system_dirs.txt\n
The command wc -l
counts the number of lines, and can read from stdin. Make a one-liner with a pipe |
symbol to find out how many system directories and files there are.
ls / | wc -l\n
"},{"location":"day1/server_login/#variables","title":"Variables","text":"Store system_dirs.txt
as variable (like this: VAR=variable
), and use wc -l
on that variable to count the number of lines in the file.
FILE=~/project/system_dirs.txt\nwc -l $FILE\n
"},{"location":"day1/server_login/#shell-scripts","title":"shell scripts","text":"Make a shell script that automatically counts the number of system directories and files.
AnswerMake a script called e.g. current_system_dirs.sh
:
#!/usr/bin/env bash\ncd /\nls | wc -l\n
"},{"location":"day2/file_types/","title":"File types","text":""},{"location":"day2/file_types/#learning-outcomes","title":"Learning outcomes","text":"After having completed this chapter you will be able to:
Download the presentation
File definition websites:
After having completed this chapter you will be able to:
bowtie2
Download the presentation
bowtie2
manualMake a script called 05_download_ecoli_reference.sh
, and paste in the code snippet below. Use it to retrieve the reference sequence using esearch
and efetch
:
#!/usr/bin/env bash\n\nREFERENCE_DIR=~/project/ref_genome/\n\nmkdir $REFERENCE_DIR\ncd $REFERENCE_DIR\n\nesearch -db nuccore -query 'U00096' \\\n| efetch -format fasta > ecoli-strK12-MG1655.fasta\n
Exercise: Check out the documentation of bowtie2-build
, and build the indexed reference genome with bowtie2 using default options. Do that with a script called 06_build_bowtie_index.sh
.
#!/usr/bin/env bash\n\ncd ~/project/ref_genome\n\nbowtie2-build ecoli-strK12-MG1655.fasta ecoli-strK12-MG1655.fasta\n
"},{"location":"day2/read_alignment/#align-the-reads-with-bowtie2","title":"Align the reads with bowtie2
","text":"Exercise: Check out the bowtie2 manual here. We are going to align the sequences in paired-end mode. What are the options we\u2019ll minimally need?
AnswerAccording to the usage of bowtie2
:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | --sra-acc <acc> | b <bam>}\n
We\u2019ll need the options:
-x
to point to our index-1
and -2
to point to our forward and reverse readsExercise: Try to understand what the script below does. After that copy it to a script called 07_align_reads.sh
, and run it.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome/\nALIGNED_DIR=~/project/results/alignments\n\nmkdir -p $ALIGNED_DIR\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n> $ALIGNED_DIR/SRR519926.sam\n
We\u2019ll go deeper into alignment statistics later on, but bowtie2
writes already some statistics to stdout. General alignment rates seem okay, but there are quite some non-concordant alignments. That doesn\u2019t sound good. Check out the explanation about concordance at the bowtie2 manual. Can you guess what the reason could be?
After having completed this chapter you will be able to:
samtools flagstat
to get general statistics on the flags stored in a sam/bam filesamtools view
to:samtools sort
to sort an alignment file based on coordinatesamtools index
to create an index of a sorted sam/bam file|
) symbol to pipe alignments directly to samtools
to perform sorting and filteringsamtools
documentationExercise: Write the statistics of the E. coli alignment to file called SRR519926.sam.stats
by using samtools flagstat
. Find the documentation here. Anything that draws your attention?
Code:
cd ~/project/results/alignments/\nsamtools flagstat SRR519926.sam > SRR519926.sam.stats\n
resulting in:
624724 + 0 in total (QC-passed reads + QC-failed reads)\n624724 + 0 primary\n0 + 0 secondary\n0 + 0 supplementary\n0 + 0 duplicates\n0 + 0 primary duplicates\n621624 + 0 mapped (99.50% : N/A)\n621624 + 0 primary mapped (99.50% : N/A)\n624724 + 0 paired in sequencing\n312362 + 0 read1\n312362 + 0 read2\n300442 + 0 properly paired (48.09% : N/A)\n619200 + 0 with itself and mate mapped\n2424 + 0 singletons (0.39% : N/A)\n0 + 0 with mate mapped to a different chr\n0 + 0 with mate mapped to a different chr (mapQ>=5)\n
Of the reads, 47.87% is properly paired. The rest isn\u2019t. Proper pairing is quite hard to interpret. It usually means that the 0x2 flag (each segment properly aligned according to the aligner) is false. In this case it means that the insert size is high for a lot of sequences. That is because the insert size distribution is very wide. You can find info on insert size distribution like this:
samtools stats SRR519926.sam | grep ^SN | cut -f 2,3\n
Now look at insert size average
and insert size standard deviation
. You can see the standard deviation is higher than the average, suggesting a wide distribution.
The command samtools view
is very versatile. It takes an alignment file and writes a filtered or processed alignment to the output. You can for example use it to compress your SAM file into a BAM file. Let\u2019s start with that.
Exercise: Create a script called 08_compress_sort.sh
. Add a samtools view
command to compress our SAM file into a BAM file and include the header in the output. For this, use the -b
and -h
options. Find the required documentation here. How much was the disk space reduced by compressing the file?
Tip: Samtools writes to stdout
By default, samtools writes it\u2019s output to stdout. This means that you need to redirect your output to a file with >
or use the the output option -o
.
08_compress_sort.sh
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam > SRR519926.bam\n
By using ls -lh
, you can find out that SRR519926.sam
has a size of 264 Mb, while SRR519926.bam
is only 77 Mb. To look up specific alignments, it is convenient to have your alignment file indexed. An indexing can be compared to a kind of \u2018phonebook\u2019 of your sequence alignment file. Indexing is done with samtools
as well, but it first needs to be sorted on coordinate (i.e. the alignment location). You can do it like this:
samtools sort SRR519926.bam > SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n
Exercise: Add these lines to 08_compress_sort.sh
, and re-run te script in order to generate the sorted bam file. After that checkout the headers of the unsorted bam file (SRR519926.bam
) and the sorted bam file (SRR519926.sorted.bam
) with samtools view -H
. What are the differences?
Your script should like like this:
08_compress_sort.sh#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh SRR519926.sam > SRR519926.bam\nsamtools sort SRR519926.bam > SRR519926.sorted.bam\nsamtools index SRR519926.sorted.bam\n
samtools view -H SRR519926.bam
returns:
@HD VN:1.0 SO:unsorted\n@SQ SN:U00096.3 LN:4641652\n@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools view -H SRR519926.bam\n
And samtools view -H SRR519926.sorted.bam
returns:
@HD VN:1.0 SO:coordinate\n@SQ SN:U00096.3 LN:4641652\n@PG ID:bowtie2 PN:bowtie2 VN:2.4.2 CL:\"/opt/conda/envs/ngs-tools/bin/bowtie2-align-s --wrapper basic-0 -x /config/project/ref_genome//ecoli-strK12-MG1655.fasta -1 /config/project/trimmed_data/trimmed_SRR519926_1.fastq -2 /config/project/trimmed_data/trimmed_SRR519926_2.fastq\"\n@PG ID:samtools PN:samtools PP:bowtie2 VN:1.12 CL:samtools view -bh SRR519926.sam\n@PG ID:samtools.1 PN:samtools PP:samtools VN:1.12 CL:samtools sort SRR519926.bam\n@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.12 CL:samtools view -H SRR519926.sorted.bam\n
There are two main differences:
SO
tag at @HD
type code has changed from unsorted
to coordinate
.@PG
type code for the sorting was added.Note that the command to view the header (samtools -H
) is also added to the header for both runs.
With samtools view
you can easily filter your alignment file based on flags. One thing that might be sensible to do at some point is to filter out unmapped reads.
Exercise: Check out the flag that you would need to filter for mapped reads. It\u2019s at page 7 of the SAM documentation.
AnswerYou will need the 0x4 flag.
Filtering against unmapped reads (leaving only mapped reads) with samtools view
would look like this:
samtools view -bh -F 0x4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam\n
or:
samtools view -bh -F 4 SRR519926.sorted.bam > SRR519926.sorted.mapped.bam\n
Exercise: Generate a script called 09_extract_unmapped.sh
to get only the unmapped reads (so the opposite of the example). How many reads are in there? Is that the same as what we expect based on the output of samtools flagstat
?
Tip
Check out the -f
and -c
options of samtools view
Your script 09_extract_unmapped.sh
should look like this:
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh -f 0x4 SRR519926.sorted.bam > SRR519926.sorted.unmapped.bam\n
Counting like this:
samtools view -c SRR519926.sorted.unmapped.bam\n
This should correspond to the output of samtools flagstat
(624724 - 621624 = 3100)
samtools view
also enables you to filter alignments in a specific region. This can be convenient if you don\u2019t want to work with huge alignment files and if you\u2019re only interested in alignments in a particular region. Region filtering only works for sorted and indexed alignment files.
Exercise: Generate a script called 10_extract_region.sh
to filter our sorted and indexed BAM file for the region between 2000 and 2500 kb, and output it as a BAM file with a header.
Tip: Specifying a region
Our E. coli genome has only one chromosome, because only one line starts with >
in the fasta file
cd ~/project/ref_genome\ngrep \">\" ecoli-strK12-MG1655.fasta\n
gives:
>U00096.3 Escherichia coli str. K-12 substr. MG1655, complete genome\n
The part after the first space in the title is cut off for the alignment reference. So the code for specifying a region would be: U00096.3:START-END
#!/usr/bin/env bash\n\ncd ~/project/results/alignments\n\nsamtools view -bh \\\nSRR519926.sorted.bam \\\nU00096.3:2000000-2500000 \\\n> SRR519926.sorted.region.bam\n
"},{"location":"day2/samtools/#redirection","title":"Redirection","text":"Samtools is easy to use in a pipe. In this case you can replace the input file with a -
. For example, you can sort and compress the output of your alignment software in a pipe like this:
my_alignment_command \\\n| samtools sort - \\\n| samtools view -bh - \\\n> alignment.bam\n
The use of -
In the modern versions of samtools, the use of -
is not needed for most cases, so without an input file it reads from stdin. However, if you\u2019re not sure, it\u2019s better to be safe than sorry.
Exercise: Write a script called 11_align_sort.sh
that maps the reads with bowtie2 (see chapter 2 of read alignment), sorts them, and outputs them as a BAM file with a header.
#!/usr/bin/env bash\n\nTRIMMED_DIR=~/project/results/trimmed\nREFERENCE_DIR=~/project/ref_genome\nALIGNED_DIR=~/project/results/alignments\n\nbowtie2 \\\n-x $REFERENCE_DIR/ecoli-strK12-MG1655.fasta \\\n-1 $TRIMMED_DIR/trimmed_SRR519926_1.fastq \\\n-2 $TRIMMED_DIR/trimmed_SRR519926_2.fastq \\\n2> $ALIGNED_DIR/bowtie2_SRR519926.log \\\n| samtools sort - \\\n| samtools view -bh - \\\n> $ALIGNED_DIR/SRR519926.sorted.mapped.frompipe.bam\n
Redirecting stderr
Notice the line starting with 2>
. This redirects standard error to a file: $ALIGNED_DIR/bowtie2_SRR519926.log
. This file now contains the bowtie2 logs, that can later be re-read or used in e.g. multiqc
.
The software MultiQC is great for creating summaries out of log files and reports from many different bioinformatic tools (including fastqc
, fastp
, samtools
and bowtie2
). You can specify a directory that contains any log files, and it will automatically search it for you.
Exercise: Run the command multiqc .
in ~/project
and checkout the generated report.
After having completed this chapter you will be able to:
Presentation on the UCSC genome browser (after the exercises):
Download the presentation
The exercises below are partly based on this tutorial from the Griffith lab.
"},{"location":"day3/igv_visualisation/#exercises","title":"Exercises","text":""},{"location":"day3/igv_visualisation/#a-first-glance-the-e-coli-dataset","title":"A first glance: the E. coli dataset","text":"Index the alignment that was filtered for the region between 2000 and 2500 kb:
cd ~/project/results/alignments\nsamtools index SRR519926.sorted.region.bam\n
Download the alignment (SRR519926.sorted.region.bam
) together with it\u2019s index file (SRR519926.sorted.region.bam.bai
) and the reference genome (ecoli-strK12-MG1655.fasta
) to your desktop. If working with Docker
If you are working with Docker, you can find the files in the working directory that you mounted to the docker container (with the -v
option). So if you have used -v C:\\Users\\myusername\\ngs-course:/root/project
, your files will be in C:\\Users\\myusername\\ngs-course
.
.fasta
) into IGV: Genomes > Load Genome from File\u2026.bam
): File > Load from File\u2026Zoom in into the region U00096.3:2046000-2048000. You can do this in two ways:
View the reads as pairs, by right click on the reads and select View as pairs
Exercise: There are lot of reads that are coloured red. Why is that?
If you don\u2019t find any red reads..
The default setting is to color reads by insert size. However, if you\u2019ve used IGV before, that might have changed. To color according to insert size: right click on the reads, and select: Color alignments by > insert size
AnswerAccording to IGV, reads are coloured red if the insert size is larger than expected. As you remember, this dataset has a very large variation in insert size.
Modify the popup text behaviour by clicking on the yellow balloon to Show Details on Click:
Exercise: Click on one of the reads. What kind of information is there?
AnswerMost of the information from the SAM file.
Colour the alignment by pair orientation by right clicking on the reads, and click Color alignments by > read strand.
"},{"location":"day3/igv_visualisation/#hcc1143-data-set","title":"HCC1143 data set","text":"For this part, we will be using publicly available Illumina sequence data generated for the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer.
Sequence reads were aligned to version GRCh37 of the human reference genome. We will be working with subsets of aligned reads in the region: chromosome 21: 19,000,000 - 20,000,000.
The BAM files containing these reads for the cancer cell line and the matched normal are:
HCC1143.normal.21.19M-20M.bam
HCC1143.normal.21.19M-20M.bam.bai
A lot of model-organism genomes are built-in IGV. Select the human genome version hg19 from the drop down menu:
Select File > Load from File\u2026 from the main menu and select the BAM file HCC1143.normal.21.19M-20M.bam
using the file browser.
This BAM file only contains data for a 1 Megabase region of chromosome 21. Let\u2019s navigate there to see what genes this region covers. To do so, navigate to chr21:19,000,000-20,000,000
.
Navigate to the gene CHODL by typing it in the search box.
Load the dbsnp annotations by clicking File > Load From Server\u2026 > Annotations > Variation and Repeats > dbSNP 1.4.7
Like you did with the gene (i.e. by typing it in the search box), navigate to SNP rs3827160 that is annotated in the loaded file.
Click on the coverage track where the SNP is:
Exercise: What is the sequence sequencing depth for that base? And the percentage T?
AnswerThe depth is 62, and 25 reads (40%) T.
Navigate to region chr21:19,800,320-19,818,162
Load repeat tracks by selecting File > Load from Server\u2026 from the main menu and then select Annotations > Variation and Repeats > Repeat Masker
Note
This might take a while to load.
Right click in the alignment track and select Color alignments by > insert size and pair orientation
Exercise: Why are some reads coloured white? What can be the cause of that?
AnswerThe white coloured reads have a map quality of 0 (click on the read to find the mapping quality). The cause of that is a LINE repeat region called L1PA3.
Navigate to region chr21:19,324,500-19,331,500
Right click in the main alignment track and select:
Exercise: What is the insert size of the red flagged read pairs? Can you estimate the size of the deletion?
AnswerThe insert size is about 2.8 kb. This includes the reads. The deletion should be about 2.5 - 2.6 kb.
"}]} \ No newline at end of file diff --git a/2024.4/sitemap.xml.gz b/2024.4/sitemap.xml.gz index b4fd8b0..c5321be 100644 Binary files a/2024.4/sitemap.xml.gz and b/2024.4/sitemap.xml.gz differ