Outputs a list of coding regions that are uncovered above a given threshold
-
COAT is a Python2 program developed in 2016 which aims to automatically find bad quality region of coding sequences in a set of individual exome sequencing data.
-
It is a module integrated to the EXome Analysis and Mining.
-
It includes a graphical interface.
-
The output of this program is a spreadsheet with all the coding regions which are uncovered. Supplementary annotation can be used for further analysis. Additional files can be generated using
--cover_output
argument and/or--track_no_cover_UCSC_output
.
BAM is a binary file format. SAM and BAM files contain the same information. These files contain mapped reads sequence from Next Generation Sequencing.
To use COAT, the following programs are necessary :
The following files are necessary :
- BAM files folder
- reference sequence annotation file
Python (version 2.6.5 to version 2.7.8). To see which version of Python you have installed, open a command prompt and run:
python --version
If you don’t have Python, type the following command to install python version 2.x:
sudo apt-get install python 2.6
If you want to use COAT graphical interface, you will need to install the python module PyQt4
apt-cache search pyqt
sudo apt-get install python-qt4
To run COAT, you must have installed samtools 1.3. If you have a previous version of samtools, COAT will not work, so download and install samtools 1.3. Unzip the downloaded file. Go into the newly created directory and compile the code by typing "make".
tar -xvjf samtools-1.3.tar.bz2
cd samtools-1.3/
make
Store your BAM files into a dedicated folder. Prefix of BAM files name is used as subject identifier (see command-line argument --subject_list
).
Samtools folder must be specified (see command-line argument --samtoolsPath
)
BAM files must be indexed. To generate Index of BAM files, use samtools line-commands:
samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
samtools index aln.sorted.bam
Reference annotated sequence is needed. This file contains gene annotation of the genome reference you want to use. You can get file for h19 reference:
This table summarizes the command-line arguments which are using by COAT.
complete flag argument | short flag | Default value | Summary |
---|---|---|---|
--output |
-o |
stdout | Output folder path. Will overwrite contents if file exists |
--position |
-p |
NA | Interval positions in chromosome of the reference sequence FORMAT: chr:start-end |
--geneName |
-g |
NA | Select region of the gene. FORMAT: name of the gene according to reference |
--cds_regions |
-cds |
TRUE | regions of exons which are not into UTR in the reference sequence |
--add_utr_regions |
-add_utr |
FALSE | Add UTR to the selected region |
--add_intron_regions |
-add_intron |
FALSE | Add intron regions to the selected region |
--bamsPath |
-bams |
NA | BAM files folder |
--subject_list |
-subject |
NA | Coma-separated list of prefix BAM files name(s) of selected subject in BAM files repertory |
--coverage_threshold |
-cov |
5 | The minimum coverage to be actually considered as covered |
--mapQuality_threshold |
-mapqual |
20 | The minimum allowable mapping quality score to be counted for coverage |
--baseQuality_trehshold |
-basequal |
0 | The minimum allowable base quality score to be counted for coverage |
--cover_output |
-covout |
FALSE | Output as a supplementary file the list of covered regions with annotation |
--track_no_cover_UCSC_output |
-track |
FALSE | Output UCSC Genome Brower custom track which contains uncovered regions |
--refPath |
-ref |
NA | Annotation file for reference e.g. hg19 |
--samtoolsPath |
-samtools |
NA | Samtools 1.3 folder path |
python2 coverageAnalysis.py \
-gene GLIS3 \
-cds \
-add_utr \
-bams /storage/bams/ \
-subject family1_1,family1_3 \
-ref refFlat_hg19.txt \
-cov 15 \
-o coverage_check_results/
Chrom | Uncovered_start | Uncovered_end | subject | gene | transcript | Exon_numb | Uncovered_length |
---|---|---|---|---|---|---|---|
Chr9 | 3937027 | 3937120 | Family1_1 | GLIS3 | NM_001042413 | 5 | 94 |
Chr9 | 3937027 | 3937074 | Family1_3 | GLIS3 | NM_001042413 | 5 | 48 |
Each row is an uncovered region.
- Chrom: chromosome.
- Uncovered_start: the reference start position of an uncovered region.
- Uncovered_end: the reference end position of an uncovered region.
- subject: the sequenced subject identifier.
- gene: name of the gene including the uncovered region.
- transcript: name of the transcript of the gene including the uncovered region.
- exon_numb: numb of the exon of the transcript of the gene including the uncovered region. This field can optionally mention UTR5’, UTR3’ and intron region.
- uncovered_length: number of nucleotides in the uncovered region.
Visualising uncovered region on gene GLIS3 of the 2 subjects from family 1 with UCSC Genome Browser. Annotation from REFSEQ have been added. In red, uncovered region.
To run in graphical mode the program :
python2 Interface/coat_HCI.pyw
This command will display the following interface :
Click on Advanced settings to access advanced settings section.
- bam files folder: path of the folder in which your BAM files and BAM.bai files are stored.
- samtools folder: path of the folder in which samtools 1.3 is stored.
- Hg reference table file: path of the reference annotated sequence.
- Results folder: output folder path. Will overwrite contents if file exists.
Save settings will save all your settings for the next time you will use COAT. Reset gives the default value for all settings.
Fill the fields with your coma-separated list of prefix BAM files name(s) of selected subject in BAM files folder.
Select the region to check by specify coordinates else or gene name. Coordinates refers to an interval positions in chromosome of the reference sequence (FORMAT: chr:start-end). By default, coverage check is running on CDS region. But you can add UTR and intron region by toggle the corresponding option.
3 thresholds are availables:
- the minimum allowable coverage depth to be considered covered (default value = 5);
- the minimum allowable mapping quality score to be counted for coverage (default value = 20);
- the minimum allowable base quality score to be counted for coverage (default value = 0).
By default, output is a list of uncovered coding regions with annotation. You can add by toggle the corresponding option: a list of covered regions with annotation; an UCSC custom track which contains uncovered regions.
When all your parameters are selected, click on launch coverage check to run COAT.