Skip to content

Latest commit

 

History

History
266 lines (155 loc) · 8.82 KB

README.md

File metadata and controls

266 lines (155 loc) · 8.82 KB

TAPAS Logo

Transcription Analysis Plus Alternative Splicing for RNA-SEQ

TAPAS is the workflow we use in the lab to process RNA-SEQ.


1. Quick overview

2. Create Listing File

3. Create Init File

4. Set up Tools

5. Set up Annotation Files

Quick overview

One command is used to launch all the steps :

  • Alignment with STAR to generate bigWigs for visualisation
  • Differential Expression with DESEQ2
  • Splicing Analysis with WHIPPET 10.4
  • Custom filters, formats ...
    python3 signature.py -l listofRnaSeqProject.tsv -i init_disciplus.json

You end with bed files/excel describing spliced exons between two conditions.

Two files need to be created before.

Create listing File

alt text

  • STUDY : Can be anything you want.
  • RUNACCESSION : Can be anything you want.
  • LIBRARY_LAYOUT : KeyWords to use are "PAIRED" or "SINGLE"
  • FASTQ : Absolute path to your dataset or ftp url to download automatically, if paired-end it can be data_1_fastq.gz;data_2_fastq.gz or data_R1_fastq.gz;data_R2_fastq.gz
  • TREATMENT : Anything you want
  • CELL_LINE : Anything you want
  • CONDITION : Keywords to use are "TEST" and "CONTROL"
  • REP_NUMBER : should be 1,2,3 or rep1, rep2, rep3
  • TREATMENT_DAY : A number.
  • KMER : Keywords to use are "normal" or "short". Use "normal" everytime. Short is used when read are less than 50 and you want to give a try with whippet with lowers index value for kmersize.

Column names has to be the sames as showed in the picture. The order of the columns can be changed.

Very Important

  1. You create a directory where you put your fastq files. The results will be in this directory. "Output" path in init.json must be the same as in listing.tsv in column fastq. Input and Output directories sould be the same.

  2. You also have to create a directory called FINAL in this directory.

Create Init File

You need to set up a init.json . You will set up path to softwares and other annotation files.

First of all, you need to change init.json inside config directory to set up correctly paths and files.


Under the Hood / Go Fast for Splicing

Whippet works in two commands. First you create the .psi files with whippet-quant.jl

Then you do a differential analysis using whippet-delta.jl to create a .diff file.

Finally, you apply fancy filters to keep reliable events. (P>=0.95, only CE , |dpsi| >= 0.1)

A Rscript (annotSymbol.R) is used to add the gene symbol.

/home/jean-philippe.villemin/bin/julia-6/julia /home/jean-philippe.villemin/bin/Whippet.0.10.4/bin/whippet-quant.jl -x /home/jean-philippe.v
illemin/data/data/index_whippet/julia-6/human/9kmer/index_whippet  path2/A_1.fastq.gz path2/A_2.fastq.gz -o path2/A

/home/jean-philippe.villemin/bin/julia-6/julia /home/jean-philippe.villemin/bin/Whippet.0.10.4/bin/whippet-delta.jl -a A.psi.gz,B.psi.gz,C.psi.gz -b D.psi.gz,E.psi.gz,F.psi.gz

PATH2=/somewhereYouMustConfig/
FILE=output
TYPE="CE"

/usr/bin/gawk -F "\t"  'BEGIN {OFS="\t";}  {  if ( match($1, "^(\\w+)\\.([0-9]+)", ary) && $5=="CE" && sqrt($8*$8) >= sqrt(0.1*0.1) && $9 >= 0.95) print  ary[1],$3,$4,$5,$6,$7,$8,$9,$10,$11 ; }' ${PATH2}${FILE}.diff > ${PATH2}${FILE}.clean.${TYPE}.diff
/bin/sed -i $'1 i\\\ngene\tcoordinates\tstrand\tevent\tpsiA\tpsiB\tdpsi\tprobability\tcomplexity\tentropy'  ${PATH2}${FILE}.clean.${TYPE}.diff
Rscript /home/luco/work/annotSymbol.R --organism=human --file=${PATH2}${FILE}.clean.CE.diff
/usr/bin/gawk -F ","  'BEGIN {OFS="_";}  {  print $2,$1,$4; }'  ${PATH2}${FILE}.clean.${TYPE}.diff.annoted.csv > ${PATH2}${FILE}.final.txt

If you have discrepency with this go fast method and the pipeline it's because pipeline, remove some events that are on weakly expressed gene.


Notes For Expression

Genes differentially expressed are with a | foldchange | > 1.5 and p-value-adjusted < 0.05

You will find them in directory final.

DESEQ_res_annotated_DE.csv (genes up and down significant)

DESEQ_all_res_annoted_sorted_pvalAdj.csv contains everything without filtering.

In the outpout of STAR mapper, you will also have a file for TPM computed by TPMcalulator. gene_TPM.txt.

I used the column TPMexon.


The next part explain what you need to do if you start from scratch (softs to install, files to download or create)

You will need to download some files from Gencode, create a genome Index with STAR (explain in Alignment part), and create some custom files for which scripts are provided to generate them before using the main pipeline.

An example of the init.json is showed in config directory.
In the next section, I explain all the files you need.




Set up Annotation Files

Much more of the annotation files are given in the annot directory.

Download from ensembl biomart a text file with Ensembl Gene ID & Associated Gene Name Should look as follows (you can also the one provided) :

Ensembl Gene ID	Associated Gene Name
ENSG00000252760	RNA5SP54
ENSG00000252830	5S_rRNA

Now you can set up genes_biomart_ensembl value in init.json

Download from gencode the gtf file to set up path_to_gtf.

Download fasta sequence of the genome of interest. (Here we use PRIM_ASSEMBLY from GENCODE R25 based on ENSEMBL 85 for Human and GENCODE M15 based on ENSEMBL 90 for Mouse.

Create this file with chromosome files.

conda install pyfaidx
faidx yourgenome.fasta -i chromsizes > yourgenome.genome.sizes

Then configure in init.json the value for path_to_chrom_length_ref & genomeDir with the directory where you create yourgenome.genome.sizes directory as showed in the provided example

Also you need to create index for STAR :

STAR --runMode genomeGenerate --genomeDir /home/jean-philippe.villemin/mount_archive2/commun.luco/ref/indexes/GRCm38_PRIM_GENCODE_M15/ --genomeFastaFiles /home/jean-philippe.villemin/mount_archive2/commun.luco/ref/genome/GRCm38_PRIM_GENCODE_M15/GRCm38.primary_assembly.genome.fa --runThreadN 30 --sjdbGTFfile /home/jean-philippe.villemin/mount_archive2/commun.luco/ref/genes/GRCm38_PRIM_GENCODE_M15/gencode.vM15.primary_assembly.annotation.gtf

Generate the fasta file index :

samtools faidx reference.fa 

This creates a file called reference.fa.fai

Generate the sequence genome.2bit :

faToTwoBit genome.fa genome.2bit

Download the transcripts from gencode for your genome version and then inside a dir called transcriptome for instance, you will create indexes for Salmon as follows :

salmon index -t gencode.vM15.transcripts.fa.gz -i gencode.m15.transcripts.index --type quasi -k 31  

This gives you the value for transcriptome_index in init.json

Also run this R script. It will create two files : one with genomic non-redundant exons coordinates and the other with gene length using sum of exon length. In annot, you have two zipped examples for the human Gencode Annotation version 25 (Ensembl 85) for these two files. These files are also provided in annotation dir.

  • exons.non-redundant.csv
  • gene.length.exon.sums.csv
Rscript geneSize.R -f Yourannotation.gtf

Now you can complete postitionGenomicExon and gene_length values in init.json.

NB : Andrew pointed out you need to covert exons.non-redundant.csv to a bed format.

--


Set up Tools

Check that the following tools are installed on server/computer.

Scripts available here are in Python3.
It's advised to install Conda if python3 is not set up on your computer.
It will make things easier then for installing tools or switching to older python version if needed.

To be sure python3 is there, you can call it as follows :

python3 --version

Should return something like : Python 3.5.5 :: Anaconda custom (64-bit)

Conda : here

Then you can install all tierce tools needed by the pipeline. See below.

Follow the links below and look the Set up tools paragraph for each section.

-ALIGNEMENT : here

-DIFF_EXPRESSION : here

NB : RMysql package need to be installed in R also.

-SPLICING :

JULIA 0.6 : here

	wget https://julialang-s3.julialang.org/bin/linux/x64/0.6/julia-0.6.3-linux-x86_64.tar.gz
	tar -zxvf julia-0.6.3-linux-x86_64.tar.gz

WHIPPET 10.4 : here

In julia :

	Pkg.update()
	Pkg.add("Whippet")
	using Whippet

bedtools : Bed handler here