Update 2020/10/24: You can also use simplified and faster version of normalization script from here.
RPKM for RNAseq V1.3
Usage for sample input provided:
perl rpkm_script_beta.pl sample_count_test.count 2:9 28 > sample_count_test.rpkm
Description
In above example 'sample_count_test.count' file has count data from 2 to 9th column;
28th column has length of each genes calculated from Gencode GTF (Note below).
General usage:
perl rpkm_script_beta.pl input_count_file.txt ActualColumnStart:ActualColumnEnd ColumnGeneLength > OUTPUT_RPKM_FILE
ActualColumnStart = For example you have GeneID in first column and counts starts from second column. This should be '2'
ActualColumnEnd = Upto which column you need RPKM
ColumnGeneLength = Length of each gene (**NOTE below)
- Length of the gene can be obtained from Gencode GTF by following command (Successfully tested upto Gencode V19)
- Combine input_count_file.txt and YOUR_GENE_LENGTH_FILE by GeneID or First column
- Run the script over OUTPUT_ANNOTATED_COUNT_FILE
cat gencode.vXX.annotation.gtf | awk -F'\t' '{if($3=="gene") {split($9,a,";"); print a[1]"\t"$5-$4};}' | sed 's/[gene_id |"|]//g' > YOUR_GENE_LENGTH_FILE
join -j1 <(sort input_count_file.txt) <(sort YOUR_GENE_LENGTH_FILE) > OUTPUT_ANNOTATED_COUNT_FILE
perl rpkm_script_beta.pl OUTPUT_ANNOTATED_COUNT_FILE ActualColumnStart:ActualColumnEnd ColumnGeneLength > OUTPUT_ANNOTATED_RPKM_FILE
Description
ActualColumnStart = For example you have GeneID in first column and counts starts from second column. This should be '2'
ActualColumnEnd = Upto which column you need RPKM
ColumnGeneLength = Length of each gene
RPKM = (10^6 * C)/(N * L), where
C = Number of reads mapped to a gene
N = Total mapped reads in the experiment
L = gene length in base-pairs for a gene
Author: Santhilal Subhash