-
Notifications
You must be signed in to change notification settings - Fork 8
2.1. Automated Pipeline Starting with FASTQ or FASTA files
Ah, so you're starting from the beginning? Fantastic! This pipeline will run STAR and ALLSorts for you.
ALLSorts runs on hg19 (I know), so we need references related to that.
GTF - ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
FASTA - ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
Now ungzip both somewhere memorable!
We need to align our fastq/fasta reads to a reference genome, so let's make one! These should do, but feel free to adjust to make it work for other hg19 projects - so long as the fasta and gtf file remain.
STAR --runMode genomeGenerate --genomeDir **/path/to/desired/output/** --limitGenomeGenerateRAM 64000000000 # 64 GB - Choose as appropriate for your environment. --runThreadN 8 # Choose as appropriate for your environment. --sjdbGTFfile /path/to/Homo_sapiens.GRCh37.87.chr.gtf --genomeFastaFiles /path/to/Homo_sapiens.GRCh37.dna.primary_assembly.fa
Just follow the instructions https://github.com/Oshlack/ALLSorts/wiki.
Ok, ALLSorts and the prerequisites above have been installed? You're good to go!
ALLSorts can be run with this script, note the parameter descriptions below:
bpipe -p results=$results -p threads=$threads -p a_mem=$mem -p type=$type -p strand=$strand -p format=$format -p genome_dir=$genome_dir $COUNTSDIR/counts.groovy_ $fasta
Feel free to make these environment variables (I tend to) or just directly insert them into the command line snippet above.
$results = /path/to/desired/output
$threads = 8 # choose as appropriate
$mem = 64000000000 # 64GB - choose as appropriate
$type = "fasta" or "fastq" # choose as appropriate for your input
$strand = "yes" or "no" or "reverse" # No and Reverse will be the two most used (no = unstranded, reverse = stranded typically)
$format = the format path of your input fastq/fasta as per bpipes input spec. A brief example would be an input like /path/to/sample1_R1.fastq.gz and /path/to/sample1_R2.fastq.gz being represented by: format = /%_R.fastq.gz. This will use sample1 as the branch name.
genome_dir = /path/to/Genome/ The output from the STAR genome generation step provided earlier.
$COUNTSDIR/counts.groovy should be the path /your/allsorts/clone/path/tools/counts/counts.groovy
$fasta - the path to your fasta/fastq files. Can be something as simple as /path/to/fastq/*.fastq.gz, so long as the format parameter is set correctly.
If you have setup your prerequisite tools correctly, this should output a result fairly quickly! Just change the parameters as suitable for your environment.
bpipe -p results=/output/path/ -p threads=8 -p a_mem=64000000000 -p type="fasta" -p strand="no" -p format="*/%_*.fasta.gz" -p genome_dir="/path/to/Genome/" $COUNTSDIR/counts.groovy /your/allsorts/clone/path/tests/fastq/*.fasta.gz
The output will just be some collection of predictions, it's not a real sample, just a garbled mess of counts.
Please report any https://github.com/Oshlack/ALLSorts/issues!