Query multiple vcf files simultaneously using sample identity, position(s), genotypes or other vcf data.
Grid-crawler uses a grid file to enable querying of multiple vcf files simultaneously by wrapping around bcftools view and merge. Each query can be either processed sequentially via SHELL or submitted via an SBATCH script to SLURM and proccessed in parallel using xargs.
$ perl grid-crawler.pl grid-file.yaml -s sampleID_1 -s sampleID_2 -s sampleID_3 -p 1,X:200050-2000600,22:456793-456793 -g ^miss -e "DV>5 && INFO/SGB>3 " -env shell -sen source activate mip4.0 -ot b
Grid-crawler is written in perl and therfore requires that perl is installed on your OS.
$ git clone https://github.com/henrikstranneheim/grid-crawler.git
$ cd grid-crawler
After this you can decide whether to make grid-crawler an "executable" by either adding the install directory to the $PATH
in e.g. "~/.bash_profile
" or move all the files from this directory to somewhere already in your path like "~/usr/bin
".
Remember to make the file(s) executable by chmod +x file
.
$ cd t; perl run-test.pl
SampleID: [Path to vcf data]
Can be compressed/uncompressed bcf/vcf format. SampleIDs that point to the same vcf path will be handled within the same query.
Limit the output to the specified sampleID(s) within the grid and vcf file - based on flag '-s sampleID'.
Only query variants for positions - based on flag '-p [A,B:C,X:Y-Z]'.
Exclude ('-e') or include ('-i') variants based on filtering expression - uses bcftools API (see bcftools manual page for details). Basically anything in the vcf can be filtered using this approach.
Inlcude or exclude (prefix with ^) genotypes - based on flag '-g [het|hom|miss]'.
Execute the query sequenctially in shell or in parallel (xargs or classic print/wait) via sbatch and SLURM ('-env [shell|sbatch]').
Requires a mandatory flag '--projectID [X]', and some additional options of setting email alerts and SLURM quality of service.
Optionally activate a conda environment prior to executing queries.
Grid-crawler will automatically merge the output variants to a single file if analysing more than two unique paths.
Redirect query output data using ('--outDataDir').
Set the output type using ('--outputType [b: compressed BCF, z: compressed VCF]').
- Perl (>=5.18)
- Perl modules: "Modern::Perl", "autodie", "IPC::Cmd", "YAML", "Log::Log4perl", "TAP::Harness"
- bcftools (>=1.3.0)
When using Sbatch
- SLURM
- xargs (>=4.4.2)