bsh-ref allows the sampling of bases from alignments (mappings) stored in BAM files
at positions defined in a .map
file. The most common use case will be that
population genotypes exist in a .map
and .ped
file pair (see here for file
format descriptions) as created by PLINK, and genotypes are to be sampled from
one or more BAM files at the positions for which population genotypes exist
already.
Additionally, bsh-ref allows to assess the fraction of derived alleles that a set of query samples share with combinations of user-defined test samples, polarized by a set of outgroup individuals.
To use bsh-ref, you will need a Linux machine where the gcc compiler, as well as the libz, libm and libpthread system libraries are installed. To acquire bsh-ref, first clone the repository recursively:
git clone --recursive https://github.com/clwgg/bsh-ref
This will clone both the bsh-ref code, as well as the htslib module, which is its only external dependency. After cloning, first compile the submodule, and then the bsh-ref code:
cd bsh-ref
make submodules
make
This will create the bsh-ref binary, which you can copy or move anywhere for subsequent use.
When updating to the current version, please make sure to also update the submodules:
git pull origin master
git submodule update
make submodules
make
Usage: ./bsh-ref [options] -m in.map [-p in.ped | -b in.lst] Options: -m map file containing positions (required) -p ped file containing known individuals (at least one of -p and -b) -b bam list file containing new individuals (at least one of -p and -b) -o outfile basename (default: out) -a sample 1 allele (pseudo-haploid) or 2 (default: 1) -r sample each bam file -r times (default: 1) -q min map quality (default: 0) -s create stat file for bam sampling -c restrict base sampling to reads with terminal C-to-T in one (1) or both (2) ends (default: 0) (Warning: this is so far only implemented for single stranded libraries, which also have C-to-T at 3p) -t ancestral vs. derived file (needs -p)
This file should be of the format name\tpath\n
, with path
being the path to
a (previously indexed) BAM file, and name
representing the name by which the
genotype of this sample will later be identified (e.g. in the .ped
file).
The file may contain multiple lines, if genotypes should be sampled from
multiple BAM files.
Example file:
sample1 path/to/sample1.bam sample2 path/to/sample2.bam
This file should be of the format flag fid iid\n
, with fid
representing the
family ID, and iid
the within-family ID in the supplied PED file (see here).
The flag
field may take the values 1, 2 or 3, indicating the use of this
individual as ancestral, test or query sample in the shared derived allele test,
respectively. All individuals sampled from BAM files (-b
) are automatically
used as query samples as well.
Example file:
1 outgroup indiv1 1 outgroup indiv2 2 testpop1 indiv1 2 testpop2 indiv1 2 testpop3 indiv1 3 querypop1 indiv1 3 querypop2 indiv1
If the -c
flag is set, the sampling will only consider bases from reads that
carry terminal C-to-T conversions in either the first or the last base. This is
so far only implemented for single-stranded libraries, which have C-to-T
conversions at both ends, rather then carrying G-to-A conversions at the 3’ end
due to the fill-in repair.