GitHub - panyuwen/Theta_D_H.Est: Calculator for statictics including Theta, D, H, and so on, based on the VCF input.

Theta_D_H.Est (by Yuwen Pan)

Calculator for statictics including Theta, D, H, and so on, based on the VCF input.

If you use the Theta_D_H.Est in your research work, please cite at least one of the following paper(s):

Genomic diversity and post-admixture adaptation in the Uyghurs (National Science Review, 2021)
Lineage-specific positive selection on ACE2 contributes to the genetic susceptibility of COVID-19 (National Science Review, 2022)

Description:

To calculate the following statistic within given regions:

number of sequence
number of genetic markers
number of singleton
ThetaPI
ThetaK
number of segregating site
number of haplotype
Haplotype diversity
Fay and Wu's H
normalized Fay and Wu's H
FU and Li's F(F*)
FU and Li's D(D*)
Tajima's D (with p-values and BH-corrected P-values)

Input:

phased VCF.gz file (no missing genotype) required
target regions to be analyzed (file) required
samples to be kept (file) optional
haplotypes to be kept (file) optional
length of sliding windows (bp) and increment (bp) (string) optional
whether the state of ancestral/derived allele is determined (string) optional
name of output file (string) optional

Usage:

./Theta_D_H.Est -h

./Theta_D_H.Est \
	--gzvcf phased.vcf.gz \
	--region region.txt \
	--samples sample.info \
	--haps haps.info \
	--window_shift 50000@10000 \
	--outgroup N \
	--out output.txt

Details about the arguments

--gzvcf (required): phased VCF.gz file, in GT format, like "1|0". Note: no duplicate physical positions, separate autosomes and X chromosome

--region (required): file including target regions to be analyzed. 4 columns: region ID chrom ID start pos end pos. no header line, tab or space sperated

--samples (optional): file including samples to be analyzed. 1 or 2 column: sample ID gender, optional. gender code, 1: male; 2: female. no header line. default: all samples in the VCF.gz file

--haps (optional): file including haplotypes to be analyzed. 2 columns: sample ID haplotype index (1/2), 1: first haplotype; 2: second haplotype. no header line. default: all haplotypes in the VCF.gz file

--window_shift (optional): window_length@increment, to partition the target region(s) into sliding windows of window_length advanced by increment. Then calculate all of the statistic within the sliding windows. default: calculate all of the statistic within target region(s)

--outgroup (optional): Y / N, whether the ancestral/derived allele is determined, required for FU&Li's and Fay&Wu's tests. default: N

--out (optional): output file name. output file would be gzipped. default: out.txt

Tips & Notes:

to calculate statistic across the whole chromosome / genome, you can define the chromosome boundaries in the region file,

ID1 21 1 48129895
ID2 22 1 51304566

then specify the window size and increment (e.g., 100000@5000). This script will help you to calculate all of the statistic within sliding windows of 100KB in length shift by 5KB.
heterozygotes are not allowed for male X chromosome. But this script can't help you to check the data format.
all individuals would be considered as diploid, if no gender information is provided
if both "--haps" and "--samples" are used, only samples provided by both arguments are remained for analysis
if "--haps" is used, it would be probably unnecessary to use "--samples", (for chrX, only if males are in homo format)
for chrX in males, only the first haplotypes would be used, no more second haplotypes
if the ancestral/derived alleles are not determined, there won't output Fay and Wu's H, normalized Fay and Wu's H, FU and Li's F*, or FU and Li's D*
if the ancestral/derived alleles are determined, number of singletons will be estimated as the number of derived singletons
computational complexity: linear. you are suggested to filter homozygous loci to speed up the program. it may takes <1h and ~2Gb memory for chromosome 1 of CHB (50000@20000, ~12.5K sliding windows, 103 individuals)
this program is compiled in centos7, older systems may not be supported.
If you have problem using the compiled program, it can still be run in the following way: python2 Theta_D_F_H.py2.py [--options]; OR python3 Theta_D_F_H.py3.py [--options]. All the required packages are accessible in conda.

Other Tools:

## allele calibration, for VCF
python2 allele_cali.one_chr.py <input .vcf.gz> <reference genome, chr${chr}.fa.gz> <output .vcf.gz>

This script can be used to change (REF -> Ancestral) and (ALT -> Derived) alleles in the VCF files (one chromosome for a single run). Developed in python2, use python3 at your own risk. The reference genome (ancestral alignment) can be found here http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/ (hg19, version e59, other versions can be found elsewhere). fasta for each chromosome can be extracted using tools like samtools.

By: Yuwen Pan, 2021
Contact: panyuwen.x@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
Theta_D_F_H.py2.py		Theta_D_F_H.py2.py
Theta_D_F_H.py3.py		Theta_D_F_H.py3.py
Theta_D_H.Est		Theta_D_H.Est
VCF_GT_format.py		VCF_GT_format.py
allele_cali.one_chr.py		allele_cali.one_chr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description:

Input:

Usage:

Tips & Notes:

Other Tools:

About

Releases

Packages

Languages

License

panyuwen/Theta_D_H.Est

Folders and files

Latest commit

History

Repository files navigation

Description:

Input:

Usage:

Tips & Notes:

Other Tools:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages