PanLineal: A new concept and pipeline of constructing and utilization linear pan-genome from de novo assembled genome
Install
git clone https://github.com/lipingfangs/Panlineal.git
cd Panlineal
python setup.py install
Attention
This Pipeline rely on Mummer, Lastz, bowtie2, samtools and svmu; Users need to write path of these at file location.lg
This pipeline only support one-line .fasta format
For example
Support this:
1dna_chromosomechromosome_IRGSP-1.0_1_1_43270923_1REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
But not this
1dna_chromosomechromosome_IRGSP-1.0_1_1_43270923_1REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Script goone.py in this package is able to convey these format
usage:
python goone.py <in.fasta> <out.fasta>
Usage for this programme: These programme was divided into Pan-genome forming(Panlineal.py) and SVs mapping(mappingtools.py). Create a optimal reference genome based on short reads from pan-genome (bestref.py)
Panlineal.py -h ;for help
usage: Panlineal.py [-h] [-p PAIRGUIDE] [-t THREADS] [-all {yes,no}]
[-o OUTPUT] [-f FLITERSIZE] [-l LOCATION] [-r RANGEFLITER]
[-merge {yes,no}] [-clean {yes,no}] [--version]
Create one-by-one read-pairs and sv compare example: Panlineal.py -l location.lg -p example.pair.cfg -t 20 -all yes -o refquery -f 1000 -clean yes
optional arguments:
-h, --help show this help message and exit
-p PAIRGUIDE, --pairguide PAIRGUIDE
input your pairguide file
-t THREADS, --threads THREADS
how many thread do you want to use
-all {yes,no}, --runall {yes,no}
if yes: run multiple sequence of whole process,
generate the pan genome file; if no:just splice;
default yes
-o OUTPUT, --output OUTPUT
name of the pan-genome output: <-o>.fasta
-f FLITERSIZE, --flitersize FLITERSIZE
fliter size of SV; default 1000
-l LOCATION, --location LOCATION
location of software "mummer" "lastz" and "svmu
-r RANGEFLITER, --rangefliter RANGEFLITER
SVs distance between red and query; default 1000000
-merge {yes,no}, --merge {yes,no}
merge .goc and generate the final location file;
default yes
-clean {yes,no}, --clean {yes,no}
Clean all of the middle file!; default no
--version show program's version number and exit
mappingtools.py -h ;for help
usage: mappingtools.py [-h] [-i INPAN] [-t THREADS] [-b INBASE] [-1 PAIREND1]
[-2 PAIREND2] [-l LOCATION] [-g GOCGUIDE] [-o OUTPUT]
[-c COVFLITER] [-clean {yes,no}] [-create {yes,no}]
Create one-by-one mapping and pav example: mappingtools.py -i pangenome.fa -1
illnumina_R1.fq.gz -2 illnumina_R3.fq.gz -g guide.goc -l location.lg -o
outputcov -c 5
optional arguments:
-h, --help show this help message and exit
-i INPAN, --inpan INPAN
input your reference .fasta
-t THREADS, --threads THREADS
how many thread do you want to use
-b INBASE, --inbase INBASE
how many base-pair will you consider it as a total
insertion rather than a replace
-1 PAIREND1, --pairend1 PAIREND1
input your pairend1 .fastq
-2 PAIREND2, --pairend2 PAIREND2
input your pairend2 .fastq
-l LOCATION, --location LOCATION
location of software "mummer" "lastz" and
"svmu","samtools","bowtie"
-g GOCGUIDE, --gocguide GOCGUIDE
input your .goc file generated by multiple.py
-o OUTPUT, --output OUTPUT
name of the pan-genome coverage output
-c COVFLITER, --covfliter COVFLITER
coverage fliter
-clean {yes,no}, --clean {yes,no}
Clean all of the middle file!
-create {yes,no}, --createfasta {yes,no}
Create a optimal reference genome for short read
bestref.py -h ;for help
usage: bestref.py [-h] [-i INPAN] [-o OUTPUT] [-g GOCGUIDE] [-hap HAPCGUIDE]
Create a optimal reference genome based on short reads from pan-genome:
bestref.py -i pangenome.fa -hap goin.hapc -g guide.goc -l location.lg -o
output
optional arguments:
-h, --help show this help message and exit
-i INPAN, --inpan INPAN
input your reference .fasta
-o OUTPUT, --output OUTPUT
name of the pan-genome coverage output
-g GOCGUIDE, --gocguide GOCGUIDE
input your .goc file generated by mappingtools.py
-hap HAPCGUIDE, --hapcguide HAPCGUIDE
input your .hapc file generated by mappingtools.py
Hapmerge.py module is able to intergrate the hapc file
Hapmerge.py -h
usage: Hapmerge.py [-h] [-l HAPLIST] [-o OUTPUT] [--version]
Merge .hapc file; example: Hapmerge -l merge.li -o haplistmergeout.hapcs
optional arguments:
-h, --help show this help message and exit
-l HAPLIST, --haplist HAPLIST
list file of .hapc files ready to be merged
-o OUTPUT, --output OUTPUT
name of output .hapn file
--version show program's version number and exit
.cfg and .lg file is necessary for this programme
Enter files format:
- example.pair.cfg:
<ref.chromosome1> <query.Homologous.chromosome1>
<ref.chromosome2> <query.Homologous.chromosome2> ...
example:
Chr1 chr01
Chr2 chr02
Chr3 chr03
- location.lg
Mummer=
Lastz=
svmu=
bowtie2=
samtools=
ref=<reference sequence .fasta>
query=<query1 sequence .fasta>,<query2 sequence .fasta>,......
example:
Mummer=/home/lfp/soft/mummer-4.0.0beta2/
Lastz=/home/lfp/soft/lastz-master/src/
svmu=/home/lfp/soft/svmu/
bowtie2=/home/lfp/miniconda3/bin/
samtools=/home/lfp/miniconda3/bin/
ref=Oryza_IRGSP-1.0_genome.1m.fasta
query=Oryza_HJX74_top_level_v2.2.1m.fa,Oryza_R498_Chr.1m.fa
When need to merge the .hapc: merge.li format is necessary for Hapmerge.py input
<1.hapc>,<2.hapc>,<3.hapc>...
for example:
P133-DSW43038-S_L1-pan.cov.hapc,P208-DSW43111-S_L4-pan.cov.hapc,P229-DSW43149-S_L5-pan.cov.hapc,P236-DSW43156-S_L5-pan.cov.hapc,P54-DSW42960-S_L7-pan.cov.hapc.P91-DSW42997-S_L1-pan.cov.hapc
xxx.fasta (Pan-genome after Panlineal.py alignment and Vgs based Linearized integration)
For example:
>chr10
TAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTTAAC
ACGTTCTGAATATATGTTTCATATATTATCTATATTTTTATTATTTTCAGAAGTTTTTTAAAATTCAAAACATATTTTTAG
TGTGTTCTCCTTATTTTCTAACTATTTTAATGATTTTAAGTTGAAATTATATAAATATAAATTCTATAAGATTCTAAACATTGTAAATAGATCATTCACGTATTATCTATACTTTAGTTTTTAATGTATTATTTTTATTATGCAATGTATTACTTTTAATTTTTA
xxx.goc (file for recording corresponding Vgs Multidirectional branch information of Pan-genome(xxx.fasta) after Panlineal.py alignment and Vgs based Linearized integration) File format:
OrgID Chr Start1 end1 Start2 end2 length (if "more" appears in the row mean this locus has more than 2 kinds of PAVs locus)
For example:
OrgID Chr Start1 end1 Start2 end2 length
1-1-chr10 chr10 81013 81013 81014 84495 3482
1-2-chr10 chr10 171843 171843 171844 172376 533
2-1-chr10 chr10 172379 172379 172380 173876 1497
2-2-chr10 chr10 213521 213616 213617 215808 2192
2-3-chr10 chr10 217433 217433 217434 218610 1177
2-4-chr10 chr10 243894 246549 246550 252770 6221
1-3-chr10 chr10 268308 268309 268310 268925 616
2-5-chr10 chr10 276199 276632 276633 277254 622
2-6-chr10 chr10 279697 280326 280327 282045 1719
2-7-chr10 chr10 365487 365487 365488 366046 559
2-8-chr10 chr10 395016 395016 395017 399115 4099
2-13-chr10 chr10 1338940 1339015 1339016 1343068 1343069 1344198 5183 more 1-17-chr10
xxx.cov(file for recording the average mapping coverage of PAVs locus after mappingtools.py short reads mapping)
File format:
ID PanPosition RefPosition coverage
For example:
ID PanPosition RefPosition coverage
1-1-chr10 81013 81013 -1 0.1057
1-2-chr10 171843 168361 -1 11.687
2-1-chr10 172379 168364 -1 9.9271
2-2-chr10 213521 208009 0.4631 19.489
2-3-chr10 217433 209729 -1 11.754
2-4-chr10 243894 235013 2.4470 12.244
1-3-chr10 268308 253206 -1 0.0
2-5-chr10 276199 260481 1.4988 11.404
2-6-chr10 279697 263357 5.7472 10.788
2-7-chr10 365487 347428 -1 13.508
2-8-chr10 395016 376398 -1 17.243
1-4-chr10 458138 435421 1.6117 0.452
1-5-chr10 683947 660729 -1 9.7587
2-9-chr10 824027 795916 0.5491 14.988
xxx.cov.hapc(file for recording the Persence/absence condition of PAVs locus from coverage and create the SVs haplotype after mappingtools.py short reads mapping)
File format:
ID PanPosition RefPosition HaplotypeComposition Haplotype
For example:
ID PanPosition RefPosition HaplotypeComposition Haplotype
1-1-chr10 81013 81013 0 0 Hap00
1-2-chr10 171843 168361 0 1 Hap01
2-1-chr10 172379 168364 0 1 Hap01
2-2-chr10 213521 208009 0 1 Hap01
2-3-chr10 217433 209729 0 1 Hap01
2-4-chr10 243894 235013 0 1 Hap01
1-3-chr10 268308 253206 0 0 Hap00
2-5-chr10 276199 260481 0 1 Hap01
2-6-chr10 279697 263357 1 1 Hap11
2-7-chr10 365487 347428 0 1 Hap01
2-8-chr10 395016 376398 0 1 Hap01
1-4-chr10 458138 435421 0 0 Hap00
1-5-chr10 683947 660729 0 1 Hap01
2-9-chr10 824027 795916 0 1 Hap01
If you have any problem about Panlineal please contact lpf_bio@foxmail.com or 86-13242867935