Personal collection of small tools to manipulate sequence files
- checkmd5sum.py : Checks md5sum generated from a file vs md5sum provided for a file (missing lines, WIP).
- renameFilesRecursive.py: Based on an input file of old/new names, will rename all files recursively in an input directory. This was designed for file prefixes - to rename random prefixes with QBiC barcodes.
- matchColumnInOneFile.py: Sort by column1 in file2 based on the order of column1 in file1. Requires pandas package.
- format_fasta_oneline.py : Converts a multi-line sequence fasta file into a (seq1Name, newline, seq1 in one line, newline, seq2Name..). Just uses awk, so no dependencies biopython or perl.
- filterMinSpecies.py : Goes through each fasta file in an input directory, checks the number of sequences in a file, and only copies file to the output directory if number of sequences >=
minNumSpecPer
ontotalSpecies
. E.g iftotalSpecies
is 60 andminNumSpecPer
is 50, a fasta file must have >30 sequences to be copied to output folder. - seqLengthDist.py: # 1)get a distribution count of sequence length and 2)generate a histogram of length distribution.
- statsFasta.py: output stats for a fasta file (number of seqs, average seq length, longestSeq, shortestSeq, %GC)
- getIgphymlMSAclone.py: Parse the input file for
igPhyML
, typically named xxx_db-pass_productive-T_clone-pass_germ-pass.tsv to create a standard MSA file in fasta format. - readIgphymlOutPhylo.R : Rscript template to read in output from
igPhyML
, typically xxx_igphyml-pass.tab, to generate trees in newick format.
- bedstat.py: Customise and view statistical properties of a bedfile. Written in python2.
- parseEnsemblGeneNamesDesc.py: parse the downloaded Ensembl biomart file to create a clean output. Combines multiple phenotypes from the same geneID into one entry. Written in python2.
- parseProteinFunction_csv.py: Parse protein sequences downloaded from Ensembl and output the proteinID and function in two columns. Written in python2