smallScripts

Personal collection of small tools to manipulate sequence files

A. General utility

checkmd5sum.py : Checks md5sum generated from a file vs md5sum provided for a file (missing lines, WIP).
renameFilesRecursive.py: Based on an input file of old/new names, will rename all files recursively in an input directory. This was designed for file prefixes - to rename random prefixes with QBiC barcodes.
matchColumnInOneFile.py: Sort by column1 in file2 based on the order of column1 in file1. Requires pandas package.

format_fasta_oneline.py : Converts a multi-line sequence fasta file into a (seq1Name, newline, seq1 in one line, newline, seq2Name..). Just uses awk, so no dependencies biopython or perl.
filterMinSpecies.py : Goes through each fasta file in an input directory, checks the number of sequences in a file, and only copies file to the output directory if number of sequences >= minNumSpecPer on totalSpecies. E.g if totalSpecies is 60 and minNumSpecPer is 50, a fasta file must have >30 sequences to be copied to output folder.
seqLengthDist.py: # 1)get a distribution count of sequence length and 2)generate a histogram of length distribution.
statsFasta.py: output stats for a fasta file (number of seqs, average seq length, longestSeq, shortestSeq, %GC)

getIgphymlMSAclone.py: Parse the input file for igPhyML, typically named xxx_db-pass_productive-T_clone-pass_germ-pass.tsv to create a standard MSA file in fasta format.
readIgphymlOutPhylo.R : Rscript template to read in output from igPhyML, typically xxx_igphyml-pass.tab, to generate trees in newick format.

bedstat.py: Customise and view statistical properties of a bedfile. Written in python2.
parseEnsemblGeneNamesDesc.py: parse the downloaded Ensembl biomart file to create a clean output. Combines multiple phenotypes from the same geneID into one entry. Written in python2.
parseProteinFunction_csv.py: Parse protein sequences downloaded from Ensembl and output the proteinID and function in two columns. Written in python2

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
README.md		README.md
bedStat.py		bedStat.py
checkmd5sum.py		checkmd5sum.py
convertEdgeRcsv.py		convertEdgeRcsv.py
convertMirge3Counts.py		convertMirge3Counts.py
epPred_merge_binding_ranks.py		epPred_merge_binding_ranks.py
epPred_read_tsv_export.py		epPred_read_tsv_export.py
ep_prediction_create_samplesheet.py		ep_prediction_create_samplesheet.py
filterMinSpecies.py		filterMinSpecies.py
findFilesRecursivelyAndMove.py		findFilesRecursivelyAndMove.py
format_fasta_oneline.py		format_fasta_oneline.py
getIgphymlMSAclone.py		getIgphymlMSAclone.py
getLongestSeqPerSp.py		getLongestSeqPerSp.py
matchColumnInOneFile.py		matchColumnInOneFile.py
parseEnsemblGeneNamesDesc.py		parseEnsemblGeneNamesDesc.py
parseProteinFunction_csv.py		parseProteinFunction_csv.py
readIgphymlOutPhylo.R		readIgphymlOutPhylo.R
renameFilesDirectly.py		renameFilesDirectly.py
renameFilesRecursive.py		renameFilesRecursive.py
renameFilesUnderQBiC.py		renameFilesUnderQBiC.py
renameLabelMoveFiles.py		renameLabelMoveFiles.py
sarek_create_samplesheet.py		sarek_create_samplesheet.py
sarek_create_samplesheet_reannotate.py		sarek_create_samplesheet_reannotate.py
seqLengthDist.py		seqLengthDist.py
statsFasta.py		statsFasta.py