Skip to content

Collection of various small scripts for utility in bash or manipulating fasta files

Notifications You must be signed in to change notification settings

jonoave/smallScripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

smallScripts

Personal collection of small tools to manipulate sequence files

A. General utility

  1. checkmd5sum.py : Checks md5sum generated from a file vs md5sum provided for a file (missing lines, WIP).
  2. renameFilesRecursive.py: Based on an input file of old/new names, will rename all files recursively in an input directory. This was designed for file prefixes - to rename random prefixes with QBiC barcodes.
  3. matchColumnInOneFile.py: Sort by column1 in file2 based on the order of column1 in file1. Requires pandas package.

B. Manipulating fasta files (primarily phylogenetic purposes)

  1. format_fasta_oneline.py : Converts a multi-line sequence fasta file into a (seq1Name, newline, seq1 in one line, newline, seq2Name..). Just uses awk, so no dependencies biopython or perl.
  2. filterMinSpecies.py : Goes through each fasta file in an input directory, checks the number of sequences in a file, and only copies file to the output directory if number of sequences >= minNumSpecPer on totalSpecies. E.g if totalSpecies is 60 and minNumSpecPer is 50, a fasta file must have >30 sequences to be copied to output folder.
  3. seqLengthDist.py: # 1)get a distribution count of sequence length and 2)generate a histogram of length distribution.
  4. statsFasta.py: output stats for a fasta file (number of seqs, average seq length, longestSeq, shortestSeq, %GC)

C. Utility scripts of IgPhyML (B cell phylogenetic inference package)

  1. getIgphymlMSAclone.py: Parse the input file for igPhyML, typically named xxx_db-pass_productive-T_clone-pass_germ-pass.tsv to create a standard MSA file in fasta format.
  2. readIgphymlOutPhylo.R : Rscript template to read in output from igPhyML, typically xxx_igphyml-pass.tab, to generate trees in newick format.

D. niche utility scripts

  1. bedstat.py: Customise and view statistical properties of a bedfile. Written in python2.
  2. parseEnsemblGeneNamesDesc.py: parse the downloaded Ensembl biomart file to create a clean output. Combines multiple phenotypes from the same geneID into one entry. Written in python2.
  3. parseProteinFunction_csv.py: Parse protein sequences downloaded from Ensembl and output the proteinID and function in two columns. Written in python2

About

Collection of various small scripts for utility in bash or manipulating fasta files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published