Clinical data

This is a collection of bioinformatics tools I have sourced from recent literature, organized by topic. I have not used most of these tools.

Table of Contents

Clinical data
- EHR
Data Management
Data Sets
Discovery
Genomics
General Programming Resources
Statistics/Machine Learning
Visualization
- Genome browsers
- Networks
- Phylogenetic trees
- R
- Python
- Javascript
- Examples
Publication/Archiving
- Code/Data sharing
- Journals
- Writing
- Posters
- Slides
- CV
Promising methods without software implementation

Clinical data

EHR

Data Management

Data Sets

Discovery

When looking for a bioinformatics tool for a specific application:

Genomics

General Information

Algorithms

Assay Design

Choosing assays based on complementarity to existing data: https://github.com/melodi-lab/Submodular-Selection-of-Assays
MBRAnator: design of MPRA libraries https://www.genomegeek.com/

Candidate Prioritization

Databases

RNA Meta Analysis (web app)
- Perform meta-analyses of GEO microarray and RNA-Seq studies
- Correlate resulting signatures to CMAP02/L1000 perts and 26,000+ GEO studies
- https://www.rnama.com
NAR catalog of databases, by subject: http://www.oxfordjournals.org/our_journals/nar/database/c/
Super-Enhancer Archive: http://www.bio-bigdata.com/SEA/
GWAS database: http://jjwanglab.org/gwasdb
rVarBase: regulatory features of human genetic variants http://rv.psych.ac.cn/
TransVar http://bioinformatics.mdanderson.org/transvarweb/
dbMAE: mono-allelic expression https://mae.hms.harvard.edu/
Disease-gene associations http://www.disgenet.org/web/DisGeNET/menu/rdf
BISQUE: convert between database identifiers http://bisque.yulab.org/
Human tissue-specific enhancers: http://www.enhanceratlas.org/
Searh HLI's genome data: hli-opensearch.com
Chromatin-state annotations + per-base functionality scores for 164 cell types: http://noble.gs.washington.edu/proj/encyclopedia/
Feature-based classification of human transcription factors: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1349-2
Database of epifactors (epigenetic factors): http://epifactors.autosome.ru/
Database of disease-associated methylation: http://202.97.205.78/diseasemeth/
SRA metadata: http://deweylab.biostat.wisc.edu/metasra/
Human histone modifications: http://www.tongjidmb.com/human/index.html
iRegNet3D: SNP-focused catalog of TF-TF, TF-DNA, and DNA-DNA interactions http://iregnet3d.yulab.org/index/
CTD: drug interactions and toxicity http://ctdbase.org/
FANTOM lncRNA catalog: http://fantom.gsc.riken.jp/cat/
Integrated database of public ChIP-Seq datasets: http://chip-atlas.org/
Alternative splicing: http://vastdb.crg.eu/wiki/Main_Page
Interactive multi-omics tissue assay database: https://ccb-web.cs.uni-saarland.de/imota/
Database of cis-regulatory elements (enhancers): http://www.kostkalab.net/software.html
Database of genetic variant effects on gene expression: https://xhaubem01.u.hpc.mssm.edu/gwas2genes/
Structural variation: https://www.ncbi.nlm.nih.gov/dbvar
PharmacoDB
- Search multiple cancer pharmacogenomic databases with a single query
- Software is GPL licensed; target databases have various licenses
- http://pharmacodb.pmgenomics.ca
HIVE BioMuta: https://hive.biochemistry.gwu.edu/biomuta/readme
Database of druggable variant information (web only): http://depo-dinglab.ddns.net/
CHESS: new gene annotation database
- Contains most genes from RefSeq and Gencode, plus additional genes discovered from GTEx transcripts
- http://ccb.jhu.edu/chess/
Variant benchmark datasets: http://structure.bmc.lu.se/VariBench/
Metabase: aggregates various gene annotation sources: http://metascape.org
CIViC: https://civicdb.org/home
Search publications and clinical trials by genes/variants/drugs https://vist.informatik.hu-berlin.de/
Allele frequencies https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/

Data Formats

Functional Enrichment/Ontology

GWAS/QTL

Microarray

Correct for batch effects between training data and external datasets: https://cran.r-project.org/web/packages/bapred/index.html
Impute from Affy expression arrays: http://simtk.org/home/affyimpute
SNP
- Call haplotypes https://cran.r-project.org/web/packages/GHap/index.html
- Neural network for breakpoint detection for CNV calling: https://www.biorxiv.org/content/biorxiv/early/2018/06/24/354423.full.pdf
Methylation
- Minfi: R package for working with 450k methylation arrays
- D3M: two-sample test of differential methylation from distribution-valued data https://cran.r-project.org/web/packages/D3M/D3M.pdf
- Network-based approach to discovering epigenetic "modules" that can be associated with gene expression: http://bioinformatics.oxfordjournals.org/content/30/16/2360.long
- Filtering probes using technical replicates https://cran.r-project.org/web/packages/CpGFilter/index.html
- Imputation of genome-wide methylation: http://wanglab.ucsd.edu/star/LR450K/
- Tutorial for analysis using bioconductor packages: http://biorxiv.org/content/biorxiv/early/2016/05/25/055087.full.pdf
- Normalization
  - Probe design bias correction: https://www.bioconductor.org/packages/release/bioc/html/ENmix.html
- DMR calling
  - SeqLM: https://github.com/raivokolde/seqlm
- PyMAP: http://aminmahpour.github.io/PyMAP/
- Interactive exploration: http://bioconductor.org/packages/release/bioc/html/shinyMethyl.html
- eFORGE: identify cell type-specific signals in differentially methylated positions (mostly important for blood-based EWAS) http://eforge.cs.ucl.ac.uk/
- Model for EWAS using probe signal intensities: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1347-4
- Reference-based tissue deconvolution (three different algorithms): https://github.com/sjczheng/EpiDISH
- Glint pipeline (qc, EWAS, population structure): http://glint-epigenetics.readthedocs.io/en/latest/
  - Bayesian extension of Refactor cell type heterogeneity correction that incorporates experimentally determined cell counts: https://github.com/cozygene/BayesCCE
- Meffil: https://github.com/perishky/meffil/

Motif/TFBS

Network Analysis

Population genetics

Population history from unphased whole-genomes: https://github.com/popgenmethods/smcpp
Management of allele-frequency data: https://grenaud.github.io/glactools/
Calculation of LD from genome-wide genotype likelihoods: https://github.com/fgvieira/ngsLD

Prediction

Gene expression
- Predicting gene expression from methylation: http://arxiv.org/abs/1603.08386
QTL
- Imputation of summary statistics in multi-ethnic cohorts: http://dleelab.github.io/jepegmix/
- Causal variant identification: http://genetics.cs.ucla.edu/caviar/
eQTL
- Imputation of gene expression from genotype data : https://github.com/hriordan/PrediXcan
Genetic risk
- AnnoPred: https://github.com/yiminghu/AnnoPred
Pathogenicity
- Disease-specific https://github.com/samesense/pathopredictor
- SVs https://github.com/gersteinlab/SVFX
- https://data.broadinstitute.org/alkesgroup/LDSCORE/Kim_annotboost/
- Frequency conservation score http://bioinfo.cnic.es/FCS
Causal variant
- Ensemble method: https://github.com/gifford-lab/EnsembleExpr/
- eCAVIAR: probability that a variant is causal for both QTL and eQTL http://genetics.cs.ucla.edu/caviar/
- qCat: https://github.com/dleelab/qcat
- Disease-associated risk variants: https://sites.google.com/site/emorydivan/
- Predicting gene targets from GWAS summary statistics https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4979185/
- Orion: https://github.com/igm-team/orion-public
Chromatin States
- GenoSTAN: http://bioconductor.org/packages/release/bioc/html/STAN.html
- R package for predicting chromatin states from histone marks across conditions https://github.com/ataudt/chromstaR
- Rule-based: http://www.statehub.org/
- Hierarchical HMM: https://github.com/gcyuan/diHMM
- Basenji: https://github.com/calico/basenji
- CMint: https://bitbucket.org/roygroup/cmint
Enhancers
- Prediction of enhancer strength from sequenced http://bioinformatics.hitsz.edu.cn/iEnhancer-2L
- Prediction of core cell type-specific TFs from super enhancers https://bitbucket.org/young_computation/crcmapper
- Prediction of superenhancers https://github.com/asntech/improse
- Deep learning-based:
Coding mutations
- Predict mutation effects from sequence covariation: https://marks.hms.harvard.edu/evmutation/
- Predicting loss-of-function from expression: https://www.nature.com/articles/s41467-017-00443-5
- Prediction of missense variants: https://www.biorxiv.org/content/early/2018/02/02/259390
- Disease-specific functional prediction https://sites.google.com/site/emorydivan/
- Impact of coding SNPs: http://pantherdb.org/tools/csnpScoreForm.jsp
- Predict disease risk from GWAS summary statistics: https://github.com/yiminghu/AnnoPred
Regulatory variants/TF binding
- LedPred: prediction of regulatory sequences from ChIP-seq https://github.com/aitgon/LedPred
- GERV: prediction of regulatory variants that affect TF inding http://gerv.csail.mit.edu/
- Score variant deleteriousness: http://cadd.gs.washington.edu/
- BASSET: prediction of sequence activity https://github.com/davek44/Basset
- DanQ: hybrid convolutional and recurrent neural network model for predicting the function of DNA de novo from sequence http://github.com/uci-cbcl/DanQ
- DeepSequence: Generative model https://github.com/debbiemarkslab/DeepSequence
  - Can also be used for sequence clustering
- LINSIGHT
- Protein binding affinity: https://bitbucket.org/wenxiu/sequence-shape.git
- Change in local frustration index: https://github.com/gersteinlab/frustration
- TFImpute: multi-task learning from ChIP-seq data across factors and tissues to impute TF binding for an unassayed tissue/factor combination: https://bitbucket.org/feeldead/tfimpute
- PPI https://www.ncbi.nlm.nih.gov/research/mutabind/index.fcgi/
- Multiple instance learning of TF binding: http://www.cs.utsa.edu/~jruan/MIL/
- Cell type-specific: https://github.com/uci-cbcl/FactorNet
- Predict TF binding from ATAC-Seq using deep neural network: https://github.com/hiranumn/deepatac
- CentiSNPs: http://genome.grid.wayne.edu/centisnps/
- Benchmark: https://github.com/Oncostat/BenchmarkNCVTools
- PathoPredictor: https://github.com/samesense/pathopredictor
- Cell-type agnostic regulatory activity prediction from ENCODE: http://screen.encodeproject.org/
- Segway functional scores: https://noble.gs.washington.edu/proj/encyclopedia/
- Predict effect of variants on epigenetic factors (chromatin accessibility, histone marks, etc): http://deepfigv.mssm.edu/downloads.html
- HaploReg: http://www.broadinstitute.org/mammals/haploreg/haploreg.php
- Consensus approaches:
  - PredictSNP2: https://loschmidt.chemi.muni.cz/predictsnp2/
  - PRVCS: http://jjwanglab.org/PRVCS/
- Queryor: http://queryor.cribi.unipd.it/cgi-bin/queryor/mainpage.pl
- Using epigenomic data https://github.com/mulin0424/cepip
- Tissue-specific effects on expression: https://github.com/FunctionLab/ExPecto
Methylation
- CpGenie: predicts methylation from sequence, predicts impact of variants on nearby methylation https://github.com/gifford-lab/CpGenie
Chromatin accessibility
- SCM: http://scm.csail.mit.edu/
TFBS
- Predict TF binding affinities using open chromatin + PWMs: https://github.com/schulzlab/TEPIC
- LR-DNAse: TFBS prediction using features derived from DNase-seq: http://biorxiv.org/content/early/2016/10/24/082594
Classification of cis-regulatory modules: https://github.com/weiyangedward/IMMBoost
Imputation: https://github.com/tdurham86/PREDICTD
Variable selection for random forest: https://github.com/jomayer/SMuRF

Security

Secret sharing schemes for keeping patient ID secret while being able to reconstruct it from other identifiers: https://pdfs.semanticscholar.org/0307/48be9820512a1d5582351552bd0452711296.pdf

Sequencing Protocols

Single cell
- Simultaneous RNA and methylation (and inference of CNV): http://www.nature.com/cr/journal/vaop/ncurrent/full/cr201623a.html
- Simultaneous RNA and methylation (scM&T-seq): http://www.nature.com/nmeth/journal/v13/n3/full/nmeth.3728.html
- Simultaneous RNA and protein measurements: http://www.sciencedirect.com/science/article/pii/S2211124715014345
CRISPR-DS: Uses CRISPR/Cas9 excision of target sequences to improve library quality relative to hybrid capture: https://www.biorxiv.org/content/early/2017/10/23/207027

Simulation

Variant annotation

WGSA pipeline https://sites.google.com/site/jpopgen/wgsa/
SnpEff: http://snpeff.sourceforge.net/
Normalization of SNP ID's from literature: https://github.com/rockt/SETH
HAIL: https://hail.is/
Tissue-specific https://github.com/kevinVervier/TiSAn
VCF visualization with Circos plot: http://legolas.ariel.ac.il/~tools/CircosVCF/
HGVS
- This is a standard notation for variants http://varnomen.hgvs.org/
- Toolkit for working with HGVS-formatted variants: https://github.com/biocommons/hgvs
Integration of multiple predictive measures of mutation deleterious effect: https://www.ncbi.nlm.nih.gov/research/snpdelscore/
Varsome: database that aggregates variant information from many sources
- https://varsome.com/
- Provides a REST API that can be queried with dbSNP or HGVS IDs
Constrained coding regions: https://github.com/quinlan-lab/ccrhtml
Database of human chromosomal fragile sites: http://webs.iiitd.edu.in/raghava/humcfs
FunSeq2: http://funseq2.gersteinlab.org
Deep learning-based: https://www.biorxiv.org/content/early/2017/12/18/235655.1
Search engine for indexing and intersecting with large numbers of annotation BED/VCF files: https://github.com/ryanlayer/giggle/
Clinical annotation: https://github.com/ding-lab/CharGer
Gene panel/disease-specific annotation and filtering: https://www.ebi.ac.uk/gene2phenotype/g2p_vep_plugin
Cravat: https://github.com/KarchinLab/open-cravat/tree/master/cravat
Individual, per-gene pathogenicity scores: https://github.com/UoS-HGIG/GenePy
Pharmacogenomic: https://github.com/PharmGKB/PharmCAT
SV
- AnnotSV: http://lbgi.fr/AnnotSV/
- Curation/annotation by depth: https://github.com/brentp/duphold
Protein sequence and structure annotation for variants https://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/DisaStr/GetPage.pl?varmap=TRUE

Sequence Analysis

Google Genomics R API: https://followthedata.wordpress.com/2015/02/05/notes-on-genomics-apis-2-google-genomics-api/
k-mer counting
- khmer:
  - https://github.com/ged-lab/khmer
  - https://docs.google.com/presentation/d/1biQmLkwPlCOA56mNZdUAiXa1OyGE0qyvI2nvU0qlOIE/mobilepresent?pli=1&slide=id.p58
- KCMBT: https://github.com/abdullah009/kcmbt_mt
- KAT https://github.com/TGAC/KAT
- ntCard https://github.com/bcgsc/ntCard
- KMC3 https://github.com/refresh-bio/KMC
- Encoding counts with kmer data: https://github.com/lzhLab/kmcEx
- Gerbil: GPU accelerated, outputs count table https://github.com/uni-halle/gerbil
- Toolkit for working with unique kmers: https://github.com/shenwei356/unikmer
- Across large number of datasets https://github.com/kamimrcht/REINDEER
k-mer hashing: https://github.com/czbiohub/kmer-hashing
Density-based clustering: https://bitbucket.org/jerry00/densitycut_dev
chopBAI: segment BAM indexes by region for faster access https://github.com/DecodeGenetics/chopBAI
GFFutils: http://daler.github.io/gffutils/
R package for aligned chromatin-oriented sequencing data: https://cran.r-project.org/web/packages/Pasha/
MMR: resolve multi-mapping reads https://github.com/ratschlab/mmr
BAMQL: query language for extracting reads from BAM files https://github.com/BoutrosLaboratory/bamql
SAMBAMBA: samtools alternative
BAMtools: another samtools alternative, plus some additional tools https://github.com/pezmaster31/bamtools/wiki
DeepTools: more useful SAM/BAM operations http://deeptools.readthedocs.io/en/latest/content/list_of_tools.html
Genomic intervals
- bedtools http://bedtools.readthedocs.io/en/latest/
- bedops alternative/additional BED operations http://bedops.readthedocs.io/en/latest/
- comparison of interval sets: https://github.com/deepstanding/seqpare
- Intersection and visualization of multiple gene/region sets: https://bitbucket.org/CBGR/intervene
Normalization:
- GLScale: https://github.com/allenxhcao/glscale
- QSmooth: https://github.com/stephaniehicks/qsmooth
- ORNA: https://github.com/SchulzLab/ORNA
- Suquan: https://github.com/jpvert/suquan
Demultiplexing/deduping barcoded reads w/ UMIs: http://gbcs.embl.de/portal/tiki-index.php?page=Je
Hardware acceleration of alignment (requires $5k FPGA module): https://github.com/BilkentCompGen/GateKeeper
Detection and removement of barcode swapping (issue on Illumina sequencers that used patterned flow cells: https://github.com/MarioniLab/BarcodeSwapping2017
Data processing pipelines for many types of omics data, built using NextFlow and Singularity: https://github.com/c-guzman/cipher-workflow-platform
Index and fetch data from BGZF-compressed files.
Liftover
- Segment liftover: https://github.com/baudisgroup/segment-liftover
- Liftover alignments from one reference to another: https://github.com/CMU-SAFARI/AirLift
Recover unaligned reads: https://github.com/VCCRI/Scavenger
Mantis: index of raw-read datasets for efficient and exacty queries https://github.com/splatlab/mantis
Machine learning method for determining sequence identity: https://github.com/TulsaBioinformaticsToolsmith/FASTCAR
Progressive MSA (DNA or protein) with indel evolution: https://github.com/acg-team/ProPIP
Tools to create ENCODE blacklists, and pre-computed blacklists for model organisms: https://github.com/Boyle-Lab/Blacklist
Server for reference sequences and indices: https://github.com/databio/refgenie
Coverage
- Fast coverage estimate from BAM index: https://github.com/brentp/goleft/tree/master/indexcov
- Quantification and normalization of coverage peaks: https://github.com/ncbi/BAMscale
- Library for distributed coverage calculation using Spark https://github.com/ZSI-Bio/bdg-sequila
Additional tools built on GATK4 https://bimberlab.github.io/DISCVRSeq/toolDoc/index.html
Query compressed GFF/GTF files https://github.com/qm2/gpress
Sort reads and remove duplicates (for compression and to accelerate alignment): https://github.com/bioinformatics-polito/BioSeqZip

Model 3D chromosome structure from Hi-C contact maps + optional FISH constraints: https://github.com/yjzhang/FISH_MDS.jl, https://github.com/yjzhang/3DC-Browser
Predict enhancer targets: https://github.com/shwhalen/targetfinder
Predicting TADs from histone modifications: https://cb.utdallas.edu/CITD/index.htm#ajax=home
Resolution enhancement: http://dna.cs.miami.edu/HiCNN/
Binary storage format for interaction matrix: https://github.com/mirnylab/cooler
Call CNVs and translocations https://github.com/parklab/HiNT

Alignment-free functional binning and abundance estimation: https://github.com/snz20/carnelian
Map reads against redundant databases: https://bitbucket.org/genomicepidemiology/kma
Taxonomic classificatino using pseudoalignment: https://github.com/mreppell/Karp
Iterative bloom filter and Yara mapper for rapid updating and mapping to large metagenomic databases: https://gitlab.com/pirovc/dream_yara/
Spaced seed hashing: https://bitbucket.org/samu661/fish/overview
SNV calling: https://github.com/tseemann/snippy

Alternative/differential nuleosome positioning: https://github.com/airoldilab/cplate

QC
- Bias detection
  - https://www.degruyter.com/downloadpdf/j/jib.2017.14.issue-3/jib-2017-0025/jib-2017-0025.pdf
  - This method is targeted to RNA-Seq, but should work for CNV calling from Exome data as well
- Deconvolution of signal (allele frequency, gene expression, etc) from heterogeneous tissue data: https://github.com/tedroman/WSCUnmix
- Remove strand bias artifacts (e.g. FFPE samples): https://github.com/mikdio/SOBDetector
Personal reference editor: https://github.com/precisionomics/PRESM
Sentieon includes step for "co-realignment" which just seems to be merging tumor+normal and doing indel realignment on the merged sample to harmonize indel alignment for better comparability after variant calling.
Variant Calling
- Review of somatic variant callers: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5852328/
- FMI abstracts on variant calling from tumor-only sequencing
  - http://cancerres.aacrjournals.org/content/74/19_Supplement/1893
  - http://ascopubs.org/doi/abs/10.1200/jco.2015.33.15_suppl.11084
- Sarek: https://github.com/SciLifeLab/Sarek
- Comparison of 9 somatic variant callers
  - http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0151664
  - From 2016; should be updated with MuTect2 and Strelka2
- Lancet: new somatic variant caller based on genome graphs
  - https://github.com/nygenome/lancet
  - Non-commercial license
- Some recommended tools/pipelines for calling low-frequency variants:
- Goby uses deep learning for somatic variant calling: http://campagnelab.org/software/goby/
- MC3/PanCan
  - https://www.synapse.org/#!Synapse:syn7214402/wiki/406010
  - Variant call merger: https://github.com/covingto/pancanmafmerge
- Ensemble method with random forest integration of results from multiple callers:
  - https://github.com/skandlab/SMuRF
  - Paper: https://www.biorxiv.org/content/early/2018/02/23/270413
- http://bioinform.github.io/somaticseq/
- TNScope: Sentieon's pipeline
  - https://www.biorxiv.org/content/early/2018/01/19/250647
  - Includes machine learning-based variant filtering
- xATLAS
  - https://github.com/jfarek/xatlas
  - Uses a logistic regression model based on truth sets (e.g. GIAB) to assign quality scores
- Consensus calling from multiple callers: https://www.itb.cnr.it/web/bioinformatics/isma
- Multi-sample caller based on mpileup: https://github.com/IARCbioinfo/needlestack
- Deep learning
  - https://github.com/jingmeng-bioinformatics/DeepSSV
  - NeuSomatic (non-commercial license)
- Optimized for FFPE: https://www.ciscall.org/en/ciscall7.html
- Optimized for rare variants in ultra-deep sequencing http://github.com/cibiobcg/abemus
- Learn mutation profile to refine varinat calls: https://github.com/bmannakee/batcaver
- Reference-free: https://github.com/izaak-coleman/GeDi
CNA/SV
- Infer the tumor cell fraction of SV: https://github.com/mcmero/SVclone
- A study of using shallow WGS on low-quality FFPE samples for CNV calling used QDNASeq for segmentation
  - https://bioconductor.org/packages/release/bioc/html/QDNAseq.html
  - Adjusts for CG content and mappability
  - Found 7M reads was optimal
  - https://www.biorxiv.org/content/early/2017/12/08/231480
- Germline CNVs may be predictive of cancer susceptibility: https://www.biorxiv.org/content/early/2018/04/17/303339
- Allele-specific SV calling: https://github.com/ma-compbio/Weaver
  - Phasing SVs with noisy data: http://sci-hub.tw/http://liebertpub.com/doi/pdf/10.1089/cmb.2018.0022
- SEG: https://github.com/ZhaoS-Lab/SE
- Multi-sample (from same patient)
  - HATCHet
  - CNValidator
- Estimate copy number and purity: https://github.com/keyuan/ccube
- Allele-specific CNA: https://github.com/wheelerb/hsegHMM
- DEFOR: https://github.com/drzh/defor
- Joint tumor-normal detection: https://github.com/yongzhuang/TumorCNV
- Normalization: https://github.com/baudisgroup/mecan4cna
- Allele-specific CNV calling for multi-sample tumor/normal https://github.com/imgag/ClinCNV
- Background correction prior to CNV calling: https://github.com/mskilab/dryclean
- https://github.com/ExpressionAnalysis/CNV_Radar
Variant Filtering
- TNER: https://github.com/ctDNA/TNER
- This is a cool approach for constructing a ChIP-Seq control from integration of multiple public data sets: https://www.biorxiv.org/content/early/2018/03/08/278762. It strikes me that a similar approach could be used to generate matched controls for tumor-only samples.
- Features used to design a RF classifier: https://www.biorxiv.org/content/biorxiv/early/2019/06/13/670687.full.pdf
- SGZ method used by FoundationOne for filtering germline variants from tumor-only variant calls https://github.com/jsunfmi/SGZ
Tumor Purity
- IchorCNA
  - Estimates tumor cell fraction from low-pass WGS; should probably be adaptable to deep targeted sequencing
  - https://github.com/broadinstitute/ichorCNA
- Accurity (tumor-normal): https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty043/4827681
- Control for tumor purity differences in methylation analysis: https://www.biorxiv.org/content/early/2018/01/16/248781
- PyLOH: https://github.com/uci-cbcl/PyLOH
- AllFit: https://github.com/KhiabanianLab/All-FIT
- TPES: using SNVs https://bitbucket.org/l0ka/tpes/src/master/
- From methylation:
  - https://github.com/xjtu-omics/MEpurity
  - https://github.com/mwsill/RFpurify
- DeepPurity https://www.biorxiv.org/content/10.1101/805135v1.full.pdf (no open-source code)
Joint calling
- CN, tumor purity, and LOH https://bioconductor.org/packages/release/bioc/html/PureCN.html
- purity, ploidy, and allele-specific CN https://github.com/Crick-CancerGenomics/ascat
- purity, ploidy, SV, and CN in tumor-normal: https://github.com/hartwigmedical/gridss-purple-linx
Tumor Muational Burden
- http://moat.gersteinlab.org/
- ecTMB: https://github.com/bioinform/ecTMB
Variant Annotation
- CHASM and SNV-Box
  - CHASM Predicts functional significance of somatic missense mutations
  - SNV-Box is a database of pre-computed features of all possible amino acid substitutions in the human genome
  - http://wiki.chasmsoftware.org/index.php/Main_Page
  - Not free for commercial use, but it was developed at JHU and Vogelstein is a co-author on at least some of the publications
- Orchid: framework for annotation and machine learning of cancer variants: https://github.com/wittelab/orchid
- Discover potential cancer driver elements
  - DriverPower: https://github.com/smshuai/DriverPower
  - NetSig: http://www.lagelab.org/projects/
  - CanDrA: http://bioinformatics.mdanderson.org/main/CanDrA#CanDrA
- Predict cancer subtype using biological networks based on somatic variants: https://www.biorxiv.org/content/early/2017/12/03/228031
- PCGR: Annotate cancer genomes with clinical information:
  - https://github.com/sigven/pcgr
  - Integrates many public databases
- Benchmark of methods to identify pathogenic non-coding variation: https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty008/4798701?redirectedFrom=fulltext
- FASMIC: http://bioinformatics.mdanderson.org/main/FASMIC
- Database of TCGA alternative splicing events https://www.cell.com/cancer-cell/fulltext/S1535-6108(18)30306-4#secsectitle0080
- Consensus driver gene identification pipeline and database of cancer driver genes in TCGA: https://www.cell.com/cell/fulltext/S0092-8674(18)30237-X
- TCGA pathogenic signalling pathways: https://www.cell.com/cell/fulltext/S0092-8674(18)30359-3?cid=tw%26p
Predict cancer type/signature
- From mutations: https://www.nature.com/articles/nature12477
- From SNV+SV calls: https://www.biorxiv.org/content/early/2018/02/18/267500
- Machine learning method to predict cancer type from variant calls: https://www.biorxiv.org/content/early/2017/11/05/214494
- Prediction of inherited susceptibility to 20 difference cancer types: http://www.pnas.org/content/115/6/1322.short
- Computing polygenic risk scores and association with cancers: https://www.biorxiv.org/content/early/2017/10/19/205021
- Predict driver mutations: https://github.com/KarchinLab/CHASMplus
- https://bioconductor.org/packages/release/bioc/html/SparseSignatures.html
Clonality
- QuantumClone: https://cran.r-project.org/web/packages/QuantumClone/index.html
- MOSEM: https://static-content.springer.com/esm/art%3A10.1038%2Fs41588-018-0128-6/MediaObjects/41588_2018_128_MOESM1_ESM.pdf
- Benchmarking, plus description of CloneFinder: https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty469/5040314
- Tracking variants across longitudinal samples: https://github.com/ChristofferFlensburg/SuperFreq
MSI
- Sonics: https://github.com/kzkedzierska/sonics
- MSIsensor-pro https://github.com/xjtu-omics/msisensor-pro
- https://github.com/YixuanWang1120/ELMSI (requires tumor-normal)
Neoantigen
- pVACtools: https://github.com/griffithlab/pVACtools
- NeoPredPipe: https://github.com/MathOnco/NeoPredPipe
- https://github.com/neoanthill/neoANT-HILL
- Using phased somatic mutations https://github.com/pdxgx/neoepiscope
ctDNA/cfDNA
- TNER: background error reduction https://github.com/ctDNA/TNER
Other
- Bioconductor package with various methods for working with TCGA data: https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html
- maftools: https://bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html
- Sarek: end-to-end variant calling pipeline in NextFlow https://github.com/SciLifeLab/Sarek
- Multi-regional sequencing: Mutect2 in multi-sample mode with correction step performs best https://www.biorxiv.org/content/10.1101/655605v1
- Subclonality-aware association: https://github.com/Sun-lab/SMASH
- Regional mutation density classifies cancer type: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006953

General Programming Resources

Generate data type-specific compression formats: http://algorithms.cnag.cat/cargo/
Protocol buffers
- Protobuf: fast cross-language/platform serialization of fixed-format messages https://developers.google.com/protocol-buffers/
- Cap'n Proto: https://capnproto.org/
IDEs
- VisualStudio (now free): https://www.visualstudio.com/vs/visual-studio-mac/
Parameter optimization
- Spearmint: https://github.com/beamandrew/Spearmint
Diff/patch/merge for data tables
- js/python: http://paulfitz.github.io/daff/
- R: https://github.com/edwindj/daff
Pipe output of a shell command to a website (unfortunately can't be used in NIH HPC since nodes do not allow network connections): https://seashells.io/
Debugging
- Sandsifter: Fuzzer https://github.com/xoreaxeaxeax/sandsifter
- LivePython: https://github.com/agermanidis/livepython
JSON Diff: http://www.jsondiff.com/
Google C++ and python libraries: https://github.com/abseil/abseil-py
Share a terminal as a web page: https://github.com/yudai/gotty
Spec for CSV files with metadata: http://docs.astropy.org/en/stable/api/astropy.io.ascii.Ecsv.html

C/C++

kmers
- Streaming kmer counting https://github.com/bcgsc/ntCard
- kmer bloom filters: https://github.com/Kingsford-Group/kbf
High-performance concurrent hash table (C++11): https://github.com/efficient/libcuckoo
BWT that incorporates genetic variants: https://github.com/iqbal-lab/gramtools
Fast bitwise operations on nucleotide sequences: https://github.com/kloetzl/biotwiddle
C++ interface to htslib, BWA-MEM, and Fermi (local assembly) (would be useful to build python bindings for this): https://github.com/walaj/SeqLib
Minimal perfect hash function: fast, large data sets https://github.com/rizkg/BBHash
Counting quotient filter: https://github.com/splatlab/cqf
Succinct de Bruijn Graphs: http://alexbowe.com/succinct-debruijn-graphs/
Blazing signature filter: fast pairwise comparison of e.g. gene expression matrices https://github.com/PNNL-Comp-Mass-Spec/bsf-py
rr debugger: record and playback executions http://rr-project.org/
clip: command line parser https://github.com/muellan/clipp
klib: generic library with fast implementations of striped Smith-Waterman and other NGS-related algorithms https://github.com/attractivechaos/klib
Sketch data structures (includes Python bindings): https://github.com/dnbaker/sketch

R

MonetDB - embeddable column-store DB with R integration (MonetDB.R)

Access to Google spreadsheets from R: https://github.com/jennybc/googlesheets
Advanced table formatting in knitr: https://github.com/renkun-ken/formattable
Access data frames using SQL: sqldf package
Developing R packages: https://github.com/jtleek/rpackages
Work with PDF files: https://cran.r-project.org/web/packages/pdftools/index.html
Language-agnostic data frame format: https://github.com/wesm/feather
Make for R: https://github.com/richfitz/maker
Find root of current package: https://krlmlr.github.io/here/
Work with genomic ranges using dplyr-like syntax.

Python

Data structures/formats
- Chunked, compressed, disk-based arrays: https://github.com/alimanfoo/zarr
- Tabular data
  - Working with tabular data: http://docs.python-tablib.org/en/latest/
  - Apache Arrow
  - Pandas
  - Vaesx:https://github.com/vaexio/vaex
- GFA: https://github.com/ggonnella/gfapy/tree/master/gfapy
A regular expression scanner: https://github.com/mitsuhiko/python-regex-scanner
API for interacting with databases: https://github.com/kennethreitz/records
RStudio for python: https://www.yhat.com/products/rodeo
Debugging
- boltons.debugutils: The entire boltons package has lots of useful stuff, but debugutils is particularly cool - you can add one line of code to enable you to drop into a debugger on signal (e.g. Ctrl-C): https://boltons.readthedocs.io/en/latest/debugutils.html
- Web-based debugger: https://github.com/alexmojaki/birdseye
Stats
- Non-negative matrix factorization: https://github.com/ccshao/nimfa
- Statsmodels: http://www.statsmodels.org/stable/index.html
- R formulas in python: https://github.com/pydata/patsy
pyrasite: code injection into running applications
Dexy: documentation
Event loops for asynchronous programming
- curio
- gevent
Fast microservices: https://github.com/squeaky-pl/japronto
dill: alternative serialization
arrow: alternative to datetime
Template for scientific projects: https://github.com/uwescience/shablona
FFI
- GO transplier: https://github.com/google/grumpy
- Calling Rust libraries from python: https://medium.com/@caulagi/complementing-python-with-rust-657a8cb3d066#.6in8v0bte
- pyjamas: javascript bridge
Disabling python garbage collection speeds up programs: https://engineering.instagram.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172#.ri55nyjdu (only safe when the lifecycle is straight-forward for all objects, an thus reference counting is sufficient for memory management)
Easily implementing function proxies/wrappers: http://wrapt.readthedocs.io/en/latest/
Cache system: https://bitbucket.org/zzzeek/dogpile.cache
Parse TOML (an enhanced config-file spec): https://github.com/uiri/toml
Web scraping
- Goose: https://github.com/grangier/python-goose
- ScraPy: https://scrapy.org/
- Requestium: https://github.com/tryolabs/requestium
Visualize python code execution time as a heatmap in a Jupyter notebook: https://github.com/csurfer/pyheatmagic
Graph genomes
- High-level API for working with GFA (i.e. graph genomes): https://github.com/ggonnella/gfapy
- https://github.com/jambler24/GenGraph
Fast and memory-efficient object counting: https://github.com/RaRe-Technologies/bounter
Fast string matching: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
Trace system calls in multiprocessing context: https://github.com/pinterest/ptracer
skbio: http://scikit-bio.org/docs/0.4.1/index.html
Extract keywords from text: https://github.com/vi3k6i5/flashtext
Working with time-series data: https://github.com/RJT1990/pyflux
Probability distributions and other related functions: https://pomegranate.readthedocs.io/en/latest/
Global optimization: https://github.com/chrisstroemel/Simple
HPAT: framework for automatically parallizing numpy and pandas code using MPI https://github.com/IntelLabs/hpat
CuPy: Numpy-like implementation of GPU-accelerated array operations https://github.com/cupy/cupy
Command-line interface to download and manage reference genomes: https://github.com/simonvh/genomepy
API for working with HGVS-formatted variants: https://github.com/biocommons/hgvs
B+ Tree: Hash table backed by on-disk storage https://github.com/NicolasLM/bplustree
Chunked, compressed ndarrays http://zarr.readthedocs.io/en/latest/index.html
Easy colored/styled printing in command line apps https://github.com/UltimateHackers/hue
Launching a subprocess in a pseudo-terminal (e.g. for accepting passwords) https://github.com/pexpect/ptyprocess
Launch an editor from python: https://github.com/fmoo/python-editor
VCF:
- ReadVCF: http://alimanfoo.github.io/2017/06/14/read-vcf.html
- Faster alternative to pyvcf: https://github.com/brentp/cyvcf2
BAM:
- Pure python BAM parser https://github.com/betteridiot/bamnostic
Alternative to pandas for larger-than-memory datasets: https://vaex.readthedocs.io/en/latest/
Load PLINK data into Dask arrays https://github.com/limix/pandas-plink

HPC

Command Line (OSX/Linux)

Databases

CockroachDB: based on Google's distributed database https://github.com/cockroachdb/cockroach
In-memory key-value db in python: https://github.com/paxos-bankchain/subconscious

Reproducibility/Containerization

Other

Automated system for sequencing core facilities using microservices architecture: https://www.biorxiv.org/content/early/2017/11/06/214858
Aether manages bidding for AWS and Azure credits: http://aether.kosticlab.org/

Statistics/Machine Learning

Methods/algorithms

Python

Tools for data mining, NLP, ML, network analysis: http://www.clips.ua.ac.be/pages/pattern
Linear mixed-model solver https://github.com/nickFurlotte/pylmm
Exact inference in HMMs with large number of hidden states: https://github.com/regevs/factorial_hmm
scikit-learn-compatible package for graph statistics: https://graspy.neurodata.io/
Library for preprocessing various NGS formats for input into Keras or other DL libraries: https://github.com/BIMSBbioinfo/janggu

Deep Learning

Web APIs

Data Sets

Text classification

Visualization

Genome browsers

Networks

Phylogenetic trees

R

Shushi: publication-quality figures from multiple data types https://github.com/dphansti/Sushi
EpiViz: visualization of epigenomic data sets in R, http://epiviz.github.io/
Interaction data: https://github.com/kcakdemir/HiCPlotter
Network visualization from R using vis.js: http://dataknowledge.github.io/visNetwork/
Differential expression from RNA-seq: http://bioconductor.org/packages/devel/bioc/html/Glimma.html

Python

Javascript

Examples

Cool interactive visualization of differential data: http://graphics.wsj.com/gender-pay-gap/

Publication/Archiving

Licenses: http://choosealicense.com/licenses/
APIs for literature search: http://libguides.mit.edu/apis
Assessing credit for bioinformatics software authorship: http://depsy.org/
Icons for presentations: http://cameronneylon.net/blog/some-slides-for-granting-permissions-or-not-in-presentations/
Continuous analysis: https://github.com/greenelab/continuous_analysis
A nice template for bootstrapping your own academic website using GitHub: https://academicpages.github.io/
Desktop app for searching/managing papers: https://github.com/codeforscience/sciencefair
Recommend papers to cite based on your bibliography: http://labs.semanticscholar.org/citeomatic/
Word choices:
Prepare papers for any journal format: https://typeset.io/
Slideboards: Mashup of slides and FAQ to explain a publication http://slideboard.herokuapp.com/
Generate manuscripts on GitHub: https://github.com/greenelab/manubot-rootstock
General-purpose archival format: https://github.com/fair-research/bdbag/

Code/Data sharing

Journals

Writing

Scripts to identify "bad smells" in science writing (would want to convert this to python): http://matt.might.net/articles/shell-scripts-for-passive-voice-weasel-words-duplicates/
Collaborative writing
- DraftIn: https://draftin.com/documents
- Quip: http://quip.com
- Canvas: https://usecanvas.com/
Templates:
- InDesign template for preprint: https://github.com/cleterrier/ManuscriptTools/blob/master/biorxiv_template_CC2015.indd
- Rmarkdown templates for journal articles https://github.com/rstudio/rticles
- GitHub template for authoring papers: https://github.com/peerj/paper-now
- Two-column rmarkdown template: http://dirk.eddelbuettel.com/code/pinp.html
Convert between (R)markdown and iPython notebooks: https://github.com/aaren/notedown
Publication-quality tables for markdown: https://cran.r-project.org/web/packages/flextable/index.html
Pandoc scholar: https://github.com/pandoc-scholar/pandoc-scholar
- Nice paper showing example of generating manuscripts for two different journals: https://peerj.com/preprints/2648.pdf
Convert DOI to Bibtex entry
Online equation editor: https://www.mathcha.io/
Simplified markup language for equations: http://asciimath.org/
Editoria: https://editoria.pub/
Texture: https://github.com/substance/texture
Scientific writing template for R Markdown: https://rstudio.github.io/radix/
Extension of MarkDown that handles diagraMs, equations, etc https://casual-effects.com/markdeep/

Posters

Slides

CV

Promising methods without software implementation

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
README.md		README.md

DWGFAKER/biotools

Folders and files

Latest commit

History

Repository files navigation

Clinical data

EHR

Data Management

Data Sets

Discovery

Genomics

General Information

Algorithms

Assay Design

Candidate Prioritization

Databases

Data Formats

Functional Enrichment/Ontology

GWAS/QTL

Microarray

Motif/TFBS

Network Analysis

Population genetics

Prediction

Security

Sequencing Protocols

Simulation

Variant annotation

Sequence Analysis

General-purpose

Demultiplexing

QC

ChIP-seq

Chromatin accessibility

Chromatin Interactions

DNA

Footprinting

Metagenomics

Methylation

MNase-seq

Nanopore/PacBio

RNA

Single-cell

Somatic

Integrated Methods

General Programming Resources

C/C++

R

Find packages

Database

Data Cleaning

Reporting

Misc

Python

HPC

Command Line (OSX/Linux)

Databases

Reproducibility/Containerization

Building Pipelines

Software Distribution

Other

Statistics/Machine Learning

Methods/algorithms

Python

Deep Learning

Web APIs

Data Sets

Text classification

Visualization

Genome browsers

Networks

Phylogenetic trees

R

ggplot2

Plot Types

Data Types

Interactive

Python

Javascript

Examples

Packages