The repository is a collection of awesome bio-foundation modeling papers, various domains include DNA, RNA, gene, protein, single-cell, and multimodalities.
🌟 If you'd like to add a paper or resource, feel free to submit a pull request or open an issue.
The following logo represents:
paper publisher with paper link
Papers are ranked chronologically.
-
(Enformer) Effective gene expression prediction from sequence by integrating long-range interactions
-
MoDNA: motif-oriented pre-training for DNA language model
-
Obtaining genetics insights from deep learning via explainable artificial intelligence
-
Deciphering microbial gene function using natural language processing
-
MoDNA: Motif-Oriented Pre-training For DNA Language Model
-
To Transformers and Beyond: Large Language Models for the Genome
-
GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model
-
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
-
(GeneFormer) Transfer learning enables predictions in network biology
-
DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
-
Species-aware DNA language modeling
-
DNA language models are powerful predictors of genome-wide variant effects
-
GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
-
GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences
-
EpiGePT: a Pretrained Transformer model for epigenomics
-
DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks
-
The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
-
DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
-
DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models
-
Single-cell gene expression prediction from DNA sequence at large contexts
-
Genomic language model predicts protein co-regulation and function
-
Clustering and classification methods for single-cell RNA-sequencing data
-
EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction
-
scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data
-
scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data
-
(RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions
-
(RNABERT) Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning
-
(MRM-BERT) Prediction of Multiple Types of RNA Modifications via Biological Language Model
-
(SpliceBERT) Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction
-
UNI-RNA: universal pre-trained models revolutionize RNA research
-
A Deep Dive into Single-Cell RNA Sequencing Foundation Models
-
xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data
-
trRosettaRNA: automated prediction of RNA 3D structure with transformer network
-
(RfamGen) Deep generative design of RNA family sequences
-
(RNA-MFM) Multiple sequence alignment-based RNA language model and its application to structural inference
-
RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks
-
ATOM-1: A Foundation Model for RNA Structure and Function Built on Chemical Mapping Data
-
RNAformer: A Simple Yet Effective Deep LearningModel for RNA Secondary Structure Prediction
-
scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data
-
CellPLM: Pre-training of Cell Language Model Beyond Single Cells
-
Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis
-
A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
-
ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations
-
Parapred: antibody paratope prediction using convolutional and recurrent neural networks
-
(UniRep) Unified rational protein engineering with sequence-based deep representation learning
-
(TAPE) Evaluating Protein Transfer Learning with TAPE
-
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
-
(ESM) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
-
(ESM-1v) Language models enable zero-shot prediction of the effects of mutations on protein function
-
(IgLM) Generative language modeling for antibody design
-
(ESM-2 & ESMFold) Language models of protein sequences at the scale of evolution enable accurate structure prediction
-
ProtGPT2 is a deep unsupervised language model for protein design
-
ProteinBERT: a universal deep-learning model of protein sequence and function
-
OntoProtein: Protein Pretraining With Gene Ontology Embedding
-
(AntiBERTa) Deciphering the language of antibodies using self-supervised learning
-
AbLang: an antibody language model for completing antibody sequences
-
ProGen2: Exploring the boundaries of protein language models
-
SaProt: Protein Language Modeling with Structure-aware Vocabulary
-
Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling
-
(GearNet) Protein Representation Learning by Geometric Structure Pretraining
-
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
-
Efficient evolution of human antibodies from general protein language models
-
(CARP) Convolutions are competitive with transformers for protein sequence pretraining
-
(HelixFold-Single) A method for multiple-sequence-alignment-free protein structure prediction using a protein language model
-
(ABGNN) Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design
-
(ReprogBert) Reprogramming Pretrained Language Models for Antibody Sequence Infilling
-
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
-
ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing
-
(ESM-GearNet) A Systematic Study of Joint Representation Learning on Protein Sequences and Structures
-
(ProteinINR) Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning
-
(CaLM) Codon language embeddings provide strong signals for use in protein engineering
-
(DeepGo) Protein function prediction as approximate semantic entailment
-
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
-
Genomic language model predicts protein co-regulation and function
Protein foundation models are hot topics, more papers can be found in
-
(DCell) Using deep learning to model the hierarchical structure and function of a cell
-
scVAE: variational auto-encoders for single-cell gene expression data
-
A sandbox for prediction and integration of DNA, RNA, and proteins in single cells
-
scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data
-
scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data
-
(scFoundation) Large Scale Foundation Model on Single-cell Transcriptomics
-
(DPI) Modeling and analyzing single-cell multimodal data with deep parametric inference
-
(ScPROTEIN) A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding
-
scGPT: toward building a foundation model for single-cell multi-omics using generative AI
-
scMulan: a multitask generative pre-trained language model for single-cell analysis
-
scDiffusion: conditional generation of high-quality single-cell data using diffusion model
-
Cell2Sentence: Teaching Large Language Models the Language of Biology
-
CellPLM: Pre-training of Cell Language Model Beyond Single Cells
-
Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis
-
(DPI) Modeling and analyzing single-cell multimodal data with deep parametric inference
-
Pretraining model for biological sequence data
-
BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models
-
A sandbox for prediction and integration of DNA, RNA, and proteins in single cells
-
Galactica: A Large Language Model for Science
-
BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
-
DARWIN Series: Domain Specific Large Language Models for Natural Science
-
(scMoFormer) Single-Cell Multimodal Prediction via Transformers
-
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
-
ChatCell: Facilitating Single-Cell Analysis with Natural Language
-
(Evo) Sequence modeling and design from molecular to genome scale with Evo
-
Learning the protein language: Evolution, structure, and function
-
Protein Language Models and Structure Prediction: Connection and Progression
-
Progress and Opportunities of Foundation Models in Bioinformatics
-
Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models
-
Applications of transformer-based language models in bioinformatics: a survey
-
Best practices for single-cell analysis across modalities
-
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey
-
Scientific Large Language Models: A Survey on Biological & Chemical Domains
-
Large language models in bioinformatics: applications and perspectives
-
Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
- Machine-Learning-for-Proteins
- Scientific-Large-Language-Models
- Awesome-Scientific-Language-Models
- Awesome-Biomolecule-Language-Cross-Modeling
- Awesome-Deep-Learning-Single-Cell-Papers
- Awesome-Protein-Representation-Learning
- Awesome-Molecule-Protein-Pretrain-Papers
- Awesome-Pretrain-on-Molecules
- Awesome-Molecule-Text
- Aswesome-Deep-Learing-for-Life-Sciences
- Awesome-Docking
- Awesome-Biology
- Awesome-Single-Cell
- Awesome-Computational-Biology
- Awesome-Bioinformatics
- Awesome-Multi-Omics
- LLM4ScientificDiscovery
- LLM4Mol