A curated list of prompt-based papers in computer vision and vision-language learning.
- Task tag, e.g.,
- Abbreviation tag, e.g.,
- Characteristic tag: Some characteristic makes this paper unique, e.g.,
- Bold font: We highlight some pilot work that may contribute to the prevalence of visual prompting.
This section collects papers prompting pretrained vision foundation models (e.g., ViT) for parameter-efficient adaptation.
-
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning [paper] [code]
-
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [paper] [code]
-
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning [paper] [code]
NeurIPS 2022
-
P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting [paper] [code]
-
Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models [paper] [code]
-
Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation [paper]
AAAI 2023
-
LPT: Long-tailed Prompt Tuning for Image Classification [paper]
ICLR 2023
-
Diversity-Aware Meta Visual Prompting [paper] [code]
CVPR 2023
-
Semantic Prompt for Few-Shot Image Recognition [paper]
-
Visual Prompt Tuning for Generative Transfer Learning [paper] [code]
-
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching [paper] [code]
-
Images Speak in Images: A Generalist Painter for In-Context Visual Learning [paper] [code]
-
PIVOT: Prompting for Video Continual Learning [paper]
-
Learning Expressive Prompting With Residuals for Vision Transformers [paper]
-
BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning [paper] [code]
-
A-La-Carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting [paper]
-
Understanding and Improving Visual Prompting: A Label-Mapping Perspective [paper] [code]
CVPR 2023
-
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning [paper] [code]
CVPR 2023
-
Explicit Visual Prompting for Low-Level Structure Segmentations low-level segmentation [paper] [code]
-
Understanding and Improving Visual Prompting: A Label-Mapping Perspective [paper] [code]
CVPR 2023
ArXiv Papers
-
Exploring Visual Prompts for Adapting Large-Scale Models [paper] [code]
arXiv 2022/03
-
Vision Transformer Adapter for Dense Predictions [paper] [code]
-
Convolutional Bypasses Are Better Vision Transformer Adapters [paper] [code]
arXiv 2022/07
-
Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets [paper]
arXiv 2022/08
-
Prompt Vision Transformer for Domain Generalization [paper]
-
Prompt-Matched Semantic Segmentation [paper]
-
Visual Prompt Tuning for Test-time Domain Adaptation [paper]
arXiv 2022/10
-
Visual Prompting for Adversarial Robustness [paper]
-
Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers [paper] [code]
arXiv 2022/10
-
Towards a Unified View on Visual Parameter-Efficient Transfer Learning [paper] [code]
This section collects papers prompting pretrained vision-language foundation models (e.g., CLIP) for parameter-efficient adaptation.
-
Learning Transferable Visual Models From Natural Language Supervision [paper] [code]
-
Learning to Prompt for Vision-Language Models [paper] [code]
-
Prompt Distribution Learning [paper]
CVPR 2022
-
Conditional Prompt Learning for Vision-Language Models [paper] [code]
-
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [paper] [code]
-
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [paper] [code]
-
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [paper] [code]
-
A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models [paper]
-
Expanding Language-Image Pretrained Models for General Video Recognition [paper] [code]
-
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [paper] [code]
ECCV 2022
-
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [paper]
-
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [paper] [code]
NeurIPS 2022
-
Learning to Decompose Visual Features with Latent Textual Prompts [paper]
ICLR 2023
-
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models [paper] [code]
ICLR 2023
-
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization [paper] [code]
-
Open-Set Fine-Grained Retrieval Via Prompting Vision-Language Evaluator [paper]
-
Multimodal Prompting With Missing Modalities for Visual Recognition [paper] [code]
CVPR 2023
-
Efficient Multimodal Fusion Via Interactive Prompting [paper]
-
Hierarchical Prompt Learning for Multi-Task Learning [paper] [code]
-
Text-Visual Prompting for Efficient 2D Temporal Video Grounding [paper]
-
VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval [paper] [code]
-
MaPLe: Multi-modal Prompt Learning [paper] [code]
CVPR 2023
-
Texts as Images in Prompt Tuning for Multi-Label Image Recognition [paper] [code]
-
Vita-CLIP: Video and Text Adaptive CLIP Via Multimodal Prompting [paper] [code]
-
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [paper] [code]
CVPR 2023
-
$\pi$ -Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation [paper] [code]ICML 2023
-
POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models [paper] [code]
ICML 2023
-
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization [paper] [code]
ArXiv Papers
-
Colorful Prompt Tuning for Pre-trained Vision-Language Models [paper]
-
ActionCLIP: A New Paradigm for Video Action Recognition [paper] [code]
-
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [paper] [code]
arXiv 2021/10
-
Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization [paper]
-
Prompting Visual-Language Models for Efficient Video Understanding [paper] [code]
-
Unsupervised Prompt Learning for Vision-Language Models [paper] [code]
-
Prompt-aligned Gradient for Prompt Tuning [paper] [code]
arXiv 2022/05
-
Parameter-Efficient Image-to-Video Transfer Learning [paper]
-
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [paper]
-
Prompt Tuning for Generative Multimodal Pretrained Models [paper] [code]
-
Prompt Tuning with Soft Context Sharing for Vision-Language Models [paper]
-
CPL: Counterfactual Prompt Learning for Vision and Language Models [paper] [code]
-
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [paper] [code]
-
Unified Vision and Language Prompt Learning [paper]
arXiv 2022/10
-
Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation [paper]
-
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition [paper] [code]
arXiv 2023/04
Language-interactable prompter develops zero/few-shot capabilities by prompting several independent foundational models (VLMs, LLMs, VMs, etc.) with the language interface. One of the most attractive applications is multimodal chatbot.
-
Multimodal Few-Shot Learning with Frozen Language Models [paper]
-
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [paper] [code]
-
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning [paper] [code]
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [paper] [code]
Arxiv Papers
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models [paper] [code] [demo]
arXiv 2023/03
-
Chameleon: Plug-and-play compositional reasoning with large language models [paper] [code]
arXiv 2023/04
-
Flamingo: a Visual Language Model for Few-Shot Learning [paper]
-
Language Models Can See: Plugging Visual Controls in Text Generation [paper] [code]
-
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [paper]
-
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning [paper]
The goal of vision-language instruction tuning is to train a model that can effectively understand instructions for general-purpose multimodal tasks.
-
Visual Instruction Tuning [paper] [code] [demo]
arXiv 2023/04
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models [paper] [code] [demo]
arXiv 2023/04
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning [paper] [code] [demo]
arXiv 2023/05
-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans [paper] [code]
arXiv 2023/05
-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [paper] [code]
arXiv 2023/05
-
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [paper] [code] [demo]
arXiv 2023/09
- PromptPapers: A comprehensive curated list for prompting papers (mainly in natural language processing)
- Awesome Multimodal Assistant: a curated list for vision-language instruction tuning and LLM-based chatbot.