Transformer-in-Vision

A paper list of some recent Transformer-based CV works. If you find some ignored papers, please open issues or pull requests.

**Last updated: 2022/03/11

Update log

2021/April - update all of recent papers of Transformer-in-Vision.
2021/May - update all of recent papers of Transformer-in-Vision.
2021/June - update all of recent papers of Transformer-in-Vision.
2021/July - update all of recent papers of Transformer-in-Vision.
2021/August - update all of recent papers of Transformer-in-Vision.
2021/September - update all of recent papers of Transformer-in-Vision.
2021/October - update all of recent papers of Transformer-in-Vision.
2021/November - update all of recent papers of Transformer-in-Vision.
2021/December - update all of recent papers of Transformer-in-Vision.
2022/January - update all of recent papers of Transformer-in-Vision.

Survey:

(arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work [Paper]
(arXiv 2022.02) Transformers in Medical Image Analysis: A Review. [Paper]
(arXiv 2022.01) Transformers in Medical Imaging: A Survey. [Paper], [Awesome]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks. [Paper]
(arXiv 2022.01) Video Transformers: A Survey. [Paper]
(arXiv 2021.11) A Survey of Visual Transformers. [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training. [Paper]
(arXiv 2021.03) Multi-modal Motion Prediction with Stacked Transformers. [Paper], [Code]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision. [Paper]
(arXiv 2020.09) Efficient Transformers: A Survey. [Paper]
(arXiv 2020.01) Transformers in Vision: A Survey. [Paper]

Recent Papers

Action

(CVPR'20) Speech2Action: Cross-modal Supervision for Action Recognition, [Paper]
(arXiv 2021.01) Trear: Transformer-based RGB-D Egocentric Action Recognition, [Paper]
(arXiv 2021.02) Relaxed Transformer Decoders for Direct Action Proposal Generation, [Paper], [Code]
(arXiv 2021.04) TubeR: Tube-Transformer for Action Detection, [Paper]
(arXiv 2021.04) Few-Shot Transformation of Common Actions into Time and Space, [Paper]
(arXiv 2021.05) Temporal Action Proposal Generation with Transformers, [Paper]
(arXiv 2021.06) End-to-end Temporal Action Detection with Transformer, [Paper], [Code]
(arXiv 2021.06) OadTR: Online Action Detection with Transformers, [Paper], [Code]
(arXiv 2021.07) Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition, [Paper]
(arXiv 2021.07) VideoLightFormer: Lightweight Action Recognition using Transformers, [Paper]
(arXiv 2021.07) Long Short-Term Transformer for Online Action Detection, [Paper]
(arXiv 2021.07) STAR: Sparse Transformer-based Action Recognition, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper], [Code]
(arXiv 2021.10) Lightweight Transformer in Federated Setting for Human Activity Recognition, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.11) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) Transformers in Action:Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
(arXiv 2022.03) Multi-View Fusion Transformer for Sensor-Based Human Activity Recognition, [Paper]
(arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
(arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]

Active Learning

(arXiv 2021.06) Visual Transformer for Task-aware Active Learning, [Paper], [Code]

Anomaly Detection

(arXiv 2021.04) VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization, [Paper]
(arXiv 2021.04) Inpainting Transformer for Anomaly Detection, [Paper]

Assessment

(arXiv 2021.01) Transformer for Image Quality Assessment, [Paper], [Code]
(arXiv 2021.04) Perceptual Image Quality Assessment with Transformers, [Paper], [Code]
(arXiv 2021.08) No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, [Paper], [Code]
(arXiv 2021.08) MUSIQ: Multi-scale Image Quality Transformer, [Paper], [Code]
(arXiv 2021.10) VTAMIQ: Transformers for Attention Modulated Image Quality Assessment, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]

Captioning

(arXiv 2021.01) CPTR: Full Transformer Network for Image Captioning, [Paper]
(arXiv 2021.01) Dual-Level Collaborative Transformer for Image Captioning, [Paper]
(arXiv.2021.02) VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining, [Paper], [Code]
(arXiv 2021.06) Semi-Autoregressive Transformer for Image Captioning, [Paper], [Code]
(arXiv 2021.08) Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers, [Paper]
(arXiv 2021.08) Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning, [Paper], [Code]
(arXiv 2021.09) Bornon: Bengali Image Captioning with Transformer-based Deep learning approach, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.10) Geometry Attention Transformer with Position-aware LSTMs for Image Captioning, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.11) SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation, [Paper], [Code]
(arXiv 2022.03) X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]

Classification (Backbone)

(ICLR'21) MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION, [Paper], [Code]
(CVPR'20) Feature Pyramid Transformer, [Paper], [Code]
(ICLR'21) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, [Paper], [Code]
(arXiv 2020.06) Visual Transformers: Token-based Image Representation and Processing for Computer Vision, [Paper]
(arXiv 2020.12) Training data-efficient image transformers & distillation through attention, [Paper], [Code]
(arXiv 2021.01) Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, [Paper], [Code]
(arXiv 2021.01) Bottleneck Transformers for Visual Recognition, [Paper] , [Code]
(arXiv.2021.02) Conditional Positional Encodings for Vision Transformers, [Paper], [Code]
(arXiv.2021.02) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, [Paper], [Code]
(arXiv 2021.03) Transformer in Transformer, [Paper], [Code]
(arXiv 2021.03) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases, [Paper], [Code]
(arXiv 2021.03) Scalable Visual Transformers with Hierarchical Pooling, [Paper]
(arXiv 2021.03) Incorporating Convolution Designs into Visual Transformers, [Paper]
(arXiv 2021.03) DeepViT: Towards Deeper Vision Transformer, [Paper], [Code]
(arXiv 2021.03) Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, [Paper], [Code]
(arXiv 2021.03) Understanding Robustness of Transformers for Image Classification, [Paper]
(arXiv 2021.03) Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding, [Paper]
(arXiv 2021.03) CvT: Introducing Convolutions to Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Rethinking Spatial Dimensions of Vision Transformers, [Paper], [Code]
(arXiv 2021.03) Going deeper with Image Transformers, [Paper]
(arXiv 2021.04) LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference, [Paper]
(arXiv 2021.04) On the Robustness of Vision Transformers to Adversarial Examples, [Paper]
(arXiv 2021.04) LocalViT: Bringing Locality to Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Escaping the Big Data Paradigm with Compact Transformers, [Paper], [Code]
(arXiv 2021.04) Co-Scale Conv-Attentional Image Transformers, [Paper], [Code]
(arXiv 2021.04) Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet, [Paper], [Code]
(arXiv 2021.04) So-ViT: Mind Visual Tokens for Vision Transformer, [Paper]
(arXiv 2021.04) Multiscale Vision Transformers, [Paper], [Code]
(arXiv 2021.04) Visformer: The Vision-friendly Transformer, [Paper], [Code]
(arXiv 2021.04) Improve Vision Transformers Training by Suppressing Over-smoothing, [Paper], [Code]
(arXiv 2021.04) Twins: Revisiting the Design of Spatial Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.04) ConTNet: Why not use convolution and transformer at the same time, [Paper], [Code]
(arXiv 2021.05) Rethinking the Design Principles of Robust Vision Transformer, [Paper], [Code]
(arXiv 2021.05) Vision Transformers are Robust Learners, [Paper], [Code]
(arXiv 2021.05) Rethinking Skip Connection with Layer Normalization in Transformers and ResNets, [Paper], [Code]
(arXiv 2021.05) Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead, [Paper]
(arXiv 2021.05) Intriguing Properties of Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Aggregating Nested Transformers, [Paper]
(arXiv 2021.05) ResT: An Efficient Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, [Paper], [Code]
(arXiv 2021.06) When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations, [Paper]
(arXiv 2021.06) Container: Context Aggregation Network, [Paper]
(arXiv 2021.06) TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication, [Paper]
(arXiv 2021.06) KVT: k-NN Attention for Boosting Vision Transformers, [Paper]
(arXiv 2021.06) MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens, [Paper], [Code]
(arXiv 2021.06) Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length, [Paper]
(arXiv 2021.06) Less is More: Pay Less Attention in Vision Transformers, [Paper]
(arXiv 2021.06) FoveaTer: Foveated Transformer for Image Classification, [Paper]
(arXiv 2021.06) An Attention Free Transformer, [Paper]
(arXiv 2021.06) Glance-and-Gaze Vision Transformer, [Paper], [Code]
(arXiv 2021.06) RegionViT: Regional-to-Local Attention for Vision Transformers, [Paper]
(arXiv 2021.06) Chasing Sparsity in Vision Transformers: An End-to-End Exploration, [Paper], [Code]
(arXiv 2021.06) Scaling Vision Transformers, [Paper]
(arXiv 2021.06) CAT: Cross Attention in Vision Transformer, [Paper], [Code]
(arXiv 2021.06) On Improving Adversarial Transferability of Vision Transformers, [Paper], [Code]
(arXiv 2021.06) Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight, [Paper]
(arXiv 2021.06) Patch Slimming for Efficient Vision Transformers, [Paper]
(arXiv 2021.06) Transformer in Convolutional Neural Networks, [Paper], [Code]
(arXiv 2021.06) ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias, [Paper], [Code]
(arXiv 2021.06) Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, [Paper]
(arXiv 2021.06) Refiner: Refining Self-attention for Vision Transformers, [Paper]
(arXiv 2021.06) Reveal of Vision Transformers Robustness against Adversarial Attacks, [Paper]
(arXiv 2021.06) Efficient Training of Visual Transformers with Small-Size Datasets, [Paper]
(arXiv 2021.06) Delving Deep into the Generalization of Vision Transformers under Distribution Shifts, [Paper]
(arXiv 2021.06) BEIT: BERT Pre-Training of Image Transformers, [Paper], [Code]
(arXiv 2021.06) XCiT: Cross-Covariance Image Transformers, [Paper], [Code]
(arXiv 2021.06) How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, [Paper], [Code1], [Code2]
(arXiv 2021.06) Exploring Vision Transformers for Fine-grained Classification, [Paper], [Code]
(arXiv 2021.06) TokenLearner: What Can 8 Learned Tokens Do for Images and Videos, [Paper]
(arXiv 2021.06) Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers, [Paper], [Code]
(arXiv 2021.06) VOLO: Vision Outlooker for Visual Recognition, [Paper], [Code]
(arXiv 2021.06) IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers, [Paper], [Project]
(arXiv 2021.06) PVTv2: Improved Baselines with Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2021.06) Early Convolutions Help Transformers See Better, [Paper]
(arXiv 2021.06) Multi-Exit Vision Transformer for Dynamic Inference, [Paper]
(arXiv 2021.07) Augmented Shortcuts for Vision Transformers, [Paper]
(arXiv 2021.07) Improving the Efficiency of Transformers for Resource-Constrained Devices, [Paper]
(arXiv 2021.07) CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows, [Paper], [Code]
(arXiv 2021.07) Focal Self-attention for Local-Global Interactions in Vision Transformers, [Paper]
(arXiv 2021.07) Cross-view Geo-localization with Evolving Transformer, [Paper]
(arXiv 2021.07) What Makes for Hierarchical Vision Transformer, [Paper]
(arXiv 2021.07) Efficient Vision Transformers via Fine-Grained Manifold Distillation, [Paper]
(arXiv 2021.07) Vision Xformers: Efficient Attention for Image Classification, [Paper]
(arXiv 2021.07) Long-Short Transformer: Efficient Transformers for Language and Vision, [Paper]
(arXiv 2021.07) Feature Fusion Vision Transformer for Fine-Grained Visual Categorization, [Paper]
(arXiv 2021.07) Local-to-Global Self-Attention in Vision Transformers, [Paper], [Code]
(arXiv 2021.07) Visual Parser: Representing Part-whole Hierarchies with Transformers, [Paper], [Code]
(arXiv 2021.07) CMT: Convolutional Neural Networks Meet Vision Transformers, [Paper]
(arXiv 2021.07) Combiner: Full Attention Transformer with Sparse Computation Cost, [Paper]
(arXiv 2021.07) A Comparison of Deep Learning Classification Methods on Small-scale Image Data set: from Convolutional Neural Networks to Visual Transformers, [Paper]
(arXiv 2021.07) Contextual Transformer Networks for Visual Recognition, [Paper], [Code]
(arXiv 2021.07) Rethinking and Improving Relative Position Encoding for Vision Transformer, [Paper], [Code]
(arXiv 2021.08) CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention, [Paper], [Code]
(arXiv 2021.08) Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, [Paper]
(arXiv 2021.08) Vision Transformer with Progressive Sampling, [Paper], [Code]
(arXiv 2021.08) Armour: Generalizable Compact Self-Attention for Vision Transformers, [Paper]
(arXiv 2021.08) ConvNets vs. Transformers: Whose Visual Representations are More Transferable, [Paper]
(arXiv 2021.08) Mobile-Former: Bridging MobileNet and Transformer, [Paper]
(arXiv 2021.08) Do Vision Transformers See Like Convolutional Neural Networks, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers, [Paper], [Code]
(arXiv 2021.09) Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers, [Paper]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.10) MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) Token Pooling in Visual Transformers, [Paper]
(arXiv 2021.10) NViT: Vision Transformer Compression and Parameter Redistribution, [Paper]
(arXiv 2021.10) Adversarial Token Attacks on Vision Transformers, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.11) Can Vision Transformers Perform Convolution, [Paper]
(arXiv 2021.11) Sliced Recursive Transformer, [Paper], [Code]
(arXiv 2021.11) Hybrid BYOL-ViT: Efficient approach to deal with small Datasets, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs, [Paper], [Code]
(arXiv 2021.11) iBOT: Image BERT Pre-Training with Online Tokenizer, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) Are Vision Transformers Robust to Patch Perturbations, [Paper]
(arXiv 2021.11) Discrete Representations Strengthen Vision Transformer Robustness, [Paper]
(arXiv 2021.11) Zero-Shot Certified Defense against Adversarial Patches with Vision Transformers, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) Dynamic Token Normalization Improves Vision Transformer, [Paper], [Code]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Couplformer:Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) MPViT: Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) SimViT: Exploring a Simple Vision Transformer with sliding windows, [Paper], [Code]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) Augmenting Convolutional networks with attention-based aggregation, [Paper]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention, [Paper], [Code]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) QuadTree Attention for Vision Transformers, [Paper], [Code]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need?, [Paper], [Code]
(arXiv 2022.01) Convolutional Xformers for Vision, [Paper], [Code]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper],[Code]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper],[Code]
(arXiv 2022.02) How Do Vision Transformers Work, [Paper],[Code]
(arXiv 2022.02) Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations, [Paper],[Code]
(arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
(arXiv 2022.02) Learning to Merge Tokens in Vision Transformers, [Paper]
(arXiv 2022.02) Auto-scaling Vision Transformers without Training, [Paper],[Code]
(arXiv 2022.03) Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions, [Paper]
(arXiv 2022.03) D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
(arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
(arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
(arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper],[Code]
(arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
(arXiv 2022.03) EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers, [Paper]
(arXiv 2022.03) WaveMix: Resource-efficient Token Mixing for Images, [Paper], [Code]

Completion

(arXiv 2021.03) High-Fidelity Pluralistic Image Completion with Transformers, [Paper], [Code]
(arXiv 2021.04) TFill: Image Completion via a Transformer-Based Architecture, [Paper], [Code]

Compression

(arXiv 2021.10) Accelerating Framework of Transformer by hardware Design and Model Compression Co-Optimization, [Paper]
(arXiv 2021.11) Transformer-based Image Compression, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper], [Code]
(arXiv 2021.12) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2022.02) Entroformer: A Transformer-based Entropy Model for Learned Image Compression, [Paper], [Code]

Crowd

(arXiv 2021.04) TransCrowd: Weakly-Supervised Crowd Counting with Transformer, [Paper], [Code]
(arXiv 2021.05) Boosting Crowd Counting with Transformers, [Paper], [Code]
(arXiv 2021.08) Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer, [Paper]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper], [Code]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2022.01) Scene-Adaptive Attention Network for Crowd Counting, [Paper]
(arXiv 2022.03) An End-to-End Transformer Model for Crowd Localization, [Paper]

Depth

(arXiv 2020.11) Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [Paper], [Code]
(arXiv 2021.03) Vision Transformers for Dense Prediction, [Paper], [Code]
(arXiv 2021.03) Transformers Solve the Limited Receptive Field for Monocular Depth Prediction, [Paper], [Code]
(arXiv 2021.09) Improving 360 Monocular Depth Estimation via Non-local Dense Prediction Transformer and Joint Supervised and Self-supervised Learning, [Paper]
(arXiv 2022.02) GLPanoDepth: Global-to-Local Panoramic Depth Estimation, [Paper]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.03) OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion, [Paper]

Deepfake Detection

(arXiv.2021.02) Deepfake Video Detection Using Convolutional Vision Transformer, [Paper]
(arXiv 2021.04) Deepfake Detection Scheme Based on Vision Transformer and Distillation, [Paper]
(arXiv 2021.04) M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, [Paper]
(arXiv 2021.07) Combining EfficientNet and Vision Transformers for Video Deepfake Detection, [Paper]
(arXiv 2021.08) Video Transformer for Deepfake Detection with Incremental Learning, [Paper]
(arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]

Dehazing

(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]

Denoising

(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2022.03) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper], [Code]

Detection

(ECCV'20) DETR: End-to-End Object Detection with Transformers, [Paper], [Code]
(ICLR'21) Deformable DETR: Deformable Transformers for End-to-End Object Detection, [Paper], [Code]
(CVPR'21) UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, [Paper], [Code]
(arXiv 2020.11) End-to-End Object Detection with Adaptive Clustering Transformer, [Paper]
(arXiv 2020.11) Rethinking Transformer-based Set Prediction for Object Detection, [Paper]
(arXiv 2020.12) Toward Transformer-Based Object Detection, [Paper]
(arXiv 2020.12) DETR for Pedestrian Detection, [Paper]
(arXiv 2021.01) Line Segment Detection Using Transformers without Edges, [Paper]
(arXiv 2021.01) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper]
(arXiv 2021.02) GEM: Glare or Gloom, I Can Still See You – End-to-End Multimodal Object Detector, [Paper]
(arXiv 2021.03) SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving, [Paper]
(arXiv 2021.03) Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning, [Paper]
(arXiv 2021.03) TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.03) CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, [Paper]
(arXiv 2021.03) DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention, [Paper]
(arXiv 2021.04) Efficient DETR: Improving End-to-End Object Detector with Dense Prior, [Paper]
(arXiv 2021.04) Points as Queries: Weakly Semi-supervised Object Detection by Points, [Paper]
(arXiv 2021.04) CAT: Cross-Attention Transformer for One-Shot Object Detection, [Paper]
(arXiv 2021.05) Content-Augmented Feature Pyramid Network with Light Linear Transformers, [Paper]
(arXiv 2021.06) You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection, [Paper]
(arXiv 2021.06) DETReg: Unsupervised Pretraining with Region Priors for Object Detection, [Paper],[Project]
(arXiv 2021.06) Oriented Object Detection with Transformer, [Paper]
(arXiv 2021.06) MODETR: Moving Object Detection with Transformers, [Paper]
(arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
(arXiv 2021.07) OODformer: Out-Of-Distribution Detection Transformer, [Paper]
(arXiv 2021.07) Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers, [Paper],[Code]
(arXiv 2021.08) Fast Convergence of DETR with Spatially Modulated Co-Attention, [Paper],[Code]
(arXiv 2021.08) PSViT: Better Vision Transformer via Token Pooling and Attention Sharing, [Paper]
(arXiv 2021.08) Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation), [Paper],[Code]
(arXiv 2021.08) Conditional DETR for Fast Training Convergence, [Paper],[Code]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios, [Paper]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper],[Code]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.10) IViDT: An Efficient and Effective Fully Transformer-based Object Detector, [Paper],[Code]
(arXiv 2021.10) DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries, [Paper],[Code]
(arXiv 2021.10) CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector, [Paper],[Code]
(arXiv 2021.11) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper],[Code]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) BEVDet: High-Performance Multi-Camera 3D Object Detection in Bird-Eye-View, [Paper]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR, [Paper], [Code]
(arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
(arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
(arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]

Face

(arXiv 2021.03) Face Transformer for Recognition, [Paper]
(arXiv 2021.03) Robust Facial Expression Recognition with Convolutional Visual Transformers, [Paper]
(arXiv 2021.04) TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection, [Paper]
(arXiv 2021.04) Facial Attribute Transformers for Precise and Robust Makeup Transfer, [Paper]
(arXiv 2021.04) Learning to Cluster Faces via Transformer, [Paper]
(arXiv 2021.06) VidFace: A Full-Transformer Solver for Video Face Hallucination with Unaligned Tiny Snapshots, [Paper]
(arXiv 2021.06) MViT: Mask Vision Transformer for Facial Expression Recognition in the wild, [Paper]
(arXiv 2021.06) Shuffle Transformer with Feature Alignment for Video Face Parsing, [Paper]
(arXiv 2021.06) A Latent Transformer for Disentangled and Identity-Preserving Face Editing, [Paper], [Code]
(arXiv 2021.07) ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer, [Paper]
(arXiv 2021.08) FT-TDR: Frequency-guided Transformer and Top-Down Refinement Network for Blind Face Inpainting, [Paper]
(arXiv 2021.08) Learning Fair Face Representation With Progressive Cross Transformer, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]
(arXiv 2021.09) TANet: A new Paradigm for Global Face Super-resolution via Transformer-CNN Aggregation Network, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper],[Code]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper],[Code]
(arXiv 2021.09) MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition, [Paper]
(arXiv 2021.11) FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations, [Paper]
(arXiv 2021.12) SSAT: A Symmetric Semantic-Aware Transformer Network for Makeup Transfer and Removal, [Paper],[Code]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2022.01) RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs, [Paper]
(arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]

Few-shot Learning

(arXiv 2021.04) Rich Semantics Improve Few-shot Learning, [Paper], [Code]
(arXiv 2021.04) Few-Shot Segmentation via Cycle-Consistent Transformer, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper]
(arXiv 2021.12) Cost Aggregation Is All You Need for Few-Shot Segmentation, [Paper], [Code]
(arXiv 2022.01) HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning, [Paper]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]

Fusion

(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]

GAN

(arXiv 2021.02) TransGAN: Two Transformers Can Make One Strong GAN, [Paper], [Code]
(arXiv 2021.03) Generative Adversarial Transformers, [Paper], [Code]
(arXiv 2021.04) VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Combining Transformer Generators with Convolutional Discriminators, [Paper], [Code]
(arXiv 2021.06) ViT-Inception-GAN for Image Colourising, [Paper]
(arXiv 2021.06) Improved Transformer for High-Resolution GANs, [Paper]
(arXiv 2021.06) Styleformer: Transformer based Generative Adversarial Networks with Style Vector, [Paper], [Code]
(arXiv 2021.07) ViTGAN: Training GANs with Vision Transformers, [Paper]
(arXiv 2021.10) Generating Symbolic Reasoning Problems with Transformer GANs, [Paper]
(arXiv 2021.10) STransGAN: An Empirical Study on Transformer in GANs, [Paper], [Project]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2022.01) RFormer: Transformer-based Generative Adversarial Network for Real Fundus Image Restoration on A New Clinical Benchmark, [Paper]

Gaze

(arXiv 2021.06) Gaze Estimation using Transformer, [Paper], [Code]

Hand Gesture

(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]

HOI

(CVPR'21) HOTR: End-to-End Human-Object Interaction Detection with Transformers, [Paper], [Code]
(arXiv 2021.03) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information, [Paper], [Code]
(arXiv 2021.03) Reformulating HOI Detection as Adaptive Set Prediction, [Paper], [Code]
(arXiv 2021.03) End-to-End Human Object Interaction Detection with HOI Transformer, [Paper], [Code]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) GTNet:Guided Transformer Network for Detecting Human-Object Interactions, [Paper], [Code]
(arXiv 2021.12) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer, [Paper], [Code]

Hyperspectral

(arXiv 2021.07) SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Learning A 3D-CNN and Transformer Prior for Hyperspectral Image Super-Resolution, [Paper]
(arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper]
(arXiv 2022.03) Multiscale Convolutional Transformer with Center Mask Pretraining for Hyperspectral Image Classificationtion, [Paper]
(arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]

Incremental Learning

(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]

In-painting

(ECCV'20) Learning Joint Spatial-Temporal Transformations for Video Inpainting, [Paper], [Code]
(arXiv 2021.04) Aggregated Contextual Transformations for High-Resolution Image Inpainting, [Paper], [Code]
(arXiv 2021.04) Decoupled Spatial-Temporal Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]

Instance Segmentation

(CVPR'21) End-to-End Video Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.04) ISTR: End-to-End Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) SOTR: Segmenting Objects with Transformers, [Paper], [Code]
(arXiv 2021.12) SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation, [Paper]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]

Layout

(CVPR'21) Variational Transformer Networks for Layout Generation, [Paper]
(arXiv 2021.10) The Layout Generation Algorithm of Graphic Design Based on Transformer-CVAE, [Paper]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2022.02) ATEK: Augmenting Transformers with Expert Knowledge for Indoor Layout Synthesis, [Paper]
(arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]

Lighting

(arXiv 2022.02) Spatio-Temporal Outdoor Lighting Aggregation on Image Sequences using Transformer Networks, [Paper]

Matching

(CVPR'21') LoFTR: Detector-Free Local Feature Matching with Transformers, [Paper], [Code]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper], [Code]

Medical

(arXiv 2021.02) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.02) Medical Transformer: Gated Axial-Attention for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.03) SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation, [Paper], [Code]
(arXiv 2021.03) TransBTS: Multimodal Brain Tumor Segmentation Using Transformer, [Paper], [Code]
(arXiv 2021.03) TransMed: Transformers Advance Multi-modal Medical Image Classification, [Paper]
(arXiv 2021.03) U-Net Transformer: Self and Cross Attention for Medical Image Segmentation, [Paper]
(arXiv 2021.03) SUNETR: Transformers for 3D Medical Image Segmentation, [Paper]
(arXiv 2021.04) DeepProg: A Multi-modal Transformer-based End-to-end Framework for Predicting Disease Prognosis, [Paper]
(arXiv 2021.04) ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration, [Paper], [Code]
(arXiv 2021.04) Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification, [Paper]
(arXiv 2021.04) Shoulder Implant X-Ray Manufacturer Classification: Exploring with Vision Transformer, [Paper]
(arXiv 2021.04) Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, [Paper]
(arXiv 2021.04) Crossmodal Matching Transformer for Interventional in TEVAR, [Paper]
(arXiv 2021.04) GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification, [Paper]
(arXiv 2021.04) Pyramid Medical Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.05) Anatomy-Guided Parallel Bottleneck Transformer Network for Automated Evaluation of Root Canal Therapy, [Paper]
(arXiv 2021.05) Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.05) Is Image Size Important? A Robustness Comparison of Deep Learning Methods for Multi-scale Cell Image Classification Tasks: from Convolutional Neural Networks to Visual Transformers, [Paper]
(arXiv 2021.05) Unsupervised MRI Reconstruction via Zero-Shot Learned Adversarial Transformers, [Paper]
(arXiv 2021.05) Medical Image Segmentation using Squeeze-and-Expansion Transformers, [Paper], [Code]
(arXiv 2021.05) POCFormer: A Lightweight Transformer Architecture for Detection of COVID-19 Using Point of Care Ultrasound, [Paper]
(arXiv 2021.05) COTR: Convolution in Transformer Network for End to End Polyp Detection, [Paper]
(arXiv 2021.05) PTNet: A High-Resolution Infant MRI Synthesizer Based on Transformer, [Paper]
(arXiv 2021.06) TED-net: Convolution-free T2T Vision Transformerbased Encoder-decoder Dilation network for Low-dose CT Denoising, [Paper]
(arXiv 2021.06) A Multi-Branch Hybrid Transformer Network for Corneal Endothelial Cell Segmentation, [Paper]
(arXiv 2021.06) Task Transformer Network for Joint MRI Reconstruction and Super-Resolution, [Paper], [Code]
(arXiv 2021.06) DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation, [Paper]
(arXiv 2021.06) More than Encoder: Introducing Transformer Decoder to Upsample, [Paper]
(arXiv 2021.06) Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image, [Paper]
(arXiv 2021.06) MTrans: Multi-Modal Transformer for Accelerated MR Imaging, [Paper], [Code]
(arXiv 2021.06) Multi-Compound Transformer for Accurate Biomedical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) ResViT: Residual vision transformers for multi-modal medical image synthesis, [Paper]
(arXiv 2021.07) E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception, [Paper]
(arXiv 2021.07) UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation, [Paper]
(arXiv 2021.07) COVID-VIT: Classification of Covid-19 from CT chest images based on vision transformer models, [Paper]
(arXiv 2021.07) RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting, [Paper], [Code]
(arXiv 2021.07) Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation, [Paper]
(arXiv 2021.07) Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries, [Paper]
(arXiv 2021.07) EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification, [Paper]
(arXiv 2021.07) Visual Transformer with Statistical Test for COVID-19 Classification, [Paper]
(arXiv 2021.07) TransAttUnet: Multi-level Attention-guided U-Net with Transformer for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Few-Shot Domain Adaptation with Polymorphic Transformers, [Paper], [Code]
(arXiv 2021.07) TransClaw U-Net: Claw U-Net with Transformers for Medical Image Segmentation, [Paper]
(arXiv 2021.07) Surgical Instruction Generation with Transformers, [Paper]
(arXiv 2021.07) LeViT-UNet: Make Faster Encoders with Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.07) TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee Topology Preservation in Segmentations, [Paper], [Code]
(arXiv 2021.08) Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers, [Paper], [Code]
(arXiv 2021.08) Is it Time to Replace CNNs with Transformers for Medical Images, [Paper], [Code]
(arXiv 2021.09) nnFormer: Interleaved Transformer for Volumetric Segmentation, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) MISSFormer: An Effective Medical Image Segmentation Transformer, [Paper]
(arXiv 2021.09) Eformer: Edge Enhancement based Transformer for Medical Image Denoising, [Paper]
(arXiv 2021.09) Transformer-Unet: Raw Image Processing with Unet, [Paper]
(arXiv 2021.09) BiTr-Unet: a CNN-Transformer Combined Network for MRI Brain Tumor Segmentation, [Paper]
(arXiv 2021.09) GT U-Net: A U-Net Like Group Transformer Network for Tooth Root Segmentation, [Paper]
(arXiv 2021.10) Transformer Assisted Convolutional Network for Cell Instance Segmentation, [Paper]
(arXiv 2021.10) A transformer-based deep learning approach for classifying brain metastases into primary organ sites using clinical whole brain MRI images, [Paper]
(arXiv 2021.10) Boundary-aware Transformers for Skin Lesion Segmentation, [Paper], [Code]
(arXiv 2021.10) Vision Transformer based COVID-19 Detection using Chest X-rays, [Paper]
(arXiv 2021.10) Combining CNNs With Transformer for Multimodal 3D MRI Brain Tumor Segmentation With Self-Supervised Pretraining, [Paper], [Code]
(arXiv 2021.10) CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans, [Paper], [Code]
(arXiv 2021.10) COVID-19 Detection in Chest X-ray Images Using Swin-Transformer and Transformer in Transformer, [Paper], [Code]
(arXiv 2021.10) Bilateral-ViT for Robust Fovea Localization, [Paper]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) Vision Transformer for Classification of Breast Ultrasound Images, [Paper]
(arXiv 2021.11) Federated Split Vision Transformer for COVID-19CXR Diagnosis using Task-Agnostic Training, [Paper]
(arXiv 2021.11) Hepatic vessel segmentation based on 3D swin-transformer with inductive biased multi-head self-attention, [Paper]
(arXiv 2021.11) Lymph Node Detection in T2 MRI with Transformers, [Paper]
(arXiv 2021.11) Mixed Transformer U-Net For Medical Image Segmentation, [Paper], [Code]
(arXiv 2021.11) Transformer for Polyp Detection, [Paper]
(arXiv 2021.11) DuDoTrans: Dual-Domain Transformer Provides More Attention for Sinogram Restoration in Sparse-View CT Reconstruction, [Paper], [Code]
(arXiv 2021.11) A Volumetric Transformer for Accurate 3D Tumor Segmentation, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis, [Paper], [Code]
(arXiv 2021.11) MIST-net: Multi-domain Integrative Swin Transformer network for Sparse-View CT Reconstruction, [Paper]
(arXiv 2021.12) MT-TransUNet: Mediating Multi-Task Tokens in Transformers for Skin Lesion Segmentation and Classification, [Paper], [Code]
(arXiv 2021.12) 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis, [Paper], [Code]
(arXiv 2021.12) Semi-Supervised Medical Image Segmentation via Cross Teaching between CNN and Transformer, [Paper], [Code]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper], [Code]
(arXiv 2021.12) MSHT: Multi-stage Hybrid Transformer for the ROSE Image Analysis of Pancreatic Cancer, [Paper], [Code]
(arXiv 2022.01) D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation, [Paper]
(arXiv 2022.01) Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images, [Paper], [Code]
(arXiv 2022.01) Swin Transformer for Fast MRI, [Paper], [Code]
(arXiv 2022.01) ViTBIS: Vision Transformer for Biomedical Image Segmentation, [Paper]
(arXiv 2022.01) Improving Across-Dataset Brain Tissue Segmentation Using Transformer, [Paper], [Code]
(arXiv 2022.01) SegTransVAE: Hybrid CNN -- Transformer with Regularization for medical image segmentation, [Paper], [Code]
(arXiv 2022.01) ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer, [Paper], [Code]
(arXiv 2022.01) Fast MRI Reconstruction: How Powerful Transformers Are, [Paper]
(arXiv 2022.01) Class-Aware Generative Adversarial Transformers for Medical Image Segmentation, [Paper]
(arXiv 2022.01) RTNet: Relation Transformer Network for Diabetic Retinopathy Multi-lesion Segmentation, [Paper]
(arXiv 2022.01) Joint Liver and Hepatic Lesion Segmentation using a Hybrid CNN with Transformer Layers, [Paper]
(arXiv 2022.01) DSFormer: A Dual-domain Self-supervised Transformer for Accelerated Multi-contrast MRI Reconstruction, [Paper]
(arXiv 2022.01) TransPPG: Two-stream Transformer for Remote Heart Rate Estimate, [Paper]
(arXiv 2022.01) TransBTSV2: Wider Instead of Deeper Transformer for Medical Image Segmentation, [Paper], [Code]
(arXiv 2022.01) Brain Cancer Survival Prediction on Treatment-na ive MRI using Deep Anchor Attention Learning with Vision Transformer, [Paper]
(arXiv 2022.02) Indication as Prior Knowledge for Multimodal Disease Classification in Chest Radiographs with Transformers, [Paper], [Code]
(arXiv 2022.02) AI can evolve without labels: self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation, [Paper], [Code]
(arXiv 2022.02) ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification, [Paper]
(arXiv 2022.02) A hybrid 2-stage vision transformer for AI-assisted 5 class pathologic diagnosis of gastric endoscopic biopsies, [Paper]
(arXiv 2022.02) TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery, [Paper]
(arXiv 2022.02) RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification, [Paper]
(arXiv 2022.02) A Transformer-based Network for Deformable Medical Image Registration, [Paper]
(arXiv 2022.03) Using Multi-scale SwinTransformer-HTC with Data augmentation in CoNIC Challenge, [Paper]
(arXiv 2022.03) CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising, [Paper], [Code]
(arXiv 2022.03) Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology, [Paper], [Code]
(arXiv 2022.03) A Multi-scale Transformer for Medical Image Segmentation: Architectures, Model Efficiency, and Benchmarks, [Paper], [Code]
(arXiv 2022.03) Tempera: Spatial Transformer Feature Pyramid Network for Cardiac MRI Segmentation, [Paper]
(arXiv 2022.03) Contextual Attention Network: Transformer Meets U-Net, [Paper], [Code]
(arXiv 2022.03) Characterizing Renal Structures with 3D Block Aggregate Transformers, [Paper]
(arXiv 2022.03) Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification, [Paper]

Motion

(arXiv 2021.03) Single-Shot Motion Completion with Transformer, [Paper], [Code]
(arXiv 2021.03) DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer, [Paper]
(arXiv 2021.03) Multimodal Motion Prediction with Stacked Transformers, [Paper], [Code]
(arXiv 2021.04) Action-Conditioned 3D Human Motion Synthesis with Transformer VAE, [Paper]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]

Multi-label

(arXiv 2021.06) MlTr: Multi-label Classification with Transformer, [Paper], [Code]
(arXiv 2021.07) Query2Label: A Simple Transformer Way to Multi-Label Classification, [Paper], [Code]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper], [Code]
(arXiv 2020.11) General Multi-label Image Classification with Transformers, [Paper]
(arXiv 2020.11) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]

Multi-task/modal

(arXiv 2021.02) Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, [Paper], [Code]
(arXiv 2021.04) MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding, [Paper], [Code]
(arXiv 2021.04) Multi-Modal Fusion Transformer for End-to-End Autonomous Driving, [Paper]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper]
(arXiv 2021.04) Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning, [Paper]
(arXiv 2021.06) Scene Transformer: A Unified Multi-task Model for Behavior Prediction and Planning, [Paper]
(arXiv 2021.06) Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation, [Paper]
(arXiv 2021.06) A Transformer-based Cross-modal Fusion Model with Adversarial Training, [Paper]
(arXiv 2021.07) Attention Bottlenecks for Multimodal Fusion, [Paper]
(arXiv 2021.07) Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots, [Paper]
(arXiv 2021.07) Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions, [Paper]
(arXiv 2021.07) Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers, [Paper], [Code]
(arXiv 2021.08) StrucTexT: Structured Text Understanding with Multi-Modal Transformers, [Paper]
(arXiv 2021.08) Billion-Scale Pretraining with Vision Transformers for Multi-Task Visual Representations, [Paper]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper], [Code]
(arXiv 2021.10) VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing, [Paper]
(arXiv 2021.10) Detecting Dementia from Speech and Transcripts using Transformers, [Paper]
(arXiv 2021.11) MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition, [Paper]
(arXiv 2021.11) VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code1], [Code2]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper], [Code]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper],[Code]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface Modeling, [Paper]
(arXiv 2021.12) VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper],[Code]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper],[Code]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper],[Code]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper],[Code]
(arXiv 2022.01) Robust Self-Supervised Audio-Visual Speech Recognition, [Paper],[Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning, [Paper],[Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper],[Code]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper],[Code]
(arXiv 2022.01) A Pre-trained Audio-Visual Transformer for Emotion Recognition, [Paper]
(arXiv 2022.01) Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition, [Paper]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.02) DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper],[Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.03) DXM-TransFuse U-net: Dual Cross-Modal Transformer Fusion U-net for Automated Nerve Identification, [Paper]
(arXiv 2022.03) LILE: Look In-Depth before Looking Elsewhere -- A Dual Attention Network using Transformers for Cross-Modal Information Retrieval in Histopathology Archives, [Paper]
(arXiv 2022.03) VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer, [Paper],[Project]

Multi-view Stereo

(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]

NAS

(CVPR'21) HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers, [Paper], [Code]
(arXiv.2021.02) Towards Accurate and Compact Architectures via Neural Architecture Transformer, [Paper]
(arXiv.2021.03) BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search, [Paper], [Code]
(arXiv.2021.06) Vision Transformer Architecture Search, [Paper], [Code]
(arXiv.2021.07) AutoFormer: Searching Transformers for Visual Recognition, [Paper], [Code]
(arXiv.2021.07) GLiT: Neural Architecture Search for Global and Local Image Transformer, [Paper]
(arXiv.2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper]
(arXiv.2021.10) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper]
(arXiv.2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv.2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]

Navigation

(ICLR'21) VTNet: Visual Transformer Network for Object Goal Navigation, [Paper]
(arXiv 2021.03) MaAST: Map Attention with Semantic Transformers for Efficient Visual Navigation, [Paper]
(arXiv 2021.04) Know What and Know Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation, [Paper]
(arXiv 2021.05) Episodic Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.07) Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World, [Paper]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper], [Project]

OCR

(arXiv 2021.04) Handwriting Transformers, [Paper]
(arXiv 2021.05) I2C2W: Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition, [Paper]
(arXiv 2021.05) Vision Transformer for Fast and Efficient Scene Text Recognition, [Paper]
(arXiv 2021.06) DocFormer: End-to-End Transformer for Document Understanding, [Paper]
(arXiv 2021.08) A Transformer-based Math Language Model for Handwritten Math Expression Recognition, [Paper]
(arXiv 2021.09) TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models, [Paper], [Code]
(arXiv 2021.10) Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks, [Paper], [Code]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) SPTS: Single-Point Text Spotting, [Paper]
(arXiv 2022.02) Arbitrary Shape Text Detection using Transformers, [Paper]
(arXiv 2022.03) DiT: Self-supervised Pre-training for Document Image Transformer, [Paper], [Code]
(arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]

Octree

(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]

Panoptic Segmentation

(arXiv.2020.12) MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, [Paper]
(arXiv 2021.09) Panoptic SegFormer, [Paper]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.10) An End-to-End Trainable Video Panoptic Segmentation Method using Transformers, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) PolyphonicFormer: Unified Query Learning for Depth-aware Video Panoptic Segmentation, [Paper], [Code]

Point Cloud

(ICRA'21) NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation, [Paper]
(arXiv 2020.12) Point Transformer, [Paper]
(arXiv 2020.12) 3D Object Detection with Pointformer, [Paper]
(arXiv 2020.12) PCT: Point Cloud Transformer, [Paper]
(arXiv 2021.03) You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module, [Paper], [Code]
(arXiv 2021.04) Group-Free 3D Object Detection via Transformers, [Paper], [Code]
(arXiv 2021.04) M3DETR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers, [Paper]
(arXiv 2021.04) Dual Transformer for Point Cloud Analysis, [Paper]
(arXiv 2021.04) Point Cloud Learning with Transformer, [Paper]
(arXiv 2021.08) SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer, [Paper], [Code]
(arXiv 2021.08) PTT: Point-Track-Transformer Module for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
(arXiv 2021.08) Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning, [Paper], [Code]
(arXiv 2021.08) PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2021.08) Improving 3D Object Detection with Channel-wise Transformer, [Paper], [Code]
(arXiv 2021.09) PQ-Transformer: Jointly Parsing 3D Objects and Layouts from Point Clouds, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) PatchFormer: A Versatile 3D Transformer Based on Patch Attention, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Adaptive Channel Encoding Transformer for Point Cloud Analysis, [Paper], [Code]
(arXiv 2021.11) Fast Point Transformer, [Paper]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper]
(arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
(arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
(arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]

Pose

(arXiv 2020.12) End-to-End Human Pose and Mesh Reconstruction with Transformers, [Paper]
(arXiv 2020.12) TransPose: Towards Explainable Human Pose Estimation by Transformer, [Paper]
(arXiv 2021.03) 3D Human Pose Estimation with Spatial and Temporal Transformers, [Paper], [Code]
(arXiv 2021.03) End-to-End Trainable Multi-Instance Pose Estimation with Transformers, [Paper]
(arXiv 2021.03) Lifting Transformer for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2021.03) TFPose: Direct Human Pose Estimation with Transformers, [Paper]
(arXiv 2021.04) Pose Recognition with Cascade Transformers, [Paper], [Code]
(arXiv 2021.04) TokenPose: Learning Keypoint Tokens for Human Pose Estimation, [Paper]
(arXiv 2021.04) Skeletor: Skeletal Transformers for Robust Body-Pose Estimation, [Paper]
(arXiv 2021.04) HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation of Hands and Object in Interaction, [Paper]
(arXiv 2021.07) Test-Time Personalization with a Transformer for Human Pose Estimation, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Code]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper], [Code]
(arXiv 2021.12) Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans, [Paper]
(arXiv 2021.12) End-to-End Learning of Multi-category 3D Pose and Shape Estimation, [Paper]
(arXiv 2022.01) Swin-Pose: Swin Transformer Based Human Pose Estimation, [Paper]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers, [Paper]
(arXiv 2022.02) HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders, [Paper]

Planning

(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]

Pruning & Quantization

(arXiv 2021.04) Visual Transformer Pruning, [Paper]
(arXiv 2021.06) Post-Training Quantization for Vision Transformer, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
(arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]

Recognition

(arXiv 2021.03) Global Self-Attention Networks for Image Recognition, [Paper]
(arXiv 2021.03) TransFG: A Transformer Architecture for Fine-grained Recognition, [Paper]
(arXiv 2021.05) Are Convolutional Neural Networks or Transformers more like human vision, [Paper]
(arXiv 2021.07) Transformer with Peak Suppression and Knowledge Guidance for Fine-grained Image Recognition, [Paper]
(arXiv 2021.07) RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition, [Paper]
(arXiv 2021.08) DPT: Deformable Patch-based Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.10) A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition, [Paper]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based place recognition with multi-level attention aggregation, [Paper]
(arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]

Reconstruction

(arXiv 2021.03) Multi-view 3D Reconstruction with Transformer, [Paper]
(arXiv 2021.06) THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers, [Paper]
(arXiv 2021.06) LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction, [Paper]
(arXiv 2021.07) TransformerFusion: Monocular RGB Scene Reconstruction using Transformers, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.11) Reference-based Magnetic Resonance Image Reconstruction Using Texture Transformer, [Paper]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]

Re-identification

(arXiv 2021.02) TransReID: Transformer-based Object Re-Identification, [Paper]
(arXiv 2021.03) Spatiotemporal Transformer for Video-based Person Re-identification, [Paper]
(arXiv 2021.04) AAformer: Auto-Aligned Transformer for Person Re-Identification, [Paper]
(arXiv 2021.04) A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification, [Paper]
(arXiv 2021.06) Transformer-Based Deep Image Matching for Generalizable Person Re-identification, [Paper]
(arXiv 2021.06) Diverse Part Discovery: Occluded Person Re-identification with Part-Aware Transformer, [Paper]
(arXiv 2021.06) Person Re-Identification with a Locally Aware Transformer, [Paper]
(arXiv 2021.07) Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification, [Paper], [Code]
(arXiv 2021.07) GiT: Graph Interactive Transformer for Vehicle Re-identification, [Paper]
(arXiv 2021.07) HAT: Hierarchical Aggregation Transformers for Person Re-identification, [Paper]
(arXiv 2021.09) Pose-guided Inter- and Intra-part Relational Transformer for Occluded Person Re-Identification, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2022.01) Short Range Correlation Transformer for Occluded Person Re-Identification, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]

Restoration

(arXiv 2021.06) Uformer: A General U-Shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.08) SwinIR: Image Restoration Using Swin Transformer, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.12) U2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper], [Code]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]

Retrieval

(CVPR'21') Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers, [Paper]
(arXiv 2021.01) Investigating the Vision Transformer Model for Image Retrieval Tasks, [Paper]
(arXiv 2021.02) Training Vision Transformers for Image Retrieval, [Paper]
(arXiv 2021.03) Instance-level Image Retrieval using Reranking Transformers, [Paper]
(arXiv 2021.04) Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval, [Paper]
(arXiv 2021.04) Self-supervised Video Retrieval Transformer Network, [Paper]
(arXiv 2021.05) TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval, [Paper], [Code]
(arXiv 2021.06) Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features, [Paper]
(arXiv 2021.06) All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]

Robotic

(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Code]
(arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]

Salient Object Detection

(arXiv 2021.04) Transformer Transforms Salient Object Detection and Camouflaged Object Detection, [Paper]
(arXiv 2021.04) Visual Saliency Transformer, [Paper]
(arXiv 2021.04) CoSformer: Detecting Co-Salient Object with Transformers, [Paper]
(arXiv 2021.08) Unifying Global-Local Representations in Salient Object Detection with Transformer, [Paper], [Code]
(arXiv 2021.08) TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network, [Paper], [Code]
(arXiv 2021.08) Boosting Salient Object Detection with Transformer-based Asymmetric Bilateral U-Net, [Paper]
(arXiv 2021.12) Transformer-based Network for RGB-D Saliency Detection, [Paper]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]

Scene

(arXiv 2020.12) SceneFormer: Indoor Scene Generation with Transformers, [Paper]
(arXiv 2021.05) SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation, [Paper]
(arXiv 2021.06) P2T: Pyramid Pooling Transformer for Scene Understanding, [Paper], [Code]
(arXiv 2021.07) Scenes and Surroundings: Scene Graph Generation using Relation Transformer, [Paper]
(arXiv 2021.07) Spatial-Temporal Transformer for Dynamic Scene Graph Generation, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper], [Code]

Self-supervised Learning

(arXiv 2021.03) Can Vision Transformers Learn without Natural Images? [Paper], [Code]
(arXiv 2021.04) An Empirical Study of Training Self-Supervised Visual Transformers, [Paper]
(arXiv 2021.04) SiT: Self-supervised vIsion Transformer, [Paper]], [Code]
(arXiv 2021.04) VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, [Paper], [Code]
(arXiv 2021.04) Emerging Properties in Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2021.05) Self-Supervised Learning with Swin Transformers, [Paper], [Code]
(arXiv 2021.06) MST: Masked Self-Supervised Transformer for Visual Representation, [Paper]
(arXiv 2021.06) Efficient Self-supervised Vision Transformers for Representation Learning, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper], [Code]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper], [Code]
(arXiv 2022.02) Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut, [Paper], [Project]

Semantic Segmentation

(arXiv 2020.12) Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, [Paper], [Code]
(arXiv 2021.01) Trans2Seg: Transparent Object Segmentation with Transformer, [Paper], [Code]
(arXiv 2021.05) Segmenter: Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.06) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.06) Fully Transformer Networks for Semantic Image Segmentation, [Paper]
(arXiv 2021.06) Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images, [Paper]
(arXiv 2021.06) OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments, [Paper]
(arXiv 2021.07) Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images, [Paper]
(arXiv 2021.07) A Unified Efficient Pyramid Transformer for Semantic Segmentation, [Paper]
(arXiv 2021.08) Boosting Few-shot Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.08) Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer, [Paper], [Code]
(arXiv 2021.08) Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation, [Paper], [Code]
(arXiv 2021.08) Trans4Trans: Efficient Transformer for Transparent Object and Semantic Scene Segmentation in Real-World Navigation Assistance, [Paper], [Code]
(arXiv 2021.08) Evaluating Transformer based Semantic Segmentation Networks for Pathological Image Segmentation, [Paper]
(arXiv 2021.08) Semantic Segmentation on VSPW Dataset through Aggregation of Transformer Models, [Paper]
(arXiv 2021.09) Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation, [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]
(arXiv 2021.11) Dynamically pruning segformer for efficient semantic segmentation, [Paper]
(arXiv 2021.11) APANet: Adaptive Prototypes Alignment Network for Few-Shot Semantic Segmentation, [Paper]
(arXiv 2021.11) Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers, [Paper]
(arXiv 2021.11) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Dual-Flattening Transformers through Decomposed Row and Column Queries for Semantic Segmentation, [Paper]
(arXiv 2022.01) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [Code]
(arXiv 2022.03) Transformer-based Knowledge Distillation for Efficient Semantic Segmentation of Road-driving Scenes, [Paper], [Code]
(arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) BEVSegFormer: Bird’s Eye View Semantic Segmentation From Arbitrary Camera Rigs, [Paper]
(arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]

Shape

(WACV'21) End-to-end Lane Shape Prediction with Transformers, [Paper], [Code]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]

Super-Resolution

(CVPR'20) Learning Texture Transformer Network for Image Super-Resolution, [Paper], [Code]
(arXiv 2021.06) LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation, [Paper]
(arXiv 2021.06) Video Super-Resolution Transformer, [Paper], [Code]
(arXiv 2021.08) Light Field Image Super-Resolution with Transformers, [Paper], [Code]
(arXiv 2021.08) Efficient Transformer for Single Image Super-Resolution, [Paper]
(arXiv 2021.09) Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution, [Paper]
(arXiv 2021.12) Implicit Transformer Network for Screen Content Image Continuous Super-Resolution, [Paper]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

Synthesis

(arXiv 2020.12) Taming Transformers for High-Resolution Image Synthesis, [Paper], [Code]
(arXiv 2021.04) Geometry-Free View Synthesis: Transformers and no 3D Priors, [Paper]
(arXiv 2021.05) High-Resolution Complex Scene Synthesis with Transformers, [Paper]
(arXiv 2021.06) The Image Local Autoregressive Transformer, [Paper]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Project]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]

Tracking

(EMNLP'19) Effective Use of Transformer Networks for Entity Tracking, [Paper], [Code]
(CVPR'21) Transformer Tracking, [Paper], [Code]
(CVPR'21) Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking, [Paper], [Code]
(arXiv 2020.12) TransTrack: Multiple-Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.01) TrackFormer: Multi-Object Tracking with Transformers, [Paper]
(arXiv 2021.03) TransCenter: Transformers with Dense Queries for Multiple-Object Tracking, [Paper]
(arXiv 2021.03) Learning Spatio-Temporal Transformer for Visual Tracking, [Paper], [Code]
(arXiv 2021.04) Multitarget Tracking with Transformers, [Paper]
(arXiv 2021.04) Spatial-Temporal Graph Transformer for Multiple Object Tracking, [Paper]
(arXiv 2021.05) MOTR: End-to-End Multiple-Object Tracking with TRansformer, [Paper], [Code]
(arXiv 2021.05) TrTr: Visual Tracking with Transformer, [Paper], [Code]
(arXiv 2021.08) HiFT: Hierarchical Feature Transformer for Aerial Tracking, [Paper], [Code]
(arXiv 2021.10) Siamese Transformer Pyramid Networks for Real-Time UAV Tracking, [Paper], [Code]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]

Traffic

(arXiv 2021.05) Novelty Detection and Analysis of Traffic Scenario Infrastructures in the Latent Space of a Vision Transformer-Based Triplet Autoencoder, [Paper]
(arXiv 2021.11) DetectorNet: Transformer-enhanced Spatial Temporal Graph Neural Network for Traffic Prediction, [Paper]
(arXiv 2021.11) ProSTformer: Pre-trained Progressive Space-Time Self-attention Model for Traffic Flow Forecasting, [Paper]
(arXiv 2022.01) SwinUNet3D -- A Hierarchical Architecture for Deep Traffic Prediction using Shifted Window Transformers, [Paper], [Code]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]

Transfer learning

(arXiv 2021.06) Transformer-Based Source-Free Domain Adaptation, [Paper], [Code]
(arXiv 2021.08) TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.03) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]

Translation

(arXiv 2021.10) Tensor-to-Image: Image-to-Image Translation with Vision Transformers, [Paper]
(arXiv 2022.03) UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation, [Paper], [Code]

Texture

(arXiv 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]

Unsupervised learning

(arXiv 2022.02) Handcrafted Histological Transformer (H2T): Unsupervised Representation of Whole Slide Images, [Paper]

Video

(ECCV'20) Multi-modal Transformer for Video Retrieval, [Paper]
(ICLR'21) Support-set bottlenecks for video-text representation learning, [Paper]
(arXiv 2021.01) SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.02) Video Transformer Network, [Paper]
(arXiv 2021.02) Is Space-Time Attention All You Need for Video Understanding? [Paper], [Code]
(arXiv.2021.02) A Straightforward Framework For Video Retrieval Using CLIP, [Paper], [Code]
(arXiv 2021.03) Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning, [Paper]
(arXiv 2021.03) Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training, [Paper]
(arXiv 2021.03) MDMMT: Multidomain Multimodal Transformer for Video Retrieval, [Paper]
(arXiv 2021.03) An Image is Worth 16x16 Words, What is a Video Worth? [Paper]
(arXiv 2021.03) ViViT: A Video Vision Transformer, [paper]
(arXiv 2021.04) Composable Augmentation Encoding for Video Representation Learning, [Paper]
(arXiv 2021.04) Temporal Query Networks for Fine-grained Video Understanding, [Paper], [Project]
(arXiv 2021.04) Higher Order Recurrent Space-Time Transformer, [Paper], [Code]
(arXiv 2021.04) VideoGPT: Video Generation using VQ-VAE and Transformers, [Paper], [Code]
(arXiv 2021.04) VidTr: Video Transformer Without Convolutions, [Paper]
(arXiv 2021.05) Local Frequency Domain Transformer Networks for Video Prediction, [Paper]
(arXiv 2021.05) End-to-End Video Object Detection with Spatial-Temporal Transformers, [Paper], [Code]
(arXiv 2021.06) Anticipative Video Transformer, [Paper], [Project]
(arXiv 2021.06) TransVOS: Video Object Segmentation with Transformers, [Paper]
(arXiv 2021.06) Associating Objects with Transformers for Video Object Segmentation, [Paper]
(arXiv 2021.06) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers, [Paper]
(arXiv 2021.06) Space-time Mixing Attention for Video Transformer, [Paper]
(arXiv 2021.06) Video Instance Segmentation using Inter-Frame Communication Transformers, [Paper]
(arXiv 2021.06) Long-Short Temporal Contrastive Learning of Video Transformers, [Paper]
(arXiv 2021.06) Video Swin Transformer, [Paper], [Code]
(arXiv 2021.06) Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection, [Paper]
(arXiv 2021.07) Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation, [Paper], [Code]
(arXiv 2021.07) Generative Video Transformer: Can Objects be the Words, [Paper]
(arXiv 2021.07) Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection, [Paper]
(arXiv 2021.08) Token Shift Transformer for Video Classification, [Paper], [Code]
(arXiv 2021.08) Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering, [Paper]
(arXiv 2021.08) Video Relation Detection via Tracklet based Visual Transformer, [Paper], [Code]
(arXiv 2021.08) MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition, [Paper]
(arXiv 2021.08) ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) Hierarchical Multimodal Transformer to Summarize Videos, [Paper]
(arXiv 2021.10) Object-Region Video Transformers, [Paper], [Code]
(arXiv 2021.10) Can't Fool Me: Adversarially Robust Transformer for Video Understanding, [Paper], [Code]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) Sparse Adversarial Video Attacks with Spatial Transformations, [Paper], [Code]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing, [Paper]
(arXiv 2021.12) Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper]
(arXiv 2021.12) A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring,[Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition,[Paper]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers,[Paper]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition,[Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer,[Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer,[Paper], [Code]
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]
(arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]
(arXiv 2022.03) Spatio-temporal Vision Transformer for Super-resolution Microscopy, [Paper], [Code]
(arXiv 2022.03) ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection, [Paper]

Visual Grounding

(arXiv 2021.04) TransVG: End-to-End Visual Grounding with Transformers, [Paper]
(arXiv 2021.05) Visual Grounding with Transformers, [Paper]
(arXiv 2021.06) Referring Transformer: A One-step Approach to Multi-task Visual Grounding, [Paper]
(arXiv 2021.08) Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding, [Paper]
(arXiv 2021.08) TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding, [Paper]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]

Visual Reasoning

(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]

Visual Relationship Detection

(arXiv 2021.04) RelTransformer: Balancing the Visual Relationship Detection from Local Context, Scene and Memory, [Paper]
(arXiv 2021.05) Visual Composite Set Detection Using Part-and-Sum Transformers, [Paper]
(arXiv 2021.08) Discovering Spatial Relationships by Transformers for Domain Generalization, [Paper]

Voxel

(arXiv 2021.05) SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]

Weakly Supervised Learning

(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2022.01) CaFT: Clustering and Filter on Tokens of Transformer for Weakly Supervised Object Localization, [Paper]

Zero-Shot Learning

(arXiv 2021.08) Multi-Head Self-Attention via Vision Transformer for Zero-Shot Learning, [Paper]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]

Others

(CVPR'21') Transformer Interpretability Beyond Attention Visualization, [Paper], [Code]
(CVPR'21') Pre-Trained Image Processing Transformer, [Paper]
(ICCV'21) PlaneTR: Structure-Guided Transformers for 3D Plane Recovery, [Paper], [Code]
(arXiv 2021.01) Learn to Dance with AIST++: Music Conditioned 3D Dance Generation, [Paper], [Code]
(arXiv 2021.01) VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search, [Paper]
(arXiv 2021.01) Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry, [Paper]
(arXiv 2021.04) Cloth Interactive Transformer for Virtual Try-On, [Paper], [Code]
(arXiv 2021.04) Fourier Image Transformer, [Paper], [Code]
(arXiv 2021.05) Attention for Image Registration (AiR): an unsupervised Transformer approach, [Paper]
(arXiv 2021.05) IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture, [Paper]
(arXiv 2021.05) CogView: Mastering Text-to-Image Generation via Transformers, [Paper]
(arXiv 2021.06) A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer, [Paper]
(arXiv 2021.06) Predicting Vehicles Trajectories in Urban Scenarios with Transformer Networks and Augmented Information, [Paper]
(arXiv 2021.06) StyTr2: Unbiased Image Style Transfer with Transformers, [Paper]
(arXiv 2021.06) Semantic Correspondence with Transformers, [Paper]
(arXiv 2021.06) Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue, [Paper]
(arXiv 2021.07) Grid Partitioned Attention: Efficient Transformer Approximation with Inductive Bias for High Resolution Detail Generation, [Paper], [Code]
(arXiv 2021.07) Image Fusion Transformer, [Paper], [Code]
(arXiv 2021.07) PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution, [Paper]
(arXiv 2021.07) PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion, [Paper]
(arXiv 2021.08) Applications of Artificial Neural Networks in Microorganism Image Analysis: A Comprehensive Review from Conventional Multilayer Perceptron to Popular Convolutional Neural Network and Potential Visual Transformer, [Paper]
(arXiv 2021.08) Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, [Paper], [Code]
(arXiv 2021.08) The Right to Talk: An Audio-Visual Transformer Approach, [Paper], [Code]
(arXiv 2021.08) Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion, [Paper], [Code]
(arXiv 2021.08) Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper], [Code]
(arXiv 2021.08) Investigating transformers in the decomposition of polygonal shapes as point collections, [Paper]
(arXiv 2021.08) Convolutional Neural Network (CNN) vs Visual Transformer (ViT) for Digital Holography, [Paper]
(arXiv 2021.08) Construction material classification on imbalanced datasets for construction monitoring automation using Vision Transformer (ViT) architecture, [Paper]
(arXiv 2021.08) Spatial Transformer Networks for Curriculum Learning, [Paper]
(arXiv 2021.09) TransforMesh: A Transformer Network for Longitudinal modeling of Anatomical Meshes, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper], [Code]
(arXiv 2021.10) ProTo: Program-Guided Transformer for Program-Guided Tasks, [Paper]
(arXiv 2021.10) TranSalNet: Visual saliency prediction using transformers, [Paper]
(arXiv 2021.10) Development and testing of an image transformer for explainable autonomous driving systems, [Paper]
(arXiv 2021.10) Leveraging redundancy in attention with Reuse Transformers, [Paper]
(arXiv 2021.10) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.10) TNTC: two-stream network with transformer-based complementarity for gait-based emotion recognition, [Paper]
(arXiv 2021.11) The self-supervised channel-spatial attention-based transformer network for automated, accurate prediction of crop nitrogen status from UAV imagery, [Paper]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) U-shape Transformer for Underwater Image Enhancement, [Paper]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper]
(arXiv 2021.11) Attention-based Dual-stream Vision Transformer for Radar Gait Recognition,[Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions,[Paper], [Code]
(arXiv 2021.11) BuildFormer: Automatic building extraction with vision transformer,[Paper]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers,[Paper]
(arXiv 2021.12) Transformer based trajectory prediction,[Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors,[Paper], [Project]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection,[Paper]
(arXiv 2021.12) 3D Question Answering,[Paper]
(arXiv 2021.12) Light Field Neural Rendering,[Paper], [Project]
(arXiv 2021.12) Nonlinear Transform Source-Channel Coding for Semantic Communications, [Paper]
(arXiv 2021.12) APRIL: Finding the Achilles' Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) A Transformer-Based Siamese Network for Change Detection, [Paper], [Code]
(arXiv 2022.01) Learning class prototypes from Synthetic InSAR with Vision Transformers, [Paper]
(arXiv 2022.01) Swin transformers make strong contextual encoders for VHR image road extraction, [Paper]
(arXiv 2022.01) Technical Report for ICCV 2021 Challenge SSLAD-Track3B: Transformers Are Better Continual Learners, [Paper]
(arXiv 2022.01) Spectral Compressive Imaging Reconstruction Using Convolution and Spectral Contextual Transformer, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Code]
(arXiv 2022.01) A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization, [Paper], [Code]
(arXiv 2022.01) Transformer-based SAR Image Despeckling, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Dual-Tasks Siamese Transformer Framework for Building Damage Assessment, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) Exploiting Spatial Sparsity for Event Cameras with Visual Transformers, [Paper]
(arXiv 2022.02) Spatial Transformer K-Means, [Paper]
(arXiv 2022.02) RNGDet: Road Network Graph Detection by Transformer in Aerial Images, [Paper]
(arXiv 2022.02) Image-to-Graph Transformers for Chemical Structure Recognition, [Paper]
(arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]
(arXiv 2022.03) TableFormer: Table Structure Understanding with Transformers, [Paper]
(arXiv 2022.03) BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning, [Paper]
(arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]
(arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection, [Paper], [Code]

Contact & Feedback

If you have any suggestions about this project, feel free to contact me.

[e-mail: yzhangcst[at]gmail.com]

Name		Name	Last commit message	Last commit date
Latest commit History 307 Commits
README.md		README.md

ZAKAUDD/Transformer-in-Computer-Vision

Folders and files

Latest commit

History

Repository files navigation