-
short survey of trends in VLMs since Llava 1.0 came out 𝕏
· (huggingface) · (youtube)
-
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective,
arXiv, 2410.22217
, arxiv, pdf, cication: -1Shenghao Xie, Wenqiang Zu, Mingyang Zhao, ..., Shanghang Zhang, Lei Ma
-
A Survey of Hallucination in Large Visual Language Models,
arXiv, 2410.15359
, arxiv, pdf, cication: -1Wei Lan, Wenyi Chen, Qingfeng Chen, ..., Huiyu Zhou, Yi Pan
-
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations,
arXiv, 2411.10414
, arxiv, pdf, cication: -1Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, ..., Kartikeya Upasani, Mahesh Pasupuleti · (llama) · (llama-recipes - meta-llama)
-
Unified Generative and Discriminative Training for Multi-modal Large Language Models,
arXiv, 2411.00304
, arxiv, pdf, cication: -1Wei Chow, Juncheng Li, Qifan Yu, ..., Hanwang Zhang, Qianru Sun
-
🌟 CLEAR: Character Unlearning in Textual and Visual Modalities,
arXiv, 2410.18057
, arxiv, pdf, cication: -1Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, ..., Ivan Oseledets, Elena Tutubalina · (huggingface) · (multimodal_unlearning - somvy)
-
Improve Vision Language Model Chain-of-thought Reasoning,
arXiv, 2410.16198
, arxiv, pdf, cication: -1Ruohong Zhang, Bowen Zhang, Yanghao Li, ..., Ruoming Pang, Yiming Yang
· (LLaVA-Reasoner-DPO - RifleZhang)
-
Mitigating Object Hallucination via Concentric Causal Attention,
arXiv, 2410.15926
, arxiv, pdf, cication: -1Yun Xing, Yiheng Li, Ivan Laptev, ..., Shijian Lu
-
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding,
arXiv, 2411.14347
, arxiv, pdf, cication: -1Tianhe Ren, Yihao Chen, Qing Jiang, ..., Kent Yu, Lei Zhang
-
Teach Multimodal LLMs to Comprehend Electrocardiographic Images,
arXiv, 2410.19008
, arxiv, pdf, cication: -1Ruoqi Liu, Yuelin Bai, Xiang Yue, ..., Ping Zhang
-
Number it: Temporal Grounding Videos like Flipping Manga,
arXiv, 2411.10332
, arxiv, pdf, cication: -1Yongliang Wu, Xinting Hu, Yuyang Sun, ..., Bernt Schiele, Xu Yang · (NumPro. - yongliang-wu)
-
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension,
arXiv, 2411.13093
, arxiv, pdf, cication: -1Yongdong Luo, Xiawu Zheng, Xiao Yang, ..., Jiebo Luo, Rongrong Ji
-
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance,
arXiv, 2411.02327
, arxiv, pdf, cication: -1Ruyang Liu, Haoran Tang, Haibo Liu, ..., Chen Li, Jiankun Yang · (PPLLaVA. - farewellthree)
-
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos,
arXiv, 2411.04923
, arxiv, pdf, cication: -1Shehan Munasinghe, Hanan Gani, Wenqi Zhu, ..., Fahad Shahbaz Khan, Salman Khan · (mbzuai-oryx.github)
-
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs,
arXiv, 2410.16267
, arxiv, pdf, cication: -1Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, ..., Caiming Xiong, Juan Carlos Niebles
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding,
arXiv, 2410.17434
, arxiv, pdf, cication: -1Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, ..., Mohamed Elhoseiny, Vikas Chandra
· (vision-cair.github) · (LongVU - Vision-CAIR) · (huggingface) · (huggingface)
-
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI,
arXiv, 2410.11623
, arxiv, pdf, cication: -1Sijie Cheng, Kechen Fang, Yangyang Yu, ..., Lei Han, Yang Liu
-
OMCAT: Omni Context Aware Transformer,
arXiv, 2410.12109
, arxiv, pdf, cication: -1Arushi Goel, Karan Sapra, Matthieu Le, ..., Andrew Tao, Bryan Catanzaro · (om-cat.github)
-
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents,
arXiv, 2411.13552
, arxiv, pdf, cication: -1Rui Tian, Qi Dai, Jianmin Bao, ..., Zuxuan Wu, Yu-Gang Jiang · (Reducio-VAE - microsoft)
-
Multimodal Autoregressive Pre-training of Large Vision Encoders,
arXiv, 2411.14402
, arxiv, pdf, cication: -1Enrico Fini, Mustafa Shukor, Xiujun Li, ..., Joshua M. Susskind, Alaaeldin El-Nouby · (ml-aim - apple) · (huggingface)
-
Don't Look Twice: Faster Video Transformers with Run-Length Tokenization,
arXiv, 2411.05222
, arxiv, pdf, cication: -1Rohan Choudhury, Guanglei Zhu, Sihan Liu, ..., Kris M. Kitani, László Jeni · (rccchoudhury.github) · (rlt - rccchoudhury) · (mp.weixin.qq)
-
🌟 LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation,
arXiv, 2411.04997
, arxiv, pdf, cication: -1Weiquan Huang, Aoqi Wu, Yifan Yang, ..., Chong Luo, Lili Qiu · (aka) · (LLM2CLIP - microsoft)
-
In Search of Forgotten Domain Generalization,
arXiv, 2410.08258
, arxiv, pdf, cication: -1Prasanna Mayilvahanan, Roland S. Zimmermann, Thaddäus Wiedemer, ..., Matthias Bethge, Wieland Brendel · (𝕏)
-
Adaptive Length Image Tokenization via Recurrent Allocation,
arXiv, 2411.02393
, arxiv, pdf, cication: -1Shivam Duggal, Phillip Isola, Antonio Torralba, ..., William T. Freeman
-
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior,
arXiv, 2410.21264
, arxiv, pdf, cication: -1Hanyu Wang, Saksham Suri, Yixuan Ren, ..., Hao Chen, Abhinav Shrivastava · (hywang66.github)
-
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss,
arXiv, 2410.17243
, arxiv, pdf, cication: -1Zesen Cheng, Hang Zhang, Kehan Li, ..., Xin Li, Lidong Bing
-
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization,
arXiv, 2411.11909
, arxiv, pdf, cication: -1Hongrui Jia, Chaoya Jiang, Haiyang Xu, ..., Fei Huang, Shikun Zhang
-
V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization,
arXiv, 2411.02712
, arxiv, pdf, cication: -1Yuxi Xie, Guanzhen Li, Xiao Xu, ..., Min-Yen Kan · (V-DPO - YuxiXie)
-
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models,
arXiv, 2410.17637
, arxiv, pdf, cication: -1Ziyu Liu, Yuhang Zang, Xiaoyi Dong, ..., Dahua Lin, Jiaqi Wang
-
🌟 LLaVA-o1: Let Vision Language Models Reason Step-by-Step,
arXiv, 2411.10440
, arxiv, pdf, cication: -1Guowei Xu, Peng Jin, Li Hao, ..., Lichao Sun, Li Yuan · (LLaVA-o1 - PKU-YuanGroup)
-
Vision-Language Models Can Self-Improve Reasoning via Reflection,
arXiv, 2411.00855
, arxiv, pdf, cication: -1Kanzhi Cheng, Yantao Li, Fangzhi Xu, ..., Hao Zhou, Yang Liu
-
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation,
arXiv, 2411.13281
, arxiv, pdf, cication: -1Ziyang Luo, Haoning Wu, Dongxu Li, ..., Mohan Kankanhalli, Junnan Li · (videoautoarena.github)
-
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models,
arXiv, 2411.10867
, arxiv, pdf, cication: -1Vipula Rawte, Sarthak Jain, Aarush Sinha, ..., Amit P. Sheth, Amitava Das
-
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework,
arXiv, 2411.06176
, arxiv, pdf, cication: -1Yew Ken Chia, Liying Cheng, Hou Pong Chan, ..., Soujanya Poria, Lidong Bing · (multimodal-documents.github)
-
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models,
arXiv, 2411.00836
, arxiv, pdf, cication: -1Chengke Zou, Xingang Guo, Rui Yang, ..., Bin Hu, Huan Zhang · (DynaMath - DynaMath) · (huggingface)
-
🌟 Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination,
arXiv, 2411.03823
, arxiv, pdf, cication: -1Dingjie Song, Sicheng Lai, Shunian Chen, ..., Lichao Sun, Benyou Wang · (MM-Detect - MLLM-Data-Contamination)
-
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding,
arXiv, 2411.03628
, arxiv, pdf, cication: -1Junming Lin, Zheng Fang, Chi Chen, ..., Yang Liu, Maosong Sun
-
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models,
arXiv, 2410.23266
, arxiv, pdf, cication: -1Ziyao Shangguan, Chuhan Li, Yuxuan Ding, ..., Tesca Fitzgerald, Arman Cohan · (TOMATO - yale-nlp)
-
Image2Struct: Benchmarking Structure Extraction for Vision-Language Models,
arXiv, 2410.22456
, arxiv, pdf, cication: -1Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, ..., Yifan Mai, Percy Liang · (crfm.stanford) · (x)
-
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models,
arXiv, 2410.18325
, arxiv, pdf, cication: -1Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, ..., Joon Son Chung, Tae-Hyun Oh
· (AVHBench - AVHBench)
-
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models,
arXiv, 2410.10139
, arxiv, pdf, cication: -1Peng Xia, Siwei Han, Shi Qiu, ..., Lijuan Wang, Huaxiu Yao · (mmie-bench.github) · (MMIE - Lillianwei-h) · (huggingface)
-
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks,
arXiv, 2410.10563
, arxiv, pdf, cication: -1Jiacheng Chen, Tianhao Liang, Sherman Siu, ..., Xiang Yue, Wenhu Chen · (tiger-ai-lab.github)
-
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content,
arXiv, 2410.10783
, arxiv, pdf, cication: -1Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, ..., Leonid Karlinsky, Raja Giryes
-
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models,
arXiv, 2410.10818
, arxiv, pdf, cication: -1Mu Cai, Reuben Tan, Jianrui Zhang, ..., Yong Jae Lee, Jianwei Yang
-
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples,
arXiv, 2410.14669
, arxiv, pdf, cication: -1Baiqi Li, Zhiqiu Lin, Wenxuan Peng, ..., Graham Neubig, Deva Ramanan · (arxiv) · (huggingface) · (linzhiqiu.github)
-
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines,
arXiv, 2410.12705
, arxiv, pdf, cication: -1Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, ..., Alice Oh, Chong-Wah Ngo
-
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks,
arXiv, 2410.12381
, arxiv, pdf, cication: -1Fengji Zhang, Linquan Wu, Huiyu Bai, ..., Bei Chen, Jacky Keung
-
🌟 BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices,
arXiv, 2411.10640
, arxiv, pdf, cication: -1Xudong Lu, Yinghao Chen, Cheng Chen, ..., Shuai Ren, Hongsheng Li
-
Inference Optimal VLMs Need Only One Visual Token but Larger Models,
arXiv, 2411.03312
, arxiv, pdf, cication: -1Kevin Y. Li, Sachin Goyal, Joao D. Semedo, ..., J. Zico Kolter · (llava-token-compression - locuslab)
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction,
arXiv, 2410.17247
, arxiv, pdf, cication: -1Long Xing, Qidong Huang, Xiaoyi Dong, ..., Feng Wu, Dahua Lin
· (PyramidDrop - Cooperx521)
-
🌟 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation,
arXiv, 2411.07975
, arxiv, pdf, cication: -1Yiyang Ma, Xingchao Liu, Xiaokang Chen, ..., Jiaying Liu, Chong Ruan · (Janus - deepseek-ai)
-
🌟 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models,
arXiv, 2411.04996
, arxiv, pdf, cication: -1Weixin Liang, Lili Yu, Liang Luo, ..., Luke Zettlemoyer, Xi Victoria Lin
-
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
· (Vitron - SkyworkAI)
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation,
arXiv, 2410.13848
, arxiv, pdf, cication: -1Chengyue Wu, Xiaokang Chen, Zhiyu Wu, ..., Chong Ruan, Ping Luo
-
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation,
arXiv, 2410.13861
, arxiv, pdf, cication: -1Rongyao Fang, Chengqi Duan, Kun Wang, ..., Hongsheng Li, Xihui Liu
-
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions,
arXiv, 2411.07461
, arxiv, pdf, cication: -1Anas Awadalla, Le Xue, Manli Shu, ..., Caiming Xiong, Ran Xu · (huggingface)
-
HumanVLM: Foundation for Human-Scene Vision-Language Model,
arXiv, 2411.03034
, arxiv, pdf, cication: -1Dawei Dai, Xu Long, Li Yutang, ..., Zhang Yuanhui, Shuyin Xia
-
HourVideo: 1-Hour Video-Language Understanding,
arXiv, 2411.04998
, arxiv, pdf, cication: -1Keshigeyan Chandrasegaran, Agrim Gupta, Lea M. Hadzic, ..., Jiajun Wu, Li Fei-Fei · (hourvideo.stanford)
-
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data,
arXiv, 2410.18558
, arxiv, pdf, cication: -1Shuhao Gu, Jialing Zhang, Siyuan Zhou, ..., Fangxiang Feng, Guang Liu
-
This dataset is our multimodal, fine-grained, ranking Google Shopping dataset, Marqo-GS-10M 🤗
-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions,
arXiv, 2410.10816
, arxiv, pdf, cication: -1Tianwei Xiong, Yuqing Wang, Daquan Zhou, ..., Jiashi Feng, Xihui Liu
· (LVD-2M - SilentView)
-
Harnessing Webpage UIs for Text-Rich Visual Understanding,
arXiv, 2410.13824
, arxiv, pdf, cication: -1Junpeng Liu, Tianyue Ou, Yifan Song, ..., Graham Neubig, Xiang Yue
-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions,
arXiv, 2410.10816
, arxiv, pdf, cication: -1Tianwei Xiong, Yuqing Wang, Daquan Zhou, ..., Jiashi Feng, Xihui Liu · (silentview.github)
-
· (huggingface)
-
IPLoc - SivanDoveh
-
OmniVision-968M: World's Smallest Vision Language Model
· (huggingface)
-
neptune - google-deepmind
· (storage.googleapis) · (research)