GitHub - yuecao0119/MMFuser: The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding". MMFuser addresses the limitations of current MLLMs in capturing complex image details by simply yet efficiently integrating multi-layer features from ViTs.

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

The official implementation of the paper "MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding".

📣 News

[Oct 16, 2024] The paper has been released on arXiv!
[May 30, 2024] 🔥🔥🔥 Code has been released.

Overview

Compared to the previous architecture: (a) CLIP only: Only a single layer of visual features is utilized, such as the second-to-last layer; (b) Hybrid: Integrate multiple visual encoders to enhance image representation; (c) MMFuser (Ours): Multi layer feature fusion module, used to process image features from different layers of the visual backbone (such as CLIP).

MMFuser is designed for Multi-modal Multi-layer feature fusion, which can enhance vision representation of MLLMs. The features from the last few layers of CLIP, while aligned with text, lack detailed information. In contrast, the output features from the shallow and intermediate layers contain more image details, but have poor semantic alignment.Therefore, our MMFuser employs the output features from the last layers of CLIP as queries ($Q$) and the features from the shallow and intermediate layers as keys ($K$) and values ($V$) to improve the semantic alignment of the previous layers. By applying attention mechanism, we fuse multiple layers of features to obtain image representations that are both richly detailed and text-aligned.

Performance

Results on General Multimodal Benchmarks

Performance comparison of different model sizes. (left) Compared with 7B models including Qwen-VL-Chat, LLaVA-1.5-7B, our model achieves SoTA on 11 out of 12 benchmarks. (right) Compared with 13B models, including InstructBLIP, LLaVA-1.5-13B, our model achieves SoTA on 10 out of 12 benchmarks.

Comparison with state-of-the-art VLLMs on traditional VQA benchmarks and recent Multi-modal benchmarks. The best results are marked in bold, and the second best results are underlined.

After adding MMFuser, the performance of LLaVA-1.5 was greatly improved, surpassing LLaVA-1.5 on multiple benchmarks. Specifically, the scores on Vizwiz, MME and MMBench are 57.4, 1585.2 and 69.9, surpassing LLaVA-1.5 by 3.8, 53.9 and 2.2 points respectively.

Results on OCRBench

OCRBench is a comprehensive OCR benchmark containing 1,000 manually curated and corrected OCR-related VQA instructions. As described in the table, our model has 7B and 13B parameters and achieves an average improvement of 15 points over LLaVA-1.5.

Results on Region-level Benchmarks

To assess regional understanding and grounding capabilities, we evaluate MMFuser on two representative regional-level tasks.

Results of Region Captioning On region captioning tasks, our model shows significant improvements. As shown in the table, compared to LLaVA-1.5, the 7B model of MMFuser surpasses LLaVA-1.5 by 2.5 points on average, while the 13B version improves by 3.9 points.
Results of Referring Expression Comprehension (REC) As shown in the table, our model consistently outperforms LLaVA-1.5 models across all benchmarks, with an especially notable average improvement of 5.7 points for the 7B model compared to LLaVA-1.5-7B.

Visual Representation Visualization

To intuitively validate the impact of MMFuser on visual features, we present the input and output feature map visualizations for four example images in the figure.

Install

Clone this repository and navigate to MMFuser folder

git clone git@github.com:yuecao0119/MMFuser.git
cd MMFuser

Install Package

Our project is based on LLaVA-1.5 and creates relevant environments according to LLaVA-1.5 Install.

conda create -n MMFuser python=3.10 -y
conda activate MMFuser
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages

Flash-Attention is needed.
```
pip install -e ".[train]"
pip install flash-attn==2.3.6 --no-build-isolation
```
Deformation-Attention in Deformation-DETR is used in our Project. Run the following scripts to Compiling CUDA operators.
```
cd llava/model/multimodal_projector/deformable_attention/ops
sh ./make.sh
```

Training

Our training pipeline and datasets are directly borrowed from LLaVA-v1.5. The training consists of two stages:

Pretraining: Train a projector on a subset of ~558K image-text pairs to connect a frozen pretrained vision encoder and a frozen LLM.
```
sh scripts/mmfuser/pertrain.sh
```
Instruction Tuning: Fine tune the entire MLLM using multimodal instruction data LLaVA-665K.
```
sh scripts/mmfuser/finetune.sh
```

Evaluation

We follow LLaVA-v1.5 to conduct evaluations. you should download eval.zip and unzip it to ./playground/data/eval. Please refer to Evaluation.md to prepare the data.

Then, your can run our evaluation script in scripts/v1_5/eval.

And you can run inference with:

sh scripts/mmfuser/inference.sh

👍 Acknowledgement

LLaVA: The codebase we built upon.

🔒 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violation.

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{cao2024mmfuser,
  title={MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding},
  author={Cao, Yue and Liu, Yangzhou and Chen, Zhe and Shi, Guangchen and Wang, Wenhai and Zhao, Danhuai and Lu, Tong},
  journal={arXiv preprint arXiv:2410.11829},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
images		images
llava		llava
playground/data		playground/data
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

📣 News

Table of Contents

Overview

Performance

Results on General Multimodal Benchmarks

Results on OCRBench

Results on Region-level Benchmarks

Visual Representation Visualization

Install

Training

Evaluation

👍 Acknowledgement

🔒 License

Citation

About

Releases

Packages

Contributors 2

Languages

License

yuecao0119/MMFuser

Folders and files

Latest commit

History

Repository files navigation

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

📣 News

Table of Contents

Overview

Performance

Results on General Multimodal Benchmarks

Results on OCRBench

Results on Region-level Benchmarks

Visual Representation Visualization

Install

Training

Evaluation

👍 Acknowledgement

🔒 License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages