This repository is the official implementation of the paper DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects.
DivScene is built on the simulator AI2-THOR on MacOS with
the Python version 3.9.16.
You can find the requirement in the github repository of Holodeck
and follow their instructions to build the environment.
Specifically, the commit we used is 156f8e10
at the link
We train and test our agent NatVLM with the Megatron-LM framework on Linux (CentOS).
The requirements are shown in requirement.txt
.
In our work, we build a new scene dataset DivScene
with 4614 houses with 81 distinct scene types.
The data DivScene.zip
are released at DivScene-DivTraj on HuggingFace Hub.
You can use unzip
and 7z
commands to extract house jsons from the zip file. The split of training/validation/test are shown in split_file.json
.
Notice: our houses are built using Holodeck so that you need to config the objaverse asset correctly.
Build new houses:
- Gather the textual house descriptions with the code in
sample_data/gather_gpt4_prompt
. - Input those descriptions in Holodeck.
- Use the
sample_data/regenerate_init_position.py
to search a valid init position of the embodied agent.
Similarly, the episodes of shortest paths we sampled are at DivScene-DivTraj on HuggingFace Hub. There are 5 episodes per house in the training set and 4 episodes per house in validation and test sets.
Format of episode name: {house_id}-{traj_id}
New episodes sampling: Use the sample_data/generate_trajectories.py
to generate more trajectories.
1. Prepare Data: We revise our training code based on Megatron-LM framework and Pai-Megatron.
We provide a Large Vision Language Models (LVLM) with the instruction of a step and ask it to generate the next step. Here, we follow the instruction data format of
Llava. We use convert_to_llava_format_with_pos_cot.py
to convert DivTraj
trajectories into the Llava format and
also list useful commands in convert_to_llava_format.sh
.
- First, use
webdataset
to compress the data. The script isagent_training/toolkits/pretrain_data_preprocessing/move_bulk_data.py
.webdataset
can speed up the data loading when training the model. - The training script is
agent_training/examples/idefics2/train_llava_instruct_webdataset_cot.sh
. We also leave some commands inagent_training/examples/idefics2/run_cot_cmd.sh
- Please use code in
model_checkpoints_convertor
to convert the model between huggingface format and megatron format.
LICENSE ISSUE: We released our revision of Pai-Megatron and Megatron-LM. If you use those code, all licenses are subject to their original release.
We conduct the inference with the model-serving mode. We deploy trained LVLM on Linux servers. Then, we run the AI2-THOR on MacOS and call the api of the LVLM to finish navigation.
- See commands in
agent_inference/run_server.sh
to deploy the model with FastAPI. - Run commands in
agent_inference/run_client.sh
on MacOS with AI2-THOR to test your model.
Please cite the repo if you use the data or code.
@inproceedings{wang2024divscenebenchmarkinglvlmsobject,
title={DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects},
author={Zhaowei Wang and Hongming Zhang and Tianqing Fang and Ye Tian and Yue Yang and Kaixin Ma and Xiaoman Pan and Yangqiu Song and Dong Yu},
year={2024},
eprint={2410.02730},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.02730}
}
This repo is maintained by Zhaowei Wang