Code for our paper "Detailed Object Description with Controllable Dimensions".
The diagram of our description refinement pipeline, Dimension Tailor
- Clone this repository and navigate to Attribute Polisher folder
git clone https://anonymous.4open.science/r/145501C3/
cd ObjectDescription
- Install Package
conda create -n tailor python=3.10 -y
conda activate tailor
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
We used BLIP-itm and Llama 3 to score and process the descriptions.
Please download the weights of BLIP-itm from https://huggingface.co/Salesforce/blip-itm-large-coco/tree/main.
For Llama 3, please follow the guidance of this to download the weights of Meta-Llama-3-8B-Instruct.
You can use any image and its corresponding caption to refine with our model. Here, we used images from COCO2017 dataset for experiments. LLaVA generation caption were performed in advance, and the results were saved in a JSON file.
- Download the COCO2017 val dataset
You can use this to download the COCO2017 val dataset.
- Detailed Object Descriptions Inference
CUDA_VISIBLE_DEVICES=$gpu python description_refiner.py \
--llama_model_dir $HUGGINGFACE_DIR/Meta-Llama-3-8B-Instruct \
--blip_model_dir $HUGGINGFACE_DIR/blip-itm-large-coco \
--image_root $DATASET_DIR/COCO/val2017 \
--f_description LLaVA/llava_detailed_answer.json \
--f_result results/llava_refined_detailed_answer.json \
--f_ovad_anno ovad/ovad2000.json \
--complete True \
--complete_threshold 0.3 \
--l_threshold 0.1 \
--seed 33 \
--f_obj_dim_attr_comb pred_obj_dim_attr_comb.json
- Controllable Object Descriptions Inference
CUDA_VISIBLE_DEVICES=$gpu python description_refiner.py \
--llama_model_dir $HUGGINGFACE_DIR/Meta-Llama-3-8B-Instruct \
--blip_model_dir $HUGGINGFACE_DIR/blip-itm-large-coco \
--image_root $DATASET_DIR/COCO/val2017 \
--f_description LLaVA/llava_ctrl_detailed_answer.json \
--f_result results/llava_refined_ctrl_detailed_answer.json \
--f_ovad_anno ovad/ovad2000.json \
--refine_control True \
--complete True \
--complete_threshold 0.3 \
--l_threshold 0.2 \
--seed 33 \
--f_obj_dim_attr_comb pred_obj_dim_attr_comb.json
Here's the model we tested in our paper:
If this work is helpful to you please consider citing the article below.
@misc{wang2024detailedobjectdescriptioncontrollable,
title={Detailed Object Description with Controllable Dimensions},
author={Xinran Wang and Haiwen Zhang and Baoteng Li and Kongming Liang and Hao Sun and Zhongjiang He and Zhanyu Ma and Jun Guo},
year={2024},
eprint={2411.19106},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19106},
}