Skip to content

Tiny-scale experiment showing that CLIP models trained using detailed captions generated by multimodal models (CogVLM and LLaVA 1.5) outperform models trained using the original alt-texts on a range of classification and retrieval tasks.

License

Notifications You must be signed in to change notification settings

nopperl/clip-synthetic-captions

Repository files navigation

clip-synthetic-captions

Traditionally, CLIP models are trained using images and their associated alt-texts, which are often of dubious quality. Higher quality captions are likely to improve the performance of CLIP models, but are not available in such a large quantity as alt-texts. An alternative could be to generate synthetic data using multimodel models.

To investigate this alternative, this repo contains a tiny-scale experiment showing that CLIP models trained using detailed captions generated by multimodal models (CogVLM and LLaVA 1.5) outperform models trained using the original alt-texts on a range of classification and retrieval tasks.

The ye-pop dataset is used for this experiment. The dataset is based on LAION-POP and consists of 491,567 images. Each image is accompanied by three captions: one generated by CogVLM, one generated by LLaVA-1.5 and the original alt text.

Identical CLIP models (ViT B/32) are trained for 4 epochs using each of the provided captions and evaluated on the DataComp bechmark suite, which includes 38 image classification and retrieval tasks of multiple different domains.

Results

Captions Average score (higher = better)
CogVLM 0.095
LLaVA 0.095
alt-text 0.086

The evaluations show that a small CLIP model trained using CogVLM captions performs better at the classification and retrieval tasks in the DataComp benchmark collection than the model trained on alt-texts. The same holds true for LLaVA-generated captions.

For completeness, different trainig hyperparameters are investigated. The performance slightly increases at some higher epochs, but overall 4 epochs seem to suffice for passable convergence:

Epochs Average score
4 0.095
10 0.094
26 0.098
104 0.094

DALL-E-3 suggests to use both detailed synthetic captions and alt-texts according to a 95%:5% split. Stable Diffusion 3 uses a 50%:50% split. For the datasets used in this experiment, this did not significantly affect the model performance:

Captions Average score
CogVLM 0.095
95% CogVLM + 5% alt-text 0.095
50% CogVLM + 50% alt-text 0.096

Training a CLIP model on CogVLM-generated captions with different CLIP architectures only marginally changes results:

Architecture Average score
ViT B/32 0.095
ViT B/16 0.096
ViT S/32 0.095

Overall, the results suggests that generating captions using a strong vision-language model such as CogVLM is a worthwile addition for future training datasets. However, the findings are strongly limited by the small scale of conducted experiments.

Reproduce

Dataset

Setup

For processing the dataset, clone this repo and install the required packages in a conda environment.

conda create -f environment.yaml
pip install -r requirements.txt

Download

Download the ye-pop dataset. git clone https://huggingface.co/datasets/Ejafa/ye-pop

Convert it into the img2dataset WebDataset format using the CogVLM-generated captions. python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-cogvlm_caption

The same for the LLaVA-generated captions if required: python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-llava_caption --caption llava_caption

Also, the original alt text can be used instead of the generated caption: python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-alt_txt --caption alt_txt

Experiment

Setup

The CLIP models are trained and evaluated using the scripts from the DataComp repo. Clone the repo and follow their installation steps. At the time of writing:

cd ..
git clone https://github.com/mlfoundations/datacomp
cd datacomp
bash create_env.sh
conda activate datacomp

Note: this creates a separate conda environment from the one created during the dataset creation phase.

Afterwards, adapt the configuration for the training runs:

cd ../datacomp
git apply ../clip-synthetic-captions/datacomp.patch

Decontamination

TODO: streamline

For a valid evaluation, it is necessary to remove images that are also contained in (or too similar to) the test datasets. This removes 47 images, leading to a dataset size of N=491,520.

To do this, first calculate the image similarity scores using the dataset2metadata package and the decontamination configuration in this repo:

cd ../clip-synthetic-captions
dataset2metadata --yml config/decontamination-cogvlm_caption.yaml

It may be necessary to install dataset2metadata if it was not already installed by the first step:

pip install git+https://github.com/mlfoundations/dataset2metadata@0bced76b1d45239f0932b0e5abf76935c7de6f84

The sample ids with acceptable similarity scores are now in data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata. To filter the contaminated samples and create the clean dataset, the resharder.py script of the DataComp repo can be used:

mkdir -p data/postprocessed/ye-pop-img2dataset-cogvlm_caption/shards
python apply_deduplication_filter.py data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata/decontaminated.npy
python ../datacomp/resharder.py -i data/processed/ye-pop-img2dataset-cogvlm_caption -o data/postprocessed/ye-pop-img2dataset-cogvlm_caption/shards/ -s data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata/decontaminated.npy --shard-format '{:03d}.tar' --shard-stats-format '{:03d}_stats.json'

The same can be done for LLaVA-generated captions or alt-texts by replacing all occurences of cogvlm_caption with llava_caption or alt_txt, respectively.

Run

The CLIP models can be trained using the train.py script provided by the DataComp repo and the datasets in the /data/postprocessed directory collected above. The following is an example for CogVLM-generated captions:

data_dir=data/postprocessed/ye-pop-img2dataset-cogvlm_captions/shards
scale=tiny
num_gpus=4  # Replace with actually available number of GPUs
train_output_dir=output
exp_name=ye-pop-cogvlm_caption
epochs=4
torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $train_output_dir --exp_name $exp_name --workers 2 --num_checkpoints $epochs

By changing the $data_dir variable, it is possible to train a model on the LLaVA-generated captions or alt-texts.

Evaluation

The evaluation can be conducted by following the instructions of the DataComp repo. For example:

download_dir=data/evalsets
python download_evalsets.py $data_dir
python evaluate.py --train_output_dir $train_output_dir/$exp_name --data_dir $download_dir

The evaluation results on all tasks are stored in $train_output_dir/$exp_name/eval_results.jsonl and can be aggregated using:

python aggregate_scores.py --input $train_output_dir/$exp_name/eval_results.jsonl

About

Tiny-scale experiment showing that CLIP models trained using detailed captions generated by multimodal models (CogVLM and LLaVA 1.5) outperform models trained using the original alt-texts on a range of classification and retrieval tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages