Traditionally, CLIP models are trained using images and their associated alt-texts, which are often of dubious quality. Higher quality captions are likely to improve the performance of CLIP models, but are not available in such a large quantity as alt-texts. An alternative could be to generate synthetic data using multimodel models.
To investigate this alternative, this repo contains a tiny-scale experiment showing that CLIP models trained using detailed captions generated by multimodal models (CogVLM and LLaVA 1.5) outperform models trained using the original alt-texts on a range of classification and retrieval tasks.
The ye-pop dataset is used for this experiment. The dataset is based on LAION-POP and consists of 491,567 images. Each image is accompanied by three captions: one generated by CogVLM, one generated by LLaVA-1.5 and the original alt text.
Identical CLIP models (ViT B/32) are trained for 4 epochs using each of the provided captions and evaluated on the DataComp bechmark suite, which includes 38 image classification and retrieval tasks of multiple different domains.
Captions | Average score (higher = better) |
---|---|
CogVLM | 0.095 |
LLaVA | 0.095 |
alt-text | 0.086 |
The evaluations show that a small CLIP model trained using CogVLM captions performs better at the classification and retrieval tasks in the DataComp benchmark collection than the model trained on alt-texts. The same holds true for LLaVA-generated captions.
For completeness, different trainig hyperparameters are investigated. The performance slightly increases at some higher epochs, but overall 4 epochs seem to suffice for passable convergence:
Epochs | Average score |
---|---|
4 | 0.095 |
10 | 0.094 |
26 | 0.098 |
104 | 0.094 |
DALL-E-3 suggests to use both detailed synthetic captions and alt-texts according to a 95%:5% split. Stable Diffusion 3 uses a 50%:50% split. For the datasets used in this experiment, this did not significantly affect the model performance:
Captions | Average score |
---|---|
CogVLM | 0.095 |
95% CogVLM + 5% alt-text | 0.095 |
50% CogVLM + 50% alt-text | 0.096 |
Training a CLIP model on CogVLM-generated captions with different CLIP architectures only marginally changes results:
Architecture | Average score |
---|---|
ViT B/32 | 0.095 |
ViT B/16 | 0.096 |
ViT S/32 | 0.095 |
Overall, the results suggests that generating captions using a strong vision-language model such as CogVLM is a worthwile addition for future training datasets. However, the findings are strongly limited by the small scale of conducted experiments.
For processing the dataset, clone this repo and install the required packages in a conda environment.
conda create -f environment.yaml
pip install -r requirements.txt
Download the ye-pop dataset. git clone https://huggingface.co/datasets/Ejafa/ye-pop
Convert it into the img2dataset WebDataset format using the CogVLM-generated captions. python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-cogvlm_caption
The same for the LLaVA-generated captions if required: python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-llava_caption --caption llava_caption
Also, the original alt text can be used instead of the generated caption: python ye_pop_to_img2dataset.py ye-pop data/processed/ye-pop-img2dataset-alt_txt --caption alt_txt
The CLIP models are trained and evaluated using the scripts from the DataComp repo. Clone the repo and follow their installation steps. At the time of writing:
cd ..
git clone https://github.com/mlfoundations/datacomp
cd datacomp
bash create_env.sh
conda activate datacomp
Note: this creates a separate conda environment from the one created during the dataset creation phase.
Afterwards, adapt the configuration for the training runs:
cd ../datacomp
git apply ../clip-synthetic-captions/datacomp.patch
TODO: streamline
For a valid evaluation, it is necessary to remove images that are also contained in (or too similar to) the test datasets. This removes 47 images, leading to a dataset size of N=491,520.
To do this, first calculate the image similarity scores using the dataset2metadata package and the decontamination configuration in this repo:
cd ../clip-synthetic-captions
dataset2metadata --yml config/decontamination-cogvlm_caption.yaml
It may be necessary to install dataset2metadata if it was not already installed by the first step:
pip install git+https://github.com/mlfoundations/dataset2metadata@0bced76b1d45239f0932b0e5abf76935c7de6f84
The sample ids with acceptable similarity scores are now in data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata
. To filter the contaminated samples and create the clean dataset, the resharder.py
script of the DataComp repo can be used:
mkdir -p data/postprocessed/ye-pop-img2dataset-cogvlm_caption/shards
python apply_deduplication_filter.py data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata/decontaminated.npy
python ../datacomp/resharder.py -i data/processed/ye-pop-img2dataset-cogvlm_caption -o data/postprocessed/ye-pop-img2dataset-cogvlm_caption/shards/ -s data/postprocessed/ye-pop-img2dataset-cogvlm_caption/metadata/decontaminated.npy --shard-format '{:03d}.tar' --shard-stats-format '{:03d}_stats.json'
The same can be done for LLaVA-generated captions or alt-texts by replacing all occurences of cogvlm_caption
with llava_caption
or alt_txt
, respectively.
The CLIP models can be trained using the train.py
script provided by the DataComp repo and the datasets in the /data/postprocessed
directory collected above. The following is an example for CogVLM-generated captions:
data_dir=data/postprocessed/ye-pop-img2dataset-cogvlm_captions/shards
scale=tiny
num_gpus=4 # Replace with actually available number of GPUs
train_output_dir=output
exp_name=ye-pop-cogvlm_caption
epochs=4
torchrun --nproc_per_node $num_gpus train.py --scale $scale --data_dir $data_dir --output_dir $train_output_dir --exp_name $exp_name --workers 2 --num_checkpoints $epochs
By changing the $data_dir
variable, it is possible to train a model on the LLaVA-generated captions or alt-texts.
The evaluation can be conducted by following the instructions of the DataComp repo. For example:
download_dir=data/evalsets
python download_evalsets.py $data_dir
python evaluate.py --train_output_dir $train_output_dir/$exp_name --data_dir $download_dir
The evaluation results on all tasks are stored in $train_output_dir/$exp_name/eval_results.jsonl
and can be aggregated using:
python aggregate_scores.py --input $train_output_dir/$exp_name/eval_results.jsonl