TULIP (Token-length Upgraded CLIP) is a method to upgrade the caption length of CLIP-like models to perform long caption understanding. This repository contains the code associated with the paper:
"TULIP: Token-length Upgraded CLIP"
Ivona Najdenkoska٭, Mohammad M. Derakshani٭, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek
٭ Equal core contributions
- Architectural Enhancement: TULIP injects relative positional encodings into contrastive vision-language models to handle long image captions.
- Seamless Integration: A plug-and-play approach that works with CLIP-like models.
- Improved Performance: Achieves state-of-the-art results, surpassing baselines like CLIP and LongCLIP on long-caption understanding tasks.
- To begin, clone this repository and navigate to the tulip folder:
git clone https://github.com/ivonajdenkoska/tulip.git
cd tulip
- Our repo is based on open_clip, so please install the following packages:
conda create -n openclip python=3.10 -y
conda activate openclip
- For training TULIP and further development, please install additional packages:
cd open_clip
make install
make install training
We use ShareGPT4V dataset for TULIP's training. You can download the data annotations from here (use share-captioner_coco_lcs_sam_1246k_1107.json), and the images from the links below. For more information on how to organize the folders, check here.
- LAION-CC-SBU-558K: images.zip
- COCO: train2017
- WebData: images (academic usage)
- SAM: images
- GQA: images
- OCR-VQA: download script
- TextVQA: trainvalimages
- VisualGenome: part1, part2
To further prepare the dataset for the training stage, please convert the json into a csv file, by running get_csv(args)
in open_clip.data_prep.sharegpt_preprocessing.py
. Afterwards, create two separate csv files: *_train.csv and *_val.csv. For the validation split use the first 1k instances of share-captioner_coco_lcs_sam_1246k_1107
, and use the remaining for training.
The training process of TULIP consists of two stages: relative position distillation and relative position expansion.
The first stage consists of distilling the knowledge of an exisitng CLIP model with fixed positional encodings, into a student model with relative positional encodings (i.e. RoPE, CoPE, etc.). To perfom this, you can use the following bash script to run the training on a single GPU. Additionally make sure to set the correct paths to the dataset location --train-data
and --val-data
and logs --logs
.
python -m training.main_distill_rope \
--dataset-type "csv" \
--batch-size 20 \
--train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
--val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
--logs "/logs/sharegpt4v/" \
--warmup 1000 \
--lr 5e-4 \
--wd 0.1 \
--epochs 30 \
--workers 8 \
--save-frequency 5 \
--model "ViT-L-14" \
--pretrained "openai" \
--precision 'amp_bf16' \
--log-every-n-steps 100 \
--accum-freq 4 \
--context-length 77 \
--student-context-length 248 \
--wandb-project-name "dense-cap-distill" \
--loss-type "cosine" \
--report-to "wandb" \
For training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), you can use torchrun
and --nproc_per_node
flag. So simply replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_distill_rope \
.
For the second stage of the training, we perfrom fine-tuning of the student model by optimizing the CLIP loss with the new positional encodings for a single epoch. Please make sure to set the correct path for the distilled model as --student-model
.
python -m training.main_context_finetune_rope \
--dataset-type "csv" \
--batch-size 4 \
--train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
--val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
--logs "/logs/sharegpt4v/" \
--warmup 1000 \
--lr 1e-5 \
--wd 0.1 \
--epochs 30 \
--workers 8 \
--save-frequency 5 \
--model "ViT-L-14" \
--pretrained "openai" \
--precision 'amp_bf16' \
--log-every-n-steps 100 \
--accum-freq 4 \
--context-length 77 \
--student-context-length 248 \
--loss-type "clip_loss" \
--student-model "/logs/sharegpt4v/checkpoints/epoch_5.pt" \
--wandb-project-name "dense-cap-ctx-extension" \
--report-to "wandb" \
Similalry as before, to training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_context_finetune_rope \
.
To evaluate TULIP on cross-modal retrieval tasks, you can simply use the eval_tulip.py
script. Flags such as --run_sharegpt4v
will determine which benchmarks will be used.
python eval_tulip.py
--model_name ViT-L-14
--pretrained /path_to_best_checkpoint/epoch_1.pt
--run_sharegpt4v --run_urban1k --run_dci_long --run_coco --run_flickr
--wandb
We compare our TULIP-based model to several image generation baselines using Stable Diffusion XL with different text enocders: CLIP, Long-CLIP, and two T5-based ones, PIXART-Alpha and ELLA.
If you find the TULIP paper and code useful for your research and applications, please cite using this BibTeX:
@article{najdenkoska2024tulip,
title={TULIP: Token-length Upgraded CLIP},
author={Najdenkoska, Ivona and Derakhshani, Mohammad Mahdi and
Asano, Yuki M and van Noord, Nanne and Worring, Marcel and Snoek,
Cees GM},
journal={arXiv preprint arXiv:2410.10034},
year={2024}
}
This project is based on open_clip - special thanks to all the contributors. We also thank Long-CLIP for providing the pretrained models and code, and ShareGPT4V for providing the data.