🌷 TULIP: Token-length Upgraded CLIP

[Paper] [Long-DCI benchmark] [Website (TBD)] [Checkpoints (TBD)]

Overview 🌟

TULIP (Token-length Upgraded CLIP) is a method to upgrade the caption length of CLIP-like models to perform long caption understanding. This repository contains the code associated with the paper:

"TULIP: Token-length Upgraded CLIP"
Ivona Najdenkoska٭, Mohammad M. Derakshani٭, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek
٭ Equal core contributions

Highlights 🚀

Architectural Enhancement: TULIP injects relative positional encodings into contrastive vision-language models to handle long image captions.
Seamless Integration: A plug-and-play approach that works with CLIP-like models.
Improved Performance: Achieves state-of-the-art results, surpassing baselines like CLIP and LongCLIP on long-caption understanding tasks.

Usage 🛠️

To begin, clone this repository and navigate to the tulip folder:

git clone https://github.com/ivonajdenkoska/tulip.git
cd tulip

Our repo is based on open_clip, so please install the following packages:

conda create -n openclip python=3.10 -y
conda activate openclip

For training TULIP and further development, please install additional packages:

cd open_clip
make install
make install training

Data 🗂️

We use ShareGPT4V dataset for TULIP's training. You can download the data annotations from here (use share-captioner_coco_lcs_sam_1246k_1107.json), and the images from the links below. For more information on how to organize the folders, check here.

LAION-CC-SBU-558K: images.zip
COCO: train2017
WebData: images (academic usage)
SAM: images
GQA: images
OCR-VQA: download script
TextVQA: trainvalimages
VisualGenome: part1, part2

To further prepare the dataset for the training stage, please convert the json into a csv file, by running get_csv(args) in open_clip.data_prep.sharegpt_preprocessing.py. Afterwards, create two separate csv files: *_train.csv and *_val.csv. For the validation split use the first 1k instances of share-captioner_coco_lcs_sam_1246k_1107, and use the remaining for training.

Training 🏋️‍♂️

The training process of TULIP consists of two stages: relative position distillation and relative position expansion. The first stage consists of distilling the knowledge of an exisitng CLIP model with fixed positional encodings, into a student model with relative positional encodings (i.e. RoPE, CoPE, etc.). To perfom this, you can use the following bash script to run the training on a single GPU. Additionally make sure to set the correct paths to the dataset location --train-data and --val-data and logs --logs.

python -m training.main_distill_rope \
    --dataset-type "csv" \
    --batch-size 20 \
    --train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
    --val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
    --logs "/logs/sharegpt4v/" \
    --warmup 1000 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --save-frequency 5 \
    --model "ViT-L-14" \
    --pretrained "openai" \
    --precision 'amp_bf16' \
    --log-every-n-steps 100 \
    --accum-freq 4 \
    --context-length 77 \
    --student-context-length 248 \
    --wandb-project-name "dense-cap-distill" \
    --loss-type "cosine" \
    --report-to "wandb" \

For training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), you can use torchrun and --nproc_per_node flag. So simply replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_distill_rope \ .

For the second stage of the training, we perfrom fine-tuning of the student model by optimizing the CLIP loss with the new positional encodings for a single epoch. Please make sure to set the correct path for the distilled model as --student-model.

python -m training.main_context_finetune_rope \
    --dataset-type "csv" \
    --batch-size 4 \
    --train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
    --val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
    --logs "/logs/sharegpt4v/" \
    --warmup 1000 \
    --lr 1e-5 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --save-frequency 5 \
    --model "ViT-L-14" \
    --pretrained "openai" \
    --precision 'amp_bf16' \
    --log-every-n-steps 100 \
    --accum-freq 4 \
    --context-length 77 \
    --student-context-length 248 \
    --loss-type "clip_loss" \
    --student-model "/logs/sharegpt4v/checkpoints/epoch_5.pt" \
    --wandb-project-name "dense-cap-ctx-extension" \
    --report-to "wandb" \

Similalry as before, to training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_context_finetune_rope \ .

Evaluation 📊

To evaluate TULIP on cross-modal retrieval tasks, you can simply use the eval_tulip.py script. Flags such as --run_sharegpt4v will determine which benchmarks will be used.

python eval_tulip.py 
    --model_name ViT-L-14 
    --pretrained /path_to_best_checkpoint/epoch_1.pt 
    --run_sharegpt4v --run_urban1k --run_dci_long --run_coco --run_flickr 
    --wandb

Cross-modal retreival examples 🔍

Image-to-Text retreival (comparison to CLIP)

Text-to-Image retreival (comparison to CLIP)

Image generation examples 🔍

We compare our TULIP-based model to several image generation baselines using Stable Diffusion XL with different text enocders: CLIP, Long-CLIP, and two T5-based ones, PIXART-Alpha and ELLA.

More image generation comparison (to T5-based models)

Citation 📜

If you find the TULIP paper and code useful for your research and applications, please cite using this BibTeX:

@article{najdenkoska2024tulip,
  title={TULIP: Token-length Upgraded CLIP},
  author={Najdenkoska, Ivona and Derakhshani, Mohammad Mahdi and 
  Asano, Yuki M and van Noord, Nanne and Worring, Marcel and Snoek, 
  Cees GM},
  journal={arXiv preprint arXiv:2410.10034},
  year={2024}
}

Acknowledgements 🌸

This project is based on open_clip - special thanks to all the contributors. We also thank Long-CLIP for providing the pretrained models and code, and ShareGPT4V for providing the data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
eval		eval
images		images
open_clip		open_clip
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_tulip.py		eval_tulip.py
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌷 TULIP: Token-length Upgraded CLIP

[Paper] [Long-DCI benchmark] [Website (TBD)] [Checkpoints (TBD)]

Overview 🌟

Highlights 🚀

Usage 🛠️

Data 🗂️

Training 🏋️‍♂️

Evaluation 📊

Cross-modal retreival examples 🔍

Image-to-Text retreival (comparison to CLIP)

Text-to-Image retreival (comparison to CLIP)

Image generation examples 🔍

More image generation comparison (to T5-based models)

Citation 📜

Acknowledgements 🌸

About

Releases

Packages

Contributors 2

Languages

License

ivonajdenkoska/tulip

Folders and files

Latest commit

History

Repository files navigation

🌷 TULIP: Token-length Upgraded CLIP

[Paper] [Long-DCI benchmark] [Website (TBD)] [Checkpoints (TBD)]

Overview 🌟

Highlights 🚀

Usage 🛠️

Data 🗂️

Training 🏋️‍♂️

Evaluation 📊

Cross-modal retreival examples 🔍

Image-to-Text retreival (comparison to CLIP)

Text-to-Image retreival (comparison to CLIP)

Image generation examples 🔍

More image generation comparison (to T5-based models)

Citation 📜

Acknowledgements 🌸

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages