Skip to content

Official code repository for "TULIP: Token-length Upgraded CLIP"

License

Notifications You must be signed in to change notification settings

ivonajdenkoska/tulip

Repository files navigation

🌷 TULIP: Token-length Upgraded CLIP

Overview 🌟

TULIP (Token-length Upgraded CLIP) is a method to upgrade the caption length of CLIP-like models to perform long caption understanding. This repository contains the code associated with the paper:

"TULIP: Token-length Upgraded CLIP"
Ivona Najdenkoska٭, Mohammad M. Derakshani٭, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek
٭ Equal core contributions

Highlights 🚀

  • Architectural Enhancement: TULIP injects relative positional encodings into contrastive vision-language models to handle long image captions.
  • Seamless Integration: A plug-and-play approach that works with CLIP-like models.
  • Improved Performance: Achieves state-of-the-art results, surpassing baselines like CLIP and LongCLIP on long-caption understanding tasks.

Usage 🛠️

  1. To begin, clone this repository and navigate to the tulip folder:
git clone https://github.com/ivonajdenkoska/tulip.git
cd tulip
  1. Our repo is based on open_clip, so please install the following packages:
conda create -n openclip python=3.10 -y
conda activate openclip
  1. For training TULIP and further development, please install additional packages:
cd open_clip
make install
make install training

Data 🗂️

We use ShareGPT4V dataset for TULIP's training. You can download the data annotations from here (use share-captioner_coco_lcs_sam_1246k_1107.json), and the images from the links below. For more information on how to organize the folders, check here.

To further prepare the dataset for the training stage, please convert the json into a csv file, by running get_csv(args) in open_clip.data_prep.sharegpt_preprocessing.py. Afterwards, create two separate csv files: *_train.csv and *_val.csv. For the validation split use the first 1k instances of share-captioner_coco_lcs_sam_1246k_1107, and use the remaining for training.

Training 🏋️‍♂️

TULIP_framework

The training process of TULIP consists of two stages: relative position distillation and relative position expansion. The first stage consists of distilling the knowledge of an exisitng CLIP model with fixed positional encodings, into a student model with relative positional encodings (i.e. RoPE, CoPE, etc.). To perfom this, you can use the following bash script to run the training on a single GPU. Additionally make sure to set the correct paths to the dataset location --train-data and --val-data and logs --logs.

python -m training.main_distill_rope \
    --dataset-type "csv" \
    --batch-size 20 \
    --train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
    --val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
    --logs "/logs/sharegpt4v/" \
    --warmup 1000 \
    --lr 5e-4 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --save-frequency 5 \
    --model "ViT-L-14" \
    --pretrained "openai" \
    --precision 'amp_bf16' \
    --log-every-n-steps 100 \
    --accum-freq 4 \
    --context-length 77 \
    --student-context-length 248 \
    --wandb-project-name "dense-cap-distill" \
    --loss-type "cosine" \
    --report-to "wandb" \

For training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), you can use torchrun and --nproc_per_node flag. So simply replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_distill_rope \ .

For the second stage of the training, we perfrom fine-tuning of the student model by optimizing the CLIP loss with the new positional encodings for a single epoch. Please make sure to set the correct path for the distilled model as --student-model.

python -m training.main_context_finetune_rope \
    --dataset-type "csv" \
    --batch-size 4 \
    --train-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_train.csv" \
    --val-data "/ShareGPT4V/data/share-captioner_coco_lcs_sam_1246k_1107_val.csv" \
    --logs "/logs/sharegpt4v/" \
    --warmup 1000 \
    --lr 1e-5 \
    --wd 0.1 \
    --epochs 30 \
    --workers 8 \
    --save-frequency 5 \
    --model "ViT-L-14" \
    --pretrained "openai" \
    --precision 'amp_bf16' \
    --log-every-n-steps 100 \
    --accum-freq 4 \
    --context-length 77 \
    --student-context-length 248 \
    --loss-type "clip_loss" \
    --student-model "/logs/sharegpt4v/checkpoints/epoch_5.pt" \
    --wandb-project-name "dense-cap-ctx-extension" \
    --report-to "wandb" \

Similalry as before, to training on multiple GPUs (e.g launhing the job on a node of 8 GPUs), replace the first line with the follwing: TORCH_CUDNN_V8_API_ENABLED=1 torchrun --nproc_per_node 8 -m training.main_context_finetune_rope \ .

Evaluation 📊

To evaluate TULIP on cross-modal retrieval tasks, you can simply use the eval_tulip.py script. Flags such as --run_sharegpt4v will determine which benchmarks will be used.

python eval_tulip.py 
    --model_name ViT-L-14 
    --pretrained /path_to_best_checkpoint/epoch_1.pt 
    --run_sharegpt4v --run_urban1k --run_dci_long --run_coco --run_flickr 
    --wandb

Cross-modal retreival examples 🔍

Image-to-Text retreival (comparison to CLIP)

I2T

Text-to-Image retreival (comparison to CLIP)

T2I

Image generation examples 🔍

We compare our TULIP-based model to several image generation baselines using Stable Diffusion XL with different text enocders: CLIP, Long-CLIP, and two T5-based ones, PIXART-Alpha and ELLA.

I2T

More image generation comparison (to T5-based models)

Image 1 Image 2

Citation 📜

If you find the TULIP paper and code useful for your research and applications, please cite using this BibTeX:

@article{najdenkoska2024tulip,
  title={TULIP: Token-length Upgraded CLIP},
  author={Najdenkoska, Ivona and Derakhshani, Mohammad Mahdi and 
  Asano, Yuki M and van Noord, Nanne and Worring, Marcel and Snoek, 
  Cees GM},
  journal={arXiv preprint arXiv:2410.10034},
  year={2024}
}

Acknowledgements 🌸

This project is based on open_clip - special thanks to all the contributors. We also thank Long-CLIP for providing the pretrained models and code, and ShareGPT4V for providing the data.

About

Official code repository for "TULIP: Token-length Upgraded CLIP"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages