This repository contains a comprehensive collection of the most important papers related to contrastive pretraining for vision, language, and audio. The papers are organized categorically, and sorted by year and month of publication.
The following table contains a list of papers that are directly related to CLIP, or that extend CLIP in some way, such as by improving the training process, or by changing the data filtering process. Every entry in this table is distinguished by contrastive learning being the primary pretraining objective, as opposed to models than employ multiple pretraining objectives, combining contrastive learning with other pretraining objectives masked language modeling (MLM).
Model | Year | Month | Paper Title | Novel Development | Arxiv | Github | Open Source | License | Model Card | OpenCLIP Integration |
---|---|---|---|---|---|---|---|---|---|---|
CLIP | 2021 | 2 | Learning Transferable Visual Models From Natural Language Supervision | Simplified Contrastive Language-Image Pretraining | ✔️ | License | Model Card | ✔️ | ||
ALIGN | 2021 | 2 | Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision | Extend from captions to noisy alt-text to avoid expensive filtering and post-processing | ✔️ | Model Card | ❌ | |||
CLOOB | 2021 | 10 | CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP | Avoid saturation of InfoNCE objective | ✔️ | License | ❌ | |||
DeCLIP | 2021 | 10 | Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm | Data efficiency through supervision | ✔️ | License | ❌ | |||
FILIP | 2021 | 11 | FILIP: Fine-grained Interactive Language-Image Pre-Training | Adds token-wise maximum similarity bewteen visual and textual features for efficient and fine-grained semantic alignment | ✔️ | ❌ | ||||
DeFILIP | 2022 | 3 | Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision | Combines DeCLIP and FILIP | ✔️ | License | ❌ | |||
PyramidCLIP | 2022 | 4 | PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining | Relax assumption that image and metadata are in one-to-one correspondence | ❌ | ❌ | ||||
KLITE | 2022 | 4 | K-LITE: Learning Transferable Visual Models with External Knowledge | Augment caption text with external knowledge | ✔️ | License | ❌ | |||
CyCLIP | 2022 | 5 | CyCLIP: Cyclic Contrastive Language-Image Pretraining | Formalize and optimize for geometric consistency in image and text spaces | ✔️ | License | ❌ | |||
FLIP | 2022 | 12 | Scaling Language-Image Pre-training via Masking | Masking images prior to encoding improves speed-accuracy trade-off for CLIP | ✔️ | License | ❌ | |||
OpenCLIP | 2022 | 12 | Reproducible scaling laws for contrastive language-image learning | Open-source implementation of CLIP | ✔️ | License | Model Card | ✔️ | ||
EVA-CLIP | 2023 | 3 | EVA-CLIP: Improved Training Techniques for CLIP at Scale | Improved representation learning, optimization, and augmentation for faster training | ✔️ | Model Card | ✔️ | |||
SigLIP | 2023 | 3 | Sigmoid Loss for Language Image Pre-Training | Sigmoid loss allows disentangling loss from batch size | ✔️ | License | ✔️ | |||
CLIPA | 2023 | 5 | An Inverse Scaling Law for CLIP Training | Insight into relationship between encoder size and training input sequence lengths leads to more efficient training | ✔️ | License | ✔️ | |||
MetaCLIP | 2023 | 9 | Demystifying CLIP Data | Rigorous study to reveal CLIP's data curation process | ✔️ | License | ✔️ | |||
DFN | 2023 | 11 | Data Filtering Networks | A model trained on high-quality data can be used to filter massive online data employed to train the final CLIP model | ✔️ | License | Model Card | ✔️ |
Models that extend CLIP by adding additional pretraining objectives, such as masked language modeling (MLM).
The acronyms used in the table below are as follows:
- DR: Dataset Reinforcement
- H-ITC: Hierarchical Image-Text Contrastive
- ISS: Image Self-Supervision
- ITM: Image-Text Matching
- LM: Language Modeling
- MIM: Masked Image Modeling
- MLM: Masked Language Modeling
- MMM: Masked Multimodal Modeling
- MSD: Masked Self-Distillation
All models in this table also use CLIP-style contrastive learning as a pretraining objective.
Model | Year | Month | Paper Title | Pretraining Techniques | Arxiv | Github | Open Source | License |
---|---|---|---|---|---|---|---|---|
SLIP | 2021 | 12 | SLIP: Self-supervision meets Language-Image Pre-training | ISS | ✔️ | License | ||
FLAVA | 2021 | 12 | FLAVA: A Foundational Language And Vision Alignment Model | ITM+MMM+MIM+MLM | ✔️ | License | ||
BLIP | 2022 | 1 | BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | ITM+LM | ✔️ | License | ||
MaskCLIP | 2022 | 8 | MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining | MLM+MSD | ❌ | |||
ViCHA | 2022 | 8 | Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment | H-ITC+ITM+MMM+MIM+MLM | ✔️ | License | ||
RILS | 2023 | 1 | RILS: Masked Visual Reconstruction in Language Semantic Space | MIM | ❌ | |||
MobileCLIP | 2023 | 11 | MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training | MMR | ✔️ | License |
This section contains collections of papers that are related to contrastive pretraining for other modalities, such as audio, video, and 3D data.
Models that use CLIP-style contrastive learning as a pretraining objective for audio.
Model | Year | Month | Paper Title | Modalities | Arxiv | Github | Open Source | License |
---|---|---|---|---|---|---|---|---|
AudioCLIP | 2021 | 6 | AudioCLIP: Extending CLIP to Image, Text and Audio | audio+image+text | ✔️ | License | ||
WAV2CLIP | 2021 | 10 | WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP | audio+image+text | ✔️ | License | ||
SpeechCLIP | 2022 | 10 | SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model | speech+image+text | ✔️ | License | ||
CLAP | 2023 | 4 | Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation | audio+text | ✔️ | License | ||
CLVP | 2023 | 5 | Better speech synthesis through scaling | speech+text | ✔️ | License |
Models that extend CLIP to the video domain.
Model | Year | Month | Paper Title | Arxiv | Github | Open Source | License |
---|---|---|---|---|---|---|---|
CLIP4Clip | 2021 | 4 | CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval | ✔️ | License | ||
VideoCLIP | 2021 | 9 | VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding | ✔️ | License | ||
X-CLIP | 2022 | 7 | X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval | ✔️ | License |
Models that extend CLIP to the 3D domain.
Contributions are welcome! Submit a pull request to add a new paper, or to update an existing paper. Please follow the format of the existing papers in the table 😄