1 |
COOT |
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning |
paper code |
Neurips 2020 |
University of Freiburg |
1 Nov 2020 |
2 |
MMT |
Multi-modal Transformer for Video Retrieval |
paper code |
ECCV 2020 |
Inria & Google |
21 Jul 2020 |
3 |
HiT |
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval |
paper |
arXiv |
Peking University |
28 Mar 2021 |
4 |
CLIPBERT |
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling |
paper code |
CVPR 2021 |
UNC Chapel Hill |
11 Feb 2020 |
5 |
SVRTN |
Self-supervised Video Retrieval Transformer Network |
paper |
arXiv |
Alibaba DAMO Academy |
16 Apr 2021 |
6 |
VATT |
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text |
paper |
arXiv |
Google |
22 April 2021 |
7 |
Forzen in Time |
Forzen in Time: A Joint Video and Image Encoder for End-to-End Retrieval |
paper code |
arXiv |
University of Oxford |
1 April 2021 |
8 |
CLIP4CLIP |
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval |
paper code |
arXiv |
Southwest Jiaotong University |
18 April 2021 |
9 |
CLIP2Video |
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP |
paper code |
arXiv |
PCG, Tencent |
21 June, 2021 |
10 |
T2VLAD |
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval |
paper |
CVPR 2021 |
Baidu |
20 April 2021 |
11 |
- |
On Semantic Similarity in Video Retrieval |
paper code |
CVPR 2021 |
Univesity of Bristol |
21 June, 2021 |
12 |
VLM |
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding |
paper |
arXiv |
Facebook AI |
20 May 2021 |
13 |
VideoBERT |
VideoBERT: A Joint Model for Video and Language Representation Learning |
paper |
CVPR 2019 |
Google Research |
11 Sep 2019 |
14 |
CBT |
learning video representations using contrastive bidirectional transformer |
paper |
arXiv |
Google Research |
27 Sep 2019 |
15 |
ActBERT |
ActBERT: Learning Global-Local Video-Text Representations |
paper |
Baidu Research |
CVPR 2020 |
14 Nov 2020 |
16 |
HERO |
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training |
paper code |
EMNLP 2020 |
Microsoft Dynamics 365 AI Research |
29 Sep 2020 |
17 |
UniVL |
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation |
paper code |
arXiv |
MSRA |
15 Sep 2021 |
18 |
G-TAD |
Boundary-sensitive Pre-training for Temporal Localization in Videos |
paper |
ICCV 2021 |
Samsung AI Centre Cambridge, UK |
26 Mar 2021 |
19 |
UniVL |
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation |
paper code |
arXiv |
Microsoft |
15 Feb 2020 |
20 |
ActBERT |
ActBERT: Learning Global-Local Video-Text Representations |
paper |
CVPR 2020 |
Baidu Research |
14 Nov 2020 |
21 |
HERO |
HERO : Hierarchical Encoder for Video+Language Omni-representation Pre-training |
paper code |
EMNLP 2020 |
Microsoft Dynamics 365 AI Research |
1 May 2020 |
22 |
MM-ViT |
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition |
paper |
arXiv |
Oppo |
20 Aug 2021 |
23 |
TPT |
Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering |
paper |
arXiv |
Chinese Academy of Sciences |
10 Sep 2021 |
24 |
ActionClip |
ActionCLIP: A New Paradigm for Video Action Recognition |
paper code |
arXiv |
Zhejiang University |
17 Sep 2021 |
25 |
justAsk |
Just Ask: Learning to Answer Questions from Millions of Narrated Video |
paper code |
ICCV 2021 |
Inria Paris |
12 Aug 2021 |
26 |
- |
A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer |
paper code |
arXiv |
Zhengjiang University |
9 Dec 2021 |
27 |
SWINBERT |
SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning |
paper |
arXiv |
Microsoft |
25 Nov 2021 |
28 |
VIOLET |
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling |
paper code |
arXiv |
UC Santa Barbara |
24 Nov 2021 |
29 |
FasionViL |
FashionViL: Fashion-Focused Vision-and-Language Representation Learning |
paper github |
ECCV 2022 |
University of Surrey |
17 Jul 2022 |