Skip to content

Latest commit

 

History

History
52 lines (40 loc) · 6.87 KB

video-language-transformer.md

File metadata and controls

52 lines (40 loc) · 6.87 KB

Video & Language Transformer

No. Model Name Title Links Pub. Organization Release Time
1 COOT COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning paper code Neurips 2020 University of Freiburg 1 Nov 2020
2 MMT Multi-modal Transformer for Video Retrieval paper code ECCV 2020 Inria & Google 21 Jul 2020
3 HiT HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval paper arXiv Peking University 28 Mar 2021
4 CLIPBERT Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling paper code CVPR 2021 UNC Chapel Hill 11 Feb 2020
5 SVRTN Self-supervised Video Retrieval Transformer Network paper arXiv Alibaba DAMO Academy 16 Apr 2021
6 VATT VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text paper arXiv Google 22 April 2021
7 Forzen in Time Forzen in Time: A Joint Video and Image Encoder for End-to-End Retrieval paper code arXiv University of Oxford 1 April 2021
8 CLIP4CLIP CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval paper code arXiv Southwest Jiaotong University 18 April 2021
9 CLIP2Video CLIP2Video: Mastering Video-Text Retrieval via Image CLIP paper code arXiv PCG, Tencent 21 June, 2021
10 T2VLAD T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval paper CVPR 2021 Baidu 20 April 2021
11 - On Semantic Similarity in Video Retrieval paper code CVPR 2021 Univesity of Bristol 21 June, 2021
12 VLM VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding paper arXiv Facebook AI 20 May 2021
13 VideoBERT VideoBERT: A Joint Model for Video and Language Representation Learning paper CVPR 2019 Google Research 11 Sep 2019
14 CBT learning video representations using contrastive bidirectional transformer paper arXiv Google Research 27 Sep 2019
15 ActBERT ActBERT: Learning Global-Local Video-Text Representations paper Baidu Research CVPR 2020 14 Nov 2020
16 HERO HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training paper code EMNLP 2020 Microsoft Dynamics 365 AI Research 29 Sep 2020
17 UniVL UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation paper code arXiv MSRA 15 Sep 2021
18 G-TAD Boundary-sensitive Pre-training for Temporal Localization in Videos paper ICCV 2021 Samsung AI Centre Cambridge, UK 26 Mar 2021
19 UniVL UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation paper code arXiv Microsoft 15 Feb 2020
20 ActBERT ActBERT: Learning Global-Local Video-Text Representations paper CVPR 2020 Baidu Research 14 Nov 2020
21 HERO HERO : Hierarchical Encoder for Video+Language Omni-representation Pre-training paper code EMNLP 2020 Microsoft Dynamics 365 AI Research 1 May 2020
22 MM-ViT MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition paper arXiv Oppo 20 Aug 2021
23 TPT Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering paper arXiv Chinese Academy of Sciences 10 Sep 2021
24 ActionClip ActionCLIP: A New Paradigm for Video Action Recognition paper code arXiv Zhejiang University 17 Sep 2021
25 justAsk Just Ask: Learning to Answer Questions from Millions of Narrated Video paper code ICCV 2021 Inria Paris 12 Aug 2021
26 - A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer paper code arXiv Zhengjiang University 9 Dec 2021
27 SWINBERT SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning paper arXiv Microsoft 25 Nov 2021
28 VIOLET VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling paper code arXiv UC Santa Barbara 24 Nov 2021
29 FasionViL FashionViL: Fashion-Focused Vision-and-Language Representation Learning paper github ECCV 2022 University of Surrey 17 Jul 2022

cross-domain video-retreival

No. Model Name Title Links Pub. Organization Release Time
1 - Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval paper CVPR 2021 Zhejiang University 20 April 2021

vision & language navigation

No. Model Name Title Links Pub. Organization Release Time
1 Episodic Transformer Episodic Transformer for Vision-and-Language Navigation paper arXiv Inria 13 May 2021