Xinhao Li, Zhenpeng Huang, Jing Wang, Kunchang Li, and Limin Wang.
- 2024/06/12: Release annotations and evaluation codes of VideoEval, which includes VidTAB and VidEB.
For VidTAB, we base on MMAction2 for training and evaluation:
pip install -U openmim
mim install mmengine 'mmcv>=2.0.0rc1'
mim install "mmdet>=3.0.0rc5"
mim install "mmpose>=1.0.0rc0"
git clone https://github.com/leexinhao/VideoEval.git
cd VidTAB
pip install -v -e .
Due to potential copyright issues, please refer to DATA.md to download the original videos of each dataset separately, and we will share our version of the dataset after we confirm that there are no copyright issues.
For VidTAB, you could directly use the annotations we prepared.
For training and evaluation, you could refer to here, and we provide configs of diffenent VFMs for your reference.
In brief, you can use tools/train.py
to train a model on a single machine with a CPU and optionally a GPU (Our experiment also uses only one GPU).
python tools/train.py ${CONFIG_FILE} [ARGS]
And I provide my train scripts tools/my_train.sh
for avoiding setting [ARGS], then you could begin to use VidTAB by execute a bash file like this:
bash tools/my_train.sh configs/video_eval/AR_in_Dark/Internvideo2/frozen_tuning/InternVideo2-1B-stage1-pt_16_shot_bs16.py
bash tools/my_train.sh configs/video_eval/AR_in_Dark/Internvideo2/frozen_tuning/InternVideo2-1B-stage1_100_shot_bs16.py
bash tools/my_train.sh configs/video_eval/AR_in_Dark/Internvideo2/frozen_tuning/InternVideo2-1B-stage1-pt_100_shot_bs16.py
...
bash tools/my_train.sh configs/video_eval/Fake_face/ViCLIP/frozen_tuning/ViCLIP-200M_16_shot_bs16.py
bash tools/my_train.sh configs/video_eval/Fake_face/ViCLIP/frozen_tuning/ViCLIP-10M_100_shot_bs16.py
bash tools/my_train.sh configs/video_eval/Fake_face/ViCLIP/frozen_tuning/ViCLIP-10M_16_shot_bs16.py
bash tools/my_train.sh configs/video_eval/Fake_face/ViCLIP/frozen_tuning/ViCLIP-200M_100_shot_bs16.py
bash tools/my_train.sh configs/video_eval/Fake_face/ZeroI2V/linear_adapter0d125/ZeroI2V-CLIP-L_100_shot_bs16.py
Then you can go to the work dir to find the corresponding log file to see the result, In all our experiments, we conducted validation during the training process to select the epoch with the highest accuracy. Consequently, there was no need for additional performance testing after the training was completed. Furthermore, please note that we used a single clip rather than three clips to obtain the final performance metrics.
Prompts for Zero-Shot Evaluation: see prompts for image backbones, prompts for video backbones.
bash exp/vid_zs.sh #for video language models
bash exp/img_zs.sh #for image language models
For evaluation, we provide example as a demonstration of the pipeline of embedding extraction and evaluation.
Thanks to the open source of the following projects: ARID, Breakfast, Animal Kingdom, SurgicalActions160, FaceForensics++, MOB, DOVER, CAER, vsc2022, FIVR-200K, Ask-Anything, UMT, EVA, InternVideo, SigLIP, CLIP, jepa, dinov2, VideoMAE, VideoMAEv2, MMAction2.
If you find this project useful in your research, please consider cite:
@article{li2024videoeval,
title={VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model},
author={Li, Xinhao and Huang, Zhenpeng and Wang, Jing and Li, Kunchang and Wang, Limin},
journal={arXiv preprint arXiv:2407.06491},
year={2024}
}