Name		Name	Last commit message	Last commit date
parent directory ..
anno		anno
assert		assert
configs		configs
dataset		dataset
example		example
models		models
prompts		prompts
scripts		scripts
tasks		tasks
utils		utils
README.md		README.md
README_CN.md		README_CN.md
conversation.py		conversation.py
demo.ipynb		demo.ipynb
demo.py		demo.py
requirements.txt		requirements.txt

README.md

中文文档]

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.

🔥 Updates

2023/11/29 VideoChat2 and MVBench are released.
- VideoChat2 is a strong baseline built on UMT and Vicuna-v0.
- 2M diverse instruction data are released for effective tuning.
- MVBench is a comprehensive benchmark for video understanding.
2023/06/09: Release code and scripts for pre-training and instruction tuning:
- Simply run the scripts like bash ./exp/run_7b_stage1.sh.
- You can change the NNODE and set MASTER_NODE by yourself. For stage1, it requires at least 8 GPUs for fast training. For stage2, 4 GPU is enough.
2023/05/24: Release the Stage-pretrained models:
- Model-7B-stage1, Model-13B-stage1
2023/05/12: Release the 7B version:
- 🎊 Model-7B: 7B requires ~20GB GPU memory, while 13B requires ~32GB GPU memory.
2023/05/11: Release the 🦜VideoChat V1, which can handle both image and video understanding!
- 🎊 Model-13B and Data.
- 🤗 Online Demo

⏳ Schedule

Small-scale video instuction data and tuning
Instruction tuning on BLIP+UniFormerV2+Vicuna
Large-scale and complex video instuction data
Instruction tuning on strong video foundation model
User-friendly interactions with longer videos
...

💬 Example Online🦜

Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl.
Our VideoChat can handle both image and video understanding well!

[Video] Why the video is funny?

[Video] Spatial perception

[Video] Temporal perception

[Video] Multi-turn conversation

Image understanding

🏃 Usage

Prepare the envirment.
```
pip install -r requirements.txt
```
Download BLIP2 model:
- ViT: wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
- QFormer: wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
- Change the vit_model_path and q_former_model_path in config.json or config_7b.json.

Download StableVicuna model:

LLAMA: Download it from the original repo or hugging face.
If you download LLAMA from the original repo, please process it via the following command:

# convert_llama_weights_to_hf is copied from transformers
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
  --input_dir /path/to/downloaded/llama/weights \
  --model_size 13B --output_dir /output/path

For 13B: Download stable-vicuna-13b-delta and process it:

You can download apply_delta.py from huggingface

# fastchat v0.1.10
python3 apply_delta.py \
  --base /path/to/model_weights/llama-13b \
  --target stable-vicuna-13b \
  --delta CarperAI/stable-vicuna-13b-delta

For 7B: Download vicuna-7b-delta-v0 and process it:

# fastchat v0.1.10
python3 apply_delta.py \
  --base /path/to/model_weights/llama-7b \
  --target vicuna-7b-v0 \
  --delta lmsys/vicuna-7b-delta-v0

Change the llama_model_path in config.json or config_7b.json.

Download VideoChat-13B or VideoChat-7B:
- Change the videochat_model_path in config.jsonor config_7b.json.
Running demo with Gradio:
```
python demo.py
```
Another demo on Jupyter Notebook can found in demo.ipynb

🤖 Instruction Tuning

Simply run the scripts.

bash ./exp/run_7b_stage1.sh
bash ./exp/run_7b_stage2.sh

You can change the NNODE and set MASTER_NODE by yourself. For stage1, it requires at least 8 GPUs for fast training. For stage2, 4 GPU is enough.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{2023videochat,
  title={VideoChat: Chat-Centric Video Understanding},
  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

👍 Acknowledgement

Thanks to the open source of the following projects:

InternVideo, UniFormerV2, MiniGPT-4, LLaVA, BLIP2, StableLM.

Files

video_chat

Directory actions

More options