Skip to content

Latest commit

 

History

History

video_chat

🦜 VideoChat [paper/demo/中文文档]

images In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system’s potential across a broad spectrum of video applications and set the standard for future research.

🔥 Updates

  • 2023/11/29 VideoChat2 and MVBench are released.

  • 2023/06/09: Release code and scripts for pre-training and instruction tuning:

    • Simply run the scripts like bash ./exp/run_7b_stage1.sh.
    • You can change the NNODE and set MASTER_NODE by yourself. For stage1, it requires at least 8 GPUs for fast training. For stage2, 4 GPU is enough.
  • 2023/05/24: Release the Stage-pretrained models:

  • 2023/05/12: Release the 7B version:

    • 🎊 Model-7B: 7B requires ~20GB GPU memory, while 13B requires ~32GB GPU memory.
  • 2023/05/11: Release the 🦜VideoChat V1, which can handle both image and video understanding!

⏳ Schedule

  • Small-scale video instuction data and tuning
  • Instruction tuning on BLIP+UniFormerV2+Vicuna
  • Large-scale and complex video instuction data
  • Instruction tuning on strong video foundation model
  • User-friendly interactions with longer videos
  • ...

💬 Example Online🦜

Comparison with ChatGPT, MiniGPT-4, LLaVA and mPLUG-Owl.
Our VideoChat can handle both image and video understanding well!
[Video] Why the video is funny?
[Video] Spatial perception
[Video] Temporal perception
[Video] Multi-turn conversation
Image understanding

🏃 Usage

  • Prepare the envirment.

    pip install -r requirements.txt
  • Download BLIP2 model:

    • ViT: wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
    • QFormer: wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
    • Change the vit_model_path and q_former_model_path in config.json or config_7b.json.
  • Download StableVicuna model:

    • LLAMA: Download it from the original repo or hugging face.
    • If you download LLAMA from the original repo, please process it via the following command:
    # convert_llama_weights_to_hf is copied from transformers
    python src/transformers/models/llama/convert_llama_weights_to_hf.py \
      --input_dir /path/to/downloaded/llama/weights \
      --model_size 13B --output_dir /output/path
    # fastchat v0.1.10
    python3 apply_delta.py \
      --base /path/to/model_weights/llama-13b \
      --target stable-vicuna-13b \
      --delta CarperAI/stable-vicuna-13b-delta
    # fastchat v0.1.10
    python3 apply_delta.py \
      --base /path/to/model_weights/llama-7b \
      --target vicuna-7b-v0 \
      --delta lmsys/vicuna-7b-delta-v0
  • Download VideoChat-13B or VideoChat-7B:

  • Running demo with Gradio:

    python demo.py
  • Another demo on Jupyter Notebook can found in demo.ipynb

🤖 Instruction Tuning

  • Simply run the scripts.
    bash ./exp/run_7b_stage1.sh
    bash ./exp/run_7b_stage2.sh
  • You can change the NNODE and set MASTER_NODE by yourself. For stage1, it requires at least 8 GPUs for fast training. For stage2, 4 GPU is enough.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{2023videochat,
  title={VideoChat: Chat-Centric Video Understanding},
  author={KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao},
  journal={arXiv preprint arXiv:2305.06355},
  year={2023}
}

👍 Acknowledgement

Thanks to the open source of the following projects:

InternVideo, UniFormerV2, MiniGPT-4, LLaVA, BLIP2, StableLM.