Skip to content

πŸ”₯ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

License

Notifications You must be signed in to change notification settings

magic-research/Sa2VA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

[🏠 Sa2VA] [πŸ“œ arXiv] [πŸ€— HuggingFace] [πŸŽ₯ Introduction] [πŸ§‘β€πŸ’» GitHub] [Gradio Demo (Ours internal: Sa2VA-4B)] [Gradio Demo (By HuggingFace Offical)]

Haobo Yuan1* Β· Xiangtai Li2*† Β· Tao Zhang2,3* Β· Zilong Huang2 Β· Shilin Xu4 Β·Shunping Ji3 Β·Yunhai Tong4 Β·

Lu Qi2 Β· Jiashi Feng2 Β· Ming-Hsuan Yang1

1UC Merced    2ByteDance Seed    3WHU    4PKU

† project lead * the first three authors equally contribute to the work.

Teaser

Opensource progress

  • Release Open-sourced training datasets.
  • Release Ref-SAM-v dataset.
  • Release evaluation code for each dataset.
  • Release 1B,4B,8B, 26B model.
  • Release training code.
  • Release inference and test code.
  • Release demo code.

Overview

This repository contains the code for the paper "Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos".

Sa2VA is the first unified model for the dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space.

Model Zoo

We provide the following models:

Model Name Base MLLM Language Part HF Link
Sa2VA-1B InternVL2.0-1B Qwen2-0.5B-Instruct πŸ€— link
Sa2VA-4B InternVL2.5-4B Qwen2.5-3B-Instruct πŸ€— link
Sa2VA-8B InternVL2.5-8B internlm2_5-7b-chat πŸ€— link
Sa2VA-26B InternVL2.5-26B internlm2_5-7b-chat πŸ€— link

πŸ€— Gradio Demos

We provide a script that implements interactive chat using gradio, which requires installing gradio==4.42.0. You can try it to build a local chat interface quickly.

PYTHONPATH=. python projects/llava_sam2/gradio/app.py ByteDance/Sa2VA-4B

πŸš€ Quick Start

Our Sa2VA model is available on πŸ€—HuggingFace. With very few steps, you can try it with your own data. You can install the demo/requirements.txt to avoid training-only packages.

Option1 - scripts:

Supposing you have a folder (PATH_TO_FOLDER) that contains images of a video, you can use the following script to chat with the Sa2VA model or segment the objects in the videos.

> cd scripts
> python demo/demo.py PATH_TO_FOLDER --model_path ByteDance/Sa2VA-8B --work-dir OUTPUT_DIR --text "<image>Please describe the video content."

If the output contains the segmentation results, the results will be saved to OUTPUT_DIR.

Option2 - Jupter Notebook:

Please refer to demo.ipynb.

πŸŽ₯ Demo

Demo 1 Input Video (Source: La La Land 2016):

Error

Instruction: "Please segment the girl wearing the yellow dress."

Demo 2 Input Video (Source: La La Land 2016):

Error

Instruction: "Please segment the main character."

Demo 3 Input Video (Source: Internet):

Error

Instruction: "Please segment the person wearing sun glasses."

Demo 4 Input Video (Source: Internet):

Error

Instruction: "Instruction: "Please segment the singing girl."

Demo 5 Input Video:

Error

Instruction: "What is the atmosphere of the scene?"

Answer: "The scene has a dark and mysterious atmosphere, with the men dressed in suits and ties, and the dimly lit room."

Training

Installation
  1. Please install the python and pytorch first:
> conda create -n vlm python=3.10
> conda activate vlm
> conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch  -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"
  1. Install mmcv:
> pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html
  1. Install other dependencies:
> pip install -r requirements.txt
Pretrained Model Preparation

You are expected to download the following pretrained models and place them in the ./pretrained directory:

Data Preparation

Please download the training datasets and place them in the data directory. The download link is here.

Please directly put the zip files into the data directory and unzip them. For example, you can download the video_datas_mevis.zip and unzip it in the data directory like:

> unzip video_datas_mevis.zip

The final data structure should be like:

data/
β”œβ”€β”€ video_datas
|   β”œβ”€β”€ revos
|   β”œβ”€β”€ mevis
|   β”œβ”€β”€ revos
|   └── davis17
β”œβ”€β”€ glamm_data
|   β”œβ”€β”€ images
|   β”œβ”€β”€ annotations
β”œβ”€β”€ osprey-724k
|   β”œβ”€β”€ Osprey-724K
|   β”œβ”€β”€ coco
β”œβ”€β”€ llava_data
|   β”œβ”€β”€ llava_images
|   β”œβ”€β”€ LLaVA-Instruct-150K
|   β”œβ”€β”€ LLaVA-Pretrain
β”œβ”€β”€ ref_sav
|   β”œβ”€β”€ sam_v_full
|   β”œβ”€β”€ Ref-SAV.json

sam_v_full is the SA-V dataset, which is not included in the download link. You can download it from here.

Training Script

Please run the following script to train:

> bash tools/dist.sh train projects/llava_sam2/configs/sa2va_4b.py 8
Convert trained model to huggingface format

Please run the following script to convert:

> python projects/llava_sam2/hf/convert_to_hf.py projects/llava_sam2/configs/sa2va_4b.py --pth-model PATH_TO_PTH_MODEL --save-path PATH_TO_SAVE_FOLDER

References

If you find this repository useful, please consider referring to he following paper:

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv},
  year={2025}
}

About

πŸ”₯ Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published