Skip to content

Latest commit

 

History

History
387 lines (300 loc) · 43.1 KB

File metadata and controls

387 lines (300 loc) · 43.1 KB

MM Grounding DINO

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community.

Dataset Preparation

Please refer to dataset_prepare.md or 中文版数据准备

✨ What's New

💎 We have released the pre-trained weights for Swin-B and Swin-L, welcome to try and give feedback.

Usage

Please refer to usage.md or 中文版用法说明

Zero-Shot COCO Results and Models

Model Backbone Style COCO mAP Pre-Train Data Config Download
GDINO-T Swin-T Zero-shot 46.7 O365
GDINO-T Swin-T Zero-shot 48.1 O365,GoldG
GDINO-T Swin-T Zero-shot 48.4 O365,GoldG,Cap4M config model
MM-GDINO-T Swin-T Zero-shot 48.5(+1.8) O365 config
MM-GDINO-T Swin-T Zero-shot 50.4(+2.3) O365,GoldG config model | log
MM-GDINO-T Swin-T Zero-shot 50.5(+2.1) O365,GoldG,GRIT config model | log
MM-GDINO-T Swin-T Zero-shot 50.6(+2.2) O365,GoldG,V3Det config model | log
MM-GDINO-T Swin-T Zero-shot 50.4(+2.0) O365,GoldG,GRIT,V3Det config model | log
MM-GDINO-B Swin-B Zero-shot 52.5 O365,GoldG,V3Det config model | log
MM-GDINO-B* Swin-B - 59.5 O365,ALL config model | log
MM-GDINO-L Swin-L Zero-shot 53.0 O365V2,OpenImageV6,GoldG config model | log
MM-GDINO-L* Swin-L - 60.3 O365V2,OpenImageV6,ALL config model | log
  • This * indicates that the model has not been fully trained yet. We will release the final weights in the future.
  • ALL: GoldG,V3det,COCO2017,LVISV1,COCO2014,GRIT,RefCOCO,RefCOCO+,RefCOCOg,gRefCOCO.

Zero-Shot LVIS Results

Model MiniVal APr MiniVal APc MiniVal APf MiniVal AP Val1.0 APr Val1.0 APc Val1.0 APf Val1.0 AP Pre-Train Data
GDINO-T 18.8 24.2 34.7 28.8 10.1 15.3 29.9 20.1 O365,GoldG,Cap4M
MM-GDINO-T 28.1 30.2 42.0 35.7(+6.9) 17.1 22.4 36.5 27.0(+6.9) O365,GoldG
MM-GDINO-T 26.6 32.4 41.8 36.5(+7.7) 17.3 22.6 36.4 27.1(+7.0) O365,GoldG,GRIT
MM-GDINO-T 33.0 36.0 45.9 40.5(+11.7) 21.5 25.5 40.2 30.6(+10.5) O365,GoldG,V3Det
MM-GDINO-T 34.2 37.4 46.2 41.4(+12.6) 23.6 27.6 40.5 31.9(+11.8) O365,GoldG,GRIT,V3Det

Zero-Shot ODinW (Object Detection in the Wild) Results

Results and models of ODinW13

Method GDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone 0.173 0.133 0.155 0.177 0.151
Aquarium 0.195 0.252 0.261 0.266 0.283
CottontailRabbits 0.799 0.771 0.810 0.778 0.786
EgoHands 0.608 0.499 0.537 0.506 0.519
NorthAmericaMushrooms 0.507 0.331 0.462 0.669 0.767
Packages 0.687 0.707 0.687 0.710 0.706
PascalVOC 0.563 0.565 0.580 0.556 0.566
pistols 0.726 0.585 0.709 0.671 0.729
pothole 0.215 0.136 0.285 0.199 0.243
Raccoon 0.549 0.469 0.511 0.553 0.535
ShellfishOpenImages 0.393 0.321 0.437 0.519 0.488
thermalDogsAndPeople 0.657 0.556 0.603 0.493 0.542
VehiclesOpenImages 0.613 0.566 0.603 0.614 0.615
Average 0.514 0.453 0.511 0.516 0.533
  • The MM-GDINO-T config file is odinw13

Results and models of ODinW35

Method GDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone_large 0.173 0.133 0.155 0.177 0.151
AerialMaritimeDrone_tiled 0.206 0.170 0.225 0.184 0.206
AmericanSignLanguageLetters 0.002 0.016 0.020 0.011 0.007
Aquarium 0.195 0.252 0.261 0.266 0.283
BCCD 0.161 0.069 0.118 0.083 0.077
boggleBoards 0.000 0.002 0.001 0.001 0.002
brackishUnderwater 0.021 0.033 0.021 0.025 0.025
ChessPieces 0.000 0.000 0.000 0.000 0.000
CottontailRabbits 0.806 0.771 0.810 0.778 0.786
dice 0.004 0.002 0.005 0.001 0.001
DroneControl 0.042 0.047 0.097 0.088 0.074
EgoHands_generic 0.608 0.527 0.537 0.506 0.519
EgoHands_specific 0.002 0.001 0.005 0.007 0.003
HardHatWorkers 0.046 0.048 0.070 0.070 0.108
MaskWearing 0.004 0.009 0.004 0.011 0.009
MountainDewCommercial 0.430 0.453 0.465 0.194 0.430
NorthAmericaMushrooms 0.471 0.331 0.462 0.669 0.767
openPoetryVision 0.000 0.001 0.000 0.000 0.000
OxfordPets_by_breed 0.003 0.002 0.004 0.006 0.004
OxfordPets_by_species 0.011 0.019 0.016 0.020 0.015
PKLot 0.001 0.004 0.002 0.008 0.007
Packages 0.695 0.707 0.687 0.710 0.706
PascalVOC 0.563 0.565 0.580 0.566 0.566
pistols 0.726 0.585 0.709 0.671 0.729
plantdoc 0.005 0.005 0.007 0.008 0.011
pothole 0.215 0.136 0.219 0.077 0.168
Raccoons 0.549 0.469 0.511 0.553 0.535
selfdrivingCar 0.089 0.091 0.076 0.094 0.083
ShellfishOpenImages 0.393 0.321 0.437 0.519 0.488
ThermalCheetah 0.087 0.063 0.081 0.030 0.045
thermalDogsAndPeople 0.657 0.556 0.603 0.493 0.543
UnoCards 0.006 0.012 0.010 0.009 0.005
VehiclesOpenImages 0.613 0.566 0.603 0.614 0.615
WildfireSmoke 0.134 0.106 0.154 0.042 0.127
websiteScreenshots 0.012 0.02 0.016 0.016 0.016
Average 0.227 0.202 0.228 0.214 0.284
  • The MM-GDINO-T config file is odinw35

Zero-Shot Referring Expression Comprehension Results

Method GDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
RefCOCO val @1,5,10 50.8/89.5/94.9 53.1/89.9/94.7 53.4/90.3/95.5 52.1/89.8/95.0 53.1/89.7/95.1
RefCOCO testA @1,5,10 57.4/91.3/95.6 59.7/91.5/95.9 58.8/91.70/96.2 58.4/86.8/95.6 59.1/91.0/95.5
RefCOCO testB @1,5,10 45.0/86.5/92.9 46.4/86.9/92.2 46.8/87.7/93.3 45.4/86.2/92.6 46.8/87.8/93.6
RefCOCO+ val @1,5,10 51.6/86.4/92.6 53.1/87.0/92.8 53.5/88.0/93.7 52.5/86.8/93.2 52.7/87.7/93.5
RefCOCO+ testA @1,5,10 57.3/86.7/92.7 58.9/87.3/92.9 59.0/88.1/93.7 58.1/86.7/93.5 58.7/87.2/93.1
RefCOCO+ testB @1,5,10 46.4/84.1/90.7 47.9/84.3/91.0 47.9/85.5/92.7 46.9/83.7/91.5 48.4/85.8/92.1
RefCOCOg val @1,5,10 60.4/92.1/96.2 61.2/92.6/96.1 62.7/93.3/97.0 61.7/92.9/96.6 62.9/93.3/97.2
RefCOCOg test @1,5,10 59.7/92.1/96.3 61.1/93.3/96.7 62.6/94.9/97.1 61.0/93.1/96.8 62.9/93.9/97.4
Method thresh_score GDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc 0.5 39.3/70.4 39.4/67.5
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc 0.6 40.5/83.8 40.6/83.1
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc 0.7 41.3/91.8 39.8/84.7 40.7/89.7 40.3/88.8 41.0/91.3
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc 0.8 41.5/96.8 41.1/96.4
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc 0.5 31.9/70.4 33.1/69.5
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc 0.6 29.3/82.9 29.2/84.3
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc 0.7 27.2/90.2 26.3/89.0 26.0/91.9 25.4/91.8 26.1/93.0
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc 0.8 25.1/96.3 23.8/97.2
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc 0.5 30.9/72.5 33.0/69.6
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc 0.6 30.0/86.1 31.6/96.7
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc 0.7 29.7/93.5 31.3/84.8 30.6/90.2 30.7/89.9 30.4/92.3
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc 0.8 29.1/97.4 29.5/84.2
  • The MM-GDINO-T config file is here

Zero-Shot Description Detection Dataset(DOD)

pip install ddd-dataset
Method mode GDINO-T
(O365,GoldG,Cap4M)
MM-GDINO-T
(O365,GoldG)
MM-GDINO-T
(O365,GoldG,GRIT)
MM-GDINO-T
(O365,GoldG,V3Det)
MM-GDINO-T
(O365,GoldG,GRIT,V3Det)
FULL/short/middle/long/very long concat 17.2/18.0/18.7/14.8/16.3 15.6/17.3/16.7/14.3/13.1 17.0/17.7/18.0/15.7/15.7 16.2/17.4/16.8/14.9/15.4 17.5/23.4/18.3/14.7/13.8
FULL/short/middle/long/very long parallel 22.3/28.2/24.8/19.1/13.9 21.7/24.7/24.0/20.2/13.7 22.5/25.6/25.1/20.5/14.9 22.3/25.6/24.5/20.6/14.7 22.9/28.1/25.4/20.4/14.4
PRES/short/middle/long/very long concat 17.8/18.3/19.2/15.2/17.3 16.4/18.4/17.3/14.5/14.2 17.9/19.0/18.3/16.5/17.5 16.6/18.8/17.1/15.1/15.0 18.0/23.7/18.6/15.4/13.3
PRES/short/middle/long/very long parallel 21.0/27.0/22.8/17.5/12.5 21.3/25.5/22.8/19.2/12.9 21.5/25.2/23.0/19.0/15.0 21.6/25.7/23.0/19.5/14.8 21.9/27.4/23.2/19.1/14.2
ABS/short/middle/long/very long concat 15.4/17.1/16.4/13.6/14.9 13.4/13.4/14.5/13.5/11.9 14.5/13.1/16.7/13.6/13.3 14.8/12.5/15.6/14.3/15.8 15.9/22.2/17.1/12.5/14.4
ABS/short/middle/long/very long parallel 26.0/32.0/33.0/23.6/15.5 22.8/22.2/28.7/22.9/14.7 25.6/26.8/33.9/24.5/14.7 24.1/24.9/30.7/23.8/14.7 26.0/30.3/34.1/23.9/14.6

Note:

  1. Considering that the evaluation time for Inter-scenario is very long and the performance is low, it is temporarily not supported. The mentioned metrics are for Intra-scenario.
  2. concat is the default inference mode for Grounding DINO, where it concatenates multiple sub-sentences with "." to form a single sentence for inference. On the other hand, "parallel" performs inference on each sub-sentence in a for-loop.
  3. The MM-GDINO-T config file is concat_dod and parallel_dod

Pretrain Flickr30k Results

Model Pre-Train Data Val R@1 Val R@5 Val R@10 Test R@1 Test R@5 Test R@10
GLIP-T O365,GoldG 84.9 94.9 96.3 85.6 95.4 96.7
GLIP-T O365,GoldG,CC3M,SBU 85.3 95.5 96.9 86.0 95.9 97.2
GDINO-T O365,GoldG,Cap4M 87.8 96.6 98.0 88.1 96.9 98.2
MM-GDINO-T O365,GoldG 85.5 95.6 97.2 86.2 95.7 97.4
MM-GDINO-T O365,GoldG,GRIT 86.7 95.8 97.6 87.0 96.2 97.7
MM-GDINO-T O365,GoldG,V3Det 85.9 95.7 97.4 86.3 95.7 97.4
MM-GDINO-T O365,GoldG,GRIT,V3Det 86.7 96.0 97.6 87.2 96.2 97.7

Note:

  1. @1,5,10 refers to precision at the top 1, 5, and 10 positions in a predicted ranked list.
  2. The MM-GDINO-T config file is here

Validating the generalization of a pre-trained model through fine-tuning

RTTS

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 1x 48.1
Cascade R-CNN R-50 1x 50.8
ATSS R-50 1x 48.2
TOOD R-50 1X 50.8
MM-GDINO(zero-shot) Swin-T 49.8
MM-GDINO Swin-T 1x 69.1

RUOD

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 1x 52.4
Cascade R-CNN R-50 1x 55.3
ATSS R-50 1x 55.7
TOOD R-50 1X 57.4
MM-GDINO(zero-shot) Swin-T 29.8
MM-GDINO Swin-T 1x 65.5

Brain Tumor

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 50e 43.5
Cascade R-CNN R-50 50e 46.2
DINO R-50 50e 46.4
Cascade-DINO R-50 50e 48.6
MM-GDINO Swin-T 50e 47.5

Cityscapes

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 50e 30.1
Cascade R-CNN R-50 50e 31.8
DINO R-50 50e 34.5
Cascade-DINO R-50 50e 34.8
MM-GDINO(zero-shot) Swin-T 34.2
MM-GDINO Swin-T 50e 51.5

People in Painting

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 50e 17.0
Cascade R-CNN R-50 50e 18.0
DINO R-50 50e 12.0
Cascade-DINO R-50 50e 13.4
MM-GDINO(zero-shot) Swin-T 23.1
MM-GDINO Swin-T 50e 38.9

COCO

(1) Closed-set performance

Architecture Backbone Lr schd box AP
Faster R-CNN R-50 1x 37.4
Cascade R-CNN R-50 1x 40.3
ATSS R-50 1x 39.4
TOOD R-50 1X 42.4
DINO R-50 1X 50.1
GLIP(zero-shot) Swin-T 46.6
GDINO(zero-shot) Swin-T 48.5
MM-GDINO(zero-shot) Swin-T 50.4
GLIP Swin-T 1x 55.4
GDINO Swin-T 1x 58.1
MM-GDINO Swin-T 1x 58.2
  • The MM-GDINO-T config file is here

(2) Open-set continuing pretraining performance

Architecture Backbone Lr schd box AP
GLIP(zero-shot) Swin-T 46.7
GDINO(zero-shot) Swin-T 48.5
MM-GDINO(zero-shot) Swin-T 50.4
MM-GDINO Swin-T 1x 54.7
  • The MM-GDINO-T config file is here
  • Due to the small size of the COCO dataset, continuing pretraining solely on COCO can easily lead to overfitting. The results shown above are from the third epoch. I do not recommend you train using this approach.

(3) Open vocabulary performance

Architecture Backbone Lr schd box AP Base box AP Novel box AP box AP@50 Base box AP@50 Novel box AP@50
MM-GDINO(zero-shot) Swin-T 51.1 48.4 58.9 66.7 64.0 74.2
MM-GDINO Swin-T 1x 57.2 56.1 60.4 73.6 73.0 75.3
  • The MM-GDINO-T config file is here

LVIS 1.0

(1) Open-set continuing pretraining performance

Architecture Backbone Lr schd MiniVal APr MiniVal APc MiniVal APf MiniVal AP Val1.0 APr Val1.0 APc Val1.0 APf Val1.0 AP
GLIP(zero-shot) Swin-T 18.1 21.2 33.1 26.7 10.8 14.7 29.0 19.6
GDINO(zero-shot) Swin-T 18.8 24.2 34.7 28.8 10.1 15.3 29.9 20.1
MM-GDINO(zero-shot) Swin-T 34.2 37.4 46.2 41.4 23.6 27.6 40.5 31.9
MM-GDINO Swin-T 1x 50.7 58.8 60.1 58.7 45.2 50.2 56.1 51.7
  • The MM-GDINO-T config file is here

(2) Open vocabulary performance

Architecture Backbone Lr schd MiniVal APr MiniVal APc MiniVal APf MiniVal AP
MM-GDINO(zero-shot) Swin-T 34.2 37.4 46.2 41.4
MM-GDINO Swin-T 1x 43.2 57.4 59.3 57.1
  • The MM-GDINO-T config file is here

RefEXP

RefCOCO

Architecture Backbone Lr schd val @1 val @5 val @10 testA @1 testA @5 testA @10 testB @1 testB @5 testB @10
GDINO(zero-shot) Swin-T 50.8 89.5 94.9 57.5 91.3 95.6 45.0 86.5 92.9
MM-GDINO(zero-shot) Swin-T 53.1 89.7 95.1 59.1 91.0 95.5 46.8 87.8 93.6
GDINO Swin-T UNK 89.2 91.9 86.0
MM-GDINO Swin-T 5e 89.5 98.6 99.4 91.4 99.2 99.8 86.6 97.9 99.1
  • The MM-GDINO-T config file is here

RefCOCO+

Architecture Backbone Lr schd val @1 val @5 val @10 testA @1 testA @5 testA @10 testB @1 testB @5 testB @10
GDINO(zero-shot) Swin-T 51.6 86.4 92.6 57.3 86.7 92.7 46.4 84.1 90.7
MM-GDINO(zero-shot) Swin-T 52.7 87.7 93.5 58.7 87.2 93.1 48.4 85.8 92.1
GDINO Swin-T UNK 81.1 87.4 74.7
MM-GDINO Swin-T 5e 82.1 97.8 99.2 87.5 99.2 99.7 74.0 96.3 96.4
  • The MM-GDINO-T config file is here

RefCOCOg

Architecture Backbone Lr schd val @1 val @5 val @10 test @1 test @5 test @10
GDINO(zero-shot) Swin-T 60.4 92.1 96.2 59.7 92.1 96.3
MM-GDINO(zero-shot) Swin-T 62.9 93.3 97.2 62.9 93.9 97.4
GDINO Swin-T UNK 84.2 84.9
MM-GDINO Swin-T 5e 85.5 98.4 99.4 85.8 98.6 99.4
  • The MM-GDINO-T config file is here

gRefCOCO

Architecture Backbone Lr schd val Pr@(F1=1, IoU≥0.5) val N-acc testA Pr@(F1=1, IoU≥0.5) testA N-acc testB Pr@(F1=1, IoU≥0.5) testB N-acc
GDINO(zero-shot) Swin-T 41.3 91.8 27.2 90.2 29.7 93.5
MM-GDINO(zero-shot) Swin-T 41.0 91.3 26.1 93.0 30.4 92.3
MM-GDINO Swin-T 5e 45.1 64.7 42.5 65.5 40.3 63.2
  • The MM-GDINO-T config file is here

Citation

If you find this project useful in your research, please consider citing:

@article{zhao2024open,
  title={An Open and Comprehensive Pipeline for Unified Object Grounding and Detection},
  author={Zhao, Xiangyu and Chen, Yicheng and Xu, Shilin and Li, Xiangtai and Wang, Xinjiang and Li, Yining and Huang, Haian},
  journal={arXiv preprint arXiv:2401.02361},
  year={2024}
}