Skip to content

sithu31296/simple-object-tracking

Repository files navigation

Simple Object Tracking

Multi-Object Tracking with YOLOv5, CLIP, DINO and DeepSORT

Introduction

This is a simple two-stage mulit-object tracking YOLOv5 and DeepSORT with zero-short or self-supervised feature extractors.

Normally, in DeepSORT, the deep part of the model is trained on a person re-identification dataset like Market1501. We will replace this model with zero-shot or self-supervised models; which makes it ready to track any classes without needing to re-train.

SOTA models like CLIP (zero-shot) and DINO (SSL) are currently experimented. If better models come out, I will consider adding it.

Requirements

  • torch >= 1.8.1
  • torchvision >= 0.9.1

Other requirements can be installed with pip install -r requirements.txt.

Clone the repository recursively:

$ git clone --recursive https://github.com/sithu31296/simple-object-tracking.git

Then download a YOLO model's weight from YOLOv5 and place it in checkpoints.

Tracking

Track all classes:

## webcam
$ python track.py --source 0 --yolo-model checkpoints/yolov5s.pt --reid-model CLIP-RN50

## video
$ python track.py --source VIDEO_PATH --yolo-model checkpoints/yolov5s.pt --reid-model CLIP-RN50

Track only specified classes:

## track only person class
$ python track.py --source 0 --yolo-model checkpoints/yolov5s.pt --reid-model CLIP-RN50 --filter-class 0

## track person and car classes
$ python track.py --source 0 --yolo-model checkpoints/yolov5s.pt --reid-model CLIP-RN50 --filter-class 0 2

Available ReID models (Feature Extractors):

  • CLIP: CLIP-RN50, CLIP-ViT-B/32
  • DINO: DINO-XciT-S12/16, DINO-XciT-M24/16, DINO-ViT-S/16, DINO-ViT-B/16

Check here to get COCO class index for your class.

Evaluate on MOT16

  • Download MOT16 dataset from here and unzip it.
  • Download mot-challenge ground-truth data for evaluating with TrackEval. Then, unzip it under the project directory.
  • Save the tracking results of MOT16 with the following command:
$ python eval_mot.py --root MOT16_ROOT_DIR --yolo-model checkpoints/yolov5m.pt --reid-model CLIP-RN50
  • Evaluate with TrackEval:
$ python TrackEval/scripts/run_mot_challenge.py \
    --BENCHMARK MOT16 \
    --GT_FOLDER PROJECT_ROOT/data/gt/mot_challenge/ \
    --TRACKERS_FOLDER PROJECT_ROOT/data/trackers/mot_challenge/ \
    --TRACKERS_TO_EVAL mot_det \
    --SPLIT_TO_EVAL train \
    --USE_PARALLEL True \
    --NUM_PARALLEL_CORES 4 \
    --PRINT_ONLY_COMBINED True \

Notes: FOLDER parameters in run_mot_challenge.py must be an absolute path.

For tracking persons, instead of using a COCO-pretrained model, using a model trained on multi-person dataset will get better accuracy. You can download a YOLOv5m model trained on CrowdHuman dataset from here. The weights are from deepakcrk/yolov5-crowdhuman. It has 2 classes: 'person' and 'head'. So, you can use this model for both person and head tracking.

Results

MOT16 Evaluation Results

Detector Feature Extractor MOTA↑ HOTA↑ IDF1↑ IDsw↓ MT↑ ML↓ FP↓ FN↓ FPS
(GTX1660ti)
YOLOv5m
(COCO)
CLIP
(RN50)
35.42 35.37 39.42 486 115 192 6880 63931 7
YOLOv5m
(CrowdHuman)
CLIP
(RN50)
53.25 43.25 52.12 912 196 89 14076 36625 6
YOLOv5m
(CrowdHuman)
CLIP
(ViT-B/32)
53.35 43.03 51.25 896 199 91 14035 36575 4
YOLOv5m
(CrowdHuman)
DINO
(XciT-S12/16)
54.41 47.44 59.01 511 184 101 12265 37555 8
YOLOv5m
(CrowdHuman)
DINO
(ViT-S/16)
54.56 47.61 58.94 519 189 97 12346 37308 8
YOLOv5m
(CrowdHuman)
DINO
(XciT-M24/16)
54.56 47.71 59.77 504 187 96 12364 37306 5
YOLOv5m
(CrowdHuman)
DINO
(ViT-B/16)
54.58 47.55 58.89 507 184 97 12017 37621 5

FPS Results

Detector Feature Extractor GPU Precision Image Size Detection
/Frame
FPS
YOLOv5s CLIP-RN50 GTX-1660ti FP32 480x640 1 38
YOLOv5s CLIP-ViT-B/32 GTX-1660ti FP32 480x640 1 30
YOLOv5s DINO-XciT-S12/16 GTX-1660ti FP32 480x640 1 36
YOLOv5s DINO-ViT-B/16 GTX-1660ti FP32 480x640 1 30
YOLOv5s DINO-XciT-M24/16 GTX-1660ti FP32 480x640 1 25

References

Citations

@inproceedings{caron2021emerging,
  title={Emerging Properties in Self-Supervised Vision Transformers},
  author={Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J\'egou, Herv\'e  and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
  booktitle={Proceedings of the International Conference on Computer Vision (ICCV)},
  year={2021}
}

@article{el2021xcit,
  title={XCiT: Cross-Covariance Image Transformers},
  author={El-Nouby, Alaaeldin and Touvron, Hugo and Caron, Mathilde and Bojanowski, Piotr and Douze, Matthijs and Joulin, Armand and Laptev, Ivan and Neverova, Natalia and Synnaeve, Gabriel and Verbeek, Jakob and others},
  journal={arXiv preprint arXiv:2106.09681},
  year={2021}
}

@misc{radford2021learning,
  title={Learning Transferable Visual Models From Natural Language Supervision}, 
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  year={2021},
  eprint={2103.00020},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

@inproceedings{Wojke2017simple,
  title={Simple Online and Realtime Tracking with a Deep Association Metric},
  author={Wojke, Nicolai and Bewley, Alex and Paulus, Dietrich},
  booktitle={2017 IEEE International Conference on Image Processing (ICIP)},
  year={2017},
  pages={3645--3649},
  organization={IEEE},
  doi={10.1109/ICIP.2017.8296962}
}

@inproceedings{Wojke2018deep,
  title={Deep Cosine Metric Learning for Person Re-identification},
  author={Wojke, Nicolai and Bewley, Alex},
  booktitle={2018 IEEE Winter Conference on Applications of Computer Vision (WACV)},
  year={2018},
  pages={748--756},
  organization={IEEE},
  doi={10.1109/WACV.2018.00087}
}