The official implementation of MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking, ICCV 2023.
Authors: Ruopeng Gao, Limin Wang.
MeMOTR is a fully-end-to-end memory-augmented multi-object tracker based on Transformer. We leverage long-term memory injection with a customized memory-attention layer, thus significantly improving the association performance.
-
2024.05.09: We release MOTIP, a new perspective to regard the multi-object tracking task as an ID prediction problem 🔭.
-
2024.02.21: We add the results on SportsMOT in our arxiv version (supp part). We would appreciate it if you could CITE our trackers in the SportsMOT comparison 📈.
-
2023.12.24: We release the code, scripts and checkpoints on BDD100K 🚗.
-
2023.12.13: We implement a jupyter notebook to run our model on your own video 🎥.
-
2023.11.07: We release the scripts and checkpoints on SportsMOT 🏀.
-
2023.08.24: We release the scripts and checkpoints on DanceTrack 💃.
-
2023.08.09: We release the main code. More configurations, scripts and checkpoints will be released soon 🔜.
conda create -n MeMOTR python=3.10 # create a virtual env
# I remember that I use some new features in Python 3.10, but I'm not exactly sure about this.
conda activate MeMOTR # activate the env
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
# Our code is primarily running on PyTorch 1.13.1,
# but it should be also compatible with earlier PyTorch versions (e.g., 1.12.1).
# However, too early pytorch version may cause some issue that need to be fixed, as we use some newly proposed feature of pytorch (e.g., ResNet50_Weights).
conda install matplotlib pyyaml scipy tqdm tensorboard
pip install opencv-python
You also need to compile the Deformable Attention CUDA ops:
# From https://github.com/fundamentalvision/Deformable-DETR
cd ./models/ops/
sh make.sh
# You can test this ops if you need:
python test.py
You should put the unzipped MOT17 and CrowdHuman datasets into the DATADIR/MOT17/images/
and DATADIR/CrowdHuman/images/
, respectively. And then generate the ground truth files by running the corresponding script: ./data/gen_mot17_gts.py and ./data/gen_crowdhuman_gts.py.
Finally, you should get the following dataset structure:
DATADIR/
├── DanceTrack/
│ ├── train/
│ ├── val/
│ ├── test/
│ ├── train_seqmap.txt
│ ├── val_seqmap.txt
│ └── test_seqmap.txt
├── SportsMOT/
│ ├── train/
│ ├── val/
│ ├── test/
│ ├── train_seqmap.txt
│ ├── val_seqmap.txt
│ └── test_seqmap.txt
├── MOT17/
│ ├── images/
│ │ ├── train/ # unzip from MOT17
│ │ └── test/ # unzip from MOT17
│ └── gts/
│ └── train/ # generate by ./data/gen_mot17_gts.py
└── CrowdHuman/
├── images/
│ ├── train/ # unzip from CrowdHuman
│ └── val/ # unzip from CrowdHuman
└── gts/
├── train/ # generate by ./data/gen_crowdhuman_gts.py
└── val/ # generate by ./data/gen_crowdhuman_gts.py
We initialize our model with the official DAB-Deformable-DETR (with R50 backbone) weights pretrained on the COCO dataset, you can also download the checkpoint we used here. And then put the checkpoint at the root of this project dir.
Train MeMOTR with 8 GPUs on DanceTrack (recommended to use GPUs with >= 32 GB Memory, like V100-32GB or some else):
python -m torch.distributed.run --nproc_per_node=8 main.py --use-distributed --config-path ./configs/train_dancetrack.yaml --outputs-dir ./outputs/memotr_dancetrack/ --batch-size 1 --data-root <your data dir path>
if your GPU's memory is below than 32 GB, we also implement a memory-optimized version (by running option --use-checkpoint
) as discussed in the paper, we use gradient checkpoint to reduce the allocated GPU memory. This following training script will only take about 10 GB GPU memory:
python -m torch.distributed.run --nproc_per_node=8 main.py --use-distributed --config-path ./configs/train_dancetrack.yaml --outputs-dir ./outputs/memotr_dancetrack/ --batch-size 1 --data-root <your data dir path> --use-checkpoint
You can use this script to evaluate the trained model on the DanceTrack val set:
python main.py --mode eval --data-root <your data dir path> --eval-mode specific --eval-model <filename of the checkpoint> --eval-dir ./outputs/memotr_dancetrack/ --eval-threads <your gpus num>
for submitting, you can use the following scripts:
python -m torch.distributed.run --nproc_per_node=8 main.py --mode submit --submit-dir ./outputs/memotr_dancetrack/ --submit-model <filename of the checkpoint> --use-distributed --data-root <your data dir path>
Besides, if you just want to directly eval or submit through our trained checkpoint, you can get the checkpoint we used in the paper here. Then put this checkpoint into ./outputs/memotr_dancetrack/ and run the above scripts.
For submitting, you can use the following scripts:
python -m torch.distributed.run --nproc_per_node=8 main.py --mode submit --config-path ./outputs/memotr_mot17/train/config.yaml --submit-dir ./outputs/memotr_mot17/ --submit-model <filename of the checkpoint> --use-distributed --data-root <your data dir path>
Also, you can directly download our trained checkpoint here. Then put it into ./outputs/memotr_mot17/ and run the above script for submitting to get submit files of MOT17 test set.
You can replace the --config-path
in DanceTrack Scripts. E.g., from ./configs/train_dancetrack.yaml
to ./configs/train_sportsmot.yaml
for training on SportsMOT.
Methods | HOTA | DetA | AssA | checkpoint |
---|---|---|---|---|
MeMOTR | 68.5 | 80.5 | 58.4 | Google Drive |
MeMOTR (Deformable DETR) | 63.4 | 77.0 | 52.3 | Google Drive |
For all experiments, we do not use extra data (like CrowdHuman) for training.
Methods | HOTA | DetA | AssA | checkpoint |
---|---|---|---|---|
MeMOTR | 70.0 | 83.1 | 59.1 | Google Drive |
MeMOTR (Deformable DETR) | 68.8 | 82.0 | 57.8 | Google Drive |
Methods | HOTA | DetA | AssA | checkpoint |
---|---|---|---|---|
MeMOTR | 58.8 | 59.6 | 58.4 | Google Drive |
Methods | mTETA | mLocA | mAssocA | checkpoint |
---|---|---|---|---|
MeMOTR | 53.6 | 38.1 | 56.7 | Google Drive |
- Ruopeng Gao: [email protected]
@InProceedings{MeMOTR,
author = {Gao, Ruopeng and Wang, Limin},
title = {{MeMOTR}: Long-Term Memory-Augmented Transformer for Multi-Object Tracking},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {9901-9910}
}