Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is MOTIP a class-agnostic model for tracking-by-detection task? #36

Open
max-unfinity opened this issue Dec 16, 2024 · 5 comments
Open
Labels
discussion Thoughtful discussion and insight

Comments

@max-unfinity
Copy link

Hi, thank you for amazing work!

Is MOTIP a class-agnostic model for tracking-by-detection task?
For example, if I have my Deformable DETR checkpoint trained on some marine dataset, can I use your pre-trained MOTIP models without any re-training or fine-tuning? Specifically, is the SeqDecoder module class-agnostic?

@HELLORPG
Copy link
Collaborator

I'm sorry for the delay in my responses. I'll be in the hospital for a while (maybe a month or more), so my replies might be slower than usual. Thanks for your understanding and patience.

Thanks for your interest in our work.
Currently, our MOTIP is not a class-agnostic model. The reason is that our SeqDecoder and DETR are trained jointly. This means that the weights (checkpoints) of these two parts need to be used together. Specifically, our SeqDecoder needs the object features from DETR (query's output embedding), so consistency should be ensured.
However, this pipeline (ID prediction) can be further extended to a class-agnostic method easily. Here is a feasible method: decouple the detector (DETR) and the SeqDecoder parts. The detector only provides the bounding boxes (currently, we also use the output embeddings), and the SeqDecoder should independently extract the corresponding object features from the raw images based on the given bounding boxes (currently, we use the DETR output embeddings as mentioned above). This way, these two parts can be connected solely through the detector's output boxes, without the need for consistent training and usage.
🤗 Currently, our design aims to minimize the engineering details and complex designs that need to be considered during the exploration process. Therefore, it may not be the most suitable for practical applications. However, I believe we DO demonstrate that our proposed method has significant potential. I am more than willing to help transition this work to a wide range of application scenarios or assist in future research.

@HELLORPG HELLORPG pinned this issue Dec 22, 2024
@HELLORPG HELLORPG added the discussion Thoughtful discussion and insight label Dec 22, 2024
@MattLiutt
Copy link

Thanks for the excellent repo! So do you mean that we can separate detector and SeqDecoder during inference? For instance, I get a general usage detector (DETR or YOLO series), can I then do it like Sort those tracking-by-detection methods? Or we need to train both detector and SeqDecoder jointly and then do the inference? Clarify me if there's any misunderstanding! Thanks a lot!

@HELLORPG
Copy link
Collaborator

HELLORPG commented Jan 7, 2025

So do you mean that we can separate detector and SeqDecoder during inference? For instance, I get a general usage detector (DETR or YOLO series), can I then do it like Sort those tracking-by-detection methods?

The codebase we provide in this repo does not support this feature. My above reply means that our thinking (ID prediction for target association) can be extended to the association-only model (similar to ReID methods) rather than the joint detection and association model (as we did).

If you want to get an association-only tracking-by-detection model, you need to re-write your own code based on ours. Once you ensure that feature extraction in SeqDecoder is decoupled from the detector, then this SeqDecoder can be combined with any trained detector to become a tracking-by-detection method you mentioned (not tied to any specific detector). You can refer to work like MASA/PuTR for inspiration, where they trained a decoupled feature extractor.

Or we need to train both detector and SeqDecoder jointly and then do the inference?

For this repo, YES. Because the detector (Deformable DETR) also plays the role of feature extractor for SeqDecoder. Therefore, the detector and SeqDecoder are coupled together, trained together and inferred together.

I hope this clarifies your concerns. Please let me know if you need additional details.

@MattLiutt
Copy link

Thanks for prompt response! Appreciated!

For this repo, YES. Because the detector (Deformable DETR) also plays the role of feature extractor for SeqDecoder. Therefore, the detector and SeqDecoder are coupled together, trained together and inferred together.

Just one last question, this repo used Deformable DETR as the detector as well feature extractor for SeqDecoder, is it feasible to replace it with other transformer?

Thanks so much!

@HELLORPG
Copy link
Collaborator

HELLORPG commented Jan 9, 2025

is it feasible to replace it with other transformer?

Yes, of course. In this repo, we also provide MOTIP-DAB-Deformable-DETR except the default MOTIP-Deformable-DETR, as reported in dancetrack results.

Specifically, you can refer to the following code to use your own transformer detector (self.detr) and the corresponding criterion function (self.detr_criterion):

MOTIP/models/motip.py

Lines 91 to 100 in 1dda4c4

if self.detr_framework == "Deformable-DETR":
# DETR model and criterion:
self.detr, self.detr_criterion, _ = build_deformable_detr(detr_args)
elif self.detr_framework == "DAB-Deformable-DETR":
detr_args.num_patterns = 0
detr_args.random_refpoints_xy = False
self.detr, self.detr_criterion, _ = build_dab_deformable_detr(detr_args)
# TODO: We will upload the DAB-DETR code soon.
else:
raise RuntimeError(f"Unknown DETR framework: {self.detr_framework}.")

Additionally, we need to make some modifications to the return values of the DETR detector to ensure it returns the target features (output embeddings):

# Output the outputs of last decoder layer.
# We need these outputs to generate the embeddings for objects.
out["outputs"] = hs[-1]
return out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Thoughtful discussion and insight
Projects
None yet
Development

No branches or pull requests

3 participants