Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you please point to the code that tracks during inference? #18

Open
sawhney-medha opened this issue Apr 17, 2024 · 3 comments
Open

Comments

@sawhney-medha
Copy link

I am confused about how the tracking is performed during inference for videos longer than sample length frames? What part of the code connects those shorter tracks?

@HELLORPG
Copy link
Collaborator

Our MeMOTR is an RNN-like model, processing the video frame-by-frame, like word-by-word in RNNs. So, in theory, the processing length is unlimited. Therefore, videos longer than sample length frames (during training) will not cause any difference in inference, still frame-by-frame.

So, we do not connect several shorter tracks into an overall trajectory. For the time step t during inference, we already have the past t-1 frames' trajectory and only need to connect these past tracks with targets in the current frame. Frame-by-frame is the key, not clip-by-clip (or, you can say, shorter-tracks-by-shorters-tracks).

However, inconsistent lengths during training and inference can indeed cause issues for the model. I further discuss this topic in my recent work.

@hxchashao
Copy link

Hello, may I ask what you said about the inconsistent length during training and inference. Can you explain in depth? My training dataset is 300 frames and the test set is 18,000 frames. When my test set reaches 1000 frames, there will be serious tracking confusion. Is this caused by the inconsistent length of the training dataset and the test set? Have you encountered such problems in previous experiments?

@HELLORPG
Copy link
Collaborator

I think that's not what I mean by inconsistent length. Let me explain it:
During training, we only sampled 5 frames at most. Therefore, the longest occlusion does not exceed 3 frames during training. However, during inference, we need to deal with very long object occlusion situations (like 30-frame occlusion on DanceTrack, which is determined by the param MISS_TOLERANCE). This 3-frame occlusion vs. 30-frame occlusion is the inconsistency that I am trying to point out.

In your description, although your dataset is 300 frames long, for training, it's not different from a 5-frame clip. So, the inconsistency is not between 300 and 18000.
But we have indeed not tried inference on 18000-frame videos, because this kind of data is extremely rare on MOT benchmarks.
Videos that are too long may indeed cause unexpected situations. But I wonder if you can describe it in more detail? I do not quite understand the specific situation of the tracking confusion you are talking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants