Can you please point to the code that tracks during inference? #18

sawhney-medha · 2024-04-17T02:12:45Z

I am confused about how the tracking is performed during inference for videos longer than sample length frames? What part of the code connects those shorter tracks?

HELLORPG · 2024-04-17T14:10:09Z

Our MeMOTR is an RNN-like model, processing the video frame-by-frame, like word-by-word in RNNs. So, in theory, the processing length is unlimited. Therefore, videos longer than sample length frames (during training) will not cause any difference in inference, still frame-by-frame.

So, we do not connect several shorter tracks into an overall trajectory. For the time step t during inference, we already have the past t-1 frames' trajectory and only need to connect these past tracks with targets in the current frame. Frame-by-frame is the key, not clip-by-clip (or, you can say, shorter-tracks-by-shorters-tracks).

However, inconsistent lengths during training and inference can indeed cause issues for the model. I further discuss this topic in my recent work.

hxchashao · 2024-07-15T11:05:44Z

Hello, may I ask what you said about the inconsistent length during training and inference. Can you explain in depth? My training dataset is 300 frames and the test set is 18,000 frames. When my test set reaches 1000 frames, there will be serious tracking confusion. Is this caused by the inconsistent length of the training dataset and the test set? Have you encountered such problems in previous experiments?

HELLORPG · 2024-07-18T12:47:54Z

I think that's not what I mean by inconsistent length. Let me explain it:
During training, we only sampled 5 frames at most. Therefore, the longest occlusion does not exceed 3 frames during training. However, during inference, we need to deal with very long object occlusion situations (like 30-frame occlusion on DanceTrack, which is determined by the param MISS_TOLERANCE). This 3-frame occlusion vs. 30-frame occlusion is the inconsistency that I am trying to point out.

In your description, although your dataset is 300 frames long, for training, it's not different from a 5-frame clip. So, the inconsistency is not between 300 and 18000.
But we have indeed not tried inference on 18000-frame videos, because this kind of data is extremely rare on MOT benchmarks.
Videos that are too long may indeed cause unexpected situations. But I wonder if you can describe it in more detail? I do not quite understand the specific situation of the tracking confusion you are talking about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you please point to the code that tracks during inference? #18

Can you please point to the code that tracks during inference? #18

sawhney-medha commented Apr 17, 2024

HELLORPG commented Apr 17, 2024

hxchashao commented Jul 15, 2024

HELLORPG commented Jul 18, 2024

Can you please point to the code that tracks during inference? #18

Can you please point to the code that tracks during inference? #18

Comments

sawhney-medha commented Apr 17, 2024

HELLORPG commented Apr 17, 2024

hxchashao commented Jul 15, 2024

HELLORPG commented Jul 18, 2024