Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input format-Training on one frame of the video clip? #16

Open
sawhney-medha opened this issue Apr 10, 2024 · 6 comments
Open

Input format-Training on one frame of the video clip? #16

sawhney-medha opened this issue Apr 10, 2024 · 6 comments

Comments

@sawhney-medha
Copy link

Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."

Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.

and What is the role of no_grad_frames?

Thank you!!

@HELLORPG
Copy link
Collaborator

In our experiments, batch_size refers to the number of video clips (samples). So the batch size is set to 1 per GPU means we process one video clip (which contains multiple frames) on a single GPU. And within each clip, the inter-frame interval is a random number from 1 to 10.

The no_grad_frames means these frames are forward in grad-free mode:

MeMOTR/train_engine.py

Lines 217 to 230 in f46ae3d

with torch.no_grad():
frame = [fs[frame_idx] for fs in batch["imgs"]]
for f in frame:
f.requires_grad_(False)
frame = tensor_list_to_nested_tensor(tensor_list=frame).to(device)
res = model(frame=frame, tracks=tracks)
previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame(
model_outputs=res,
tracked_instances=tracks,
frame_idx=frame_idx
)
if frame_idx < len(batch["imgs"][0]) - 1:
tracks = get_model(model).postprocess_single_frame(
previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

However, in our experiments, we deprecated this part. I have not deleted the code from this repo. My suggestion is not to pay attention to this process. Enabling it will not bring performance improvements.

@sawhney-medha
Copy link
Author

Thank you for the prompt reply!! This is helpful

The input to the model (backbone and encoder/decoder) is a single frame at a time, right? So the way we use the temporal information from video clip is the track information/embedding and memory. am i understanding correctly?

Thank you again :)

@HELLORPG
Copy link
Collaborator

Yes. We process only one frame at each time step. The track embedding will propagate the temporal information.

The only difference is that during training, we will process multiple time steps before the optimizer.step(). In this way, the model can learn the ability of temporal modeling.

@HELLORPG
Copy link
Collaborator

This is equivalent to, in each training iteration, we process T time steps.

@sawhney-medha
Copy link
Author

Thank you so much! Can you also please explain the working of the "process_single_frame" function? I want to understand how tracks are generated and how are sub clips connected to each other while predicting. Thank you!!

@HELLORPG
Copy link
Collaborator

Our model, as an online tracker, processes one-by-one for the image sequences. So, the function criterion.process_single_frame is used to process the criterion for a single frame at once. For example, as shown below:

for frame_idx in range(len(batch["imgs"][0])):

We will call this function (criterion.process_single_frame) T times in each training iteration, where T is the sampling length for each video clip (from 2 to 5 in our setting on DanceTrack).

At the same time, the function criterion.process_single_frame will also generate the track information (embed & ref_pts, etc.) for the next time step. As shown here:

MeMOTR/train_engine.py

Lines 223 to 227 in f46ae3d

previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame(
model_outputs=res,
tracked_instances=tracks,
frame_idx=frame_idx
)

It will update the tracked trajectories previous_tracks and the newborn trajectories new_tracks. Then, they will be combined into an overall tracks here:

MeMOTR/train_engine.py

Lines 229 to 230 in f46ae3d

tracks = get_model(model).postprocess_single_frame(
previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

Then, the tracks will be input into the next frame processing, like here:
res = model(frame=frame, tracks=tracks)

which connects the frames in the video clip by propagating the trajectories frame-by-frame. Therefore, our model can build a fully end-to-end training strategy and backward the gradients to the beginning (the first frame).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants