Input format-Training on one frame of the video clip? #16

sawhney-medha · 2024-04-10T21:42:27Z

Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."

Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.

and What is the role of no_grad_frames?

Thank you!!

HELLORPG · 2024-04-11T02:38:26Z

In our experiments, batch_size refers to the number of video clips (samples). So the batch size is set to 1 per GPU means we process one video clip (which contains multiple frames) on a single GPU. And within each clip, the inter-frame interval is a random number from 1 to 10.

The no_grad_frames means these frames are forward in grad-free mode:

MeMOTR/train_engine.py

Lines 217 to 230 in f46ae3d

    
           with torch.no_grad(): 
        
               frame = [fs[frame_idx] for fs in batch["imgs"]] 
        
               for f in frame: 
        
                   f.requires_grad_(False) 
        
               frame = tensor_list_to_nested_tensor(tensor_list=frame).to(device) 
        
               res = model(frame=frame, tracks=tracks) 
        
               previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame( 
        
                   model_outputs=res, 
        
                   tracked_instances=tracks, 
        
                   frame_idx=frame_idx 
        
               ) 
        
               if frame_idx < len(batch["imgs"][0]) - 1: 
        
                   tracks = get_model(model).postprocess_single_frame( 
        
                       previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

However, in our experiments, we deprecated this part. I have not deleted the code from this repo. My suggestion is not to pay attention to this process. Enabling it will not bring performance improvements.

sawhney-medha · 2024-04-11T02:45:15Z

Thank you for the prompt reply!! This is helpful

The input to the model (backbone and encoder/decoder) is a single frame at a time, right? So the way we use the temporal information from video clip is the track information/embedding and memory. am i understanding correctly?

Thank you again :)

HELLORPG · 2024-04-11T02:50:36Z

Yes. We process only one frame at each time step. The track embedding will propagate the temporal information.

The only difference is that during training, we will process multiple time steps before the optimizer.step(). In this way, the model can learn the ability of temporal modeling.

HELLORPG · 2024-04-11T02:53:38Z

This is equivalent to, in each training iteration, we process T time steps.

sawhney-medha · 2024-04-16T22:56:49Z

Thank you so much! Can you also please explain the working of the "process_single_frame" function? I want to understand how tracks are generated and how are sub clips connected to each other while predicting. Thank you!!

HELLORPG · 2024-04-17T09:23:54Z

Our model, as an online tracker, processes one-by-one for the image sequences. So, the function criterion.process_single_frame is used to process the criterion for a single frame at once. For example, as shown below:

MeMOTR/train_engine.py

Line 201 in f46ae3d

for frame_idx in range(len(batch["imgs"][0])):

We will call this function (criterion.process_single_frame) T times in each training iteration, where T is the sampling length for each video clip (from 2 to 5 in our setting on DanceTrack).

At the same time, the function criterion.process_single_frame will also generate the track information (embed & ref_pts, etc.) for the next time step. As shown here:

MeMOTR/train_engine.py

Lines 223 to 227 in f46ae3d

    
           previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame( 
        
               model_outputs=res, 
        
               tracked_instances=tracks, 
        
               frame_idx=frame_idx 
        
           )

It will update the tracked trajectories previous_tracks and the newborn trajectories new_tracks. Then, they will be combined into an overall tracks here:

MeMOTR/train_engine.py

Lines 229 to 230 in f46ae3d

    
           tracks = get_model(model).postprocess_single_frame( 
        
               previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

Then, the tracks will be input into the next frame processing, like here:

MeMOTR/train_engine.py

Line 222 in f46ae3d

res = model(frame=frame, tracks=tracks)

which connects the frames in the video clip by propagating the trajectories frame-by-frame. Therefore, our model can build a fully end-to-end training strategy and backward the gradients to the beginning (the first frame).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input format-Training on one frame of the video clip? #16

Input format-Training on one frame of the video clip? #16

sawhney-medha commented Apr 10, 2024

HELLORPG commented Apr 11, 2024

sawhney-medha commented Apr 11, 2024

HELLORPG commented Apr 11, 2024

HELLORPG commented Apr 11, 2024

sawhney-medha commented Apr 16, 2024

HELLORPG commented Apr 17, 2024

Input format-Training on one frame of the video clip? #16

Input format-Training on one frame of the video clip? #16

Comments

sawhney-medha commented Apr 10, 2024

HELLORPG commented Apr 11, 2024

sawhney-medha commented Apr 11, 2024

HELLORPG commented Apr 11, 2024

HELLORPG commented Apr 11, 2024

sawhney-medha commented Apr 16, 2024

HELLORPG commented Apr 17, 2024