-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input format-Training on one frame of the video clip? #16
Comments
In our experiments, The Lines 217 to 230 in f46ae3d
However, in our experiments, we deprecated this part. I have not deleted the code from this repo. My suggestion is not to pay attention to this process. Enabling it will not bring performance improvements. |
Thank you for the prompt reply!! This is helpful The input to the model (backbone and encoder/decoder) is a single frame at a time, right? So the way we use the temporal information from video clip is the track information/embedding and memory. am i understanding correctly? Thank you again :) |
Yes. We process only one frame at each time step. The track embedding will propagate the temporal information. The only difference is that during training, we will process multiple time steps before the |
This is equivalent to, in each training iteration, we process T time steps. |
Thank you so much! Can you also please explain the working of the "process_single_frame" function? I want to understand how tracks are generated and how are sub clips connected to each other while predicting. Thank you!! |
Our model, as an online tracker, processes one-by-one for the image sequences. So, the function Line 201 in f46ae3d
We will call this function ( criterion.process_single_frame ) T times in each training iteration, where T is the sampling length for each video clip (from 2 to 5 in our setting on DanceTrack).
At the same time, the function Lines 223 to 227 in f46ae3d
It will update the tracked trajectories previous_tracks and the newborn trajectories new_tracks . Then, they will be combined into an overall tracks here:Lines 229 to 230 in f46ae3d
Then, the tracks will be input into the next frame processing, like here:Line 222 in f46ae3d
which connects the frames in the video clip by propagating the trajectories frame-by-frame. Therefore, our model can build a fully end-to-end training strategy and backward the gradients to the beginning (the first frame). |
Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."
Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.
and What is the role of no_grad_frames?
Thank you!!
The text was updated successfully, but these errors were encountered: