MixViT CovMAE #103

samueleruffino99 · 2023-12-13T13:16:07Z

Hello, I have seen that you reference to the ConvMAE pertained based method as MixViT-COnvMAE, but actually, looking at your implementation the backbone is much more similar to the MixCvT layout, with multiple patch embedding and blocks.
Am I missing something or could be?
Because I am trying to adapt PiMAE as you have done with the ConvMAE model, thank you!

Moreover, I have seen that during training, you are passing templates and search tokes to the same backbone multiple times, how the training procedure deal with it? Because I would like to enrich your model with some kind of notion about hand trajectory (when tracked object is handled or similar).

yutaocui · 2023-12-13T14:10:15Z

In terms of the patch embeding style, the MixViT-ConvMAE is more like MixCvT, so you are ture.
For the second question, I don't know what you means, can you give detailed explanation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MixViT CovMAE #103

MixViT CovMAE #103

samueleruffino99 commented Dec 13, 2023 •

edited

Loading

yutaocui commented Dec 13, 2023

MixViT CovMAE #103

MixViT CovMAE #103

Comments

samueleruffino99 commented Dec 13, 2023 • edited Loading

yutaocui commented Dec 13, 2023

samueleruffino99 commented Dec 13, 2023 •

edited

Loading