- In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning: This paper studies Semi-Suervised Learning (SSL). The paper suggests that consistency regularization, a popular approach in SSL has limitations such as requiring domain-specific data augmentation.Psudo-labeling on the other hand does not have these lmimitation but underperforms relative to consistency regularization. They suggest that this is due to high amount of noise in the pseudo labels which resulst from over-confident models. They make a connection between network calibration and uncertainty estimation and by including model uncertainty in the process of pseudo label selection, reduce the noise level and improve the overall performance. They experiment with multiple methods for uncertainty estimation and show that all this methos achieve similar results.
- COTR: Correspondence Transformer for Matching Across Images
- Warp Consistency for Unsupervised Learning of Dense Correspondences
- Learning Target Candidate Association to Keep Track of What Not to Track
- Learning Position and Target Consistency for Memory-based Video Object Segmentation
: Matching-based methods do not consider any prior about the sequential order of the frames and how pixels of an object move together.
This paper addresses this problem by introducing 1) global retrieval module, 2) position guidance module, 3) object relation module. Global retrival mainly follows the architecture in STM. For position guidance module, additional local keys are extracted from the query embedding and the previous adjacent memory embedding.
This module adds positional encoding to both aforementioned embeddings, making the local keys position-sensitive.
Finally, the object relation module brings the object-level information from the first frame to improve the target consistency.
This way, we specifically pay attention to the first frame, unlike previous methods that treat the first frame the same as others stored in the memory bank.
- Similar idea for temporal consistency, Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning
, One-Shot Object Detection with Co-Attention and Co-Excitation.
- Why the position embedding is added only to the previous frame?
- Delving Deep into Many-to-many Attention for Few-shot Video Object Segmentation: In few-shot VOS, a support set for multiple appearances of a target object is provided; given the query images containing an object instance from the same class, the model should segment objects of the same category. Two main approaches are either computing a prototype feature vector from the support set and detecting the object in the query via comparison to the prototype or perform many-to-many attention between the support set and the query frames, which is computationally expensive. This paper considers the latter and proposes a solution for reducing the exponential cost in many-to-many attention operation to linear without performance loss.
- A limitation is the naive way of choosing the agent frame from the video (middle frame), which could be the subject of future work.
- Task: Co-salient object detection targets at detecting common salient objects sharing the same attributes given a group of relevant images.
- Why: Instead of only using images from the same group (similar things), teach the network dissimilar things using images from the other group. Therefore, the goal is to increase the intra-group compactness and the inter-group distinctiveness.
- How: The Group Affinity module brings the embeddings of the objects from the same category closer by computing a general group consensus from a group of images containing the same object (using correlation ops). The Group Collaborative Learning Network improves the inter-group separability by similar operations, only adding cross-group correlation. The consensus computed from this operation should not be able to detect the common object.
- Reference: SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks
- Task: Any architecture using Conv layer.
- Why: To have an adaptive receptive field by searching optimal dilation rates across spatial and channel dimensions instead of using a fixed manual dilation.
- How: Using a search algorithm referred to as EDO (efficient dilation optimization). The statistical optimization minimizes the L1 error between the expectation of the output of the pre-trained weights (from the so-called supernet) and the expectation of the output from the sampled dilation weights. For more information about the role of the pre-trained weights refer to DARTS method.
Question: why should the dilation pattern give us the same expected value as the pre-trained supernet? Does this optimization happen together with the actual training of the backbone weights?
- Improving Multiple Object Tracking with Single Object Tracking: This paper proposes the SOTMOT architecture for multiple object tracking to bring the single object tracking advances to MOT setup!
The training pipeline consists of offline and online phases.
During the offline training, the SOT branch (which is based on CenterNet) is trained via minimizing the ridge regression loss.
CenterNet produces the heatmap of the objects as well as an offset value for the object center and bounding box sizes.
In online inference, an association algorithm (DeepSORT) is used to find the optimal trajectory for each object.
- Additional references: Learning Feature Embeddings for Discriminant Model based Tracking, Simple Online and Realtime Tracking with a Deep Association Metric
- Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation
- Self-supervised Augmentation Consistency for Adapting Semantic Segmentation
- Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
- Adversarial Generation of Continuous Images
- Region-aware Adaptive Instance Normalization for Image Harmonization
- DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution
- Discriminative Appearance Modeling with Multi-track Pooling for Real-time Multi-object Tracking
- [Multiple Object Tracking with Correlation Learning (https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_Multiple_Object_Tracking_With_Correlation_Learning_CVPR_2021_paper.pdf)
- InverseForm: A Loss Function for Structured Boundary-Aware Segmentation
- Adaptive Consistency Regularization for Semi-Supervised Transfer Learning
- Interactive Self-Training with Mean Teachers for Semi-supervised Object Detection
- Convolutional Hough Matching Networks
- Detector-Free Local Feature Matching with Transformers
- Additional references: Neighbourhood Consensus Networks
- Partial Optimal Tranport with applications on Positive-Unlabeled Learning
- Adversarial Self-Supervised Contrastive Learning
- Normalizing Kalman Filters for Multivariate Time Series Analysis
- Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization
- Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
- Domain Generalization via Entropy Regularization
- Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
- Additional references: Learning from labeled and unlabeled data with label propagation