Implement DINO strategy for learning. #203
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR changes the learning method (we do not change the architecture or outputs) from using the MAE (Masked Autoencoder) to the DINO (Distillation with No Labels) approach.
Background on MAE:
MAE operates on the principle of masking a significant portion of the input data (typically 75%) and training the model to reconstruct these missing parts. This approach encourages the model to learn representations based on the context provided by the unmasked portions, leveraging transformer technology to generate detailed embeddings for each data patch. In scenarios where unique features are isolated within single patches, it might not always effectively infer their presence.
DINO:
DINO shifts the focus from reconstruction to a student-teacher framework (two models running in parallel). Here, the "student" model learns to replicate the output of the "teacher" model, which itself is an aggregate of the student model's past iterations. This method emphasizes learning from the entirety of the input data, as opposed to focusing on the missing parts, aiming to refine the model's understanding and representation capabilities.
Key Differences and Advantages:
Holistic Learning vs. Reconstruction by extrapolation: Unlike MAE, where learning is driven by the need to fill in gaps, DINO encourages the model to understand the full scope of the input data.
Dynamic Updating: The teacher model in DINO is dynamically updated, slowing moving the target towards better reconstructions.
Patch-Level Embeddings: Both MAE and DINO generate detailed embeddings at the patch level, but DINO is able to capture more nuanced patterns within and around each patch, informed by the accumulated tries of the teacher model.
DINO downsides:
Currently running a small experiment over Bali with DINO and then I'll do same with MAE and compare runs.