August 13th 2023

Sign upSign InSign upSign InSouvik MandalFollowITNEXT--ListenShareThis paper from Meta proposes a self-supervised technique which tries to learn highly semantic image features without relying on hand-crafted augmentations. The main idea of this paper is to give a Context block predict representation of various target blocks. This method got +7.9 top 1 accuracy than masked auto-encoders on ImageNet-1k dataset.Based on how the image representations are compared and generated, self-supervised models can be divided into three categories:There are three main components in the I-JEPA: Context selection, target selection, and predictor. The context, target encoders and predictor are Vision Transformer (ViT) architecture.For context, the authors select a random large square block with a scale in the range of (0.85, 1.0). Here scale means the size compared to the original image. This means the context block can be from 85% of the image to the full image.This context block will also consist of multiple patches. The patches that overlap with the target blocks are removed from the context block before processing through the context encoder. The context encoder is a ViT encoder.The targets are the feature representations of different image blocks (or patches). Given an input image, it is converted into a sequence of non-overlapping patches. Then they are passed through the target encoder to get the feature representations. Each block of the image will have the corresponding feature representations. In Figure 3, there are 25 non-overlapping patches on the original image. Then we pass them through the target encoder to get the feature representation of the 25 tokens.We select some of the neighbourhood patch representations to create a target. In Figure 2, the red target representation consists of 4 patch’s representations and the same for the yellow target. There is an overlapping patch between red and yellow.Authors have mentioned they normally select 4 target blocks. Each target block consists of patches with a random aspect ratio between (0.75, 1.5) and random scale (0.15, 0.2). The target encoder is a ViT encoder.We learn a mask token during training (This is similar to and tokens in NLP and is learned during training). For each of the patch we want to predict, we add the corresponding positional encodings to this mask token. For M number of target blocks prediction, predictor is called M times with encoded context and (mask token + positional embeddings) for all of the patches as inputs.The overall architecture is shown in Figure 2.The loss is the L2 distance between the predicted patch-level representations and the target patch-level representation.The context encoder weights are updated through the gradients from the loss, but the target encoder weights are updated through an exponentialmoving average of the context-encoder parameters. Generally, the exponential moving average target-encoder works really well for training JEAs with Vision Transformers.To check the model has learned good semantic features from images during pre-training, authors have reported results on various image classification tasks using the linear probing and partial fine-tuning protocols.In the case of Semi-Supervised training (First pre-training, then linear probing or fine-tuning), this paper produces better results. For each model best of the linear-probing and finetuning is taken. Also, the model is fine-tuned on 1% of the ImageNet-1K dataset.Linear evaluation on downstream image classification tasks. I-JEPA significantly outperforms previous methods that also do not use augmentations (MAE and data2vec), and decreases the gap with the best view-invariance-based methods that leverage hand-crafted data augmentations during pre-training.Linear evaluation on downstream low-level tasks consisting of object counting (Clevr/Count) and depth prediction (Clevr/Dist). The IJEPA method effectively captures low-level image features during pretraining and outperforms view-invariance based methods on tasks such object counting and depth prediction.We also find I-JEPA to benefit from pretraining with larger datasets.One of the reasons this works is because it works on the latent space than pixel space. When predicting in latent space, each of the missing pieces is some high level features or objects. This makes much more sense to mask out objects or high level features and trying to predict them with a context because this masking idea came from NLP where we predict a masked token/word based on the neighbourhood tokens/words. In the image domain, words should be mapped to objects, not pixels, since pixels do not have much semantic information individually as words do. When we pass this through the encoder, we get high level features from the object and redundant pixel level information is generally removed. This is the reason this works well.Hope you enjoyed this paper. Have a nice day.Resources:----ITNEXTSenior AI Scientist @ Qure.ai, Ex Fractal, Deep learning, CSE IIT Indore, 20.Souvik MandalinITNEXT--Carlos ArguellesinITNEXT--28Juntao QiuinITNEXT--10Souvik MandalinITNEXT--Dominik PolzerinTowards Data Science--38Sheeza Shabbir--Hyunsoo Lee--Sunny Bhaveen Chandra--1Hayden LaBrie--Hayden LaBrieinGoPenAI--1HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams

Gamers' Guide to Grand Strategy Maste…
Revive Your Morning: NAC Hangover Cur…
Nota Informativa
The Ultimate Guide to the Best All-In…
Exploring Opportunities in Remote Fre…

This post first appeared on VedVyas Articles, please read the originial post: here