Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Masked Autoencoders Are Scalable Vision Learners

Sign upSign InSign upSign InSouvik MandalFollowITNEXT--ListenShareThis paper is from Facebook AI Research (FAIR) published in 2021. This paper introduced a novel architecture Masked Autoencoders or MAEs for training models self-supervised way. The training methodology is mainly targeted towards vision transformers (ViT) like architectures.META Research (not Facebook anymore 😛) has recently released Segment Anything. That model uses a pre-trained backbone trained with this method.This architecture produces better results than existing methods with much less time (3x faster). Let’s understand why and how it works.We will first go through a high-level overview of the architecture and the method. Then, we will go into one component at a time.medium.com7. Once we have the decoders outputs, we reconstruct each of the Patches from the corresponding tokens.MAE network consists of an asymmetric encoder-decoder architecture. Asymmetric encoder-decoder architectures are where the length of encoder and decoders are different. The encoder operates only on the subset of the image patches (patches that are not masked). The decoder reconstructs the original image from the latent representation and mask tokens.In case of MAE the encoder is larger, the embedding dimension of each of the patches for the encoder if 768 but for decoder it is 512. Also, the decoder uses 8 decoder blocks (depth) but the encoder uses 24 encoder blocks. Because of smaller decoder and encoder processing a smaller number of patches the training is super faster.If you are following the research in vision transformers one of the main problems with them is large training data requirements to perform well. With same amount of small dataset CNNs still performs better. Getting large, labelled data is a problem.Transformer idea came from NLP. If we check, there how this problem is fixed there: Self-Supervised pretraining. They remove part of a sentence and try to predict this based on the available parts of the sentence. With this simple per-training method people are training NLP models with trillions of parameters (GPT-4 rumors to have 1.8 trillion of parameters 😶).Removing some of the patches and not processing them in encoder and processing everything in decoder is what the main concept of MAE. DAE corrupts the image (by adding noise, changing contrast, dropping color channel and so on) and try to predict the uncorrected image. You can thing MAE as a special form of DAE but not the other way around.Other questions that creep up is why so many patches are dropped. (Around 75% of the patches are dropped). This is normally not the case with NLP.Answer is that language models are highly semantic and information dense. Just pick up any sentence in this blog and see if you remove randomly 75% of the data will you be able to predict what the remaining parts trying to say. So, for NLP dropping few words and trying to predict them is hard enough that model will need to learn cross relation across words.Images have high special redundancy. A missing patch can be easily recovered with just interpolation with the neighborhood patches. See the image below, you can remove lots of patches from the road and will still be able to reconstruct with just interpolation.This strategy (large proportion of patch dropout) largely reduces redundancy and creates a challenging selfsupervisory task that requires holistic understanding beyond low-level image statistics.Based on the previous discussion we know that image have very redundant information and decoder is basically reconstructing the latent space information (fancy term meaning the features extracted by the encoder) to input level or pixel space. So, the decoder requires low level semantic information to generate so not complicated task thus small size is fine.Also, if you see in the encoder, we are only processing the remaining patches after masking, but in decoder we are processing all the patches. So, having small decoder and large encoder will speed-up the training process that equal size encoder-decoder. This asymmetric nature of the model consumes less memory than traditional encoder-decoder models.Mean squared error is used to train the MAE. We only compute the loss for the masked patches and not on the non-masked patches.Finally let's go through the fine-grained details of the implementation.Let’s take an image, resize it to a fixed size and create non overlapping patches.Define some of the parameters.Visualize the patches in the imageNext, we will create the feature embeddings from each of the patches. In the MAE paper, the default patch dimension is 768. We are using a 2D convolution to slide over the patches and output the feature embeddings. See the kernel size and stride are same as patch size.Reshape the output to Batch, Number_of_patches, embedding_dims.So, we finally got 196 patches and each patch has 768 feature dimensions. In the MAE authors have used learned positional embeddings in the encoder.Add the positional encodings without the class token’s positional encoding.Let’s do the random masking next. We will use 75% masking ratio. Since we have 196 patches, we will only keep 49 patches after masking.We will create a random noise torch tensor of shape batch, num_patches. Next, we will keep the indices of the first 49 indices in ascending order. For example, if we have a single data point and assume we are creating 5 patches from each image (instead of 196 patches) we can have a noise tensor like [0.6, 0.9, 0.4, 0.7, 0.8] Now let’s say we want to keep 3 patches data instead of 49. We will keep the patches at following indices [2 ,0 ,3] and mask out the remaining.Once we know which patch’s embeddings, we are keeping we will take the embeddings. Going with the smaller example, if we have a proj_op as follows.Since we are selecting the following patches [2 ,0 ,3] our non-masked embeddings will be as follows.We do this following way.We also add the class token here:We pass the final output x to the transformer encoder.Now remember we downsample the embeddings from 768 to 512 for decoder.We also define the mask token which is filled for all the masked patches. Similar to cls token this one is also learned during training and shared across batches and masked patches. This represents one masked patch in a batch.So, we have total 196 patches and have processed only 49 patches (without the cls token). We will copy the mask token 196–49 = 147 time for all the masked patches.Now, we need to put the encoder processed tokens and the masked tokens in the original order. Again, going back to the smaller example, if we sort the noise tensor [0.6, 0.9, 0.4, 0.7, 0.8] we will have [2, 0, 3, 4, 1] . After processing through the encoder, we have following output for the first 3 selected patches (patches in indices [2, 0, 3])Let’s add the mask tokens for the remaining two patches [ [1.7 ,1.8 ,1.9], [1.1 ,1.2 ,1.3], [2.0 ,2.1 ,2.2], mask_token, mask_token]. We need to convert this tensor to following format [ [1.1 ,1.2 ,1.3], mask_token, [1.7 ,1.8 ,1.9], [2.0 ,2.1 ,2.2], mask_token].If we sort the sorted indices tensor we will have [1, 4, 0, 2, 3] . Now if we use these indices to populate from the mask concatenated tensor list ([ [1.7 ,1.8 ,1.9], [1.1 ,1.2 ,1.3], [2.0 ,2.1 ,2.2], mask_token, mask_token]) we will get the result we want ([ [1.1 ,1.2 ,1.3], mask_token, [1.7 ,1.8 ,1.9], [2.0 ,2.1 ,2.2], mask_token]).We do the same as follows:Add decoder positional encodings:Define the decoder:Process the data through the decoderWe need to rescale the feature embeddings from each of the patch to patch_height, patch_width * channel. This is the image reconstruction step.Remove the class token:From the input side we patchify the image to compute the loss.Now out ground truth(pached_img) and prediction (x) is of same shape. But we only want to compute the loss on the masked patches. We do this in a similar way we get shuffle the patches bac to original order before decoder.next we compute the lossThat’s it for this blog. Have a nice day.Resources:----ITNEXTSenior AI Scientist @ Qure.ai, Ex Fractal, Deep learning, CSE IIT Indore, 20.Souvik MandalinITNEXT--Juntao QiuinITNEXT--5Mohammad Hoseini RadinITNEXT--6Souvik MandalinMLearning.ai--Mingu KanginLunit Team Blog--William Spivey--63Yennie JuninTowards Data Science--6Daniel WarfieldinRoundtableML--AutoIntuit--Erika LacsoninTowards AI--HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Masked Autoencoders Are Scalable Vision Learners

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×