Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

An all-MLP Architecture for Vision | MLP-Mixer

A vision Architecture based exclusively on Multi-layer Perceptrons(MLPs)

Cai Fang on Unsplash

Convolutional neural networks (CNNs) and attention-based networks (Vition Transformers) have been the two dominant architectures for computer vision. However, a recent paper proposes a new architecture called MLP-Mixer[1], which is based exclusively on multi-layer perceptrons (MLPs). The paper shows that MLP-Mixer can achieve competitive performance with CNNs and attention-based networks while being simpler and more efficient.

Contents

  • Overview
  • MLP-Mixer Architecture
  • Experimental Results

Overview

MLP-Mixer[1] is a novel image classification architecture that is fundamentally different from the convolutional neural networks (CNNs) and self-attention based models (Vision Transformers). it relies solely on the hierarchical processing capabilities of MLPs to learn spatial and channel dependencies in images.

MLP-Mixer consists of two types of layers:

  • Token-mixing MLPs operate on each token independently, taking each row of the input tensor as input. This allows them to communicate between different spatial locations in the input tensor.
  • Channel-mixing MLPs operate on each channel independently, taking individual columns of the input tensor as inputs. This allows them to communicate between different feature channels in the input tensor.

The two types of layers, token-mixing MLPs and channel-mixing MLPs, are interleaved to enable the interaction of both input dimensions. MLP-Mixer also uses residual connections and layer normalization to improve training stability.

MLP-Mixer Architecture

The idea behind the Mixer architecture is to clearly separate the per-location (channel-mixing) operations and cross-location (token-mixing) operations. Both operations are implemented with MLP which is illustrated in Figure 1.

Figure 1. MLP-Mixer (from the original paper[1])

MLP-Mixer includes 3 main sections (Figure 1. is cropped to better illustrate each step):

Per-patch linear embeddings

The input image with the resolution of H*W is partitioned into S non-overlapping patches. The size of each patch is P*P, so the number of patches is S = HW/P². Each patch is then linearly projected to a hidden dimension C. This results in a two-dimensional real-valued input table, X with the dimensions of S×C.
The first dimension of the table, S, corresponds to the number of rows in the table, which is equal to the number of patches in the image. The second dimension of the table, C, corresponds to the number of columns in the table, which is equal to the hidden dimension C.
All patches are linearly projected with the same projection matrix. This means that the same transformation is applied to each patch, regardless of its location in the image. This helps to ensure that the model learns to extract features that are invariant to spatial location.

Mixer-layers

Each layer consists of two MLP blocks:

  • Token-mixing MLP: This block acts on the transpose of X. The token-mixing MLP is responsible for learning long-range dependencies between different image patches. It does this by taking the transpose of the input table X and then applying fully-connected layers to each column of the table.
  • Channel-mixing MLP: This block acts on rows of X. The channel-mixing MLP is responsible for learning relationships between different channels within a single image patch. It does this by taking the input table X and then applying fully-connected layers to each row of the table.

In addition to MLP layers, Mixer uses other standard architectural components, such as skip connections and layer normalization. However, unlike Vision Transformers (ViTs), Mixer does not use positional embeddings, as the token-mixing MLPs are sensitive to the order of the input tokens.

Classifier Head

The Mixer network uses a standard classification head to predict the class of an image. The classification head consists of a global average pooling layer followed by a linear classifier.

Experimental Results

Configurations: Adam with β1 = 0:9, β2 = 0:999, linear learning rate warmup of 10k steps and linear decay, batch size 4096, weight decay, and gradient clipping at global norm 1.

Table 1 compares the largest Mixer models to the best models that have been published before. The “ImNet” and “ReaL” columns show the models’ performance on the original ImageNet validation dataset and the cleaned-up ReaL dataset, respectively. The “Avg. 5” column shows the average performance of the models across five different downstream tasks: ImageNet, CIFAR-10, CIFAR-100, Pets, and Flowers.

the MLP-based Mixer models are marked with pink (🟣), convolution-based models with yellow (🟡), and attention-based models with blue ( 🔵).

Table 1. Compare Transfer performance, inference throughput, and training cost of ViT-based models, CNN-based models, and MLP-Mixer

When MLP-Mixer is pre-trained on a large dataset of images (ImageNet-21k), it achieves a strong performance on the ImageNet validation dataset (84.15% top-1 accuracy). However, it is slightly inferior to other models that have been pre-trained on the same dataset.

When the size of the upstream dataset is increased, Mixer’s performance improves significantly. For example, Mixer-H/14 achieves 87.94% top-1 accuracy on ImageNet, which is 0.5% better than BiT-ResNet152×4 and only 0.5% lower than ViT-H/14.

This suggests that Mixer is a promising model for image classification and that its performance can be further improved by using larger datasets.

References

[1] I. Tolstikhin, N. Houlsby, Alexander Kolesnikov, L. Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, A. Dosovitskiy, MLP-Mixer: An all-MLP Architecture for Vision (2021)

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.


An all-MLP Architecture for Vision | MLP-Mixer was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://ai.plainenglish.io/an-all-mlp-architecture-for-vision-mlp-mixer-6c0b00890cb2?source=rss—-78d064101951—4
By: Golnaz Hosseini
Title: An all-MLP Architecture for Vision | MLP-Mixer
Sourced From: ai.plainenglish.io/an-all-mlp-architecture-for-vision-mlp-mixer-6c0b00890cb2?source=rss—-78d064101951—4
Published Date: Tue, 27 Jun 2023 10:17:12 GMT

Did you miss our previous article…
https://e-bookreadercomparison.com/ai-in-the-crosshairs-a-new-era-for-law-enforcement/

The post An all-MLP Architecture for Vision | MLP-Mixer appeared first on E-bookreadercomparaison.



This post first appeared on Ebook Reader Comparison, please read the originial post: here

Share the post

An all-MLP Architecture for Vision | MLP-Mixer

×

Subscribe to Ebook Reader Comparison

Get updates delivered right to your inbox!

Thank you for your subscription

×