Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

The CLIP Foundation Model

Sascha KirchFollowTowards Data Science--ListenShareIn this article we are going through the paper behind Clip (Contrastive Language-Image Pre-Training). We will extract key concepts and break them down to make them easy to understand. Further, images and data graphs are annotated to clarify doubts.Paper: Learning Transferable Visual Models From Natural Language SupervisionCode: https://github.com/OpenAI/CLIPFirst Published: 26 Feb. 2021Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya SutskeverCategory: multi-modal deep learning, computer vison, natural language processing, foundation models, representation learningCLIP (Contrastive Language-Image Pre-Training) is a multi-modal model that learns the correspondence between natural language and images. It is trained on 400 million text-images pairs collected from the internet. As we will discover later in this article, CLIP has strong zero-shot performance, meaning it performs well on downstream tasks different to those it was trained on, without performing any fine-tuning.CLIP aims to:Why is this a big deal you might ask yourself? First of all, many computer vision models are trained on crowd-sourced labeled datasets. These datasets often contain hundreds of thousands samples. Some exceptions are in the region of single or double digit million samples. As you can imagine it is a very time consuming and costly process. Datasets for natural language models on the other hand are usually several orders of magnitudes larger and are scraped from the internet. Secondly, if an object detection model has been trained on certain classes and you want to add an extra class, you would need to label this new class in your data and retrain the model.CLIP’s ability to combine natural language and image features in combination with its zero-shot performance has led to a wide adoption in many other popular foundation models such as UnCLIP, EVA, SAM, Stable Diffusion, GLIDE or VQGAN-CLIP, to name a few.Now let’s dive into the method of CLIP. The image bellow depicted in Fig.1 shows the architecture of CLIP and the process of how it is trainedThe model architecture consists of two encoder models, one for each modality. For the text encoder a transformer was used while the image encoder uses either a version of ResNet or ViT (Vision Transformer). A learned linear transformation, one for each modality, transforms the features into embeddings of matching size. Finally, the cosine similarity is calculated between each of the embeddings of opposing modality and is scaled by a learned temperature scalar. During training, the cosine similarity between matching pairs is maximized while it is minimized for incorrect pairs, hence the term “contrastive” in the framework’s name.There are some subtleties that are crucial for the success, beside the large dataset of course. First, the contrastive learning approach strongly depends on the batch size N. The more negative samples are provided along the correct ones, the stronger the learning signal. CLIP was trained on a batch size of 32,768, which is quite large. Second, CLIP does not learn a match of the exact wording, but an easier proxy task to only learn the text as a whole, also called bag of words (BoW).Fun Fact: The version of CLIP using a ResNet50x64 as image encoder was trained for 18 days on 592 V100 GPUS and while the version with the ViT model was trained for 12 days on 256 V100 GPUS. In other words, over 29 years and over 8 years on a single GPU respectively (ignoring the fact a different batch size would be used).Once the model is trained it can be used to perform object classification on a images. The question is: how to perform classification using a model that has not been trained to classify images nor does input class labels but text prompts? Fig 2. shows how:A class label can be seen as a text prompt formed by a single word. To tell the model, which classes are available for the classification task, a set of N classes is input into the model. This is a huge advantage compared to classification models trained on a fixed set of labels. We can now either input 3 classes or 100; it’s our choice. As we will see later, to improve the performance of CLIP, the class label is transformed into a prompt to provide further context to the model. Each prompt is then fed to the text encoder and is then transformed into an embedding vector.The input image is fed into the image encoder to obtain the embedding vector.Then the cosine similarity is calculated for each pair of text and image embeddings. A Softmax is applied on the obtained similarity values to form a probability distribution. Finally, the value with the highest probability is selected as the final prediction.The CLIP paper presents a vast number of experiments and ablations. Here we will cover five, from which I think are important to understand the success of CLIP. Upfront the take aways (as formulated by the authors of CLIP) and then we will dive into the details:During training, the image encoder and the text encoder are trained jointly, meaning with a single training objective and at the same time. Not only does CLIP perform a contrastive learning scheme, but the text prompts are compared as a whole against a given image, hence the order of words does not matter. It is simply a “bag of words”. The phrase “my name is Sascha” results in the same embedding as “Sascha name is my”.Predicting a bag of words instead of the correct words and its position in a phrase is a much easier proxy objective. Fig 3. bellow shows the zero-shot accuracy on ImageNet over the number of training samples of the initial transformer model trained to predict exact words, the initial transformer model trained to predict a bag of words and the CLIP model that performs contrastive learning using bag of words.“CLIP is much more efficient at zero-shot transfer than our image caption baseline” — CLIP AuthorsAs we have seen in Fig. 2, to perform object classification, the class label has been converted into a text prompt. Of course, this was not by chance, because CLIP would be totally fine with a single word. It was done to leverage the descriptiveness of language and to provide context to resolve possible ambiguities. Let’s take the word “boxer” for example. It could be a type of dog or a type of athlete. The authors of CLIP have shown that the format of the text prompt matters a lot and can boost the performance as well increase the efficiency.“Prompt engineering and ensembling improve zero-shot performance” — CLIP AuthorsIn another experiment, the authors compared the zero-shot image classification performance of CLIP against a model that was trained specifically on the dataset under comparison.“Zero-shot CLIP is competitive with fully super-vised baseline” — CLIP AuthorsWhile zero-shot predictors are not fine-tuned on the downstream task, few shot detectors are. The authors experimented with multiple publicly available pre-trained models and compared their few-shot performance on 20 different datasets against zero-shot and few-shot CLIP. The few-shot models have been fine-tuned on 1, 2, 4, 8 and 16 examples per class.Interestingly, zero-shot CLIP performs roughly as good as 4-shot CLIP.If comparing CLIP to other models, one must consider that the publicly available models under comparison (i.e. BiT, SimCLR and ResNet) have been pre-trained on different and smaller datasets as the CLIP model.“Zero-shot CLIP outperforms few-shot linear probes” — CLIP AuthorsGenerally speaking, a model’s robustness towards distribution shifts refers to its capability to perform as good on data of a different data distribution as on the data distribution of the data it was trained on. Ideally, it would perform equally well. In reality, its performance drops.The robustness of zero-shot CLIP has been compared to a ResNet101 ImageNet model. Both models are evaluated on natural distribution shifts of ImageNet, as depicted in Fig. 7.“Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models” — CLIP AuthorsAs mentioned at the beginning of this article, CLIP has been widely adopted by a vast number of projects.Following a list of papers using CLIP:And a list of repositories if you want to dive into the implementation and test it yourself:----Towards Data Science🚙 Expert Deep Learning @ Bosch 🤖 Collaborating Researcher @ Volograms 🎓 Lecturer Deep Learning @ UNED ⚡️ IEEE Eta Kappa Nu Nu AlphaSascha KirchinTowards Data Science--1Bex T.inTowards Data Science--10Maxime LabonneinTowards Data Science--33Sascha KirchinTowards Data Science--TeeTracker--Sehyun Choi--3Mark Riedl--31Prateek--1Syed Hamza--Dominik PolzerinTowards Data Science--42HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

The CLIP Foundation Model

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×