Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Large Language Models: TinyBERT — Distilling BERT for NLP

Vyacheslav EfimovFollowTowards Data Science--ListenShareIn recent years, the evolution of large language models has skyrocketed. Bert became one of the most popular and efficient models allowing to solve a wide range of NLP tasks with high accuracy. After BERT, a set of other models appeared later on the scene demonstrating outstanding results as well.The obvious trend that became easy to observe is the fact that with time large language models (LLMs) tend to become more complex by exponentially augmenting the number of parameters and data they are trained on. Research in deep learning showed that such techniques usually lead to better results. Unfortunately, the machine learning world has already dealt with several problems regarding LLMs and scalability has become the main obstacle in effective training, storing and using them.By taking into consideration this issue, special methods have been elaborated for compressing LLMs. In this article, we will focus on Transformer distillation which led to the development of a small version of BERT called Tinybert. Additionally, we will understand the learning process in TinyBERT and several subtleties that make TinyBERT so robust. This article is based on the official TinyBERT paper.Recently we have already covered how distillation works in DistilBERT: in short words, the loss function objective is modified in a way to make the predictions of the student and teacher similar. In DistilBERT, the loss function compares the output distributions of the student and teacher and also takes into consideration the output embeddings of both models (for similarity loss).towardsdatascience.comOn the surface, the distillation framework in TinyBERT does not change that much from DistilBERT: the loss function is again modified to make the student imitate the teacher. However, in the case of TinyBERT, it goes a step beyond: the loss function takes into consideration not only WHAT both models produce but also HOW predictions are obtained. According to the paper, the TinyBERT loss function consists of three components that cover different aspects of both models:3. the logits output by the prediction layerWhat is the point of comparing the hidden states of both models? Including the outputs of hidden states and attention, matrices makes it possible for the student to learn the hidden layers of the teacher, thus constructing layers similar to those of the teacher. This way, the distilled model does not only imitate the output of the original model but also its inner behaviour.Why is it important to replicate the teacher’s behaviour? The researchers claim that the attention weights learned by BERT can be beneficial for capturing language structure. Therefore, their distillation to another model also gives the student more chances to gain linguistic knowledge.Representing a smaller BERT version, TinyBERT has fewer encoder layers. Let us define the number of BERT layers as N, and the number of those of TinyBERT as M. Given the fact that the number of layers is different, it is not obvious how it would be possible to calculate the distillation loss.For this purpose, a special function n = g(m) is introduced to define which BERT layer n is used to distillate its knowledge to a corresponding layer m in TinyBERT. The chosen BERT layers are then used for loss calculation during training.The introduced function n = g(m) has two reasoning constraints:For all other TinyBERT layers 1 ≤ m ≤ M, the corresponding function values of n = g(m) need to be mapped. For now, let suppose that such function is defined. The TinyBERT settings will be studied later in this article.Before raw input is passed to the model, it is firstly tokenized and then mapped to learned embeddings. These embeddings are then used as the first layer of the model. All the possible embeddings can be expressed in the form of a matrix. To compare how much different the student and teacher embeddings are, it is possible to use a standard regression metric applied on their respective embedding matrices E. For instance, transformer distillation uses MSE as a regression metric.Since student and teacher embedding matrices have different sizes, it is not possible to compare them element-wisely by using MSE. That is why, the student embedding matrix is multiplied by a learnable weight matrix W, so the resulting matrix is of the same shape as the teacher embedding matrix.Since the embedding spaces of the student and teacher are different, matrix W also plays an important role in linearly transforming the embedding space of a student to that of the teacher.2A. Attention-layer distillationAt its core, the multi-head attention mechanism in Transformer produces several attention matrices containing rich linguistic knowledge. By transferring the attention weights from the teacher, the student can also understand important language concepts. To implement this idea, the loss function is used to calculate the differences between student and teacher attention weights.In TinyBERT, all the attention layers are considered and the resulting loss value for each layer equals the sum of MSE values between respective student and teacher attention matrices for all heads.The attention matrices A used for attention-layer distillation are unnormalized, instead of their softmax output softmax(A). According to the researchers, this subtlety leads to faster convergence and improved performance.2B. Hidden-layer distillationFollowing the idea of capturing rich linguistic knowledge, the distillation is applied to the outputs of transformer layers as well.The weight matrix W plays the same role as the one described above for embedding-layer distillation.Finally, to make the student reproduce an output of the teacher, the prediction-layer loss is considered. It consists of computing cross-entropy between predicted logit vectors by both models.Sometimes, the logits are divided by temperature parameter T which controls the smoothness of an output distribution. In TinyBERT, the temperature T is set to 1.In TinyBERT, based on its type, each layer has its own loss function. To give some layers more or less importance, corresponding loss values are multiplied by a constant a. The ultimate loss function equals a weighted sum of loss values on all TinyBERT layers.In numerous experiments, it was shown that among three loss components, the transformer-layer distillation loss has the highest impact on the model’s performance.It is important to note that most NLP models (including BERT) are developed in two stages:Following the same paradigm, the researchers developed a framework in which TinyBERT learning process also consists of two stages. In both training stages the Transformer distillation is used to transfer BERT knowledge to TinyBERT.A special data augmentation technique was elaborated for task-specific distillation. It consists of taking sequences from a given dataset and substituting a percentage of words in one of two ways:Despite a considerable reduction of the model size, the described data augmentation mechanism makes a high impact on TinyBERT performance by allowing to it to learn more diverse examples.By having only 14.5M parameters, TinyBERT is about 7.5x smaller than BERT base. Their detailed comparison is demonstrated in the figure below:For the layer mapping, the authors propose a uniform strategy according to which the layer mapping function maps each TinyBERT layer to each third BERT layer: g(m) = 3 * m. Other strategies were also studied (like taking all bottom or top BERT layers) but the uniform strategy showed the best results which seems logical because it allows to transfer knowledge from different abstraction layers making the transferred information more varied.Speaking of the training process, TinyBERT is trained on English Wikipedia (2500M words) and has most of its hyperparameters the same as in BERT base.Transformer distillation is a big step in natural language processing. Taking into consideration that Transformer-based models are one of the most powerful at the moment in machine learning, we can further cherish them by applying Transformer distillation to effectively compress them. One of the greatest examples is TinyBERT which is compressed by 7.5x times from BERT base.Despite such a huge reduction of parameters, experiments show that TinyBERT demonstrates comparable performance with BERT base: achieving a 77.0% score on the GLUE benchmark, TinyBERT is not far away from BERT whose score equals 79.5%. Obviously, this is an amazing achievement! Finally, other popular compression techniques like quantization or pruning can be applied to TinyBERT to make it even smaller.All images unless otherwise noted are by the author----Towards Data ScienceBSc in Software Engineering. Passionate machine learning engineer. Writer at Towards Data Science.Vyacheslav EfimovinTowards Data Science--Damian GilinTowards Data Science--24Adrian H. RaudaschlinTowards Data Science--20Vyacheslav EfimovinTowards Data Science--Donato RiccioinTowards Data Science--5Qendel AIinGoPenAI--3Yanli LiuinLevel Up Coding--7Karkar Nizar--Emanuel FerreirainLlamaIndex Blog--Ryan NguyeninTowards AI--3HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Large Language Models: TinyBERT — Distilling BERT for NLP

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×