Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

How to Train BERT for Masked Language Modeling Tasks

Ransaka RaviharaFollowTowards Data Science--ListenShareIn recent years, large Language models(LLMs) have taken all the attention from the machine learning community. Before LLMs came in, we had a crucial research phase on various language modeling techniques, including masked language modeling, causal language modeling, and sequence-to-sequence language modeling.From the above list, masked language models such as BERT became more usable in downstream NLP tasks such as classification and clustering. Thanks to libraries such as Hugging Face Transformers, adapting these models for downstream tasks became more accessible and manageable. Also thanks to the open-source community, we have plenty of language models to choose from covering widely used languages and domains.When adapting existing language models to your specific use cases, sometimes we can use existing models without further tuning (so-called fine-tuning). For example, if you want an English sentiment/intent detection Model, you can go into HuggingFace.co and find a suitable model for your use case.However, you can only expect this for some of the tasks encountered in the real world. That's where we need an additional technique called fine-tuning. First, you must choose a base model that will be fine-tuned. Here, you must be careful about the selected model and your target language's lexical similarity.However, if you can't find a suitable model retrained on the desired language, consider building one from scratch. In this tutorial, we will implement the BERT model for the masked language model.Even though describing BERT architecture is out of the scope of this tutorial, for the sake of clarity, let's go through it very narrowly. BERT, or Bidirectional Encoder Representations from Transformers, belongs to the encoder-only transformer family. It was introduced in the year 2018 by researchers at Google.We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement). Paper: https://arxiv.org/abs/1810.04805In the above, we can see an interesting keyword, Bidirectional. Bidirectional nature gives human-like power to BERT. Assume you have to fill in a blank like the below one,“War may sometimes be a necessary evil. But no matter how necessary, it is always an _____, never a good.”To guess the word comes into a blank position, you should be aware of a few things: words before empty, words after blank, and the overall context of the sentence. Adapting this human nature, BERT works the same way. During training, we hide some words and ask BERT to try predicting those. When training is finished, BERT can predict masked tokens based on their before and after words. To do this, the model should allocate different attention to words presented in the input sequence, which may significantly impact predicting masked tokens.As you can see here, the model sees a suitable word for the hidden position as evil and the first sentence's evil as necessary to make this prediction. This is a noticeable point and implies that the model understands the context of the input sequence. This context awareness allows BERT to generate meaningful sentence embeddings for given tasks. Further, these embeddings can be used in downstream tasks such as clustering and classification. Enough about BERT; let's build one from scratch.We generally have BERT(base) and BERT(large). Both have 64 dimensions per head. The large variant contains 24 encoder layers, while the base variant only has 12. However, we are not limited to these configurations. Surprisingly, we have complete control over defining the model using Hugging Face Transformers library. All we have to do is define desired model configurations using the BertConfig class.I chose 6 heads and 384 total model dimensions to comply with the original implementation. In this way, each head has 64 dimensions similar to the original implementation. Let's initialize our BERT model.Here, I won't describe how tokenization works under the hood. Instead, let's train one from scratch using Hugging Face tokenizers library. Please note that the tokenizer used in the original BERT implementation is WordPiece tokenizer, yet another subword-based tokenization method. You can learn more about this tokenization using the neat HuggingFace resource below.huggingface.coThe dataset used here is the Sinhala-400M dataset (under apache-2.0). You can follow the same with any dataset you have.As you may notice, some Sinhalese words have been typed using English as well. Let's train a tokenizer for these corpora.Let's import the necessary modules first. The good thing about training tokenizers using Hugging Face Tokenizers library is that we can use existing tokenizers and replace only vocabulary (and merge where applicable) per our training corpus. This means tokenization steps such as pre-tokenization and post-tokenization will be preserved. For this, we can use a method, train_new_from_iterator BertTokenizer class.You can see words like 'cricketer' decomposed into cricket and ##er, indicating that the tokenizer has been adequately trained. However, try out different vocab sizes; mine is 5000, which is relatively small but suitable for this toy example.Finally, we can save the trained tokenizer into our directory.Let's define a collator for MLM tasks. Here, we will mask 15% of tokens. Anyway, we can set different masking probabilities.Let's tokenize the dataset using a previously created tokenizer. I'm replacing the original LineByLineTextDataset using my custom class utilizing Hugging Face accelerate.Let's tokenize the dataset.Alright, let's code our training loop.We can invoke the trainer using its train() method.After sufficient training, our model can be used for downstream tasks such as zero-shot classification and clustering. You may find the example using this Hugging Face space for more details.huggingface.coWith limited resources, pre-trained models may only recognize specific linguistic patterns, but they can still be helpful for particular use cases. It is highly recommended to fine-tune when possible.In this article, all images, unless otherwise noted, are by the author.----Towards Data ScienceData ScientistRansaka RaviharainTowards Data Science--4Khouloud El AlamiinTowards Data Science--23Ransaka RaviharainTowards Data Science--Mayur Ghadge--1AL Anany--432The NYT Open Team--4Sthanikam Santhosh--Dr. Ashish BamaniainLevel Up Coding--35Gathnex--2HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

How to Train BERT for Masked Language Modeling Tasks

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×