October 16th 2023

Silvia OnofreiFollowTowards Data Science--ListenShareCombine TSDAE pre-training on a target domain with supervised fine-tuning on a general-purpose corpus to enhance the quality of the embeddings for a specialized domain.Embeddings encode text into high dimensional vector spaces, using dense vectors to represent words and to capture their semantic relationships. Recent developments in generative AI and LLM, such as context search and RAG rely heavily on the quality of their underlying embeddings. While the similarity searches use basic mathematical concepts such as cosine similarity, the methods used to build the embedding vectors significantly influence the subsequent outcomes.In most cases, a pre-trained Sentence transformer, will work out of the box and will provide reasonable results. There are many choices of BERT based pretrained contextual embeddings, some domain specialized, that can be used in these cases, and they are available for download from platforms such as HuggingFace.The issues arise when we are dealing with cases where the corpus contains many technical terms specific to a narrow domain or originates from low resource languages. In these cases, we need to address the unknown words that were not seen during pre-training or fine-tuning.For example a model pre-trained on general text will have hard time to properly allocate vectors to titles from a corpus of mathematical research papers.In these cases, since the model was not exposed to the domain specific words, it struggles to determine their meaning and to place them accurately in the vector space relative to other words from the corpus. The higher the number of unknown words the bigger the impact, and the lower the performance of the model.Hence an out of the box pre-trained model will underperform in such scenarios, while attempting to pre-train a custom model faces challenges due to the lack of labelled data and the need of significant computational resources.This work was incited by a recent research [aviation_article] that is focusing on the aviation domain whose data has unique characteristics such as technical jargon, abbreviations, and unconventional grammar.To address the lack of labelled data, the authors employed one of the most effective unsupervised techniques that allow for pre-training embeddings (TSDAE) followed by a fine-tuning stage that used labelled data from a general purpose corpus. The adapted sentence transformers outperform the general transformers, demonstrating the effectiveness of the approach in capturing the characteristics of the aviation domain data.Domain adaptation is about tailoring text embeddings to a specific domain without the need for labeled training data. In this experiment, I am using a two-step approach which, according to the [tsdae_article], works better than just training on the target domain.Firstly, I start with pre-training focused on the target domain, often termed as adaptive pre-training. This phase demands a collection of sentences from our dataset. I employ TSDAE for this stage, a method that excels in domain adaptation as a pre-training task, significantly surpassing other methods, including the masked language model, as emphasized in [tsdae_article]. I am closely following the script: train_tsdae_from_file.py.Subsequently, I fine-tune the model on the generic labeled AllNLI dataset, employing a multiple negative ranking loss strategy. For this stage I am using the script from training_nli_v2.py. As documented in [tsdae_article], this additional step not only counters over-fitting but also significantly improves the model’s performance.TSDAE (transformer-based sequential denoising auto-encoder) is an unsupervised sentence embedding method, that was first introduced by K. Wang, N. Reimers and I. Gurevych in [tsdae_article].TSDAE uses a modified encoder-decoder transformer design where the key and the value of the cross-attention are confined to the sentence embedding. I’ll outline the details in the context of the optimal architecture choices highlighted in the original paper [tsdae_article].For good reconstruction quality, the sentence embedding from the Encoder must optimally capture the semantics. A pretrained transformer such as bert-base-uncased is used for the Encoder, while the Decoder’s weights are copied from it.The Decoder’s attention mechanism is restricted to the sentence representation produced by the Encoder. This modification from the original transformer encoder-decoder architecture, limits the information the Decoder retrieves from the Encoder, and introduces a bottleneck that forces the Encoder to produce meaningful sentence representations.At inference, only the Encoder is used to create sentence embeddings.The model is trained to reconstruct the clean sentence from the corrupted sentence and this is accomplished by maximizing the objective:Natural Language Inference (NLI) determines the relationship between two sentences. It categorizes the truth of the hypothesis (second sentence) as entailment (true based on the premise), contradiction (false based on the premise), or neutral (neither guaranteed nor contradicted by the premise). NLI datasets are large labeled datasets where pairs of sentences are annotated with their relationship class.For this experiment, I use the AllNLI dataset that contains a collection of more than 900K records, from combined Stanford Natural Language Inference (SNLI) and MultiNLI datasets. This dataset can be downloaded from: AllNLI download site.To build our domain specific data, we are using the Kaggle arXiv dataset comprised roughly of about 1.7M scholarly STEM papers sourced from the established electronic preprint platform, arXiv. Besides title, abstract and authors, there is a significant amount of metadata associated with each article. However, here we are concerned only with the titles.After download, I’ll select the mathematics preprints. Given the hefty size of the Kaggle file, I’ve added a reduced version of the mathematics papers file to Github for easier access. However, if you’re inclined towards a different subject, download the dataset, and replace math with your desired topic in the code below:I’ve loaded our dataset into a Pandas dataframe df. A quick inspection shows that the reduced dataset contains 55,497 preprints—a more practical size for our experiment. While the [tsdae_article] suggests around 10K entries are adequate, I'll keep the entire reduced dataset. Mathematics titles might have LaTeX code, which I'll swap for ISO code to optimize processing.I’ll use the parsed_title entries for training, so let’s extract them as a list:Next, let’s form the corrupted sentences by removing approximately 60% of tokens from each entry. If you’re interested in exploring further or trying different deletion ratios, check out the denoising script.Let’s take a look at what happened to one entry after processing:As you notice, Bethe equations and model were removed from the initial text.The last step in our data processing is to load the dataset in batches:While I’ll be following the approach from the train_tsdae_from_file.py, I’ll construct it step by step for better understanding.Start with selecting a pre-trained transformer checkpoint, and stick with the default option:Choose CLS as the pooling method and specify the dimension of the vectors to be constructed:Next build the sentence transformer by combining the two layers:Lastly, specify the loss function and tie the encoder-decoder parameters for the training phase.Now, we’re set to invoke the fit method and train the model. I’ll also store it for the subsequent steps. You’re welcome to tweak the hyperparameters to optimize your experiment.The pre-training stage took about 15 min on a Google Colab Pro instance with A100 GPU set on High-RAM.Let’s start by downloading the AllNLI dataset:Next, unzip the file and parse the data for training:The training dataset has about 563K training samples. Finally, use a special loader that loads the data in batches, and avoids duplicates within a batch:The batch size I use here is smaller than the default size of 128 from the script. Although a larger batch would give better results, it would require more GPU memory, and since I am limited by my computational resources, I choose a smaller batch size.Finally, fine-tune the pre-trained model on the AllNLI dataset using MultipleRankingLoss. Entailment pairs are positive and the contradiction pairs are hard negatives.I fine-tuned the model on the entire 500K dataset, and this took about 40 min on the Google Colab Pro, for 1 epoch with batch size of 32.I will conduct some preliminary evaluation on the STS (semantic textual similarity) dataset from HuggingFace, using the EmbeddingSimilarityEvaluator, which returns the Spearman rank correlation. However, these evaluations don't employ the specific domain I am focusing on, potentially not showcasing the model's true performance. For details see Section 4 in [tsdae_article].I start with downloading the dataset from HuggingFace and specifying the validation subset:This is a Dataset object of the form:To better understand it, let’s take a look at one specific entryAs we can see from this example, each entry has 4 features, one is the index, two sentences and a label (which was created by a human annotator). The label can take values between 0 and 5 and measures the similarity level of the two sentences (with 5 being most similar). In this example the two sentences are on completely different topics.To evaluate the model, sentence embeddings for the pairs of sentences are created, and the cosine similarity score for each pair is computed. The Spearman rank correlation between the labels and the similarity scores is computed as the evaluation score.Since I’ll be using cosine similarity which takes values between 0 and 1, I have to normalize the labels:Wrap the data in the InputExample class from HuggingFace:Create the evaluator based on EmbeddingSimilarityEvaluator class in sentence-transformers library.We compute the scores for TSDAE model, for the fine-tuned model and for a couple of pre-trained sentence transformers:Thus on a general scope dataset, some pre-trained models, such as all-mpnet-base-v2 outperforms the TSDAE fine-tuned model. However, by pre-training, the performance of the initial model bert-base-uncased more than doubled. It is conceivable that better results could be attained by further tweaking the hyperparameters for fine-tuning.For low resources domains, TSDAE in conjunction with fine-tuning is a rather efficient strategy for building embeddings. The results obtained here are noteworthy, given the amount of data and the computational means. However, for datasets that are not particularly unusual or domain specific, taking efficiency and cost into account, it might be preferable to choose a pretrained embedding that could provide comparable performance.Gihub Link to Colab Notebook and sample dataset.And so, my friends, we should always embrace the good, the bad, and the messy on our learning journey![tsdae_article]. K. Wang, et al., TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (2021) arXiv:2104.06979[aviation_article]. L. Wang, et al., Adapting Sentence Transformers for the Aviation Domain (2023) arXiv:2305.09556----Towards Data ScienceData Scientist. Mathematician. Continuous seeker of knowledge. Unraveling the story within data.Silvia Onofrei--2Damian GilinTowards Data Science--25Khouloud El AlamiinTowards Data Science--23Silvia Onofrei--Adrian H. RaudaschlinTowards Data Science--17Damian GilinTowards AI--5Fanghua (Joshua) Yu--Selva KumarinGoPenAI--1Plaban NayakinAI Planet--LlamaIndexinLlamaIndex Blog--HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams

The Ultimate Guide to Cloud Gaming: D…
best projectors for home
Vivo V30 & V30 Pro Sale in India: 10%…

This post first appeared on VedVyas Articles, please read the originial post: here