Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

TaatikNet: Sequence-to-Sequence Learning for Hebrew Transliteration

Morris AlperFollowTowards Data Science--ListenShareThis article describes TaatikNet and how to easily implement seq2seq models. For code and documentation, see the TaatikNet GitHub repo. For an interactive demo, see TaatikNet on HF Spaces.Many tasks of interest in NLP involve converting between texts in different styles, languages, or formats:Such tasks are known collectively as Sequence-to-Sequence (Seq2seq) Learning. In all of these tasks, the input and desired output are strings, which may be of different lengths and which are usually not in one-to-one correspondence with each other.Suppose you have a dataset of paired examples (e.g. lists of sentences and their translations, many examples of misspelled and corrected texts, etc.). Nowadays, it is fairly easy to train a neural network on these as long as there is enough data so that the Model may learn to generalize to new inputs. Let’s take a look at how to train seq2seq models with minimal effort, using PyTorch and the Hugging Face transformers library.We’ll focus on a particularly interesting use case: learning to convert between Hebrew text and Latin transliteration. We’ll give an overview of this task below, but the ideas and code presented here are useful beyond this particular case — this tutorial should be useful for anyone who wants to perform seq2seq learning from a dataset of examples.In order to demonstrate seq2seq learning with an interesting and fairly novel use case, we apply it to transliteration. In general, transliteration refers to converting between different scripts. While English is written with the Latin script (“ABC…”), the world’s languages use many different writing systems, as illustrated below:What if we want to use the Latin alphabet to write out a word from a language originally written in a different script? This challenge is illustrated by the many ways to write the name of the Jewish holiday of Hanukkah. The current introduction to its Wikipedia article reads:Hanukkah (/ˈhɑːnəkə/; Hebrew: חֲנֻכָּה‎, Modern: Ḥanukka, Tiberian: Ḥănukkā) is a Jewish festival commemorating the recovery of Jerusalem and subsequent rededication of the Second Temple at the beginning of the Maccabean Revolt against the Seleucid Empire in the 2nd century BCE.The Hebrew word חֲנֻכָּה‎ may be transliterated in Latin script as Hanukkah, Chanukah, Chanukkah, Ḥanukka, or one of many other variants. In Hebrew as well as in many other writing systems, there are various conventions and ambiguities that make transliteration complex and not a simple one-to-one mapping between characters.In the case of Hebrew, it is largely possible to transliterate text with nikkud (vowel signs) into Latin characters using a complex set of rules, though there are various edge cases that make this deceptively complex. Furthermore, attempting to transliterate text without vowel signs or to perform the reverse mapping (e.g. Chanukah → חֲנֻכָּה) is much more difficult since there are many possible valid outputs.Thankfully, with deep learning applied to existing data, we can make great headway on solving this problem with only a minimal amount of code. Let’s see how we can train a seq2seq model — TaatikNet — to learn how to convert between Hebrew text and Latin transliteration on its own. We note that this is a character-level task since it involves reasoning on the correlations between different characters in Hebrew text and transliterations. We will discuss the significance of this in more detail below.As an aside, you may have heard of UNIKUD, our model for adding vowel points to unvocalized Hebrew text. There are some similarities between these tasks, but the key difference is that UNIKUD performed character-level classification, where for each character we learned whether to insert one or more vowel symbols adjacent to it. By contrast, in our case the input and output texts may not exactly correspond in length or order due to the complex nature of transliteration, which is why we use seq2seq learning here (and not just per-character classification).As with most machine learning tasks, we are fortunate if we can collect many examples of inputs and desired outputs of our model, so that we may train it using supervised learning.For many tasks regarding words and phrases, a great resource is Wiktionary and its multilingual counterparts — think Wikipedia meets dictionary. In particular, the Hebrew Wiktionary (ויקימילון) contains entries with structured grammatical information as shown below:In particular, this includes Latin transliteration (agvaniya, where the bold indicates stress). Along with section titles containing nikkud (vowel characters), this gives us the (freely-licensed) data that we need to train our model.In order to create a dataset, we scrape these items using the Wikimedia REST API (example here). Please note that original texts in Wiktionary entries have permissive licenses for derivative works (CC and GNU licenses, details here) and require share-alike licensing (TaatikNet license here); in general, if you perform data scraping make sure that you are using permissively licensed data, scraping appropriately, and using the correct license for your derivative work.We perform various preprocessing steps on this data, including:After data scraping and preprocessing, we are left with nearly 15k word-transliteration pairs (csv file available here). A few examples are shown below:The transliterations are by no means consistent or error-free; for example, stress is inconsistently and often incorrectly marked, and various spelling conventions are used (e.g. ח may correspond to h, kh, or ch). Rather than attempting to clean these, we will simply feed them directly to the model and have it make sense of them by itself.Now that we have our dataset, let’s get to the “meat” of our project — training a seq2seq model on our data. We call the final model TaatikNet after the Hebrew word תעתיק taatik meaning “transliteration”. We will describe TaatikNet’s training on a high level here, but you are highly recommended to peruse the annotated training notebook. The training code itself is quite short and instructive.To achieve state-of-the-art results on NLP tasks, a common paradigm is to take a pretrained transformer neural network and apply transfer learning by continuing to fine-tune it on a task-specific dataset. For seq2seq tasks, the most natural choice of base model is an encoder-decoder (enc-dec) model. Common enc-dec models such as T5 and BART are excellent for common seq2seq tasks like text summarization, but because they tokenize text (split it into subword tokens, roughly words or chunks of words) these are less appropriate for our task which requires reasoning on the level of individual characters. For this reason, we use the tokenizer-free ByT5 enc-dec model (paper, HF model page), which performs calculations on the level of individual bytes (roughly characters, but see Joel Spolsky’s excellent post on Unicode and character sets for a better understanding of how Unicode glyphs map to bytes).We first create a PyTorch Dataset object to encapsulate our training data. We could simply wrap the data from our dataset csv file with no modifications, but we add some random augmentations to make the model’s training procedure more interesting:This augmentation teaches TaatikNet to accept either Hebrew script or Latin script as input and to calculate the corresponding matching output. We also randomly drop vowel signs or accents to train the model to be robust to their absence. In general, random augmentation is a nice trick when you would like your network to learn to handle various types of inputs without calculating all possible inputs and outputs from your dataset ahead of time.We load the base model with the Hugging Face pipeline API using a single line of code:After handling data collation and setting hyperparameters (number of epochs, batch size, learning rate) we train our model on our dataset and print out selected results after each epoch. The training loop is standard PyTorch, apart from the evaluate(…) function which we define elsewhere and which prints out the model’s current predictions on various inputs:Compare some results from early epochs and at the end of training:Before training the model outputs gibberish, as expected. During training we see that the model first learns how to construct valid-looking Hebrew and transliterations, but takes longer to learn the connection between them. It also takes longer to learn rare items such as ג׳ (gimel + geresh) corresponding to j.A caveat: We did not attempt to optimize the training procedure; the hyperparameters were chosen rather arbitrarily, and we did not set aside validation or test sets for rigorous evaluation. The purpose of this was only to provide a simple example of seq2seq training and a proof of concept of learning transliterations; however, hyperparameter tuning and rigorous evaluation would be a promising direction for future work along with the points mentioned in the limitations section below.A few examples are shown below, demonstrating conversion between Hebrew text (with or without vowels) and Latin transliteration, in both directions. You may try playing with TaatikNet yourself at the interactive demo on HF Spaces. Note that it use§qs beam search (5 beams) for decoding and inference is run on each word separately.For the sake of simplicity we implemented TaatikNet as a minimal seq2seq model without extensive tuning. However, if you are interested in improving results on conversion between Hebrew text and transliteration, there are many promising directions for future work:If you try these or other ideas and find that they lead to an improvement, I would be very interested in hearing from you, and crediting you here — feel free to reach out via my contact info below this article.We have seen that it is quite easy to train a seq2seq model with supervised learning — teaching it to generalize from a large set of paired examples. In our case, we used a character-level model (TaatikNet, fine-tuned from the base ByT5 model), but nearly the same procedure and code could be used for a more standard seq2seq task such as machine translation.I hope you have learned as much from this tutorial as I did from putting it together! Feel free to contact me with any questions, comments, or suggestions; my contact information may be found at my website, linked below.Morris Alper, MSc is a PhD student at Tel Aviv University researching multimodal learning (NLP, Computer Vision, and other modalities). Please see his webpage for more information and contact info: https://morrisalp.github.io/----Towards Data ScienceMorris AlperinTowards Data Science--4Khuyen TraninTowards Data Science--20Jacob Marks, Ph.D.inTowards Data Science--48Morris Alper--JP Brown--285Dominik PolzerinTowards Data Science--6LucianoSphereinAge of Awareness--4Marco PeixeiroinTowards Data Science--8Kory BeckerinITNEXT--4SM Raiyyan--2HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

TaatikNet: Sequence-to-Sequence Learning for Hebrew Transliteration

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×