Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Mastering Python RegEx: A Deep Dive into Pattern Matching

Nathan RosidiFollowTowards Data Science--ListenShareRegular expressions often shortened to regex, serve as a potent instrument for handling text. In essence, they consist of a series of characters that establish a pattern for searching. This pattern can be used for a wide range of string manipulations including matching patterns, replacing text, and dividing strings.Mathematician Stephen Cole Kleene first introduces Regular expressions in the 1950s as a notation to describe regular sets or regular languages.Today, regular expressions have become an essential skill for programmers, data scientists, and IT professionals.Before delving into how these regular expressions can be used, by using Python, let’s see the different range of its applications to motivate ourselves.Now I hope, you are motivated enough!Let’s get started with re module, which is all about Regular expressions.Great, let’s get started with Python’s re module fundamentals. In the next sections, we will cover more advanced topics.Python provides innate support for regular expressions via the re module.This module is Python’s standard library, which means you don’t have to install it externally, it will come with every Python installation.The re module contains various functions and classes to work with regular expressions. Some of the functions are used for matching text, some for splitting text, and others for replacing text.It includes a wide range of functions an classes tailored for handling regular expressions. Amon these, certain functions are designated for text matching, remaining ones for text splitting or text replacements.As we already mentioned, it came with installation, so no need to worry about installation.That’s why, to start using regular expressions in Python, you need to import the re library first. You can do this by using the import statements as follows.After the library is imported, you can start its features like functions and classes, provided by the re module.Let’s start with a simple example.Let’s say you want to find all occurrences of the word “Python” in a string.We can use the findall() function from the re module.Here is the code.Here is the output.There are many more functions in the re module that we can use to build more complex patterns.But first, let’s see the common functions in the re Module.Before expressing to you the fundamentals of Python RegEx, let’s see the common functions first, to grasp the remaining concepts better. The re module includes many different functions. By using them, we can perform different operations.In the following parts, we will discover some of them.The re.match() catches whether the regular expression starts with the specific string or not.If there is a match, the function returns a match object; if not, it returns none.Next, we’ll use the re.match() function. Here we will check whether the string text starts with the word “Python” or not. Then we’ll print the result to the console.Here is the code.Here is the output.The output shows that the pattern “Python” matches the beginning of the text.In contrast to re.match(), the re.search() function scans the entirety of the string in search of a match and yields a match object if one is discovered.In the following code, we use the re.search() function to search for the word “amazing” anywhere in the string text. If the word is found, we print it; otherwise, we print “No match found”.Here is the code.Here is the output.The output shows that our code catches amazing from the given text.The re.findall() function is used to collect all the non-overlapping matches of a pattern in the string. And it returns these matches as a list of strings.In the following example, we use the re.findall() function to find all “a” in the string. The matches are returned as a list, which then we print to the console.Here is the code.Here is the output.The output represented all non-overlapping occurrences of the letter “a” found in our text.The re.finditer() function bears resemblance to re.findall(), however it returns an iterator, that yields match objects.In the following code, re.finditer() function is used to find all occurrences of the letter “a” in the string text. It returns an iterator of match objects and we print the index and value of each match.Here is the code.Here is the output.The output shows the index of the pattern “a” in the text.The re.sub() function is used to do a replacement with one string to another.Next, we’ll use the re.sub() function to replace “Python” with “Java”.We then print the modified string.Here is the code.Here is the output.The output shows that we can successfully replace “Python” with “Java” from our text.In the next section, we will discover into the basic patterns that can be used in regular expressions to match a variety of text patterns.Let’s start with basic patterns.Regular expressions are constructed through the combination of literal characters, meta-characters, and quantifiers. So, grasping these fundamental components is important for creating effective regular expressions.Let’s start with literal characters.Literal characters are the simplest form of pattern matching in regular expressions.They match themselves exactly and do not have a special meaning.For example, the regular expression python will match the string python exactly.Here is the output.The output shows that our re.findall() function found all instances of the pattern “python”çMeta-characters like “.”, “‘^”, “$”. These characters can be very important to manipulate strings. Let’s see.The dot . is like a Joker card. It can stand in for any single character except a newline.In the code below, we’ll use a regular expression pattern “p.t”.Here is the code.Here is the output.The output shows that our code found all three character instances which start with “p” and ends with “t”.The caret ^ is used to check if a string starts with a certain character.Let’ see an example.The following code checks whether the text starts with Hello( Match found : “match” ) or not ( No match found )Here is the code.Here is the output.The output shows that our code catches the hello pattern at the beginning of the text.The dollar sign $ is used to check if a string ends with a certain character.The following code checks whether the text ends with the world$ ( if so print “ Match found: “match) or not ( if so print “No match found” )Here is the code.Here is the output.The output shows that re.search() function found the text that ends with the word “world”.Quantifiers are used to define how many times characters(or character) should appear in the pattern you are trying to match.In this subsection, we will look at examples about the asterisk (*), continue with the plus sign (+), and the question mark (?), and finish with curly braces ({}).Let’s start with an asterisk.The asterisk (*) in a regular expression signifies that the previous character can appear zero or more times.Let’s see the code. In the following code, we first define the pattern ( “py”), then we will use findall( ) function.Here is the code.Here is the output.The output shows all because asterisks allow for “y” to appear as zero or more times.The plus + matches 1 or more repetitions of the previous character.Here we again use findlall() function with the py pattern but this time we will use plus(+).Here is the code.Here is the output.As we can see the output, plus requires at least one or more “y” characters after “p”.The question mark ? matches 0 or 1 repetition of the previous character. It makes the previous character optional.Here is the code.Here is the output.In the output, you can see that it only matches “p” and “py”, since question mark allows to appear “y” one time or zero times.Curly braces {} allow you to match a specific number of repetitions.Here is the output.In this example, the pattern matches “pyy” and “pyyy” but not “py” or “pyyyy” because we specified that we want to match exactly 2 or 3 “y” characters after “p”.Special can be used to build more complex patterns.Let’s see character classes first.In the following examples, we will see 3 of them.Let’s start with \d, \D.The “\d” is used to find numbers (from 0 to 9), on the contrary, “\D” is used to find elements that are not numbers.In the following code, “\d” scans through the text string and retrieve numbers from the text.Here is the output.The output shows that we found all digits (0–9) in the text.The “\s” can be used to find whitespace characters, on the opposite “\S can be used to find anything that is not whitespace.In the below, the regular expression “\s” identifies all spaces and tabs in the given text.Here is the code.Here is the output.We can see from the outputs that we can identify all the whitespace.The “\w” can be used to find words. (letters, numbers, and underscore characters)”\W” is the opposite of that.In the code below, “\w” retrieves all letters and numbers from the text.Here is the code.Here is the output.Predefined character classes offer shortcuts for common classes. For example, “\d” is a predefined character class that represents digits.In this case, the “\d” pattern extracts all numerical digits from the given text.Here is the output.The output shows that our code has found all instances of predefined character class “\d” (representing all digits) in the text.Custom character classes allow you to define your own set of characters using square brackets [].In the example below, the custom character class “[aeiou]” is used to find all vowel letters in the text.Here is the code.Here is the output.The output shows all instances of vowels in the text as we defined it.We also can use “-” to define the range of characters.Here is the code.Here is the output.Here we can the output consists of the uppercase letters in the text.When you use the same regular expression multiple times in a script, it is time-saving to compile it into a pattern object first. This saves a lot of time because the regular expression doesn’t need to be parsed again with each use.The re.compile() method can be used to compile a regular expression pattern into a pattern object.Once we have this pattern object, we can call its methods (matching text, searching, and other operations.)Here is the code.Here is the output.The output shows digits.Here are some benefits of using regular expressions;Here is a simple example of compiled regular expressions:Here is the output.Now let’s check the second text.Here is the code.Here is the output.Our example above is rather a simple one for you to grasp the importance of reusability, performance, and readability, especially when our pattern plan to use repeatedly.In this section, let’s test what we discover together by writing a Python script to extract phone numbers from text.This one is a common use of regular expressions, especially in the data-cleaning process.Phone numbers can be in different formats, especially in different countries, so you can adjust these numbers according to yours, for this example, let’s consider the format XXX-XXX-XXXX, where X is a digit.The following code defines a pattern that matches the format above and complies with this pattern into a regular expression.Let’s see the code.In this example, we will use findall() method to extract phone numbers that matched our pattern.The following code uses a regular expression pattern to find and extract allFinally, let’s print the extracted phone numbers to the console.Here is the code.Here is the output.Here is the full Python script that combines all the steps above:Here is the output.As you continue to work with regular expressions, here are a few best practices to keep in mind:In this guide, we explored the realm of Python RegEx or Regular Expressions. We started with common functions and fundamentals and go through more advanced concepts and practical examples. But remember doing real-life projects, that will count as an example for your career to deepen this understanding of your mind. Just by doing so, you’ll develop knowledge and save yourself from googling whenever you work on Python regular expressions.Check out this comprehensive guide to advanced Python concepts to get an overview of such concepts.I hope you also gained valuable information about Python RegEx by reading this article too.Thanks for reading!Originally published at https://www.stratascratch.com.----Towards Data Sciencei like creating content and building tools for data scientists. www.stratascratch.comNathan Rosidi--2Miriam SantosinTowards Data Science--19Dominik PolzerinTowards Data Science--27Nathan Rosidi--5Rashi DesaiinTowards Data Science--6Khushal Kumar--3JINinThe Dev Project--Andy McDonaldinTowards Data Science--3🐼 panDatainLevel Up Coding--2Statecraft by Arman Madani--54HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams



This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Mastering Python RegEx: A Deep Dive into Pattern Matching

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×