11. Demystifying Regex#
11.1. Introduction#
Regular expressions, often also refered to as regex or regexp, are tools many progrmmers use for pattern matching and text manipulation. They provide a concise and flexible syntax for describing complex patterns within strings of text. They can be scary to look at first, but once you understand them, they can be a powerful tool. They’re often used for things like email validation or to search for particular things with tools like grep. Here’s an example with how emails can be constructed with regex.
Image Source: https://paulvanderlaken.com/2017/10/03/regular-expressions-in-r-part-1-introduction-and-base-r-functions/
11.2. Code Examples#
Here’s how that would look in python after importing the built-in package, re:
import re
def extract_emails(text):
#creare the regex pattern for matching email addresses
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
#find all matches of the pattern in the text
matches = re.findall(pattern, text)
return matches
#sample text with email addresses
sample_text = """
Hello, my email address is john.doe@cats.com.
You can also reach me at janelovescats@gmail.com.
"""
#find email addresses from the sample text
emails = extract_emails(sample_text)
#print the extracted email addresses
for email in emails:
print(email)
john.doe@cats.com
janelovescats@gmail.com
Here’s how it would look in JavaScript:
function extractEmails(text) {
//define the regex pattern for matching email addresses
var pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/g;
//find all matches of the pattern in the text
var matches = text.match(pattern);
return matches;
}
//text containing email addresses
var sampleText = "Hello, my email address is john.doe@regex.com.
You can also reach me at janeregex@textbook.com.";
//find email addresses
var emails = extractEmails(sampleText);
//print the extracted email addresses
for (var i = 0; i < emails.length; i++) {
console.log(emails[i]);
}
… and in Perl:
use strict;
use warnings;
sub extract_emails {
my ($text) = @_;
#define the regex pattern for matching email addresses
my $pattern = qr/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/;
#find pattern matches
my @matches = $text =~ /$pattern/g;
return @matches;
}
#text containing email addresses
my $sample_text = <<'END_TEXT';
Hello, my email address is john.lovescats@example.com.
You can also reach me at janelovesdogs@gmail.com.
END_TEXT
#taking emails from text
my @emails = extract_emails($sample_text);
#printing each email
foreach my $email (@emails) {
print $email . "\n";
}
Now that we know what regex looks like within the context of programming, let’s go through each character and what it means, so you can use regular expressions in your own software!
12. Regex Breakdown#
Literal Characters: Regular expressions can contain literal characters, which match themselves in the text. For example, the regular expression
cat
matches the characters “cat” in a string.Metacharacters: Metacharacters are special characters with predefined meanings in regular expressions. Some common metacharacters include:
.
(dot/period): Matches any single character except newline - these are often referred to as wildcardsAn example would be:
k.n could match with kitten
c.t could match with cat or can’t or ct
Because any character would match with the period
*
: Matches with zero or more occurrences of the previous character or group
An example would be:
cof*ee could match with coffee, cofffee, coffffffffee
Another example would be:
ca*t
could match ct
+
: Matches one or more occurrences of the previous character or groupAn example would be:
go+gle
could match with gogle, google, gooogle, goooogleBut it would NOT match with ggle beacuse it has to match at least one character, so if you wanted it to match, you would use the * regex metacharacter
?
: Matches zero or one occurrence of the previous character or group
An example would be:
?g
could match with hiking (matching with zero occurances) or it could match with Kellogg
^
: Let’s the computer know that the match has to begin at the start of the lineAn example would be:
Where d is a metacharacter for any digit from 1-9
^\d{3}
will match with patterns like “973” in “973-333-7039”
$
: Let’s the computer know the match has to begin at the end of the line
An example would be:
$\d{4}
will match with patterns like “7039” in “973-333-7039”
[ ]
: Defines a character class, matching any single character within the brackets
An example would be:
[aeiou]
matches any vowel, and[0-9]
matches any digit
|
: Acts as a logical OR, allowing alternatives within a pattern
An example would be:
cat|kitty
A string that contains either cat or kitty
Quantifiers: Quantifiers specify how many occurrences of a character or group should be matched. They can be used with both literal characters and metacharacters. For example:
*
: Matches zero or more occurrences+
: Matches one or more occurrences?
: Matches zero or one occurrence{n}
: Matches exactly n occurrences{n,}
: Matches n or more occurrences{n,m}
: Matches between n and m occurrences
Here are some bracketing examples:
a{3} would match with {aaa}
z{3,6} would match with {zzz, zzzz, zzzzz, zzzzzz}
c{3,} would match with {ccc, cccc, cccccc, …}
Grouping and Capturing: Parentheses
()
are used to group characters or subpatterns together. Groups can be quantified as a whole and can be referred to later in the regular expression or during text replacement operations An example would be:
(dog|cat)lover
would match with doglover or catlover
Escaping: Some characters have special meanings in regular expressions and need to be escaped if you want to match them literally. For example, if you want to match a literal dot
.
, you need to escape it as\.
An example would be:
[A-Z]{3} [\d.,]+|[\d.,]+ [A-Z]{3}
This uses escaping on the d to match a currency and an amount, like USD 4,800.23
So, as you can see, using regex is a lot easier than using a long string of complicated if-elses. For alphanumeric processing, regex makes your code simpler and cleaner. It does take a while to understand and conceptutualize, but there are many places online (see: Conclusion), where you can test out your regex.
Image Source: https://www.datacamp.com/cheat-sheet/regular-expresso
12.1. Conclusion#
Regex’s versatility ranges from validating input to extracting specific data from large datasets, which can be hugely important in computing, which is a science that is always dealing with data! Here are a couple of sites that you can use to help strengthen your understanding and use of regex!
Want some more example regexes? Check these out: https://support.google.com/a/answer/1371417?hl=en https://regexr.com/
Want to build and test your regex? Here are some great places to do so: https://regex101.com/ https://regexr.com/
Happy Coding!