9781118013847
regular_expressions.html

Chapter 8. Regular Expressions

WHAT YOU WILL LEARN IN THIS CHAPTER

  • Basic regular expression matching

  • Handling characters and numbers

  • Using quantifiers

  • Character classes and grouping

  • Understanding escape characters

  • Extracting data

  • Substitutions

  • Useful regular expression modules

Sometimes instead of exactly matching text, you want to find some text that looks like something you’re expecting. This is where Perl’s regular expressions come in.

A regular expression is a pattern that describes what your text should look like. Regular expressions can get very complex, but most of the time they’re pretty straightforward once you understand the syntax. Note that regular expressions are often called regexes (a single regular expression is sometimes just called a regex or worse, a regexp).

Note that an entire book can be (and has been) written on this topic. We are going to focus on those aspects of regular expressions you’re most likely to encounter.

Basic matching

Let’s say you have a list of strings and you want to print all strings containing the letters cat because, like your author, you love cats.

my @words = (
   'laphroaig',
   'house cat',
   'catastrophe',
   'cat',
   'is awesome',
);
foreach my $word (@words) {
    if ( $word =~ /cat/ ) {
        print "$word\n";
    }
}
That prints out:
house cat
catastrophe
cat

The basic syntax of a regular expression match looks like this:

STRING =~ REGEX

The =~ is known as a binding operator. By default, regular expressions match against the built in $_ variable, but the binding operator binds it to a different string. So we could have written the loop like this:

foreach (@words) {
    if (/cat/) {
        print "$_\n";
    }
}

There is also a negated form of the binding operator, !~, that is used to identify strings not matching a given regular expression.

foreach my $word (@words) {
    if ( $word !~ /cat/ ) {
        print "$word\n";
    }
}

And that prints:

laphroaig
is awesome

Without the binding operator, just use negation like normal:

foreach (@words) {
    if ( !/cat/ ) {
        print "$_\n";
    }
}

If you want to match a forward slash (/), you can escape it with a backslash. Alternatively, just as with quote-like operators, you can use a different set of delimiters if you precede them with the letter m (for ‘m’atch). The following are all equivalent and match the string 1/2.

/1\/2/
m"1/2"
m{1/2}
m(1/2)

Quantifiers

If you just want to match an exact string, using the index() builtin is faster, but sometimes you want to match more or less of a particular string. That’s when you want to use quantifiers in your regular expression. For example, to match the letter a followed by an optional letter b, and then the letter c, use the ? quantifier to show that the b is optional. The following will match both abc and ac:

if ( $word =~ /ab?c/ ) { ... }

The * is used to show that you can match zero or more of a given letter:

if ( $word =~ /ab*c/ ) { ... }

The + is used to show that you can match one or more of a given letter:

if ( $word =~ /ab+c/ ) { ... }

This sample code should make this clear. We will use the qr() quote-like operator. This allows us to properly quote a regular expression without trying to match it to anything before we’re ready.

my @strings = qw(
    abba
    abacus
    abbba
    babble
    Barbarella
    Yello
);
my @regexes = (
    qr/ab?/,
    qr/ab*/,
    qr/ab+/,
);
foreach my $string (@strings) {
    foreach my $regex (@regexes) {
        if ( $string =~ $regex ) {
            print "'$regex' matches '$string'\n";
        }
    }
}

And that prints out:

'(?-xism:ab?)' matches 'abba'
'(?-xism:ab*)' matches 'abba'
'(?-xism:ab+)' matches 'abba'
'(?-xism:ab?)' matches 'abacus'
'(?-xism:ab*)' matches 'abacus'
'(?-xism:ab+)' matches 'abacus'
'(?-xism:ab?)' matches 'abbba'
'(?-xism:ab*)' matches 'abbba'
'(?-xism:ab+)' matches 'abbba'
'(?-xism:ab?)' matches 'babble'
'(?-xism:ab*)' matches 'babble'
'(?-xism:ab+)' matches 'babble'
'(?-xism:ab?)' matches 'Barbarella'
'(?-xism:ab*)' matches 'Barbarella'

Sadly, nothing matches Yello, an excellent music group, but studying the rest of the matches should make it clear what is going on.

However, you may be wondering what that bizarre (?-xism:ab*) is doing on the regex we’ve printed out? Those are regular expression modifiers, which we’ll cover later in this chapter.

If you need to be more precise, you can use the {n,m} syntax. This tells the Perl that you want to match at least n times and no more than m times. There are three variants of this:

/ab{3}c/    # 1 a, 3 'b's, 1 c (only "abbbc")
/ab{3,}c/   # 1 a, 3 or more 'b's, 1 c
/ab{3,6}c/  # 1 a, 3 to 6 'b's, 1 c

Table 8.1, “Regex quantifiers” summarizes the different types of regex quantifiers and their meaning:

Table 8.1. Regex quantifiers

Quantifier

Meaning

*

Match 0 or more times

+

Match 1 or more times

?

Match 0 or 1 times

{n}

Match exactly n times

{n,}

Match at least n times

{n,m}

Match at least n times but not more than m times


By default, all quantifiers in Perl are greedy. That means they’ll try to match as much as possible. For example, the dot metacharacter (.) means “match anything” except newlines and .* will match the rest of the string up to a newline. Later in the chapter when you learn to print out just the bits you’ve matched, you’ll discover that for the word cataract, the regular expression a.+a matches atara and not just ata. If you want a quantifier to be lazy (match as little as possible) instead of greedy, just follow it with a question mark:

if ( "cataract" =~ /a.+?a/ ) {
    # the first match is now "ata" instead of "atara"
}

Note

By now you’ve noticed that some characters in regexes have a special meaning. These are called metacharacters. The following are the metacharacters that Perl regular expressions recognize:

{}[]()^$.|*+?\

If you want to match the literal version of any of those characters, you must precede them with a backslash, \. As we go through the chapter, the meaning of these metacharacters will become clear.

Escape Sequences

Sometimes you want to match a wide variety of different things that are difficult to type, or may match a wide range of characters. Many of the common cases are handled with escape sequences. Table 8.2, “Common escape sequences” explains some of these sequences and we’ll give a few practical examples. This is not an exhaustive list, just a list of the more common sequences.

Table 8.2. Common escape sequences

Escape

Meaning

\A

Beginning of string

\b

Match backspace in character class

\b

Word boundary

\cX

ASCII control character (for example, CNTL-C is \cC)

\d

Unicode digit

\D

Not a Unicode digit

\E

End case (\F, \L, \U) or quotemeta (\Q) translation, only if interpolated.

\e

Escape character (ESC, not the backslash)

\g{GROUP}

Named or numbered capture

\G

End of match of m//g

\k<GROUP>

Named capture

\l

Lowercase next character only, if interpolated

\L

Lowercase until \E, if interpolated

\N{CHARNAME}

Named character, alias, or sequence, if interpolated. You must “use charnames” (See Unicode in Chapter 9, Files and Directories)

\n

Newline

\p{PROPERTY}

Character with named Unicode property

\P{PROPERTY}

Character without named Unicode property

\Q

Ignore metacharacters until \E

\r

Return character

\s

Whitespace

\S

Not whitespace

\t

Tab

\u

Upper case next character only, if interpolated

\U

Upper case until \E, if interpolated

\w

Word character

\W

Not word character

\z

True at end of string only.

\Z

True right before final newline or at end of string


Of those, the ones you’ll most commonly see are \w (“word” characters), \d (digits), \s (whitespace) and \b (“word” boundary).

So let’s say you have some strings and you want to find all strings containing phone numbers matching the pattern XXX-XXX-XXXX where X can be any digit. You might use the following regular expression:

for my (@strings) {
    if ( /\d{3}-\d{3}-\d{4}/ ) {
        print "Phone number found: $string\n";
    }
}

And that will indeed match 555-867-5309. Unfortunately, it will also match a string containing 555555555-867-444444444 and that, presumably, is not a phone number. There are a several ways of dealing with this. If you know the phone number has whitespace on either side, you could do try to match whitespace with the \s escape:

for my (@strings) {
    if ( /\s\d{3}-\d{3}-\d{4}\s/ ) {
        print "Phone number found: $string\n";
    }
}

But maybe you don’t know what is on either side of the phone number. You might make a mistake and try to match “non-digits” with \D:

for my (@strings) {
    if ( /\D\d{3}-\d{3}-\d{4}\D/ ) {
        print "Phone number found: $string\n";
    }
}

That looks reasonable, but try this:

print "Phone: 123-456-7890" =~ /\D\d{3}-\d{3}-\d{4}\D/
    ? "Yes"
    : "No";

That prints No. Why? Because \D has to match something. The first \D matches a space, but the second one has nothing to match. What you actually want is the \b. That matches a word boundary. A word is matched by \w and that’s any alphanumeric character, plus the underscore. A word boundary matches no characters, but matches when there is a transition between a word and non-word character (this means that \w\b\w can never match anything).

print "Phone: 123-456-7890" =~ /\b\d{3}-\d{3}-\d{4}\b/
   ? "Yes"
   : "No";

That prints Yes because the final \b matches between the final digit and the end of the string.

Warning

Actually, the \d matches any Unicode (Chapter 9, Files and Directories) character which matches a digit and there are far more than you are probably know about, including a few mistakes that have crept into the Unicode standard. If you only want to match the digits 0 through 9, use the [0-9] character class. See “Character Classes and Grouping” in this chapter.

Extracting Data

At this point, you’re probably thinking “that’s nice, but what good is that data if you can’t get it?” It’s simple: just put parentheses around any data in a regular expression that you want to extract. For every set of capturing parentheses, use a $1, $2, $3, and so on, to access that data.

if ( "Phone: 123-456-7890" =~ /(\b\d{3}-\d{3}-\d{4}\b)/ ) {
    my $phone = $1;
    print "The phone number is $phone\n";
}

And that will print:

The phone number is 123-456-7890

You can use this to populate data structures. Consider the following block of text. You want to create a hash of names and their ages. Example 8.1, “Building data structures with regexes” shows an example of this.

Example 8.1. Building data structures with regexes

use strict;
use warnings;
use diagnostics;
use Data::Dumper;
my $text = <<'END';
Name: Alice Allison Age: 23
Occupation: Spy
Name: Bob Barkely   Age: 45
Occupation: Fry Cook
Name: Carol Carson  Age: 44
Occupation: Manager
Name: Prince        Age: 53
Occupation: World Class Musician
END
my %age_for;
foreach my $line (split /\n/, $text) {
    if ( $line =~ /Name:\s+(.*?)\s+Age:\s+(\d+)/ ) {
        $age_for{$1} = $2;
    }
}
print Dumper(\%age_for);

Note

data_structure.pl available for download at Wrox.com.

And that will print something like:

$VAR1 = {
          'Bob Barkely' => '45',
          'Alice Allison' => '23',
          'Carol Carson' => '44',
          'Prince' => '53'
        };

Note

If captures starting with $1 sound odd, it might be because other indexes in Perl start with 0 and not 1. In this case, $0 is reserved for the name of the program being executed.

If that regular expression is confusing, here’s a way to make it read easier: put an /x modifier at the end and all whitespace (unless escaped with a backslash) will be ignored. You can then put comments at the end of each part to explain it.

my $name_and_age = qr{
   Name:
   \s+      # 1 or more whitespace
   (.*?)    # The name in $1
   \s+      # 1 or more whitespace
   Age:
   \s+      # 1 or more whitespace
   (\d+)    # The age in $2
}x;
foreach my $line (split /\n/, $text) {
    if ( $line =~ $name_and_age ) {
        $age_for{$1} = $2;
    }
}

That makes regexes much easier to read.

As was explained earlier, the . metacharacter will match anything except newlines (but see the /s modifier later in this chapter). So .* means match zero or more of anything. Note that we made the .* lazy by adding a question mark after it. If we didn’t do this, it would have matched greedily and pulled in all of whitespace it could before the \s+. The resulting data structure would have looked like this:

$VAR1 = {
          'Carol Carson ' => '44',
          'Alice Allison' => '23',
          'Bob Barkely  ' => '45',
          'Prince       ' => '53'
        };

Warning

Be careful when using the . metacharacter. In fact, you should avoid it you possibly can. Because it matches indiscriminately, it’s very easy for it to match something you don’t intend. It’s far better to have a regular expression state explicitly what you’re trying to match. For the $name_and_age regex, your author probably would have written [[:alpha:] ]*?, but we haven’t yet covered that in this chapter.

You can also use those “digit” variables in a regular expression. However, you precede them with a backslash. The $1 captured by the first set of parentheses is matched by \1. Here’s how you can find double words:

print "Four score score and seven years ago" =~ /\b(\w+)\s+\1\b/
    ? "The word ($1) was doubled"
    : "No doubles found";

And that will print:

The word (score) was doubled

We use the \b (word boundary) after the \1 to ensure that strings like the theramin will not be reported as doubled words.

Modifiers and Anchors

A regular expression modifier is one or more characters appended to the end of the regular expression that modifies it.

Earlier, when printing a regular expression, you saw (?-xism:ab?). The (?-) syntax shows the modifiers in effect for the regular expression. If the modifying letter is after the minus sign (-), then it does not apply to the regex. For the $name_and_age regular expression we used earlier, let’s also add an /i modifier at the end of it. When that’s added, it makes the regular expression case-insensitive. /name/i will match Name, name, nAMe, and so on.

For the (?-) syntax, if a modifying letter is before the minus sign, it means that it applies to this regex:

my $name_and_age = qr{
   Name:
   \s+      # 1 or more whitespace
   (.*?)    # The name in $1
   \s+      # 1 or more whitespace
   Age:
   \s+      # 1 or more whitespace
   (\d+)    # The age in $2
}xi;
print $name_and_age;

And that will print out:

(?ix-sm:
    Name:
    \s+      # 1 or more whitespace
    (.*?)    # The name in $1
    \s+      # 1 or more whitespace
    Age:
    \s+      # 1 or more whitespace
    (\d+)    # The age in $2%
 )

The most common modifiers you will see are explained in Table 8.3, “Common Regex Modifiers”.

Table 8.3. Common Regex Modifiers

Modifier

Meaning

/x

Ignore unescaped whitepace

/i

Case-insensitive match

/g

Global matching (keep matching until no more matches)

/m

Multiline mode (we’ll explain in a bit)

/s

Single line mode (the . metacharacter will now match \n)


You already know about the /x and /i modifiers, so let’s look at the /g modifier. That allows you to globally match something. For example, to print every non-number in a string:

my $string = '';
while ("a1b2c3dddd444eee66" =~ /(\D+)/g ) {
    $string .= $1;
}
print $string;

And that will print out abcddddeee, as we expect.

You can also use this to count things, if you’re so inclined. Here’s how to count every occurrence of a word ending in the letters at.

my $silly = 'The fat cat sat on the mat';
my $at_words = 0;
$at_words++ while $silly =~ /\b\w+at/g;

The $at_words variable will contain the number 4 after that code runs. If you don’t like statement modifiers (putting the while at the end of the statement), you can write it this way:

while ( $silly =~ /\b\w+at/g ) {
    $at_words++;
}

You might recall that while loops are often used with iterators. the /g modifier effectively turns the regular expression in to an iterator.

The /m and /s modifiers look a bit strange, but to discuss those, we should explain anchor metacharacters first.

Anchor metacharacters are used to “anchor” a regular expression to a particular place in a string. They do not match an actual character. You’ve already seen one anchor: \b. The ^ is used to match the start of the string and the $ is used to match the end of the string. They are synonymous with \A and \Z. Note that both $ and \Z match the end of a string or before a newline. Thus, if you have a newline in your string, the $ will match immediately before the newline.

my $prisoner = <<"END";
I will not be pushed, filed, stamped, indexed, briefed, debriefed or numbered.
My life is my own.
END
print $prisoner =~ /^I/          ? "Yes\n" : "No\n";
print $prisoner =~ /^My/         ? "Yes\n" : "No\n";
print $prisoner =~ /numbered\.$/ ? "Yes\n" : "No\n";
print $prisoner =~ /own\.$/      ? "Yes\n" : "No\n";
That prints:
Yes
No
No
Yes

In other words, only /^I/ and /own\.$/ matched. If you want /^My/ and /numbered\.$/ to match, use the /m switch to force “multiline” mode. That will force the ^ and $ to match at the beginning and end of every string (separated by newlines) instead of the beginning and end of the entire string.

It’s also important to be aware that if the $ is not the last character in the regular expression, then Perl will assume that this is the sigil introducing a scalar variable:

my $match = "aa";
if ( $some_string =~ /$match/ ) {
    # match words containing aa
}

Later, we’ll see how we can take advantage of this to build complicated regular expressions that would ordinarily be too difficult to write.

Character Classes

Sometimes you want to match a few characters, as even numbers. You can do with a character class. You put the characters you want in square brackets, []. Here’s a silly way of extracting all positive even integers from a string.

my $string = '42 85 abcd 8 4ever foobar 666 43';
my @even;
push @even => $1 while $string =~ /\b(\d*[02468])\b/g;

That will leave @even containing the numbers 42, 8 and 666. Here’s how it works.

By now you already know that the \b matches a word boundary, so the 4 in 4ever cannot be matched because not only is that an abomination to the English language, there is no “boundary” between the 4 and ever.

The \d*[02468] means “zero or more digits, followed by a zero, two, four, six or eight”. In other words, a positive even integer.

In a character class, only the -]\^$ characters are considered “special”. So a . will match a literal dot, not “any character except newline”. If the first character is a caret, ^, then it’s a negated character class. This means it will match anything except what’s listed in the character class and we can use this to match odd numbers:

my $string = '42 85 abcd 8 4ever foobar 666 43';
my @odd;
push @odd => $1 while $string =~ /\b(\d*[^02468])\b/g;

That will push 85 and 43 onto the @odd array (of course, you could have simply used [13579] for the character class).

The dash, -, if used any place after the first character in a character class, tries to create a range. For example, we’ve mentioned earlier that \d matches any Unicode character (Unicode will be discussed in Chapter 9, Files and Directories). If you want to to only match the 0 through 9 ASCII digits, you can use [0-9]. This is generally easier to read than [0123456789], though they mean the same thing.

You can have multiple ranges in a character class. [0-9a-fA-F] will match all hexadecimal digits.

Perl also supports POSIX character classes. These have the form [:name:]. Despite the square brackets around them, you must use an additional set of square brackets around them. For example, to match all alphabetical and numeric characters (the same as \w, but without the underscore), you could use [[:alnum:]]. You can combine these, too. To match all digits and punctuation characters: [[:digit:][:punct:]]. Table 8.4, “POSIX Character Classes” explains Perl’s POSIX-style character classes and their meaning.

Note

A common, confusing mistake for regular expressions is to try to use POSIX-style regular expressions like this:

if ( $string =~ /[:alnum:]/ ) {
    ...
}

Note only does that not work, but it doesn’t generate an error. This is because Perl sees [:alnum:] as being a character class matching :, a, l, n, u, m (it’s OK to list a character more than once in a character class). You must write that [[:alnum:]] for Perl to recognize the regex correctly.

Table 8.4. POSIX Character Classes

Class

Meaning

[:alpha:]

Letters (think “Unicode” — Chapter 9, Files and Directories. It’s more than you think)

[:alnum:]

[:alpha:] plus Unicode digits

[:ascii:]

ASCII only

[:cntrl:]

Control characters

[:digit:]

Unicode digits

[:graph:]

Alphanumeric and punctuation characters

[:lower:]

Lower case letters

[:print:]

Printable characters ([:graph:] plus [:space:])

[:space:]

\s. In other words, tab, newline, form feed and carriage return

[:upper:]

Upper case characters

[:xdigit:]

Hexadecimal digits ([0-9a-fA-F])

[:word:]

\w


As a Perl extension to POSIX character classes, you can include a ^ after the [: to indicate negation. So to match anything that is not a control character, use [[:^cntrl:]].

Grouping

For a character class, you list what types of characters you’re looking for. For a group, you can list what types of words you’re looking for. To group words (or patterns), just put parentheses around them. With that, you can do all sorts of interesting thing, including using quantifiers:

# cat, optionally followed by atastrophe
/cat(atastrophe)?/

You can use a | character in the group to alternate between different patterns:

# matches catastrophe, cataract and catapult, but not cat
/cat(atastrophe|aract|apult)/

We’ve already seen parentheses before. These are used when we want to extract data into the $1, $2, $3 variables and so on. If you want to group but don’t want to extract the data (perhaps you’re inserting a group in an existing regex and don’t want to change all of your match variables), use the (?:...) syntax:

# matches catastrophe or cataract, but without setting $1
/cat(?:atastrophe|aract)/

As you’ve already seen (?-xism:...) earlier in this chapter, you may wonder if the (?:...) syntax is related. In fact, it’s the same thing. You can set those modifiers yourself to tell Perl how to behave. For example, let’s make part of a regex case-insensitive. Maybe you’re writing code to list everyone who is a volunteer. Unfortunately, the people who typed in the data typed volunteer, Volunteer, VOLUNTEER.

use Data::Dumper;
my $text = <<'END';
Name: Alice Allison Position: VOLUNTEER
Name: Bob Barkely   Position: Manager
Name: Carol Carson  Position: Volunteer
Name: David Dark    Position: Geek
Name: e.e. cummings Position: Volunteer
name: Fran Francis  Position: volunteer
END
my @volunteers;
foreach my $line (split /\n/, $text) {
    if ( $line =~ m<Name:\s+(.*?)\s+Position:\s+(?i-xsm:volunteer)\b> ) {
        push @volunteers => $1;
    }
}
print Dumper(\@volunteers);

And that prints:

$VAR1 = [
          'Alice Allison',
          'Carol Carson',
          'e.e. cummings'
        ];

(And you’ll note how we sneakily put the . in the Name pattern so we could still match e.e. cummings).

Why didn’t it add Fran Francis to that list? Because she has name: in front of her name but we didn’t make that part of the regular expression case-insensitive.

You might find that typing (?i-xsm:volunteer) to be a bit cumbersome. If the entire regular expression is not using the /x, /s or /m modifiers, you don’t need the -xsm in the group. You only need them if you need to explicitly disable them (and you don’t need to list all of them, either). So we could have written (?i:volunteer), which is cleaner.

Advanced matching

As you work with regular expressions more, you’ll find yourself wanting to do more powerful things with them. Regular expressions are actually a special purpose declarative language embedded in Perl. Though they’re generally agreed to not be Turing complete (http://en.wikipedia.org/wiki/Turing_complete), they’re still pretty powerful (even if they were Turing Complete, you’d upset a lot of programmers if you wrote your programs solely in terms of regular expressions).

Substitutions

While perhaps not an “advanced” feature of regexes, substitutions are the next logical step in your programming journey. They have the following form:

s/regular expression/replacement text/

You prefer a rare steak to a well-done steak (as you should), so you need to fix this menu item:

my $main_course = "A well-done filet mignon";
$main_course =~ s/well-done/rare/;
print $main_course.

And that prints A rare filet mignon.

As with the normal m//, you can use the /g modifier to make substitutions global. Here’s a (very stupid) technique to remove all doubled words from a text:

my $text = "a a b b c cat dd dd";
$text =~ s/\b(\w+)\s+\1\b/$1/g;
print $text;

And that leaves us with a b c cat dd.

Let’s use the /x modifier to make this a bit clearer.

$text =~ s/
    \b          # word boundary
    (\w+)       # capture to $1
    \s+         # whitespace
    \1          # doubled word (matches $1)
    \b          # word boundary
/$1/gx;         # replace doubled with $1

The left side of the substitution is a regular expression and the right side is not. Thus, we use \1 inside the regex and $1 outside the regex.

Lookahead/Lookbehind Anchors

As you know by now, an anchor matches a particular place in a string without actually matching a character. Lookahead/behind anchors (and their negative counterparts) are primarily used with substitutions (and sometimes split()) to allow fine-grained control over matching. A positive lookahead allows you to match text following a regular expression, but not including it in the regular expression. The positive lookahead syntax is

(?=$regex)

For example, if you want to replace all xxx followed by yyy with ---, but not replacing the yyy itself, you can do this:

my $string = 'xxxyyy xxxbbb xxxyyy';
$string =~ s/
    xxx      # match xxx
    (?=yyy)  # followed by yyy, but not included in the match
  /---/xg;
print $string;

And that prints out:

---yyy xxxbbb ---yyy

The negative lookahead syntax is (?!$regex). That will allow you to match a regular expression not followed by another regular expression, but the negative lookahead will not be included in the match. So let’s say your young child is writing a “compare and contrast” essay regarding Queen Elizabeth of the United Kingdom and queen bees and ants. She writes this:

The queen rules over the United Kingdom and is loved by
her subjects but a queen ant just lays a lot of eggs.
The queen lives in a palace and the queen bee lives
in a hive.

Obviously you are horrified because the queen of the United Kingdom should be referred to as Queen Elizabeth in this context. So you write this:

my $childs_essay = <<'END_ESSAY';
The queen rules over the United Kingdom and is loved by
her subjects but a queen ant just lays a lot of eggs.
The queen lives in a palace and the queen bee lives
in a hive.
END_ESSAY
$childs_essay =~ s/the queen/Queen Elizabeth/gi;
print $childs_essay;

And that prints out:

Queen Elizabeth rules over the United Kingdom and is loved by
her subjects but a queen ant just lays a lot of eggs.
Queen Elizabeth lives in a palace and Queen Elizabeth bee lives
in a hive.

Obviously that’s not going to earn your daughter a good grade, so let’s use a negative lookahead to only replace those instances of queen not followed by the words ant or bee.

my $childs_essay = <<'END_ESSAY';
The queen rules over the United Kingdom and is loved by
her subjects but a queen ant just lays a lot of eggs.
The queen lives in a palace and the queen bee lives
in a hive.
END_ESSAY
$childs_essay =~
  s/
    the
    \s+
    queen
    \s+
    (?!ant|bee)
   /Queen Elizabeth /gxi;
print $childs_essay;

And that prints out the desired paragraph:

Queen Elizabeth rules over the United Kingdom and is loved by
her subjects but a queen ant just lays a lot of eggs.
Queen Elizabeth lives in a palace and the queen bee lives
in a hive.

Your daughter may not get a wonderful grade for the essay, but at least she’ll be following proper editorial style.

Note

The Queen Elizabeth/queen ant example seems fairly contrived, but it’s based on a true story of an online news organization whose computer-driven editorial rules had a news story about ants referring to Queen Elizabeth laying thousands of eggs and having a lifespan of many times that of her workers. We hope Her Majesty was amused.

Positive lookbehinds are designated with (?<=$regex) and negative lookbehinds are written as (?<!$regex). They are identical to their lookahead counterparts with two exceptions:

  • They match text before the regular expression

  • They cannot match a variable-width regex, meaning that *, + and ? quantifiers are not allowed.

Named Subexpressions (5.10)

If you are using Perl 5.10 or better, you can also use named subexpressions. Ordinarily you refer to a captured group in the regex with \1, \2 and so on. After a successful match, those are $1, $2 and so on. With named subexpressions you can name them and make things easier to read.

To name a subexpression, use the syntax (?<name>...). To refer to it again inside of the regex, use \g{name}. To refer to the match outside of the regex, be aware that it’s a key in the special %+ hash. For example, our double-word stripper would look something like this:

Note

The %+ hash is a special variable that only contains entries for the last successfully matched named subexpressions in the current scope. Thus, any if a named subexpression fails to match, it will not have an entry in the %+ hash. There is a corresponding %- hash that we will not cover here. See perldoc perlvar and perldoc perlretut for more information.

use v5.10;
my $text = "a a b b c cat dd dd";
$text =~
   s/
    \b
    (?<word>\w+)
    \s+
    \g{word}
    \b
    /$+{word}/gx;
print $text;

For a clearer example consider matching dates. You may remember our code for converting a date to the ISO 8601 format. Here we rewrite it with named subexpressions.

Before:

my $provided_date = '28-9-2011';
    $provided_date =~
    s{
        (\d\d?)      # day
        [-/]         # - or /
        (\d\d?)      # month
        [-/]         # - or /
        (\d\d\d\d)   # year
    }
    {
        sprintf "$3-%02d-%02d", $2, $1
    }ex;
    print;

After:

my $provided_date = '28-9-2011';
    $provided_date =~
    s{
        (?<day>\d\d?)
        [-/]
        (?<month>\d\d?)
        [-/]
        (?<year>\d\d\d\d)
    }
    {
        sprintf "$+{year}-%02d-%02d", $+{month}, $+{day}
    }ex;
    print;

This has an added advantage of no longer requiring you to keep track of the number of the capture. Thus, if you needed to switch the day and month around:

s{
        (?<month>\d\d?)      # month
        [-/]
        (?<day>\d\d?)
        [-/]
        (?<year>\d\d\d\d)
    }
    {
        sprintf "$+{year}-%02d-%02d", $+{month}, $+{day}
    }ex;

You’ll note that the regular expression changed, but the substitution did not.

You can also use the named parameters outside of the substitution as long as you’re in the same scope. For example:

print LOGFILE "converted provided date to ",
     sprintf "$+{year}-%02d-%02d", $+{month}, $+{day};

Common Regular Expression Issues

While we’re messing about with regular expressions, let’s consider a few common issues that arise. These may seem almost superfluous to this chapter, but these issues are raised so often that they bear mentioning. We’ll cover a few things you can do with regular expressions, along with a few things you should not do.

Regexp::Common

You know a number might be represented as 2, 2.3, .4, −3e17 and so on. There are a variety of ways in which you can legally write a number and writing a regular expression for it is hard. So don’t write it. When you need a regular expression that you think someone else has already written, look at the Regexp::Common module and see if it’s in there. Here’s how to match a real number:

use Regexp::Common;
print "yes" if '-3e17' =~ $RE{num}{real};

Here’s how to blank out profanity (knowing it would never get through the editorial process, I regretfully omitted the full example).

use Regexp::Common;
my $text = 'something awful or amusing';
$text =~ s/($RE{profanity})/'*' x length($1)/eg;
print $text;

There’s plenty more in this module, so go install it and have fun reading the docs.

Email Addresses

If you’ve never read RFC 822 (http://tools.ietf.org/html/rfc822), your author recommends that you do. It’s a great way to get to sleep. It’s also a great way to realize that if you’ve been trying to validate email addresses with a regular expression, you’ve been doing it wrong.

Email addresses can contain comments. The localpart of the domain name cannot contain spaces (unless they’re in comments), but they can contain dashes. They can even start with dashes. Lots of people with last names like O’Malley have trouble sending and receiving email because is a perfectly valid email address, but many email validation tools think that apostrophe is naughty.

So whatever you do, don’t do this:

if ( $email =~ /^\w+\@(?:\w+\.)+\w$/ ) {
    # Congrats. Many good emails rejected!
}

We see that a lot in code. It doesn’t work. In fact, you can’t use regular expressions to match email addresses. The one your author knows of that is closest to being correct is an almost one-hundred line (beautiful) monstrosity written by Jeffrey Friedl. You can see it in the source code of Email::Valid, the module you should use instead.

use Email::Valid;
print (Email::Valid->address($maybe_email) ? 'yes' : 'no');

Actually, Email::Valid will only tell you if the email address is well-formed. If you ask nicely, it will try to tell you if the host exists. It cannot tell you if the email is valid.

Note

There is one and only one way to know if an email is valid: send an email to it and hope someone responds. Even then, you may get a false bounce or the mail server might be down. Nothing’s perfect.

HTML

Sooner or later every programmer hears about people trying to parse HTML with regular expressions. Here’s an attempt your author tried to make once:

$html =~ s{
             (<a\s(?:[^>](?!href))*href\s*)
             (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
             ([^>]+>)
          }
          {$1 . decode_entities($2) .  $4}gsexi;

Do you know what that does? Neither does your author. He can’t remember, but he doesn’t care because it didn’t work. He learned to use a proper HTML parser. HTML does not have a regular grammar and thus cannot be properly parsed with regular expressions.

That being said, if you are using a well-defined subset of HTML and you are writing small, one-off scripts for extracting data, feel free to use regexes. Just don’t blame anyone but yourself when it breaks. Instead, consider HTML::TreeBuilder, HTML::TokeParser::Simple, or any of a variety of other great HTML parsing modules. Just don’t do it with regular expressions.

Composing regular expressions

Sometimes you find that a regular expression gets very complicated. For example, you might want to match an employee number in the format “department-grade-number”, where is one of 4 different valid department codes for a company, AC, IT, MG, JA, the grade is a two digit number from 00 to 20 and the number is any five or six digit number. The regular expression might look like this:

if ( /\b(AC|IT|MG|JA)-([01]\d|20)-(\d{5,6})\b/ ) {
    my $dept       = $1;
    my $grade      = $2;
    my $emp_number = $3;
    ...
}

As far as regular expressions go, that one really isn’t too bad, but maybe you still want it to be a bit easier to read. You can compose regular expressions easily by using variables and the qr// operator.

my $depts         = join '|' => qw(AC IT MG JA);
my $dept_re       = qr/$depts/;
my $grade_re      = qr/[01]\d|20/;
my $emp_number_re = qr/\d{5,6}/;
if ( /\b($dept_re)-($grade_re)-($emp_number_re)\b/ ) {
    my $dept       = $1;
    my $grade      = $2;
    my $emp_number = $3;
    ...
}

The qr() operator will “quote” your regular expression and, in some cases, pre-compile it, leading to significant performance gains when you later use it in a match.

As a more complicated example, your author was writing a pre-processor for Prolog code (Prolog is a programming language) and wanted to match math expressions. The following are all valid math expressions in Prolog (the actual code is far more complicated):

2 + 3
Var
-3.2e5 % SomeVar / Var

The code to match those is presented in Example 8.2, “Building complex regular expressions from smaller ones.”.

Example 8.2. Building complex regular expressions from smaller ones.

use strict;
use warnings;
use diagnostics;
use Regexp::Common;
my $num_re = $RE{num}{real};
my $var_re = qr/[[:upper:]][[:alnum:]_]*/;
my $op_re  = qr{[-+*/%]};
my $math_term_re = qr/$num_re|$var_re/;
my $expression_re       = qr/
    $math_term_re
    (?:
        \s*
        $op_re
        \s*
        $math_term_re
    )*
/x;
my @expressions = (
    '2 + 3',
    ' + 2 - 3',
    'Var',
    '-3.2e5 % SomeVar / Var',
    'not_a_var + 2',
);
foreach my $expression (@expressions) {
    if ( $expression =~ /^$expression_re$/ ) {
        print "($expression) is a valid expression\n";
    }
    else {
        print "($expression) is not a valid expression\n";
    }
}

Note

composed_regexes.pl available for download at Wrox.com.

And that will print our desired output:

(2 + 3) is a valid expression
( + 2 - 3) is not a valid expression
(Var) is a valid expression
(-3.2e5 % SomeVar / Var) is a valid expression
(not_a_var + 2) is not a valid expression

You may be thinking that the regular expression isn’t that complicated, but if you print out the entire thing, it looks like this (formatted for to fit this page and still be a valid regular expression):

/(?x-ism:(?-xism:(?:(?i)(?:[+-]?)(?:(?=[.]?[0123456789])(?:[
0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[
E])(?:(?:[+-]?)(?:[0123456789]+))|))|(?-xism:[[:upper:]][
[:alnum:]_]*))(?:\s*(?-xism:[-+*/%])\s*(?-xism:(?:(?i)(?:[
+-]?)(?:(?=[.]?[0123456789])(?:[0123456789]*)(?:(?:[.])
(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789
]+))|))|(?-xism:[[:upper:]][[:alnum:]_]*)))*)/x

If you want to write that by hand, be my guest, but don’t ask anyone (including yourself) to debug it.

Summary

Regular expressions are very powerful and we’ve only skimmed the surface of what they can do. This chapter has tried to focus on what you’ll most likely encounter in the real world, but there are many areas of regular expressions we’ve only touched on. You should read the following to learn more:

perldoc perlre
perldoc perlretut
perldoc perlrequick
perldoc perlreref

If you have Perl version 5.12 or above installed, you can also read perldoc perlrebackslash and perldoc perlrecharclass. You can also read them on http://perldoc.perl.org/.

Additionally, the excellent book “Mastering Regular Expression” by Jeffrey Friedl is highly recommended.

By now, you should understand most common uses of regular expressions including matching arbitrary text, making substitutions and extracting useful data from strings.

Exercises

  1. In the US, Social Security numbers are a sequence of three digits, followed by a dash, followed by two digit, followed by another dash, followed by four digits. They might look like this: 123-45-6789.

    Ignoring that not all combinations of numbers are valid, write a regular expression that matches a US Social Security number.

  2. Imagine you have a block of the following text read from a file:

    my $employee_numbers = <<'END_EMPLOYEES';
    alice: 48472
    bob:34582
    # we need to fire charlie
       charlie : 45824
    # denise is a new hire
    denise : 34553
    END_EMPLOYEES

    Those are employee login names and their user numbers. Obviously an admin has been sloppy in just keeping these in a text file. Write code that will read that text and create a hash with employee usernames as the keys and employee numbers as the values. There should be no leading or trailing whitespace in either the keys or the values. Empty lines and lines starting with a # can be ignored.

  3. Given the following text with dates embedded in the YYYY-MM-DD format, write code that will rewrite them as $monthname $day, $year. For example, 2011-02-03 should become February 3, 2011. Assume the dates are valid (in other words, not January 40th or something stupid like that).

    my $text = <<'END';
    We hired Mark in 2011-02-03. He's working on product
    1034-34-345A. He is expected to finish the work on or
    before 2012-12-12 because our idiot CEO thinks the world
    will end.
    END

WHAT YOU LEARNED IN THIS CHAPTER

Topic

Key Concepts

Regular expressions

Patterns to describe strings

Quantifiers

Matching a pattern a variable number of times

Escape sequences

Sequences for controlling matches

Extracting data

Extracting matched data into variables

Modifiers

Special trailing characters that later regex behavior

Anchors

Matching “places” in a string and not characters

Character classes

Groups of individual characters

Grouping

Groups of patterns

Substitutions

Replacing matched text

Regexp::Common

A module providing many common regular expressions

Email::Valid

A module to properly validate an email addres

Lookahead/lookbehind anchors

Anchors to match text before and after a regex

Named subexpressions

A cleaner way to match data

Composed regexes

Building complex regexes from smaller ones

Answers to exercises

  1. In the US, Social Security numbers are a sequence of three digits, followed by a dash, followed by two digit, followed by another dash, followed by four digits. They might look like this: 123-45-6789.

    Ignoring that not all combinations of numbers are valid, write a regular expression that matches a US Social Security number.

    my $social_security_re = qr/\b(\d{3})-(\d{2})-(\d{4})\b/;
    # or
    my $social_security_re = qr/\b(\d\d\d)-(\d\d)-(\d\d\d\d)\b/;

    There are, of course, other solutions.

    Note that we use word boundaries at the beginning and end of the regex. If we didn’t, we could easily have something like this matching:

    44444444444444-44-44444444444444

    We don’t want that, obviously.

  2. Imagine you have a block of the following text read from a file:

    my $employee_numbers = <<'END_EMPLOYEES';
    alice: 48472
    bob:34582
    # we need to fire charlie
       charlie : 45824
    # denise is a new hire
    denise : 34553
    END_EMPLOYEES

    Those are employee login names and their user numbers. Obviously an admin has been sloppy in just keeping these in a text file. Write code that will read that text and create a hash with employee usernames as the keys and employee numbers as the values. There should be no leading or trailing whitespace in either the keys or the values. Empty lines and lines starting with a # can be ignored.

    Part of the art of writing regular expressions is knowing your data. A regular expression is often crafted for a particular quick and dirty job. In this case, we’ll assume as usernames are only alphabetical characters and user numbers are five digit numbers.

    use strict;
    use warnings;
    use Data::Dumper;
    my $employee_numbers = <<'END_EMPLOYEES';
    alice: 48472
    bob:34582
    # we need to fire charlie
       charlie : 45824
    # denise is a new hire
    denise : 34553
    END_EMPLOYEES
    my %employee_number_for;
    while ( $employee_numbers =~ /^ \s* (\w+) \s* : \s* (\d{5}) \s* $/gmx ) {
       $employee_number_for{$1} = $2;
    }
    print Dumper \%employee_number_for;

    This example looked easy, but the /g was needed to match all of the employees and the /m (multiline) is used to make the ^ and $ anchors treat each line of text as a separate string. That prints out:

    $VAR1 = {
              'alice' => '48472',
              'denise' => '34553',
              'charlie' => '45824',
              'bob' => '34582'
            };
  3. Given the following text with dates embedded in the YYYY-MM-DD format, write code that will rewrite them as $monthname $day, $year. For example, 2011-02-03 should become February 3, 2011. Assume the dates are valid (in other words, not January 40th or something stupid like that).

    my $text = <<'END';
    We hired Mark in 2011-02-03. He's working on product
    1034-34-345A. He is expected to finish the work on or
    before 2012-12-12 because our idiot CEO thinks the world
    will end.
    END
    my %month_for = (
       '01' => 'January',
       '02' => 'February',
       '03' => 'March',
       '04' => 'April',
       '05' => 'May',
       '06' => 'June',
       '07' => 'July',
       '08' => 'August',
       '09' => 'September',
       '10' => 'October',
       '11' => 'November',
       '12' => 'December',
    );
    $text =~ s{\b(\d\d\d\d)-(\d\d)-(\d\d)\b}
              {sprintf "$month_for{$2} %d, %d", $3, $1}ge;
    print $text;

    And that will print out:

    We hired Mark in February 3, 2011. He's working on product
     1034-34-345A. He is expected to finish the work on or
     before December 12, 2012 because our idiot CEO thinks the world%
     will end.

There’s nothing terribly tricky on this one, but we had to quote the hash keys because otherwise Perl would interpret those as octal numbers (see Chapter 4, Working With data).

The sprintf() formats are also fairly straightforward, In reality, using a module like DateTime would help us validate that these are valid dates.

Site last updated on: July 5, 2012 at 11:41:08 AM PDT
Cover for Beginning Perl (Wrox)

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    Consider adding /e as you later use it before you mention what it does.

    Edited on June 27, 2012, 10:20 a.m. PDT

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 23, 2012

    [:space:] matches a vertical tab. \s does not

    Edited on November 23, 2012, 1:43 a.m. PST

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "Substitutions are very common in and" -> "Substitutions are very common and"

  2. Curtis Poe – Posted June 24, 2012

    Fixed.

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 14, 2012

    "which matches a digit" -> "which represents a digit"

    "than you are probably know about" -> "than you probably know about"

  2. Curtis Poe – Posted June 24, 2012

    Fixed. Thanks!

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 22, 2012

    In fact, you should avoid it you possibly can. ->

    In fact, you should avoid it if you possibly can.

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 22, 2012

    That will force the ^ and $ to match at the beginning and end of every string (separated by newlines) -> That will force the ^ and $ to match at the beginning and end of every line (separated by newlines)

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 22, 2012

    Sometimes you want to match a few characters, as even numbers. -> Sometimes you want to match any of a set of characters.

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 22, 2012

    d matches any Unicode character

    -> d matches any character that Unicode considers to be a digit

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 14, 2012
    # matches catastrophe, cataract and catapult, but not cat
    /cat(atastrophe|aract|apult)/
    

    Hmm:

    if ('catastrophe' =~ /cat(atastrophe|aract|apult)/) {
        print "Yes.\n";
    }
    

    Edited on June 15, 2012, 7:52 a.m. PDT

  2. Curtis Poe – Posted June 24, 2012

    Fixed. Thanks!

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "Thus, any if a named subexpression fails to match" -> "Thus, if any named subexpression fails to match"

  2. Curtis Poe – Posted June 24, 2012

    Fixed, thanks.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    You've used the e modifier without explaining it yet. Yes: you do explain it in a few paragraphs time but...

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "write a number and writing a regular expression for it is hard" -> "write a number, and writing a regular expression for them is hard"

  2. Curtis Poe – Posted June 24, 2012

    Fixed, thanks!

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    And even if an email is valid, it doesn't mean it belongs to the person who entered it - which is why many websites ask you to click a confirmation link in a sent email when you subscribe to them.

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    s/didn't/doesn't/

    unless you're claiming it magically works now...

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "Just don’t do it with regular expressions." doesn't fit the rest of the paragraph.

  2. Curtis Poe – Posted June 24, 2012

    Agreed. I deleted it.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    "really" and "still" are unnecessary verbiage

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    s/And //

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. chrisjack1 – Posted June 27, 2012

    s/for //

    The author has indicated that the issue raised in this comment has been resolved.

Add a comment

View 1 comment

  1. Derek Mead – Posted Nov. 22, 2012

    As printed, the first four lines all have a literal newline within a character class. These newlines would be treated as being part of those classes, regardless of the state of /x mode.

    I know the main point is for the reader to look at it and say "That's awful!", but it might mislead someone into thinking that /x mode applies within character classes.

    The version below avoids this problem. But I can't get it to display correctly even if I change it to use Markdown block quotes - so you may have to go into comment-edit mode to see it properly.

    (?x-ism: (?-xism:(?:(?i)(?:[+-]?)(?:(?=[.]?[0123456789])(?:[0123456789]) (?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?) / (?:[0123456789]+))|))|(?-xism:[[:upper:]][[:alnum:]_])) (?: s (?-xism:[-+/%]) s (?-xism:(?:(?i)(?:[+-]?)(?:(?=[.]?[0123456789]) (?:[0123456789])(?:(?:[.])(?:[0123456789]{0,}))?) (?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|) )|(?-xism:[[:upper:]][[:alnum:]_])) ) )/x

    Edited on November 22, 2012, 4:13 p.m. PST

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "Mastering Regular Expression" -> "Mastering Regular Expressions"

  2. Curtis Poe – Posted June 24, 2012

    Fixed. Thanks.

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "followed by two digit" -> "followed by two digits"

  2. Curtis Poe – Posted June 24, 2012

    Fixed. Thanks.

Add a comment

View 2 comments

  1. Ben Bullock – Posted June 15, 2012

    "followed by two digit" as above.

  2. Curtis Poe – Posted June 24, 2012

    Yes. Fixed.

Add a comment