Vol. 1, No. 34
Intro to Regular Expressions
In Perl, as in life, we spend a great deal of time looking for that
one thing that would make our lives perfect. In real life, this can
be an exhausting and fruitless quest. Perl, on the other hand, was
specifically designed with this kind of task in mind. As a result,
Perl tools make it easy to find what you're searching for. Regular
expressions are one of the most powerful and fun tools provided by
Perl that will help you actually search out just what you've been
looking for.
If you've ever experimented with using wildcards to search for
filenames or the like (e.g., typing the string "win*" to match both
"windows.exe" and "winning_is_everything.mp3"), you've played with
regular expressions' little brother. Regular expressions (let's call
them regexps and save ourselves a little typing, OK?) are based on
the same idea using general patterns that match certain specific
strings of characters but they're much more convoluted and robust.
Regexps are used for many purposes in programming when you're dealing
with a list of data and you need to look through it efficiently.
Here's an introduction to the basics.
The asterisk ( * ) and its ilk are known as metacharacters:
characters that have a special meaning for regexps. A regular
expression that doesn't include metacharacters will match any string
that contains it. For example, the regexp /peak/ will match the
strings "peak performance" and "Speaker of the House." (Don't mind
the slashes they just mark the beginning and end of the regexp.)
Regexps are case-sensitive, though: /peak/ won't match "Pikes Peak,"
because the latter has a capital P. Brackets will make a regexp
match all of the things within these brackets. So, the regexp
/[pP]eak/ would match both "Peak" and "peak," but not "PEAK."
Similarly, /pe[ae]k/ would match both "peak" and "peek," as well
as "peekskill," but not "peaek."
You can also indicate a range of numbers or letters within brackets:
/[3-8]/ would match any of the following characters: 3, 4, 5, 6, 7,
or 8. Also, by sticking a "^" at the beginning of the bracketed
sequence, the regexps will match anything but what's inside the
brackets. Therefore, /[^aeiou]/ will match any character besides
a, e, i, o, or u.
Using a period ( . ) in a regexp will match any single character
in that particular spot, including spaces but not including line
breaks. Hence, the regexp /p.ck/ will match "pick" and "puck,"
but not "pock." The regexp /p..ck/ will match "pock," "pinck,"
and "pp ck."
The asterisk has a very specific meaning in regexp-land. In a
regexp, it will match any number of the preceding character or
characters. For example, the regexp /b*/ will match "bbb" or
"bbbbbbbb" or "b" or even " " (a string consisting of no b's).
Along the same lines, "?" will match zero or one of the preceding
and "+" will match one or more of the preceding. Putting what we've
learned in this paragraph together with what we learned in the
previous one, we see that the regexp /.*/ will match anything
at all, or nothing. Now that's flexibility! By the same token,
/.?/ will match any single character, or no character.
Instead of *, +, or ?, we can use a specific number in curly
brackets. /b.{2}k/ will match "book," "beak," or "br k." If the
number in brackets is followed by a comma ( , ), it means "or more."
So, /bo{2,}k/ will match "book," "booook," and so on.
All right, now we're getting someplace. Here are a few more
metacharacters: The backslash ( \ ) is the "escape character,"
indicating that the next character is special. For example, /\w/
will match any alphanumeric character (number or letter); /\W/
will match any NON-alphanumeric character; /\d/ will match any
number (/\d/ is synonymous with /[0-9]/); /\D/ will match any
character that's not a number ( /[^0-9]/ ). /\s/ will match a
whitespace character a space, tab, or line break. And, guess what:
/\S/ will match any character that's not whitespace.
Just a couple more: When "^" is not within brackets, it means "only
at the beginning of the search string"; similarly, "$" means "only
at the end of the search string." So, the regexp /^all/ would match
"all" or "all of the time" or "alleyway," but not "ball." /^.{5}$/
would match only lines of five characters.
Now that you have a decent grounding in how to put together regexp
strings, let's learn how you can apply that knowledge. Perl utilizes
regexps to perform simple, variable operations that accomplish a
variety of tasks. The simplest is to search the contents of a variable
to see if a certain pattern can be found. This operation returns
true if it is found and false if it is not. We can use the "m"
modifier for a simple search. If the variable $x contains the text
we want to search, and we want to find anything that might be a
Social Security number (that is, any number having the pattern
xxx-xx-xxxx), the syntax would look like this:
($x =~ m/\d{3}-\d{2}-\d{4}/);
It's statements like this that lead to the claim that a Perl
program looks like an assortment of random characters. But in
fact, it does make sense: The regexp is delimited by the /
characters, and it means "search for a number with three digits
then a dash, then two digits then a dash, then four digits."
Once you've mastered "m," we can move on to "s," the substitution
command. Let's go through our text and replace every instance of
"TriPod" with "Tripod."
($x =~ s/TriPod/Tripod/g);
The g at the end means "do it globally" replace every single
occurrence. Without the g, only the first "TriPod" the command
found would be replaced.
That's a healthy dose of regular expressions. Just a few simple
metacharacters, but they can be used very effectively. Here's one
more little regexp for you to puzzle out: /[^0-9a-fA-F]$/
HINTS, POINTERS, AND TIPS 'O THE TRADE:
A regular expression will by default try to match the largest
possible string it can. Be careful this can trip you up if
you're not expecting it. If you're trying to match a phrase within
parentheses, for example, a regexp like /(.*)/ will match the
entirety of the text between the first open parenthesis and the
last close parenthesis. If you want to just get what's within a
particular pair of parentheses, you'll need something like
/([^)]*)/ to match everything after the first open parenthesis
until THE NEXT close parenthesis.
Attaching the modifier "i" to the end of a regexp command means
"don't pay attention to letter case in what you're searching."
This regexp:
($x =~ s/tripod/Tripod/gi);
will make every "tripod," "TRIpod," and "TripOD" into "Tripod."
Removing gum from clothing is a three-step process. First make
the gum brittle by holding plastic-wrapped ice cubes against it.
Then chip off as much gum as you can with a butter knife. Third,
spray on cheap hairspray to lift off the residue.
RESOURCES:
The Perl regexp manual
Webmonkey's regexp rundown