Tripod
Tripod

   handcrafted

Vol. 1, No. 34
Intro to Regular Expressions


In Perl, as in life, we spend a great deal of time looking for that one thing that would make our lives perfect. In real life, this can be an exhausting and fruitless quest. Perl, on the other hand, was specifically designed with this kind of task in mind. As a result, Perl tools make it easy to find what you're searching for. Regular expressions are one of the most powerful and fun tools provided by Perl that will help you actually search out just what you've been looking for.

If you've ever experimented with using wildcards to search for filenames or the like (e.g., typing the string "win*" to match both "windows.exe" and "winning_is_everything.mp3"), you've played with regular expressions' little brother. Regular expressions (let's call them regexps and save ourselves a little typing, OK?) are based on the same idea — using general patterns that match certain specific strings of characters — but they're much more convoluted and robust. Regexps are used for many purposes in programming when you're dealing with a list of data and you need to look through it efficiently. Here's an introduction to the basics.

The asterisk ( * ) and its ilk are known as metacharacters: characters that have a special meaning for regexps. A regular expression that doesn't include metacharacters will match any string that contains it. For example, the regexp /peak/ will match the strings "peak performance" and "Speaker of the House." (Don't mind the slashes — they just mark the beginning and end of the regexp.) Regexps are case-sensitive, though: /peak/ won't match "Pikes Peak," because the latter has a capital P. Brackets will make a regexp match all of the things within these brackets. So, the regexp /[pP]eak/ would match both "Peak" and "peak," but not "PEAK." Similarly, /pe[ae]k/ would match both "peak" and "peek," as well as "peekskill," but not "peaek."

You can also indicate a range of numbers or letters within brackets: /[3-8]/ would match any of the following characters: 3, 4, 5, 6, 7, or 8. Also, by sticking a "^" at the beginning of the bracketed sequence, the regexps will match anything but what's inside the brackets. Therefore, /[^aeiou]/ will match any character besides a, e, i, o, or u.

Using a period ( . ) in a regexp will match any single character in that particular spot, including spaces but not including line breaks. Hence, the regexp /p.ck/ will match "pick" and "puck," but not "pock." The regexp /p..ck/ will match "pock," "pinck," and "pp ck."

The asterisk has a very specific meaning in regexp-land. In a regexp, it will match any number of the preceding character or characters. For example, the regexp /b*/ will match "bbb" or "bbbbbbbb" or "b" or even " " (a string consisting of no b's). Along the same lines, "?" will match zero or one of the preceding and "+" will match one or more of the preceding. Putting what we've learned in this paragraph together with what we learned in the previous one, we see that the regexp /.*/ will match anything at all, or nothing. Now that's flexibility! By the same token, /.?/ will match any single character, or no character.

Instead of *, +, or ?, we can use a specific number in curly brackets. /b.{2}k/ will match "book," "beak," or "br k." If the number in brackets is followed by a comma ( , ), it means "or more." So, /bo{2,}k/ will match "book," "booook," and so on.

All right, now we're getting someplace. Here are a few more metacharacters: The backslash ( \ ) is the "escape character," indicating that the next character is special. For example, /\w/ will match any alphanumeric character (number or letter); /\W/ will match any NON-alphanumeric character; /\d/ will match any number (/\d/ is synonymous with /[0-9]/); /\D/ will match any character that's not a number ( /[^0-9]/ ). /\s/ will match a whitespace character — a space, tab, or line break. And, guess what: /\S/ will match any character that's not whitespace.

Just a couple more: When "^" is not within brackets, it means "only at the beginning of the search string"; similarly, "$" means "only at the end of the search string." So, the regexp /^all/ would match "all" or "all of the time" or "alleyway," but not "ball." /^.{5}$/ would match only lines of five characters.

Now that you have a decent grounding in how to put together regexp strings, let's learn how you can apply that knowledge. Perl utilizes regexps to perform simple, variable operations that accomplish a variety of tasks. The simplest is to search the contents of a variable to see if a certain pattern can be found. This operation returns true if it is found and false if it is not. We can use the "m" modifier for a simple search. If the variable $x contains the text we want to search, and we want to find anything that might be a Social Security number (that is, any number having the pattern xxx-xx-xxxx), the syntax would look like this:

($x =~ m/\d{3}-\d{2}-\d{4}/);

It's statements like this that lead to the claim that a Perl program looks like an assortment of random characters. But in fact, it does make sense: The regexp is delimited by the / characters, and it means "search for a number with three digits then a dash, then two digits then a dash, then four digits."

Once you've mastered "m," we can move on to "s," the substitution command. Let's go through our text and replace every instance of "TriPod" with "Tripod."

($x =~ s/TriPod/Tripod/g);

The g at the end means "do it globally" — replace every single occurrence. Without the g, only the first "TriPod" the command found would be replaced.

That's a healthy dose of regular expressions. Just a few simple metacharacters, but they can be used very effectively. Here's one more little regexp for you to puzzle out: /[^0-9a-fA-F]$/

HINTS, POINTERS, AND TIPS 'O THE TRADE:

A regular expression will by default try to match the largest possible string it can. Be careful — this can trip you up if you're not expecting it. If you're trying to match a phrase within parentheses, for example, a regexp like /(.*)/ will match the entirety of the text between the first open parenthesis and the last close parenthesis. If you want to just get what's within a particular pair of parentheses, you'll need something like /([^)]*)/ to match everything after the first open parenthesis until THE NEXT close parenthesis.

Attaching the modifier "i" to the end of a regexp command means "don't pay attention to letter case in what you're searching." This regexp:

($x =~ s/tripod/Tripod/gi);

will make every "tripod," "TRIpod," and "TripOD" into "Tripod."

Removing gum from clothing is a three-step process. First make the gum brittle by holding plastic-wrapped ice cubes against it. Then chip off as much gum as you can with a butter knife. Third, spray on cheap hairspray to lift off the residue.

RESOURCES:

The Perl regexp manual

Webmonkey's regexp rundown

 
Subscribe/Unsubscribe

Handcrafted Archive

2002 March
February
January
2001 December
November
October
September
August
July
June
May
April
March
February
January
2000 December
November
October
September
August
July
June
May
April
March
February
January
1999 December
November
October
September
August
July
June
May
April
March
February



    Tripod: Home | Site Map | About Tripod | International | Tripod Help | Report Tripod Abuse | Members | Angelfire Members

     » Lycos.com  © Copyright 2009, Lycos, Inc. Lycos is a registered trademark of Lycos, Inc. All Rights Reserved.
     About Lycos | Help | Jobs | Advertise

     Your use of this website constitutes acceptance of the Lycos Privacy Policy and Terms & Conditions