|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
Prev | Up | Contents | Down | Next
|
One needs to be very careful with regular expressions and aboid overcomplexity like a plague. In complex regular expressions surprises abound for the uninitiated -- Perl provides for endless ways to shoot themselves yourself in the foot! |
This material is adapted from O'Reilly Network Five Habits for Successful Regular Expressions
Consider the following regular expression to match a U.S. phone number:
\(?\d{3}\)? ?\d{3}[-.]\d{4}
This regex matches phone numbers like "(314)555-4000". Ask yourself if the regex would match "314-555-4000" or "555-4000". The answer is no in both cases. Writing this pattern on one line conceals both flaws and design decisions. The area code is required and the regex fails to account for a separator between the area code and prefix.
Spreading the pattern out over several lines makes the flaws more visible and the necessary modifications easier.
In Perl this would look like:
/
\(? # optional parentheses
\d{3} # area code required
\)? # optional parentheses
[-\s.]? # separator is either a dash, a space, or a period.
\d{3} # 3-digit prefix
[-.] # another separator
\d{4} # 4-digit line number
/x
The rewritten regex now has an optional separator after the area code so that it matches "314-555-4000." The area code is still required. However, a new programmer who wants to make the area code optional can quickly see that it is not optional now, and that a small change will fix that.
There are three levels of testing, each adding a higher level of reliability to your code. First, you need to think hard about what you want to match and whether you can deal with false matches. Second, you need to test the regex on example data. Third, you need to formalize the tests into a test suite.
Deciding what to match is a trade-off between making false matches and missing valid matches. If your regex is too strict, it will miss valid matches. If it is too loose, it will generate false matches. Once the regex is released into live code, you probably will not notice either way. Consider the phone regex example above; it would match the text "800-555-4000 = -5355". False matches are hard to catch, so it's important to plan ahead and test.
Sticking with the phone number example, if you are validating a phone number on a web form, you may settle for ten digits in any format. However, if you are trying to extract phone numbers from a large amount of text, you might want to be more exact to avoid a unacceptable numbers of false matches.
When thinking about what you want to match, write down example cases. Then write some code that tests your regular expression against the example cases. Any complicated regular expression is best written in a small test program, as the examples below demonstrate:
In Perl:
#!/usr/bin/perl
my @tests = ( "314-555-4000",
"800-555-4400",
"(314)555-4000",
"314.555.4000",
"555-4000",
"aasdklfjklas",
"1234-123-12345"
);
foreach my $test (@tests) {
if ( $test =~ m/
\(? # optional parentheses
\d{3} # area code required
\)? # optional parentheses
[-\s.]? # separator is either a dash, a space, or a period.
\d{3} # 3-digit prefix
[-\s.] # another separator
\d{4} # 4-digit line number
/x ) {
print "Matched on $test\n";
}
else {
print "Failed match on $test\n";
}
}
Running the test script exposes yet another problem in the phone number regex: it matched "1234-123-12345". Include tests that you expect to fail as well as those you expect to match.
Ideally, you would incorporate these tests into the test suite for your entire program. Even if you do not have a test suite already, your regular expression tests are a good foundation for a suite, and now is the perfect opportunity to start on one. Even if now is not the right time (really, it is!), you should make a habit to run your regex tests after every modification. A little extra time here could save you many headaches.
The alternation operator (|) has a low
precedence. This means that it often alternates over more than the programmer
intended. For example, a regex to extract email addresses out of a mail file
might look like:
^CC:|To:(.*)
The above attempt is incorrect, but the bugs often go unnoticed. The intent of the above regex is to find lines starting with "CC:" or "To:" and then capture any email addresses on the rest of the line.
Unfortunately, the regex doesn't actually capture anything from lines starting with "CC:" and may capture random text if "To:" appears in the middle of a line. In plain English, the regular expression matches lines beginning with "CC:" and captures nothing, or matches any line containing the text "To:" and then captures the rest of the line. Usually, it will capture plenty of addresses and nobody will notice the failings.
If that were the real intent, you should add parentheses to say it explicitly, like this:
(^CC:)|(To:(.*))
However, the real intent of the regex i to match lines starting with "CC:" or "To:" and then capture the rest of the line. The following regex does that:
^(CC:|To:)(.*)
This is a common and hard-to-catch bug. If you develop the
habit of wrapping your alternations in parentheses (or non-capturing
parentheses -- (?:)) you can avoid this error.
Most people avoid using the lazy quantifiers *?,
+?, and ??, even though they are easy to understand
and make many regular expressions easier to write.
Lazy quantifiers match as little text as possible while
still aiding the success of the overall match. If you write foo(.*?)bar,
the quantifier will stop matching the first time it sees "bar", not the last
time. This may be important if you are trying to capture "###" in the text "foo###bar+++bar".
A regular quantifier would have captured "###bar+++".
Let's say you want to capture all of the phone numbers from an HTML file. You could use the phone number regular expression example we discussed earlier in this article. However, if you know that the file contains all of the phone numbers in the first column of a table, you can write a much simpler regex using lazy quantifiers:
<tr><td>(.+?)<td>
Many beginning regular expression programmers avoid lazy quantifiers with negated character classes. They write the above code as:
<tr><td>([^>]+)</td>
That works in this case, but leads to trouble if the text
you are trying to capture contains common characters from your delimiter (in
this case, </td>). If you use lazy quantifiers, you will spend
less time kludging character classes and produce clearer regular expressions.
Lazy quantifiers are most valuable when you know the structure surrounding the text you want to capture.
Perl allow you to use any non-alphanumeric or
whitespace character as a delimiter. If you switch to a new delimiter, you can
avoid having to escape the forward slashes when you are trying to match URLs
or HTML tags such as "http://" or "<br />".
For example:
/http:\/\/(\S)*/
could be rewritten as:
#http://(\S)*#
Common delimiters are #, !,
|. If you use square brackets, angle brackets, or curly braces, the
opening and closing brackets must match. Here are some common uses of
delimiters:
| # # | ! ! | { } |
| s| | | | s[ ][ ] | s< >/ / |
Usually you should use index function instead of regular expression for simple string matching,
Sometimes it makes sense to use regular expressions instead of substr. One such task is extraction of component of date, for example:
$cur_date='20060325';
(year, $month, $day)=$cur_date=~/(\d{4})(\d\d)(\d\d)/;
I was just reminded about this small thing, which is so easy to forget regular expressions that have markers of line start (^) and/or line end($) are so much faster than those regexps that dont have these markers. The thing is that with line start/end marker regexp engine needs to make only one match/substution, whereas when there is no such markers, it has to repeat the match/substitution operation at every character of the string.
In practice, its unbelievable how much difference this can make. Especially when using complex regular expressions over large data sets.
P.S.: I understand that it is not always possible to use these markers, but I think that they can be used much more often than they are. Everywhere.
Prev | Up | Contents | Down | Next
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: September 05, 2009