|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
Prev | Up | Contents | Down | Next
Below are several different examples of real-life pattern matchings, some very simple, some more complicated. They show the power and flexibility of regular expressions.
use FileHandle;
my $fh = new FileHandle("$ARGV[0]");
my @words = (<$fh> =~ m"\b(\w+)\b"g);
This is a simple example. Here, we simply open up the file given to us by the first argument to the script, i.e.: by typing 'perl5 script.p filename'. Then in line 4, the pattern 'm"\b(\w+)\b"g' iterates over the file, getting all of the words out of it and sticking it into the array words.
Now let's expand on this a little. Suppose we want to get all the words in a file that start with the letter 't':
my @words = grep(m"^t", (<$fh> =~ m"\b(\w+)\b"g));
Or equivalently:
foreach $word (<$fh> =~ m"\b(\w+)\b"g){
push (@words, $word) if ($word =~ m"^t");
}
In each case, the regular expression is in array context, with a 'g' modifier, and therefore it passes back a list of words to the context in which it was called. In the first case, this was the function grep, in the second case, a foreach loop.
After, this array is passed back, the m"^t" clause then pushes onto the array stack any and all words that start with 't'. This, for example could be used to match words in a list:
Suppose now that you have a file of the form:
Picked up nuts and bolts 10:00AM
Sawed wood 11:00AM
Sanded 12:30PM
in which the description is followed by the time at which the deed occurred and you want to turn this into a hash that tells you what you did at what time. There are three steps here.
1) read in the file
2) extract out the times and events.
3) create a hash
With the use of regular expressions, step #2 becomes fairly straightforward. Our regular expression will look something like this:
m"^(.*?)(<regexp for time>)\s*$"mg
Here, matching the comment is easy. We are given that every line is a comment/time pair, and consists of a comment first, and a time second. Hence, matching the comment becomes a simple matter of putting a '.*?' right after the beginning.
Furthermore, we use the 'm' modifier, so that '^' means 'match any character after either the beginning or a newline', and such that '$' means 'match any character that is a newline, or the end of the string'.
Now we need only to find what the regular expression for 'time' is. We could roughly think of it as this:
m"(\d{1,2}:\d{2}\s*[AP]M)";
The first \d{1,2}: matches '10:', the second \d{2} matches 00-59, and [AP]M matches AM or PM. This expression also so happens to match 99:99PM, but the chances of such a string occurring are fairly slight, so we deem the risk acceptable (In more 'bulletproof' cases, we would have to consider this. See 'Mastering Regular Expressions' for more detail on how to only match 0-11, or 0-23, or 0-59, i.e. 'time numbers'.)
Anyway, we can now iterate through our file with this expression and make the hash. The code is below:
1 use FileHandle
2 use strict;
3 undef $/;
4 my $fd = new FileHandle("$ARGV[0]");
5 my $line = <$fd>; my %commenthash;
6 while ($line =~ m^"(.*?)(\d{1,2}:\d{2}\s*[AP]M)\s*$"mg)
7 {
8 my $comment = $1; my $time = $2;
9 $commenthash{$time} = $comment;
10 }
The loop in 6 through 10 does most of the work, making the hash by taking the results of the regular expression (line 8) and hashifying it (line 9). You could then do whatever you would like with the data.
You are probably aware of these. They are things like
<H3 FOLDED_ADD_DATE="....">culture</H3>
or
<TITLE>my bookmarks</TITLE>.
In general, we cannot match all of these, with one regular expression, since they can be recursive, like:
<DL><p>
<A ...> </A>
</DL><p>
Although we could write a recursive subroutine to do so.
No, in a simple subroutine, the best we can do is pick a list of tags to match, and then use that information to match, assuming that the tags are not recursive. This is usually a safe assumption.
Now, the general form of a tag is something like this:
<I> .... </I>
or
<A HREF = ...> </A>
In other words, the first tag consists of a '<', then a string, followed by either nothing, or a space and other text describing the tag, plus a '>'. The tag is closed by a '</STRING>' where STRING equals the same string before.
To match:
<B> text <\B>
we could say:
m"<B>(.*?)<\B>"
Assuming, of course, that this tag isn't recursive. How do we generalize this so it matches any tag? Well, lets say we wanted to match bold or italic ('B' or 'I'). And furthermore, we don't know that these strings have text after them (assume we don't know if <B description> is possible). Well, then we could use the following pattern.
m"<(B|I)(\s.*?)?>"
Which says to match first, either B or I, and then match (zero or 1 times) the combination of a space plus the minimum amount of characters. Why this complexity? Consider, if we say: m "<(B|I).*?>", then:
<BODY>
will match. Hence, it is essential that we put the facts that:
1) there could not be any more text after the '<'.
2) if there is any more text, it will be a space followed by any number of characters (up to but not including the next >).
In pictures, our expression m"<(B|I)(\s.*?)? works like Figure 9.9:
If we now take (B|I) and substitute it with $pattern, where $pattern = 'B|I', we get:
m"<($pattern)(\s.*?)?>"
To match the first tag <B>. And to match the whole enchilada:
m"<($patterns)(\s.*?){0,1}>.*?</\1>"sg;
Here, \1 is whatever pattern we had found in '$patterns'. Hence, this matches the whole tag (in this case, the 'B' tag.)
<B>hold</B>
For our code example, Lets be a little bit simple, and only consider two possible strings to match: <B> and <I>. Furthermore, in this case, lets substitute bold text for italic, and vice-versa.
However, lets write our code in such a way that it is generalizable. The code follows now:
1 undef $/;
2 my $fd = new FileHandle("$ARGV[0]");
3 $line = <$fd>;
4 my (%substituteHash) = ('B' => 'I', 'I' => 'B');
5 my $patterns = join('|', keys(%substituteHash)); # makes B|I
6 $line =~ s{(<) ($patterns)((\s.*?)?>) # opening tag (<B> or <I>)
7 (.*?) # text inbetween
8 (</) \2 (>) # closing tag (</B or </I>)
9 }{$1$substituteHash{$2}$3$5$6$substituteHash{$2}$7}sgx;# subsitute
Its lines 5-9 which are the killers. Lets look at them closely. The first thing that we notice is that we have parenthesized everything. (<), ($patterns), etc. This way, we insure that everything that we save will be accounted for when we substitute back in.
Secondly, notice the 'sgx' modifiers on the end. We need to use the 's' modifier in the case that we have something like:
<B>
text here
</B>
We need to use the 'x' or else the expression will become unreadable, and finally we need to use the 'g' so that this matches more than one time.
Finally, notice the '$1....$7' substitute variable. This is us, plunking in the data which we have matched. In particular, the $substituteHash{$2} takes whatever tag we find, and converts from, say 'B' to 'I'.
This example shows a lot about the crafting of regular expressions. The next, and final example goes even further.
You know the ones I'm talking about:
http://members.aol.com/tlyco/KITH/index.html
ftp://ftp.x.org
Suppose that you've got a file of text containing them (for simplicity sake, lets suppose that they aren't split between lines) and you want to abstract them out. Well, the first thing we need to do is come up with a regular expression to match these. Lets look what we need to match:
1) 'ftp:' or a 'http:' followed by '//'.
2) an internet address (members.aol.com, 128.101.22.1)
3) several, optional paths. '/tlyco/KITH/index.html'.
This translates into something that looks like this:
m"((?:ftp|http)://(?:\w+\.){2,5}\w*(/\S*)?)"g)
This did take a little trial and error. We use (?:) so we don't get any backreferences, and the (?:ftp|http) is self explanatory, but the (?:\w+\.){2,5}\w* needs a little bit of explaining, as does the (/\S*){0,1}.
(?:\w+\.){2,5}\w*. Figure: Each one of the \w+\. is meant to match 'members.', or 'aol.'. However, this isn't enough to match the whole internet address. Why? Because the internet address has a trailing group of letters. (?:\w+\.){2,5} does not match 'members.aol.com' it matches 'members.aol.'. We need to add a '\w*' to match the last bit, the 'com' part.
(/\S*)? In English, this is saying "match a slash followed by as any non-spaces as you can find, and do it zero or one times." It refers to the fact that you can have a trailing path on an http or ftp address, but it is not a necessity to have this. This results in the optional question mark.
Note that this pattern is not perfect. If you have backslashed spaces in the http tag in particular, this won't work. Or, if you have a http daemon running on a different port (http://site:8080 for example) it won't work. However, it is as we say 'close enough': if you want to improve it to handle such cases, go right ahead.
Let's now make a loop which extracts all http, and ftp tags from a given file, for example a bookmarks file:
undef $/;
my $fd = new FileHandle("$ARGV[0]");
$line = <$fd>;
while ($line =~ m" m"((?:ftp|http)://(?:\w+\.){2,5}\w*(/\S*){0,1})"g){
my $tag = $1;
chop($tag);
push(@tags, $tag);
}
Here, we chop off the last character, since in a bookmarks file these tags are in double quoted strings. We shall approach this problem more directly next, when we consider matching a double quoted string.
Prev | Up | Contents | Down | Next
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: September 05, 2009