Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Introduction to Perl for Unix System Administrators

(Perl without excessive complexity)

by Dr Nikolai Bezroukov


Prev | Up | Contents | Down | Next

5.4. Non-greedy Matching

Again, please remember that the * and + quantifiers are greedy. That means that in idioms like .* and .+ they match as many characters as possible between two anchors that you provide. This may not always be the behavior that you need. You can create non-greedy components by following the quantifier with a ?.

$_="To be or not to be";

For example the pattern {T.*e} will gobbles up all the characters of the string between the first 'To' and  last 'be', and the match is successful.

Let's assume that we need to parse the file specification written in Unix-style with regular slashes and separate the path and the filename. For example here is an example of fully qualified filename that we need to parse:

$_ = '/user/nick/mydata/phones.dat';
The regular expression {^/.*/} will match the full path of the file, because greediness guarantees that the last "/" in pattern will match the last "/" in fully qualified filename. This can be seen in the following small script:
$_ = '/user/nick/mydata/phones.dat';
m{/.*/};
print "full path=$&, filename=$'";
Try yourself and see what this program will display. You can see that the * quantifier is greedy. It matched the whole path as we wanted. But what if you want just to extract the top level directory only. Here the ? modifier to make it  non-greedy will help:

$_ = '/user/nick/mydata/phones.dat';
m{/(.+?)/.*/};
print "top level directory=$1, filename=$'";

Take a common example of C comments (this example is also discussed in Perl FAQ and in the Perl documentation). Say you want to match the bold text in the following:

/* this is a comment */ $a=1; /*another comment */

Try to think of a greedy solution here. We want to match a '/*' followed by all the text up to and including the next '*/'. If we try:

m{/\*.*\*/};

That with match the whole line. More correct solution would be to use non-greedy matching:

m{/\*(.*?)\*/};

which still isn't very readable expression, but at least it does the job.

Now we can understand that non greedy matching of patterns delimited with .* or .+ are very similar to search. And as such it's much more safer solution in many cases similar to the case discussed above.

Now lets discuss some implicit connections between various non greedy iterators.

One can see that the following regular expressions are equal:

  1. *? and {0,}?
  2. +? and {1,}?
  3. ?? and {0,1}?

Here are some more examples of minimal matchers in action, along with the what would be matched with non-greedy ones:

$_ = 'This is the time for all good men to come to the aid of their party';

/the(.*?)the/;

Problem

You have a pattern with a greedy quantifier like *, +, ?, or {}, and you want to stop it from being greedy.

A classic case of this is the naОve substitution to remove tags from HTML. Although it looks appealing, s#<TT>.*</TT>##gsi, actually deletes everything from the first open TT tag through the last closing one. This would turn "Even <TT>vi</TT> can edit <TT>troff</TT> effectively." into "Even effectively", completely changing the meaning of the sentence!

Solution

Replace the offending greedy quantifier with the corresponding non-greedy version. That is, change *, +, ?, and {} into *?, +?, ??, and {}?, respectively.

Discussion

Perl has two sets of quantifiers: the maximal ones *, +, ?, and {} (sometimes called greedy) and the minimal ones *?, +?, ??, and {}? (sometimes called stingy). For instance, given the string "Perl is a Swiss Army Chainsaw!", the pattern /(r.*s)/ matches "rl is a Swiss Army Chains" whereas /(r.*?s)/ matches "rl is".

With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for * or one or more times for +, the matching engine prefers the "or more" portion of that description. Thus /foo.*bar/ matches from the first "foo" up to the last "bar" in the string, rather than merely the next "bar", as some might expect. To make any of the regular expression repetition operators prefer stingy matching over greedy matching, add an extra ? . So *? matches zero or more times, but rather than match as much as it possibly can the way * would, it matches as little as possible.

# greedy pattern
s/<.*>//gs;                     # try to remove tags, very badly

# non-greedy pattern
s/<.*?>//gs;                    # try to remove tags, still rather badly

This approach doesn't remove tags from all possible HTML correctly, because a single regular expression is not an acceptable replacement for a real parser. See Recipe 20.6 for the right way to do this.

Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern BEGIN.*?END in a pattern amidst other elements will always match the shortest amount of text between occurrences of BEGIN and END. Imagine the pattern /BEGIN(.*?)END/. If matched against the string "BEGIN and BEGIN and END", $1 would contain "and BEGIN and". This is probably not what you want.

Imagine if we were trying to pull out everything between bold-italic pairs:

<b><i>this</i> and <i>that</i> are important</b> Oh, <b><i>me too!</i></b>

A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be this one:

m{ <b><i>(.*?)</i></b> }sx

You might be surprised to learn that the pattern doesn't do that. Many people incorrectly understand this as matching a "<b><i>" sequence, then something that's not "<b><i>", and then "</i></b>", leaving the intervening text in $1. While often it works out that way due to the input data, that's not really what it says. It just matches the shortest leftmost substring that satisfies the entire pattern. In this case, that's the entire string. If the intention were to extract only stuff between "<b><i>" and its corresponding "</i></b>", with no other bold-italic tags in between, it would be incorrect.

If the string in question is just one character, a negated class is remarkably more efficient than a minimal match, as in /X([^X]*)X/. But the general way to say "match BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN and END is as follows (this also stores the intervening part in $1):

/BEGIN((?:(?!BEGIN).)*)END/

Applying this to the HTML-matching code, we end up with something like:

m{ <b><i>(  (?: (?!</b>|</i>). )*  ) </i></b> }sx

or perhaps:

m{ <b><i>(  (?: (?!</[ib]>). )*  ) </i></b> }sx

Jeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed really matters, such as:

m{
    <b><i> 
    [^<]*  # stuff not possibly bad, and not possibly the end.
    (?:
 # at this point, we can have '<' if not part of something bad
     (?!  </?[ib]>  )   # what we can't have
     <                  # okay, so match the '<'
     [^<]*              # and continue with more safe stuff
    ) *
    </i></b>
 }sx

See Also

The non-greedy quantifiers in the "Regular Expressions" section of perlre (1), and in the "the rules of regular expression matching" section of Chapter 2 of Programming Perl

Prev | Up | Contents | Down | Next



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: September 05, 2009