Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Introduction to Perl for Unix System Administrators

(Perl without excessive complexity)

by Dr Nikolai Bezroukov


Prev | Up | Contents | Down | Next

5.3. More Complex Perl Regular Expressions

 

Alternative delimiters

You need to use a backslash to escape special character you want to match.

For example:

$path =~ m/C:\\WINDOWS\\COMMAND/;

This tries to match C:\WINDOWS\COMMAND in $path. One additional special character is the symbol you use as a bracket for regular expressions: As we saw regular expressions are usually delimited by slashes, for example:

m/Hello/;

This is a Unix vi editor heritage and it does not look nice from the point of view of lexical structure. But Perl allows the use of alternative pattern delimiters (delimiter that marks the beginning and end of a given pattern). Slash character inside the pattern need to be escaped (prefixed with \) which is inconvenient in cases where there are a lot of slashes like in Unix full path names:

m/\/root\/home\/\.kshrc/

You can make the example above move readable by using an alternate pattern delimiter:

m"/root/home/.kshrc" # here double quote serves as a pattern delimiter

or better

m{/root/home/\.kshrc} # here { and } are delimiters

Note that if a left bracket is used as the starting delimiter, then the ending delimiter must be the right bracket. Both the match and substitution operators let you use variable interpolation. You can take advantage of this to use a single-quoted string that does not require the slash to be escaped. For instance:

$file = '/root/home/random.dat';
m/$file/;

If you choose the single quote as your delimiter character, then no variable interpolation is performed on the pattern. However, you still need to use the backslash character to escape any of the meta-characters discussed below. The items inside slashes here are treated exactly like double quoted strings and you can interpolate with them:

$_="Hello world";

$regex = 'Hello';

if (m/$regex/) { # this equals to  m/Hello/

print "There is a 'Hello' in the string $_";

}

 Actually { } is probably the most readable alternative variant that permit easy finding of opening and closing brackets in any decent editor (including Emacs, vi, vim, MultiEdit):

$variable =~ m{ ... }; # this works good with vi or emacs because the parens bounce

$variable =~ m(  )  this also can be useful

Same in substitution, for example:

s{ substitute pattern } { for this pattern }sg;

This capability to find matching bracket can be useful when we deal with multiple line regular expressions using extension syntax described below.

Operator =~ works with strings only

If would be nice to be able to match on array too but this is not the case. If you try something such as:

@buffer =~ m/yahoo/; # Wrong way tosearch for a string in the array? No.

Array will be converted to scalar (number of elements) and if we assume that the array has 10 elements that means that you will be doing something like:

'10' =~ m/yahoo/;

The right way to solve this problem is to use grep function like in an example below:

grep(m/variable/, @buffer);

In scalar context the number of matches will be returned. In array context  the list of elements that matched will be returned.

Special Cases of Matching Operator

If the match pattern evaluates to the empty string, the last valid pattern is used. So, if you see a statement like

 if (//) {print;}

in a Perl program, look for the previous regular expression operator to see what the pattern really is. The substitution operator also uses this interpretation of the empty pattern (but never for the substitution part which is a string, not a regular expression).

One of its more common uses of regex is find a substring in a string but remember that in simple cases the  index function is simpler and better. Remember that regular expression matching is greedy and you will get the longest match possible:

$regex = "a*a";
$_ = "abracadabra";
if  m/$regex/ {print "Found $regex in $_\n" 
When matching lines in a file you can print matched strings along with their line number using special variable $.
$target = "yahoo";
open(INPUT, "< visited_sites.dat");
while (<INPUT>) {
     if (/$target/o ) {
         print "Site $target was visited: $. $_";
     }
}
close(INPUT);
The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented.

Please note that this example would be better programmed using the index function.

Backreferences and the replacement string

So the question arise what is are additional capabilities of patterns that made them superior to string fuctions in complex situations. The answer is that patterns have so called matching memory or  pattern memory-- a set of special variables that are assigned values during matching operation for patterns a whole and each component of the pattern enclosed inside parentheses. Pattern memory  often called backreferences. This memory persists after the execution of a particular match statement. You can think about backreferences as a special kind of assignment statements.

Each time you use the parentheses '()' in regex Perl assumes that you want to assign the result of matching to a special variable with names like $1,  $2, $3.... ). Natuarally $1 can be used to refer to the first, $2 -- the second, $3 -- the third matched subpatern. These variables can then be accessed directly, by name, or indirectly by assigning the matching expression to an array.

You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program

$_ =  "a=5";
m/(\w+) = (\d+) /;
print("$1, $2);
will display
A, 5
This is a simplified example of how one can process Unix-style configuration fimes that often contains such statements. You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. If you want to find all the word in the string, you need to use the /g match option. In order to find all the words, you can use a loop statement that loops until the match operator returns false.
$_ =  "word1 word3 word3";

while (/(\w+)/g) {
    print("$1\n");
}
Naturally, this program will display
word1
word2
word3
because of each iteration exectly one new match will be printed. As you can see pattern has internal memory and in case of using option g in the loop will continue extract parts of the the initial strings one by one. But much more interesting approach to a similar ptoblem is to use array on the left side of the assignment statement:
$_ =  "word1 word2 word3";
@matches = /(\w+)/g;
print("@matches\n");
The program will display
word1 word2 word3
To help you to know what matched and what did not Perl has several auxiliary built-in variables with really horrible names:

For example:

$text = "this matches 'THIS' not 'THAT'";

$text =~ m"('TH..')";

print "$` $'\n";

Here, Perl will save and later print subsrign "this matches '" for &` and "' not 'THAT'" for &'. the characters 'THIS' are printed out - Perl has saved them for you in $1 which later gets printed. That regular expressions match the first occurrence on a line. 'THIS' was matched because it came first. And, with the default regexp behavior, 'THIS' will always be the first string to be matched (you can change this default by modifiers - see below)

If you need to save the value of the matched strings stored in the pattern memory, make sure to assign them to other variables. Pattern memory is local to the enclosing block and lasts only until another match is done.

Here are some more examples:

$text = 'word1 word2 word3';

($word1, $word3) = ($text =~ m"(\w+).*(\w+)");

Notice, however, that assignment occurs when the text string matches. When the text string does not match, then $word1 and $word3 would be empty. Try the example above with the sting "1999 2000 2001" to see the result. So, what happens if your regular expression does not match at all? Nothing will be assigned and special variable will preserve thier values (so the values from prev match if any would be used).

Backreferences are not set if a regular expression fails

This is a frequent Perl 'gotcha'.  Built-in variables like $1 does not get change if the regular expression fails. Some people think this a bug, others consider this a feature. Nonetheless, this second point becomes painfully obvious when you consider the following code.

$_ = 'Perl bugs bite';

/\w+ (\w+) \w+/; # sets $1 to be "bugs".

$_ = 'Another match another bug';

/(^a.*\s)"; # /^a.*\s will not match to any substring in the string

print $1 # Suprize ! "bugs" will be printed !

In this case, $1 is the string 'bugs', since the second match expression failed!  This Perl behavior can cause hours of seaching for bug. So, consider yourself warned. Or more to the point, always check if a match was succcesful before assigning anything to it. You can use one of the following three constructions that are aordered in the order of preference.

1) if clause. This is the simplest and the most safe method to use.

if (/(^a.*\s)/) { $matched = $1; }

else { print "matching failed"; }

2) direct assignment: Since you can assign a regular expression directly to an array, you can take advantage of the fact that strings will be assigned zero length string in case match fails. For example:

($match1, $match2) = /(\w+).*(\w+));

3) Shell style short circuiting method. This is a Perl idiom similar in style to Unix shell practice but it's not very useful.

($scalarName =~ m"(regular expression)") && ($match = $1);

Although the first method is the most clean any one will do the job. In any case your pattern matching code should protect from unassigned built-in variable pattern matching errors.

 In any case your pattern matching code should protect from unassigned built-in variable pattern matching errors.

 In any case your pattern matching code should protect from unassigned built-in variable pattern matching errors.

Using Backreferences in the Regular expression

Perl provides a very interesting possibility -- backreferences are available in the matching pattern itself. In other words, if you put parentheses around a group of characters, you can use this group of characters later in the regular expression or substitution string. But there is an important  syntactic difference -- if you want to use the backreferences in the matching pattern  you need to use the syntax \1, \2, etc. If you want to use the backreferences in substitution string you use regular $1, $2, etc.

Perl is Perl and there are some irregularities in Perl regular expressions ;-) Here are some examples:

$_ = 'Hello world';

/(\w+) (\w+)/$2 $1/; # This makes string 'world Hello'.

The first example demonstrate simple substitution. Now lets try the use of backreferenece in the matching pattern itself:

if (/(A)(B)\2\1/) { print "Hello ABBA";}

The example is pretty artificial, but it well illustrates the key concept.  There are 4 steps to the match of this string:

1) The first A in parentheses matches the letter A and is saved into \1 and $1.

2) (B) matches the string 'B' and is stored into \2 and $2.

3) \2 then matches the second ''B" in the string, because it is equal to "B".

4) \1 matches the next 'A'

Note: If variable interpolation is used in the replacement string none of the meta-characters can be used in the replacement string


 Counting the number of matches

Perl provides several capability to specify how many times a given component must be present before the match is true. You can specify both minimum and maximum number of repetitions.

Quantifiers

Quantifier Description
{n} The component must be present n times.
{n,} The component must be present at least n times.
{n,m} The component must be present at least n times and no more than m times.

One can see that old quntifiers that we already know (*, + and ?) can be expressed via new ones:

  1. * and {0,}
  2. + and {1,}
  3. ? and {0,1}
This pattern will match "You" and "The" but not "" or " The". In order to account for the leading whitespace, which may or not be at the beginning of a line, you need to use the asterisk (*) quantifier in conjunction with the \s symbolic character class in the following way:
m/^\s*\w+/;
Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The pattern /b*/ will match any string - even one without any b characters.

At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:

$_ = 131.1.1.1 - joejerk [21/Jan/2000:09:50:50 -0500] "GET http://216.1.1.1/xxxgirls/bigbreast.gif HTTP/1.0" 200 51500
m/(\w+\s+){3}/; # get the user name of the offender
In this example, we are interested in getting exactly the third word which  corresponds to the user id in HTTP logs. After match $3 should contain this id.

The same ideas can be used for processing date and time in the HTTP logs.

Extended syntax and option /x

Suffix x provides several important extensions for regular expressions along with the ability to put comments and ignore whitespace.

Among additional matching capabilities:

Here, <special character(s)> represent the extensions, and <text> is the text that that expression acts on.

The most useful of the extensions listed above is grouping without creating a backreference.

Five Extension Components

Extension Description
(?# TEXT) This extension lets you add comments to your regular expression. The TEXT value is ignored.
(?:...) This extension lets you add parentheses to your regular expression without causing a pattern memory position to be used.
(?=...) This extension lets you match values without including them in the $& variable.
(?!...) This extension lets you specify what should not follow your pattern. For instance, /blue(?!bird)/ means that "bluebox" and "bluesy" will be matched but not "bluebird".
(?sxi) This extension lets you specify an embedded option in the pattern rather than adding it after the last delimiter. This is useful if you are storing patterns in variables and using variable interpolation to do the matching.

By far the most useful feature of extended mode, in my opinion, is the ability to add comments directly inside your regular expressions. For example, instead of regular expression: 

# Match an assignment like a=b;. $1 will be the name of the variable and the
# first word. $2 will be the second word.
m/^\s+(\w+)\W+(\w+)\s+$/;

We can write 

m/^\s+ leading spaces
(w+) get first word
\s*=\s*  match = with white space before and after ignored
(.*) right part
\; final semicalon
/x
Here we move group to separate lines it improves readability and gives us opportunity to put comments:

But you can go too far and kill with kindness. Here is an example of over-commented regular expression that is more difficult to read the one line version:

m/
    (?# This pattern will match any Unix stile assignemnt in configuration file delimited with semicolon
    (?# resulta are put into $1 and $2 if the match is successful.)

    ^      (?# Anchor this match to the beginning of the string)           

    \s*    (?# skip over any whitespace characters)
           (?# we use the * because there may be none)

    (\w+)  (?# Match the first word, we know it's)
           (?# the first word because of the anchor)
           (?# above. Place the matched word into)
           (?# pattern memory.)

    \W+    (?# Match at least one non-word)
           (?# character, there may be more than one)

    (\w+)  (?# Match another word, put into pattern)
           (?# memory also.)

    \s*    (?# skip over any whitespace characters)
           (?# use the * because there may be none)

    $      (?# Anchor this match to the end of the)
           (?# string. Because both ^ and $ anchors)
           (?# are present, the entire string will)
           (?# need to match the pattern. A)
           (?# sub-string that fits the pattern will)
           (?# not match.) 
/x;
In general, if you do not resort to excesses like the example above, the commented version should be much easier to maintain.

Extensions also let you change the order of evaluation without affecting pattern memory. For example,

m/(?:Operator|Unix)+/;
matches the strings Operator and Unix in any order. The pattern memory will not be affected.

At times, you might like to include a pattern component in your pattern without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:

A:B:C

and you want to find all operators in /etc/passwd file and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example:

while (<>) {
    push(@array, $&) if m/^\w+(?=\s+Operator\s+)/;
}

print("@array\n");
Let's look at the pattern with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the pattern because the pattern is part of the if statement modifier. Adding comments in that location would make the comments hard to format.

So we can use a different tactic and put the pattern in variable

$pattern = '^\w+     (?# Match the first word in the string)

            (?=\s+   (?# Use a look-ahead assertion to match)
                     (?# one or more whitespace characters)

            Operator  (?# text to match but not to include)
                     
           \s+' (?# one or more whitespace characters)
           ;

while (<>) {
    push(@array, $&) if m/$pattern/xo;
}

print("@array\n");
Here we used a variable to hold the pattern and then used variable interpolation in the pattern with the match operator. To speed things up we use o modifier, which tells Perl to evaluate regular expression only once.

The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not an operator. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.

There are many ways of matching any value.

If the first method you try doesn't work, try breaking the value into smaller components and match each boundary.

If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup.

Nested Backreferences

As backreferences are implicit assignments they can be nested. Let's discuss parsing of date format in HTTP logs.

m{\([(\d)*\])};

Here, the outermost (( )) parentheses captures the whole thing: 'softly slowly surely subtly'. The innermost (()) parentheses captures a combination of strings beginning with an 's' and ending with a "ly" followed by spaces. Hence, it first captures 'softly', throws it away then captures 'slowly', throws it away then captures 'surely', then captures 'subtly'.

The first two examples are fairly straightforward. '[0-9]' matches the digit '1' in 'this has a digit (1) in it'. '[A-Z]' matches the capital 'A' in 'this has a capital letter (A) in it'. The last example is a little bit trickier. Since there is only one 'an' in the pattern, the only characters that can possibly match are the last four 'an A'.

However, by asking for the pattern 'an [^A]' we have distinctly told the regular expression to match 'a', then 'n', then a space, and finally a character that is NOT an 'A'. Hence, this does not match. If the pattern was 'match an A not an e', then this would match, since the first 'an' would be skipped, and the second matched! Lik

$scalarName = "This has a tab( )or a newline in it so it matches";

$scalarName =~ m"[\t\n]" # Matches either a tab or a newline.

# matches since the tab is present.

This example illustrates some of the fun things that can be done with matching and wildcarding. One, the same characters that you can have interpolated in a " " string also get interpolated in both a regular expression and inside a character class denoted by a brackets ([\t\n]). Here, "\t" becomes the matching of a tab, and "\n" becomes the matching of a newline.

Backreferences in array context

There are several cases:

The first case is Scalar context, no modifiers. This is not very interesting and as was discussed above 0 or 1 will be returned. Much more interesting case is Matching in array context, no backreferences. Here, this matches the first position the regular expression can match, and simply puts the backreferences in a form that is quickly accessible. For example:

($variable, $equals, $value) = ($line =~ m"(\w+)\s*(=)\s*(\w+)");

This takes the first reference (\w+) and makes it $variable, the second reference (=) and makes it $equals, and the third reference (\w+) and makes it $value.

Another interesting case is Matching in array context, 'g' modifier. This takes the regular expression, applies it as many times as it can be applied, and then stuffs the results into an array that consists of all possible matches. For example:

$line = '1.2 3.4 beta 5.66';

@matches = ($line =~ m"(\d*\.\d+)"g);

will make '@matches' equal to '(1.2, 3.4, 5.66)'. The 'g' modifier does the iteration, matching 1.2 first, 3.4 second, and 5.66 third. Likewise:

undef $/;

my $FD = new FileHandle("file");

@comments = (<$FD> =~ m"/\*(.*?)\*/");

will make an array of all the comments in the file '$fd'

Matching in scalar context, 'g' modifier

Finally, if you use the matching operator in scalar context, you get a behavior that is entirely different from anything else (in the regular expression world, and even the Perl world). This is that 'iterator' behavior we talked about. If you say:

$line = "BEGIN <data> BEGIN <data2> BEGIN <data3>"

while ($line =~ m"BEGIN(.*?)(?=BEGIN|$)"sg){

push(@blocks, $1);

}

This then matches the following text (in bold), and stuffs it into @blocks on successive iterations of while:

BEGIN <data>(%)BEGIN <data2> BEGIN <data3>

BEGIN <data> BEGIN <data2>(%)BEGIN <data3>

BEGIN <data> BEGIN <data2> BEGIN <data3>

We have indicated via a '(%)' where each of the iterations start their matching. Note the use of (?=) in this example too! It is essential to matching the correct way, since if you don't use it, the 'matcher' will get set in the wrong place.

Precedence in Regular Expressions

Pattern components have an order of precedence just as operators do. If you see the following pattern:
m/a|b+/
it's hard to tell if the pattern should be
 m/(a|b)+/  # match any sequence of  "a" and "b" characters 
             # in any order.
or
m/a|(b+)/   # match either the "a" character or the "b" character
            # repeated one or more times.
The order of precedence shown in below. By looking at the table, you can see that quantifiers have a higher precedence than alternation. Therefore, the second interpretation is correct.

 The Pattern Component Order of Precedence

Precedence Level Component
1 Parentheses
2 Quantifiers
3 Sequences and Anchors
4 Alternation

 

You can use parentheses to affect the order that components are evaluated because they have the highest precedence. you need to use extended syntax or you will be affecting the pattern memory.

The quotemeta function

Now let's introduce one new thing: both the matching and the substitution operators perform variable interpolation both in the pattern and substitution strings, for example:

$variable =~ m"$scalar";

then $scalar will be interpolated, turned into the value for scalar. There is a caveat here. Any special characters will be acted upon by the regular expression engine, and may cause syntax errors. Hence if scalar is:

$scalar = "({";

Then saying something like:

$variable =~ m"$scalar";

is equivalent to saying: $variable =~ m"({"; which is a runtime syntax error. If you say:

$scalar = quotemeta('({');

instead will make $scalar become '\(\{' for you, and substitute $scalar for:

$variable =~ m"\{\{";

Then, you will match the string '({' as you would like.

You can use array in regex (it will be converted to the string with elements separated by spaces like in pront statement), but this is tricky and rarely used:

$variable =~ m/@arrayName/; # this equals m/elem1 elem2/;

Here, this is equal to m/elem1 elem2/. If the special variable $" was set to '|', this would be equal to m/elem|elem2/, which as we shall see, matches either 'elem' or 'elem2' in a string. This works for special characters too:

For example:

$_ = "AAA BBB AAA";
print "Found bbb\n" if  m/bbb/i;
This program finds a match even though the pattern uses lowercase and the string uses uppercase because the /i option was used, telling Perl to ignore the case. The result from a global pattern match (option g) can be assigned to an array variable or used inside a loop.

As we already know the substitution operator has all options used in the matching operator plus several more.  One interesting option is the capability to evaluate the replacement pattern as an expression instead of a string. You could use this capability to find all numbers in a file and multiply them by a given percentage. Or you could repeat matched strings by using the string repetition operator.

If back quotes are used as delimiters, the replacement string is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.

s modifier

Without modifiers, a dot ('.') matches anything but a newline. Sometimes this is helpful. Sometimes it is very frustrating, especially if you have data that spans multiple lines. Consider the following case:

$line =

'BLOCK1:

<text here1>

END BLOCK

BLOCK2:

<text here2>

END BLOCK'

Now suppose you want to match the text between blocks <text here[0-9]>:

$line =~ m{

BLOCK(\d+)

(.*?)

END\ BLOCK # Note backslash. Space will be ignored otherwise

};

This does not work. Since the wildcard ('.') matches every character EXCEPT a newline, the regular expression hits a dead end when it gets to the first newline and it STOPS MATCHING RIGHT THERE.

Sometimes, as in this case, it is helpful to have the wildcard ('.') match EVERYTHING, not just the newline. And, by extension, to have the wildcard (\s) match [\n\t ], not just tabs and spaces. This is what the 's' operator does.

It tells Perl to not assume that the string you are working on is one line long. The above then does work with an 's' on the end of the regular expression:

$line =~ m{

BLOCK(\d+)

(.*?)

END\ BLOCK # Note backslash. Space will be ignored otherwise

}s;

With the 's' on the end, this now works.

m: The 'm' operator is the opposite of the 's' operator. In other words, it says 'treat the regular expression as multiple lines, rather than one line.

This basically makes it so '^' and '$' now match not only the beginning and ending of the string (respectively), but also make ^ match any character after a newline, and make $ match a newline. In the example,

$line = 'a

b

c';

$line =~ m"^(.*)$"m;

the 'm' modifier will make the backreference '$1' become 'a' instead of "a\nb\nc".

(e) o: compile regular expression only once.

This modifier is helpful when you have a long, long expression. Consider, when you say something like:

$line =~ m"<very long expression>";

where '<very long expression>' is a paragraph long, or even pages long. As it stands, each time that Perl hits this regular expression, it compiles it. This takes time, and if your pattern that you need to match is exceedingly complicated, your regular expression will be exceedingly long.

In Jeffrey Friedl's book, there is an expression that matches email addresses which comes out to 6598 bytes long! Without the 'o' modifier it would be sunk, but if you compile it only once, it is usable.

However, there is one caveat you should be aware of. If you say:

$line =~ m"$regex"o;

you make a promise to Perl that $regex will not change. If it does, Perl will not notice your change. Hence,

$regex = 'b';

while ('bbbbb' =~ m"$regex"o) { $regex = 'c'; }

is actually an infinite loop in Perl. $regex changes, but it is not reflected in the regular expression. (This doesn't however, bind you to one only one regexp per program. Each instance of expressions with 'o' is compiled before usage).

e modifier

(e) e: evaluate the second part of the s/// as a complete 'mini-Perl program' rather than as a string.

The 'e' modifier for s/// is pretty cool, but also very involved. You can do pretty heavy wizardry with it. We'll just mention it briefly here, with an example. Let's say that you wanted to substitute all of the letters in the following string with their corresponding ASCII number:

$string = 'hello';

$string =~ s{ ( \w ) } # we save the $1.

{ord($1). " ";} egx;

This example prints out '104 101 108 108 111". Each character was taken in turn here and run through the 'ord' function that turned it into a digit. Needless to say, this can do pretty powerful stuff in a short amount of time. It also runs the risk of being extremely obscure.

We suggest you use this logic as a last resort, when all of your other 'bag of tricks' has failed. Its use can sometimes hide a cleaner way of doing things. Is the above really clear? Or is this better:

$string = turnToAscii($string);

sub turnToAscii{

my ($string) = @_;

my ($return, @letters);

@letters = split(//, $string);

foreach $letter (@letters) {

    $letter = ord($letter) . " " if ($letter =~ m"\w");

}

$return = join('', @letters);

$return;

}

This latter example is explicit and easily maintainable. However, it is also over 10 times as long and also a few times slower, so a judgment call has to be made on when to use 'e'.

g modifier in loops

There is one significant difference in how the 'g' operator works in loops. As was seen before, the 'g' operator in substitution meant that every single instance of a regular expression was replaced. However, this is meaningless in the context of matching.  Perl uses the 'g' operator as an iterate: when you match once with '$string =~ m" "g', Perl remembers where that match occurs. This means that you can use this to match where you left off. When Perl hits the end of the string, the iterator is reset:

$line = "hello stranger hello friend hello sam";

while ($line =~ m"hello (\w+)"sg){

    print "$1\n";

}

This outputs

stranger

friend

sam

and then quits, because the inherent iterator comes to the end of the expression.

There is one caveat here. If you are using the 'g' modifier, then ANY modification to the variable being matched via assignment causes this iterator to be reset.

$line = "hello";

while ($line =~ m"hello"sg){

   $line = $line.'.';

}

This is an infinite loop!

Recommended Links

perlop - perldoc.perl.org

perlrequick - perldoc.perl.org

Prev | Up | Contents | Down | Next



Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

Disclaimer:

Last modified: September 04, 2009