Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Overview of regular expressions in Perl

News Introduction Two types of regex Two Binding Operators (=~ and !~) Metacharacters  
Examples How to Create Complex Regex Substitutions Comments Humor Etc

The Hello World Example

Again I would like to stress that you need to understand that regular expression mini-language is a new language that has no direct connections with Perl language.

As it is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from Perl shell/AWK legacy a regular expression lexically  is a special type of literals (similar to double quoted literal). It is usually is included in slashes, and the source string (where matching occurs) is specified on the right side of the special  =~ operator (matching operator). The simplest case is to search substring in string like in index.

Regex operates against strings. No arrays please

This can be accomplished as shown below. The following expression is true if the string Hello appears anywhere in the variable $sentence.

$sentence = "Hello world";
if ($sentence =~ /Hello/) {...} # expression is true if the string the appears in variable $sentence. 

The regular expressions (called also regex of RE) are case sensitive, so if

$sentence = "hello world";

then the above match will fail.

The operator !~ can be used for spotting a non-match. In the above example

$sentence !~ /Hello/

is true if the string Hello does not appear in $sentence.

Two types of regex

There are two main uses for regular expressions in Perl:

Two Binding Operators (=~ and !~)
The $_  the default operand for regular expressions

The binding operators let you bind the regular expression operators to a variable other than $_. There are two forms of the binding operator: the regular =~ and its negation  !~. For example:

$my_string = "The graph has many leaves";
if $my_string =~ m/graph/ {print("The source string contains the word 'graph'.\n");}
$result =~ s/graph/tree/;
print("\$result = $scalar\n");
In this example each of the regular expression operators applies to the $my_string variable instead of $_.  We can make this example slightly more complex by trying to capture the results of the matching and substitution, respectively:
$my_string    = "The graph has many leaves";
$match        = ($string =~ m/graph/);
$substitution = ($string =~ s/graph/tree/);

print("\$match         = $match\n");
print("\$substitution  = $substitution\n");
print("\$my_string     = $string\n");

This program displays the following:

$match         = 1
$substitution  = 1
$my_string     = The tree has many leaves

The other useful feature of this example is that it shows you how to obtain the the return values of the regular expression operators. If you don't need the return values, you can just omit variable:

This program displays the following:


$scalar = The tree has many leaves

We could use a conditional as

if ($sentence =~ /world/){
	print "there is a substring 'world' somewhere in the sentence: $sentence\n";
}

which would print out a message if we had either of the following

$sentence = "Hello world";
$sentence = "Disneyworld in Orlando";

Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop.  In this case you  can write something like:

while (<>) { # get "Hello world" from the input stream

	if (/world/) {
		print "There is a word 'world' in the sentence $_\n";
	}
}

As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).

Regular Expressions Metacharacters

The problem with regex is that there are plenty of special characters (metachatacters) that denote a set of acceptable letters. That gives regex a lot of power and at the same time make them appear very complicated, at least at the very beginning.

It's best to build up your skills slowly; Creation of complex regex can be considered as a kind of an art form (like Japanese are of flower compositions or chess problems). Here are three most important special characters (metacharacters) and three most important modifies (also metacharacters, but they affect interpretation of prev character). The most important metacharacters include three metacharacters and three modifiers that modify their behavior in a certain ways. They are the following::

Here is slightly more complex example of our traditional Hello World program for regular expression:

$sentence="Hello world";

if ($sentence =~ /^Hello/) { # true if the sentence starts with "Hello"

print "The string $sentence starts with Hello";

}

Match operation returns 0 or 1 depending on whether
a given regular expression matched the string

metacharacter ^ means the beginning of the string. So the regex /^Hello/ will match only if the work Hello is the first in the string and there are no blanks before it. Try to modify the string and see results yourself.

In the following sections of this chapter I will try to collect several typical regex. (Remember that to use then in your scripts you need to enclosed them  /.../ slashes).

Here is more complete list of metacharacters:

  • . -- matches any character, except (in some cases) newline (character grouping [^\n])
  • \d -- matches a digit (character grouping [0-9])
  • \D -- matches a non-digit (character grouping [^0-9]
  • \w -- matches a word character (character grouping [a-zA-Z0-9_] (underscore is counted as a word character here)
  • \W -- matches a non-word character (character grouping [^a-zA-Z0-9_]
  • \s -- matches a 'space' character (character grouping [\t\n ]. (tab, newline, space)
  • \S -- matches a 'non-space' character (character grouping [^\t\n ]).
  •  (matches any character, when you say m"(.*)"s. See modifiers, below.))
  • $ -- anchor which matches the 'end of line', if placed at the end of a regular expression.
  • ^ -- anchor that matches 'beginning of line' if placed at the beginning of a regular expression.
  • \b, \B -- anchors that matches a word boundary (\b) or lack of word boundary (\B).
  • Examples

    It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

    Here are a few examples:

    $a = '404 - - ';

    $a =~ m/40\d/; # matches 400, 401, 403, etc.

    Here we took a fragment of  a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer.  A similar idea works for real, but generally reals have much more compilex syntax:

    $target='simple_real  22.33';

    $target=~/\d+\.\d*/;

    Note: regex /\d+\.\d*/ isn't a general enough to match all the real number permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.

    Now let's try to match works. the simplest regular expression that matches a single work is \w+. Here is a couple of examples:

    $target='hello world'; $target~ m{(\w+)\s+(\w+)}; # regex for detecting two words separated by white space

    $target='A = b'; $target =~ /(\w+)(\s*)=(\s*)(\w+)/; # another way to ignore white space in matching

    Here are more examples of simple regular expressiong that might be reused in other contexts:

    /t.t/		# t followed by any letter followed by t
    	
    ^131		# 131 at the beginning of a line
    0$		# 0 at the end of a line
    \.txt$		# .txt at the end of a line
    /^newfile\.\w*$/	# newfile. with any  followed by zero or more arbitrary characters
    		# This will match newfile.txt, new_prg, newscript, etc.
    /^.*marker/     # head of the string up and including the word "marker"
    /marker.*$/	# tail of the string starting from the 'market' and till the end (up to newline). 		
    /^$/		# An empty line
    

    Now let's add complexity by introducing classes of characters.

    They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not":

    /[abcde]/		# Either a or b or c or d or e
    /[a-e]/			# same thing ("-" denote range here)
    /[a-z]/			# Anything from a to z inclusive
    /[^a-z]/		# Any non lower case letter
    
    /[a-zA-Z]/ 		# Any letter
    /\w/	 		# Same thing as above
    
    /[a-z]+/		# Any non-zero sequence of lower case letters
    /[01]/			# Either "0" or "1"
    /[^0-9a-zA-Z]/    	# matches any non-word character.

    If you need to match a word whose length is unknown, you probably should not use an * or *? because a zero length word makes no sense.

    Now let's introduce two so called anchors, a special characters that tell regex engine that the match should start of end in a certain position of the string. Two most common anchors are ^ and $:

    For example to match the first word on the line we can use the following regex :

    /^\w+/;

    Several additional examples:

    /0/		# zero: "0"
    /0*/		# zero of more zeros		
    /0+/		# one or more zeros
    /0*0/		# same as above
    /\d/		# any digit but only one
    /\d+/           # any integer
    /\d+\.\d*/      # a subset of real numbers. Please note that 0. is a real number
    /\d+\.\d+\.\d+\.\d+/ # IP addresses starting
    		(no control of the number of digits so 1000.1000.1000.1000 would match  this regex
    /\d+\.\d+\.\d+\.255/ # IP addresses ending with 255

    At this point you can probably benefit from doing several exercises. Anyway, I would like to remind about several metacharacters for reference:

    \n		# A newline
    \t		# A tab
    \w		# Any alphanumeric (word) character.
    		# The same as [a-zA-Z0-9_]
    \W		# Any non-word character.
    		# The same as [^a-zA-Z0-9_]
    \d		# Any digit. The same as [0-9]
    \D		# Any non-digit. The same as [^0-9]
    \s		# Any whitespace character: space,
    		# tab, newline, etc
    \S		# Any non-whitespace character
    \b		# A word boundary, outside [] only
    \B		# No word boundary

    Characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded  by a backslash, for example:

    \|		# Vertical bar
    \[		# An open square bracket
    \)		# A closing parenthesis
    \*		# An asterisk
    \^		# A carat symbol
    \/		# A slash
    \\		# A backslash

    More complete example.

    $ip_addr=~/\d+\.\d+\.\d+\.\d+/;
    @component = split(/\./,$&);
    $hex_addr = pack('C4', @component[0], @component[1], @component[2], @component[3]);
    ($name,$aliases,$type,$len,@addrs) = gethostbyaddr($hex_addr,2);

    How to Create Complex Regex

    Complex patterns are constructed from simple regular expressions using the following metacharacters:

    Meta-characters are characters that have an additional meaning above and beyond their literal meaning. For example, the period character can have two meanings in a pattern. First, it can be used to match a period character in the searched string - this is its literal meaning. And second, it can be used to match any character in the searched string except for the newline character - this is its meta-meaning. The following two components that can be used to construct complex patterns:

    Metacharacters and anchors

    The metacharacter differ in their behaviors. some of them can  match zero number of characters of a particular class, but most require at least one such character. Here are examples of metacharacters that we already know:

    Substrings matched by those metacharacters always have positive width. Or to put it differently the regular expression engine 'eats' characters in the process of matching.

    The second group of characters does not eat any characters -- that means that they do not require any character to be present. This subclass is usually called anchors. Here are most important anchors:

    Anchors don't match a character, they match a condition. In other words,'^cat' will match a string with 'cat' at the beginning of it, but it doesn't match any given character.

    For example, assuming that the string is in the $src variable, you can match a one-character word like this:

    $src =~ m/\w/;

    You can use the + quantifier to say that the match should succeed only if the component is matched one or more times, for example:

    $src =~ m/\w+/;

    If the value of $_ was "AAA BBB", then m/\w+/; would match the "AAA" in the string. If $_ was blank, contains only whitespace, or contains only non-alphanumeric characters, a zero would be returned.

    Metacharacters in Character Classes

    The character class [0123456789] or, shorter, [0-9] defines the class of decimal digits, and [0-9a-fA-F] defines the class of hexadecimal digits. You should use a dash to define a range of consecutive characters. Character classes let you match any of a range of characters. You can use variable interpolation inside the character class, but you must be careful when doing so. You can use metacharacters inside character classes but not as endpoints of a range. For example, you can do the following:

    $_ = "\tAAA";
    print "matched" if m/[\d\s]/;

    which will display

    matched

    because the value of $_ includes the tab character.

    Meta-characters that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. This may be confusing but that's how it is.

    Alternation

    Alternation is the way to tell Perl that you wish to match one of two or more patterns. In other words, the regular expression:

    /^foreach|^for|^while/

    in a regular expression tells Perl "look for the line beginning with the string 'for' OR the string 'if' or the string 'while'." As an example, start with the following statement:

    The ( | ) syntax split regular expression on sections and each section will be tried independently. Alternation always tries to match the first item in the parentheses. If it doesn't match, the second pattern is then tried and so on.

    This is called left most matching, and misunderstanding of this property of regular expressions accounts for many of the bugs. Let's switch the order of the items in the parentheses, so the example becomes:

    $line = 'foreach $i (@n) { $sum+=$i;}';

    $line =~ /^for|^foreach/;

    In this case the string foreach will never be matched as for will match before it. This is so common a mistake that I would like to recommend to  put longest string first in such cases.

    This is helpful also for things like if you don't know whether or not a word will be followed by a delimiter, or an end of line character, or whether or not a word is plural, as in:

    for $line ('words', 'word') {

       $line =~ /word(s?)/;

    }

    The useful option for matching if i  (ignore case). for example

    $line =~ /"word(s?)/i; # will match "word" or "words" independent of case

    Substitutions

    As well as identifying substrings that match regular expressions Perl can make substitutions based on those matches. The way to do this is to use the s function which mimics the way substitution is done in the vi text editor. If the target string is omitted then the substitution is assumed to take place with the $_ variable.

    To replace an occurrence of regular expression h.*?o by string "Privyet" in the string $sentence we use the expression

    $sentence =~ s/h.*?o/Privyet/;
    

    and to do the same thing with the $_ variable just  write the right side of the prev. operator:

    s/h.*?o/Privyet/;
    

    The regular expression is called matching pattern and the argument is called substitution string. The result of a substitution operator is the number of substitutions made, so it is either 0 (false) or 1 (true) in this case.

    The result of a substitution operator is the number of substitutions made

    This example only replaces the first occurrence of the string, and it may be that there will be more than one such string we want to replace. To make a global substitution the last slash is followed by a g option as follows:

    s/h.*?o/Privyet/g

    Here the target is $_ variable. The expression returns the number of substitutions made ( 0 is none).

    If we want to also make replacements case insensitive with the option  i (for "ignore case"). The expression

    s/h.*?o/Privyet/gi

    will force the regex engine to ignoring case. Note that case will be ignored only in matching -- substitution string will in inserted exactly as  you specified.

    Option i (ignore case) is very useful for both matching and substitution
    Option g (global) is very useful for substitution

    The matching pattern will let you determine if $_ contains a word but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components inside parentheses. For example:

    m/(\w+)/;

    By doing this, you force Perl to store the matched string into the $1 variable. The $1 variable can be considered as pattern memory or backreference. We will discuss backreferences in more details later.

    The substitution operator can be used to delete any substring. In this case the replacement string should be omitted. For example to remove the substring "Nick"  from the $_ variable, you could write: s/Nick//;

    There is additional option that is applicable to both regex and replacement string -- /e option that changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used.

    Like in index function you can use variables in both matching pattern and substitution string. For instance:

    <>; # let's assume that $_ = "Nick Bezroukov";
    $regex  = "Nick";
    $replacement_string = "Nickolas";
    $result = s/$regex/$replacement_string/;
     

    Here is a slightly more complex case of replacement:

    #alert udp $SITE_DHCP 63 -> any any (msg:"POLICY TFTP to DCHP segment"; classtype:attempted-admin; sid:235; rev:60803;)
    $new="classtype:$ARGV[0];";

    while(<>) {
       $line=$_;
       $line=~s[classtype\:.*?\;][$new];
       print $line;
    }

     

    This program changes the $_ variable by performing the replacement and the $result variable will be equal to 1 -- the number of substitutions made. For plain vanilla string this a similar capability is available with built-in function substr.

           $result = substr($_,index($_'Nick'),length('Nick'),'Nikolas');

    Comments

    Perl has several extensions of regex syntax that can be used only if you used a special suffix. By far the most useful feature of extended mode, in my opinion, is the ability to add comments directly inside your patterns. this is achieved by using option x after the pattern. For example, would you rather a see a pattern that looks like this:

    # Match an IP address like 131.0.0.100
    m/^\s+(\d+)\.(\d+)\.(d+)\.(d+)$/;

    or one that looks like this:

    m/  ^      (?# Anchor)
        \s*    (?# skip over whitespace characters, if any)
        (\d+)  (?# Match the group of digits)   
        \.    (?# Match dot)
        (\d+)  (?# Match the second group of digits)
        \.    (?# second dot)
       (\d+)  (?# Match the third group of digits)     
        \.    (?# second dot)
       (\d+)  (?# Match the third group of digits) 
        $      (?# Anchor to the end of the)      
    /x;



    Etc

    Society

    Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

    Quotes

    War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

    Bulletin:

    Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

    History:

    Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

    Classic books:

    The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

    Most popular humor pages:

    Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

    The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


    Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

    This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

    You can use PayPal to to buy a cup of coffee for authors of this site

    Disclaimer:

    The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

    Last modified: March 12, 2019