|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
Regular expressions are mini programming language for parsing text strings. Perl was the first scripting language that popularized the power of regular expressions which before Perl were limited to Unix world.
In additional powerful search capabilities Perl adds the capability to replace parts of the string matched. Good command of regular expression represent a valuable skill beyond Perl. You can use them in editors (for example vim – never use vi, vim is much better) and they also shared by many other UNIX utilities (egrep). Some Unix utilities like GNU grep have Perl compatible mode for regular expressions (option -p in GNU grep).
Regex is one of the most useful features of Perl (but contrary to common advocacy line definitely not the most useful feature). Due to myriads of enhancements now regex became a pretty powerful (and pretty obscure) non-procedural notation for parsing strings. They are more flexible that string functions that we already studied and complement procedural facilities that we already discussed. Generally everything that is achievable via regular expressions can be programmed using string functions, but regular expression in certain cases provides for much more compact solution.
Regular expressions appear like random line noise to the Perl uninitiated. The number of arbitrary symbols that represent features seem to be infinite. There is no knowledge base to draw upon that would tell me that $ means "from the end" and that ^ means "from the beginning." But, on the positive side, if you use Unix or CygWin utilities in Windows this knowledge is (with some minor exceptions) transferable to other utilities, including grep, find, vi, awk and sed.
The origin of this non-procedural (functional) notation is Unix editors, shell and utilities, so for Unix users they are quite natural extension of exiting procedural string manipulation facilities that we already discussed. Everybody else is in much less fortunate position. It you never have used Unix, than the closest relative of regular expressions would be so called masks in DOS/Windows (*.*, *.tx?, etc) and formats in Fortran, PL/1 and C. For example mask *.* that in DOS and Windows denote all files in the current directory is actually a primitive regular expression. All decent text editors (and best HTML editors) support searching using regular expressions too. As we already mentioned, traditionally regular expression functionality is a strong point of Unix command line shells and many Unix utilities accept regular expressions.
It is important to understand that regular expressions in Perl are a language within the language and as soon as you are in a regular expression normal Perl rules are non-applicable. You should forget about Perl lexical and syntactical rules inside regular expression -- it's a different animal.
| Regular expressions in Perl are a language within the language and as soon as you are in a regular expression normal Perl rules are non-applicable. You should forget about Perl lexical and syntactical rules inside regular expression -- it's a different animal |
What we see in Perl 5 is not a result of design -- this is more a result of evolving of simple mechanism, gradually solving problems that were detected and improving usefulness of this mechanism. Now mechanism is not that simple and as such presents a difficulties for newcomers. That's why we introduce regular expression late and will proceed slowly. The flipside of regular expressions is that they can notoriously misbehave if you don't have enough experience with them. So it's very important to practice the use of regular expressions starting with simple one and gradually mastering the general principles that are involved in their construction.
| The flipside of regular expressions is that they can notoriously misbehave if you don't have enough experience with them |
There are the two basic regular expression operators that Perl has: m (for matching) and s( for substitution ). Any of them are applicable only to scalars (strings).
In Perl 5 regex has an readability form, which helps to debug complex regex. It should be used as the only notation for all more or less complex regex.
There is a good book Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly. Paradoxically Friedl’s book is also a good example of what not to do with regular expressions and convincingly demonstrates that attempts to replace lex and yacc with regular expressions are doomed to be a failure.
You need to know were to stop and this is as important as the knowledge of regular expressions itself. Understanding the limits of applicability of regular expressions is as important as understand of their power.
| You need to know were to stop and this is as important as the knowledge of regular expressions itself. Understanding the limits of applicability of regular expressions is as important as understand of their power. |
Unfortunately many Perl gurus are addicted to complexity and provide a bad example. Some of his examples (double word problem, matching comments in C and several others) are perfect examples of what not to do with regular expressions.
For example, in case of double words, converting text into array of words with a pipe and then checking the stream for two identical words in a better and much cleaner solution.
In case of comments one should try flex or procedural string functions. Actually in both cases regex-based solution can be more complex than solution using string functions.
Overview of literature
There are multiple tutorials on the web devoted to regular expressions. Most of them sufffer from overcomplexity. Very few have the goal to provide simple introduction that covers, sya, 80% of funcitonality and leave 20% staff aside.
Among those that I can recommend:
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: September 07, 2009