|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
Articles
Perl.com Apocalypse 5 Interesting discussion of shortcomings of the Regex culture
Regex culture has gone wrong in a variety of ways, but it's not my intent to assign blame--there's plenty of blame to go around, and plenty of things that have gone wrong that are nobody's fault in particular. For example, it's nobody's fault that you can't realistically complement a character set anymore. It's just an accident of the way Unicode defines combining characters. The whole notion of character classes is mutating, and that will have some bearing on the future of regular expression syntax.
Given all this, I need to warn you that this Apocalypse is going to be somewhat radical. We'll be proposing changes to certain "sacred" features of regex culture, and this is guaranteed to result in future shock for some of our more conservative citizens. Do not be alarmed. We will provide ways for you to continue programming in old-fashioned regular expressions if you desire. But I hope that once you've thought about it a little and worked through some examples, you'll like most of the changes we're proposing here.
So although the RFCs did contribute greatly to my thinking for this Apocalypse, I'm going to present my own vision first for where regex culture should go, and then analyze the RFCs with respect to that vision.
First, let me enumerate some of the things that are wrong with current regex culture.
I'm sure there are other problems, but that'll do for starters. Let's look at each of these in more detail.
Most of the other problems stem from trying to deal with a rich history. Now there's nothing wrong with history per se, but those of us who are doomed to repeat it find that many parts of history are suboptimal and contradictory. Perl has always tried to err on the side of incorporating as much history as possible, and sometimes Perl has succeeded in that endeavor.
Cultural continuity has much to be said for it, but what can you do when the culture you're trying to be continuous with is itself discontinuous? As it says in Ecclesiastes, there's a time to build up, and a time to tear down. The first five versions of Perl mostly built up without tearing down, so now we're trying to redress that omission.
Regular expressions were invented by computational linguists
who love to write examples like /aa*b*(cd)*ee/. While these are
conducive to reasoning about pattern matching in the abstract, they aren't so
good for pattern matching in the concrete. In real life, most atoms are longer
than "a" or "b". In real life, tokens are more
recognizable if they are separated by whitespace. In the abstract, /a+/
is reducible to /aa*/. In real life, nobody wants to repeat a 15
character token merely to satisfy somebody's idea of theoretical purity. So we
have shortcuts like the + quantifier to say "one or more".
Now, you may rightly point out that + is
something we already have, and we already introduced /x to allow
whitespace, so why is this bullet point here? Well, there's a lot of inertia in
culture, and the problem with /x is that it's not the default, so
people don't think to turn it on when it would probably do a lot of good. The
culture is biased in the wrong direction. Whitespace around tokens should be the
norm, not the exception. It should be acceptable to use whitespace to separate
tokens that could be confused. It should not be considered acceptable to define
new constructs that contain a plethora of punctuation, but we've become
accustomed to constructs like (?<=...) and (??{...})
and [\r\n\ck\p{Zl}\p{Zp}], so we don't complain. We're frogs who
are getting boiled in a pot full of single-character morphemes, and we don't
notice.
Huffman invented a method of data compaction in which common characters are represented by a small number of bits, and rarer characters are represented by more bits. The principle is more general, however, and language designers would do well to pay attention to the "other" Perl slogan: Easy things should be easy, and hard things should be possible. However, we haven't always taken our own advice. Consider those two regex constructs we just saw:
(?<=...)
(??{...})
Which one do you think is likely to be the most common in everyday use? Guess which one is longer...
There are many examples of poor Huffman coding in current regexes. Consider these:
(...)
(?:...)
Is it really the case that grouping is rarer than capturing? And by two gobbledygooky character's worth? Likewise there are many constructs that are the same length that shouldn't be:
(?:...)
(?#...)
Grouping is much more important than the ability to embed a comment. Yet they're the same length currently.
A lot of our Huffman troubles came about because we were
trying to shoehorn new capabilities into an old syntax without breaking
anything. The (?...) construct succeeded at that goal, but it was
new wine in old wineskins, as they say. More successful was the *?
minimal matching hack, but it's still symptomatic of the problem that we only
had three characters to choose from that would have worked at that point in the
grammar. We've pretty nearly exhausted the available backslash sequences.
The waterbed theory of linguistic complexity says that if you push down one place, it goes up somewhere else. If you arbitrarily limit yourself to too few metacharacters, the complexity comes out somewhere else. So it seems obvious to me that the way out of this mess is to grab a few more metacharacters. And the metacharacters I want to grab are...well, we'll see in a moment.
Consider these constructs:
(??{...})
(?{...})
(?#...)
(?:...)
(?i:...)
(?=...)
(?!...)
(?<=...)
(?<!...)
(?>...)
(?(...)...|...)
These all look quite similar, but some of them do radically
different things. In particular, the (?<...) does not mean the
opposite of the (?>...). The underlying visual problem is the
overuse of parentheses, as in Lisp. Programs are more readable if different
things look different.
In linguistics, the notion of end-weight is the idea that people tend to prefer sentences where the short things come first and the long things come last. That minimizes the amount of stuff you have to remember while you're reading or listening. Perl violates this with regex modifiers. It's okay when you say something short like this:
s/foo/bar/g
But when you say something like we find in RFC 360:
while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(?:(?@\S+)[, ]*)*\n\s*
favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {...}
it's not until you read the /sigx at the end that
you know how to read the regex. This actually causes problems for the Perl 5
parser, which has to defer parsing the regular expression till it sees the
/x, because that changes how whitespace and comments work.
The /s modifier in the previous example changes
the meaning of the . metacharacter. We could, in fact, do away with
the /s modifier entirely if we only had two different
representations for "any character", one of which matched a newline, and one
which didn't. A similar argument applies to the /m modifier. The
whole notion of something outside the regex changing the meaning of the regex is
just a bit bogus, not because we're afraid of context sensitivity, but because
we need to have better control within the regex of what we mean, and in this
case the context supplied outside the regex is not precise enough. (Perl 5 has a
way to control the inner contexts, but it uses the self-obfuscating (?...)
notation.)
Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.
Without knowing the context, you cannot know what the pattern
// will do. It might match a null string, or it might match the
previously successful match.
The local operator behaves differently inside
regular expressions than it does outside.
It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:
/
| foo
| bar
| baz
/x
Even when it's intentional, it may not look intentional:
(a|b|c|)
That's hard to read because it's difficult to make the absence of something visible.
It's too easy to confuse the multiple meanings of dot. Or the
multiple meanings of ^, and $. And the opposite of
\A is frequently not \Z, but \z. Tell me
again, when do I say \1, and when do I say $1? Why are
they different?
Speaking of \1, backreferences have a number of
shortcomings. The first is actually getting ahold of the right backreference.
Since captures are numbered from the beginning, you have to count, and you can
easily count wrong. For many purposes it would be better if you could ask for
the last capture, or the one before that. Or perhaps if there were a way to
restart the numbering part way through...
Another major problem with backreferences is that you can't
easily modify one to search for a variant. Suppose you match an opening
parenthesis, bracket, or curly. You'll like to search for everything up to the
corresponding closing parenthesis, bracket, or curly, but there's no way to
transmogrify the opening version to the closing version, because the backref
search is hardwired independently of ordinary variable matching. And that's
because Perl doesn't instantiate $1 soon enough. And that's because
Perl relies on variable interpolation to get subexpressions into regexes. Which
leads us to...
Since regexes undergo an interpolation pass before they're
compiled, anything you interpolate is forced to be treated as a regular
expression. Often that's not what you want, so we have the klunky
\Q$string\E mechanism to hide regex metacharacters. And that's because...
The problem with \Q$string\E arises because of
the fundamental mistake of using interpolation to build regexes instead of
letting the regex control how it treats the variables it references. Regexes
aren't strings, they're programs. Or, rather, they're strings only in the sense
that any piece of program is a string. Just as you have to work to eval a string
as a program, you should have to work to eval a string as a regular expression.
Most people tend to expect a variable in a regular expression to match its
contents literally. Perl violates that expectation. And because it violates that
expectation, we can't make $1 synonymous with \1. And
interpolated parentheses throw off the capture count, so you can't easily use
interpolation to call subrules, so we invented (??{$var}) to get
around that. But then you can't actually get at the parentheses captured by the
subrule. The ramifications go on and on.
Historically, regular expressions were considered a very
low-level language, a kind of glorified assembly language for the regex engine.
When you're only dealing with ASCII, there is little need for abstraction, since
the shortest way to say [a-z] is just that. With the advent of the
eighth bit, we started getting into a little bit of trouble, and POSIX started
thinking about names like [:alpha:] to deal with locale
difficulties. But as with the problem of conciseness, the culture was still
biased away from naming abstractly anything that could be expressed concretely.
However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.
It's difficult to deal with any subset of Unicode without
naming it. These days, if you see [a-z] in a program, it's probably
an outright bug. It's much better to use a named character property so that your
program will work right in areas that don't just use ASCII.
Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:
(??{$rule})
That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.
Make that "no support" in Perl, unless you include assignment
to a list. This is just a part of the bias against naming things. Instead we are
forced to number our capturing parens and count. That works okay for the
top-level regular expression, when we can do list assignment or assign $1
to $foo. But it breaks down as soon as you start trying to use
nested regexes. It also breaks down when the capturing parentheses match more
than once. Perl handles this currently by returning only the last match. This is
slightly better than useless, but not by much.
For many of the reasons we've mentioned, it's difficult to make regexes refer to each other, and even if you do, it's almost impossible to get the nested information back out of them. And there are entire classes of parsing problems that are not solvable without recursive definitions.
Even if it were easier for regexes to refer to other regexes, we'd still have the problem that those other regexes aren't organized in any meaningful way. They might be off in variables that come and go at the whim of the surrounding context.
When we have an organized system of parsing rules, we call it a grammar. One advantage of having a grammar is that you can optimize based on the assumption that the rules maintain their relationship to each other. For instance, if you think of grammar rules as a funny kind of subroutine, you can write an optimizer to inline some of the subrules--but only if you know the subrule is fixed in the grammar.
Without support for grammar classes, there's no decent way to think of deriving one grammar from another. And if you can't derive one grammar from another, you can't easily evolve your language to handle new kinds of problems.
If we want to have variant grammars for Perl dialects, then what about regex dialects? Can regexes be extended either at compile time or at run time? Perl 5 has some rudimentary overloading magic for rewriting regex strings, but that's got the same problems as source filters for Perl code; namely that you just get the raw regex source text and have to parse it yourself. Once again the fundamental assumption is that a regex is a funny kind of string, existing only at the behest of the surrounding program.
Do we think of regexes as a real, living language?
Let's face it, in the culture of computing, regex languages are mostly considered second-class citizens, or worse. "Real" languages like C and C++ will exploit regexes, but only through a strict policy of apartheid. Regular expressions are our servants or slaves; we tell them what to do, they go and do it, and then they come back to say whether they succeeded or not.
At the other extreme, we have languages like Prolog or Snobol where the pattern matching is built into the very control structure of the language. These languages don't succeed in the long run because thinking about that kind of control structure is rather difficult in actual fact, and one gets tired of doing it constantly. The path to freedom is not to make everyone a slave.
However, I would like to think that there is some happy medium between those two extremes. Coming from a C background, Perl has historically treated regexes as servants. True, Perl has treated them as trusted servants, letting them move about in Perl society better than any other C-like language to date. Nevertheless, if we emancipate regexes to serve as co-equal control structures, and if we can rid ourselves of the regexist attitudes that many of us secretly harbor, we'll have a much more productive society than we currently do. We need to empower regexes with a sense of control (structure). It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.
Perl 5 started to give regexes more control of their own
destiny with the "grab" construct, (?>...), which tells the regex
engine that when it fails to match the rest of the pattern, it should not
backtrack into the innards of the grab, but skip back to before it. That's a
useful notion, but there are problems. First, the notation sucks, but you knew
that already. Second, it doesn't go far enough. There's no way to backtrack out
of just the current grouping. There's no way to backtrack out of just the
current rule. Both of these are crucial for giving first-class status to the
control flow of regexes.
Notionally, a regex is an organization of assertions that either succeed or fail. Some assertions are easily expressed in traditional regex language, while others are more easily expressed in a procedural language like Perl.
The natural (but wrong) solution is to try to reinvent Perl expressions within regex language. So, for instance, I'm rejecting those RFCs that propose special assertion syntax for numerics or booleans. The better solution is to make it easier to embed Perl assertions within regexes.
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last updated: August 15, 2009