|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
| Recommended Links | AWK Regular expressions | Tips | Humor |
Etc |
Regular expressions are string sequences formed from letters, numbers, and a set of special operators. Regular expressions include simple strings, but they also allow you to search for a pattern that is more than a simple string match. awk accepts the same regular expressions as the egrep command, discussed in Chapter 14. Examples of regular expressions are
Table 14-1 shows the special symbols that you can use to form regular expressions.
To use a regular expression as a string-matching pattern, you enclose it in slashes.
To illustrate how you can use regular expressions, consider a file containing the inventory of items in a stationery store. The file inventory includes a one-line record for each item. Each record contains the item name, how many are on hand, how much each costs, and how much each sells for:
pencils 108 .11 .15
markers 50 .45 .75
memos 24 .53 .75
notebooks 15 .75 1.00
erasers 200 .12 .15
books 10 1.00 1.50
If you want to search for the price of markers, but you cannot remember whether you called them "marker" or "markers," you could use the regular expression
/marker*/
as the pattern.
To find out how many books you have on hand, you could use the pattern
/^books/
to find entries that contain "books" only at the beginning of a line. This would match the record for books, but not the one for notebooks.
However, suppose you want to find all the items that sell for 75 cents. You want to match .75, but only when it is in the fourth field (selling price). Then you need more than a string match using regular expressions. You need to make a comparison between the content of a particular field and a pattern. The next section discusses the comparison operators that make this possible.
The preceding section dealt with string matches in which a match occurs when the target string occurs anywhere in a line. Sometimes, though, you want to compare a string or pattern with another string, for example a particular field or a variable. You can compare two strings in various ways, including whether one contains the other, whether they are identical, or whether one precedes the other in alphabetical order.
You use the tilde (~) sign to test whether two strings match. For example,
$2 ~ /^15/
checks whether field 2 begins with 15. This pattern matches if field 2 begins with 15 regardless of what the rest of the field may contain. It is a test for matching, not identity. If you wish to test whether field 2 contains precisely the string 15 and nothing else, you could use
$2 ~ /^15$/
You can test for nonmatching strings with !~. This is similar to ~, but it matches if the first string is not contained in the second string.
You can use the == operator to check whether two strings are identical, rather than whether one contains the other. For example,
$1==$3
checks to see whether the value of field 1 is equal to the value of field 3.
Do not confuse == with =. The former (==) tests whether two strings are identical. The single equal sign (=) assigns a value to a variable. For example,
$1=15
sets the value of field 1 equal to 15. It would be used as part of an action statement. On the other hand,
$1==15
compares the value of field 1 to the number 15. It could be a pattern statement.
The != operator tests whether the values of two expressions are not equal. For example,
$1 != "pencils"
is a pattern that matches any line where the first field is not "pencils."
(4)COMPARING THE ORDER OF TWO STRINGS
You can compare two strings according to their alphabetical order using the standard comparison operators, <, >, <=, and >=. The strings are compared character by character, according to standard ASCII alphabetical order, so that:
"regular" < "relational"
Remember that in the ASCII character code, all uppercase letters precede all lowercase letters.
You can use string comparison patterns in a program to put names in alphabetical order, or to match any record with a last name past a certain name. For example, the following matches lines in which the second field follows "Johnson" in alphabetical order:
$2 > "Johnson"
Table 14-2 shows further examples of comparison patterns.
(4)COMPOUND PATTERNS
Compound patterns are combinations of patterns, joined with the logical operators && (and), || (or), and ! (not). These can be very useful when searching for a complex pattern in a database or in a program.
For example, here is a small but useful program that works on a text file formatted for troff. A common mistake in troff files is to forget to close a display begun with .DS (Display Start) with a matching .DE (Display End). This program tests whether you have a .DE for each .DS:
/\.DS/ && display==1 {print "Missing DE before line " NR}
/\.DS/ && display==0 {display=1}
/\.DE/ && display==0 {print "Extra DE at line " NR}
/\.DE/ && display==1 {display=0;discountt++}
END {print "Found " discountt " matched displays"}
In the preceding program, "display" is a flag that is set when awk reads a .DS and is unset when it finds the next .DE. Because awk clears all variables at the beginning, display is initially equal to 0; it is set to 1 when a line beginning with .DS is encountered. If a line contains .DS when the flag is set, then you have found a missing .DE. The "discountt" variable is a counter to tell you how many displays are in the file. The END pattern is a special pattern that is matched after the last line is read. (This is discussed in the "BEGIN and END" section later in this chapter.) You could easily add to this program to test for .TS/.TE and other text formatting pairs.
(3)Range Patterns
You have seen how to make comparisons between strings, how to search for complex strings using regular expressions, and how to create compound patterns using compound operators. awk provides another way to specify a pattern that can be particularly powerful—the range pattern. The syntax for a range pattern is
pattern1, pattern2
This will match any line after a match to the first pattern and before a match to the second pattern, including the starting and ending lines. In other words, from the line where the first pattern is found, every line will match through the line where the second pattern is found.
If you have a database file in which at least one of the fields is arranged in order, a range pattern is a very easy way to pull out part of the database. For example, if you have a list of customers sorted by customer number, a range pattern can select all the entries between two customer numbers. The following command prints all lines between the line beginning with 200 and the line beginning with 299:
/200/,/299/
(2)Numeric Patterns
All of the string-matching patterns in the previous section also work for numeric patterns, except for regular expressions and the string-matching tilde operator. You don't have to specify whether you are dealing with strings or numeric patterns; awk uses the context to decide which is appropriate. Probably the most commonly used numeric patterns are the comparisons, especially those comparing the value of a field to a number or to another field. You use the comparison operators to do this.
Compound patterns, formed with the operators && (and), || (or), and ! (not), are useful for numeric variables as well as string variables. This is an example of a compound pattern:
$1 < 10 && $2 <= 30
This matches if field 1 is less than 10 and field 2 is less than or equal to 30.
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: August 14, 2009