Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

AWK Regular Expressions

News

AWK Programming

Recommended Books

Regular Expressions

Recommended Links POSIX regular Expressions Overview of regular expressions in Perl
Extended Regular Expressions in grep POSIX regular Expressions          

 

 

bash

Horror Stories Tips Humor  Etc

Regular expressions are string sequences formed from letters, numbers, and a set of special operators. Regular expressions include simple strings, but they also allow you to search for a regex that is more than a simple string match. awk accepts the class of regular expression called "extended regular expressions.

 They are mostly compatible with Perl regex (see Dr. Nikolai Bezroukov. Perl for system administrators. Ch. 5. Regular Expressions). 

So called Posix extended regular expressions are defined in IEEE POSIX 1003.2. (which you should probably read in order to understand the full range of capabilities provided). In addition to AWK extended regular expression are also used in egrep. That means that they are well known by most system administrators. They are also available in bash starting from version 3.0 (see  String Operations in Shell). They are different and more powerful from "basic regular expressions" (often called DOS regular expressions).  As Donald Knuth joked Unix can be defined as operating system with three incompatible flavour of regular expressions.

Examples of extended regular expressions are

To use a regular expression as a string-matching regex in AWK, you need to enclose it in slashes. For example to find strings that start with the word books you could use the regex

/^books/

This regex without so called "action block" will find and print lines that contain string "books" at the beginning of a line. This would match the record starting with "books", but not the one for "notebooks". Here is list of metacharacters supported by AWK

Expression Name Description
Letters, numbers, most punctuation Ordinary character Matches itself.
. Period (dot) Matches any single character except the newline character.
* Asterisk Matches any number of occurrences of the preceding simple expression, including none.
? Question mark Matches zero or one occurrence of the preceding simple expression.
+ Plus sign Matches one or more occurrences of the preceding simple expression.
{i,j} Interval expression Matches a more restricted number of instances of the preceding simple expression; for example, ab{3}c  matches only abbbc, while ab{2,3}c  matches abbc  or abbbc, but not abc  or abbbbc. Basic regular expression interval expressions are delimited by escaped braces. To match a literal expression that has the form of an interval expression using an extended regular expression, escape the left brace. For example, \{2,3}  matches the explicit string {2,3}.
(expr) Subexpression Matches expr, causing extended regular expression operators to treat it as a unit; for example, a(bc)?d  matches ad  or abcd  but not abcbcd, abcbcbcd, or other similar strings. Basic regular expression subexpressions are delimited by escaped parentheses. To match a literal parenthesized expression using an extended regular expression, escape the left parenthesis. For example, \(abc)  matches the explicit string (abc).
[chars] Bracket expression Matches a single instance of any one of the characters within the brackets. Ranges of characters can be abbreviated by using a hyphen. For example, [0-9a-z]  matches any single digit or lowercase letter. Within brackets, all characters are ordinary characters except the hyphen (when used in a range abbreviation) and the circumflex (when used as the first character inside the brackets).
^ Circumflex When used at the beginning of an expression (or a subexpression), matches the beginning of a line (anchors the expression to the beginning of the line). When used as the first character inside brackets, excludes the bracketed characters from being matched. Otherwise, has no special properties.
$ Dollar sign When used at the end of an expression, matches the end of a line (anchors the expression to the end of the line). Otherwise, has no special properties.
\char Backslash Except within a bracket expression, escapes the next character to permit matching on explicit instances of characters that usually are extended regular expression operators.
expr expr ... Concatenation Matches any string that matches all of the concatenated expressions in sequence.
expr|expr ... Vertical bar (alternation) Separates multiple extended regular expressions; matches any of the bar-separated expressions.

awk, permit you to specify multiple alternative extended regular expressions simultaneously by separating the individual expressions with a vertical bar. For example:

% awk '/[Bb]lack|[Ww]hite/ {print NR ":", $0}' .Xdefaults
55: sm.pointer_foreground:  black
56: sm.pointer_background:  white

An asterisk ( * ) acts on the simple regular expression immediately preceding it, causing that expression to match any number of occurrences of a matching pattern, even none. When an asterisk follows a period, the combination indicates a match on any sequence of characters, even none. A period and an asterisk always match as much text as possible.

An asterisk matches any number of instances of the preceding regular expression (both basic and extended). To limit the number of instances that a particular extended regular expression will match, use a plus sign ( + ) or a question mark ( ? ). The plus sign requires at least one instance of its matching pattern. The question mark refuses to accept more than one instance. The following chart illustrates the matching characteristics of the asterisk, plus sign, and question mark:

Regular Expression Matching Strings
ab?c ac abc  
ab*c ac abc abbc, abbbc, ...
ab+c   abc abbc, abbbc, ...

You can also specify more restrictive numbers of instances of the regular expression with an interval expression. The following list illustrates the various forms of interval expressions in basic regular expressions:

Using the subexpression delimiters, you can save up to nine basic regular expression subexpression patterns on a line. Counting from left to right on the line, the first pattern saved is placed in the first holding space, the second pattern is placed in the second holding space, and so on.

The back-reference character sequence \n  (where n is a digit from 1 to 9) matches the n th saved pattern. Consider the following basic regular expression:

\(A\)\(B\)C\2\1

This expression matches the string ABCBA. You can nest patterns to be saved in holding spaces. Whether the enclosed patterns are nested or in a series, n refers to the nth occurrence, counting from the left, of the subexpression delimiters. You can also use \n  back-reference expressions in replacement strings.

A period in an expression matches any character except the newline character. To restrict the characters to be matched, place the characters inside brackets ( [ ] ). Each string of bracketed characters is a single-character expression that matches any one of the bracketed characters. Except for the circumflex ( ^ ), regular expression operators within brackets are interpreted literally, without special meaning. The circumflex excludes the bracketed characters if it is the first character in the brackets; otherwise, it has no special meaning.

Bracket expressions can include three special types of expressions called classes:

Here is a relevant quote from Chapter 11 The awk Programming Language

awk scripts consist of patterns and procedures:

pattern   {  procedure  }

Both are optional. If pattern is missing, { procedure } is applied to all lines; if { procedure } is missing, the matched line is printed.

A pattern can be any of the following:

/regular expression/
relational expression
pattern-matching expression
BEGIN
END

Except for BEGIN and END, patterns can be combined with the Boolean operators || (or), && (and), and ! (not). A range of lines can also be specified using comma-separated patterns:

pattern,pattern

In other words AWK statement contains a pattern (which can be regular expression) and an action, either of which (but not both) may be omitted. The purpose of the action is to tell awk  what to do once a match for the pattern is found. Thus, in outline, an awk  program generally looks like this:

[pattern] [{ action }]
[pattern] [{ action }]
...
function name(args) { ... }
...

An action consists of one or more awk  statements, enclosed in curly braces (`{' and `}'). Each statement specifies one thing to be done. The statements are separated by newlines or semicolons.

The curly braces around an action must be used even if the action contains only one statement, or even if it contains no statements at all. However, if you omit the action entirely, you can omit the curly braces as well. An omitted action is equivalent to `{ print $0 }'.

/foo/  { }  # match foo, do nothing - empty action
/foo/       # match foo, print the record - omitted action

AWK users can reuse Perl regular expressions tutorials (which are plentiful and much more detailed then awk manuals) as Perl is essentially a generalization of AWK and most regular expressions metasymbols are compatible in both implementations. 

/sheep/;   # True if the current line contains "sheep"

Pattern matching is more like string searching. Without anchors, the position where the match occurs can be anywhere inside the string.

Special Variables

AWK sets special variables some of which reflect the environemtn and some reflect the the result of the most recent match ( $1 , $2 , $3 , and so on).

Variable Description
$0 The contents of the current record.
$n The contents of field n of the input record. In awk  you can modify the entire record using   built-in variable $0.
ARGC A count of the arguments given to awk. This variable is modifiable. Does not include the command name, flags preceded by minus signs, the script file name (if any), or variable assignments.
ARGV An array from ARGV[0]  to ARGV[ARGC-1]  containing the command name followed by the arguments given to awk. The elements of this array are modifiable. Does not include flags preceded by minus signs, the script file name (if any), or variable assignments.
CONVFMT The conversion format for numbers (by default, %.6g).
ENVIRON A modifiable array containing the current set of environment variables; accessible by ENVIRON["name"  ], where "name" is a variable or literal containing the name of the environmental variable. Changing an element in this array does not affect the environment passed to commands that awk  spawns by redirection, piping, or the system()  function.
FILENAME The name of the current input file. If no input file was named, FILENAME  contains a single minus sign. Inside a BEGIN  action, FILENAME  is undefined. Inside an END  action, FILENAME  reflects the last file read.
FNR The number of the current record within the current file. Differs from NR  if multiple files are being processed and the current file is not the first file read.
FS The character or expression used for a field separator. By default, any amount of white space. In awk, field separators can be multibyte regular expressions and can be multiply defined. For example, the following statement defines either a comma followed by any amount of white space or at least one white-space character as the field separator:
FS = ",[ \t]*|[ \t]+"
NF The number of fields in the current record.
NR The number of the current record, counted sequentially from the beginning of the first file read. Differs from FNR  if multiple files are being processed and the current file is not the first file read.
OFMT The format specification for numbers on output (by default, %.6g).
OFS The output field separator; or string inserted between fields when the data is written. By default, a space character.
ORS The character used for the output record separator (the character between records when the data is written). By default, a newline character.
RLENGTH The length of the string matched by match(); set to -1 if no match.
RS Input character used for a record separator.
RSTART The index (position within the string) of the first character matched by match(); set to 0 if no match.
SUBSEP The separator for multiple subscripts in array elements (by default \034, the ASCII FS character).

However, suppose you want to find all the items that sell for 75 cents. You want to match .75, but only when it is in the fourth field (selling price). Then you need more than a string match using regular expressions. You need to make a comparison between the content of a particular field and a regex. The next section discusses the comparison operators that make this possible.

Complex (compound) regex

Complex regular expressions are combinations of regex, joined with the logical operators && (and), || (or), and ! (not). These can be very useful when searching for a complex regex in a database or in a program.

Comparing Strings

The preceding section dealt with string matches in which a match occurs when the target string occurs anywhere in a line. Sometimes, though, you want to compare a string or regex with another string, for example a particular field or a variable. You can compare two strings in various ways, including whether one contains the other, whether they are identical, or whether one precedes the other in alphabetical order.

You use the tilde (~) sign to perform pattern matching on a string using extended regular expressions. For example,

$2 ~ /^15/

checks whether field 2 begins with 15. This regex matches if field 2 begins with 15 regardless of what the rest of the field may contain. It is a test for matching, not identity. If you wish to test whether field 2 contains precisely the string 15 and nothing else, you could use

$2 ~ /^15$/

You can test for nonmatching strings with !~. This is similar to ~, but it matches if the first string is not contained in the second string.

You can use the == operator to check whether two strings are identical, rather than whether one contains the other. For example,

$1==$3

checks to see whether the value of field 1 is equal to the value of field 3.

Do not confuse == with =. The former (==) tests whether two strings are identical. The single equal sign (=) assigns a value to a variable. For example,

$1=15

sets the value of field 1 equal to 15. It would be used as part of an action statement. On the other hand,

$1==15

compares the value of field 1 to the number 15.  This is not a regex.

Similarly the != operator tests whether the values of two expressions are not equal. For example,

$1 != "pencils"

That regex matches any line where the first field is not "pencils."

COMPARING TWO STRINGS

You can compare two strings according to their alphabetical order using the standard comparison operators, <, >, <=, and >=. The strings are compared character by character, according to standard ASCII alphabetical order, so that:

"regular" < "relational"

Remember that in the ASCII character code, all uppercase letters precede all lowercase letters.

You can use string comparison regex  in a program to put names in alphabetical order, or to match any record with a last name past a certain name. For example, the following matches lines in which the second field follows "Johnson" in alphabetical order:

$2 > "Johnson"

Range regex

You have seen how to make comparisons between strings, how to search for complex strings using regular expressions, and how to create compound regex using compound operators. awk provides another way to specify a regex that can be particularly powerful—the range regex. The syntax for a range regex is

regex1 , regex2

This will match any line after a match to the first regex and before a match to the second regex, including the starting and ending lines. In other words, from the line where the first regex is found, every line will match through the line where the second regex is found.

If you have a database file in which at least one of the fields is arranged in order, a range regex is a very easy way to pull out part of the database. For example, if you have a list of customers sorted by customer number, a range regex can select all the entries between two customer numbers. The following command prints all lines between the line beginning with 200 and the line beginning with 299:

/200/,/299/

Numeric regex

All of the string-matching regex in the previous section also work for numeric regex, except for regular expressions and the string-matching tilde operator. You don't have to specify whether you are dealing with strings or numeric regex; awk uses the context to decide which is appropriate. Probably the most commonly used numeric regex  are the comparisons, especially those comparing the value of a field to a number or to another field. You use the comparison operators to do this.

Compound regex, formed with the operators && (and), || (or), and ! (not), are useful for numeric variables as well as string variables. This is an example of a compound regex:

$1 < 10 && $2 <= 30

This matches if field 1 is less than 10 and field 2 is less than or equal to 30.


Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

.

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: February 10, 2021