|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
|
| Selected papers | Reference |
|
|||
| Length | Index | Substr | Default | Pattern Matching/substritution | Etc |
String operators allow you to manipulate the contents of a variable without resorting to AWK or Perl.
There are three kinds of variable substitution:
I'll talk about the first two and leave command substitution for another article.
Please note that double quoted string in shell is not equal to double quoted string in Perl. If you want strings expansion in Perl style you need to use $'string'
#!/bin/bash # String expansion. # Introduced with version 2 of Bash. # Strings of the form $'xxx' #+ have the standard escaped characters interpreted. echo $'Ringing bell 3 times \a \a \a' # May only ring once with certain terminals. echo $'Three form feeds \f \f \f' echo $'10 newlines \n\n\n\n\n\n\n\n\n\n' echo $'\102\141\163\150' # Bash # Octal equivalent of characters. exit 0
There are two way to get length of the string. The simplest
one is
${#varname}, which
returns the length of the value of the variable as a character string. For example, if
filename has the value
fred.c, then ${#filename} would have the
value 6. The other operator (${#array[*]})
has to do with array variables.
stringZ=abcABC123ABCabc
echo ${#stringZ} # 15
echo `expr length $stringZ` # 15
echo `expr "$stringZ" : '.*'` # 15
|
stringZ=abcABC123ABCabc # |------| echo `expr match "$stringZ" 'abc[A-Z]*.2'` # 8 echo `expr "$stringZ" : 'abc[A-Z]*.2'` # 8 |
stringZ=abcABC123ABCabc
echo `expr index "$stringZ" C12` # 6
# C position.
echo `expr index "$stringZ" 1c` # 3
# 'c' (in #3 position) matches before '1'.
|
This is the near equivalent of strchr() in C.
Shell uses zero based indexing. When substring expansion of the form ${param:offset[:length} is used,
an `offset' that evaluates to a number less than zero counts back from the end of the expanded value of $param.
When a negative `offset' begins with a minus sign, however, unexpected things can happen. Consider
a=12345678
echo ${a:-4}
intending to print the last four characters of $a. The problem is that ${param:-word} already has a well-defined meaning: expand to word if the expanded value of param is unset or null, and $param otherwise.
To use negative offsets that begin with a minus sign,
separate the minus sign and the colon with a space.
If the $string parameter is "*" or "@", then this extracts the positional parameters, [1] starting at $position.
stringZ=abcABC123ABCabc
# 0123456789.....
# 0-based indexing.
echo ${stringZ:0} # abcABC123ABCabc
echo ${stringZ:1} # bcABC123ABCabc
echo ${stringZ:7} # 23ABCabc
echo ${stringZ:7:3} # 23A
# Three characters of substring.
|
If the $string parameter is "*" or "@", then this extracts a maximum of $length positional parameters, starting at $position.
echo ${*:2} # Echoes second and following positional parameters.
echo ${@:2} # Same as above.
echo ${*:2:3} # Echoes three positional parameters, starting at second.
|
stringZ=abcABC123ABCabc # 123456789...... # 1-based indexing. echo `expr substr $stringZ 1 2` # ab echo `expr substr $stringZ 4 3` # ABC |
These operators can be used in a variety of ways. A good example would be to give a default value to a variable normally read from the command-line arguments, when no such arguments are given. This is shown in the following script:
#!/bin/bash
export INFILE=${1-"infile"}
export OUTFILE=${2-"outfile"}
cat $INFILE $OUTFILE
Hopefully, this gives you something to think about and to play with until the next article. If you're interested in more hints about bash (or other stuff I've written about), please take a look at my home page. If you've got questions or comments, please drop me a line.
There are two kinds of pattern matching available: matching from the left and matching from the right. The operators, with their functions and an example, are shown in the following table:
| Operator | Function | Example |
| ${foo#t*is} | Deletes the shortest possible match from the left | export $foo="this is a test" echo ${foo#t*is} is a test |
| ${foo##t*is} | Deletes the longest possible match from the left | export $foo="this is a test" echo ${foo##t*is} a test |
| ${foo%t*st} | Deletes the shortest possible match from the right | export $foo="this is a test" echo ${foo%t*st} this is a |
| ${foo%%t*st} | Deletes the longest possible match from the right | export $foo="this is a test" echo ${foo%%t*is} |
NOTE
While the # and % identifiers may not seem obvious, they have a convenient mnemonic. The # key is on the left side of the $ key on the keyboard and operates from the left. The % key is on the right of the $ key and operated from the right.
These operators can be used to do a variety of things. For example, the following script changes the extension of all .html files to .htm.
#!/bin/bash
# quickly convert html filenames for use on a dossy system
# only handles file extensions, not filenames
for i in *.html; do
if [ -f ${i%l} ]; then
echo ${i%l} already exists
else
mv $i ${i%l}
fi
done
Another kind of variable mangling you might want to employ is substitution. There are four substitution operators in bash, as shown in the following table.
| Operator | Function | Example |
| ${foo:-bar} | If $foo exists and is not null, return $foo. If it doesn't exist or is null, return bar. | export foo="" echo ${foo:-one} one echo $foo |
| ${foo:=bar} | If $foo exists and is not null, return $foo. If it doesn't exist or is null, set $foo to bar and return bar | export foo="" echo ${foo:=one} one echo $foo one |
| ${foo:+bar} | If $foo exists and is not null, return bar. If it doesn't exist, or is null, return a null. | export foo="this is a test" echo ${foo:+bar} bar |
| ${foo:?"error message"} | If $foo exists and is not null, return its value. If it doesn't exist or is null, print the error message. If no error message is given, print parameter null or not set. Note: In a non-interactive shell, this will abort the current script. In an interactive shell, this will just print the error message. | export foo="one" for i in foo bar baz; do eval echo \${$foo:?} one bash: bar: parameter null or not set bash: baz: parameter null or not set |
NOTE
The colon (:) in the above operators can be omitted. Doing so changes the behavior of the operator to test only for existence of the variable. This will cause the creation of a variable in the case of ${foo=bar}.
These operators can be used in a variety of ways. A good example would be to give a default value to a variable normally read from the command-line arguments, when no such arguments are given. This is shown in the following script:
#!/bin/bash
export INFILE=${1-"infile"}
export OUTFILE=${2-"outfile"}
cat $INFILE $OUTFILE
Hopefully, this gives you something to think about and to play with until the next article. If you're interested in more hints about bash (or other stuff I've written about), please take a look at my home page. If you've got questions or comments, please drop me a line.
Korn shell's pattern-matching operators.
| Operator | Meaning |
|---|---|
| ${variable#pattern} |
If the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest. |
| ${variable##pattern} |
If the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest. |
| ${variable%pattern} |
If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest. |
| ${variable%%pattern} |
If the pattern matches the end of the variable's value, delete the longest part that matches and return the rest. |
These can be hard to remember, so here's a handy mnemonic device: # matches the front because number signs precede numbers; % matches the rear because percent signs follow numbers.
The expression ${DIRSTACK
:-$PWD} evaluates to
$DIRSTACK if it is non-null or
$PWD (the current directory) if it is null.
| Operator | Substitution |
|---|---|
| ${varname:-word} |
If varname exists and isn't null, return its value; otherwise return word. |
| Purpose: |
Returning a default value if the variable is undefined. |
| Example: |
${count:-0} evaluates to 0 if count is undefined. |
| ${varname:=word} |
If varname exists and isn't null, return its value; otherwise set it to word and then return its value.[7] |
| Purpose: |
Setting a variable to a default value if it is undefined. |
| Example: |
|
${varname:?message} |
If varname exists and isn't null, return its value; otherwise print varname: followed by message, and abort the current command or script. Omitting message produces the default message parameter null or not set. |
| Purpose: |
Catching errors that result from variables being undefined. |
| Example: |
{count |
${varname:+word} |
If varname exists and isn't null, return word; otherwise return null. |
| Purpose: |
Testing for the existence of a variable. |
| Example: |
${count:+1} returns 1 (which could mean "true") if count is defined. |
The first two of these operators are ideal for setting defaults for command-line arguments in case the user omits them. We'll use the first one in our first programming task.
If we used #*/
instead of ##*/,
the expression would have the incorrect value dave/pete/fred/bob,
because the shortest instance of "anything followed by a slash" at the
beginning of the string is just a slash (/).
The construct ${variable##*/}
is actually equivalent to the UNIX utility basename(1).
basename takes a pathname as argument and
returns the filename only; it is meant to be used with the shell's command
substitution mechanism (see below). basename is
less efficient than ${variable##/*}
because it runs in its own separate process rather than within the shell.
Another utility, dirname(1), does essentially
the opposite of basename: it returns the
directory prefix only. It is equivalent to the Korn shell expression
${variable%/*}
and is less efficient for the same reason.
The bash shell has many features that are sufficiently obscure you almost never see them used. One of the problems is that the man page offers no examples.
Here I'm going to show how to use some of these features to do the sorts of simple string manipulations that are commonly needed on file and path names.
In traditional Bourne shell programming you might see references to the basename and dirname commands. These perform simple string manipulations on their arguments. You'll also see many uses of sed and awk or perl -e to perform simple string manipulations.
Often these machinations are necessary perform on lists of filenames and paths. There are many specialized programs that are conventionally included with Unix to perform these sorts of utility functions: tr, cut, paste, and join. Given a filename like /home/myplace/a.data.directory/a.filename.txt which we'll call $f you could use commands like:
dirname $f basename $f basename $f .txt
... to see output like:
/home/myplace/a.data.directory a.filename.txt a.filename
Notice that the GNU version of basename takes an optional parameter. This handy for specifying a filename "extension" like .tar.gz which will be stripped off of the output. Note that basename and dirname don't verify that these parameters are valid filenames or paths. They simple perform simple string operations on a single argument. You shouldn't use wild cards with them -- since dirname takes exactly one argument (and complains if given more) and basename takes one argument and an optional one which is not a filename.
Despite their simplicity these two commands are used frequently in shell programming because most shells don't have any built-in string handling functions -- and we frequently need to refer to just the directory or just the file name parts of a given full file specification.
Usually these commands are used within the "back tick" shell operators like TARGETDIR=`dirname $1`. The "back tick" operators are equivalent to the $(...) construct. This latter construct is valid in Korn shell and bash -- and I find it easier to read (since I don't have to squint at me screen wondering which direction the "tick" is slanted).
Although the basename and dirname commands embody the "small is beautiful" spirit of Unix -- they may push the envelope towards the "too simple to be worth a separate program" end of simplicity.
Naturally you can call on sed, awk, TCL or perl for more flexible and complete string handling. However this can be overkill -- and a little ungainly.
So, bash (which long ago abandoned the "small is beautiful" principal and went the way of emacs) has some built in syntactical candy for doing these operations. Since bash is the default shell on Linux systems then there is no reason not to use these features when writing scripts for Linux.
The bash man page is huge. In contains a complete reference to the "readline" libraries and how to write a .inputrc file (which I think should all go in a separate man page) -- and a run down of all the csh "history" or bang! operators (which I think should be replaced with a simple statement like: "Most of the bang! tricks that work in csh work the same way in bash").
However, buried in there is a section on Parameter Substitution which tells us that $foo is really a shorthand for ${foo} which is really the simplest case of several ${foo:operators} and similar constructs.
Are you confused, yet?
Here's where a few examples would have helped. To understand the man page I simply experimented with the echo command and several shell variables. This is what it all means:
We can use these expressions:
Note that the last two depend on the assignment made in the second one
Here we notice two different "operators" being used inside the parameters (curly braces). Those are the # and the % operators. We also see them used as single characters and in pairs. This gives us four combinations for trimming patterns off the beginning or end of a string:
It's important to understand that these use shell "globbing" rather than "regular expressions" to match these patterns. Naturally a simple string like "txt" will match sequences of exactly those three characters in that sequence -- so the difference between "shortest" and "longest" only applies if you are using a shell wild card in your pattern.
A simple example of using these operators comes in the common question of copying or renaming all the *.txt to change the .txt to .bak (in MS-DOS' COMMAND.COM that would be REN *.TXT *.BAK).
This is complicated in Unix/Linux because of a fundamental difference in the programming API's. In most Unix shells the expansion of a wild card pattern into a list of filenames (called "globbing") is done by the shell -- before the command is executed. Thus the command normally sees a list of filenames (like "foo.txt bar.txt etc.txt") where DOS (COMMAND.COM) hands external programs a pattern like *.TXT.
Under Unix shells, if a pattern doesn't match any filenames the parameter is usually left on the command like literally. Under bash this is a user-settable option. In fact, under bash you can disable shell "globbing" if you like -- there's a simple option to do this. It's almost never used -- because commands like mv, and cp won't work properly if their arguments are passed to them in this manner.
However here's a way to accomplish a similar result:
for i in *.txt; do cp $i ${i%.txt}.bak; done
... obviously this is more typing. If you tried to create a shell function or alias for it -- you have to figure out how to pass this parameters. Certainly the following seems simple enough:
function cp-pattern { for i in $1; do cp $i ${i%$1}$2; done
... but that doesn't work like most Unix users would expect. You'd have to pass this command a pair of specially chosen, and quoted arguments like:
cp-pattern '*.txt' .bak
... note how the second pattern has no wild cards and how the first is quoted to prevent any shell globbing. That's fine for something you might just use yourself -- if you remember to quote it right. It's easy enough to add check for the number of arguments and to ensure that there is at least one file that exists in the $1 pattern. However it becomes much harder to make this command reasonably safe and robust. Inevitably it becomes less "unix-like" and thus more difficult to use with other Unix tools.
I generally just take a whole different approach. Rather than trying to use cp to make a backup of each file under a slightly changed name I might just make a directory (usually using the date and my login ID as a template) and use a simple cp command to copy all my target files into the new directory.
Another interesting thing we can do with these "parameter expansion" features is to iterate over a list of components in a single variable.
For example, you might want to do something to traverse over every directory listed in your path -- perhaps to verify that everything listed therein is really a directory and is accessible to you.
Here's a command that will echo each directory named on your path on it's own line:
p=$PATH until [ $p = $d ]; do d=${p%%:*}; p=${p#*:}; echo $d; done
... obviously you can replace the echo $d part of this command with anything you like.
Another case might be where you'd want to traverse a list of directories that were all part of a path. Here's a command pair that echos each directory from the root down to the "current working directory":
p=$(pwd) until [ $p = $d ]; do p=${p#*/}; d=${p%%/*}; echo $d; done
... here we've reversed the assignments to p and d so that we skip the root directory itself -- which must be "special cased" since it appears to be a "null" entry if we do it the other way. The same problem would have occurred in the previous example -- if the value assigned to $PATH had started with a ":" character.
Of course, its important to realize that this is not the only, or necessarily the best method to parse a line or value into separate fields. Here's an example that uses the old IFS variable (the "inter-field separator in the Bourne, and Korn shells as well as bash) to parse each line of /etc/passwd and extract just two fields:
cat /etc/passwd | ( \ IFS=: ; while read lognam pw id gp fname home sh; \ do echo $home \"$fname\"; done \ )
Here we see the parentheses used to isolate the contents in a subshell -- such that the assignment to IFS doesn't affect our current shell. Setting the IFS to a "colon" tells the shell to treat that character as the separater between "words" -- instead of the usual "whitespace" that's assigned to it. For this particular function it's very important that IFS consist solely of that character -- usually it is set to "space," "tab," and "newline.
After that we see a typical while read loop -- where we read values from each line of input (from /etc/passwd into seven variables per line. This allows us to use any of these fields that we need from within the loop. Here we are just using the echo command -- as we have in the other examples.
My point here has been to show how we can do quite a bit of string parsing and manipulation directly within bash -- which will allow our shell scripts to run faster with less overhead and may be easier than some of the more complex sorts of pipes and command substitutions one might have to employ to pass data to the various external commands and return the results.
Many people might ask: Why not simply do it all in perl? I won't dignify that with a response. Part of the beauty of Unix is that each user has many options about how they choose to program something. Well written scripts and programs interoperate regardless of what particular scripting or programming facility was used to create them. Issue the command file /usr/bin/* on your system and and you may be surprised at how many Bourne and C shell scripts there are in there
In conclusion I'll just provide a sampler of some other bash parameter expansions:
This one just uses the shell "null command" (the : command) to evaluate the expression. If the variable doesn't exist or has a null value -- this will print the string to the standard error file handle and exit the script with a return code of one.
Oddly enough -- while it is easy to redirect the standard error of processes under bash -- there doesn't seem to be an easy portable way to explicitly generate message or redirect output to stderr. The best method I've come up with is to use the /proc/ filesystem (process table) like so:
function error { echo "$*" > /proc/self/fd/2 }
... self is always a set of entries that refers to the current process -- and self/fd/ is a directory full of the currently open file descriptors. Under Unix and DOS every process is given the following pre-opened file descriptors: stdin, stdout, and stderr.
Bash supports a surprising number of string manipulation operations. Unfortunately, these tools lack a unified focus. Some are a subset of parameter substitution, and others fall under the functionality of the UNIX expr command. This results in inconsistent command syntax and overlap of functionality, not to mention confusion.
stringZ=abcABC123ABCabc # ======= echo `expr match "$stringZ" '\(.[b-c]*[A-Z]..[0-9]\)'` # abcABC1 echo `expr "$stringZ" : '\(.[b-c]*[A-Z]..[0-9]\)'` # abcABC1 echo `expr "$stringZ" : '\(.......\)'` # abcABC1 # All of the above forms give an identical result. |
stringZ=abcABC123ABCabc # ====== echo `expr match "$stringZ" '.*\([A-C][A-C][A-C][a-c]*\)'` # ABCabc echo `expr "$stringZ" : '.*\(......\)'` # ABCabc |
stringZ=abcABC123ABCabc
# |----|
# |----------|
echo ${stringZ#a*C} # 123ABCabc
# Strip out shortest match between 'a' and 'C'.
echo ${stringZ##a*C} # abc
# Strip out longest match between 'a' and 'C'.
|
stringZ=abcABC123ABCabc
# ||
# |------------|
echo ${stringZ%b*c} # abcABC123ABCa
# Strip out shortest match between 'b' and 'c', from back of $stringZ.
echo ${stringZ%%b*c} # a
# Strip out longest match between 'b' and 'c', from back of $stringZ.
|
#!/bin/bash
# cvt.sh:
# Converts all the MacPaint image files in a directory to "pbm" format.
# Uses the "macptopbm" binary from the "netpbm" package,
#+ which is maintained by Brian Henderson (bryanh@giraffe-data.com).
# Netpbm is a standard part of most Linux distros.
OPERATION=macptopbm
SUFFIX=pbm # New filename suffix.
if [ -n "$1" ]
then
directory=$1 # If directory name given as a script argument...
else
directory=$PWD # Otherwise use current working directory.
fi
# Assumes all files in the target directory are MacPaint image files,
# + with a ".mac" suffix.
for file in $directory/* # Filename globbing.
do
filename=${file%.*c} # Strip ".mac" suffix off filename
#+ ('.*c' matches everything
#+ between '.' and 'c', inclusive).
$OPERATION $file > $filename.$SUFFIX
# Redirect conversion to new filename.
rm -f $file # Delete original files after converting.
echo "$filename.$SUFFIX" # Log what is happening to stdout.
done
exit 0
|
stringZ=abcABC123ABCabc
echo ${stringZ/abc/xyz} # xyzABC123ABCabc
# Replaces first match of 'abc' with 'xyz'.
echo ${stringZ//abc/xyz} # xyzABC123ABCxyz
# Replaces all matches of 'abc' with # 'xyz'.
|
stringZ=abcABC123ABCabc
echo ${stringZ/#abc/XYZ} # XYZABC123ABCabc
# Replaces front-end match of 'abc' with 'xyz'.
echo ${stringZ/%abc/XYZ} # abcABC123ABCXYZ
# Replaces back-end match of 'abc' with 'xyz'.
|
A Bash script may invoke the string manipulation facilities of awk as an alternative to using its built-in operations.
#!/bin/bash
# substring-extraction.sh
String=23skidoo1
# 012345678 Bash
# 123456789 awk
# Note different string indexing system:
# Bash numbers first character of string as '0'.
# Awk numbers first character of string as '1'.
echo ${String:2:4} # position 3 (0-1-2), 4 characters long
# skid
# The awk equivalent of ${string:pos:length} is substr(string,pos,length).
echo | awk '
{ print substr("'"${String}"'",3,4) # skid
}
'
# Piping an empty "echo" to awk gives it dummy input,
#+ and thus makes it unnecessary to supply a filename.
exit 0
|
For more on string manipulation in scripts, refer to Section 9.3 and the relevant section of the expr command listing. For script examples, see:
| [1] | This applies to either command line arguments or parameters passed to a function. |
|
# More to the point, thanks to the way ksh works, you can do this:
# make an array, words, local to the current function
typeset -A words
# read a full line
read line
# split the line into words
echo "$line" | read -A words
# Now you can access the line either word-wise or string-wise - useful if you want to, say, check for a command as the Nth parameter,
# but also keep formatting of the other parameters...