Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Softpanorama Bulletin
Vol 23, No.05 (May, 2011)

Prev | Contents | Next

Bulletin	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007
2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
Jan	Feb	Mar	Apr	May	Jun	Jul	Sept	Oct	Nov	Dec

Perl string operations

Explicit conversion to string
Concatenation operator (dot)
Functions and operators to manipulate strings
Related functions (split, join, chr, oct, hex, etc).
Implementation of some additional useful functions (trim and scan)
Summary
Additional Reading

There are two ways of performing string operations in Perl -- procedural and non-procedural. Here we will discuss procedural capabilities of Perl. Non-procedural (regular expressions based) capabilities will be discussed in Chapter 5

Procedural string handling in Perl is often simpler, more reliable and more easily debugged that other methods (debugging a complex regular expression is difficult -- especially for users without of couple of years practice in this sport ;-) . I strongly recommend for beginners to use procedural string operations as widely as possible, unless non-procedural capabilities are definitely better fit for a particular task. If such cases you often can borrow a similar regular expression from a book or other Perl script and gradually modify it to get to point when it is doing what you expect. Usage on non-greedy modifies is highly recommended for simplification of regex, as they are essentially non procedural notation for sequential search of the substring in a larger string and as such less prone to errors.

Notes:

Several string manipulation functions (tr, split, etc) work on the default input variable -- $_. This is so called default scalar variable -- another interesting albeit slightly strange Perl creature -- useful and dangerous at the same time. The idea is that some operators and built-in functions change the value of this variable as a side effect of their invocation.
There are also strange omissions -- classic head, tail and truncate functions are not implemented for the strings (a crippled version of tail is available as chop and a crippled version of truncate as chomp but generally it's better to use regular expression for this task in Perl). If you need a better set of functions then you can use replication of functions from REXX. See StringRexx - Perl implementation of Rexx string functions available from search.cpan.org
The functions Perl uses for conversions numbers to strings is springf with format "%.20g".

Please note that any string in Perl can be converted into array and back (for example using split and join functions), so if a given operation is simpler on arrays it might make sense to perform such a conversion and then convert the resulting array back to string. This trick is also useful working with words.

All-in-all capabilities of working with strings in Perl are one of the best of any scripting language and that is very important as many problems can be viewed (or converted to) to the problem of manipulation of strings.

Explicit conversion to string

As we already have learned that the " (double quote) is not a string delimiter in Perl -- it is an operator that concatenates everything with its scope including variables, performing variable substitution. Variables can be only scalar -- either simple variable or elements of arrays and hashes. In case you use some complex expression chances are that it will be recognized correctly by syntax analyzer, but your better test this fact to be sure.

The result is interpreted as a string so double quotes can be used to perform the conversion of operator to string -- a surrogate of type casting.

The simplest example of using double quested to force string conversion would be:

if ("$b" eq "$a") {
   print "String '$a' equals to '$b'";
} else {
   print "String '$a' not equal '$b'";
}

If the case of letters during comparison does not matter the best way is to perform preliminary explicit conversion either to lower case or to upper case using functions uc() (converts to upper case) or lc() (converts to lower case). Both functions convert operand to string. For example previous example can be generalized to:

if (lc($b) eq lc($a)) {
   print "Case insensitive string '$a' equals to string '$b'";
} else {
   print "string '$a' not equal to '$b'";
}

This is a useful idiom in Perl that helps to ensure correct comparison of strings that can be either in lower or in upper case (for example user input): as string comparisons are case sensitive and often you do know what case will be used (for example in answers like "Do you want to continue (yes/no)?", why does anybody wants to distinguish "Yes", "yes", or even " yES"?).

Mainframe tradition dictates to convert everything to upper case (first mainframes printers did not have a lowercase characters), but you can be more modern and use Unix style which dictates for everything to be converted to lowercase :-).

As a side note I would like to mention that from the point of view of the language designer this possibility actually makes second set of comparison operators used in Perl redundant -- it would be simpler and less error prone to require explicit conversion by usage of quotes if string comparison is used and use one set of operators. But it's too late... Also unless the interpreter is smart that involves additional (and mostly redundant) processing.

Still adopted in Perl solution is problematic especially for users two use several additional languages in addition to Perl. Use wrong comparison operator is one of the most common Perl errors: this proved to be a bad design decision similar to allowing the usage of an assignment in the if statement in C (like in if (i=1) { ... } ).

Concatenation operator (dot)

As we already discussed the dot symbol denote the concatenation operator in Perl. The operator takes two scalars, and combines them together in one scalar. Both of the scalars to the left and right are converted to strings. For example:

$line = $line . "\n"; # add a newline to the string  
$line ="$line\n"; # same thing

Like double quotes the concatenation operator can be used for casting a numeric value into string. The following (rather unrealistic) example demonstrates that when a numeric value that contains non-significant digits (0.00 in this particular case) is converted to string all non-significant digits are lost:

$a=0.00 . '';
if ( 0.00 . '' ) {
   print "$a is True\n";
} else {
   print "$a is False\n"; # Will print "0 is False"
}

The side effect of usage is a number on the left side of the assignment statement is that is will be converted to floating point double precision and than from this representation will be converted back. In other words it will be converted to floating representation with double precision in statement " $a=0.00". That will be number 0. and that what will be printed in print statement as this number will be converted back to string "0"

In case of the real number 0.00 in statement . ''; "it will first be converted to string "0" and only then concatenated with null string in the concatenation operator resulting in string "0" which is false.

All non-significant digits are lost if we convert a string to numeric representation and then back to string

Compare with:

$a="0.00"; 
if ($a) {
   print "$a is true\n" # will print "0.00 is true"
}

In the latter example the string 0.00 will never be converted to numeric and as such will be considered true in the if statement.

It is important to know that concatenation operator enforces a scalar context and this mean that in case of an array the number of elements in the array will be substituted. For example (note that the array @ARGV represents all arguments passed on the command line to the script):

print "Arguments:".@ARGV."\n"; # a very unpleasant error.

The intent seems to be to print all command line arguments, but the Perl interpretation is quite different. Please run this example and find out what will be printed. One of the possible correct solutions is:

print map { "$_\n" } @ARGV; # map function provides an implicit foreach loop.

The x Operator

This operator is called the string repetition operator and is used to repeat a string. All you have to do is put a string on the left side of the x and a number on the right side. Like this:

"----|" x 5 # This is the same as "----|----|----|----|----|"

You can use any string as the source including strings that contain newline, for example:

print ("Hello\n" x 5);

Functions and operators to manipulate strings

Contrary to popular belief and Perl-hype regular expression are not the "universal opener" and in many cases procedural solutions is more transparent, more easily debugged and more modifiable in the future. Perl has full assortment of PL/1 string operations including substr, length, index and tr (translate). This functions are very powerful and generally they could probably accommodate for 80% of the operations that you ever would want to perform on strings. In case someone hates regular expressions for religious reasons he/she can do almost everything without them (kind of vegetarian diet, I think ;-)

There are also several other string manipulation functions that provide for important special cases (split, chop, chomp, uc, lc, unpack etc.). They are not well designed and we will discuss them after major functions.

Perl also have powerful array related string fictions like grep, sort, etc. We will discuss them in the part of this chapter devoted to arrays (3.2)

Please remember that one interesting idiosyncrasy of Perl is the concept of so called default argument denoted as $_. If you do not supply an argument for certain functions it will operate on default argument. If you do not supply arguments to the function it will usually operate on the default argument $_.

Getting Substring: substr

Substr is a classic string manipulation function that, as far as I know was first introduced in PL/1 in early 60th. This is the most important function for manipulating strings in Perl. One need to understand it to be effective Perl programmer.

Complete understanding of the substr function is really important in order to become an effective Perl programmer

Like PL/1 Perl provides the substr function (substring) to extract parts of a scalar (e.g. string). In the most general case of substr invocation you need to specify four arguments:

String to be used
Starting position of the substring that you want to extract (can be negative
Length of the substring

For example if you wanted to get the first character of a string:

 $name = "Nick";

 $initial = substr($name,0,1);
		    |	 | |_______ length
		    |	 |_________ starting position  
		    |______________ name of the string

The second argument can be negative -- in this case the offset of the starting position will be calculated from the end of the string, but form the start of the string

$last=substr($name, length($name),1); # the last character 

$last= substr($name, -1,1); # same as above

Like in PL/1 and REXX omitting the last argument means that all characters till the end of the string (tail) will be taken.

$last=substr($name, -1) ; # same as above
$last-substr($name,10); # all characters of the string starting from position 10

If you want, you can also use substr function to replace any fragment of the string -- like in PL/1 and REXX substr can be used on the left side of the assignment statement (such functions are called pseudofunctions or R-value functions):

substr($name,0,1)=uc(substr,$name,0,1); # capitalize the first symbol like in ucfirst.

This pseudofunction or R-value capability of substr is very useful. For example we can also to chop off the last character from the scalar $current_line:

substr($name, -1) = '';  # will truncate the string $name by two letters

This is actually more flexible then chop function as them the number of bytes we need to chop can be a variable:

substr($name, -$k) = ''; # will truncate the string $name by $k+1 letters

Here we used negative subscript to count backwards from the end of the string. You can achieve the same result using negative value of length parameter in substr function, which will be interpreted as length of the string minus this offset:

$name=substr($name, 0,-2);  # will truncate the string $name by two letters

Here you can see that in Perl substr functions the negative third argument was interpreted in a similar way to negative second argument -- as offset from length of the string (length($name)-2) .

If you note that substr($name,0,0) is the very beginning of the string it is clear that you can add prefix to the string using substr:

$name='bezroukov';
substr($name, 0,0)='Nick '; # will add the first name to the last

Another interesting idiom is the conversion of the first latter to upper case (kind of generalized uc function as we can convert not only the first letter but any number of letters in any part of the string. for example

substr($name,0,3) =~ tr/a-z/A-Z/; # convert the first three letters to uppercase

The important difference between substr in PL/1 and REXX and substr in Perl is that it can substitute a new string instead of deleted. This is a semi-useful generalization as it is borders on overcomplexity and there is a nice simple idiom using regular expressions for substitution (with search):

s/search_string/replacement_string/;

(see Chapter 5).

Here is a man entry for substr that describes this possibility (the bold in mine -NNB):

substr EXPR,OFFSET,LEN,REPLACEMENT

substr EXPR,OFFSET,LEN

substr EXPR,OFFSET

Extracts a substring out of EXPR and returns it. First character is at offset 0, or whatever you've set $[ to (but don't do that).
If OFFSET is negative (or more precisely, less than $[), starts that far from the end of the string.

If LEN is omitted, returns everything to the end of the string. If LEN is negative, leaves that many characters off the end of the string.

If you specify a substring that is partly outside the string, the part within the string is returned. If the substring is totally outside the string a warning is produced.

You can use the substr() function as an lvalue, in which case EXPR must itself be an lvalue. If you assign something shorter than LEN, the string will shrink, and if you assign something longer than LEN, the string will grow to accommodate it. To keep the string the same length you may need to pad or chop your value using sprintf().

An alternative to using substr() as an lvalue is to specify the replacement string as the 4th argument. This allows you to replace parts of the EXPR and return what was there before in one operation, just as you can with splice().

As you can see Perl substr can use forth argument -- the "replacement string". This duplicates the ability to use substr at the left side of assignment statement and as such might well be considered as an example of useless or harmful "innovation". For example:

$string='abba';
substr($string,1,2)='vv'; # produced "avva".

is equivalent to

$string='abba';
substr($string,1,2,'vv'); # same thing

One marginally useful example when usage of forth argument makes some sense is inserting a substring from a certain position of the string, for example:

$a="world";
$b=substr($a,0,0,"Hello "); # note that the length can be different
print $a; # will print "Hello world"

For some unknown to me reason the substr function does not affect default variable $_:

$a='abba';
$_='';
substr($a,0,1);
print "$_\n"; # Will not print the first letter of the string

In some cases instead of substr you can use sprintf (see below) It is convenient for example to put variables in a predefined places in dynamically generated command. For example there are some difficulties on working with UNIX permissions as they are octal and can be mangled if Perl converts them into decimal, so using sprintf in this case is simpler:

$perm=0755;
$string = sprintf ("/bin/chmod %o $target/*", $perm);
`$string`;

We will discuss sprintf in more details below

String searching: index and rindex functions

There are two function for searching a substring in a string -- index and rindex. Let's quote Perl man page for a more precise definition:

index STR,SUBSTR,POSITION

index STR,SUBSTR

The index function searches for one string within another, but without the wildcard-like behavior of a full regular-expression pattern match. It returns the position of the first occurrence of SUBSTR in STR at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string. The return value is based at 0 (or whatever you've set the $[ variable to--but don't do that). If the substring is not found, returns one less than the base, ordinarily -1.

The index function search its first operand (string) in the second operand (substring) and return the offset of the first substring found. The rindex function returns the offset of the last substring found. It returns -1 if the string is not found, which looks logical as it is the index of the last character of the string and that's where unsuccessful matching stops.

index always return offset counted from the beginning of the string even if the third argument is present. In case string is not found the result is -1 (zero corresponds to the first letter of the string)

As one can see index function is not greedy -- it finds the first substring in the string that matches the argument and stops at this point. Therefore often it is simpler to use it for finding relevant parts of the string than regular expressions (although for this purpose you can specify non-greedy matching in regular expressions too)

Let's assume the variable $string contains the value abracadabra. Here are some examples:

$string="abracadabra";
print index ($string, "ab")."\n";     # will print 0 (the 1st letter has index 0)
print index ($string, "abc")."\n";    # will print -1
print index ($string, "ab", 2)."\n";  # will print 7 (staring pos is 2)
print index ($string, "ra", 3)."\n";  # will print 9
print rindex ($string, "ab");         # will print 7 (last "ab" in the string)

If you have a string that contain double quotes and want to interpolate variable in this string instead of double quotes it's more convenient to use function qq like in

$a=qq(<font face="$font" color="$color">);

This is the best way to avoid errors connected with forgetting to escape all double quotes in such strings. Compare example above with

$a="&pce=\"$font\" color=\"$color\">"

Also please remember that a double backslash in double quoted literals represents just one backslash.

$path="C:\\SFU\\bin";
if ( index($path,"\\SFU\\") >-1 ) {
   print "The directory belongs to Microsoft Services for Unix\n";
}

Often one needs to extract the file name at the end of the path. You might do this by searching for the last backslash using the rindex function and then using substr to return the sub-string. For example:

$fullname = qq(C:/WINDOWS/TEMP/SOME.DAT);
$d=index($fullname,':'); #
$drive=substr($fullname, 0, $d);
$p = rindex($fullname, '/') + 1; # index of the first latter after
	
$fname = substr($fullname, $p); # note that we use 2 arguments

print("File $fileName is on the drive $d\n");

Note that in the example above we used a special form of substr invocation -- if the third parameter(the length) is not supplied to substr, it simply returns the sub-string that starts at the position specified by the second parameter until the end of the string. By omitting the third argument we can avoid errors when we miscalculate the length of the substring

The important innovation of Perl implementation of index function in comparison with PL/1 and REXX is that you can specify the starting position of the search. Like in substr in case it is negative it will be counted from the end of the string.

Another important difference is that in case the string is not found index will return -1 not 0. This is pretty logical design decision as it is corresponds to the index of the last element of the string.

Assembling string from components, sprintf function

Function sprintf is not often considered to be string manipulation function in most Perl textbooks. It is usually studies along with printf function in the chapter devoted to I/O operation in Perl. But in reality it is important string manipulation function that overlaps with substr, concatenate(dot), length and other string processing functions. It is indispensable when you need to assemble a string from one or several numeric values or mixture of numeric values(each using particular format) and strings.

Function sprintf accepts two arguments:

FORMAT -- a string that serves as a template for the result of assembly using the classic C printf format specification. The format tags follow this prototype:
```
%[flags][width][.precision][length]specifier
```
Important: If format string is exhausted before list of augments is exhausted, the interpretation if format string is not re-started from the very beginning.
LIST -- comma delimited list of values of variable

Here is the basic information about format string from sprintf man page. For more information see sprintf() map page. Perl's sprintf() permits the following classic specifiers

c a character with the given number
s a substring of the string or string justified in a certain manner. For example
- %-3s A string left-justified in a field at least 3 characters wide
- %.3s A string, but at most 3 characters of that string are printed
d a signed integer, in decimal
u an unsigned integer, in decimal
o an unsigned integer, in octal
x an unsigned integer, in hexadecimal
e a floating-point number, in scientific notation
f a floating-point number, in fixed decimal notation
g a floating-point number, in e or f notation

Perl also supports several additional useful specifiers:

X like x, but using upper-case letters
E like e, but using an upper-case "E"
G like g, but with an upper-case "E" (if applicable)
b an unsigned integer, in binary
B like b, but using an upper-case "B" with the # flag
p a pointer (outputs the Perl value's address in hexadecimal)
n special: *stores* the number of characters output so far

Like C Perl permits the following flags between the % and the specifiers letter:

space -- prefix positive number with a space
+ -- prefix positive number with a plus sign
- -- left-justify within the field
0 -- use zeros, not spaces, to right-justify
# -- prefix non-zero octal with "0", non-zero hex with "0x"
number -- minimum field width
.number -- "precision": digits after decimal point for floating-point, max length for string, minimum length for integer
l -- interpret integer as C type "long" or "unsigned long"
h -- interpret integer as C type "short" or "unsigned short"

There is also one Perl-specific flag v (vector flag ):

This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. Perl applies the format to each integer in turn, then joins the resulting strings with a separator (a dot . by default). This can be useful for displaying ordinal values of characters in arbitrary strings:

printf "%vd", "AB\x{100}"; # prints "65.66.256"

printf "version is v%vd\n", $^V; # Perl's version

Put an asterisk * before the v to override the string to use to separate the numbers:

printf "address is %*vX\n", ":", $addr; # IPv6 address

printf "bits are %0*v8b\n", " ", $bits; # random bitstring

You can also explicitly specify the argument number to use for the join string using something like *2$v; for example:

printf '%*4$vX %*4$vX %*4$vX', @addr[1..3], ":"; # 3

Perl also supports several macros and aliases to existing format items for backward compatibility. Among them:

%i a synonym for %d
%D a synonym for %ld
%U a synonym for %lu
%O a synonym for %lo
%F a synonym for %f

Where a number would appear in the flags, an asterisk ("*'') may be used instead, in which case Perl uses the next item in the parameter list as the given number (that is, as the field width or precision). If a field width obtained through "*'' is negative, it has the same effect as the "-'' flag: left-justification.

Function sprintf is extremely convenient if you want to assemble a formatted string from several numeric variables or mixture of string and numeric variable. For example:

my $timestamp = sprintf( "%4d/%02d/%02d %2d:%02d:%02d",
                         $yr,$mon,$day, $h, $m, $s);

In that example, the variable $timestamp will be initialized with the value "2012/01/19 8:00:08".

Please note that the format string used a leading zero in format components %02d that repeats several times. The leading zero means that is number has less digits then specified width, it will be filled from the left with leading zeroes as needed to make the number into the width specified (in this example two positions).

Specifying currency values

In case of currency values you typically need to render the number with a certain number of places after the decimal point. For example two in case of dollar and cents. You can accomplish this using the "%.2d" format:

my $money = sprintf "%.2d", 5.9997;

This means rounding the value to two decimal points. Unless this is the way to perform calculation specified you are better off keeping numbers in memory with all of the available accuracy, rounding off only when you need to output it as a string.

Often for large monetary value a special "money format" is needed. This format prints an integer with each three positions before decimal point marked with comma. In other words commas are used to distinguish dollars, thousand of dollars, millions of dollars. In Perl you can generate such a format %v flag that we mentioned above

This flag tells Perl to interpret the supplied string as a vector of integers, one for each format item in the format string. Perl applies the format to each part of the integer one by one, then joins the resulting strings with a separator (a dot . by default). This can be useful for displaying ordinal values of characters in arbitrary strings:

printf "version is v%vd\n", $^V; # Perl's version

You can put an asterisk * before the v to override the string to use to separate the numbers:

printf "address is %*vX\n", ":", $addr; # IPv6 address

printf "bits are %0*v8b\n", " ", $bits; # random bitstring

You can also explicitly specify the argument number to use for the join string using something like *2$v; for example:

@addr=(10,2,5,7);
$ip=printf('%*4$vX %*4$vX %*4$vX', @addr, ":");

Length of the string: function length

Built-in function length returns the length of a string.

$password = <>;
if ( length($password) < 8 ) {
   print "bad password chosen, please use a longer one";
}

In this example, the function length counts the number of bytes in the scalar variable $password.

Regrettably length cannot be used as a pseudo-function (on the right side of the assignment statement) like substr, although it would be useful to be able to truncate string using this shorthand. Instead use one needs to use substr function like in the following example:

$a=substr($a,0,n); # truncation of the string a to n letters

If no scalar is specified, the length function returns length of $_. Note that length requires scalar argument and contrary to proclaimed Perl philosophy it will not work as you might expect with array or hash. To find out how many elements array or hash have one might use scalar(@array) and scalar(keys %hash) respectively.

Translate function (tr)

The tr function (actually this is an operator ;-) allows character-by-character translation with several enhancements. It takes two argument source character set and target character set. Syntax is rather strange and belongs to "Perl warts" as it does not fit well into general string manipulation functions framework That can be explained by the fact that the tr operator is derived from the UNIX tr utility. The UNIX sed utility uses a y for this operation -- it is supported as a synonym for tr.

The string to be modified is not supplied as a parameter, but is taken from the $_ variable, for example:

 tr/a/z/; # change all "a" into "z"

The following expression replaces each digit with 9 so any resulting number will consist of 9 only. This sometimes can be a useful parsing technique or data scrambling technique. Of course this encoding is not really helpful but it will suit for the illustration purposes.

By default tr modifies the content of the variable $_.

The function returns the number of substitutions made, not the translated string as we might expect.

$_='Test string 123456789123456789123456789';
$k=tr/2345678/9/; # $k will contain the number of substitutions made

Unlike index and substr the tr function returns not the translated string,
but the number of substitutions made.

If you specify more than one character in the match character list, you can translate multiple characters at a time. For example:

tr/0123456789/9999999999/; # replace all digits with 9

translates all digits into the 9 character. If the replacement list of characters is shorter than the target list of characters, the last character in the replacement list is repeated as often as needed. That means that we can rewrite the statement above as:

tr/0123456789/9/; # same as above

if more than one replacement character is given for a matched character (this is stupid idea because arguments are sets, but can happen if sets are generated automatically and corresponding check is not in place), only the first is used. The rest of the replacement list is ignored. For instance:

tr/999/123/;

results in all characters "9" in the string being converted to an 1 character. So it's equal to

tr/999/1/;

The translation operator doesn't perform variable interpolation, for example:

$from_set="0123456789";
$to_set  ="ABCDEFGHIJ";
tr/$from_set/$to_set/; # does not work

The translation operator doesn't perform variable interpolation.

The translation operator several useful options: you can delete matched characters, replace repeated characters with a single character, and translate only characters that don't match the character list (see the table below).

Historically the translate function is considered to be one of pattern matching operators. That is untrue, but as you will see the syntax is derived form (also pretty strange) match and substitute operators that we will study in Ch.5. At the same time the translation function operates with strings of character sets, not with regular expressions. Delimiter can vary, but slashes are most commonly used. (slashes are also used in Perl 5 for regular expressions). Most of the special regular expression codes are not applicable.

However, like in regular expressions the dash is used to mean "between". This statement converts $_ to upper case.

 tr/a-z/A-Z/; # again this is not the best way to do it. Use uc() instead

Please note that Perl 4 did not have lc and uc functions. Therefore the tr function was often used to convert case. If you see this idiom in the script that probably means that the script was initially written for Perl 4. The example above that converts all digits to 9 can be rewritten as

tr/0-9/9/; # the shortest way to replace all digits to 9

Deleting characters from the source set

If the target set contains no characters and you use modifier d that operations deletes characters from the source set:

tr/.,;://d;

You do need to specify empty second set with option d or the function does not work as expected

# cat test
   $test='test ';
   print "Before test 1: |$test|\n";
   $test=~tr/ / /d;
   print "After: test 1: |$test|\n";
   $test='test ';
   print "Before test 2: |$test|\n";
   $test=~tr/ //d;
   print "After test 2:  |$test|\n";
# perl test
Before test 1: |test |
After: test 1: |test |
Before test 2: |test |
After test 2:  |test|

Counting characters using tr

If the new set is empty and there is no d option, then new set is assumed to be equal to the old one and function will not modify the source string -- it can be used for counting characters from the specified set in the string.

For example, the statement here counts the number of dots (dot is a special character in regular expressions in the variable $ip and stores that in the variable $total.

$_="131.1.1.1"
$total = tr/.//;

Another more complex example counts a set of characters

$k=tr/0-9//; # counts number of digits in the string $_

You can specify set not only directly, but using the idea of complement set operation:

$k=tr/0-9//c; $ will count all non digit characters

If you use tr to parse the string into lexical elements than you may not need repeated character after transliteration. In this case one can use option s. This permits easy building of primitive lexical parsers:

$k=tr/0-9a-Z_/9AA/s; # each identifier replaced by A, each number by 9

Normally, if the match list is longer than the replacement list, the last character in the replacement list is used as the replacement for the extra characters. However, when the d option is used, the matched characters are simply deleted.

If the replacement list is empty, then no translation is done. The operator will still return the number of characters that matched, though. This is useful when you need to know how often a given letter appears in a string. This feature also can compress repeated characters using the s option.

Options

Here is the list of all possible options:

Option	Description
`c`	This option complements the source character set. In other words, the translation is done for every character that does not match the source character set.
`d`	This option deletes any character in the source character set that does not have a corresponding character in the target character set.
`s`	This option reduces repeated sequences of the same character in the output to to a single instance of that character.

For example ROT13 is a simple substitution cipher that is sometimes used for distributing offensive jokes and other potentially objectionable materials on Usenet. This is a Caesar cyper with the value of key equal to 13 (A->N, B->O etc.). Using tr function for decoding ROT13 is an interesting example because the target set is constructed by concatenation of disjoint character subranges [n-z][a-m] (or [N-Z][A-M] for the upper case:

tr/[a-z][A-Z]/[n-z][a-m][N-Z][A-M]/

UNIX programmers may be familiar with using the tr utility to convert lowercase characters to uppercase characters, or vice versa. Do not do that -- Perl 5 has the lc() and uc() functions for this purpose

For complex transliterations the tr/// syntax is bad. . One of the problems is that the notation doesn't actually show which characters correspond, so you have to count characters. for example:

tr/abcdefghijklmnopqrstuvwxyz/VCCCVCCCVCCCCCVCCCCCVCCCCC/

But in Perl there is a way to make this example more readable using different delimiters:

    tr[abcdefghijklmnopqrstuvwxyz]
      [VCCCVCCCVCCCCCVCCCCCVCCCCC]

If the first string contains duplicates, then the first corresponding character is used, not the last:

       tr/aeioua-z/VVVVVC/

Truncating last characters in strings: functions chop and chomp

Those two are special and very limited functions that are really shame of Perl language designers. They do just one thing and were not generalized to cover important similar situations. Those two functions are:

chop -- chops the last character off the end of a scalar and returns it.
chomp -- a conditional chop which chops a newline or regular expression defined by $/ from the end of the string is it is present.

Chop function

The built-in function chop, chops the last character off the end of a scalar or array and returns it. Why just one and why I cannot chop, say, ten characters is a mystery to me (actually equivalent function for arrays pop permit argument with the number of elements popped, see part 3.2 of this chapter). I suspect that this is more "optimization trick" then anything else. Of course chop can be imitated with the substr function, but the case is rather frequent and deserves a special "short-cut" function that does not allocated new string but performs operation "in place".

The most popular use is for comparison of a string with a line of file that used to be read with newline as a part of input, but chop is safer and better function here. For example:

echo OK | perl -e '$a =<>;
if ($a eq "OK"){
   print "equal\n"
}else{
   print "non equal\n";
}

So to compensate for this we will need first to chop the last character (newline): chop($a); but its better to use chomp ( see below).

Due to the fact that scalars are stored in both string and numeric representation chop can be used for dividing of a number by 10 for example:

$n = 128;
chop($n); # returns 12 (128 divided by ten!)

Function returns the character that was dropped, and there is no way to return chopped string which is somewhat unfortunate and can lead to errors like:

$n = chop($n); # probably truncated by one character string was expected here

This is an error, because it returns just the last character of the string (the character that was chopped) not the truncated string. To return the truncated string one should use substr($string, 0, -1).

At the same time when the order of characters that you process in the string is not important chop can be used for processing sting character by character, the same way as substr(string,$i,1) is usually used in forward direction.

Conditionally truncating characters: chomp

Chomp is another "Perl wart". It usually removes the newline if such exists. Like chop chomp works both for scalars and arrays. This is essentially a very limited version of trim function as it is known on REXX. Trim as used in REXX is a function that removes repeated first and/or last characters in a string (be it newline or blanks or whatever). Chomp can work with just trailing string (as defined by $/; it can be a regular expression):

This safer version of chop removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module). It returns the total number of characters removed from all its arguments. It's often used to remove the newline from the end of an input record when you're worried that the final record may be missing its newline. When in paragraph mode ($/ = ""), it removes all trailing newlines from the string. When in slurp mode ($/ = undef) or fixed-length record mode ($/ is a reference to an integer or the like, see perlvar) chomp() won't remove anything. If VARIABLE is omitted, it chomps $_. Example:
while (<>) {
	chomp;	# avoid \n on last field
	@array = split(/:/);
	# ...
}
You can actually chomp anything that's an lvalue, including an assignment:
chomp($cwd = `pwd`);
chomp($answer = <STDIN>);

If you chomp a list, each element is chomped, and the total number of characters removed is returned.

The function chomp is a conditional chop which is usually used for getting rid of newlines on the ends of Perl input. Lets say you define a 'special character' to be "\n" ( a newline). Then a statement such as:

$example = "This has a line with a newline at the end\n";
chomp($example);

In other words, chomp gets rid of the newlines only, not any last character like chop. if the string does not contain a newline at the end it will remain unchanged:

$example = "This doesn't have a newline";
chomp($example);

That makes chomp safer then chop.

Actually it does not need to be a newline -- newline is simply a default value of the special variable $/-- input record separator -- which contains the characters that you want to be chopped. This can be set to any value you want, as in:

$/ = "/"; $path = "/This/is/a/path/"; chomp($path); $/="\n";
print ($path); # will print '/This/is/a/path'

Please note that you need to restore the value of

$/,

unless you want to break a lot of scripts. And yes it's ugly and should be just chomp("/This/is/a/path/","/") but Perl is pretty irregular language.

Manipulating Case: uc() and lc(), ucfirst(), lcfirst()

Function uc() returns an uppercase version of the string that you give it. For example, if you say something like:

$name = uc("Hello");
print $name; # this will prints 'HELLO'

Function ucfirst() returns a capitalized version of the string:

$name = ucfirst("hello");
print $name; # prints "Hello";

If we note absence of head, tail and truncate functions in Perl, the presence of ucfirst looks like arbitrary and probably redundant and can be implemented using substr. TO increase usefulness of the function it would be wise to generalize it to provide the possibility to capitalize not only the first letter but any substring by providing second and third arguments.

Symmetrically lc() and lcfirst() return lowercased versions of strings. lc returns all lowercase. Function lcfirst() makes the first character uncapitalized -- sometimes useful for names, but again this is a very limited application and probably function needs some generalization.

One frequent use of ucfirst and lc is to get a capitalized word:

$word=ucfirst(lc($word);

This combination of ucfirst with lc is useful for other string formatting tasks. For example, let's assume that we need to format a string as a title (with each word starting with a capital letter). Here is a very simple solutiuon for this problem:

@words=split(/\s+/,$title);
foreach $w (@words) {
   $w=ucfirst(lc($w) # we are using side effect of foreach loop
} 
$title=join(' ',@words);

Usually articles like "a" and "the" are not capitalized in titles so we can modify the code to accomplish this in the following way:

@words=split(/\s+/,lc($title));
foreach $w (@words) {  
   next if ($w eq 'a' || $w eq 'the');    
   $w=ucfirst($w);
} 
$title=join(' ',@words);

The same effect in a slightly more compact way can be achieved using map instead of foreach loop. This modification we leave as an exercise for the reader.

Related functions

Function is discussed in in array operations because it is essentially a string parsing function. See Split function It breaks up a string based on some delimiter (can be regular expression). In an array context, it returns a list of the things that were found. In a scalar context, it returns the number of things found.

As the most interesting cases involve using regular expressions it is discussed in Chapter 5 ( 5.5 Perl Split function).

Function chr(NUMBER) returns the character represented by NUMBER in the ASCII table. For instance, chr(65) returns the letter A.

Function join(STRING, ARRAY) -- Returns a string that consists of all of the elements of ARRAY joined together using STRING as a delimiter. For instance, join(">>", ("AA", "BB", "cc")) returns "AA>>BB>>cc".

Function hex (EXPR Returns the decimal value of an expression interpreted as a hex string. If EXPR is omitted, uses $_. The hex function can handle strings with or without a leading 0x or 0X

$x="hex" ("0xa2"); # $x is 162 
$x="hex" ("a2"); # $x is 162 
$x="hex" (0xa2); # $x is 354 (!)

Function oct (EXPR) returns the decimal value of an expression interpreted as an octal string. If EXPR is omitted, uses $_

$x = oct ("042"); # $x is 34
$x = oct ("42"); # $x is 34
$x = oct ("0x42"); # $x is 66
$x = oct (042); # $x is 28 (!)

Implementation of some additional useful functions (trim and scan)

The first frequent operation that is not among built-in functions of Perl is trim, ltrim and rtrim -- removal of blanks from both ends of the string, left or right, correspondingly. You can think about it as a generalization of chomp. You can implement it as a regular expression, for example:

sub trim {
   return $_[0]=~s/^\s*(.*?)\s*$/$1/;
}
sub ltrim
{
   return $_[0] =~ s/^\s+//;	
}
sub rtrim
{
   return $_[0] =~ s/\s+$//;	
}

This implementation accepts one or several strings and applies the same operation to each.

The second function that is often useful is scan. This function removes from the string and returns as a result the first word of the string passed as an argument. If there is no words in the string the function should return empty string.

sub scan {
   if ($_[0] =~s/\s*(\S+)\s+(.*)$/$2/ ) {
      return $1;
   } else {
      return '';
   } # if
}

You can generalize it to multiple arguments the way previous function was implemented. That left as an exercise for the reader.

Another useful function implementation of which can be useful exercise for the reader is subword. It should work like substr but count not symbols but words:

subword(string, n[, length])

Examples

$this_string=subword("Where is this string",3,2) # returns the string "this string" 
$third_word = subword("Where is this string",3)

Summary

Perl has an impressive array of string manipulating functions that can supplement its regular expressions-based string manipulation capabilities. Novices should probably avoid overusing regular expression string manipulation capabilities until they became more confident in understanding the associated semantics. In case the task maps clearly into classic sting function like substr and index is also lead to more clear programs that are easier to modify and maintain.

Several important points:

Remember that substr, index, rindex, and splice functions also accept negative subscripts to count back from the end.
```
substr($string -2, 2); # two last symbols of the string
```
Remember that substr can be used as lvalue. Moreover it can be used with regular expressions:
```
substr($s, -10) =~ s/ /./g;
```
There is no function that can chop the first character of the string in Perl but there is a function that can chop the last character of the string. To chop the first character of the string you can use substr function, but in most cases you are better off scanning the string and extracting character by character in a typical C-style of programming.

Questions

1. What will the following fragment print ?

$name='softpanorama';
if ( index($name, 's') > -1 ) {
   printf("String '%s' has 's' in it\n", 
}

2. What will the following fragment print:

$string='softpanorama'; 
@c = split(//, $string);
print "$c[0]$c[4]$c[2]$s[-3]$s[3]\n";

3. What will the following fragment print?

$str1='remember';
$str2='Perl';
$str3='warts";
$left = $str1 . " " . $str2 . " " . $str3;
$right = "$str1 $str2 $str3";

if ($left == $right) {
  print "strings are equal\n";
} else {
  print "strings are unequal\n"
}

Additional Reading

Strings as a datatype
Extensions
- StringRexx - Perl implementation of Rexx string functions - search.cpan.org
- StringStrip - Perl extension for fast, commonly used, string operations - search.cpan.org
ProgrammingPerl Strings - Wikibooks, collection of open-content textbooks
Perl Notes
sprintf - perldoc.perl.org
Perl sprintf Function -- same man page with some additional examples
Perl String & Regular Expression Operations
Python & Perl Basic String Operations
Main Page - BioPerl

Prev | Up | Contents | Down | Next

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019

Softpanorama Bulletin Vol 23, No.05 (May, 2011)