|
Softpanorama |
May the source be with you, but remember the KISS principle ;-)
Softpanorama Search
|
Prev | Up | Contents | Down | Next
Good introduction to Perl lexical strcuture can be found in InformIT Perl's Building Blocks Numbers and Strings Literals which is a reprint of a chapter from Sams Teach Yourself Perl in 24 Hours, 2nd Edition
The lexical structure of a programming language is the set of rules that specify lexical elements (tokens) from which any program can be constructed. It is the lowest level rules and on the intuitive level they correspond to the first pass of the interpreter through the program.
On this level we specify such entitled (tokens) as identifiers, whitespace, statement delimiters (how one program statement is separated from the next), constants, comments and delimiters. Comments are usually discarded during lexical analysis of the program.
This short chapter deals with the lexical structure of the language.
Perl lexical structure is closer to lexical stricture of Unix shell languages than to traditional high level languages and is pretty complex. Due to shell heritage in several important way Perl lexical structure is different from lexical structure of C-style languages like Java. for example some literals (double quoted) are further preprocessed and actually behave more like built-in functions -- actually they are special built-in functions. Similarity with shells are visible in other areas too (overuse of prefixes for variables). At the same time Perl has the most powerful and flexible facilities for specifying text literals of all languages that I know.
Perl is a case-sensitive language. This means that language words (called identifiers) should be typed with a consistent capitalization of letters. there are several types of identifies:
language keywords (words that have fixed meaning in the language and ofte start statement)
variables names
function names
The if keyword, for example, must be typed in lowercase, and not If or IF. Similarly, $i and $I are two distinct variable names (as we will see Perl uses $ as a prefix to some variables)
Whitespace are elements that can serve as identifiers separators, but that otherwise ignored much like comments. whitespace symbols include spaces, tabs, and newlines.
Like any normal language should Perl ignores whitespace that surrounds "lexical tokens" in programs. As we already know "lexical token" is a keyword, variable name, number, function name, etc. For example those two statements are the same:
$i=1;
$i = 1;
$i =
1;
At the same time you cannot use whitespace inside the token, that splits the token into two. For example, If you place a space or tab or newline within a token, you will break it up into two tokens
$i=5E2; # 5E2 is a single numeric token ( 5 *100 equal to 500)
$i=5 E2; # here 5E2 was split into two tokens by a space (syntax error).
Because you can use whitespace freely in your program it is possible to indent your programs in a neat fashion (to beautify the program). That makes the code easy to read and understand. Special program called beautifiers can be used for this purpose. Programming editors often can beautify code automatically too, for example:
Comments are usually considered as a special case of whitespace and many languages treat internal comment as a space. Perl does not have internal comments (comments that start and end on the same line and have language construct before and after it). It has simplified "line to the end" comments that start with "#". This type comments is well known in Unix world because they are used in shells.
Perl does not have multi-line comments either. There is no way to comment out a block of code in Perl other than to put # at the beginning of each line. For example
# this is a one-line comment
$i = 0; # this is an end line comment
# $i=0; # this is a statement that is commented out (note two # in this line)
|
Perl has rather inflexible and limited comments. The desire to preserve compatibility with Unix shells dictated the use of #, but here Perl robbed itself of important symbol (that can be used for example for casting scalars into numeric). This differences creates problems for heavy users of C-like languages and one can write a preprocessor that permits usage of C and C++ style comments. In any case if you are C programmer you need to check you scripts for wrong comments... |
Like most scripting languages Perl doesn't specify the type size and ranges of the numeric literal. And to a certain extent there is no any numeric data type in Perl. No byte, short and integer types are defined in the language (integrer was added in latest versions of Perl but we will not discuss it here). In classic Perl all arithmetic operations are performed by default on double precision (64 bit) IEEE 754 floating-point numbers that gives 14-15 significant digits in most calculations. You can change default with pragma use int but I saw very few scripts that are using this feature.
If you need more control you probably need a different language.
Default base is ten but you can specify numbers in octal notation (start with zero or hexadecimal notation (have prefix 0x). Underscore can be used in big numbers for readability.
12 # small integer in decimal notation
192_168_10_10 # underscore can be used in big numbers for readability
1123141.111 # all numbers are represented in floating point so it's ok
3.14e+2 # explicit floating point
3.14E2 # yet another floating point number
Base 16 (hexadecimal) numeric literal are possible .
0x1F # hexadecimal
0x1f # same as above
Octal number are also supported (a leading zero should be used):
037 # octal -- not decimal
Perl have very flexible literal lexics. There are tree types of static string literals:
Single quoted,
Double quoted
Back quoted (this is a special case of double quoted literal)
There are also four additional functions that can generate literals of this tree types. And there is also special multiline literal called HERE documents.
The single quote (') indicates that the text is to be use verbatim with minimum interpretation. Single quoted literals cannot span for more than one line. There are only two C-style escape sequences acceptable in single quoted strings literals:
\\ -- backslash
\' -- single quote
Typical idioms:
print '\''; # you can't just put ' in the single quote literalprint 'C:\\WINDOWS\\COMMAND\\COMMAND.COM'; # backslashes are doubled
print '"'; # single quote-double quote-single quote
print 'this is "new" example'; # double quotes used inside
Double quoted literals are essentially expressions or special functions, not an atomic entity. Not only double quotes can be delimiters. The qq() function (see below) is another way to specify them. In certain contexts the initial character can be different and the last character should be matching.
Unless you use $ in the text of the literal they are generally equal to single quoted literals. As single quoted literals they cannot span on more than one line. If a literal contains any $ symbol it will be additionally processed (variable will be substituted for their values -- the operation called variable interpolation in Perl, see below). For now I would like just state that script
$v="Hello world"; # note that variables start with $
print "$v\n"; # $v will be expanded and new line added to the end
print "v=$v\n"; # this is typical debugging statement for printing variable $v
print '$v\n'; # error
will produce the same output as our first "hello world" program.
Generally it's more convenient to use them instead of single quoted literals because you can imbed a newline character in it (it's not possible with single quoted literals. Some examples:
print "'"; # doublequote-singlequote-doublequote -- you do not need backslash here
print "Nick's house\n"; # singlequote is OK inside double quoted literal
But this not only one possibility -- see qw(), qx(), qq(), and q{} functions below. The number of escape sequences in this type of literals is larger (see the table).
|
Additional Escape Sequences for double-quoted literals |
||||||||||||||||||||||||
|
Here are some examples of errors in hex values:
$hexerror = '\x0b\x0f'; # error ! value is \x0b\x0f (single quotes should not be used)
It is also possible to use octal data in Perl ( $octalData = "\07\04\00\01"; # octal data. (\01 equals '1' octal), but I do not recommend them. Usually it's better to use hexadecimal notation instead.
The double quotes force macro substitution (for some reason called interpolation
in Perl) of any scalar variables -- variables that start with $ (dollar sign)
$a="Hello";
$b="world";
print "$a $b"; # it will print the value of $a and the value of $b
This is yet another way to print the famous phrase "Hello world" in Perl. Details of processing double quoted comments are in Gory details of parsing quoted constructs
A very typical mistake connected with this feature is putting a email address (or group of e-mail addresses in double or backquotes, for example
` cat letter | mailx -s test myself@mydomain.my`;
Here @mydomain will interpreted as an array with very undesirable results. The correct form should be
`cat letter | mailx -s test myself\@mydomain.my`;
or
to_addr='myself\@mydomain.my;
`cat letter | mailx -s test $to_addr`;
Backquoted literals
Back quoted literals are similar to double-quoted literals (interpolation is performed), but the result is considered a script that needs to be executed by standard shell. Yes it will be executed -- and that provide programmer with a lot of non-trivial possibilities. But it's too early now to cover this item. We will discuss this type of literals later but here is one example:
$my_homedir=`/bin/ls -l ~`;
Dynamically delimited literals
Perl also has alternative syntax representation of the literals discussed above. I would call them dynamically delimited literals. In this case you can specify delimiter as the first character after the name of the special compile type functions (for brackets closing delimiter needs to be a symmetrical bracket). Among these functions (all depicted with {} as delimiters):
q{} -- equivalent of single quotes, for example q{Hello worlds}
qq{} -- equivalent of double quotes, for example qq{Hello $word} # interpolation is possible
qw{} -- quoted word list, for example qw{Mn,Ts, Wn, Th, Fr, St, Sn}
qx{}, -- equivalent of backtick, for example: qx{echo hello world}
qr{} -- equivalent to regular expressions. qr{world}
Here is a relevant quote from the
perlop man page:While we usually think of quotes as literal values, in Perl they function as operators, providing various kinds of interpolating and pattern matching capabilities. Perl provides customary quote characters for these behaviors, but also provides a way for you to choose your quote character for any of them. In the following table, a
{}represents any pair of delimiters you choose. Non-bracketing delimiters use the same character fore and aft, but the 4 sorts of brackets (round, angle, square, curly) will all nest.Customary Generic Meaning Interpolates '' q{} Literal no "" qq{} Literal yes `` qx{} Command yes (unless '' is delimiter) qw{} Word list no // m{} Pattern match yes (unless '' is delimiter) qr{} Pattern yes (unless '' is delimiter)
This is also a heritage from the ksh, which permits insertion of arbitrary text fragments into Perl script. We will discuss them later.
|
Perl literals are one of the strongest part of the language. It's very flexible and here the programmer is served much better that in any other language that I know |
There are four types of identifiers in Perl. They are distinguished by the first letter. Perl uses a variation of the idea that was first successfully used in Fortran and in more developed form PL/1 (as well as probably several other languages) -- data type is determined by the first letter. Some variations of this idea are usually called Hungarian notation. Prefixes make Perl code look strange, even bizarre for those that do not get used to it, but this is a legitimate solution and this solution is not that uncommon feature of programming languages. Sadly enough Perl does not use extensions developed by PL/1 where you can specify a set of letters that by default enforce a particular type.
All-in-all that means that most variables in Perl have prefixes. Words without prefixes (barewords in Perl Speak) are used for file handles. Other than that they are considered to be a literals much like single quoted literals.
Table
| Type | Prefix | Examples | Comment |
| Scalar | $ | $number = 123.45; | |
| Array | @ |
@a= (1,2,3,4,5)
$a[1]=0 |
Individual members of array are considered to be a scalar |
| Hash | % |
%ip=(mail_server=> 128.101.1.1,dns_server=>131.1.1.1,); $ip{"dns_server")=131.10.10.10 |
Individual members of the hash are considered to be a scalar |
| Handle | none | open(IN,$path); |
IN is a file handle |
Most common type of identifier in Perl is scalar and one needs to adjust that it should be prefixed with '$' (dollar sign). The variable name is formed by role close to rules in C and other high-level languages (you can use a-z, A-Z, 0-9 and _ ). Any identifier should start with a letter. In Unix tradition variable names in Perl are case sensitive, so $a and $A are different.
|
The most popular type of identifier in Perl is scalar, prefixed with $ |
That means that all variables in Perl need to have prefixes. There is no way to change the type of prefix required for, say, scalars like it is possible in Fortran or PL/1.
Examples:
$i=5; # i is an identifier. $i means a scalar variable
@digits =(0,1,2,3,4,5,6,7,8,9); # @digits is an array that is initialized with 10 values.
@digits =(0..9) # same as above. Range 0..9 will generate all values automatically
print $ip{'www.yahoo.com'}; # will print 204.71.200.68
$ip{'www.yahoo.com'} = '204.71.200.75'; # set new value for key www.yahoo.com
Perl conventions lessen the probably of a conflict between keywords and identifiers, but not eliminate them completely. One should avoid using typical keywords like if, then, etc as variable names even when they will be prefixed with $ or other special character.
A Perl script consists of a sequence of declarations and statements. The only things that need to be declared in Perl are report formats and subroutines. Like in most scripting languages variables are usually declared implicitly -- the first appearance add the variable name to the dictionary. By default variables have global scope -- all script.
All statements should end with a semicolon. Like in C statements can be grouped with { }. There is a Perl beautifier. Use it.
So called my variables are different and is more like what you expect from a high-level language. A declaration can be put anywhere a statement can, but has no effect on the execution of the primary sequence of statements--declarations all take effect at compile time. Typically all the declarations are put at the beginning or the end of the script. However, if you're using lexically-scoped private variables created with my(), you'll have to make sure your format or subroutine definition is within the same block scope as the my if you expect to be able to access those private variables.
Like in C and PL/1 statements in Perl must be terminated with a semicolons.
| Statements in Perl must be terminated with a semicolons. Checking this fact before compiling new Perl script or after changing something in the old one can save a lot of time... |
The unique feature of Perl statements is that any statement may optionally be followed by a conditional modifier. There are 4 possible modifiers:
Semantic of this suffixes are similar to semantic regular conditional statements (see the below on conditional statements). So
$a=0 if ($a<0);
is equal to
if ($a<0) { $a=0;} # see discussion of conditionals in Ch.5
A sequence of statements delimited by curly brackets is called compound statement or block.
Like in Unix scripting languages when assigning values to a variable -- double quote literals are a special kind of expressions in which a substitution of variables is performed. For some reason this macrosubstitution is called interpolation in Perl.
Typical examples:
$a = 'This is a string'; # a scalar assigned 'This is a string'
$b = 11.00; # simple scalar assigned 0 (non significant zeros will be dropped during conversion to double float, so it is like $b = 11;)
$c = '21.0'; # a string that is a well formed number
$d = "this item cost is $c"; # interpolation will be performed, substituting $c for 21.0
$e = 'this is $a'; # no interpolation in single quoted literals
$f = ''; # just empty string.
Neither single and double quoted literals can span for more than one line. In this case concatenation operator("." -- dot) should be used.
Perl has more or less typical for high level languages set of operators. Programmers who know C should have the less amount of difficulties. But there are a couple of idiosyncrasies.
The first significant difference with c is that Perl when needs to distinguish between operations on numbers and on text strings it introduces two sets of operations. For example there are two sets of conditional operations -- one for numbers and the second one for strings: "==" in Perl mean numeric comparison and "eq" -- string comparison. Only the first will work correctly on number comparisons like testing for zero
| Operator type | Symbols used | Example |
| String comparison | gt, lt, eq, cmp,ne | '9' gt '10', '1.1' ne '1' |
| dot operator | . (dot) | $a = 1; $b =2; print $a . $b # prints 12 |
| numeric operators | +,-,/,*,**, %(mod) | |
| subscript | [] | |
$a = $a + 4; # Add 4 to $a and store the result $a. Can be written as $a +=4
$a = $a - 4; # Subtract 4 from $a and store the result in $a.
# Can be written as $a-=4;
$a = $a * 2; # Multiply $a by 2. Can be written as $a *=2
$a = $a / 2; # Divide $a by 2. Can be written as $a /=2
$a = $a ** 3; # Raise $a to the cube
$a = $a % 2; # Remainder of $a divided by 2 (integer operation)
++$a; # Increment $a and then return the value of the expression
$a++; # Return the current value of $a and then increment it
--$a; # Decrement $a and then the value of the expression
$a--; # Return the current value $a and then decrement it
Here are examples when both operator will be treated as strings (and will
be first converted to strings no matter what):
There are marginally useful shorthand's similar to C (do not overuse them -- they can hamper the comprehension of the program without saving much space $a=$a+$b is not much longer than $a+=$b):$a = $b . $c; # Concatenate $b and $c $a = $b x $c; # $b is repeated $c times
$a = $b; # Assign $b to $a $a += $b; # Add $b to $a $a -= $b; # Subtract $b from $a $a .= $b; # Append $b onto $a
Other operators can be found on the perlop manual page. One need to understand that double quoted literals in Perl are essentially a function that converts a scalar to the string, but at the same both type of comparison operators (for example "==" and "eq" dictate the type or left and right operator so in case left or right oprator is of inproper typer they will be converted before comparison
if (0.0 == 0) { } # true (no covertion performed)if ("0.0" eq "0") { } # false (no convertion perfomed, strings are unequal as they have freerent length)if ("0.0" == "0") { } # true (left and right operators will be converted to numeric value which is zero for both
Until recently Perl has had only one representation of numbers -- double float. In most cases it works OK. In more complex cases when you need additional precision that design decision leads to troubles. Later pragma use integer was introduced, that can dictate interpreter to use integer arithmetic for numeric operations.
It is very important to understand that in Perl the operators gt, lt, eq, cmp and ne presuppose conversion to string of both operands before evaluation
if ('a' gt 'b') {print "yes, 'a' is greater than 'b'\n"; }
That rule can be a source of errors if you by accident use numeric operand with string comparison, for example
$a='9.0'; # $a contain three character string '9.0'if ($a ne 9.0 ) {print "not equal";} # it's not equal
First 9.0 will be converted to number ("9") and then this number will
be converted into string "9". After that the string '9.0' will be
compared with the string '9' and they are not equal. Thus the message "not
equal" will always get printed.
Likewise:
$a='10';if ($a lt '2') { print "left is less" } ;
Evaluates to true since 10 is smaller than 2 when evaluated as a string. String comparison is done from left to right symbol by symbol until the first non-equal symbol is found:
| String/Symbol number | 0 | 1 | 2 |
| '10' | 1 | 0 | |
| '2' | 2 | ||
| *- non-equal symbols as 1<2 | |||
If you compare numeric literal (a number) with a string using string comparison, then numeric literal will first be converted into a number (discarding all training zeros) and then this number will be converted to a string. For example:
$a='1';if ($a eq 1.0) { print ' $a is equal to 1.0\n'; }
is true because first 1.0 will first be converted to numeric representation which will be converted back to string resulting in string "1". After than we will compare two strings that are equal.
|
Perl uses "==" for numeric comparison and "eq" for string. Both left and right operators are forcefully converted into required representation before comparison. This is a source of very complex to find errors. |
The '.' symbol denotes the concatenation operator in Perl. The operator takes two scalars, and combines them together in one scalar:
$sentence = $sentence . '.';
appends the string '.' to the end of $sentence. Adding zeros with concatenation can be used for multiplication by ten:
$i= $i.'0'; # here we essentially multiplied $i by 10
A scalar is interpreted as a number if it is part of an array subscript, is in a numeric comparison operator, or is in an built-in function requiring a number. In case it does not represent a valid numeric string it zero will be used. No conversion error will be ever reported.
If the string represents a "well formed" number it will be converted into numeric value without any problems. For example:
$sum = "111.00" + 12;
The '+' turns the string 111.00 into a number, so $sum becomes 123.
But if data cannot be converted to numeric, zero is used.
I would like to remind that Perl recognizes floating point numbers and hexadecimal numbers. and underscore can be used to make large numbers more readable, but this representation can not be used in string literals:
| No conversion error will be ever reported. In most cases the value 0(zero) will be used, like in a=1+'one'; A very unpleasant mistake in connected with the fact that underscore dies not represent well formed number in literals. That means that '1_000_000' will not be converted correctly to 100000, but to zero -- an unpleasant surprise. |
Note: underscore is not accepted in literals and can be used only for numbers without quotes. For example, that means that 1_000_000 in double quotes will not be converted correctly to 100000, but to zero -- an unpleasant surprise. For example string "1_000_000" in not well formed number and during conversion to a number it will be converted to zero:
if (1000000 == "1_000_000") {} # will return false as right operand will be converted to 0.
Likewise with:
$value = "non_number" + 12;
conversion of string "non_number" to a number will result in zero. That means that $value will be assigned 12.
That also means that in case $i is non-numeric the index zero will be used in the statement
print $a[$i];
Like in regular languages statements on Perl are executed according to the flow control. One can achieve implicit loop behavior similar to sed and awk scripts by using the -n or -p switch. We will discuss them later.
While superficially similar to C/C++ Perl has more rich lexical and syntax structure with elements inspired by Unix shells.
Some difficulty for novices might be the concept of a typeless language. Especially the fact that type is defined by operator used and associated implicit conversion to numeric or text representation depending on operator. For example comparison operator "==" forces both left and right part to be converted to numbers, while operator "eq" force both left and right operators to be converted to strings.
Scalars are the most popular type of variable in Perl and one can think about them as strings with optional numeric representation when it makes sense and zero otherwise. That means that any string in Perl can be converted as a number. Perl is one of very few languages where operator determines the type of operands (and implicit conversion, if necessary) much like in assembler.
Please be very careful and check your program for typical errors before submitting it to complier. That significantly simplifies the life.
|
Please always use '-wc' (warning flag + compile_only flag ) to check the initial draft of the scripts with Perl interpreter. It might help you to find some tricky bugs on syntax level instead digging them out as runtimes errors. |
Be especially careful with numeric comparison that involved variable that are strings. You need to be very careful not to shoot yourself in a foot by implicit conversion.
|
Scalars are the most popular type of variable in Perl and one can think about them as strings with optional numeric representation when it makes sense (zero otherwise). That means that any string can be interpreted as a number, if the operator requires a number. |
Strong point of Perl is a very flexible string literal mechanism. By using appropriate quotation mechanism one can avoid errors typical in other languages with less flexible string literal syntax. Perl has rich set of operators and flexible semantic of assignment.
Perl uses prefixes to identify type of variables, much like early languages (Fortran-66).
Prev | Up | Contents | Down | Next
Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
Last modified: September 05, 2009