Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Lexical analysis

News Compilers Best books Recommended Links A symbol table Regular expressions LEX Flex
Recursive decent parsing YACC XPL Bit Tricks     Humor Etc

Lexical analysis is the subroutine of the parser or a separate pass of the compiler, which converts a text representation of the program (sequence of characters) into a sequence of lexical unit for a particular language (tokens). A program which performs lexical analysis is called a lexical analyzer, lexer or scanner.

token is a string of characters, categorized according to the rules of lexical grammar for a particular language (such as identifies, integers, floating numbers, text literals, delimiters (especially statement delimiter), operators, comments, etc). Token usually consists of two main parts:

In addition to recognizing tokens, lexical analyzers usually has some implicit definition of "white space" -- set of characters the can separate words in the program. Typically blanks, tabs and newlines are treated as whitespace. "Internal" comments (such as /* ... */) are also treated as whitespace. 

Lexical analyzer also detects some errors in the text representation of the program. For example  can issue warning about unmatched parenthesis within a statement (most languages balance opening and closing brackets within statements). Also a statement usually starts with a word and can't start with delimiters such as  "=",":", etc.

Consider this simple statement:

sum=i+2;

After processing by lexical analyzer it will be converted to the following stream of tokens:

lexeme

token type

sum

Identifier

=

Assignment operator

i

Identifier

+

Addition operator

2

Number

;

Statement delimiter

Lexemes are not usually represented by themselves (strings), but by references to compiler symbol tables (compiler can have multiple such tables, one for identifiers, the other for constants, yet another for reserved words like if, then, else, for, while, etc.

While lexical element of the modern languages typically can be defined by regular expressions, it does not make much sense to use lexical expression parser to split text into the tokens. It's simple enough to program lexical analyzer directly in any high level language using built-in function try which was first introduced in System/260 instruction set.  The classic hand-written lexer is using so called "loop-and-switch" algorithms. The initial prototype should actually be generated by flex or similar tool. After that you can enhance it manually. This is the path taken in many real life projects. For example, the gcc front-end uses an ad hoc scanner.

To improve speed lexer usually detects end of words and numeric literals using very efficient tr instruction which is available on most modern computer architectures.   The idea of "loop and switch" lexical scanner algorithm can be illustrated using the following skeleton program:

while ((ch = read_ch()) != EOF ) {
  if ch <= ' ' { Continue; } # skip "white space" (can be integrated into read_ch) 
  err_line = cur_line;
  err_col  = cur_col;
  switch (ch) {   # this is "switch part of the algorithm
    case EOF: return eof;
    case '+': return plus;
    case '-': return minus;
    case '*': return times;
    case '/': return slash;
    case '=': return eql;
    case '(': return lparen;
    case ')': return rparen;
    case ',': return comma;
    case ';': return semicolon;
    case '.': return period;
    case ':':
      ch = read_ch();
      return (ch == '=') ? becomes : nul;
    case '<':
      ch = read_ch();
      if (ch == '>') return neq;
      if (ch == '=') return leq;
      put_back(ch);
      return lss;
    case '>':
      ch = read_ch();
      if (ch == '=') return geq;
      put_back(ch);
      return gtr;
    default:
      if (isdigit(ch)) {
        num = 0;
        do {  /* no checking for overflow! */
          num = 10 * num + ch - '0';
          ch = read_ch();
        } while ( ch != EOF && isdigit(ch));
        put_back(ch);
        return number;
      }
      if (isalpha(ch)) {
        Entry *entry;
        id_len = 0;
        do {
          if (id_len < MAX_ID) {
            id[id_len] = (char)ch;
            id_len++;
          }
          ch = read_ch();
        } while ( ch != EOF && isalnum(ch));
        id[id_len] = '\0';
        put_back(ch);
        entry = find_htab(keywords, id);
        return entry ? (Symbol)get_htab_data(entry) : ident;
      }
 
      error("getsym: invalid character '%c'", ch);
      return nul;
  } # case
} # while

A good lexical analyzer not only reads in a stream of characters, discards comments, identifies the lexems in the stream, and categorizes them into tokens, it also performs various diagnostic functions and fills complier symbol table. The diagnostic functions performed by lexer are very important and this is one reason why hand written scanners are much preferable to generated one. Among important types of warning the lexer can issue we can mention:

  1. Warning about unclosed multiline literals (based on pragma of one literal confined to the single line or N lines). 
  2. Unclosed (running) comments (if the language supports multiline /* ... */ style comments)
  3. Unbalanced parenthesis within a statement.
  4. Warn about variables which are used in the program only once.
  5. If possible, it can also  disambiguouses some ambiguous symbols like "=" (if used in both assignment and as conditional "equal" operation; this is a design flaw and is not recommended for languages created from scratch). Of course it is better to use different lexems for those (like "=" and "==")
  6. Move  to the beginning of the program function declarations and variable declarations. For this purpose lexical stream for functions and can be generated into separate location (a cache) and then processed by subsequent passes fist. That makes usage of full lexical pass preferable to usage of lexer as a subroutine called by parser (please note that the issue of the efficiency of the number of passes is not that important (you can always make mutipass compiler almost as efficient as single pass using coroutines).  What is important is the ease of debugging and here multipass compilers are preferable.
  7. Lexer also can take care of simple parameterless macros like %PI=3.14 and include statements.  

One important feature of a good lexical analyzer it its ability to help parser. For this purpose it can provide look ahead functions and rearrange text to simplify subsequent parsing, moving declarations and functions ahead of the rest of the program.

Lexer can also be implemented as a coroutine for the parser, or as a separate pass.  Lexer also can populate symbol table of the compiler  (especially for simple languages where all variables are global and complications connected with nesting variables inside blocks and procedures are absent.

The general rule is that you need to make lexical analyzer as helpful to parser as possible. That's why it should be hand coded and not automatically generated.  A little bit ingenuity on lexical level pays tremendous dividends later.  Usage of full lexical pass instead of usage of lexer as a subroutine is preferable as it permits to rearranged lexical stream to simplify parsing (for example, moving function definitions upstream). Also  severe  errors on lexical level can be detected and translation aborted early.  In this sense, lexical analysis serves as a firewall between program representation and syntax parser and severe problems in text representation of the program should stop translation immediately. 

If you are designing you own language it is important ot use rules of lexical structure design:

A language designer should avoid ambiguity in the design of the lexical syntax of a language.

  1. Reserved words are a good idea to avoid ambiguity between user symbols and language command words.

    There should not be too many reserved words.

  2. Don't allow spaces inside tokens.
  3. Different kinds of tokens should look different at the left end (initial character).
  4. Avoid ambiguous combinations of tokens that would require long look-ahead.
  5. Avoid "noise'' such as ugly special characters. These make programs hard to read even if the save a few keystrokes. 
  6. Is  identifiers case-sensitive. In other works are upper- and lower-case letters equivalent?

Dr. Nikolai Bezroukov


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News

[Dec 28, 2010] Free Compiler Construction Tools Lexers, Parser Generators, Optimizers )

thefreecountry.com

re2c

This program, re2c, generates a scanner from regular expressions. It supposedly generates scanners that are faster than a flex scanner (see elsewhere on this page). However, note that the scanner is not a complete scanner: you will need to supply some interface code (to provide the input stream to the scanner), which also means that you have a certain amount of flexibility in determining how the scanner gets its input (it can be from a file, or just some buffer you created on the fly).

Quex - A Mode Oriented Directly Coded Lexical Analyzer Generator

A Mode Oriented Directly Coded Lexical Analyzer Generator Quex, or Queχ (depending on which part of the site you read), produces a directly coded lexical analyzer engine with pre- and post- conditions rather than the table-driven created by the Lex/Flex family (see elsewhere on this page). Features include inheritable "lexer modes" that provide transition control and indentation events, a general purpose token class, a token queue that allow tokens to be communicated without returning from the lexical analyzer function, line and column numbering, generation of transition graphs, etc. You will need to install Python before using this lexical analyser generator. It generates C++ code. It is released under the GNU LGPL with additional restrictions; see the documentation for details. Windows, Mac OS X, Solaris and Linux are supported.

Gardens Point Scanner Generator
The Gardens Point Scanner Generator, GPLEX, accepts a lex-like input specification to create a C# lexical scanner. The scanner produced is thread-safe and all scanner state is maintained within the scanner instance. Note that the input program does not support the complete POSIX lex specifications. The scanner uses the generic types defined in C# 2.0.

[Dec 27, 2010] Writing a Compiler in C# Lexical Analysis - All Your Base Are Belong To Us

The following is the primary method of our lexical analyzer. (The rest of its implementation was omitted for brevity.)

public void Advance()
{
    EatWhitespace();
 
    if (IsAtEnd)
    {
        _done = true;
        return;
    }
    char nextChar = NextChar();
    if (Syntax.IsSymbol(nextChar))
    {
        //This token is going to be a symbol. There are
        //three special look-ahead cases for '<=', '>=', 
        //and '!='.
        if ((new[] { '<', '>', '!' }.Contains(nextChar))
            && LookAhead() == '=')
        {
            NextChar();//Eat the '='
            _currentToken = new Token(
                TokenType.Symbol, nextChar + "=");
        }
        else
        {
            _currentToken = new Token(
                TokenType.Symbol, nextChar.ToString());
        }
    }
    else if (Syntax.IsNumber(nextChar))
    {
        //This token is going to be an integer constant.
        string intConst = nextChar.ToString();
        intConst += EatWhile(Syntax.IsNumber);
 
        int result;
        if (!int.TryParse(intConst, out result))
        {
            throw new CompilationException(
            "Int const must be in range [0,2147483648), " + 
            "but got: " + intConst, _currentLine);
        }
 
        _currentToken = new Token(
            TokenType.IntConst, intConst);
    }
    else if (Syntax.IsCharOrdinalStart(nextChar))
    {

        char marker = NextChar();
        if (marker == '\\')
        {
            string code = EatWhile(Syntax.IsNumber);
            if (code.Length != 3)
            {
                throw new CompilationException(
                "Expected: \\nnn where n are decimal digits",
                _currentLine);
            }
            int value = int.Parse(code);
            if (value >= 256)
            {
                throw new CompilationException(
                "Character ordinal is out of range [0,255]",
                _currentLine);
            }
            _currentToken = new Token(
                TokenType.IntConst, value.ToString());
        }
        else
        {
            _currentToken = new Token(
                TokenType.IntConst, ((int)marker).ToString());
        }
        NextChar();//Swallow the end of the character ordinal
    }
    else if (Syntax.IsStringConstantStart(nextChar))
    {
        //This token is going to be a string constant.
        string strConst = EatWhile(
            c => !Syntax.IsStringConstantStart(c));
        NextChar();//Swallow the end of the string constant
        _currentToken = new Token(
            TokenType.StrConst, strConst);
    }
    else if (Syntax.IsStartOfKeywordOrIdent(nextChar))
    {

        string keywordOrIdent = nextChar.ToString();
        keywordOrIdent += EatWhile(
                          Syntax.IsPartOfKeywordOrIdent);
        if (Syntax.IsKeyword(keywordOrIdent))
        {
            _currentToken = new Token(
                TokenType.Keyword, keywordOrIdent);
        }
        else
        {
            _currentToken = new Token(
                TokenType.Ident, keywordOrIdent);
        }
    }
    else
    {
        throw new CompilationException(
            "Unexpected character: " + nextChar, _currentLine);
    }
}

There are five interesting cases here from which five different token types can be generated:

  1. A symbol [TokenType.Symbol], which may contain two characters-explaining the need for additional look-ahead with '<', '>', and '!'. Note that the additional look-ahead may fail if the symbol is placed at the end of the file, but this is not a legal language construct, anyway.
  2. A numeric constant [TokenType.IntConst]-we currently allow only integer constants, as the language doesn't have floating-point support.
  3. A character ordinal constant such as 'H' or '\032'-these are translated to numeric constants as in #2.
  4. A literal string constant [TokenType.StrConst] such as "Hello World"-note that '"' is not a legal character within a literal string constant. We leave it for now as a language limitation.
  5. A keyword or an identifier [TokenType.Keyword or TokenType.Ident], matching the previously shown regular expression.

This lexical analyzer is rather "dumb"-it does not record identifier information anywhere, and it doesn't provide access to anything but the current token. It turns out that we don't need anything else for the current Jack syntax-formally speaking, it is almost an LL(1) language, i.e. most of its language constructs can be parsed with only one look-ahead token. The single LL(2) exception is subroutine calls within expressions, and we'll craft a special case in the parser to work around this limitation.

For the "Hello World" program above, this lexical analyzer will produce the following sequence of tokens:

<keyword, class> <ident, Main> <symbol, {>
<keyword, function> <keyword, void> <ident, main>
<symbol, (> <symbol, )> <keyword, do>
<ident, System> <symbol, .> <ident, print>
<symbol, (> <symbol, "> <strconst, Hello World!>
<symbol, )> <symbol, ;> …

Next time, some parsing.

Dragon Lexical Analyzer " Python recipes " ActiveState Code

This is an example of classic "loop and switch" lexical analiser in Pythin

I remember thinking the old Dragon C code was pretty good, but not so much now. Rather than allocate a token struct in each loop, lexan() returned just the token tag after writing the token's value to a global. It's understandable from a speed standpoint, but it does make the code harder to follow.

Pattern Matching vs Lexical Grammar Specification

Xah Lee, 2008-05-01

This page is a personal note on the relation of pattern matching and lexical grammar specification, and their application in formal logic. Pattern matching here can mean textual pattern matching such as regex or list structure and datatype pattern matching such as in Mathematica.

A Need For Lexical Grammar Spec Lang

In computing, there's pattern matching. e.g. those in Mathematica, for pattern matching list structure and datatypes ( Mathematica patterns ) , and those in perl, for pattern matching strings. These pattern matching domain-specific languages are designed to match patterns, but i have realized, that they are rather unfit, when used as a mean to specify textual forms. In short, what i came to understand is that pattern matching facilities (may it be regex for strings or Mathematica's for symbolic expressions), although powerful, but is not capable of specifying formal grammar, even very simple ones. (this seems obvious now, when stated as above, but it was a realization for me.)

Often, when writing doc or in verbal communication, on computer programing or doing math with Mathematica, you are tempted to use a pattern to communicate a particular form a expression may have. For example, you would like to say that email address has the form xyz, where xyz is a perl regex:

[A-z0-9]+@[A-z]+\.com

Similarly, you might use regex to communicate the form of a url, domain name, file path. In math (especially in formal logic or in computer algebra systems), i often want to say that certain function has the form xyz, where xyz is a Mathematica pattern that indicates the function's parameters, output, and their types. (For example: a computer language expression analogous to traditional notation 「f:ℂ²→ℝ³」). However, in practice, using regex or Mathematica's patterns for the purpose of specifying a form, is often unsatisfactory. Typically, using patterns only gives a general idea, but isn't a correct specification of the form. For the purpose of giving a general idea, verbal English description is almost as good. Almost.

Recommended Links

Lexical analysis - Wikipedia, the free encyclopedia

Lexical Analysis

LEXICAL ANALYSIS

Free Compiler Construction Tools Lexers, Parser Generators, Optimizers )

thefreecountry.com

re2c

This program, re2c, generates a scanner from regular expressions. It supposedly generates scanners that are faster than a flex scanner (see elsewhere on this page). However, note that the scanner is not a complete scanner: you will need to supply some interface code (to provide the input stream to the scanner), which also means that you have a certain amount of flexibility in determining how the scanner gets its input (it can be from a file, or just some buffer you created on the fly).

Quex - A Mode Oriented Directly Coded Lexical Analyzer Generator

A Mode Oriented Directly Coded Lexical Analyzer Generator Quex, or Queχ (depending on which part of the site you read), produces a directly coded lexical analyzer engine with pre- and post- conditions rather than the table-driven created by the Lex/Flex family (see elsewhere on this page). Features include inheritable "lexer modes" that provide transition control and indentation events, a general purpose token class, a token queue that allow tokens to be communicated without returning from the lexical analyzer function, line and column numbering, generation of transition graphs, etc. You will need to install Python before using this lexical analyser generator. It generates C++ code. It is released under the GNU LGPL with additional restrictions; see the documentation for details. Windows, Mac OS X, Solaris and Linux are supported.

Gardens Point Scanner Generator
The Gardens Point Scanner Generator, GPLEX, accepts a lex-like input specification to create a C# lexical scanner. The scanner produced is thread-safe and all scanner state is maintained within the scanner instance. Note that the input program does not support the complete POSIX lex specifications. The scanner uses the generic types defined in C# 2.0.



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019