May the source be with you, but remember the KISS principle ;-) Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers (slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

# Compilers Algorithms

 News Algorithms Best books Recommended Links University Courses Papers FAQs People Lexical analysis Recursive Descent Parsing Parsing Regular expressions Code  generation Tree-based code optimization Peephole optimization Performance and Benchmarks Decompilation ACM Program Analysis and Transformations Pipes coroutines and continuations C compilers Generative programming methods Graph Analysis DSL (domain specific languages) Defensive programming LEX&YACC XPL PL/360 Prolog Bit Tricks Humor Etc

### Introduction

As I used to explain students during the first lecture of my  "Compiler construction" course, this topic of computer science has much wider practical significance than one can expect just from the name of course.  First of all, from my point of view,  this is the most exiting part of CS curriculum that touch a lot of important areas and some very interesting algorithms.

Ability to construct and use languages is what makes humans humans, and the ability to create your own language is simply amazing. In a way writing compiler from a specialized language is just an extension of if this immanent human ability. But pretty amazing one. Please don't miss this opportunity. I think that compiler writing is an eye-opening experience for any interested student of programming, and I deeply regret that compiler construction courses are de-emphasized in current university curriculums. But at this slightly skeptical site, I am not obliged to conform to the current programming fashion, whatever it might be.

That why this page tries to bring compiler construction to your attention as the central and most important part of CS curriculum.  This course has some implicit links to scripting languages as the latter (and first of all Perl and Python) are powerful compiler construction tools that greatly simplify writing of  lexical scanner and syntax analyzer of any specialized complier. They allow to produce higher quality product in less time. Speed of compiling in such cases is not a primary concern and in many cases you will be surprised how fast are supposedly inefficient methods (such as writing compiler as s set of distinct passes, connected via files written to the disk)  on modern computers.  But a quality of diagnostic is of primary concern  and is easier achieve then you lexical parser and syntax analyzer are written (or generated) in scripting language.  Also overemphasizing of syntax analysis in such courses will do no good to students (and this the grave deficiency of most "compiler construction textbooks"). Recursive decent is usually adequate as you control the language ( Niklaus Wirth was probably the first who understood this fact; althouth later he slipped into software verification fundamentalism, which killed him as compiler writer ;-)

The real beef in writing the compiler is in creating an flexible intermediate representation of the program  suitable for code generation (often structured similarly to the syntax tree) and generating the code from it. The hierarchical structure of Unix file system facilitates its use it in compiler writing for storing the intermediate parsing tree with templates from which you can generate final code by just traversing this tree in a right order. And having elements of intermediate representation written as files  greatly simplifies debugging.

Code generation actually does not necessary needs to be machine code or even assembler language. You can generate code in any language you want, including shell, Perl, Python and Java.

Another advantage of taking this course that it introduces you to what is called "dual language programming" -- programming your software system in two languages (for example, TCL+C, Python+C, JPython+ Java, Shell+Perl, Groovy+Java). For many problems this approach is more productive than any currently fashionable programming paradigm. When you create interpreter from the scripting (for example TCL or Python) in lower level compiled language (such as C), you might preserve the capability to call programs that adhere to certain calling rules, but are written in low level language from the higher level language. In case of TCL this was one of the design goals for the language and its interpreter  (LUA is similar in this respect).  In such cases you still have access to lower level language in which you can write parts of your software that are difficult or very inefficient to express in higher level language.  This is a tremendous advantage, as you can mix and match them in creating your software product.

### Knowledge of compiler construction is the key to understanding of any particular programming language, especially a complex one

You can't understand compiled and scripting programming language without understand how particular data types are represented in computers, and how particular constructs are implemented, at least of high level. Such notions  as references, identifier tables (and related notion of namespaces),  structures (which can hold iether values or references, including references to functions (in this case of OO languages such structures are called objects), garbage collection, passing parameters by value vs by reference, etc. Nicklaus Wirth reportedly joked that, because Europeans pronounce his name properly, while Americans pronounce it as "nickel's worth", he is called by name in Europe and called by value in America.

There is another important lesson which you can learn by studying compiler construction and methods of translating particular constructs into the instruction of iether real of virtual machine: writing compilers is hard and raising the level of language is equal to the rising of the level of abstraction at which you are viewing the problem. Not all aspects of the problem can be viewed from high level of abstraction. Ability to switch from one level of abstraction to another (lower of higher) is one of the traits of talented software architects.

When you have no clue about the methods used for the implementation of particular constructs, you can't write good programs. And the higher is the levle of the language the more difficult is to maintain this understanding. For example, very few people understand how OO constructs are implemented.  Unfortunately many programming books try to obscure this aspect, do not provide any details of the implementation or the history of emergence of this feature (which is very important for understanding),  and try to treat such constructs as something "given".

If you have only vague understanding of the way the particular construct is implemented in compiler/interpreter, you will spend a lot of time "struggling with the language" -- attempt to understand what can be called "language bureaucracy" instead of solving the problem.

Another problem is that modern language became way to complex. They are essentially incomprehensible for humans and most programmers are able to learn only a subset. In this sense the knowledge of how particular feature compiled in object code or interpreted is the only way really understand some features of the language.

I suspect that this drive to "overcomplexity" while lead to some important  successes in creating VHL languages (Python now is among top ten algorithmic languages used; in October 2017 is occupied fifth position in TIOBE Index) to a certain extent is a self-defeating approach -- the resulting language becomes too "top heavy" infused with esoteric constructs that few use or need,  and interpreter became too complex to maintain. At this point the language and interpreter both need powerful financially rich sponsor to continue development. If such sponsor bails out (like O'Reilly did in case of Perl) the language stagnates as maintaining the interpreter and libraries is no longer fun and few developers are ready to deal with such a level of complexity.

"Dual language programming approach" might be more productive and allow on both levels use simpler languages (for example in pair TCL+C,  TCL is really much  simpler then, say Perl or Python).  Also the design of a particular language compiler or interpreter is heavily influenced by the current flavor of "religious fundamentalism" which is endemic in this area.  Previously those were "structured programming" and "verification" frenzy. Now it is OO, althouth the religious fervor now is going down in this area, as it is clear that OO is unable to produce the benefits expected and never will, while contibuting to excessive complexity of the language and its  imlementation in comparison with classic Algor-60 style procedural languages (Rees Re OO)

(Jonathan Rees had a really interesting response to Why Arc isn't Especially Object-Oriented, which he has allowed me to reproduce here.)

... ... ...

Because OO is a moving target, OO zealots will choose some subset of this menu by whim and then use it to try to convince you that you are a loser.

(I say regular utilities written for Linux  maintenance tasks written in Python in OO style -- that was a horror show; or more correctly "art for the sake of the art").  Excessive attention to religious fundamentalism aspect of programming language culture  is really detrimental to the quality of implementation (and speed of execution) of many classes of programs.  Also languages that too subscribed to particular fashion tend to disappear when the fashion dies out. The destiny of Pascal (only version with Borland Delphi extensions is still used) and Smalltalk are very important  lessons in this respect.  While OO flag is still carries by C++, Java, Python and Ruby  all those language allow procedural programming as well.

Actually it is better to learn language that allow you use regular procedural programming along with OO, such as C++, Perl and Python than languages in which  your are forced into straight jacket of OO.  When I see book on "Algorithms and data structures" which use OO in sorting or searching algorithms I do not know whether I should laugh or cry. In any case this is simply disgusting. KISS principle is clearly violated here: not all areas need or can benefit from OO approach. Moreover a rational part of OO approach can be modeled in non-OO languages as well.   Recent growth of popularity of Javascript suggest that alternative approaches can be as good or even better (Javascript uses Prototype Based Object Model, which is distinct and somewhat superior of Simula67-inspired class-based OO model used in C++, Java, Python, and similar languages)

At the same time it is important to understand that the programming language and its compiler or interpreter are only a small part of the total "language ecosystem". And the quality of this ecosystem and especially the quality of libraries (especially the part of libraries that provide interface to underling OS such as Linux)  and debugger are as important or even more important  as the quality of the language.  As Knuth pointed out long ago, in evaluating the quality of the language, you need to see the larger picture --  not the language as such,  but  the whole programming ecosystem that exists around the language and which include the debugger,  IDE, libraries, books, Web sites, groups, etc.

### The idea of creating abstract, specific area oriented,  machine and associated language  -- compiler construction as programming paradigm

Compiler design and related set of classic algorithms provides a pretty flexible software architecture that can be called "abstract machine" architecture. Sometimes using this architecture and adapting it to a particular task can make design more transparent and more easily debugged. While separating lexer, parser and semantic routines is very powerful way of structuring many types of programs, especially text conversion/transformation programs, the real power of this approach is in identifying the key operations of some "abstract machine" specific to particular domain and the language that allow you to operated with those (now defined) high level operations and data structures.

In other words structuring a large program as if it were a compiler for some new specialized language helps to solve problem that are more difficult or even impossible to solve other way. It can adds some elegance in architecture of many types of text processing and interactive programs. BTW coroutines were first introduced as a way to simplify writing compilers (Melvin Conway was a pioneer of this approach; see also his Conway Law)

An approach when an input to the program can be treated as some formal language and program transforms it to another language now got some traction in XML world. But in no way it is limited to XML and the idea has much wider applicability.  In a perverted way it is also present in OO languages, which can be viewed as simplistic implementation of "compiler-compiler" approach to problem solving.  Rational ideas related to creation of a specialized data types set of abstract operations on them (disguised as objects) and  are buried in superficial things such as inheritance issues, the machinery of working with hierarchical name spaces and related "visibility" problems,  integration of exceptions into this framework, as well as semi-relations desire to outlaw alternative approaches, typical for OO zealots.

In other words, elements of compiler technology, especially a notion of scanner-parser-generator troika are applicable to a wide range of programming tasks. Of course to use it you need to understand how compiler is structured and what algorithms are used. That's why complier construction course is so crucial. And again, while difficult (especially if overloaded is redundant formalisms) it is really existing. In some way, creating a new language is the way human solve complex problems.

That's is of the reasons I created this page as the part of my career was devoted to compiler writing and teaching compiler construction. It is now so long in the past that I've already forgotten most of the things I used to know :-(. and that's sad. This page reminds me that some time in the past I knew lot more than now, and probably was a better programmer.

But the key ideas remain applicable to my current work. And  I can attest that that fact that by accident  I learned the most about programming at this early stage of  my career by writing compilers, made me better programmer then my more talented colleagues, who did have of such an opportunity. Without this stage I probably would have never manage to reach my present level. I also would be more susceptible to fads (which are abundant in programming with each decade producing a couple of "new and exciting" technologies, which are solidly forgotten in the next decade ;-).  As compilers are one of the most complex types of programs, it makes you skeptical about any claim that somebody somehow invented a "silver bullet". moreover it allows you to see that some of the prophets of those new fads are simply charlatans who do not know much about the problem that arise in writing large and complex programming systems.

On the other hand it's really unfortunate that many students today study only a couple of  complex and boring languages (C++ and Java; sometimes one scripting language such as Python) and never learn assembler language. The latter is not only much simpler than any high-level language, but also is a necessary prerequisite for any decent  compiler course.  It allow you to understand how computer operates and ability to work on multiple levels of abstraction and switch from one level of another with  ease is a sign of a top software architect. This is both a native ability and acquired skill, so without learning assembler language programming the native ability might lie dormant.

Many tricks with bit arithmetic that complier writers widely use, and first of all treating words as set of bits capable to implement set operations, as well as superimposition of different types on a single memory location; thus redefining the memory area,  are best implemented in assembler language. That's why I still think that a student who has not taken an assembly course at the university is a slightly crippled student and IMHO only the most gifted students can overcome an overdose of OO, Java, and other fancy topics in the current university curriculum ;-).  I would like to stress it again, that understanding of low level architecture of computer including, but not limited to its instruction set, is critical for any top programmers because one of the key features of top programmers is the ability to switch the level of abstraction they operate from high to low and vise-versa. You can't think on low level of computer architecture details if you never learned them.  If you never studied one or more computer architectures and programmed in assembler.  That also means that high level language skills of such people are in many ways deficient.

If a student has the real desire to obtain an intimate knowledge of the computer architecture, then compiler construction is the way to go. You will need to understand instruction set,  programming conventions, learn efficiently manipulate registers bitfields, work with different and sometimes pretty complex programming structures and so on.  It you look at Knuth Books (for example volume 3 of  The Art of Computer Programming) and compare implementation of some sorting algorithms in MIX and high level language, you will see that many important algorithms can be expressed in  assembler almost as compactly as in higher level languages.

This multitude of abstraction levels and ability to create you own set of abstractions for solving particular problem is really what computer science are all about and in compiler writing many algorithms-related topics arise naturally and stimulate deeper learning of algorithms and data structures. We seldom learn things we do not need ;-).

To make a parallel with math, I see compiler writing as closer to a pure computer science, as opposed to applied computer science. It's not accidental the famous  The Art of Computer Programming  by Donald Knuth initially was  intended to be a book about writing compilers. The upper layers of software: applications, GUI, etc. probably can be called applied computer science. I mean, a biologist programming a genome decoding software is mainly 'using' the computer technology, for example regular expressions engine.  But if he get deep into it, he/she might modify or enhance regular expression engine to suit his/her needs better and achieve higher productivity. Generic regular expression language in this case is just a starting point, not the end of the journey.

Thus compiler writing is more about computer and algorithms by themselves and as such are IMHO more attractive to those who is interested in those areas.

### The key idea of "compiler construction as a programming paradigm" movement is the creation of a set of domain specific operations and data types as well as the language that is capable operating on them

The key point here is that many programming problems are better solved  by creating a specialized language, working with a set of abstract operations (you can call it "abstract, domain specific machine", as the term "virtual machine" is already taken).  The validity of this approach was demonstrated many time in very different contexts.

That why I would like to call compiler construction a programming paradigm, not just a pretty obscure and semi-forgotten area of computer science.  Moreover compiler construction was and probably still is the source of many neat ideas in programming in general. Please remember that coroutines -- the fundamental idea of modern programming were first invented by Conway (see M. E. Conway, Design of a Separable Transition-Diagram Compiler, Comm, A.C.M., 6(7), pp. 396-408. 1963) for the explicit purpose of writing a compiler and only much later had found its way to other areas of programming (first to simulation languages, then to operating systems via pipe concept that was pioneered in Unix, then to high level languages such as Modula and now to scripting languages such as Python and Ruby). BTW Conway wrote his compiler in assembler.  The famous PL/C compiler by Richard W. Conway and Thomas R. Wilcox was written in assembler too.

That suggests that assembler language was and still is constructive to viewing the problem in a different light, than when you use just a high-level language and that mixed language approach to programming might be better than a monolithic approach.  Currently the most acceptable way to write a compiler for a novice is to use combination of TCL and C with manually written lexical analyzer (in C) and a parser and code generator mainly in TCL (if the language will not be still born, it is always possible to rewrite the parser in C later for efficiency).

Programming language is just a tool and in no way one should use the only tool for a complex project. I would reiterate the value of studying assembler as a necessary step in obtaining solid programming education and would like to remind a reader that one probably will never be able to find a better book to study this important programming techniques than the first volume of TAOCP, in which Donald Knuth used a simple artificial assembler for a hypothetical computer called MIX.

### Some suggestions

For those who would like to try this approach I have several suggestions:

• Be skeptical about excessive formalism and too complex formal models (including but not limited to attribute grammars). They are useful in moderation but many instructors tend to stress them too much because they do not know better and that 's what is required to get published. Be skeptical about your instructor and use additional books to see bigger picture. If the subject became incomprehensible, it's often the fault of the instructor. A good book or two can save your butt: use it.

• Get some old books, the books written before current OO overhype and fashion for obscure mathematic notations.  Some of them are less then a dollar on Amazon (plus shipping). There are several books which implemented full compiler from some small language in C or C++. Those are very valuable teaching tools not only for compiler writing for programming too as compiler writers belong to the top echelon of computer programmers hierarchy. Those guys are not your regular commercial software writers who publish books about how to write Excel formulas or DevOps  ;-)

Generally be skeptical about authors who want to treat compiler as a task for which OO is the next best thing since sliced bread.  Especially dangerous are "obfuscators" -- authors who try to make simple things complex and complex impossible to understand. Often using as an instrument mathematical notation of some sort. The essence of mathematic is elegance and help in understanding, not the obscurity. This is true about such notations as BNF.

The same as abuse of mathematical formalisms can be said about abuse of OO. Compilers are not GUI interfaces and too much zeal in applying OO to this domain is counterproductive (but in moderation and in structuring of namespace OO language might be a useful tool).  Explicit operations with pointers and coroutines are a really useful tools in compiler writing so languages without them such as Java are inferior to languages that have them (C, C++, Perl).  Yes, you can write compiler in languages like Pascal or Java but it is more difficult and you need to use some unnatural constructs. Availability of bit operations is also important.

• Read Comp.compilers Usenet group and ask questions. You may get a very good answers that will help.  Also reading discussions will help you to understand many features of languages like C, C++, C#, Java much deeper. Often the most illuminating discussion about the usability and value of the a particular language feature is a discussion of compiler writers as they are the only one who understand the tradeoffs between expressive power of the language, convenience of programmer, potential difficult to find bugs and headaches of a compiler writer to implement the feature.  That does not mean the opinions of compiler writers are always rational (for example Wirth often went too far in simplification of the language and killed the future for it).

• It's better to use a hand-written lexical analyzers than generated (the initial version can be generated by any tool that produce "loop and switch" type of code).  IMHO all this abstract automata topics are a little bit detached from reality.  You can (and probably should generate the initial prototype using flex or similar tool but after that you need to write a simple "loop and switch" scanner manually. Hand-written loop and switch scanner is still a simple, easily understandable program. It  can be made more flexible than any generated scanner and can have better error diagnostic capabilities. What is more important is that this approach gives you an important leverage in structuring your syntax analyzer. For example you can use various tricks including complex look-ahead approaches in scanner to simplify parsing.

• Learn recursive decent parsing first. It is the most transparent and can be well mated to code generation. Use http://www.citeseer.com to find code samples and papers that discuss this technology.  Pascal was specifically structured as language suitable for recursive decent parsing. Look at Wirth's Oberon0 compiler in his Compiler Construction book. Only after you mastered the concepts inherent in recursive decent parsing and its connection to code generation you can start playing with "shift and reduce" algorithms. Bison is a good playground for that.

• You need to distinguish "language in the large" -- structure of the language and "language in the small" (arithmetic, Boolean and other expressions). Different methods can be used to parse those languages. For example to improve the quality of diagnostic (modern compilers sucks big way in comparison with handcrafted masterpieces such as PL/C or IBM PL/1 Debugging Compiler)  for example recursive decent can be used for "language at large" and LR grammars can be used for parsing expressions.  Semi-forgotten Floyd-Evans language provides an interesting experimental tool for this.

• Tree-like representation of data at intermediate stages based on subset of XML  has some advantages. In many cases you can generate a simplified version of XML (for example XHTML with an appropriate stylesheet) as an intermediate representation of your data stream.  First of all this instantly makes your representation readable and that is a tremendous advantage for any complex problem. XML is less readable than XHTML with stylesheet, but there are very good XML editors  which help to visualize what you are generating.  Moreover this hierarchical representation is very flexible and can be easily adapted to a very wide variety of problem domains. Here you can also benefit from a current trend to use XML as an intermediate representation of data streams and even hide compiler-based approach from incompetent bosses.

• If you need to optimize your code, peephole optimization is a very simple and nice method to improve the quality of the generated output.  The peephole is a small moving window on the target program. Instructions are optimized only considering the instructions in the peephole. Peephole optimization is applied over the while target program, moving the peephole window. Several passes can be used if necessary.  It can be adapted to intermediate XML or XHTML+stylesheet  representations and tasks different  then code generation. Actually peephole optimization can be productively used for any XML-based intermediate representation of the program.  Some people even tried to used XSLT to generate code in an application software project like "template driven code generator". That's a questionable approach, but for the narrow task of peephole optimization might be OK.  See Soumen Sarkar experience of using XSLT and the book Program Generators with XML and Java for more information.

• Use a scripting language. suitable choices are Ruby and Python, to lesser extent Perl (the problems with Perl is that it does not have a clean interface to C as a part of your implementation. The use of scripting language bridges the semantic gap between an application specification and its implementation, offers economies of scale when implementing repetitive concepts, and can result in code that is readable, more reliable, compact, easy to maintain, and concisely documents the overall structure of the application.

• Use multipass design. Intermediate representation of each pass written as a file is important debugging tool. This is especially important on the code generation phaze.

• Learn regular expressions. This is a very useful tool in compiler writing. Perl has very good integration of regular expression into the language so you might benefit from experimenting with using Perl at code generation phase.  but Python while weaker is also adequate.

• If you want to generate native doe generate assembler. Generating binary code is often fun but outside simple problems and imbedded software it too difficult to debug to justify the effort.

Dr. Nikolai Bezroukov

 Top Visited

Your browser does not support iframes.

Switchboard Latest Past week Past month

## Old News ;-)

To preserve bandwidth for humans as opposed to robots older News (1999-2003) now are moved in a separate subdirectory. Sorry for any inconvenience.

 2011 2010 2009 2008 2007 2006 2005 2004 1999-2003

#### [Nov 23, 2015] Knuth Recent News

##### So far only MMIX can be ordered in ePub and PDF formats

Announcing the first Art of Computer Programming eBooks

For many years I've resisted temptations to put out a hasty electronic version of The Art of Computer Programming, because the samples sent to me were not well made.

But now, working together with experts at Mathematical Sciences Publishers, Addison-Wesley and I have launched an electronic edition that meets the highest standards. We've put special emphasis into making the search feature work well. Thousands of useful "clickable" cross-references are also provided --- from exercises to their answers and back, from the index to the text, from the text to important tables and figures, etc.

Note: However, I have personally approved ONLY the PDF versions of these books. Beware of glitches in the ePUB and Kindle versions, etc., which cannot be faithful to my intentions because of serious deficiencies in those alternative formats. Indeed, the Kindle edition in particular is a travesty, an insult to Addison-Wesley's proud tradition of quality.

Readers of the Kindle edition should not expect the mathematics to make sense! Maybe the ePUB version is just as bad, or worse --- I don't know, and I don't really want to know. If you have been misled into purchasing any part of eTAOCP in an inferior format, please ask the publisher for a replacement.

The first fascicle can be ordered from Pearson's InformIT website, and so can Volumes 1, 2, 3, and 4A.

MMIXware

Hooray: After fifteen years of concentrated work with the help of numerous volunteers, I'm finally able to declare success by releasing Version 1.0 of the software for MMIX. This represents the most difficult set of programs I have ever undertaken to write; I regard it as a major proof-of-concept for literate programming, without which I believe the task would have been too difficult.

Version 0.0 was published in 1999 as a tutorial volume of Springer's Lecture Notes in Computer Science, Number 1750. Version 1.0 has now been published as a thoroughly revised printing, available both in hardcopy and as an eBook. I hope readers will enjoy such things as the exposition of a computer's pipeline, which is discussed via analogy to the activites in a high tech automobile repair shop. There also is a complete implementation of IEEE standard floating point arithmetic in terms of operations on 32-point integers, including original routines for floating point input and output that deliver the maximum possible accuracy. The book contains extensive indexes, designed to enhance the experience of readers who wish to exercise and improve their code-reading skills.

#### The MMIX Supplement

I'm pleased to announce the appearance of an excellent 200-page companion to Volumes 1, 2, and 3, written by Martin Ruckert. It is jam-packed with goodies from which an extraordinary amount can be learned. Martin has not merely transcribed my early programs for MIX and recast them in a modern idiom using MMIX; he has penetrated to their essence and rendered them anew with elegance and good taste.

His carefully checked code represents a significant contribution to the art of pedagogy as well as to the art of programming.

#### [Jan 07, 2013] On the translation languages from left to right, Knuth, D.E

Scanned version of Knuth, D.E., On the translation of languages from left to right, Information and Control 8, 607-639 (1965) Here are pages 607-616.

#### [Jan 07, 2013] WikiBook Compiler construction

##### This is a Wikipedia book, a collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book
###### Wikipedia
Introduction
Compiler construction
Compiler
Interpreter
History of compiler writing
Lexical analysis
Lexical analysis
Regular expression
Regular expression examples
Finite-state machine
Preprocessor
Syntactic analysis
Parsing
Symbol table
Abstract syntax
Abstract syntax tree
Context-free grammar
Terminal and nonterminal symbols
Left recursion
Backus–Naur Form
Extended Backus–Naur Form
TBNF
Top-down parsing
Recursive descent parser
Tail recursive parser
Parsing expression grammar
LL parser
LR parser
Parsing table
Simple LR parser
Canonical LR parser
GLR parser
LALR parser
Recursive ascent parser
Parser combinator
Bottom-up parsing
Chomsky normal form
CYK algorithm
Simple precedence grammar
Simple precedence parser
Operator-precedence grammar
Operator-precedence parser
Shunting-yard algorithm
Chart parser
Earley parser
The lexer hack
Scannerless parsing
Semantic analysis
Attribute grammar
L-attributed grammar
LR-attributed grammar
S-attributed grammar
ECLR-attributed grammar
Intermediate language
Control flow graph
Basic block
Call graph
Data-flow analysis
Use-define chain
Live variable analysis
Reaching definition
Static single assignment form
Dominator
C3 linearization
Intrinsic function
Aliasing
Alias analysis
Array access analysis
Pointer analysis
Escape analysis
Shape analysis
Loop dependence analysis
Program slicing
Code optimization
Compiler optimization
Peephole optimization
Copy propagation
Constant folding
Sparse conditional constant propagation
Common subexpression elimination
Partial redundancy elimination
Global value numbering
Strength reduction
Bounds-checking elimination
Inline expansion
Return value optimization
Unreachable code
Redundant code
Superoptimization
Loop optimization
Induction variable
Loop fission
Loop fusion
Loop inversion
Loop interchange
Loop-invariant code motion
Loop nest optimization
Manifest expression
Polytope model
Loop unwinding
Loop splitting
Loop tiling
Loop unswitching
Interprocedural optimization
Whole program optimization
Lazy evaluation
Partial evaluation
Profile-guided optimization
Automatic parallelization
Loop scheduling
Vectorization
Superword Level Parallelism
Code generation
Code generation
Name mangling
Register allocation
Chaitin's algorithm
Rematerialization
Sethi-Ullman algorithm
Data structure alignment
Instruction selection
Instruction scheduling
Software pipelining
Trace scheduling
Just-in-time compilation
Bytecode
Dynamic compilation
Dynamic recompilation
Object file
Code segment
Data segment
.bss
Literal pool
Relocation
Library
Static build
Architecture Neutral Distribution Format
Development techniques
Bootstrapping
Compiler correctness
Jensen's Device
Man or boy test
Cross compiler
Source-to-source compiler
Tools
Compiler-compiler
PQCC
Compiler Description Language
Comparison of regular expression engines
Comparison of parser generators
Lex
Flex lexical analyser
Quex
Ragel
Yacc
Berkeley Yacc
ANTLR
GNU bison
Coco/R
GOLD
JavaCC
JetPAG
Lemon Parser Generator
LALR parser generator
ROSE compiler framework
SableCC
Scannerless Boolean Parser
Spirit Parser Framework
S/SL programming language
SYNTAX
Syntax Definition Formalism
TREE-META
Frameworks supporting the polyhedral model
Case studies
GNU Compiler Collection
Java performance
Literature
Compilers: Principles, Techniques, and Tools
Principles of Compiler Design
The Design of an Optimizing Compiler

#### [Dec 04, 2011] Comp.compilers Re Need an interesting topic for an undergraduate project on Compilers

 From: Christophe de Dinechin Newsgroups: comp.compilers Date: Sun, 4 Sep 2011 23:42:48 -0700 (PDT) Organization: Compilers Central References: 11-08-006 Keywords: courses Posted-Date: 06 Sep 2011 22:12:39 EDT

On Aug 6, 7:28 pm, amit karmakar <amit.codenam...@gmail.com> wrote:
> I would like to have some suggestions as to what *new* and
> *innovative* project i can do which are based on compiler design.

Innovation in compilers can happen at a number of levels :

1. Parsing techniques, grammars, etc. Very active research a while
back, considered (erroneously methinks) as dead by most today, who
happily use flex/bison and don't think twice about it.

2. Language design. One of the active areas these days is "domain
specific languages" or DSLs, i.e. languages designed for one specific
need. Often using "meta-programming" techniques (programs that
generate programs)

3. Type systems, proofs, program validation. Languages like Haskell
use type inference, so that you don't have to specify types yourself
most of the time. C++ recently gained the "auto" keyword for types.
DSLs pose a new class of interesting problems in that space.

4. Intermediate representations, code generation and optimization
frameworks. The king of this hill these days IMO is LLVM. But there
are a number of contenders. If you are interested in optimizations,
that's the right place to look at.

5. Runtime support : garbage collectors, just-in-time code generation,
parallel execution, use of new hardware such as GPUs, 

6. Support for innovative hardware, hardware generation, hardware/
software co-design, etc. If you are more into silicon, this is a very

My own pet project, XLR (http://xlr.sf.net) offers a number of
innovations in the first three of these areas. It is a language
designed to grow with the user, i.e. the objective is to make it as
easy to add language constructs as it is to add, say, functions or
classes in other languages.

Regarding parsing, it generates a parse tree made of exactly 8 nodes :
integer, real, text and name/symbol represent leaves of the tree,
infix, prefix, postfix and block represent inner nodes. This makes it
possible to write programs in a very natural-looking way, yet with an
internal program representation that is easy to manipulate. This is
the foundation of XL meta-programming / DSL capabilities.

To validate that, XLR has practically no built-in constructs. It has
constructs to connect to LLVM primitives, constructs to connect to C
code, and a pair of "rewrite" constructs, notably ->, to transform one
tree shape into another. For example :

extern bool puts(text);
(x:integer - y:integer):integer -> opcode Sub

repeat 0, B -> true
repeat N, B -> B; repeat N-1, B

repeat 25,
puts "Hello"

You can check the code generated for the above with xlr -tcode -O3
tests/09.Compiler/optimized-repeat-loop.xl. LLVM actually turns it
into a sequence of 25 calls to puts, you can hardly do better.

The most active area of research for XLR these days is its type
system. In order to generate efficient code, an Haskell-like type
inference mechanism is in place. But the standard type inference
algorithms must be extended, because there are a few additional
transformations compared to lambda calculus (not just "alpha" and
"beta"), and the closest there is to a type is the shape of a tree
(e.g. "if X then Y else Z").

Since it uses LLVM, it is also an interesting way to learn a little
about LLVM, but it's not intended as an LLVM tutorial.

So if you are interested in experimenting with "growing a language" in
a text-based framework, XLR is the right way to go. There are other
projects that are more advanced e.g. if you want to build the IDE at
the same time, see for example JetBrain's Meta Programming System. But
they are not as strong in language development per-se, I believe.

#### [Dec 04, 2011] Roslyn Project Overview

Microsoft Developer Network

October 2011

Traditionally, compilers are black boxes -- source code goes in one end, magic happens in the middle, and object files or assemblies come out the other end. As compilers perform their magic, they build up deep understanding of the code they are processing, but that knowledge is unavailable to anyone but the compiler implementation wizards and it is promptly forgotten after the translated output is produced.

For decades, this world view has served us well, but it is no longer sufficient. Increasingly we rely on integrated development environment (IDE) features such as IntelliSense, refactoring, intelligent rename, "Find all references," and "Go to definition" to increase our productivity. We rely on code analysis tools to improve our code quality and code generators to aid in application construction. As these tools get smarter, they need access to more and more of the deep code knowledge that only compilers possess. This is the core mission of the Roslyn project: opening up the black boxes and allowing tools and end users to share in the wealth of information compilers have about our code. Instead of being opaque source-code-in and object-code-out translators, through the Roslyn project, compilers become services-APIs that you can use for code related tasks in your tools and applications.

The transition to compilers as services dramatically lowers the barrier to entry for creating code focused tools and applications. It creates many opportunities for innovation in areas such as meta-programming, code generation and transformation, interactive use of the C# and VB languages, and embedding of C# and VB in domain specific languages.

The Microsoft "Roslyn" CTP previews the new language object models for code generation, analysis, and refactoring, and the upcoming support for scripting and interactive use of C# and Visual Basic. This document is meant to be a conceptual overview of the Roslyn project. Further details can be found in the walkthroughs and samples included in the Roslyn CTP.

Summary

Project Roslyn exposes a set of Compiler APIs, Scripting APIs, Workspace APIs, and Services APIs that provides rich information about your source code and that has full fidelity with the C# and Visual Basic languages. The transition to compilers as a service dramatically lowers the barrier to entry for creating code focused tools and applications. It creates many opportunities for innovation in areas such as meta-programming, code generation and transformation, interactive use of the C# and VB languages, and embedding of C# and VB in domain specific languages.

#### [Oct 27, 2011] Writing a Programming Language in Perl

###### Oct 25, 2011
programmer99 has asked for the wisdom of the Perl Monks concerning the following question:
Hello. This is my first post, so please forgive me for any mistakes I make.

I have been programming for about 16 years now and Perl has always been one of my absolute favorite languages. I have been wondering how I can make a Programming language using Perl, only Perl and nothing but Perl. How would I go about doing this?

I have never created a language before. I know the basics of what a compiler is and how it works. But I don't know how to make one. After I make the language (which I don't know how to do) how do I make the language work with the compiler?

I truly do not want to read another book, as these can be extremely unhelpful. There really aren't any good tutorials that I can find. Can anyone give me some kind of tutorial or a point in the right direction? Thanks

P.S.- No mistakes that I can find! I'm good at this :-)

BrowserUk (Pope)

See Exploring Programming Language Architecture in Perl by Bill Hails. for one of the most complete explorations of doing exactly what the title suggests.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". >

In the absence of evidence, opinion is indistinguishable from prejudice.

BUK's suggestion is quite good. IIRC, it's a fun manuscript.

I've been meaning to suggest Marpa to you for awhile. While still marked experimental it should be excellent for this kind of thing. I've wanted to play with it myself for a long time but tuits being what they are...

Marpa would be used, I'd expect, to convert a language spec/grammar to Perl and then you could use any number of strategies to freeze the resulting code (in memory, file, some other specialized cache) until the original source changed, at which point it could be "recompiled." Doing it this way should afford an easy path for testing and building up a grammar piece by piece. If you don't write tests as you go, you will be sorry down the road. If you do write tests, you'll be able to breeze through wrong turns and accidentally introduced bugs with confidence.

programmer99
Ok, I have a few questions (if you don't mind). What do you mean by grammar? How will Marpa help me? Thanks.

As I say, tuits in the round are the issue and without the time to cook up some working example I'd just be glossing the various documentation and BNF/ParseRec stuffage. I think a simple example can be cooked up but it's also not entirely trivial or my forte (hint, hint to others who might have something to show). This is quite interesting and I've been waiting to try some Marpa though. If I can grab an hour tonight I'll revisit this otherwise, the weekend, otherwise... uh, what were we talking about?

ikegami

The grammar of a language defines its syntax. Parsers (e.g. Marpa) take a sequence of bytes, characters or tokens, check if it conforms to the grammar, and assigns meaning to the components of the sequence as per the grammar.

For example, the job of the parser is to receive "a+b*c" and return "a multiplication of ( an addition of ( identifier a ) and (identifier b ) ) and ( identifier c )".

ikegami

Marpa - Parse any Language You Can Describe in BNF

Hopefully, it can do a bit more than that, because associativity cannot be described in BNF.

davies

Have a look at Jonathan Worthington's slides (http://www.jnthn.net/papers/2011-yapc-russia-compiler.pdf). The code is next door on his web site. But as this was a public talk, there may be a recording of it on YAPC.tv. My spies tell me that the talk was outstanding, writing a complete language (well, almost) from next to nothing in under an hour.

Regards,

John Davies

ikegami

If you're interested in parsing and compiling, consider getting your hands on Compilers: Principles, Techniques, and Tools. You won't find any Perl in it, though.

Corion (Saint)

In my opinion a very good book is Compiler Construction (pdf warning) by Niklaus Wirth. I've only used the German edition from 1986, but it introduces a small language and how to build a parser and compiler for it. It only teaches you the technique of recursive descent. Recursive Descent is a technique that is currently enjoying a comeback because it is easier with it to produce meaningful error messages than it is when using lex and yacc or other parsing tools. The German book has 115 pages, the English and more recent edition (from 1996) has 130 pages. This seems to me a good starting point in the sense that little previous knowledge is needed and it gets you a feel for implementing some language with Pascal-like syntax. Actually, in the English PDF, it's an Oberon-like syntax, but that's still close enough.

When I first read the book, I skipped over the chapters 1 to 4, as they concern themselves with the theoretical background of the languages that the approach can handle. Looking back over the chapters, I'm not sure whether it was really necessary to skip over them, as they don't seem as dry to me as they did back then. On the other hand, I now have much more experience and also learned the theory of regular languages and context-free languages in university.

Browsing through the PDF, I see that it also mentions bottom-up (yacc-style) parsing and gives a rough outline of it.

andal:

The work of Crenshaw at http://compilers.iecc.com/crenshaw/ appears to be simpler. But YMMV.

spx2
There's a lot of theory about "bla grammars bla bla translators bla ... context-free context-dependent bla bla etc".
You can just throw that Dragon Book out the window and study the grammar of some small language like Basic or JavaScript's grammar to learn stuff
(here you can find a lot of grammars to many many languages).
This is not rocket science, it's engineering.
So I totally agree with you about not wanting to read 9999 books on the subject, you need to build something.
I'm not in any way a compiler/language expert but I do feel that you want to get down to business
so.. here's Parse::Yapp, here's an article , here's another article , here's some benchmarks and comparisons between different parser generators for Perl,
and another comparison , and another article. If you have further questions, I'm sure the monastery has lots of monks with knowledge of Parse::Yapp
Now go write a language and good luck !
hbm (Pilgrim:)
For ideas, you might look at Brainf*ck, which has a simple grammar of: <>+-[],.

Speaking of which, this is worth revisiting for anyone interested or that missed it the first time around: Ook interpreter.

sundialsvc4

Creating a programming language from scratch is an interesting exercise. I not only did that, but sold umpteen thousand copies of a system that used uses it. (So there...) But I dunno if I would implement the entire language in Perl. If you do it, I'd love to see it.

JavaFan
I'd start by first thinking what your new language should do. Is it going to be a domain specific language? Or a general purpose one? If the latter, what is its strength going to be? Is it going to be procedial? functional? OO? Block based? List based? Something else?
Second, make some trade-offs. Do you want to write a simple compiler/interpreter? (That probably means a simple syntax, not many features in your language). Does the compiler have to be run fast? Or do you want to give much power to the user of the language (resulting in a more complex compiler)?

Only then I'd worry about the implementation.

Oh, and my step 0 would be is to ponder about "if I really haven't a fucking clue on how to do that, is my disdain to read a book justified"?

spx2:
well, just look over Parse::Yapp, that would be your entry point.
JavaFan (Abbot)
In a way, I have already thought this out. The issue I am having now how to get started with the actual code.

So, your question really is, I know what I want to do, but I'm not telling you what it is, yet I ask from you how to get started?

Is that how you ask for directions in an unknown city as well? You ask I want to go somewhere, in which direction do I go?

Anonymous Monk
LOL *cough* *cough* aXML *cough* *cough*
Anonymous Monk
Parsing Strings and Trees with Parse::Eyapp (An Introduction to Compiler Construction
• http://search.cpan.org/dist/HOP-Lexer/lib/HOP/Lexer/Article.pod Lexing Without Grammars: When Regular Expressions Suck
• http://hop.perl.plover.com/Examples/ALL/calculator
• http://hop.perl.plover.com/Examples/ALL/regex-parser
• http://hop.perl.plover.com/linogram/
• #### [Dec 29, 2010] Making C compiler generate obfuscated code

##### The thread contains a very interesting discussion...
###### Dec 07, 2010 | Comp.compilers

About my little attempt to hack Tiny C compiler's codegenerator so it produces obfuscated code:
http://blogs.conus.info/node/58

Martin Ward <martin@gkc.org.uk> :

On Thursday 16 Dec 2010 at 15:23, Joshua Cranmer <Pidgeot18@gmail.com> wrote:

> I'm not sure what work has been done on deoptimization (perhaps anyone
> with the Hex-Rays decompiler could tell us?), but some of the
> optimization techniques seem relatively easy to reverse.
> > From a purely theoretical standpoint, obfuscation that adds
> non-executing code is going to be less difficult to reverse engineer
> than obfuscation that does the same thing, just... less obviously.

A major theoretical result in this area is the paper "On the (Im)possibility of Obfuscating Programs" by Boaz Barak et al, published in: CRYPTO '01 Proceedings of the 21st Annual International Cryptology Conference on Advances in Cryptology. Boaz Barak gives a non-technical description of the results and their meaning here:

The FermaT program transformation system can interactively transform code into semantically equivalent forms: including restructuring, simplification, dataflow analysis , SSA construction, slicing and, in simple cases, abstraction to a specification. The FermaT engine and graphical front end runs on Windows and Linux and can be downloaded from here:

or:

FermaT's primary commercial application is migrating assembler to structured and maintainable C and COBOL, so the "deobfuscation" transformations are geared towards removing the "clever tricks" introduced by assembler programmers to save a byte here and a cycle there: or just because they were

-- Martin

STRL Reader in Software Engineering and Royal Society Industry Fellow martin@gkc.org.uk http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4 G.K.Chesterton web site: http://www.cse.dmu.ac.uk/~mward/gkc/ Mirrors: http://www.gkc.org.uk and http://www.gkc.org.uk/gkc

#### [Dec 29, 2010] Compiler construction by Prof. O. Nierstrasz

##### Fall Semester 2008
 01Intro.pdf 17-Sep-2008 09:49 2.8M 02Lexical.pdf 17-Sep-2008 16:38 853K 03Parsing.pdf 01-Oct-2008 09:55 1.7M 04ParsingPractice.pdf 17-Sep-2008 16:39 757K 05SemanticAnalysis.pdf 20-Oct-2008 14:49 1.2M 06IntermediateRepresentation.pdf 17-Sep-2008 16:40 456K 07CodeGeneration.pdf 28-Oct-2008 20:52 1.1M 08IntroSSA.pdf 09-Dec-2008 14:29 1.0M 09Optimization.pdf 09-Dec-2008 14:29 446K 10PyPy.pdf 19-Nov-2008 10:57 1.0M 11PEGs.pdf 29-Oct-2008 09:52 472K 12DSL.pdf 02-Dec-2008 22:20 1.5M 12DSL.zip 02-Dec-2008 22:26 21M 13ProgramTransformation.pdf 26-Nov-2008 12:08 2.3M
Apache

Compiler Construction

#### [Dec 28, 2010] CS 375 Compilers Lecture Notes

Permission is granted for individuals to make copies of these notes for personal use, or for instructors to make copies for classroom use.

Note: Many of these pages use math symbols such as: &forall &exist . Microsoft Internet Explorer will not display the math symbols, but Firefox will.

28. Floating Point Numbers

29. IEEE Floating Point Standard

30. Floating Point Examples

78.

80.

94. Auxiliary C Code

95. Auxiliary C Code ...

96. Example

361. Expression Trees to English

364. ATN in Lisp

366. Grammar Compiler

371. Physics Queries

#### [Dec 27, 2010] Let's Build a Compiler

It's available in two formats, plain text, and with printer control characters so it will print reasonably on an Epson printer.

#### [Apr 6, 2009] LEPL 2.3

LEPL is a recursive descent parser library written in Python. It is based on parser combinator libraries popular in functional programming, but also exploits Python language features. Operators provide... a friendly syntax, and the consistent use of generators supports full backtracking and resource management. Backtracking implies that a wide variety of grammars are supported; appropriate memoisation ensures that even left-recursive grammars terminate

#### [Feb 21, 2008] Project details for Bare Bones interpreter

freshmeat.net

BareBones is an interpreter for the "Bare Bones" programming language defined in Chapter 11 of "Computer Science: An Overview", 9th Edition, by J. Glenn Brookshear.

Release focus: Minor feature enhancements

Changes:

Author:
Eric Smith [contact developer]

#### [Jan 22, 2008] freshmeat.net Project details for tinyap

tinyap is a recursive descent parser with backup that outputs an abstract syntax tree (AST). Unlike in most parsers, the grammar is data. Tinyap uses an AST that represents a grammar to parse its input text. The factory default for the grammar is tinyap's grammar description language itself, so one can parse a grammar description and directly use the parse output to parse some other text written in the described language. Tinyap also features a plugin mechanism for grammars, which allows for dynamic modular grammars. Finally, it provides an interface to walk down the ASTs and to write external plugins to visit the nodes.

Release focus: Initial freshmeat announcement

#### [Aug 9, 2007] freshmeat.net Project details for Parsing.py Parser Generator

The Parsing module is a pure-Python module that implements an LR(1) parser generator, as well as CFSM and GLR parser drivers. From an algorithmic perspective, this is one of the most advanced parser generators in existence.

Release focus: Initial freshmeat announcement

Changes:
Python 2.4 is now supported, in addition to Python 2.5.

Author:
Jason Evans [contact developer]

#### Using Prolog to Compile Things - comp.compilers Google Groups

Newsgroups: comp.compilers From: f...@cs.mu.OZ.AU (Fergus Henderson) Date: 1999/05/21 Subject: Re: Using Prolog to Compile Things Reply to author | Forward | Print | Individual message | Show original | Report this message | Find messages by this author

Nick Roberts <nickrobe...@callnetuk.com> writes:
>Has anyone on this ng experience or knowledge of the use of Prolog to
>implement a native-code compiler for a typical high-level imperative
>language? I am toying with the idea of using Prolog for the lexer, the
>parser, the intermediate (library) code generator, and the end-code
>generator (and even, in effect, for linking!), i.e. the 'whole shebang'.
I have written a couple of compilers in Prolog. In many ways, Prolog
is an excellent language for writing compilers, but it does have some

Much of the task of compilation is manipulating trees of various
kinds, and this is a task which Prolog and other logic or functional
languages do very well.

Another advantage of Prolog is the use of unbound variables and
partially instantiated data structures. Often during code generation
you may want to make several passes over some data structure (e.g. the
parse tree), filling in the values of different attributes with each
pass. In Prolog you can leave uninitialized attributes as unbound
variables, and have each pass bind the appropriate variables. This
contrasts with some languages such as ML or Mercury where you would
normally initialize the attributes with dummy values and then each
pass would make a copy of the parse tree in order to set the new
attributes.

Prolog does have some significant disadvantages. One is that Prolog
is often quite inefficient. For example, it's very difficult to get a
lexer written in Prolog to run fast.

Another disadvantage is that Prolog doesn't have records with named
fields. If you want access fields by name, then you need to write a
bunch of named access predicates, which is quite tedious. (Existing
Prolog implementations generally don't do inlining, so using named
access predicates also exacerbates the efficiency problems.)

Another disadvantage, probably more important than the previous two,
is that Prolog has very little in the way of static checking. This
works OK for small projects, but makes things very difficult when
writing large systems. If you plan to write 5000 lines of code or
more, then I would very strongly recommend using a language with
static type checking and a proper module system (some Prolog systems
do have module systems, but they are not standardized and vary
significantly between different Prolog systems). And Prolog's support
for unbound variables and nondeterminism, although useful, can also
cause a lot of problems if you accidentally forget to initialize a
variable or if you accidentally introduce nondeterminism. Other logic
programming languages such as Mercury and Visual Prolog (which,
despite the name, is quite different from Prolog) do a lot better
here.

If you do use Prolog, then I strongly recommend that you document the
types, modes, and determinism of the predicates in your program very
carefully. This is especially important if you ever want anyone other
than yourself to maintain the program. But compilers are usually not
small projects, so I think you would probably be better off choosing a
language which has more support for static checking, such as Mercury.

Of course, I'm one of the developers of Mercury, so my opinion in that
respect is no doubt biased ;-). But Mercury was developed with my
experience of implementing compilers in Prolog in mind, and so it was
designed to address many of Prolog's faults in that area. The Mercury
compiler is written in Mercury, so it has certainly been stress-tested
in that area.

> I'm particularly interested in the idea of using Prolog's natural
> searching abilities to search for truly optimal code.
I think this is a mirage. It's pretty easy to express searching
algorithms in any language. But finding _feasible_ algorithms to
produce "truly optimal code" is going to be difficult in any language.
Prolog's searching abilities won't really help much here.
--
Fergus Henderson <f...@cs.mu.oz.au>
WWW: <http://www.cs.mu.oz.au/~fjh>
PGP: finger f...@128.250.37.3

[It's my impression that in too many cases the only way to find perfectly
optimal code would be to enumerate and check an impractically large set
of possibilities. -John]

#### Compiling Little Languages in Python - Aycock (ResearchIndex)

Abstract: "Little languages" such as configuration files or HTML documents are commonplace in computing. This paper divides the work of implementing a little language into four parts, and presents a framework which can be used to easily conquer the implementation of each. The pieces of the framework have the unusual property that they may be extended through normal object-oriented means, allowing features to be added to a little language simply by subclassing parts of its compiler.

Compiler Transformations for High-Performance Computing DAVID F. BACON SUSAN L. GRAHAM, AND OLIVER J. SHARP

1.2 A Brief History of SETL Information about SETL compiler project

This perspective, as hinted in the first quotation above, sprang from the strong perception in the late 1960s that there was a need for a set-oriented language capable of expressing concisely the kind of set-intensive algorithm that kept arising in studies of compiler optimization, such as those by Allen, Cocke, Kennedy, and Schwartz [5,43,6,41,7,10,130,11,8,9,131,178,179,180,12,132,42]. Programming Languages and their Compilers [44], published early in 1970, devoted more than 200 pages to optimization algorithms. It included many of the now familiar techniques such as redundant code elimination and strength reduction, dealt extensively with graphs of control flow and their partitioning into intervals'', and showed how to split nodes in an irreducible flow graph to obtain a reducible one. Many workers in the 1970s and 80s besides those just mentioned identified SETL, directly or indirectly, as a language whose implementation was greatly in need of solutions to difficult compiler optimization problems [80,84,85,143,82,86,123,83,148,149,174,208,3,209]. SETL, while still far from the celestial sphere of pure mathematics, was nonetheless seen as occupying a very high orbit relative to other languages. It was SETL's distance from pure machines that made optimizing its implementations so important and at the same time so difficult.

The synergy between the study of code optimization and the high-level set language used for expressing optimization algorithms led to the SETL compiler project [186,177], which was itself an abundant source of optimization problems. The SETL project produced, among other things, the SETL optimizer [178,88,77], a 24,000-line prototype written in SETL. Unfortunately, on the machines of the day, it was too large to apply to itself. This was a pity because not only is SETL a language which could benefit greatly from a good optimizer, it is also one whose semantic simplicity makes it particularly amenable to the flow-tracing techniques of machine-independent code optimization. The absence of pointers alone circumvents the issue of aliasing, a huge advantage in this kind of analysis.

The sort of data flow (definition-use) information obtainable from analysis of control flow graphs, and more generally from Schwartz's value flow'' tracing [177,178,144] that could follow objects when they were stored in aggregates and later extracted, was useful in all sorts of ways. It sustained copy optimization [175,178,179], where the redundant copying of an object could be suppressed when the only subsequent use of the object also modified it, perhaps incrementally. Value flow analysis provided a dependency framework wherein the types of many variables and expressions could be deduced by a transitive closure process starting from the manifest types of literals and other forms [196]. This typefinding process in turn enabled the discovery of relationships of set membership and inclusion [180], which was itself a prelude to automatic data structure choice, because the way an object is used has a profound influence on how it should be implemented. Weiss and Schonberg [208,209] later showed how to do type inference even in the presence of infinite sets of possible types arising from actions such as x := {x}''.

Data structure representations had their own sublanguage, the DSRL, which served to annotate, but not otherwise modify, SETL programs coded at an appropriately abstract level. The DSRL was designed to permit a smooth transition from Schwartz's two-level programming regime, in which programmers supplied representational details, to a more fully developed system in which a sophisticated optimizer made the selections [175,56,174,88,77]. An important concept in the DSRL was that of base sets, which were implicitly defined objects that could in principle allow much representational sharing among the objects conceived by the programmer.

Value flow analysis, type inference, copy optimization, and deeper determinations of relationships such as set membership or inclusion between variables preparatory to automatic data structure selection all embody an approach to program analysis described by Sintzoff [189] and called abstract interpretation by Cousot and Cousot [47] or symbolic execution in Muchnick and Jones [149, p. xv]. The essence of this model is that any program P with well-defined semantics can be projected onto a more abstract program A capturing salient properties of objects in P in a manner susceptible of analysis. For example, the sign of a product can be deduced from the signs of its multiplicands without knowing their specific values. Similarly, result types for known operators can usually be gleaned at compile time from operand types regardless of the actual run-time values of those operands. In abstract interpretation, the abstract program A is exercised at P's compile time to discover desired properties of objects in P. The symbols in A combine and recombine according to an algebra appropriate for their purpose. If that algebra has been designed with feasible goals in mind, the exercise will converge. It is typical to ensure this termination by taking advantage of the fact that any set generated by inductive definitions (such as data flow equations) can be defined as the lattice-theoretic least fixed point of a monotone function. This often allows global properties to be inferred from local ones by a straightforward process of transitive closure.

The power and generality of abstract interpretation moved Paige and his colleagues to undertake an ambitious study of program transformations, which ultimately led to the APTS project [33,161,126,162]. The first of the main three transformations used in APTS is dominated convergence [32] for computing fixed points of Tarski [195,48] sequences (fi(1) : i = 0, 1, 2, ...for deflationary, monotone f) with reasonable efficiency. The second is finite differencing [157,164,158], which is a set-theoretic analogue of strength reduction that allows some expensive set operations within loops to be reduced to incremental updates by locating fixed points more quickly through the construction and maintenance of program invariants. The third transformation is real-time simulation [159,30,160,34,166] of an associative memory on an ordinary random-access memory (or with slight additional restrictions a mere pointer-access memory), which effectively automates the tedious programming activity of choosing efficient basings for sets.

Chung Yung has recently used finite differencing in a technique he calls destructive effect analysis, which seeks to incrementalize the copying of aggregates, in his purely functional programming language, EAS, a packaging of the typed -calculus as extended with homogeneous sets [210,211,212].

Transformational programming can be regarded as a formalization of Dijkstra's stepwise refinement [57,58]. As Bloom and Paige [28] point out, the transformational methodology is able to do much more than merely optimize code, or translate a SETL-like language into a C-like one. By helping the algorithm designer reason about time and space complexity in syntactic terms rather than only by means of low-level counting arguments, this technology has actually played a significant role in the invention of several new algorithms with greatly reduced asymptotic complexity compared to previous solutions [165,163,32,31,28,34,38,93], while it has rendered the algorithms themselves more perspicuous both to their inventors and to their students.

The next phase in the development of APTS will seek to improve both its reliability and its performance. Currently, all program transformations in APTS are proved correct (meaning-preserving) by hand, which is slow and error-prone. The hope is to integrate a meta-level proof verifier along the lines of Etna [36], an outgrowth of Cantone, Ferro, and Omodeo's work on fast decision procedures for fragments of finite set theory [37]. Alternatively, the model for an integrated verifier might be the SETL-like NAP [126] system, itself implemented in the SETL derivative Cantor [127,124]. Verification of assertions in, say, Hoare logic [113] would increase confidence in automatically applied transformations. Davis and Schwartz [50] showed how mechanical verification systems could extend themselves with new proof methods without violating soundness or changing the set of statements that could be proved.

The main existing impediment to the speed of APTS is the fact that its database of program property relationships, which is dynamically deduced using a static database of inference rules, must be recomputed after each application of a program transformation. What is being sought is an incremental rule database system that can be used to regenerate the relationship records efficiently after each rewriting operation [161]. Ultimately, it should be possible to apply APTS to itself for further large gains in speed [162], and to take advantage of the technique of partial evaluation [122] to realize a production-grade transformational system.

Recently, Goyal and Paige [94] have revisited the copy optimization problem for SETL and other high-level languages that exemplify Hoare's ideal of a pointer-free style of programming. By taking the well-known technique of dynamic reference counting to achieve lazy'' copying, and combining that with static liveness determination based on Schwartz's value flow analysis, they are able to optimize the placement of copy operations and of om assignments, the latter serving to decrement the reference counts of objects known to have no subsequent uses. They also prove the correctness of their alias propagation analysis and code transformations using formal semantics and abstract interpretation.

Goyal [91] has obtained a dramatic improvement in the algorithmic complexity of computing intra-procedural may-alias relations, again using dominated convergence and finite differencing. In his dissertation [92], he develops set-theoretic languages which can express both abstract specifications and low-level implementations in a form which uses a data structure selection method based on a novel type system to preserve the computational transparency that is necessary in order for statements about program efficiency to be meaningful. This is a cornerstone of the general transformational methodology.

Reliable software implementation using domain-specific languages by Diomidis Spinellis This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

The safety and reliability of products, processes, equipment, and installations is often critically affected by the software that controls their design and operation [WCD+98]. Software engineering stands out among other engineering disciplines because of its tight, haphazard, and fluid coupling with other elements of its operating environment. Mechanical, civil, or electrical engineers can base their designs on standardised and well understood specifications and constraints, and can design their implementations based on materials and components of known modalities and properties. In contrast to that, software engineers have to rely on specifications often expressed in the notation most suited to the application domain, design an artefact whose application changes significantly the environment it was designed for [Leh91], and rely on implementors whose output quality can vary up to a factor of 10 [SEG68]. These problems are of particular importance for safety critical applications where human life can be put at risk from malfeasant software designs and implementations.

Our approach for overcoming those problems is a software process architecture based on domain specific languages (DSLs). These allow the specification of critical parts of a software subsystem in the most appropriate formalism for the application domain, thus bridging the gap between the domain expert specifications and the software implementation, providing maintainability, and -- once a particular DSL has been correctly implemented -- ensuring consistent quality throughout the system's lifecycle. In the following paragraphs we illustrate this concept by detailing the usage of DSLs in the design and implementation of a civil engineering CAD application. The application is typical for the class described so far: it embodies a large amount of domain knowledge and its failures can lead to reliability and safety problems.

• ### BYACC which produces Perl/Tcl parsers

I wanted a set of Tcl/Perl yacc tools that I knew were based on the latest version of byacc that I knew of. I also wanted a minimal diff against that version of byacc. To that end, I've taken the latest versions of tyacc/pyacc that I found:
     tyacc-0.9.tar.gz
perl-byacc1.8.2.tar.gz
perl5-byacc-patches-0.5.tar
and applied the diff's to FreeBSD's 4.8-RELEASE yacc (which is really byacc.) Each of tyacc/pyacc/p5yacc is a separate tool -- removing the command line option specifying what language (Tcl/Perl) to generate simplified the diff's enormously. Basing the changes on FreeBSD's version of yacc did remove some changes that might be important to some people. The following features were removed:
• portability #ifdef's for STDC (porting back to MSDOS/Amiga/etc.) might need to be re-done
• portability for number of bits in a word
The Perl versions contain no new functionality or bug fixes. However, the Tcl version has been improved:
• yacc error recovery now works
• restored the default action for a production
• included working versions of YYABORT/YYREJECT/YYACCEPT/YYERROR
• references to "" and "$n" are always expanded without a leading "$" in actions
• use "\$" to indicate a literal "$" in actions
• removed the parse state/tree lists from the generated y.tab.tcl
• added a working calculator example using tcLex

#### [Mar 14, 2005] Key open-source programming tool due for overhaul Tech News on ZDNet

Almost all open-source software is built with GCC, a compiler that converts a program's source code--the commands written by humans in high-level languages such as C--into the binary instructions a computer understands. The forthcoming GCC 4.0 includes a new foundation that will allow that translation to become more sophisticated, said Mark Mitchell, the GCC 4 release manager and "chief sourcerer" of a small company called CodeSourcery.

"The primary purpose of 4.0 was to build an optimization infrastructure that would allow the compiler to generate much better code," Mitchell said.

Compilers are rarely noticed outside the software development community, but GCC carries broad significance. For one thing, an improved GCC could boost performance for the open-source software realm--everything from Linux and Firefox to OpenOffice.org and Apache that collectively compete with proprietary competitors from Microsoft, IBM and others.

For another, GCC is a foundation for an entire philosophy of cooperative software development. It's not too much of a stretch to say GCC is as central an enabler to the free and open-source programming movements as a free press is to democracy.

GCC, which stands for GNU Compiler Collection, was one of the original projects in the Gnu's Not Unix effort. Richard Stallman launched GNU and the accompanying Free Software Foundation in the 1980s to create a clone of Unix that's free from proprietary licensing constraints.

The first GCC version was released in 1987, and GCC 3.0 was released in 2001. A company called Cygnus Solutions, an open-source business pioneer acquired in 1999 by Linux seller Red Hat, funded much of the compiler's development.

But improving GCC isn't a simple matter, said Evans Data analyst Nicholas Petreley. There have been performance improvements that came from moving from GCC 3.3 to 3.4, but at the expense of backwards-compatibility: Some software that compiled fine with 3.3 broke with 3.4, Petreley said.

RedMonk analyst Stephen O'Grady added that updating GCC shouldn't compromise its ability to produce software that works on numerous processor types.

"If they can achieve the very difficult goal of not damaging that cross-platform compatibility and backwards-compatibility, and they can bake in some optimizations that really do speed up performance, the implications will be profound," O'Grady said.

What's coming in 4.0
GCC 4.0 will bring a foundation to which optimizations can be added. Those optimizations can take several forms, but in general, they'll provide ways that the compiler can look at an entire program.

For example, the current version of GCC can optimize small, local parts of a program. But one new optimization, called scalar replacement and aggregates, lets GCC find data structures that span a larger amount of source code. GCC then can break those objects apart so that object components can be stored directly in fast on-chip memory rather than in sluggish main memory.

"Optimization infrastructure is being built to give the compiler the ability to see the big picture," Mitchell said. The framework is called Tree SSA (static single assignment).

However, Mitchell said the optimization framework is only the first step. Next will come writing optimizations that plug into it. "There is not as much use of that infrastructure as there will be over time," Mitchell said.

One optimization that likely will be introduced in GCC 4.1 is called autovectorization, said Richard Henderson, a Red Hat employee and GCC core programmer. That feature economizes processor operations

#### [Jan 20, 2005] TACC Single Processor Optimization on the IBM Power4 System

Code optimization is an iterative process requiring an investment of time, energy, and thought on the part of the programmer. Performance tuning is recommended in the following situations:

1. production codes that are widely distributed and often used in the research community;
2. projects with limited allocation (to maximize available compute hours).

It is hardly worthwhile to optimize programs that will only be used once or twice, but it is important to optimize often-used application code. In addition, the lack of availability of compute hours is an incentive to decrease the consumption rate of the allocation, and hence will call for increasing the efficiency of one's production code.

## 2004

#### [Dec 10, 2004] Comp.compilers SUMMARY Static analyzers for detecting programming errors

static analyzers for detecting program errors cpdas@marlin.jcu.edu.au (1991-11-21) SUMMARY: Static analyzers for detecting programming errors cpdas@groper.jcu.edu.au (1991-12-05)

Newsgroups: comp.compilers From: cpdas@groper.jcu.edu.au (David Spuler) Keywords: errors, analysis, summary, bibliography Organization: Compilers Central References: 91-11-087 Date: Thu, 5 Dec 91 20:04:26 +1100

I received quite a large number of replies to my request for information

1) my work/ideas and

Hope it's useful.

David Spuler

MY OWN WORK
===========
Here is a summary of my own bug lists and references on the use of static
analyzers for detecting program errors. Myself, I implemented a full C
checker (better than Lint) as my 4th year Honours project - hence there is an
Honours thesis. Naturally, the bug lists and references have a C bias.

C Bugs
======
Expression Errors
-----------------
assignment instead of equality (if(x=y) versus if(x==y) )
confusing &, | and &&,|| operators
null effect statements (x<<1; instead of x<<=1)
order of evaluation errors (a[i++]=i)
side effects to 2nd/3rd operand of ||, && and ?:

Lexical Errors
==============
Multibyte character constants (novices use 'abc' instead of "abc")

Flow of Control Errors
======================
unreachable statements
use of uninitialized local variables
function return anomalies - return statements with and without expression
function has no return statement on some paths
constants in conditional tests (if(1>0)...)
accidental empty loops (bad semicolon) for(...); { }
missing break in switch statement

Library Usage Errors
====================
bad formats (and argument type) to printf/scanf/etc

Preprocessor Errors
===================
missing braces #define swap(i,j) temp =i; i=j; j=temp;
precedence - need brackets around whole expresion
precedence - need brackets around macro formal parameters
side effects in macro arguments

Declarations
=============
unused local variables
inner scope redeclarations hiding outer name
global problems - inconsistent use of fns etc - Lint does all this stuff

REFERENCES (papers from my Thesis bibliography)
(some aren't highly relevant to static checking, but the titles tell all)
=============================================================================
Spuler, D.A. "Check: A Better Checker for C", Honours Thesis, Dept of Computer
Science, James Cook University of North Queensland, 4811. AUSTRALIA

"Validation, Verification, and Testing of Computer Software",
ACM Computing Surveys, Vol. 14, No. 2, pp. 159-192.

R. E. Berry and B. A. E. Meekings (1985),
"A Style Analysis of C Programs",
Communications of the ACM, Vol. 28, No. 1, pp. 80-88.

R. E. Fairley (1978),
"Static Analysis and Dynamic Testing of Computer Software",
IEEE Computer, Vol. 11, No. 4, pp. 14-23.

L. D. Fosdick and L. J. Osterweil (1976),
"Data Flow Analysis In Software Reliability",
ACM Computing Surveys, Vol. 8, No. 3, pp. 305-330.

M. T. Harandi and J. Q. Ninq (1990), "Knowledge-Based Program Analysis",
IEEE Software, January 1990, pp. 74-81.

W. Harrison (1977),
"Compiler Analysis of Value Ranges for Variables",
IEEE Transactions on Software Engineering, Vol. 3, No. 3, pp. 243-250.

W. S. Hecht and J. D. Ullman (1975),
"A Simple Algorithm for Global Data Flow Analysis Problems",
SIAM Journal of Computing, Vol. 4, pp. 528-532.

T. R. Hopkins (1980), "PBASIC - A Verifier for BASIC",
Software Practice and Experience, Vol. 10, No. 3, pp. 175-181.

W. E. Howden (1982), "Validation of Scientific Programs",
ACM Computing Surveys, Vol. 14, No. 2, pp. 193-227.

W. E. Howden (1990), "Comments Analysis and Programming Errors",
IEEE Transactions on Software Engineering, Vol. 16, No. 1, pp. 72-81.

S. C. Johnson (1978), "Lint, a C Program Checker", Bell Laboratories,
Murray Hill, NJ, Computer Science Technical Report, July 1978 (also
appearing as part of the documentation for many versions/variants of the
UNIX operating system, as in: Ultrix Programmers Manual - Supplementary
Documents, 1983).

W. L. Johnson (1986), Intention-Based Diagnosis of Novice Programming
Errors, Pitman.

J. Katzenelson and A. Strominger (1987), "Debugging Programs that use
Macro-Oriented Data Abstractions", Software Practice and Experience, Vol.
17, No. 2, pp. 79-103.

J. C. King (1976), "Symbolic Execution and Program Testing",
Communications of the ACM, Vol. 19, No. 7, pp. 385-394.

D. E. Knuth (1971), "An Empirical Study of FORTRAN Programs",
Software Practice and Experience, Vol. 1, pp 105-133.

S. Lauesen (1979), "Debugging Techniques", Software Practice and Experience,
Vol. 9, pp. 51-63.

T. E. Lindquist and J. R. Jenkins (1988), "Test-Case Generation with IOGen",
IEEE Software, January 1988, pp. 72-79.

P. G. Moulton and M. E. Muller (1967), "DITRAN - A Compiler Emphasizing
Diagnostics", Communications of the ACM, Vol. 10, No. 1, pp. 45-52.

S. S. Muchnick and N. D. Jones (1981),
Program Flow Analysis: Theory and Applications, Prentice-Hall.

L. J. Osterweil and L. D. Fosdick (1976), "DAVE - A Validation, Error
Detection and Documentation System for Fortran Programs", Software
Practice and Experience, Vol. 6, No. 4, pp. 473-486.

K. A. Redish and W. F. Smyth (1986), "Program Style Analysis: A Natural
By-Product of Program Compilation", Communications of the ACM, Vol. 29,
No. 2, pp. 126-133.

B. G. Ryder (1974), "The PFORT Verifier",
Software Practice and Experience, Vol. 4, No. 4, pp. 359-377.

R. E. Seviora (1987), "Knowledge-Based Program Debugging Systems", IEEE
Software, Vol. 4, No. 3, pp. 20-32.

N. Suzuki and K. Ishihata (1977), "Implementation of an Array Bound
Checker", Fourth ACM Symposium on Principles of Programming Languages, pp.
132-143.

C. Wilson and L. J. Osterweil (1985),
"Omega - A Data Flow Analysis Tool for the C Programming Language",
IEEE Transactions on Software Engineering, Vol. 11, No. 9, pp. 832-838.

============================
RESPONSES FROM OTHERS
============================

>From @yonge.csri.toronto.edu:hugh@redvax Thu Dec 5 15:06:51 1991

I too am interested in static analysis tools. I have produced a C
compiler (intended to be commercial) that does extensive lint-like
checking.

Over a decade ago Fosdick and Osterweil (sp?) at Colorado produced a
program to detect "data-flow anomalies" in FORTRAN programs. I thought
that work was very interesting. At the time, I prototyped a similar
system (theirs was so slow that they recommended it be used only once in
the life of a program; mine was fast enough that it would not have

Do you know of the PFORT Verifier? I think that it is available for free
from Bell Labs. I think that it is sort of a Lint for FORTRAN,
emphasizing checking for FORTRAN 77 standard conformance.

Again, a decade ago, there was some work at eliminating runtime range
checks from Pascal programs. Clearly, this could be turned around into
compile-time warnings, perhaps without being annoying. I think Welsh of
Queens University of Belfast wrote one of the better papers (dim
recollection).

Anyway, I would be interested in your bug lists and references.

Hugh Redelmeier
{utcsri, yunexus, uunet!attcan, utzoo, scocan}!redvax!hugh
When all else fails: hugh@csri.toronto.edu
+1 416 482-8253

===================================================

>From jackson@larch.lcs.mit.edu Tue Nov 26 03:23:07 1991

David,

Raymie Stata, a fellow graduate in my research group passed me your
request from the net.

I am completing a thesis on a bug detection scheme I have invented called
Aspect. I am attaching some blurb from a paper I have just written. I'm
including the bibliography too, which may give you some useful references.

If you're interested, I'd be happy to tell you more. There is one paper
published that describes the state of Aspect about 1 year ago; it's
Aspect: an Economical Bug Detector', International Conf. On Software
Engineering, May 1991.

I'd also be very interested in any references or ideas that you have.

Regards,

--Daniel Jackson

Aspect is an annotation language for detecting bugs in imperative
programs. The programmer annotates a procedure with simple assertions
that relate abstract components (called aspects') of the pre- and
post-states. A checker has been implemented that can determine
efficiently whether the code satisfies an assertion. If it does not,
there is a bug in the code (or the assertion is wrong) and an error
message is displayed. Although not all bugs can be detected, no
spurious bugs are reported....

The purpose of a compiler is not just to make it easier to write good
programs but also to make it harder to write bad ones. Catching errors
during compilation saves testing. It also spares the greater cost of
discovering the error later when it is harder to fix.

Programming errors can be divided into two classes. {\em Anomalies} are
flaws that are apparent even to someone who has no idea what the
program is supposed to do: uninitialized variables, dead code,
infinite loops, etc. {\em Bugs}, on the other hand, are faults only with
respect to some intent. An anomaly detector can at best determine
that a program does something right; a bug detector is needed to tell
whether it does the right thing.

Aspect detects bugs with a novel kind of dataflow annotation.
Annotating the code is extra work for the programmer, but
it is mitigated by two factors. First, some sort of redundancy is
inevitable if bugs, rather than just anomalies, are to be caught.
Moreover, Aspect assertions may be useful documentation: they are
generally much shorter and more abstract than the code they accompany.
Second, no mimimal annotation is demanded; the programmer can choose
to annotate more or less according to the complexity of the code or
the importance of checking it.

A procedure's annotation relates abstract components of objects called
aspects'. The division of an object into aspects is a kind of data
abstraction; the aspects are not fixed by the object's representation
but are chosen by the programmer.

Each assertion of the annotation states that an aspect of the
post-state is obtained from some aspects of the pre-state. The
checker examines the code to see if such dependencies are plausible.
If there is no path in the code that could give the required
dependencies, there must be an error: the result aspect was computed
without adequate information. An error message is generated saying
which abstract dependency is missing.

...

Many compilers, of course, perform some kind of anomaly analysis, and
a variety of clever techniques have been invented (see, e.g.
\cite{carre}). Anomaly detection has the great advantage that it
comes free to the programmer. Aspect might enhance existing methods
with a more precise analysis that would catch more anomalies (using
annotations of the built-in procedures alone). But there will always
be a fundamental limitation: most errors are bugs and not anomalies.

The Cesar/Cecil system \cite{cesar}, like Aspect, uses annotations to
detect bugs. Its assertions are path expressions that constrain the
order of operations. Errors like failing to open a file before
reading it can be detected in this way. Flavor analysis \cite{flavor}
is a related technique whose assertions make claims about how an
object is used: that an integer is a sum in one place in the code and
a mean in another, for instance. Both techniques, however, report
spurious bugs in some cases: an error may be signalled for a path that
cannot occur. Aspect, on the other hand, is sound: if an error is
reported, there is a bug (or the assertion is wrong).

Type checking may also be viewed as bug detection when there is name
equality or data abstraction. Aspect is more powerful for two
reasons. First, since procedures with the same type signature usually
have different annotations, Aspect can often tell that the wrong
procedure has been called even when there is no type mismatch.
Second, type systems are usually immune to changes of state and so, in
particular, cannot catch errors of omission. Even models that
classify side-effects, such as FX \cite{FX}, do not constrain the
order of operations like Aspect.

The version of Aspect described here advances previous work
\cite{icse} by incorporating alias analysis. It can handle multi-level
pointers, whose precise analysis is known to be intractable
\cite{Landi}. The alias scheme adopted is most similar to
\cite{larus}, but it is less precise and cannot handle cyclic and
recursive structures.

...

\bibitem[BC85]{carre}
Jean-Francois Bergeretti \& Bernard A. Carre,
Information-Flow and Data-Flow Analysis of while-Programs',
ACM Trans. Programming Languages and Systems, Vol. 7, No. 1, January 1985.

\bibitem[CWZ90]{chase}
David R. Chase, Mark Wegman \& F. Kenneth Zadeck,
Analysis of Pointers and Structures',
Proc. SIGPLAN '88 Conf. Prog. Lang. Design \& Implementation, 1990.

\bibitem[GL88]{FX}
David K. Gifford \& John M. Lucassen,
Polymorphic Effect Systems',
ACM Conf. Principles of Programming Languages, 1988.

\bibitem[How89]{flavor}
William E. Howden,
Validating Programs Without Specifications',
Proc. 3d ACM Symposium on Software Testing, Analysis and Verification (TAV3),
Key West, Florida, Dec. 1989.

\bibitem[Jac91]{icse}
Daniel Jackson,
Aspect: An Economical Bug-Detector',
Proc 13th Int. Conf. on Software Engineering,
Austin, Texas, May 1991.

\bibitem[Lan91]{Landi}
William Landi and Barbara G. Ryder,
Pointer-induced Aliasing: A Problem Classification',
ACM Conf. Principles of Programming Languages, 1991.

\bibitem[LH88]{larus}
James R. Larus and Paul N. Hilfinger,
Detecting Conflicts Between Structure Accesses'
ACM Conf. Programming Language Design and Implementation,
June 1988.

\bibitem[OO89]{cesar}
Kurt M. Olender \& Leon J. Osterweil,
Cesar: A Static Sequencing Constraint Analyzer',
Proc. 3d ACM Symposium on Software Testing, Analysis and Verification (TAV3),
Key West, Florida, Dec. 1989.

===================================================

>From adv5@shakti.ernet.in Tue Nov 26 03:17:02 1991

I am a member of a quality control team in Citicorp Overseas Software Ltd.

I do a lot of desk work, to test code validity.

Till now I have used manual methods for static analysis of code, using
tables of states of variables, and basically sweating it out. I wish to
know if you have more information on known bugs or pitfalls in various
constructs of a language.

I will dig out some information on static analysers, and mail them to you

===================================================
>From @bay.csri.toronto.edu:norvell@csri.toronto.edu Tue Nov 26 02:04:23 1991

I'd be very interested in your list of references. Unfortunately I can't
find my own list of references to give you in return. Perhaps I didn't
type it in yet. I _can_ send you an unpublished paper (complete except
for bibliography) on detecting dataflow anomalies in procedural languages.
There is also an experimental program that implements the ideas of the
paper. The paper can be sent in postscript or dvi and the program in in
Turing -- a Pascal type language.

Theo Norvell norvell@csri.toronto.edu
U of Toronto norvell@csri.utoronto.ca

===================================================
>From ae2@cunixa.cc.columbia.edu Sun Nov 24 21:40:04 1991

Greetings!

interest in this particular field as my final project has to do with the
implemetation of expetional handlin in "Small C", a compiler designed on
the IBM 8088 and it's sole interest is educational, something equivalent
to minix. I would greatly appreciate if you could help me in finding
sources that dwell on this subject, anything that would be related to
errors and how one might deal with them would be relavant.
--Amiran
ae2@cunixa.cc.columbia.edu
===================================================

>From paco@cs.rice.edu Sun Nov 24 04:42:03 1991

The Convex Application Compiler (TM?) apparently does a pretty good job.
Any interprocedural analyzer has to catch a lot of errors just to avoid
crashing. They also do pointer tracking, array section analysis, and
generally just all-out analysis and optimization. Bob Metzger
<metzger@convex.com> et al. have a paper on the pointer tracking in the
proceedings of Supercomputing '91, which was held last week.

It is alleged that Convex has sold machines, or at least copies of their
compiler, just for use in static error checking.

Paul Havlak

===================================================
>From jvitek@csr.UVic.CA Sat Nov 23 11:54:49 1991

You might be interested in looking at abstract interpretation. It is a
technique related to data-flow analysis (you can express data-flow anlyses
as abstract interpreation problems) and is quite popular in among the
functional programming crowd.

There is a number of paper (even books!) on the subject, if you are
interested I can provide references.

===================================================

>From twinsun!twinsun.com!eggert@CS.UCLA.EDU Fri Nov 22 07:17:14 1991

Here's a few references that I've written:

Paul Eggert,
Toward special-purpose program verification,
.i "Proc. ACM SIGSOFT International Workshop on Formal Methods
in Software Development",
Napa, CA (9-11 May 1990);
.i "Software engineering notes"
.b 15 ,
4 (September 1990), 25-29.

Paul Eggert,
An Example of Detecting Software Errors Before Execution,
.i "Proc. workshop on effectiveness of testing and proving methods,"
Avalon, CA
(11-13 May 1982), 1-8.

Paul Eggert,
Detecting Software Errors Before Execution,
UCLA Computer Science Department report CSD-810402 (April 1981).

The last reference contains an extensive survey of the pre-1980 literature.
I'd be interested in seeing your list of references as well,
when you have the time.

===================================================
>From beck@cs.cornell.edu Fri Nov 22 05:31:04 1991

I'm interested in hearing your list of bugs. I have given some thought to
the detection and propagation of error information in a program using
dependence information. Propagation of errors uses the idea that any
computation on a path which leads only to an error condition is dead
unless a side-effect intervenes, and any code after the error is
unreachable. Thus one can actually integrate program errors into control
flow information as an unstructured jump to the end of the program. In an
optimizing compiler, one might be tempted to say that any statement which
can be scheduled after the last side effect before the error is dead.
Thus, in this fragment:

1: x := 2
2: print "hi there"
3: y := z/0

statement 1 is actually dead, but statement 2 is not.

Micah Beck
--

#### [Nov 14, 2004] Two Benchmark Tests for BCPL Style Coroutines

This directory contains various implementions of two benchmark programs to test the efficiency of BCPL style coroutines when implemented in a variety of different languages.

Download cobench.tgz (or cobench.zip) and un-tar (or un-zip) it to form the directory Cobench. The enter one of the diectories: bcpl, natbcpl, mcpl, java, c and New Jersey SML and type the command:
make. This should compile and run the benchmark, provided you are on a Pentium machine running Linux. You may find you have to re-install the BCPL Cintcode system and you may have to edit one or more of the Makefiles.

There are directories for each language and implementation style. They are currently as follows:

bcpl The definitive BCPL implementation run using Cintcode
natbcpl BCPL compiled (very naively) in machine code
c Implementation in C using a coroutine library involving
setjmp, longjump and two instructions of inline assembler.
nat-c Implemetation in C using a hand written coroutine library
in assembler (for the Pentium).
java Implementation in Java using a coroutine library based on
threads using wait, notify and synchronisation.
javadep An incorrect implementation in Java attempting to use the
deprecated mesthod suspend and resume.
mcpl An interpretive implementation in MCPL
smlnj An implementation in Standard ML using New Jersey Continuations.

All these implementations can be run under Linux (provided the relevant compilers have been installed). They are typically run use make as follows:

make Same as make help
make help Display the help infomation
make bench Run cobench
make bencht Run cobench with tracing
make sim Run cosim
make simt Run cosim with tracing
make test Run cotest
make clean Delete any files that can be reconstructed

The program cotest tests where the BCPL style coroutines have been implemented correctly.

The benchmark: cobench

This benchmark program creates a source coroutine, 500 copy coroutines and a sink coroutine. It then causes the source coroutine to transmit 10,000 numbers through all the copy coroutines to be thrown away by the sink coroutine. The numbers are sent one at a time, and all transmission from one coroutine to the next uses a coroutine implementation of Occam channels.

Reading and writing are done (in the BCPL version) using coread and cowrite, respectively. These are defined as follows:

{ LET cptr = !ptr
TEST cptr
THEN { !ptr := 0 // Clear the channel word
RESULTIS resumeco(cptr, currco)
}
ELSE { !ptr := currco // Set channel word to this coroutine
RESULTIS cowait() // Wait for value from cowrite
}
}

AND cowrite(ptr, val) BE
{ LET cptr = !ptr
TEST cptr
THEN { !ptr := 0
callco(cptr, val) // Send val to coread
}
ELSE { !ptr := currco
callco(cowait(), val)
}
}

In both functions, ptr points to a channel word which hold zero (=FALSE) if no coroutine is waiting on the channel. The first coroutine to call coread (or cowrite) puts its own identity into the channel word and then waits. When the second coroutine reaches the corresponding cowrite (or coread) call, it can see the identity of the other coroutine that is waiting.

If the writing coroutine arrives at the communication point second, it picks the identity of the reading coroutine out of the channel word, clears the channel word, and then transmits the value to the waiting
read coroutine using a simple call of callco.

If the reading coroutine arrives at the communication point second, it gives its own identity to the waiting write coroutine then waits for the value to be sent. A little thought will show you why coread uses
resumeco to transmit its own identity to the write coroutine.

If you are not familiar with BCPL coroutines, you can read about them in the BCPL manual that is available via my home page.

This directory currently holds four implementations of the benchmark: bcpl, natbcpl, java and c. When run on a 1 GHz Pentium III they take the following times:

BCPL Cintcode byte stream interpreter: 4.78 secs

BCPL Native 386 code (unoptimised): 0.81 secs

BCPL under Cintpos using an interpreter implemented
in C and all debugging aids on turned on: 10 secs

MCPL Mintcode byte stream interpreter: 10.51 secs

Java j2sdk1.4.0 JIT (using threads): 191.59 secs

Java j2sdk1.4.0 JIT (using threads, suspend and resume): 2029.44 secs

C using setjmp and longjmp + inline assembler: 1.36 secs

C using colib written in assembler 0.61 secs

New Jersey SML using continuations: 4.37 secs

BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld) 115.52 secs

The Java implementation was done by Eben Upton and the first of the C versions using setjmp, longjmp and inline assembler by Jeremy Singer.

The benchmark: cosim

This benchmark is a discrete event simulator that simulates messages passing through a network of nodes. The nodes are numbered 1 to 500 and are each initially given a message to process. The time to process a message is randomly distributed between 0 and 999 units of simulated time. Once a message has been processed it is sent to another node selected at random. The transmission delay is assumed to be ABS(i-j) where i and j are the source and destination node numbers. Nodes can only process one message at a time and so messages may have to be queued. Initially each node is assumed to have just received a message. The simulation starts a time 0 and finishes at time
1,000,000. The result, 501907, is the total count of messages that have been completely processed by the finishing time. This benchmark executes 435,363,350 Cintcode instructions and performs 2,510,520 coroutine changes. The timing results are as follows:

BCPL Cintcode byte stream interpreter: 9.02 secs

BCPL Native 386 code (unoptimised): 1.11 secs

BCPL under Cintpos using an interpreter implemented
in C and all debugging aids on turned on: 17 secs

MCPL Mintcode byte stream interpreter: ??.?? secs

Java j2sdk1.4.0 JIT (using threads, wait and notify): 43.13 secs

Java j2sdk1.4.0 JIT (using threads, suspend and resume): 512.03 secs

C using setjmp and longjmp + inline assembler: 0.80 secs

C using colib written in assembler 0.58 secs

New Jersey SML using continuations: 4.08 secs

BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld) 121.22 secs

Martin Richards
20 July 2004

#### [Nov 12, 2004] C Coroutines

co_create, co_call, co_resume, co_delete, co_exit_to, co_exit - C coroutine management

The coro library implements the low level functionality for coroutines. For a definition of the term coroutine see The Art of Computer Programming by Donald E. Knuth. In short, you may think of coroutines as a very simple cooperative multitasking environment where the switch from one task to another is done explicitly by a function call. And, coroutines are fast. Switching from one coroutine to another takes only a couple of assembler instructions more than a normal function call.

This document defines an API for the low level handling of coroutines i.e. creating and deleting coroutines and switching between them. Higher level functionality (scheduler, etc.) is not covered.

#### Comp.compilers Re Assembly verses a high-level language.

From comp.compilers

 Newsgroups: comp.compilers From: macrakis@osf.org (Stavros Macrakis) In-Reply-To: tomviper@ix.netcom.com's message of Mon, 20 Nov 1995 03:53:54 GMT Keywords: performance, assembler Organization: OSF Research Institute References: 95-11-166 Date: Wed, 22 Nov 1995 21:21:12 GMT

tomviper@ix.netcom.com (Tom Powell ) writes:

How come programs written in assembly are so much faster than any
other high-level language. I know that it is a low-level language
and that it "speaks" directly to the hardware so it is faster, but
why can't high-level languages compile programs just as fast as
assembly programs?

First of all, assembler is not an "other" high-level language. It is
the low-level language par excellence, lower-level even than C :-).

There are several reasons that assembly programs can be faster than compiled programs:

The assembly programmer can design data structures which take maximum advantage of the instruction set. To a certain extent, you can do this in languages like C if you're willing to write code that is specific to the architecture. But there are some instructions which are so specialized that it is very hard for compilers to recognize that they're the best way to do things; this is mostly true in CISC architectures.

The assembly programmer typically can estimate which parts of the program need the most optimization, and apply a variety of tricks which it would be a bad idea to apply everywhere, because they make the code larger. I don't know of any compilers that allow "turning up" or "turning down" optimization for code fragments, although most will allow it for compilation modules.

The assembly programmer can sometimes use specialized runtime structures, such as for instance reserving some registers globally for things that are often used, or designing special conventions for register use and parameter passing in a group of procedures. Another example is using the top of the stack as a local, unbounded stack without respecting frame conventions.

Some control structures are not widely supported by commonly-usedhigher-level languages, or are too general. For instance, coroutines are provided by very few languages. Many languages now provide threads, which are a generalization of coroutines, but often have more overhead.

The assembly programmer is sometimes willing to do global analysis which most compilers currently don't do.

Finally, the assembly programmer is more immediately aware of the cost of operations, and thus tends to choose more carefully as a function of cost. As language level rises, the cost of a given operation generally becomes less and less predictable.

All this said, there is no guarantee than an assembly program will be faster than a compiled program. A program of a given functionality will take longer to develop in assembler than in a higher-level language, so less time is available for design and performance tuning. Re-design is particularly painful in assembler since many decisions are written into the code. In many programs, large improvements can be made in performance by improving algorithms rather than coding; assembler is a disadvantage here since coding time is larger, and flexibility is less. Finally, it is harder to write reliable assembly code than reliable higher-level language code; getting a core dump faster is not much use.

Compiler writers have tried, over time, to incorporate some of these advantages of assembler. The "coalescing" style of compiler in particular in many ways resembles the work of a good assembly programmer: design your data structures and inner loops together, and early on in the design process. Various kinds of optimization and global analysis are done by compilers, but in the absence of application knowledge, it is hard to bound their runtime. (Another thread in this group talked about the desirability of turning optimization up very high in some cases.)

-s

#### Compilers and Compiler Generators an introduction with C++ © P.D. Terry, Rhodes University, 1996 NEW (12 August 2004)

A complete revision of this book, using C# and Java and the versions of Coco/R for those languages, will be published by Pearson Education in about October 2004.

Further details of the date will be published here in due course.

A feature of the new edition is that it demonstrates the use of Coco/R to implement compilers for the JVM and CLR platforms.

In the meantime you can preview the contents of the "Resource Kit" which is currently under construction. This contains the preface and table of contents and additional material that will not appear in the published book. When complete it will also contain the source code for the case studies in the book and distributions of Coco/R for C# and Java.

#### Let's Build a Compiler by Jack Crenshaw

This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction. You can read the parts on-line or download them in a ZIP file.

#### Slashdot Low Level Virtual Machine 1.3 Released

"VM" in LLVM name does not do it justice (Score:4, Interesting)
by truth_revealed (593493) on Saturday August 14, @10:13AM ( #9966950)

The casual Slashdot reader may roll his/her eyes when they see yet another Virtual Machine - but this project is much more than that. It's a complete compiler infrastructure project that will one day surpass GCC. Why? Because it's around ten times easier to understand and written in a modern language (C++) rather than C. An expert C++ programmer could start contributing code to LLVM in under a month; whereas an equivalent learning curve for GCC is at least a year. Writing new compiler passes or complete language front ends for LLVM is very straight-forward, for example. The LLVM AST has the advantage of having strong typechecking and not arcane tree macros as in GCC. LLVM is not burdened with the legal or philosophical wranglings of GCC where they do NOT WANT their compiler to be the backend of a commercial compiler and try their damnedest to obscure and change the programming API from release to release. The GCC "Toy" example language has not worked in many releases for this very reason.
GCC recently voted down using C++ in their core code. Perhaps LLVM at the very least will drag GCC into the modern age due to competition.

Speaking as a GCC maintainer, I call bullshit (Score:5, Informative)
by devphil (51341) on Saturday August 14, @02:55PM (#9968783)
(http://www.devphil.com/)

The very best trolls always start with a grain of truth. (LLVM is much easier to understand than GCC. The GCC infrastructure is very baroque, dating from a time when assuming the presence of an ANSI C bootstrap compiler was too much. One of the major LLVM guys has presented his toolchain work at the annual GCC Summit, and maintains close communication with the rest of the GCC team -- and we wish him well. All very true; no GCC hacker would say any less.)

The trolls then move on into wild exaggerations and complete lies. Such as:

and try their damnedest to obscure and change the programming API from release to release.

Pure malicious bullshit. RMS doesn't want proprietary backends to be able to read the GCC IR, and so we don't ever write it out in a machine-readable format. But we've never gone out of our way to obfuscate the internal API.

GCC recently voted down using C++ in their core code.

Again, a complete lie. We asked RMS whether we could make use of C++ in parts of the compiler. While a skilled and brilliant C and LISP hacker, RMS is a reactionary fuckhead when it comes to anything other than C or LISP. In his opinion, all GNU programs should be written in C, and only C, ever.

There was no vote.

Re:Speaking as a GCC maintainer, I call bullshit (Score:3, Informative)
by devphil (51341) on Sunday August 15, @02:16PM (#9974914)
(http://www.devphil.com/)

You're not completely right, and not completely wrong. The politics are exceedingly complicated, and I regret it every time I learn more about them.

RMS doesn't have dictatorial power over the SC, nor a formal veto vote.

He does hold the copyright to GCC. (Well, the FSF holds the copyright, but he is the FSF.) That's a lot more important that many people realize.

Choice of implementation language is, strictly speaking, a purely technical issue. But it has so many consequences that it gets special attention.

The SC specifically avoids getting involved in technical issues whenever possible. Even when the SC is asked to decide something, they never go to RMS when they can help it, because he's so unaware of modern real-world technical issues and the bigger picture. It's far, far better to continue postponing a question than to ask it, when RMS is involved, because he will make a snap decision based on his own bizarre technical ideas, and then never change his mind in time for the new decision to be worth anything.

He can be convinced. Eventually. It took the SC over a year to explain and demonstrate that Java bytecode could not easily be used to subvert the GPL, therefore permitting GCJ to be checked in to the official repository was okay. I'm sure that someday we'll be using C++ in core code. Just not anytime soon.

As for forking again... well, yeah, I personally happen to be a proponent of that path. But I'm keenly aware of the damange that would to do GCC's reputation -- beyond the short-sighted typical /. viewpoint of "always disobey every authority" -- and I'm still probably underestimating the problems.

#### Full-Text Searching & the Burrows-Wheeler Transform

• Here's an indexing method that lets you find any character sequence in the source text using a structure that can fit the entire source text and index into less space than the text alone.

#### Regular Expression Mining

• Regular expressions are convenient devices for identifying patterns in collections of strings.

#### C/C++ Compiler Optimization

• Squeezing the maximum execution speed from C/C++ compilers requires an understanding of optimization switches.

#### Shared Source CLI Essentials by Ted Neward , David Stutz, Geoff Shilling

• Paperback: 378 pages
• Publisher: O'Reilly & Associates; Book and CD-ROM edition (March 2003)
• ISBN: 059600351X
• Average Customer Review: Based on 3 reviews.
• Best source for .NET implementation details, October 28, 2003
 Reviewer: Rick Byers from Toronto, ON Canada

This book is the best and most concentrated source of information I've found for understanding how the .NET CLR is implemented (comparable only to Chris Brumme's blog). Even if you never actually build the SSCLI, this book combined with the SSCLI source code can provide a solid understanding of what's going on behind the scenes in the commercial CLR. I have found this level of understanding to be absolutely necessary in understanding and diagnosing some types of unusual behaviour or performance characteristics of .NET.

If you're not using the SSCLI on a UNIX machine and have a solid understanding of the Win32 API, you can probably safely skip the last chapter on the PAL as it is somewhat anti-climatic. However, coming from a UNIX programming background myself, I found it to be of value in solidifying my understanding of Win32 specific functionality (eg. structured exception handling) and how its used by the SSCLI.

Obviously this book is a must-read for anyone that is actually experimenting with the SSCLI, but I also consider it essential for anyone that wants to fully understand how the commercial version of .NET works.

Magnificent!, April 26, 2003

 Reviewer: Jason Whittington from Tucson, AZ USA

As someone who has spent a fair amount of time toying with and writing about managed code I have to say that I am in awe of the wisdom and clarity contained in this book. "SSCLI Essentials" transcends its subject matter (a research platform unlikely to be used much outside of academia) to be one of the best books I've ever read on Virtual Execution concepts. Java, the CLR, Smalltalk, and all other such environments ultimately have to solve the same problem (How to turn source code into executing machine instructions?). This book uses the SSCLI as a backdrop for exploring decades of VM research and explaining the historical forces influencing how and why this particular implementation (and by implication, Microsoft's commercial CLR) works.

The resulting volume is concise, fascinating, and thorough. Given the increasing importance of virtual environments in the computing world today I think most all working developers (including Java developers!) owe it to themselves to read this book. Even if you never plan to install or use the SSCLI codebase you'll benefit from Dave and friends' lucid explanation of the issues facing modern VM environments and how one particularly popular platform chooses to solve them.

#### [May 22, 2004] lcc, A Retargetable Compiler for ANSI C

Note: Due to the volume of this page previous years of "Old News" was moved to the page Compilers Chronicle 1999-2003 ; Recommended Links section of the page was also moved to a separate page

### Sites

• Complier A very extensive collection of links [as of Dec 11, 2007]

## Ebooks

Compiler Construction by Niklaus Wirth

131 pages

ISBN 0-201-40353-6

Zürich, November 2005

Compiler Construction - Wikibooks, collection of open-content textbooks

Parsing Techniques - A Practical Guide by Dick Grune and Ceriel J.H. Jacobs. Originally published by Ellis Horwood, Chichester, England, 1990; ISBN 0 13 651431 6 A small but and too formalistic book about parsing

This 320-page book treats parsing in its own right, in greater depth than is found in most computer science and linguistics books. It offers a clear, accessible, and thorough discussion of many different parsing techniques with their interrelations and applicabilities, including error recovery techniques. Unlike most books, it treats (almost) all parsing methods, not just the popular ones. See Preface + Introduction and/or Table of Contents for a quick impression.

The authors also participates in writing of a newer book Modern Compiler Design John Wiley & Sons, Ltd., pp. 736 + xviii, 2000; ISBN 0 471 97697 0

Structure and Interpretation of Computer Programs: Written by Hal Abelson, Jerry Sussman and Julie Sussman, this book is the very famous "Wizard Book", a computer science text used in the introductory courses at MIT. Download it from http://mitpress.mit.edu/sicp/full-text/book/book.html

Optimizing C++: Written by Steve Heller, this book focus specially in what its title says: optimization in C++. Though it's not as ground breaking as Abrash's Black Book, it's worthy reading for some interesting algorithms and techniques. Download it from http://www.steveheller.com/opt/

Compiler Construction using Flex and Bison: A small (102 pages) boot by Anthony A. Aaby. Explains step by step the way to program compilers that use Flex and Bison (including very detailed information about both tools). Download it from http://cs.wwc.edu/~aabyan/464/Book/

Compilers and Compiler Generators: http://scifac.ru.ac.za/compilers/

## WEB-based materials of University Courses

• Let's Build a Compiler by Jack Crenshaw. This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction. You can read the parts on-line or download them in a ZIP file.

• Compiler Theory and Design a decent intro e-book. Copyright 2000 Brigham Young University. As I mentioned before, lexical analyzers are more conveniently to program by hand and that permits to implement some additional nice things like checking the length of comments, quoted literals, etc, so the corresponding chapter can probably be skipped or just browsed. Chapter on syntax analysis explains recursive decent which is a pretty good method for the "language at large" (although for "language at small" like arithmetic expressions and Boolean expressions one can use something different with a better diagnostic capabilities). Code generation is explained using a pseudo-assembler without using a separate pass for address generation. The problems of compilation OO languages are covered although it's not that necessary for an intro course. Suggested by Yavor Kolarov <kolarov@net-bg.net>
• Compilers and Compiler Generators -- free electromic book by © P.D. Terry, Rhodes University, 1996
• Parsing Techniques - A Practical Guide by Dick Grune and Ceriel J.H. Jacobs. Originally published by Ellis Horwood, Chichester, England, 1990; ISBN 0 13 651431 6 A small but a good book about parsing, explaining all in an almost academic way.

This 320-page book treats parsing in its own right, in greater depth than is found in most computer science and linguistics books. It offers a clear, accessible, and thorough discussion of many different parsing techniques with their interrelations and applicabilities, including error recovery techniques. Unlike most books, it treats (almost) all parsing methods, not just the popular ones. See Preface + Introduction and/or Table of Contents for a quick impression.

• CS502 Information -- Purdue Univ. Compiling and Programming Systems (Hosking)
• Compilers -- UTexas. Compilers for programming languages, Pascal, PowerPC machine code. Syllabus. Assignments. Study guides. Links to related materials. By Gordon S. Novak Jr, University of Texas at Austin.
• Structure of Programming Languages II: Types, Modules, Classes -- Syntactic and semantic features of modern programming languages and their impact on language design. A rational reconstruction of imperative languages such as Pascal, Ada, and Modula-3 and functional languages such as scheme and ML. Syllabus and lecture notes. By Wolfgang Schreiner, Research Institute for Symbolic Computation, Johannes Kepler University, Linz, Austria.
• Code Optimizations- A Quantitative Approach
This course will concentrate on state-of-the-art code optimization techniques used in compilers for modern processors, with a quantitative framework for designing and evaluating these techniques. Students will be introduced to a wide range of restructuring transformations and optimization techniques for improving the performance of compiler-generated code. August 12-16, 1996. WICS and Stanford University.
• CPTR 464 Compiler Design
• CMP SCI 610-410(491A) - Compiler Techniques
• COMP 412 Lecture Notes Keith Cooper Rice university
• Compiler Construction Course -- Computer Science Department, Mary Washington College
• CS 4410 - Compilers -- College of Computing, Georgia Institute of Technology
• CST 8152 Compilers -- Algonquin College Computer Studies
• C431: Assemblers and Compilers -- Spring 1997 -- Math and Computer Science Department, Indiana University South Bend
• Iowa State: The Teaching About Programming Languages Project
Information about the teaching of the concepts of programming languages, especially undergraduate survey courses and courses about programming language semantics.
• Uppsala University
• Rensselaer Polytechnic
• Princeton University
• Joseph Bergin - Pace University, New York
• Languages and Compilers - notes
• Lecture 1 Introduction

## Usenet and FAQs

The comp.compilers newsgroup -- this is "must" group to read. It contains a wealth of material unachibale in ant WEB page. In addition the site contains several useful resources. Among them:

You can ask questions and often get a highly qualified answer: for example:

mark watson <mark.watson@shaw.ca> wrote:

>I suspect that one doesn't exist, but is there any visualizer or
>converter for the .v file that YACC produces?

If the .v file contains the stack state machine (push-down automaton)
description, current Bison has a way to visualize it.

Hans Aberg * Anti-spam: remove "remove." from email address.
... ... ...

Hans Aberg wrote:
>
> Can somebody give a reference to a description of the LALR(1)
> algorithm that Bison uses?
>
> The file state.h of the Bison sources says that first, a
> non-deterministic finite state machine that parses the specified
> grammar is created; it is then converted into a deterministic finite
> state machine.

I don't know what Bison uses, but your description sounds like the algorithm I first came across in:
Syntax Analysis and Software Tools K.John Gough Addison-Wesley 1988 chapter 9 Bottom-up parsing
The references there point to D.E.Knuth (1965) On the translation of languages from left to right
Information and control v8 pp323-350 which apparently covers both this and the "usual" form of the algorithm.

compilers/construction-tools Catalog of Compiler Construction Products - Thirteenth Issue

compilers/faq comp.compilers monthly message and Frequently Asked Questions

## ACM

SIGPLAN -- a Special Interest Group of ACM that focuses on Programming Languages. In particular, SIGPLAN explores implementation and efficient use. Pretty closed and SygPlan Notices (see below) has high subscription price. IMHO its best day are in the past.

ACM SIGPLAN Web Page at ACM -- ACM SIGPLAN Notices is an informal monthly publication of the Special Interest Group on Programming Languages (SIGPLAN) of ACM.

## DSL

DSLAnnotatedBibliography

First quarter, 1999

• A. van Deursen, P. Klint and C. Verhoef. Research Issues in Software Renovation. In J.-P. Finance (editor) Fundamental Approaches to Software Engineering, FASE99, LNCS, Springer-Verlag, 1999, pp. 1-21. http://www.cwi.nl/~arie/papers/etaps99.ps
• M.G.J. van den Brand, P. Klint & P. Olivier, Compilation and Memory Management of ASF+SDF. In Stefan Jahnichen (ed.) Proceedings of the 8th International Conference on Compiler Construction CC'99, Volume 1675 of Lecture Notes in Computer Science, pages 198--213, Springer, 1999.
• M.G.J. van den Brand, H.A de Jong, P. Klint & P. Olivier, Efficient Annotated Terms. Software Practice and Experience 30(3), 259--291, 2000. http://www.cwi.nl/~markvdb/papers/at.ps

Second quarter, 1999

Third quarter, 1999

• T.B. Dinesh, Magne Haveraaen, and Jan Heering. An algebraic programming style for numerical software and its optimization. ACM CoRR Preprint Server, 1999. Submitted to Scientific Programming. http://xxx.lanl.gov/abs/cs.SE/9903002
• A. van Deursen, S. Woods, and A. Quilici. Program Plan Recognition for Year 2000 Tools. Science of Computer Programming, 1999. To appear. http://www.cwi.nl/~arie/scp99.ps
• A. van Deursen and T. Kuipers. Building Documentation Generators. In Proceedings International Conference on Software Maintenance. IEEE Computer Society, 1999. pp. 40-49. http://www.cwi.nl/~arie/papers/icsm99.pdf

Fourth quarter, 1999

• A. van Deursen and L. Moonen. Understanding COBOL Systems using Inferred Types. In S. Woods (ed.) 7th Int. Workshop on Program Comprehension. IEEE Computer Society, 1999, pp. 74-83. Full version submitted to Science of Computer Programming. http://www.cwi.nl/~arie/papers/iwpc99.pdf
• Jan Heering and Paul Klint, Semantics of programming languages: A tool-oriented approach, ACM CoRR Preprint Server cs.PL/9911001 v2, 26 November 1999. Submitted to ACM SIGPLAN Notices. http://www.cwi.nl/~jan/semantics/semantics.html
• S. Bunnig, P. Forbrig, R. Lammel, en N. Seemann. A programming language for design patterns. In Proceedings of the GI-Jahrestagung 1999, Informatik '99, Reihe Informatik aktuell, Springer 1999. http://www.cwi.nl/~ralf/atps99.ps
• A. van Deursen, P. Klint, and J. Visser. Domain-Specific Languages -- An Annotated Bibliography. 1999. CWI, Technical Report, 1999. To appear. Submitted for publication. http://www.cwi.nl/~arie/dsl-project/annotbib/
• R. Oudejans en J. Visser Reverse Engineering and Code Generation -- Feasibility Study. CWI / First Result Consulting, December 1999.

First Quarter, 2000

Second Quarter, 2000

Third Quarter, 2000

Fourth Quarter, 2000

First Quarter, 2001

## Memory management

The Memory Management Reference

## Graph analysis

Abstract: . A structuring algorithm for arbitrary control flow graphs is presented. Graphs are structured into functional, semantical and structural equivalent graphs, without code replication or introduction of new variables. The algorithm makes use of a set of generic high-level language structures that includes different types of loops and conditionals. Gotos are used only when the graph cannot be structured with the structures in the generic set. This algorithm is adequate for the control flow analysis required when decompiling programs, given that a pure binary program does not contain information on the high-level structures used by the initial high-level language program (i.e. before compilation). The algorithm has been implemented as part of the dcc decompiler, an i80286 decompiler of DOS...

......only by the code generator, and no labels are placed on the graph on the earlier control flow analysis phase. 7 Previous Work Most structuring algorithms have concentrated on the removal of goto statements from control flow graphs at the expense of introduction of new boolean variables [8, 29, 22, 28, 6, 13], code replication [17, 27, 29], the use of multilevel exits [7, 24], or the use of a set of high-level structures not available in commonly used languages [25]. None of these methods are applicable to a decompiler because: the introduction of new boolean variables modifies the semantics of the ......

13. A.M. Erosa and L.J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In Proceedings of the International Conference on Computer Languages, Universit'e Paul Sabatier, Toulouse, France, May 1994. IEEE Computer Society.

Taming control flow: A structured approach to.goto statements from control flow graphs at the expense of......potentially unstructured control-flow graphs.
WolfPack v1.0 Control Flow Graphs...WolfPack v1.0 Control Flow Graphs Table of Contents......extension code for control flow graphs found in version 4.0...

Rastislav Bodik and Sadun Anik. Path-sensitive value-flow analysis.
In 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, California}, pages 237--251. ACM, January 1998.
http://www.cs.wisc.edu/~bodik/research/popl98.ps

Uwe Assmann. How to Uniformly Specify Program Analysis and Transformation. In P. Fritzson, editor, Compiler Construction (CC). Springer, 1996. http://i44s11.info.uni-karlsruhe.de:80/~assmann/

Rastislav Bodik and Sadun Anik. Path-sensitive value-flow analysis. In 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, California}, pages 237--251. ACM, January 1998. http://www.cs.wisc.edu/~bodik/research/popl98.ps

Choose a Format for Viewing the Dataflow Web Pages

Catalog of Compiler Construction Tools

Archive of comp.compilers

## Program Analysis and Transformations

• VALUE LATTICE STATIC ANALYSIS by William Brew and Maggie Johnson. William and Maggie examine "value lattice," a new approach to static analysis that finds the most dangerous defects that tend to slip through testing. Additional resources include lattice.txt (listings).

## XML Complilers

Soumen Sarkar XSLT is typically used in document transform (XML to WML or HTML) scenario and this misses the point that it could be used very flexibly in "template driven code generation". Here is how the author describes his experience:

Soumen Sarkar, has successfully used XSLT to achieve source code generation in his Java application software project at ATOGA Systems Inc.. Out of approximately 2,480 Java files in the project 1, 520 were generated. In other words, 61% of the Java code is generated. SQL and XML files are also generated. He has the following publications on this topic:

Model Driven Programming, published in XML journal of SYS-CON media, August 2002, Volume 3, Issue 8. The article proposed an extension of Model View Controller (MVC) architecture and showed how to implement in software development projects.

http://www.sys-con.com/xml/pdf/trees.pdf

Code generation using XML based document transformation, a technical article published in J2EE expert site theserverside.com, November 2001. The URL is http://www.theserverside.com/resources/article.jsp?l=XMLCodeGen .
Application of this technique to Atoga NMS development resulted in 60% code
generation.

http://www.theserverside.com/resources/articles/XMLCodeGen/xmltransform.pdf

ACE: CoSy Compilation System
Instruction-Set Simulation and Tracing
Runtime Code Generation (RTCG)
Faculty Research Interests : Greg Morrisett
ANDF Reference Guide
Language Prototyping: An Algebraic Specification Approach
Dynamic Typing in Polymorphic Languages
Catalog of Free Compilers and Interpreters: introduction
Catalog of Compiler Construction Tools
http://www.csd.uu.se/~hakanm/pli.html
The Compiler Connection
Graphs to color
dag
CPU Info Center
HAKMEM -- CONTENTS -- DRAFT, NOT YET PROOFED
ICCD '95
The comp.compilers newsgroup
Catalog of Compiler Construction Tools
Catalog of Free Compilers and Interpreters: introduction
Compiler Internet Resource List
re2c
Lazy code motion handouts
Dynamic Typing in a Statically Typed Language.
Error Repair in Shift-Reduce Parsers
Mark Leone's Research
ACM Symposium: PLDI '97 Call for papers

Compiler Jobs

## Humor

 Top 10 reasons computers are male
===========================

10. They have a lot of data but are still clueless.
9.  A better model is always just around the corner.
8.  They look nice and shiny until you bring them home.
7.  It is always necessary to have a backup.
6.  They'll do whatever you say if you push the right buttons.
5.  The best part of having either one is the games you can play.
4.  In order to get their attention, you have to turn them on.
3.  The lights are on but nobody's home.
2.  Big power surges knock them out for the night.
1.  Size does matter

Here's the quid pro quo:

Top 10 reasons compilers must be female:
========================================

10. Picky, picky, picky.
9.  They hear what you say, but not what you mean.
8.  Beauty is only shell deep.
7.  When you ask what's wrong, they say "nothing".
6.  Can produce incorrect results with alarming speed.
5.  Always turning simple statements into big productions.
4.  Smalltalk is important.
3.  You do the same thing for years, and suddenly it's wrong.
2.  They make you take the garbage out.
1.  Miss a period and they go wild`

## Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

 You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.