Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Compilers Algorithms

News Algorithms Best books Recommended Links University Courses Papers  FAQs People
Lexical analysis Recursive Descent Parsing Parsing Regular expressions Code  generation Tree-based code optimization Peephole optimization Performance and Benchmarks
Decompilation ACM Program Analysis and Transformations Pipes coroutines and continuations C compilers Generative programming methods Graph Analysis DSL (domain specific languages)
Defensive programming LEX&YACC XPL PL/360 Prolog Bit Tricks Humor Etc

Introduction

As I used to explain students during the first lecture of my  "Compiler construction" course, this topic of computer science has much wider practical significance than one can expect just from the name of course.  First of all, from my point of view,  this is the most exiting part of CS curriculum that touch a lot of important areas and some very interesting algorithms. 

Ability to construct and use languages is what makes humans humans, and the ability to create your own language is simply amazing. In a way writing compiler from a specialized language is just an extension of if this immanent human ability. But pretty amazing one. Please don't miss this opportunity. I think that compiler writing is an eye-opening experience for any interested student of programming, and I deeply regret that compiler construction courses are de-emphasized in current university curriculums. But at this slightly skeptical site, I am not obliged to conform to the current programming fashion, whatever it might be.

That why this page tries to bring compiler construction to your attention as the central and most important part of CS curriculum.  This course has some implicit links to scripting languages as the latter (and first of all Perl and Python) are powerful compiler construction tools that greatly simplify writing of  lexical scanner and syntax analyzer of any specialized complier. They allow to produce higher quality product in less time. Speed of compiling in such cases is not a primary concern and in many cases you will be surprised how fast are supposedly inefficient methods (such as writing compiler as s set of distinct passes, connected via files written to the disk)  on modern computers.  But a quality of diagnostic is of primary concern  and is easier achieve then you lexical parser and syntax analyzer are written (or generated) in scripting language.  Also overemphasizing of syntax analysis in such courses will do no good to students (and this the grave deficiency of most "compiler construction textbooks"). Recursive decent is usually adequate as you control the language ( Niklaus Wirth was probably the first who understood this fact; althouth later he slipped into software verification fundamentalism, which killed him as compiler writer ;-)

The real beef in writing the compiler is in creating an flexible intermediate representation of the program  suitable for code generation (often structured similarly to the syntax tree) and generating the code from it. The hierarchical structure of Unix file system facilitates its use it in compiler writing for storing the intermediate parsing tree with templates from which you can generate final code by just traversing this tree in a right order. And having elements of intermediate representation written as files  greatly simplifies debugging. 

Code generation actually does not necessary needs to be machine code or even assembler language. You can generate code in any language you want, including shell, Perl, Python and Java.

Another advantage of taking this course that it introduces you to what is called "dual language programming" -- programming your software system in two languages (for example, TCL+C, Python+C, JPython+ Java, Shell+Perl, Groovy+Java). For many problems this approach is more productive than any currently fashionable programming paradigm. When you create interpreter from the scripting (for example TCL or Python) in lower level compiled language (such as C), you might preserve the capability to call programs that adhere to certain calling rules, but are written in low level language from the higher level language. In case of TCL this was one of the design goals for the language and its interpreter  (LUA is similar in this respect).  In such cases you still have access to lower level language in which you can write parts of your software that are difficult or very inefficient to express in higher level language.  This is a tremendous advantage, as you can mix and match them in creating your software product. 

Knowledge of compiler construction is the key to understanding of any particular programming language,
especially a complex one

You can't understand compiled and scripting programming language without understand how particular data types are represented in computers, and how particular constructs are implemented, at least of high level. Such notions  as references, identifier tables (and related notion of namespaces),  structures (which can hold iether values or references, including references to functions (in this case of OO languages such structures are called objects), garbage collection, passing parameters by value vs by reference, etc. Nicklaus Wirth reportedly joked that, because Europeans pronounce his name properly, while Americans pronounce it as "nickel's worth", he is called by name in Europe and called by value in America.

There is another important lesson which you can learn by studying compiler construction and methods of translating particular constructs into the instruction of iether real of virtual machine: writing compilers is hard and raising the level of language is equal to the rising of the level of abstraction at which you are viewing the problem. Not all aspects of the problem can be viewed from high level of abstraction. Ability to switch from one level of abstraction to another (lower of higher) is one of the traits of talented software architects. 

When you have no clue about the methods used for the implementation of particular constructs, you can't write good programs. And the higher is the levle of the language the more difficult is to maintain this understanding. For example, very few people understand how OO constructs are implemented.  Unfortunately many programming books try to obscure this aspect, do not provide any details of the implementation or the history of emergence of this feature (which is very important for understanding),  and try to treat such constructs as something "given".

If you have only vague understanding of the way the particular construct is implemented in compiler/interpreter, you will spend a lot of time "struggling with the language" -- attempt to understand what can be called "language bureaucracy" instead of solving the problem.

Another problem is that modern language became way to complex. They are essentially incomprehensible for humans and most programmers are able to learn only a subset. In this sense the knowledge of how particular feature compiled in object code or interpreted is the only way really understand some features of the language.

I suspect that this drive to "overcomplexity" while lead to some important  successes in creating VHL languages (Python now is among top ten algorithmic languages used; in October 2017 is occupied fifth position in TIOBE Index) to a certain extent is a self-defeating approach -- the resulting language becomes too "top heavy" infused with esoteric constructs that few use or need,  and interpreter became too complex to maintain. At this point the language and interpreter both need powerful financially rich sponsor to continue development. If such sponsor bails out (like O'Reilly did in case of Perl) the language stagnates as maintaining the interpreter and libraries is no longer fun and few developers are ready to deal with such a level of complexity. 

"Dual language programming approach" might be more productive and allow on both levels use simpler languages (for example in pair TCL+C,  TCL is really much  simpler then, say Perl or Python).  Also the design of a particular language compiler or interpreter is heavily influenced by the current flavor of "religious fundamentalism" which is endemic in this area.  Previously those were "structured programming" and "verification" frenzy. Now it is OO, althouth the religious fervor now is going down in this area, as it is clear that OO is unable to produce the benefits expected and never will, while contibuting to excessive complexity of the language and its  imlementation in comparison with classic Algor-60 style procedural languages (Rees Re OO)

(Jonathan Rees had a really interesting response to Why Arc isn't Especially Object-Oriented, which he has allowed me to reproduce here.)

... ... ...

Because OO is a moving target, OO zealots will choose some subset of this menu by whim and then use it to try to convince you that you are a loser.

 (I say regular utilities written for Linux  maintenance tasks written in Python in OO style -- that was a horror show; or more correctly "art for the sake of the art").  Excessive attention to religious fundamentalism aspect of programming language culture  is really detrimental to the quality of implementation (and speed of execution) of many classes of programs.  Also languages that too subscribed to particular fashion tend to disappear when the fashion dies out. The destiny of Pascal (only version with Borland Delphi extensions is still used) and Smalltalk are very important  lessons in this respect.  While OO flag is still carries by C++, Java, Python and Ruby  all those language allow procedural programming as well. 

Actually it is better to learn language that allow you use regular procedural programming along with OO, such as C++, Perl and Python than languages in which  your are forced into straight jacket of OO.  When I see book on "Algorithms and data structures" which use OO in sorting or searching algorithms I do not know whether I should laugh or cry. In any case this is simply disgusting. KISS principle is clearly violated here: not all areas need or can benefit from OO approach. Moreover a rational part of OO approach can be modeled in non-OO languages as well.   Recent growth of popularity of Javascript suggest that alternative approaches can be as good or even better (Javascript uses Prototype Based Object Model, which is distinct and somewhat superior of Simula67-inspired class-based OO model used in C++, Java, Python, and similar languages)

At the same time it is important to understand that the programming language and its compiler or interpreter are only a small part of the total "language ecosystem". And the quality of this ecosystem and especially the quality of libraries (especially the part of libraries that provide interface to underling OS such as Linux)  and debugger are as important or even more important  as the quality of the language.  As Knuth pointed out long ago, in evaluating the quality of the language, you need to see the larger picture --  not the language as such,  but  the whole programming ecosystem that exists around the language and which include the debugger,  IDE, libraries, books, Web sites, groups, etc.

The idea of creating abstract, specific area oriented,  machine and associated language
 -- compiler construction as programming paradigm

Compiler design and related set of classic algorithms provides a pretty flexible software architecture that can be called "abstract machine" architecture. Sometimes using this architecture and adapting it to a particular task can make design more transparent and more easily debugged. While separating lexer, parser and semantic routines is very powerful way of structuring many types of programs, especially text conversion/transformation programs, the real power of this approach is in identifying the key operations of some "abstract machine" specific to particular domain and the language that allow you to operated with those (now defined) high level operations and data structures.   

In other words structuring a large program as if it were a compiler for some new specialized language helps to solve problem that are more difficult or even impossible to solve other way. It can adds some elegance in architecture of many types of text processing and interactive programs. BTW coroutines were first introduced as a way to simplify writing compilers (Melvin Conway was a pioneer of this approach; see also his Conway Law)

An approach when an input to the program can be treated as some formal language and program transforms it to another language now got some traction in XML world. But in no way it is limited to XML and the idea has much wider applicability.  In a perverted way it is also present in OO languages, which can be viewed as simplistic implementation of "compiler-compiler" approach to problem solving.  Rational ideas related to creation of a specialized data types set of abstract operations on them (disguised as objects) and  are buried in superficial things such as inheritance issues, the machinery of working with hierarchical name spaces and related "visibility" problems,  integration of exceptions into this framework, as well as semi-relations desire to outlaw alternative approaches, typical for OO zealots.

In other words, elements of compiler technology, especially a notion of scanner-parser-generator troika are applicable to a wide range of programming tasks. Of course to use it you need to understand how compiler is structured and what algorithms are used. That's why complier construction course is so crucial. And again, while difficult (especially if overloaded is redundant formalisms) it is really existing. In some way, creating a new language is the way human solve complex problems.

That's is of the reasons I created this page as the part of my career was devoted to compiler writing and teaching compiler construction. It is now so long in the past that I've already forgotten most of the things I used to know :-(. and that's sad. This page reminds me that some time in the past I knew lot more than now, and probably was a better programmer.

But the key ideas remain applicable to my current work. And  I can attest that that fact that by accident  I learned the most about programming at this early stage of  my career by writing compilers, made me better programmer then my more talented colleagues, who did have of such an opportunity. Without this stage I probably would have never manage to reach my present level. I also would be more susceptible to fads (which are abundant in programming with each decade producing a couple of "new and exciting" technologies, which are solidly forgotten in the next decade ;-).  As compilers are one of the most complex types of programs, it makes you skeptical about any claim that somebody somehow invented a "silver bullet". moreover it allows you to see that some of the prophets of those new fads are simply charlatans who do not know much about the problem that arise in writing large and complex programming systems.

On the other hand it's really unfortunate that many students today study only a couple of  complex and boring languages (C++ and Java; sometimes one scripting language such as Python) and never learn assembler language. The latter is not only much simpler than any high-level language, but also is a necessary prerequisite for any decent  compiler course.  It allow you to understand how computer operates and ability to work on multiple levels of abstraction and switch from one level of another with  ease is a sign of a top software architect. This is both a native ability and acquired skill, so without learning assembler language programming the native ability might lie dormant.

Many tricks with bit arithmetic that complier writers widely use, and first of all treating words as set of bits capable to implement set operations, as well as superimposition of different types on a single memory location; thus redefining the memory area,  are best implemented in assembler language. That's why I still think that a student who has not taken an assembly course at the university is a slightly crippled student and IMHO only the most gifted students can overcome an overdose of OO, Java, and other fancy topics in the current university curriculum ;-).  I would like to stress it again, that understanding of low level architecture of computer including, but not limited to its instruction set, is critical for any top programmers because one of the key features of top programmers is the ability to switch the level of abstraction they operate from high to low and vise-versa. You can't think on low level of computer architecture details if you never learned them.  If you never studied one or more computer architectures and programmed in assembler.  That also means that high level language skills of such people are in many ways deficient. 

If a student has the real desire to obtain an intimate knowledge of the computer architecture, then compiler construction is the way to go. You will need to understand instruction set,  programming conventions, learn efficiently manipulate registers bitfields, work with different and sometimes pretty complex programming structures and so on.  It you look at Knuth Books (for example volume 3 of  The Art of Computer Programming) and compare implementation of some sorting algorithms in MIX and high level language, you will see that many important algorithms can be expressed in  assembler almost as compactly as in higher level languages.

This multitude of abstraction levels and ability to create you own set of abstractions for solving particular problem is really what computer science are all about and in compiler writing many algorithms-related topics arise naturally and stimulate deeper learning of algorithms and data structures. We seldom learn things we do not need ;-). 

To make a parallel with math, I see compiler writing as closer to a pure computer science, as opposed to applied computer science. It's not accidental the famous  The Art of Computer Programming  by Donald Knuth initially was  intended to be a book about writing compilers. The upper layers of software: applications, GUI, etc. probably can be called applied computer science. I mean, a biologist programming a genome decoding software is mainly 'using' the computer technology, for example regular expressions engine.  But if he get deep into it, he/she might modify or enhance regular expression engine to suit his/her needs better and achieve higher productivity. Generic regular expression language in this case is just a starting point, not the end of the journey. 

Thus compiler writing is more about computer and algorithms by themselves and as such are IMHO more attractive to those who is interested in those areas. 

The key idea of "compiler construction as a programming paradigm" movement is the creation of a set of domain specific operations and data types as well as the language that is capable operating on them

The key point here is that many programming problems are better solved  by creating a specialized language, working with a set of abstract operations (you can call it "abstract, domain specific machine", as the term "virtual machine" is already taken).  The validity of this approach was demonstrated many time in very different contexts.

That why I would like to call compiler construction a programming paradigm, not just a pretty obscure and semi-forgotten area of computer science.  Moreover compiler construction was and probably still is the source of many neat ideas in programming in general. Please remember that coroutines -- the fundamental idea of modern programming were first invented by Conway (see M. E. Conway, Design of a Separable Transition-Diagram Compiler, Comm, A.C.M., 6(7), pp. 396-408. 1963) for the explicit purpose of writing a compiler and only much later had found its way to other areas of programming (first to simulation languages, then to operating systems via pipe concept that was pioneered in Unix, then to high level languages such as Modula and now to scripting languages such as Python and Ruby). BTW Conway wrote his compiler in assembler.  The famous PL/C compiler by Richard W. Conway and Thomas R. Wilcox was written in assembler too.

That suggests that assembler language was and still is constructive to viewing the problem in a different light, than when you use just a high-level language and that mixed language approach to programming might be better than a monolithic approach.  Currently the most acceptable way to write a compiler for a novice is to use combination of TCL and C with manually written lexical analyzer (in C) and a parser and code generator mainly in TCL (if the language will not be still born, it is always possible to rewrite the parser in C later for efficiency).

Programming language is just a tool and in no way one should use the only tool for a complex project. I would reiterate the value of studying assembler as a necessary step in obtaining solid programming education and would like to remind a reader that one probably will never be able to find a better book to study this important programming techniques than the first volume of TAOCP, in which Donald Knuth used a simple artificial assembler for a hypothetical computer called MIX.

Some suggestions

For those who would like to try this approach I have several suggestions:

Dr. Nikolai Bezroukov


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News ;-)

To preserve bandwidth for humans as opposed to robots older News (1999-2003) now are moved in a separate subdirectory. Sorry for any inconvenience.

2011 2010 2009 2008 2007 2006 2005 2004 1999-2003

[Nov 23, 2015] Knuth Recent News

So far only MMIX can be ordered in ePub and PDF formats

Announcing the first Art of Computer Programming eBooks

For many years I've resisted temptations to put out a hasty electronic version of The Art of Computer Programming, because the samples sent to me were not well made.

But now, working together with experts at Mathematical Sciences Publishers, Addison-Wesley and I have launched an electronic edition that meets the highest standards. We've put special emphasis into making the search feature work well. Thousands of useful "clickable" cross-references are also provided --- from exercises to their answers and back, from the index to the text, from the text to important tables and figures, etc.

Note: However, I have personally approved ONLY the PDF versions of these books. Beware of glitches in the ePUB and Kindle versions, etc., which cannot be faithful to my intentions because of serious deficiencies in those alternative formats. Indeed, the Kindle edition in particular is a travesty, an insult to Addison-Wesley's proud tradition of quality.

Readers of the Kindle edition should not expect the mathematics to make sense! Maybe the ePUB version is just as bad, or worse --- I don't know, and I don't really want to know. If you have been misled into purchasing any part of eTAOCP in an inferior format, please ask the publisher for a replacement.

The first fascicle can be ordered from Pearson's InformIT website, and so can Volumes 1, 2, 3, and 4A.

MMIXware

Hooray: After fifteen years of concentrated work with the help of numerous volunteers, I'm finally able to declare success by releasing Version 1.0 of the software for MMIX. This represents the most difficult set of programs I have ever undertaken to write; I regard it as a major proof-of-concept for literate programming, without which I believe the task would have been too difficult.

Version 0.0 was published in 1999 as a tutorial volume of Springer's Lecture Notes in Computer Science, Number 1750. Version 1.0 has now been published as a thoroughly revised printing, available both in hardcopy and as an eBook. I hope readers will enjoy such things as the exposition of a computer's pipeline, which is discussed via analogy to the activites in a high tech automobile repair shop. There also is a complete implementation of IEEE standard floating point arithmetic in terms of operations on 32-point integers, including original routines for floating point input and output that deliver the maximum possible accuracy. The book contains extensive indexes, designed to enhance the experience of readers who wish to exercise and improve their code-reading skills.

The MMIX Supplement

I'm pleased to announce the appearance of an excellent 200-page companion to Volumes 1, 2, and 3, written by Martin Ruckert. It is jam-packed with goodies from which an extraordinary amount can be learned. Martin has not merely transcribed my early programs for MIX and recast them in a modern idiom using MMIX; he has penetrated to their essence and rendered them anew with elegance and good taste.

His carefully checked code represents a significant contribution to the art of pedagogy as well as to the art of programming.

[Jan 07, 2013] On the translation languages from left to right, Knuth, D.E

Scanned version of Knuth, D.E., On the translation of languages from left to right, Information and Control 8, 607-639 (1965) Here are pages 607-616.

[Jan 07, 2013] WikiBook Compiler construction

This is a Wikipedia book, a collection of Wikipedia articles that can be easily saved, rendered electronically, and ordered as a printed book
Wikipedia
Introduction
Compiler construction
Compiler
Interpreter
History of compiler writing
Lexical analysis
Lexical analysis
Regular expression
Regular expression examples
Finite-state machine
Preprocessor
Syntactic analysis
Parsing
Lookahead
Symbol table
Abstract syntax
Abstract syntax tree
Context-free grammar
Terminal and nonterminal symbols
Left recursion
Backus–Naur Form
Extended Backus–Naur Form
TBNF
Top-down parsing
Recursive descent parser
Tail recursive parser
Parsing expression grammar
LL parser
LR parser
Parsing table
Simple LR parser
Canonical LR parser
GLR parser
LALR parser
Recursive ascent parser
Parser combinator
Bottom-up parsing
Chomsky normal form
CYK algorithm
Simple precedence grammar
Simple precedence parser
Operator-precedence grammar
Operator-precedence parser
Shunting-yard algorithm
Chart parser
Earley parser
The lexer hack
Scannerless parsing
Semantic analysis
Attribute grammar
L-attributed grammar
LR-attributed grammar
S-attributed grammar
ECLR-attributed grammar
Intermediate language
Control flow graph
Basic block
Call graph
Data-flow analysis
Use-define chain
Live variable analysis
Reaching definition
Three-address code
Static single assignment form
Dominator
C3 linearization
Intrinsic function
Aliasing
Alias analysis
Array access analysis
Pointer analysis
Escape analysis
Shape analysis
Loop dependence analysis
Program slicing
Code optimization
Compiler optimization
Peephole optimization
Copy propagation
Constant folding
Sparse conditional constant propagation
Common subexpression elimination
Partial redundancy elimination
Global value numbering
Strength reduction
Bounds-checking elimination
Inline expansion
Return value optimization
Dead code
Dead code elimination
Unreachable code
Redundant code
Jump threading
Superoptimization
Loop optimization
Induction variable
Loop fission
Loop fusion
Loop inversion
Loop interchange
Loop-invariant code motion
Loop nest optimization
Manifest expression
Polytope model
Loop unwinding
Loop splitting
Loop tiling
Loop unswitching
Interprocedural optimization
Whole program optimization
Adaptive optimization
Lazy evaluation
Partial evaluation
Profile-guided optimization
Automatic parallelization
Loop scheduling
Vectorization
Superword Level Parallelism
Code generation
Code generation
Name mangling
Register allocation
Chaitin's algorithm
Rematerialization
Sethi-Ullman algorithm
Data structure alignment
Instruction selection
Instruction scheduling
Software pipelining
Trace scheduling
Just-in-time compilation
Bytecode
Dynamic compilation
Dynamic recompilation
Object file
Code segment
Data segment
.bss
Literal pool
Overhead code
Link time
Relocation
Library
Static build
Architecture Neutral Distribution Format
Development techniques
Bootstrapping
Compiler correctness
Jensen's Device
Man or boy test
Cross compiler
Source-to-source compiler
Tools
Compiler-compiler
PQCC
Compiler Description Language
Comparison of regular expression engines
Comparison of parser generators
Lex
Flex lexical analyser
Quex
Ragel
Yacc
Berkeley Yacc
ANTLR
GNU bison
Coco/R
GOLD
JavaCC
JetPAG
Lemon Parser Generator
LALR parser generator
ROSE compiler framework
SableCC
Scannerless Boolean Parser
Spirit Parser Framework
S/SL programming language
SYNTAX
Syntax Definition Formalism
TREE-META
Frameworks supporting the polyhedral model
Case studies
GNU Compiler Collection
Java performance
Literature
Compilers: Principles, Techniques, and Tools
Principles of Compiler Design
The Design of an Optimizing Compiler

[Dec 04, 2011] Comp.compilers Re Need an interesting topic for an undergraduate project on Compilers

From: Christophe de Dinechin <[email protected]>
Newsgroups: comp.compilers
Date: Sun, 4 Sep 2011 23:42:48 -0700 (PDT)
Organization: Compilers Central
References: 11-08-006
Keywords: courses
Posted-Date: 06 Sep 2011 22:12:39 EDT

On Aug 6, 7:28 pm, amit karmakar <[email protected]> wrote:
> I would like to have some suggestions as to what *new* and
> *innovative* project i can do which are based on compiler design.


Innovation in compilers can happen at a number of levels :


1. Parsing techniques, grammars, etc. Very active research a while
back, considered (erroneously methinks) as dead by most today, who
happily use flex/bison and don't think twice about it.


2. Language design. One of the active areas these days is "domain
specific languages" or DSLs, i.e. languages designed for one specific
need. Often using "meta-programming" techniques (programs that
generate programs)


3. Type systems, proofs, program validation. Languages like Haskell
use type inference, so that you don't have to specify types yourself
most of the time. C++ recently gained the "auto" keyword for types.
DSLs pose a new class of interesting problems in that space.


4. Intermediate representations, code generation and optimization
frameworks. The king of this hill these days IMO is LLVM. But there
are a number of contenders. If you are interested in optimizations,
that's the right place to look at.


5. Runtime support : garbage collectors, just-in-time code generation,
parallel execution, use of new hardware such as GPUs, 


6. Support for innovative hardware, hardware generation, hardware/
software co-design, etc. If you are more into silicon, this is a very
interesting are to learn about.


My own pet project, XLR (http://xlr.sf.net) offers a number of
innovations in the first three of these areas. It is a language
designed to grow with the user, i.e. the objective is to make it as
easy to add language constructs as it is to add, say, functions or
classes in other languages.


Regarding parsing, it generates a parse tree made of exactly 8 nodes :
integer, real, text and name/symbol represent leaves of the tree,
infix, prefix, postfix and block represent inner nodes. This makes it
possible to write programs in a very natural-looking way, yet with an
internal program representation that is easy to manipulate. This is
the foundation of XL meta-programming / DSL capabilities.


To validate that, XLR has practically no built-in constructs. It has
constructs to connect to LLVM primitives, constructs to connect to C
code, and a pair of "rewrite" constructs, notably ->, to transform one
tree shape into another. For example :


extern bool puts(text);
(x:integer - y:integer):integer -> opcode Sub


repeat 0, B -> true
repeat N, B -> B; repeat N-1, B


repeat 25,
puts "Hello"


You can check the code generated for the above with xlr -tcode -O3
tests/09.Compiler/optimized-repeat-loop.xl. LLVM actually turns it
into a sequence of 25 calls to puts, you can hardly do better.


The most active area of research for XLR these days is its type
system. In order to generate efficient code, an Haskell-like type
inference mechanism is in place. But the standard type inference
algorithms must be extended, because there are a few additional
transformations compared to lambda calculus (not just "alpha" and
"beta"), and the closest there is to a type is the shape of a tree
(e.g. "if X then Y else Z").


Since it uses LLVM, it is also an interesting way to learn a little
about LLVM, but it's not intended as an LLVM tutorial.


So if you are interested in experimenting with "growing a language" in
a text-based framework, XLR is the right way to go. There are other
projects that are more advanced e.g. if you want to build the IDE at
the same time, see for example JetBrain's Meta Programming System. But
they are not as strong in language development per-se, I believe.

[Dec 04, 2011] Roslyn Project Overview

Microsoft Developer Network

October 2011

Traditionally, compilers are black boxes -- source code goes in one end, magic happens in the middle, and object files or assemblies come out the other end. As compilers perform their magic, they build up deep understanding of the code they are processing, but that knowledge is unavailable to anyone but the compiler implementation wizards and it is promptly forgotten after the translated output is produced.

For decades, this world view has served us well, but it is no longer sufficient. Increasingly we rely on integrated development environment (IDE) features such as IntelliSense, refactoring, intelligent rename, "Find all references," and "Go to definition" to increase our productivity. We rely on code analysis tools to improve our code quality and code generators to aid in application construction. As these tools get smarter, they need access to more and more of the deep code knowledge that only compilers possess. This is the core mission of the Roslyn project: opening up the black boxes and allowing tools and end users to share in the wealth of information compilers have about our code. Instead of being opaque source-code-in and object-code-out translators, through the Roslyn project, compilers become services-APIs that you can use for code related tasks in your tools and applications.

The transition to compilers as services dramatically lowers the barrier to entry for creating code focused tools and applications. It creates many opportunities for innovation in areas such as meta-programming, code generation and transformation, interactive use of the C# and VB languages, and embedding of C# and VB in domain specific languages.

The Microsoft "Roslyn" CTP previews the new language object models for code generation, analysis, and refactoring, and the upcoming support for scripting and interactive use of C# and Visual Basic. This document is meant to be a conceptual overview of the Roslyn project. Further details can be found in the walkthroughs and samples included in the Roslyn CTP.

Summary

Project Roslyn exposes a set of Compiler APIs, Scripting APIs, Workspace APIs, and Services APIs that provides rich information about your source code and that has full fidelity with the C# and Visual Basic languages. The transition to compilers as a service dramatically lowers the barrier to entry for creating code focused tools and applications. It creates many opportunities for innovation in areas such as meta-programming, code generation and transformation, interactive use of the C# and VB languages, and embedding of C# and VB in domain specific languages.

[Oct 27, 2011] Writing a Programming Language in Perl

Oct 25, 2011
programmer99 has asked for the wisdom of the Perl Monks concerning the following question:
Hello. This is my first post, so please forgive me for any mistakes I make.

I have been programming for about 16 years now and Perl has always been one of my absolute favorite languages. I have been wondering how I can make a Programming language using Perl, only Perl and nothing but Perl. How would I go about doing this?

I have never created a language before. I know the basics of what a compiler is and how it works. But I don't know how to make one. After I make the language (which I don't know how to do) how do I make the language work with the compiler?

I truly do not want to read another book, as these can be extremely unhelpful. There really aren't any good tutorials that I can find. Can anyone give me some kind of tutorial or a point in the right direction? Thanks

P.S.- No mistakes that I can find! I'm good at this :-)

BrowserUk (Pope)

See Exploring Programming Language Architecture in Perl by Bill Hails. for one of the most complete explorations of doing exactly what the title suggests.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". >

In the absence of evidence, opinion is indistinguishable from prejudice.

Your Mother (Abbot)

BUK's suggestion is quite good. IIRC, it's a fun manuscript.

I've been meaning to suggest Marpa to you for awhile. While still marked experimental it should be excellent for this kind of thing. I've wanted to play with it myself for a long time but tuits being what they are...

Marpa would be used, I'd expect, to convert a language spec/grammar to Perl and then you could use any number of strategies to freeze the resulting code (in memory, file, some other specialized cache) until the original source changed, at which point it could be "recompiled." Doing it this way should afford an easy path for testing and building up a grammar piece by piece. If you don't write tests as you go, you will be sorry down the road. If you do write tests, you'll be able to breeze through wrong turns and accidentally introduced bugs with confidence.

programmer99
Ok, I have a few questions (if you don't mind). What do you mean by grammar? How will Marpa help me? Thanks.

Your Mother (Abbot)

As I say, tuits in the round are the issue and without the time to cook up some working example I'd just be glossing the various documentation and BNF/ParseRec stuffage. I think a simple example can be cooked up but it's also not entirely trivial or my forte (hint, hint to others who might have something to show). This is quite interesting and I've been waiting to try some Marpa though. If I can grab an hour tonight I'll revisit this otherwise, the weekend, otherwise... uh, what were we talking about?

ikegami

The grammar of a language defines its syntax. Parsers (e.g. Marpa) take a sequence of bytes, characters or tokens, check if it conforms to the grammar, and assigns meaning to the components of the sequence as per the grammar.

For example, the job of the parser is to receive "a+b*c" and return "a multiplication of ( an addition of ( identifier a ) and (identifier b ) ) and ( identifier c )".

ikegami

Marpa - Parse any Language You Can Describe in BNF

Hopefully, it can do a bit more than that, because associativity cannot be described in BNF.

davies

Have a look at Jonathan Worthington's slides (http://www.jnthn.net/papers/2011-yapc-russia-compiler.pdf). The code is next door on his web site. But as this was a public talk, there may be a recording of it on YAPC.tv. My spies tell me that the talk was outstanding, writing a complete language (well, almost) from next to nothing in under an hour.

Regards,

John Davies

ikegami

If you're interested in parsing and compiling, consider getting your hands on Compilers: Principles, Techniques, and Tools. You won't find any Perl in it, though.

Corion (Saint)

In my opinion a very good book is Compiler Construction (pdf warning) by Niklaus Wirth. I've only used the German edition from 1986, but it introduces a small language and how to build a parser and compiler for it. It only teaches you the technique of recursive descent. Recursive Descent is a technique that is currently enjoying a comeback because it is easier with it to produce meaningful error messages than it is when using lex and yacc or other parsing tools. The German book has 115 pages, the English and more recent edition (from 1996) has 130 pages. This seems to me a good starting point in the sense that little previous knowledge is needed and it gets you a feel for implementing some language with Pascal-like syntax. Actually, in the English PDF, it's an Oberon-like syntax, but that's still close enough.

When I first read the book, I skipped over the chapters 1 to 4, as they concern themselves with the theoretical background of the languages that the approach can handle. Looking back over the chapters, I'm not sure whether it was really necessary to skip over them, as they don't seem as dry to me as they did back then. On the other hand, I now have much more experience and also learned the theory of regular languages and context-free languages in university.

Browsing through the PDF, I see that it also mentions bottom-up (yacc-style) parsing and gives a rough outline of it.

andal:

The work of Crenshaw at http://compilers.iecc.com/crenshaw/ appears to be simpler. But YMMV.

spx2
There's a lot of theory about "bla grammars bla bla translators bla ... context-free context-dependent bla bla etc".
You can just throw that Dragon Book out the window and study the grammar of some small language like Basic or JavaScript's grammar to learn stuff
(here you can find a lot of grammars to many many languages).
This is not rocket science, it's engineering.
So I totally agree with you about not wanting to read 9999 books on the subject, you need to build something.
I'm not in any way a compiler/language expert but I do feel that you want to get down to business
so.. here's Parse::Yapp, here's an article , here's another article , here's some benchmarks and comparisons between different parser generators for Perl,
and another comparison , and another article. If you have further questions, I'm sure the monastery has lots of monks with knowledge of Parse::Yapp
so your questions will receive an answer.
Now go write a language and good luck !
hbm (Pilgrim:)
For ideas, you might look at Brainf*ck, which has a simple grammar of: <>+-[],.
Your Mother

Speaking of which, this is worth revisiting for anyone interested or that missed it the first time around: Ook interpreter.

sundialsvc4

Creating a programming language from scratch is an interesting exercise. I not only did that, but sold umpteen thousand copies of a system that used uses it. (So there...) But I dunno if I would implement the entire language in Perl. If you do it, I'd love to see it.

JavaFan
I'd start by first thinking what your new language should do. Is it going to be a domain specific language? Or a general purpose one? If the latter, what is its strength going to be? Is it going to be procedial? functional? OO? Block based? List based? Something else?
Second, make some trade-offs. Do you want to write a simple compiler/interpreter? (That probably means a simple syntax, not many features in your language). Does the compiler have to be run fast? Or do you want to give much power to the user of the language (resulting in a more complex compiler)?

Only then I'd worry about the implementation.

Oh, and my step 0 would be is to ponder about "if I really haven't a fucking clue on how to do that, is my disdain to read a book justified"?

spx2:
well, just look over Parse::Yapp, that would be your entry point.
JavaFan (Abbot)
In a way, I have already thought this out. The issue I am having now how to get started with the actual code.

So, your question really is, I know what I want to do, but I'm not telling you what it is, yet I ask from you how to get started?

Is that how you ask for directions in an unknown city as well? You ask I want to go somewhere, in which direction do I go?

Anonymous Monk
LOL *cough* *cough* aXML *cough* *cough*
Anonymous Monk
Parsing Strings and Trees with Parse::Eyapp (An Introduction to Compiler Construction
  • http://search.cpan.org/dist/HOP-Lexer/lib/HOP/Lexer/Article.pod Lexing Without Grammars: When Regular Expressions Suck
  • http://hop.perl.plover.com/Examples/ALL/calculator
  • http://hop.perl.plover.com/Examples/ALL/regex-parser
  • http://hop.perl.plover.com/linogram/
  • [Dec 29, 2010] Making C compiler generate obfuscated code

    The thread contains a very interesting discussion...
    Dec 07, 2010 | Comp.compilers

    About my little attempt to hack Tiny C compiler's codegenerator so it produces obfuscated code:
    http://blogs.conus.info/node/58

    Martin Ward <[email protected]> :

    On Thursday 16 Dec 2010 at 15:23, Joshua Cranmer <[email protected]> wrote:

    > I'm not sure what work has been done on deoptimization (perhaps anyone
    > with the Hex-Rays decompiler could tell us?), but some of the
    > optimization techniques seem relatively easy to reverse.
    > > From a purely theoretical standpoint, obfuscation that adds
    > non-executing code is going to be less difficult to reverse engineer
    > than obfuscation that does the same thing, just... less obviously.

    A major theoretical result in this area is the paper "On the (Im)possibility of Obfuscating Programs" by Boaz Barak et al, published in: CRYPTO '01 Proceedings of the 21st Annual International Cryptology Conference on Advances in Cryptology. Boaz Barak gives a non-technical description of the results and their meaning here:

    http://www.math.ias.edu/~boaz/Papers/obf_informal.html

    The FermaT program transformation system can interactively transform code into semantically equivalent forms: including restructuring, simplification, dataflow analysis , SSA construction, slicing and, in simple cases, abstraction to a specification. The FermaT engine and graphical front end runs on Windows and Linux and can be downloaded from here:

    http://www.gkc.org.uk/fermat.html

    or:

    http://www.cse.dmu.ac.uk/~mward/fermat.html

    FermaT's primary commercial application is migrating assembler to structured and maintainable C and COBOL, so the "deobfuscation" transformations are geared towards removing the "clever tricks" introduced by assembler programmers to save a byte here and a cycle there: or just because they were

    -- Martin

    STRL Reader in Software Engineering and Royal Society Industry Fellow [email protected] http://www.cse.dmu.ac.uk/~mward/ Erdos number: 4 G.K.Chesterton web site: http://www.cse.dmu.ac.uk/~mward/gkc/ Mirrors: http://www.gkc.org.uk and http://www.gkc.org.uk/gkc


    [Dec 29, 2010] Compiler construction by Prof. O. Nierstrasz

    Fall Semester 2008
    01Intro.pdf 17-Sep-2008 09:49 2.8M
    02Lexical.pdf 17-Sep-2008 16:38 853K
    03Parsing.pdf 01-Oct-2008 09:55 1.7M
    04ParsingPractice.pdf 17-Sep-2008 16:39 757K
    05SemanticAnalysis.pdf 20-Oct-2008 14:49 1.2M
    06IntermediateRepresentation.pdf 17-Sep-2008 16:40 456K
    07CodeGeneration.pdf 28-Oct-2008 20:52 1.1M
    08IntroSSA.pdf 09-Dec-2008 14:29 1.0M
    09Optimization.pdf 09-Dec-2008 14:29 446K
    10PyPy.pdf 19-Nov-2008 10:57 1.0M
    11PEGs.pdf 29-Oct-2008 09:52 472K
    12DSL.pdf 02-Dec-2008 22:20 1.5M
    12DSL.zip 02-Dec-2008 22:26 21M
    13ProgramTransformation.pdf 26-Nov-2008 12:08 2.3M

    Apache

    Compiler Construction

    [Dec 28, 2010] CS 375 Compilers Lecture Notes

    Copyright © 2009 by Gordon S. Novak Jr.

    Permission is granted for individuals to make copies of these notes for personal use, or for instructors to make copies for classroom use.

    Note: Many of these pages use math symbols such as: &forall &exist . Microsoft Internet Explorer will not display the math symbols, but Firefox will.

    Index

    1. CS 375, Compilers: Class Notes

    2.

    3. Course Topics

    4. Pascal Test Program

    5. Introduction

    6. Machine Language

    7. Assembly Language

    8. High-Level Language

    9. Compilers

    10. Sequential Phases of a Compiler

    11. Data Flow through the Compiler

    12. Line Handler

    13. Lexical Analyzer

    14. Syntactic Analyzer

    15. Semantic Analysis

    16. Lexical Analysis

    17. Character Classes

    18. Implementation of Character Classes

    19. Hand-written Lexical Analyzer

    20. Example Lexical Analyzer

    21. Flowchart for Parsing Identifier

    22. Lexical Language Design

    23. Token Data Structure

    24. Example Token Data Structure

    25. Number Conversion

    26. Simple Number Scanner

    27. Lexical Analyzer Output

    28. Floating Point Numbers

    29. IEEE Floating Point Standard

    30. Floating Point Examples

    31. Errors

    32. Error Messages

    33. Formal Syntax

    34. Grammar

    35. Language Generation

    36. Parsing

    37. Ambiguity

    38. Notation

    39. Phrase Structure Grammar

    40. Chomsky Hierarchy

    41. Recognizing Automaton

    42. Chomsky Language Hierarchy

    43. Regular Languages

    44. Example Regular Language

    45. lex

    46. Regular Expressions

    47. Lex Specifications

    48. Sample lex Specification

    49. C for Lex Sample

    50. lex.yy.c

    51. Comments on Sample lex

    52. Translation Section

    53. Lex Conventions

    54. The Lookahead Operator

    55. Auxiliary Procedures

    56. Parser Overview

    57. Context Free Languages

    58. Context Sensitive Languages

    59. Derivations

    60. Language Generated by a Grammar

    61. Ambiguity and Left Recursion

    62. Parsing

    63. Top-down Parser

    64. Bottom-up Parsing

    65. Chart Parser

    66. Augmented Transition Network Grammars

    67. Augmented Transition Networks

    68. Context Free Parser

    69. Semantics Influences Parsing

    70. Arithmetic Expressions

    71. Example of Operator Precedence

    72. Operator Precedence

    73. Operator Precedence Parsing

    74. Operator Precedence Parser

    75. Examples

    76. Stack Handling in C

    77. Basic Routines

    78.

    79. Operator Precedence Parser

    80.

    81. Additional Considerations

    82. Recursive Descent Parser

    83. Bottom-up Table-driven (LR) Parsing

    84. The LR Parsing Algorithm

    85. Shift-Reduce Parser

    86. Example Parsing Table

    87. A Parse of id * id + id

    88. Synthesized Translation

    89. Using yacc

    90. y.tab.c

    91. Yacc Specifications

    92. Example: Desk Calculator

    93. Yacc: Pascal Subset

    94. Auxiliary C Code

    95. Auxiliary C Code ...

    96. Example

    97. Examples ...

    98. Examples ...

    99. Hints for yacc

    100. File trivb.tree

    101. The Semantic Actions

    102. Supporting C Routines

    103. Example

    104. Comments on the Example

    105. Parsing Action Conflicts

    106. Resolving Shift/Reduce Conflicts

    107. Error Productions

    108. Error Handling

    109. Parsing Techniques

    110. Looping Statements

    111. Symbol Table

    112. Symbol Table Organization

    113. Symbol Table Organizations

    114. Binary Tree Symbol Table

    115. AVL Trees

    116. Hash Table

    117. Hash Functions

    118. Indexed Buckets

    119. Nested Procedures

    120. Tree of Symbol Tables

    121. Stack Symbol Table

    122. Symbol Table Entry

    123. Kinds of Symbols

    124. Kinds of Symbols ...

    125. Looking up ID in Symbol Table

    126. Variable Declarations

    127. Identifier List etc.

    128. Data Addressing

    129. Storage Allocation

    130. Alignment and Padding

    131. Installing Variables in Symbol Table

    132. Record Declarations

    133. Symbol Table Structures for Record

    134. Array Declarations

    135. Symbol Table Structures for Array

    136. Type Checking, Coercion, and Inference

    137. Structure References

    138. Structure References....

    139. Record References

    140. Array References

    141. Array References in Pascal

    142. Does Array Order Matter?

    143. Example of Structure Declaration

    144. Example of Structure Reference

    145. Pointer Reference

    146. Types as Tree Structures

    147. Dynamic Type Checking

    148. Static Type Checking

    149. Strong Typing

    150. Type Equivalence

    151. Type Signatures

    152. Polymorphic Procedures

    153. Table for Labels

    154. Intermediate Code

    155. Quadruples

    156. Triples

    157. Reverse Polish Notation

    158. Trees and Reverse Polish

    159. Converting a Tree to RPN

    160. Executing Reverse Polish

    161. Executing RPN

    162. RPN as Intermediate Code

    163. Code Generation

    164. Loading Process

    165. Absolute File

    166. Initializing BSS Storage

    167. Banked Memory

    168. Location Counter

    169. Example of Assembly Listing

    170. Backpatching

    171. Link Editing Process

    172. Relocatable Code

    173. Finding Relocatable Modules

    174. Assigning Absolute Addresses

    175. Absolute Addresses

    176. Link Editor

    177. Form of Relocatable Code

    178. Static Allocation

    179. Dynamically Linked Library

    180. Run-Time Support

    181. Operations by Subroutine

    182. Special Subroutines

    183. Memory Management

    184. Returning Memory

    185. Heap Memory Management

    186. Garbage Collection

    187. Garbage Collection

    188. Mark-And-Sweep Garbage Collection

    189. Mark-and-Sweep ...

    190. Copying Garbage Collection

    191. Reference Counting

    192. Reference Counting...

    193. Garbage Collection Is Expensive

    194. Compiled Procedure

    195. Subroutine Call Is Expensive

    196. Activations and Control Stack

    197. Environment

    198. Run-time Memory Organization

    199. PowerPC Stack Frame Layout

    200. SPARC Stack Frame Layout

    201. SPARC Stack Frame Layout...

    202. Global Variable References

    203. Global Variables in Algol, Pascal, PL/I

    204. Block-structured Symbol Table

    205. Run-time Stack for Algol

    206. Code Generation

    207. Code Generation

    208. Code Generation

    209. Running Generated Code

    210. Code Generation for Statements

    211. Arithmetic Expressions

    212. Basic Expression Algorithm

    213. Trace of Expression Algorithm

    214. Arithmetic Expression Algorithm

    215. Register Management

    216. Simple Register Allocation

    217. Heuristic for Expressions

    218. Improving Register Allocation

    219. Register Allocation

    220. Example of Code Generation

    221. Example (2)

    222. Example (3)

    223. Example (4)

    224. Reusing Register Contents

    225. Register Targeting

    226. SPARC Processor

    227. Load/Store Instructions

    228. Load/Store with Calculated Address

    229. Literals

    230. Integer Arithmetic Instructions

    231. Compare and Branch

    232. Floating Point

    233. Intrinsic Functions

    234. Function Calls

    235. Volatile Registers

    236. Details of Function Call

    237. IF Statement Generation

    238. IF Statement Optimization

    239. Array References

    240. Easy Array References

    241. Better Array References

    242. Pointer References

    243. switch Statement

    244. switch Statement Compiled

    245. switch Statement Compiled -O

    246. Table Lookup

    247. Table Lookup Compiled

    248. Table Lookup Compiled -O

    249. Parameter Passing

    250. Macros

    251. In-line Compilation

    252. Optimization

    253. Correctness of Optimization

    254. Optional Optimization

    255. Local and Global Optimization

    256. Easy Optimization Techniques

    257. Constant Folding

    258. Peephole Optimization

    259. Loop Unrolling

    260. Partial Evaluation

    261. Partial Evaluation

    262. Example

    263. Simple Partial Evaluator

    264. Simple Partial Evaluator...

    265. Examples

    266. Examples

    267. Binding-Time Analysis

    268. Futamura Projections

    269. Interpreter

    270. Specialization

    271. Parameterized Programs

    272. Pitfalls of Partial Evaluation

    273. Pitfalls ...

    274. Program Analysis

    275. Basic Block

    276. Finding Basic Blocks

    277. Relations and Graphs

    278. Graph Notations

    279. Bit Vector Representations

    280. Boolean Matrix Representation of Graph

    281. Dominators

    282. Intervals

    283. Definition and Reference of Variables

    284. Data Flow Analysis for a Block

    285. Availability of Expressions

    286. Data Flow Analysis for an Interval

    287. Busy Variables

    288. Variable Uses and Register Assignment

    289. Register Allocation by Graph Coloring

    290. Overview of Global Optimization

    291. gcc Compiler Optimization Options

    292. gcc Optimizations

    293. Loop Transformations

    294. Strip Mining

    295. Induction Variable Transformation

    296. Finite Differencing

    297. Example: Computing Squares

    298. General Case

    299. Finite Differencing for Set Operations

    300. Memoization

    301. Hardware Assistance

    302. PowerPC Features

    303. SPARC Features

    304. Hardware Trends

    305. Object-oriented Programming

    306. Access to Objects

    307. Domain Analysis

    308. Internal Implementation is Hidden

    309. Encapsulation with OOP

    310. Object-oriented Programming Terminology

    311. Terminology ...

    312. Implementation of Objects

    313. Are Basic Types Objects?

    314. Inheritance and Class Structure

    315. Message Sending

    316. Dynamic Method Lookup

    317. Static Method Lookup

    318. Multiple Inheritance

    319. Improving OOP Efficiency

    320. Smalltalk

    321. Smalltalk Code

    322. ThingLab

    323. ThingLab Examples

    324. Good Features of OOP

    325. Unfortunate Features of OOP

    326. Why OOP Is Not Enough

    327. Top Ten Lies About OOP

    328. Aspect-Oriented Programming

    329. Lisp

    330. History of Lisp

    331. Advantages of Lisp

    332. Lisp Interaction

    333. Function Definition

    334. List Structure

    335. Abstract Syntax Tree

    336. Binding Lists

    337. Substitution

    338. Copying and Substitution Functions

    339. Substitution in C

    340. Loop Unrolling

    341. Instantiating Design Patterns

    342. Pattern Matching

    343. Pattern Matching

    344. Transformation by Patterns

    345. Transformation Patterns

    346. Program Transformation using Lisp

    347. Dot Matching

    348. Looping Patterns

    349. Code Expansion by Looping Patterns

    350. More Complex Rules

    351. Multi-Level Patterns

    352. Use of Multi-Level Patterns

    353. Function Inlining

    354. Program Transformation

    355. Pattern Optimization Examples

    356. Examples ...

    357. Examples ...

    358. Examples ...

    359. Paul Graham:

    360. English

    361. Expression Trees to English

    362. Generating English

    363. Parsing English

    364. ATN in Lisp

    365. Parsing Functions

    366. Grammar Compiler

    367. Access to Database

    368. Restaurant Database Grammar

    369. Restaurant Queries

    370. Physics Problems

    371. Physics Queries

    Writing a Compiler in C# Lexical Analysis - All Your Base Are Belong To Us

    [Dec 27, 2010] Writing Your Own Toy Compiler Using Flex, Bison and LLVM (gnuu.org)

    1. Introduction
    2. Getting Started
    3. Step1. Lexical Analysis with Flex
    4. Step 2. Semantic Parsing with Bison
    5. Generating Flex and Bison Code
    6. Step 3. Assembling the AST with LLVM
    7. Building and Running Our Toy Language
    8. Conclusion
    9. View All

    [Dec 27, 2010] Let's Build a Compiler

    Read the tutorial on-line

    Download the tutorial

    It's available in two formats, plain text, and with printer control characters so it will print reasonably on an Epson printer.

    [Apr 6, 2009] LEPL 2.3

    LEPL is a recursive descent parser library written in Python. It is based on parser combinator libraries popular in functional programming, but also exploits Python language features. Operators provide... a friendly syntax, and the consistent use of generators supports full backtracking and resource management. Backtracking implies that a wide variety of grammars are supported; appropriate memoisation ensures that even left-recursive grammars terminate

    more

    [Feb 21, 2008] Project details for Bare Bones interpreter

    freshmeat.net

    BareBones is an interpreter for the "Bare Bones" programming language defined in Chapter 11 of "Computer Science: An Overview", 9th Edition, by J. Glenn Brookshear.

    Release focus: Minor feature enhancements

    Changes:
    Identifiers were made case-insensitive. A summary of the language was added to the README file.

    Author:
    Eric Smith [contact developer]

    [Jan 22, 2008] freshmeat.net Project details for tinyap

    tinyap is a recursive descent parser with backup that outputs an abstract syntax tree (AST). Unlike in most parsers, the grammar is data. Tinyap uses an AST that represents a grammar to parse its input text. The factory default for the grammar is tinyap's grammar description language itself, so one can parse a grammar description and directly use the parse output to parse some other text written in the described language. Tinyap also features a plugin mechanism for grammars, which allows for dynamic modular grammars. Finally, it provides an interface to walk down the ASTs and to write external plugins to visit the nodes.

    Release focus: Initial freshmeat announcement

    [Dec 11, 2007] Complier

    A very extensive collection of links

    [Aug 9, 2007] freshmeat.net Project details for Parsing.py Parser Generator

    The Parsing module is a pure-Python module that implements an LR(1) parser generator, as well as CFSM and GLR parser drivers. From an algorithmic perspective, this is one of the most advanced parser generators in existence.

    Release focus: Initial freshmeat announcement

    Changes:
    Python 2.4 is now supported, in addition to Python 2.5.

    Author:
    Jason Evans [contact developer]

    Using Prolog to Compile Things - comp.compilers Google Groups

    Newsgroups: comp.compilers From: f...@cs.mu.OZ.AU (Fergus Henderson) Date: 1999/05/21 Subject: Re: Using Prolog to Compile Things Reply to author | Forward | Print | Individual message | Show original | Report this message | Find messages by this author

    Nick Roberts <nickrobe...@callnetuk.com> writes:
    >Has anyone on this ng experience or knowledge of the use of Prolog to
    >implement a native-code compiler for a typical high-level imperative
    >language? I am toying with the idea of using Prolog for the lexer, the
    >parser, the intermediate (library) code generator, and the end-code
    >generator (and even, in effect, for linking!), i.e. the 'whole shebang'.
    I have written a couple of compilers in Prolog. In many ways, Prolog
    is an excellent language for writing compilers, but it does have some
    important disadvantages.

    Much of the task of compilation is manipulating trees of various
    kinds, and this is a task which Prolog and other logic or functional
    languages do very well.

    Another advantage of Prolog is the use of unbound variables and
    partially instantiated data structures. Often during code generation
    you may want to make several passes over some data structure (e.g. the
    parse tree), filling in the values of different attributes with each
    pass. In Prolog you can leave uninitialized attributes as unbound
    variables, and have each pass bind the appropriate variables. This
    contrasts with some languages such as ML or Mercury where you would
    normally initialize the attributes with dummy values and then each
    pass would make a copy of the parse tree in order to set the new
    attributes.

    Prolog does have some significant disadvantages. One is that Prolog
    is often quite inefficient. For example, it's very difficult to get a
    lexer written in Prolog to run fast.

    Another disadvantage is that Prolog doesn't have records with named
    fields. If you want access fields by name, then you need to write a
    bunch of named access predicates, which is quite tedious. (Existing
    Prolog implementations generally don't do inlining, so using named
    access predicates also exacerbates the efficiency problems.)

    Another disadvantage, probably more important than the previous two,
    is that Prolog has very little in the way of static checking. This
    works OK for small projects, but makes things very difficult when
    writing large systems. If you plan to write 5000 lines of code or
    more, then I would very strongly recommend using a language with
    static type checking and a proper module system (some Prolog systems
    do have module systems, but they are not standardized and vary
    significantly between different Prolog systems). And Prolog's support
    for unbound variables and nondeterminism, although useful, can also
    cause a lot of problems if you accidentally forget to initialize a
    variable or if you accidentally introduce nondeterminism. Other logic
    programming languages such as Mercury and Visual Prolog (which,
    despite the name, is quite different from Prolog) do a lot better
    here.

    If you do use Prolog, then I strongly recommend that you document the
    types, modes, and determinism of the predicates in your program very
    carefully. This is especially important if you ever want anyone other
    than yourself to maintain the program. But compilers are usually not
    small projects, so I think you would probably be better off choosing a
    language which has more support for static checking, such as Mercury.

    Of course, I'm one of the developers of Mercury, so my opinion in that
    respect is no doubt biased ;-). But Mercury was developed with my
    experience of implementing compilers in Prolog in mind, and so it was
    designed to address many of Prolog's faults in that area. The Mercury
    compiler is written in Mercury, so it has certainly been stress-tested
    in that area.

    > I'm particularly interested in the idea of using Prolog's natural
    > searching abilities to search for truly optimal code.
    I think this is a mirage. It's pretty easy to express searching
    algorithms in any language. But finding _feasible_ algorithms to
    produce "truly optimal code" is going to be difficult in any language.
    Prolog's searching abilities won't really help much here.
    --
    Fergus Henderson <f...@cs.mu.oz.au>
    WWW: <http://www.cs.mu.oz.au/~fjh>
    PGP: finger f...@128.250.37.3

    [It's my impression that in too many cases the only way to find perfectly
    optimal code would be to enumerate and check an impractically large set
    of possibilities. -John]

    Informatik II Languages, Compilers, and Theory Lecture 1: Introduction and Overview

    [PS] Program Analysis in Prolog John Hannan The Pennsylvania State ...

    Prolog3-1New

    Compiling Little Languages in Python - Aycock (ResearchIndex)

    Abstract: "Little languages" such as configuration files or HTML documents are commonplace in computing. This paper divides the work of implementing a little language into four parts, and presents a framework which can be used to easily conquer the implementation of each. The pieces of the framework have the unusual property that they may be extended through normal object-oriented means, allowing features to be added to a little language simply by subclassing parts of its compiler.

    Compiler Transformations for High-Performance Computing DAVID F. BACON SUSAN L. GRAHAM, AND OLIVER J. SHARP

    1.2 A Brief History of SETL Information about SETL compiler project

    This perspective, as hinted in the first quotation above, sprang from the strong perception in the late 1960s that there was a need for a set-oriented language capable of expressing concisely the kind of set-intensive algorithm that kept arising in studies of compiler optimization, such as those by Allen, Cocke, Kennedy, and Schwartz [5,43,6,41,7,10,130,11,8,9,131,178,179,180,12,132,42]. Programming Languages and their Compilers [44], published early in 1970, devoted more than 200 pages to optimization algorithms. It included many of the now familiar techniques such as redundant code elimination and strength reduction, dealt extensively with graphs of control flow and their partitioning into ``intervals'', and showed how to split nodes in an irreducible flow graph to obtain a reducible one. Many workers in the 1970s and 80s besides those just mentioned identified SETL, directly or indirectly, as a language whose implementation was greatly in need of solutions to difficult compiler optimization problems [80,84,85,143,82,86,123,83,148,149,174,208,3,209]. SETL, while still far from the celestial sphere of pure mathematics, was nonetheless seen as occupying a very high orbit relative to other languages. It was SETL's distance from pure machines that made optimizing its implementations so important and at the same time so difficult.

    The synergy between the study of code optimization and the high-level set language used for expressing optimization algorithms led to the SETL compiler project [186,177], which was itself an abundant source of optimization problems. The SETL project produced, among other things, the SETL optimizer [178,88,77], a 24,000-line prototype written in SETL. Unfortunately, on the machines of the day, it was too large to apply to itself. This was a pity because not only is SETL a language which could benefit greatly from a good optimizer, it is also one whose semantic simplicity makes it particularly amenable to the flow-tracing techniques of machine-independent code optimization. The absence of pointers alone circumvents the issue of aliasing, a huge advantage in this kind of analysis.

    The sort of data flow (definition-use) information obtainable from analysis of control flow graphs, and more generally from Schwartz's ``value flow'' tracing [177,178,144] that could follow objects when they were stored in aggregates and later extracted, was useful in all sorts of ways. It sustained copy optimization [175,178,179], where the redundant copying of an object could be suppressed when the only subsequent use of the object also modified it, perhaps incrementally. Value flow analysis provided a dependency framework wherein the types of many variables and expressions could be deduced by a transitive closure process starting from the manifest types of literals and other forms [196]. This typefinding process in turn enabled the discovery of relationships of set membership and inclusion [180], which was itself a prelude to automatic data structure choice, because the way an object is used has a profound influence on how it should be implemented. Weiss and Schonberg [208,209] later showed how to do type inference even in the presence of infinite sets of possible types arising from actions such as ``x := {x}''.

    Data structure representations had their own sublanguage, the DSRL, which served to annotate, but not otherwise modify, SETL programs coded at an appropriately abstract level. The DSRL was designed to permit a smooth transition from Schwartz's two-level programming regime, in which programmers supplied representational details, to a more fully developed system in which a sophisticated optimizer made the selections [175,56,174,88,77]. An important concept in the DSRL was that of base sets, which were implicitly defined objects that could in principle allow much representational sharing among the objects conceived by the programmer.

    Value flow analysis, type inference, copy optimization, and deeper determinations of relationships such as set membership or inclusion between variables preparatory to automatic data structure selection all embody an approach to program analysis described by Sintzoff [189] and called abstract interpretation by Cousot and Cousot [47] or symbolic execution in Muchnick and Jones [149, p. xv]. The essence of this model is that any program P with well-defined semantics can be projected onto a more abstract program A capturing salient properties of objects in P in a manner susceptible of analysis. For example, the sign of a product can be deduced from the signs of its multiplicands without knowing their specific values. Similarly, result types for known operators can usually be gleaned at compile time from operand types regardless of the actual run-time values of those operands. In abstract interpretation, the abstract program A is exercised at P's compile time to discover desired properties of objects in P. The symbols in A combine and recombine according to an algebra appropriate for their purpose. If that algebra has been designed with feasible goals in mind, the exercise will converge. It is typical to ensure this termination by taking advantage of the fact that any set generated by inductive definitions (such as data flow equations) can be defined as the lattice-theoretic least fixed point of a monotone function. This often allows global properties to be inferred from local ones by a straightforward process of transitive closure.

    The power and generality of abstract interpretation moved Paige and his colleagues to undertake an ambitious study of program transformations, which ultimately led to the APTS project [33,161,126,162]. The first of the main three transformations used in APTS is dominated convergence [32] for computing fixed points of Tarski [195,48] sequences (fi(1) : i = 0, 1, 2, ...for deflationary, monotone f) with reasonable efficiency. The second is finite differencing [157,164,158], which is a set-theoretic analogue of strength reduction that allows some expensive set operations within loops to be reduced to incremental updates by locating fixed points more quickly through the construction and maintenance of program invariants. The third transformation is real-time simulation [159,30,160,34,166] of an associative memory on an ordinary random-access memory (or with slight additional restrictions a mere pointer-access memory), which effectively automates the tedious programming activity of choosing efficient basings for sets.

    Chung Yung has recently used finite differencing in a technique he calls destructive effect analysis, which seeks to incrementalize the copying of aggregates, in his purely functional programming language, EAS, a packaging of the typed $\lambda$-calculus as extended with homogeneous sets [210,211,212].

    Transformational programming can be regarded as a formalization of Dijkstra's stepwise refinement [57,58]. As Bloom and Paige [28] point out, the transformational methodology is able to do much more than merely optimize code, or translate a SETL-like language into a C-like one. By helping the algorithm designer reason about time and space complexity in syntactic terms rather than only by means of low-level counting arguments, this technology has actually played a significant role in the invention of several new algorithms with greatly reduced asymptotic complexity compared to previous solutions [165,163,32,31,28,34,38,93], while it has rendered the algorithms themselves more perspicuous both to their inventors and to their students.

    The next phase in the development of APTS will seek to improve both its reliability and its performance. Currently, all program transformations in APTS are proved correct (meaning-preserving) by hand, which is slow and error-prone. The hope is to integrate a meta-level proof verifier along the lines of Etna [36], an outgrowth of Cantone, Ferro, and Omodeo's work on fast decision procedures for fragments of finite set theory [37]. Alternatively, the model for an integrated verifier might be the SETL-like NAP [126] system, itself implemented in the SETL derivative Cantor [127,124]. Verification of assertions in, say, Hoare logic [113] would increase confidence in automatically applied transformations. Davis and Schwartz [50] showed how mechanical verification systems could extend themselves with new proof methods without violating soundness or changing the set of statements that could be proved.

    The main existing impediment to the speed of APTS is the fact that its database of program property relationships, which is dynamically deduced using a static database of inference rules, must be recomputed after each application of a program transformation. What is being sought is an incremental rule database system that can be used to regenerate the relationship records efficiently after each rewriting operation [161]. Ultimately, it should be possible to apply APTS to itself for further large gains in speed [162], and to take advantage of the technique of partial evaluation [122] to realize a production-grade transformational system.

    Recently, Goyal and Paige [94] have revisited the copy optimization problem for SETL and other high-level languages that exemplify Hoare's ideal of a pointer-free style of programming. By taking the well-known technique of dynamic reference counting to achieve ``lazy'' copying, and combining that with static liveness determination based on Schwartz's value flow analysis, they are able to optimize the placement of copy operations and of om assignments, the latter serving to decrement the reference counts of objects known to have no subsequent uses. They also prove the correctness of their alias propagation analysis and code transformations using formal semantics and abstract interpretation.

    Goyal [91] has obtained a dramatic improvement in the algorithmic complexity of computing intra-procedural may-alias relations, again using dominated convergence and finite differencing. In his dissertation [92], he develops set-theoretic languages which can express both abstract specifications and low-level implementations in a form which uses a data structure selection method based on a novel type system to preserve the computational transparency that is necessary in order for statements about program efficiency to be meaningful. This is a cornerstone of the general transformational methodology.

    Reliable software implementation using domain-specific languages by Diomidis Spinellis This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

    The safety and reliability of products, processes, equipment, and installations is often critically affected by the software that controls their design and operation [WCD+98]. Software engineering stands out among other engineering disciplines because of its tight, haphazard, and fluid coupling with other elements of its operating environment. Mechanical, civil, or electrical engineers can base their designs on standardised and well understood specifications and constraints, and can design their implementations based on materials and components of known modalities and properties. In contrast to that, software engineers have to rely on specifications often expressed in the notation most suited to the application domain, design an artefact whose application changes significantly the environment it was designed for [Leh91], and rely on implementors whose output quality can vary up to a factor of 10 [SEG68]. These problems are of particular importance for safety critical applications where human life can be put at risk from malfeasant software designs and implementations.

    Our approach for overcoming those problems is a software process architecture based on domain specific languages (DSLs). These allow the specification of critical parts of a software subsystem in the most appropriate formalism for the application domain, thus bridging the gap between the domain expert specifications and the software implementation, providing maintainability, and -- once a particular DSL has been correctly implemented -- ensuring consistent quality throughout the system's lifecycle. In the following paragraphs we illustrate this concept by detailing the usage of DSLs in the design and implementation of a civil engineering CAD application. The application is typical for the class described so far: it embodies a large amount of domain knowledge and its failures can lead to reliability and safety problems.

    Dave Bodenstab's Home Page

  • BYACC which produces Perl/Tcl parsers

    I wanted a set of Tcl/Perl yacc tools that I knew were based on the latest version of byacc that I knew of. I also wanted a minimal diff against that version of byacc. To that end, I've taken the latest versions of tyacc/pyacc that I found:
         tyacc-0.9.tar.gz
         perl-byacc1.8.2.tar.gz
         perl5-byacc-patches-0.5.tar
    and applied the diff's to FreeBSD's 4.8-RELEASE yacc (which is really byacc.) Each of tyacc/pyacc/p5yacc is a separate tool -- removing the command line option specifying what language (Tcl/Perl) to generate simplified the diff's enormously. Basing the changes on FreeBSD's version of yacc did remove some changes that might be important to some people. The following features were removed: The Perl versions contain no new functionality or bug fixes. However, the Tcl version has been improved:

    My versions of the Perl/Tcl yacc's can be downloaded here.

    [May 25, 2005] Prolog Parsers

    A study of the efficiency of various parsing techniches in Prolog, e.g. top-down, bottom-up, oracles, prediction, left-corner, chart. Also a comparison with Lisp. Examples in Prolog and Lisp.

    [Mar 14, 2005] Key open-source programming tool due for overhaul Tech News on ZDNet

    Almost all open-source software is built with GCC, a compiler that converts a program's source code--the commands written by humans in high-level languages such as C--into the binary instructions a computer understands. The forthcoming GCC 4.0 includes a new foundation that will allow that translation to become more sophisticated, said Mark Mitchell, the GCC 4 release manager and "chief sourcerer" of a small company called CodeSourcery.

    "The primary purpose of 4.0 was to build an optimization infrastructure that would allow the compiler to generate much better code," Mitchell said.

    Compilers are rarely noticed outside the software development community, but GCC carries broad significance. For one thing, an improved GCC could boost performance for the open-source software realm--everything from Linux and Firefox to OpenOffice.org and Apache that collectively compete with proprietary competitors from Microsoft, IBM and others.

    For another, GCC is a foundation for an entire philosophy of cooperative software development. It's not too much of a stretch to say GCC is as central an enabler to the free and open-source programming movements as a free press is to democracy.

    GCC, which stands for GNU Compiler Collection, was one of the original projects in the Gnu's Not Unix effort. Richard Stallman launched GNU and the accompanying Free Software Foundation in the 1980s to create a clone of Unix that's free from proprietary licensing constraints.

    The first GCC version was released in 1987, and GCC 3.0 was released in 2001. A company called Cygnus Solutions, an open-source business pioneer acquired in 1999 by Linux seller Red Hat, funded much of the compiler's development.

    But improving GCC isn't a simple matter, said Evans Data analyst Nicholas Petreley. There have been performance improvements that came from moving from GCC 3.3 to 3.4, but at the expense of backwards-compatibility: Some software that compiled fine with 3.3 broke with 3.4, Petreley said.

    RedMonk analyst Stephen O'Grady added that updating GCC shouldn't compromise its ability to produce software that works on numerous processor types.

    "If they can achieve the very difficult goal of not damaging that cross-platform compatibility and backwards-compatibility, and they can bake in some optimizations that really do speed up performance, the implications will be profound," O'Grady said.

    What's coming in 4.0
    GCC 4.0 will bring a foundation to which optimizations can be added. Those optimizations can take several forms, but in general, they'll provide ways that the compiler can look at an entire program.

    For example, the current version of GCC can optimize small, local parts of a program. But one new optimization, called scalar replacement and aggregates, lets GCC find data structures that span a larger amount of source code. GCC then can break those objects apart so that object components can be stored directly in fast on-chip memory rather than in sluggish main memory.

    "Optimization infrastructure is being built to give the compiler the ability to see the big picture," Mitchell said. The framework is called Tree SSA (static single assignment).

    However, Mitchell said the optimization framework is only the first step. Next will come writing optimizations that plug into it. "There is not as much use of that infrastructure as there will be over time," Mitchell said.

    One optimization that likely will be introduced in GCC 4.1 is called autovectorization, said Richard Henderson, a Red Hat employee and GCC core programmer. That feature economizes processor operations

    [Feb 2, 2005] Compilers - Contents P.D. Terry, Rhodes University, 1996

    [Jan 20, 2005] TACC Single Processor Optimization on the IBM Power4 System

    Code optimization is an iterative process requiring an investment of time, energy, and thought on the part of the programmer. Performance tuning is recommended in the following situations:

    1. production codes that are widely distributed and often used in the research community;
    2. projects with limited allocation (to maximize available compute hours).

    It is hardly worthwhile to optimize programs that will only be used once or twice, but it is important to optimize often-used application code. In addition, the lack of availability of compute hours is an incentive to decrease the consumption rate of the allocation, and hence will call for increasing the efficiency of one's production code.

    2004

    [Dec 10, 2004] Comp.compilers SUMMARY Static analyzers for detecting programming errors

    static analyzers for detecting program errors [email protected] (1991-11-21) SUMMARY: Static analyzers for detecting programming errors [email protected] (1991-12-05)

    | List of all articles for this month |


    Newsgroups: comp.compilers From: [email protected] (David Spuler) Keywords: errors, analysis, summary, bibliography Organization: Compilers Central References: 91-11-087 Date: Thu, 5 Dec 91 20:04:26 +1100

    I received quite a large number of replies to my request for information
    about compiler-like static analysis tools. This article is a summary of:

    1) my work/ideas and
    2) responses I received.

    Hope it's useful.

    David Spuler

    MY OWN WORK
    ===========
    Here is a summary of my own bug lists and references on the use of static
    analyzers for detecting program errors. Myself, I implemented a full C
    checker (better than Lint) as my 4th year Honours project - hence there is an
    Honours thesis. Naturally, the bug lists and references have a C bias.

    C Bugs
    ======
    Expression Errors
    -----------------
    assignment instead of equality (if(x=y) versus if(x==y) )
    confusing &, | and &&,|| operators
    null effect statements (x<<1; instead of x<<=1)
    order of evaluation errors (a[i++]=i)
    side effects to 2nd/3rd operand of ||, && and ?:

    Lexical Errors
    ==============
    Nested comments
    Multibyte character constants (novices use 'abc' instead of "abc")

    Flow of Control Errors
    ======================
    unreachable statements
    use of uninitialized local variables
    dead assignments
    function return anomalies - return statements with and without expression
    function has no return statement on some paths
    constants in conditional tests (if(1>0)...)
    accidental empty loops (bad semicolon) for(...); { }
    missing break in switch statement

    Library Usage Errors
    ====================
    bad formats (and argument type) to printf/scanf/etc

    Preprocessor Errors
    ===================
    missing braces #define swap(i,j) temp =i; i=j; j=temp;
    precedence - need brackets around whole expresion
    precedence - need brackets around macro formal parameters
    side effects in macro arguments

    Declarations
    =============
    unused local variables
    inner scope redeclarations hiding outer name
    global problems - inconsistent use of fns etc - Lint does all this stuff

    REFERENCES (papers from my Thesis bibliography)
    (some aren't highly relevant to static checking, but the titles tell all)
    =============================================================================
    Spuler, D.A. "Check: A Better Checker for C", Honours Thesis, Dept of Computer
    Science, James Cook University of North Queensland, 4811. AUSTRALIA

    W. R. Adrion, M. A. Branstad and J. C. Cherniavsky (1990),
    "Validation, Verification, and Testing of Computer Software",
    ACM Computing Surveys, Vol. 14, No. 2, pp. 159-192.

    R. E. Berry and B. A. E. Meekings (1985),
    "A Style Analysis of C Programs",
    Communications of the ACM, Vol. 28, No. 1, pp. 80-88.

    R. E. Fairley (1978),
    "Static Analysis and Dynamic Testing of Computer Software",
    IEEE Computer, Vol. 11, No. 4, pp. 14-23.

    L. D. Fosdick and L. J. Osterweil (1976),
    "Data Flow Analysis In Software Reliability",
    ACM Computing Surveys, Vol. 8, No. 3, pp. 305-330.

    M. T. Harandi and J. Q. Ninq (1990), "Knowledge-Based Program Analysis",
    IEEE Software, January 1990, pp. 74-81.

    W. Harrison (1977),
    "Compiler Analysis of Value Ranges for Variables",
    IEEE Transactions on Software Engineering, Vol. 3, No. 3, pp. 243-250.

    W. S. Hecht and J. D. Ullman (1975),
    "A Simple Algorithm for Global Data Flow Analysis Problems",
    SIAM Journal of Computing, Vol. 4, pp. 528-532.

    T. R. Hopkins (1980), "PBASIC - A Verifier for BASIC",
    Software Practice and Experience, Vol. 10, No. 3, pp. 175-181.

    W. E. Howden (1982), "Validation of Scientific Programs",
    ACM Computing Surveys, Vol. 14, No. 2, pp. 193-227.

    W. E. Howden (1990), "Comments Analysis and Programming Errors",
    IEEE Transactions on Software Engineering, Vol. 16, No. 1, pp. 72-81.

    S. C. Johnson (1978), "Lint, a C Program Checker", Bell Laboratories,
    Murray Hill, NJ, Computer Science Technical Report, July 1978 (also
    appearing as part of the documentation for many versions/variants of the
    UNIX operating system, as in: Ultrix Programmers Manual - Supplementary
    Documents, 1983).

    W. L. Johnson (1986), Intention-Based Diagnosis of Novice Programming
    Errors, Pitman.

    J. Katzenelson and A. Strominger (1987), "Debugging Programs that use
    Macro-Oriented Data Abstractions", Software Practice and Experience, Vol.
    17, No. 2, pp. 79-103.

    J. C. King (1976), "Symbolic Execution and Program Testing",
    Communications of the ACM, Vol. 19, No. 7, pp. 385-394.

    D. E. Knuth (1971), "An Empirical Study of FORTRAN Programs",
    Software Practice and Experience, Vol. 1, pp 105-133.

    S. Lauesen (1979), "Debugging Techniques", Software Practice and Experience,
    Vol. 9, pp. 51-63.

    T. E. Lindquist and J. R. Jenkins (1988), "Test-Case Generation with IOGen",
    IEEE Software, January 1988, pp. 72-79.

    P. G. Moulton and M. E. Muller (1967), "DITRAN - A Compiler Emphasizing
    Diagnostics", Communications of the ACM, Vol. 10, No. 1, pp. 45-52.

    S. S. Muchnick and N. D. Jones (1981),
    Program Flow Analysis: Theory and Applications, Prentice-Hall.

    L. J. Osterweil and L. D. Fosdick (1976), "DAVE - A Validation, Error
    Detection and Documentation System for Fortran Programs", Software
    Practice and Experience, Vol. 6, No. 4, pp. 473-486.

    K. A. Redish and W. F. Smyth (1986), "Program Style Analysis: A Natural
    By-Product of Program Compilation", Communications of the ACM, Vol. 29,
    No. 2, pp. 126-133.

    B. G. Ryder (1974), "The PFORT Verifier",
    Software Practice and Experience, Vol. 4, No. 4, pp. 359-377.

    R. E. Seviora (1987), "Knowledge-Based Program Debugging Systems", IEEE
    Software, Vol. 4, No. 3, pp. 20-32.

    N. Suzuki and K. Ishihata (1977), "Implementation of an Array Bound
    Checker", Fourth ACM Symposium on Principles of Programming Languages, pp.
    132-143.

    C. Wilson and L. J. Osterweil (1985),
    "Omega - A Data Flow Analysis Tool for the C Programming Language",
    IEEE Transactions on Software Engineering, Vol. 11, No. 9, pp. 832-838.

    ============================
    RESPONSES FROM OTHERS
    ============================

    >From @yonge.csri.toronto.edu:hugh@redvax Thu Dec 5 15:06:51 1991

    I too am interested in static analysis tools. I have produced a C
    compiler (intended to be commercial) that does extensive lint-like
    checking.

    Over a decade ago Fosdick and Osterweil (sp?) at Colorado produced a
    program to detect "data-flow anomalies" in FORTRAN programs. I thought
    that work was very interesting. At the time, I prototyped a similar
    system (theirs was so slow that they recommended it be used only once in
    the life of a program; mine was fast enough that it would not have
    noticeably slowed a compiler that had the checks added).

    Do you know of the PFORT Verifier? I think that it is available for free
    from Bell Labs. I think that it is sort of a Lint for FORTRAN,
    emphasizing checking for FORTRAN 77 standard conformance.

    Again, a decade ago, there was some work at eliminating runtime range
    checks from Pascal programs. Clearly, this could be turned around into
    compile-time warnings, perhaps without being annoying. I think Welsh of
    Queens University of Belfast wrote one of the better papers (dim
    recollection).

    Anyway, I would be interested in your bug lists and references.

    Hugh Redelmeier
    {utcsri, yunexus, uunet!attcan, utzoo, scocan}!redvax!hugh
    When all else fails: [email protected]
    +1 416 482-8253

    ===================================================

    >From [email protected] Tue Nov 26 03:23:07 1991

    David,

    Raymie Stata, a fellow graduate in my research group passed me your
    request from the net.

    I am completing a thesis on a bug detection scheme I have invented called
    Aspect. I am attaching some blurb from a paper I have just written. I'm
    including the bibliography too, which may give you some useful references.

    If you're interested, I'd be happy to tell you more. There is one paper
    published that describes the state of Aspect about 1 year ago; it's
    `Aspect: an Economical Bug Detector', International Conf. On Software
    Engineering, May 1991.

    I'd also be very interested in any references or ideas that you have.

    Regards,

    --Daniel Jackson

    About the Aspect Bug Detector:

    Aspect is an annotation language for detecting bugs in imperative
    programs. The programmer annotates a procedure with simple assertions
    that relate abstract components (called `aspects') of the pre- and
    post-states. A checker has been implemented that can determine
    efficiently whether the code satisfies an assertion. If it does not,
    there is a bug in the code (or the assertion is wrong) and an error
    message is displayed. Although not all bugs can be detected, no
    spurious bugs are reported....

    The purpose of a compiler is not just to make it easier to write good
    programs but also to make it harder to write bad ones. Catching errors
    during compilation saves testing. It also spares the greater cost of
    discovering the error later when it is harder to fix.

    Programming errors can be divided into two classes. {\em Anomalies} are
    flaws that are apparent even to someone who has no idea what the
    program is supposed to do: uninitialized variables, dead code,
    infinite loops, etc. {\em Bugs}, on the other hand, are faults only with
    respect to some intent. An anomaly detector can at best determine
    that a program does something right; a bug detector is needed to tell
    whether it does the right thing.

    Aspect detects bugs with a novel kind of dataflow annotation.
    Annotating the code is extra work for the programmer, but
    it is mitigated by two factors. First, some sort of redundancy is
    inevitable if bugs, rather than just anomalies, are to be caught.
    Moreover, Aspect assertions may be useful documentation: they are
    generally much shorter and more abstract than the code they accompany.
    Second, no mimimal annotation is demanded; the programmer can choose
    to annotate more or less according to the complexity of the code or
    the importance of checking it.

    A procedure's annotation relates abstract components of objects called
    `aspects'. The division of an object into aspects is a kind of data
    abstraction; the aspects are not fixed by the object's representation
    but are chosen by the programmer.

    Each assertion of the annotation states that an aspect of the
    post-state is obtained from some aspects of the pre-state. The
    checker examines the code to see if such dependencies are plausible.
    If there is no path in the code that could give the required
    dependencies, there must be an error: the result aspect was computed
    without adequate information. An error message is generated saying
    which abstract dependency is missing.

    ...

    Many compilers, of course, perform some kind of anomaly analysis, and
    a variety of clever techniques have been invented (see, e.g.
    \cite{carre}). Anomaly detection has the great advantage that it
    comes free to the programmer. Aspect might enhance existing methods
    with a more precise analysis that would catch more anomalies (using
    annotations of the built-in procedures alone). But there will always
    be a fundamental limitation: most errors are bugs and not anomalies.

    The Cesar/Cecil system \cite{cesar}, like Aspect, uses annotations to
    detect bugs. Its assertions are path expressions that constrain the
    order of operations. Errors like failing to open a file before
    reading it can be detected in this way. Flavor analysis \cite{flavor}
    is a related technique whose assertions make claims about how an
    object is used: that an integer is a sum in one place in the code and
    a mean in another, for instance. Both techniques, however, report
    spurious bugs in some cases: an error may be signalled for a path that
    cannot occur. Aspect, on the other hand, is sound: if an error is
    reported, there is a bug (or the assertion is wrong).

    Type checking may also be viewed as bug detection when there is name
    equality or data abstraction. Aspect is more powerful for two
    reasons. First, since procedures with the same type signature usually
    have different annotations, Aspect can often tell that the wrong
    procedure has been called even when there is no type mismatch.
    Second, type systems are usually immune to changes of state and so, in
    particular, cannot catch errors of omission. Even models that
    classify side-effects, such as FX \cite{FX}, do not constrain the
    order of operations like Aspect.

    The version of Aspect described here advances previous work
    \cite{icse} by incorporating alias analysis. It can handle multi-level
    pointers, whose precise analysis is known to be intractable
    \cite{Landi}. The alias scheme adopted is most similar to
    \cite{larus}, but it is less precise and cannot handle cyclic and
    recursive structures.

    ...

    \bibitem[BC85]{carre}
    Jean-Francois Bergeretti \& Bernard A. Carre,
    `Information-Flow and Data-Flow Analysis of while-Programs',
    ACM Trans. Programming Languages and Systems, Vol. 7, No. 1, January 1985.

    \bibitem[CWZ90]{chase}
    David R. Chase, Mark Wegman \& F. Kenneth Zadeck,
    `Analysis of Pointers and Structures',
    Proc. SIGPLAN '88 Conf. Prog. Lang. Design \& Implementation, 1990.

    \bibitem[GL88]{FX}
    David K. Gifford \& John M. Lucassen,
    `Polymorphic Effect Systems',
    ACM Conf. Principles of Programming Languages, 1988.

    \bibitem[How89]{flavor}
    William E. Howden,
    `Validating Programs Without Specifications',
    Proc. 3d ACM Symposium on Software Testing, Analysis and Verification (TAV3),
    Key West, Florida, Dec. 1989.

    \bibitem[Jac91]{icse}
    Daniel Jackson,
    `Aspect: An Economical Bug-Detector',
    Proc 13th Int. Conf. on Software Engineering,
    Austin, Texas, May 1991.

    \bibitem[Lan91]{Landi}
    William Landi and Barbara G. Ryder,
    `Pointer-induced Aliasing: A Problem Classification',
    ACM Conf. Principles of Programming Languages, 1991.

    \bibitem[LH88]{larus}
    James R. Larus and Paul N. Hilfinger,
    `Detecting Conflicts Between Structure Accesses'
    ACM Conf. Programming Language Design and Implementation,
    June 1988.

    \bibitem[OO89]{cesar}
    Kurt M. Olender \& Leon J. Osterweil,
    `Cesar: A Static Sequencing Constraint Analyzer',
    Proc. 3d ACM Symposium on Software Testing, Analysis and Verification (TAV3),
    Key West, Florida, Dec. 1989.

    ===================================================

    >From [email protected] Tue Nov 26 03:17:02 1991

    I am a member of a quality control team in Citicorp Overseas Software Ltd.

    I do a lot of desk work, to test code validity.

    Till now I have used manual methods for static analysis of code, using
    tables of states of variables, and basically sweating it out. I wish to
    know if you have more information on known bugs or pitfalls in various
    constructs of a language.

    I will dig out some information on static analysers, and mail them to you

    Pankaj Fichadia.

    ===================================================
    >From @bay.csri.toronto.edu:[email protected] Tue Nov 26 02:04:23 1991

    I'd be very interested in your list of references. Unfortunately I can't
    find my own list of references to give you in return. Perhaps I didn't
    type it in yet. I _can_ send you an unpublished paper (complete except
    for bibliography) on detecting dataflow anomalies in procedural languages.
    There is also an experimental program that implements the ideas of the
    paper. The paper can be sent in postscript or dvi and the program in in
    Turing -- a Pascal type language.

    Theo Norvell [email protected]
    U of Toronto [email protected]

    ===================================================
    >From [email protected] Sun Nov 24 21:40:04 1991

    Greetings!

    I have read your note about error detections in compilers. I have a great
    interest in this particular field as my final project has to do with the
    implemetation of expetional handlin in "Small C", a compiler designed on
    the IBM 8088 and it's sole interest is educational, something equivalent
    to minix. I would greatly appreciate if you could help me in finding
    sources that dwell on this subject, anything that would be related to
    errors and how one might deal with them would be relavant.
    Many Thanks In Advance
    --Amiran
    [email protected]
    ===================================================

    >From [email protected] Sun Nov 24 04:42:03 1991

    The Convex Application Compiler (TM?) apparently does a pretty good job.
    Any interprocedural analyzer has to catch a lot of errors just to avoid
    crashing. They also do pointer tracking, array section analysis, and
    generally just all-out analysis and optimization. Bob Metzger
    <[email protected]> et al. have a paper on the pointer tracking in the
    proceedings of Supercomputing '91, which was held last week.

    It is alleged that Convex has sold machines, or at least copies of their
    compiler, just for use in static error checking.

    Paul Havlak
    Graduate Student

    ===================================================
    >From [email protected] Sat Nov 23 11:54:49 1991

    You might be interested in looking at abstract interpretation. It is a
    technique related to data-flow analysis (you can express data-flow anlyses
    as abstract interpreation problems) and is quite popular in among the
    functional programming crowd.

    There is a number of paper (even books!) on the subject, if you are
    interested I can provide references.

    Regards, Jan Vitek/ Graduate Student/ University of Victoria/ BC/ Canada
    ===================================================

    >From [email protected] Fri Nov 22 07:17:14 1991

    Here's a few references that I've written:

    Paul Eggert,
    Toward special-purpose program verification,
    .i "Proc. ACM SIGSOFT International Workshop on Formal Methods
    in Software Development",
    Napa, CA (9-11 May 1990);
    .i "Software engineering notes"
    .b 15 ,
    4 (September 1990), 25-29.

    Paul Eggert,
    An Example of Detecting Software Errors Before Execution,
    .i "Proc. workshop on effectiveness of testing and proving methods,"
    Avalon, CA
    (11-13 May 1982), 1-8.

    Paul Eggert,
    Detecting Software Errors Before Execution,
    UCLA Computer Science Department report CSD-810402 (April 1981).

    The last reference contains an extensive survey of the pre-1980 literature.
    I'd be interested in seeing your list of references as well,
    when you have the time.

    ===================================================
    >From [email protected] Fri Nov 22 05:31:04 1991

    I'm interested in hearing your list of bugs. I have given some thought to
    the detection and propagation of error information in a program using
    dependence information. Propagation of errors uses the idea that any
    computation on a path which leads only to an error condition is dead
    unless a side-effect intervenes, and any code after the error is
    unreachable. Thus one can actually integrate program errors into control
    flow information as an unstructured jump to the end of the program. In an
    optimizing compiler, one might be tempted to say that any statement which
    can be scheduled after the last side effect before the error is dead.
    Thus, in this fragment:

    1: x := 2
    2: print "hi there"
    3: y := z/0

    statement 1 is actually dead, but statement 2 is not.

    Micah Beck
    --

    [Nov 14, 2004] Two Benchmark Tests for BCPL Style Coroutines

    This directory contains various implementions of two benchmark programs to test the efficiency of BCPL style coroutines when implemented in a variety of different languages.

    Download cobench.tgz (or cobench.zip) and un-tar (or un-zip) it to form the directory Cobench. The enter one of the diectories: bcpl, natbcpl, mcpl, java, c and New Jersey SML and type the command:
    make. This should compile and run the benchmark, provided you are on a Pentium machine running Linux. You may find you have to re-install the BCPL Cintcode system and you may have to edit one or more of the Makefiles.

    There are directories for each language and implementation style. They are currently as follows:

    bcpl The definitive BCPL implementation run using Cintcode
    natbcpl BCPL compiled (very naively) in machine code
    c Implementation in C using a coroutine library involving
    setjmp, longjump and two instructions of inline assembler.
    nat-c Implemetation in C using a hand written coroutine library
    in assembler (for the Pentium).
    java Implementation in Java using a coroutine library based on
    threads using wait, notify and synchronisation.
    javadep An incorrect implementation in Java attempting to use the
    deprecated mesthod suspend and resume.
    mcpl An interpretive implementation in MCPL
    smlnj An implementation in Standard ML using New Jersey Continuations.
    haskell An untested approximation to the benchmark in Haskell.

    All these implementations can be run under Linux (provided the relevant compilers have been installed). They are typically run use make as follows:

    make Same as make help
    make help Display the help infomation
    make bench Run cobench
    make bencht Run cobench with tracing
    make sim Run cosim
    make simt Run cosim with tracing
    make test Run cotest
    make clean Delete any files that can be reconstructed

    The program cotest tests where the BCPL style coroutines have been implemented correctly.

    The benchmark: cobench

    This benchmark program creates a source coroutine, 500 copy coroutines and a sink coroutine. It then causes the source coroutine to transmit 10,000 numbers through all the copy coroutines to be thrown away by the sink coroutine. The numbers are sent one at a time, and all transmission from one coroutine to the next uses a coroutine implementation of Occam channels.

    Reading and writing are done (in the BCPL version) using coread and cowrite, respectively. These are defined as follows:

    AND coread(ptr) = VALOF
    { LET cptr = !ptr
    TEST cptr
    THEN { !ptr := 0 // Clear the channel word
    RESULTIS resumeco(cptr, currco)
    }
    ELSE { !ptr := currco // Set channel word to this coroutine
    RESULTIS cowait() // Wait for value from cowrite
    }
    }

    AND cowrite(ptr, val) BE
    { LET cptr = !ptr
    TEST cptr
    THEN { !ptr := 0
    callco(cptr, val) // Send val to coread
    }
    ELSE { !ptr := currco
    callco(cowait(), val)
    }
    }

    In both functions, ptr points to a channel word which hold zero (=FALSE) if no coroutine is waiting on the channel. The first coroutine to call coread (or cowrite) puts its own identity into the channel word and then waits. When the second coroutine reaches the corresponding cowrite (or coread) call, it can see the identity of the other coroutine that is waiting.

    If the writing coroutine arrives at the communication point second, it picks the identity of the reading coroutine out of the channel word, clears the channel word, and then transmits the value to the waiting
    read coroutine using a simple call of callco.

    If the reading coroutine arrives at the communication point second, it gives its own identity to the waiting write coroutine then waits for the value to be sent. A little thought will show you why coread uses
    resumeco to transmit its own identity to the write coroutine.

    If you are not familiar with BCPL coroutines, you can read about them in the BCPL manual that is available via my home page.

    This directory currently holds four implementations of the benchmark: bcpl, natbcpl, java and c. When run on a 1 GHz Pentium III they take the following times:

    BCPL Cintcode byte stream interpreter: 4.78 secs

    BCPL Native 386 code (unoptimised): 0.81 secs

    BCPL under Cintpos using an interpreter implemented
    in C and all debugging aids on turned on: 10 secs

    MCPL Mintcode byte stream interpreter: 10.51 secs

    Java j2sdk1.4.0 JIT (using threads): 191.59 secs

    Java j2sdk1.4.0 JIT (using threads, suspend and resume): 2029.44 secs

    C using setjmp and longjmp + inline assembler: 1.36 secs

    C using colib written in assembler 0.61 secs

    New Jersey SML using continuations: 4.37 secs

    BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld) 115.52 secs

    The Java implementation was done by Eben Upton and the first of the C versions using setjmp, longjmp and inline assembler by Jeremy Singer.


    The benchmark: cosim

    This benchmark is a discrete event simulator that simulates messages passing through a network of nodes. The nodes are numbered 1 to 500 and are each initially given a message to process. The time to process a message is randomly distributed between 0 and 999 units of simulated time. Once a message has been processed it is sent to another node selected at random. The transmission delay is assumed to be ABS(i-j) where i and j are the source and destination node numbers. Nodes can only process one message at a time and so messages may have to be queued. Initially each node is assumed to have just received a message. The simulation starts a time 0 and finishes at time
    1,000,000. The result, 501907, is the total count of messages that have been completely processed by the finishing time. This benchmark executes 435,363,350 Cintcode instructions and performs 2,510,520 coroutine changes. The timing results are as follows:

    BCPL Cintcode byte stream interpreter: 9.02 secs

    BCPL Native 386 code (unoptimised): 1.11 secs

    BCPL under Cintpos using an interpreter implemented
    in C and all debugging aids on turned on: 17 secs

    MCPL Mintcode byte stream interpreter: ??.?? secs

    Java j2sdk1.4.0 JIT (using threads, wait and notify): 43.13 secs

    Java j2sdk1.4.0 JIT (using threads, suspend and resume): 512.03 secs

    C using setjmp and longjmp + inline assembler: 0.80 secs

    C using colib written in assembler 0.58 secs

    New Jersey SML using continuations: 4.08 secs

    BCPL Cintcode on a 75Mhz SH3 (HP 620LX Handheld) 121.22 secs

    Martin Richards
    20 July 2004

    [Nov 12, 2004] C Coroutines

    co_create, co_call, co_resume, co_delete, co_exit_to, co_exit - C coroutine management

    The coro library implements the low level functionality for coroutines. For a definition of the term coroutine see The Art of Computer Programming by Donald E. Knuth. In short, you may think of coroutines as a very simple cooperative multitasking environment where the switch from one task to another is done explicitly by a function call. And, coroutines are fast. Switching from one coroutine to another takes only a couple of assembler instructions more than a normal function call.

    This document defines an API for the low level handling of coroutines i.e. creating and deleting coroutines and switching between them. Higher level functionality (scheduler, etc.) is not covered.

    Comp.compilers Re Assembly verses a high-level language.

    From comp.compilers

    Newsgroups: comp.compilers
    From: [email protected] (Stavros Macrakis)
    In-Reply-To: [email protected]'s message of Mon, 20 Nov 1995 03:53:54 GMT
    Keywords: performance, assembler
    Organization: OSF Research Institute
    References: 95-11-166
    Date: Wed, 22 Nov 1995 21:21:12 GMT

    [email protected] (Tom Powell ) writes:

    How come programs written in assembly are so much faster than any
    other high-level language. I know that it is a low-level language
    and that it "speaks" directly to the hardware so it is faster, but
    why can't high-level languages compile programs just as fast as
    assembly programs?

    First of all, assembler is not an "other" high-level language. It is
    the low-level language par excellence, lower-level even than C :-).

    There are several reasons that assembly programs can be faster than compiled programs:

    The assembly programmer can design data structures which take maximum advantage of the instruction set. To a certain extent, you can do this in languages like C if you're willing to write code that is specific to the architecture. But there are some instructions which are so specialized that it is very hard for compilers to recognize that they're the best way to do things; this is mostly true in CISC architectures.

    The assembly programmer typically can estimate which parts of the program need the most optimization, and apply a variety of tricks which it would be a bad idea to apply everywhere, because they make the code larger. I don't know of any compilers that allow "turning up" or "turning down" optimization for code fragments, although most will allow it for compilation modules.

    The assembly programmer can sometimes use specialized runtime structures, such as for instance reserving some registers globally for things that are often used, or designing special conventions for register use and parameter passing in a group of procedures. Another example is using the top of the stack as a local, unbounded stack without respecting frame conventions.

    Some control structures are not widely supported by commonly-usedhigher-level languages, or are too general. For instance, coroutines are provided by very few languages. Many languages now provide threads, which are a generalization of coroutines, but often have more overhead.

    The assembly programmer is sometimes willing to do global analysis which most compilers currently don't do.

    Finally, the assembly programmer is more immediately aware of the cost of operations, and thus tends to choose more carefully as a function of cost. As language level rises, the cost of a given operation generally becomes less and less predictable.

    All this said, there is no guarantee than an assembly program will be faster than a compiled program. A program of a given functionality will take longer to develop in assembler than in a higher-level language, so less time is available for design and performance tuning. Re-design is particularly painful in assembler since many decisions are written into the code. In many programs, large improvements can be made in performance by improving algorithms rather than coding; assembler is a disadvantage here since coding time is larger, and flexibility is less. Finally, it is harder to write reliable assembly code than reliable higher-level language code; getting a core dump faster is not much use.

    Compiler writers have tried, over time, to incorporate some of these advantages of assembler. The "coalescing" style of compiler in particular in many ways resembles the work of a good assembly programmer: design your data structures and inner loops together, and early on in the design process. Various kinds of optimization and global analysis are done by compilers, but in the absence of application knowledge, it is hard to bound their runtime. (Another thread in this group talked about the desirability of turning optimization up very high in some cases.)

    -s

    Compilers and Compiler Generators an introduction with C++ © P.D. Terry, Rhodes University, 1996 NEW (12 August 2004)

    A complete revision of this book, using C# and Java and the versions of Coco/R for those languages, will be published by Pearson Education in about October 2004.

    Further details of the date will be published here in due course.

    A feature of the new edition is that it demonstrates the use of Coco/R to implement compilers for the JVM and CLR platforms.

    In the meantime you can preview the contents of the "Resource Kit" which is currently under construction. This contains the preface and table of contents and additional material that will not appear in the published book. When complete it will also contain the source code for the case studies in the book and distributions of Coco/R for C# and Java.

    Let's Build a Compiler by Jack Crenshaw

    This fifteen-part series, written from 1988 to 1995, is a non-technical introduction to compiler construction. You can read the parts on-line or download them in a ZIP file.

    Compilers short notes and links by Rashid Bin Muhammad. Department of Computer Science,
    Kent State University. See also his interesting Home Page

    Slashdot Low Level Virtual Machine 1.3 Released

    "VM" in LLVM name does not do it justice (Score:4, Interesting)
    by truth_revealed (593493) on Saturday August 14, @10:13AM ( #9966950)

    The casual Slashdot reader may roll his/her eyes when they see yet another Virtual Machine - but this project is much more than that. It's a complete compiler infrastructure project that will one day surpass GCC. Why? Because it's around ten times easier to understand and written in a modern language (C++) rather than C. An expert C++ programmer could start contributing code to LLVM in under a month; whereas an equivalent learning curve for GCC is at least a year. Writing new compiler passes or complete language front ends for LLVM is very straight-forward, for example. The LLVM AST has the advantage of having strong typechecking and not arcane tree macros as in GCC. LLVM is not burdened with the legal or philosophical wranglings of GCC where they do NOT WANT their compiler to be the backend of a commercial compiler and try their damnedest to obscure and change the programming API from release to release. The GCC "Toy" example language has not worked in many releases for this very reason.
    GCC recently voted down using C++ in their core code. Perhaps LLVM at the very least will drag GCC into the modern age due to competition.

    Speaking as a GCC maintainer, I call bullshit (Score:5, Informative)
    by devphil (51341) on Saturday August 14, @02:55PM (#9968783)
    (http://www.devphil.com/)

    The very best trolls always start with a grain of truth. (LLVM is much easier to understand than GCC. The GCC infrastructure is very baroque, dating from a time when assuming the presence of an ANSI C bootstrap compiler was too much. One of the major LLVM guys has presented his toolchain work at the annual GCC Summit, and maintains close communication with the rest of the GCC team -- and we wish him well. All very true; no GCC hacker would say any less.)

    The trolls then move on into wild exaggerations and complete lies. Such as:

    and try their damnedest to obscure and change the programming API from release to release.

    Pure malicious bullshit. RMS doesn't want proprietary backends to be able to read the GCC IR, and so we don't ever write it out in a machine-readable format. But we've never gone out of our way to obfuscate the internal API.

    GCC recently voted down using C++ in their core code.

    Again, a complete lie. We asked RMS whether we could make use of C++ in parts of the compiler. While a skilled and brilliant C and LISP hacker, RMS is a reactionary fuckhead when it comes to anything other than C or LISP. In his opinion, all GNU programs should be written in C, and only C, ever.

    There was no vote.

    Re:Speaking as a GCC maintainer, I call bullshit (Score:3, Informative)
    by devphil (51341) on Sunday August 15, @02:16PM (#9974914)
    (http://www.devphil.com/)

    You're not completely right, and not completely wrong. The politics are exceedingly complicated, and I regret it every time I learn more about them.

    RMS doesn't have dictatorial power over the SC, nor a formal veto vote.

    He does hold the copyright to GCC. (Well, the FSF holds the copyright, but he is the FSF.) That's a lot more important that many people realize.

    Choice of implementation language is, strictly speaking, a purely technical issue. But it has so many consequences that it gets special attention.

    The SC specifically avoids getting involved in technical issues whenever possible. Even when the SC is asked to decide something, they never go to RMS when they can help it, because he's so unaware of modern real-world technical issues and the bigger picture. It's far, far better to continue postponing a question than to ask it, when RMS is involved, because he will make a snap decision based on his own bizarre technical ideas, and then never change his mind in time for the new decision to be worth anything.

    He can be convinced. Eventually. It took the SC over a year to explain and demonstrate that Java bytecode could not easily be used to subvert the GPL, therefore permitting GCJ to be checked in to the official repository was okay. I'm sure that someday we'll be using C++ in core code. Just not anytime soon.

    As for forking again... well, yeah, I personally happen to be a proponent of that path. But I'm keenly aware of the damange that would to do GCC's reputation -- beyond the short-sighted typical /. viewpoint of "always disobey every authority" -- and I'm still probably underestimating the problems.

    Full-Text Searching & the Burrows-Wheeler Transform

  • Here's an indexing method that lets you find any character sequence in the source text using a structure that can fit the entire source text and index into less space than the text alone.

    Regular Expression Mining

  • Regular expressions are convenient devices for identifying patterns in collections of strings.

    C/C++ Compiler Optimization

  • Squeezing the maximum execution speed from C/C++ compilers requires an understanding of optimization switches.

    Shared Source CLI Essentials by Ted Neward , David Stutz, Geoff Shilling

    [May 22, 2004] lcc, A Retargetable Compiler for ANSI C

    Note: Due to the volume of this page previous years of "Old News" was moved to the page Compilers Chronicle 1999-2003 ; Recommended Links section of the page was also moved to a separate page

    Recommended Links

    Google matched content

    Softpanorama Recommended

    Top articles

    Sites

    Top Links


    Ebooks

    Compiler Construction by Niklaus Wirth

    This is a slightly revised version of the book published by Addison-Wesley in 1996

    131 pages

    ISBN 0-201-40353-6

    Zürich, November 2005

    Compiler Construction - Wikibooks, collection of open-content textbooks

    Parsing Techniques - A Practical Guide by Dick Grune and Ceriel J.H. Jacobs. Originally published by Ellis Horwood, Chichester, England, 1990; ISBN 0 13 651431 6 A small but and too formalistic book about parsing

    This 320-page book treats parsing in its own right, in greater depth than is found in most computer science and linguistics books. It offers a clear, accessible, and thorough discussion of many different parsing techniques with their interrelations and applicabilities, including error recovery techniques. Unlike most books, it treats (almost) all parsing methods, not just the popular ones. See Preface + Introduction and/or Table of Contents for a quick impression.

    The authors also participates in writing of a newer book Modern Compiler Design John Wiley & Sons, Ltd., pp. 736 + xviii, 2000; ISBN 0 471 97697 0

    Structure and Interpretation of Computer Programs: Written by Hal Abelson, Jerry Sussman and Julie Sussman, this book is the very famous "Wizard Book", a computer science text used in the introductory courses at MIT. Download it from http://mitpress.mit.edu/sicp/full-text/book/book.html

    Optimizing C++: Written by Steve Heller, this book focus specially in what its title says: optimization in C++. Though it's not as ground breaking as Abrash's Black Book, it's worthy reading for some interesting algorithms and techniques. Download it from http://www.steveheller.com/opt/

    Compiler Construction using Flex and Bison: A small (102 pages) boot by Anthony A. Aaby. Explains step by step the way to program compilers that use Flex and Bison (including very detailed information about both tools). Download it from http://cs.wwc.edu/~aabyan/464/Book/

    Compilers and Compiler Generators: http://scifac.ru.ac.za/compilers/


    WEB-based materials of University Courses

    See also: A List of Courses in Principles and Implementation of Programming Languages


    Papers


    Usenet and FAQs

    The comp.compilers newsgroup -- this is "must" group to read. It contains a wealth of material unachibale in ant WEB page. In addition the site contains several useful resources. Among them:

    You can ask questions and often get a highly qualified answer: for example:

    mark watson <[email protected]> wrote:

    >I suspect that one doesn't exist, but is there any visualizer or
    >converter for the .v file that YACC produces?

    If the .v file contains the stack state machine (push-down automaton)
    description, current Bison has a way to visualize it.

    Hans Aberg * Anti-spam: remove "remove." from email address.
    ... ... ...

    Hans Aberg wrote:
    >
    > Can somebody give a reference to a description of the LALR(1)
    > algorithm that Bison uses?
    >
    > The file state.h of the Bison sources says that first, a
    > non-deterministic finite state machine that parses the specified
    > grammar is created; it is then converted into a deterministic finite
    > state machine.

    I don't know what Bison uses, but your description sounds like the algorithm I first came across in:
    Syntax Analysis and Software Tools K.John Gough Addison-Wesley 1988 chapter 9 Bottom-up parsing
    The references there point to D.E.Knuth (1965) On the translation of languages from left to right
    Information and control v8 pp323-350 which apparently covers both this and the "usual" form of the algorithm.

    compilers/construction-tools Catalog of Compiler Construction Products - Thirteenth Issue

    compilers/faq comp.compilers monthly message and Frequently Asked Questions

    compilers/free/* ...


    People


    ACM

    SIGPLAN -- a Special Interest Group of ACM that focuses on Programming Languages. In particular, SIGPLAN explores implementation and efficient use. Pretty closed and SygPlan Notices (see below) has high subscription price. IMHO its best day are in the past.

    ACM SIGPLAN Web Page at ACM -- ACM SIGPLAN Notices is an informal monthly publication of the Special Interest Group on Programming Languages (SIGPLAN) of ACM.

    SIGPLAN - Conferences


    DSL

    DSLAnnotatedBibliography

    First quarter, 1999

    Second quarter, 1999

    Third quarter, 1999

    Fourth quarter, 1999


    First Quarter, 2000

    Second Quarter, 2000

    Third Quarter, 2000

    Fourth Quarter, 2000


    First Quarter, 2001


    Tools


    Parsing


    Memory management

    The Memory Management Reference


    Graph analysis

    Structuring Decompiled Graphs - Cristina Cifuentes (1996)

    Abstract: . A structuring algorithm for arbitrary control flow graphs is presented. Graphs are structured into functional, semantical and structural equivalent graphs, without code replication or introduction of new variables. The algorithm makes use of a set of generic high-level language structures that includes different types of loops and conditionals. Gotos are used only when the graph cannot be structured with the structures in the generic set. This algorithm is adequate for the control flow analysis required when decompiling programs, given that a pure binary program does not contain information on the high-level structures used by the initial high-level language program (i.e. before compilation). The algorithm has been implemented as part of the dcc decompiler, an i80286 decompiler of DOS...

    ......only by the code generator, and no labels are placed on the graph on the earlier control flow analysis phase. 7 Previous Work Most structuring algorithms have concentrated on the removal of goto statements from control flow graphs at the expense of introduction of new boolean variables [8, 29, 22, 28, 6, 13], code replication [17, 27, 29], the use of multilevel exits [7, 24], or the use of a set of high-level structures not available in commonly used languages [25]. None of these methods are applicable to a decompiler because: the introduction of new boolean variables modifies the semantics of the ......

    13. A.M. Erosa and L.J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In Proceedings of the International Conference on Computer Languages, Universit'e Paul Sabatier, Toulouse, France, May 1994. IEEE Computer Society.

    Control Flow Graphs for Java Bytecode Neal Glew, Cornell...
    CS-553 Compiler Project - Control Flow Graphs
    Introduction to Control Flow Graphs

    Control Flow Graphs for Data Analysis

    Taming control flow: A structured approach to.goto statements from control flow graphs at the expense of......potentially unstructured control-flow graphs.
    WolfPack v1.0 Control Flow Graphs...WolfPack v1.0 Control Flow Graphs Table of Contents......extension code for control flow graphs found in version 4.0...

    Rastislav Bodik and Sadun Anik. Path-sensitive value-flow analysis.
    In 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, California}, pages 237--251. ACM, January 1998.
    http://www.cs.wisc.edu/~bodik/research/popl98.ps

    Uwe Assmann. How to Uniformly Specify Program Analysis and Transformation. In P. Fritzson, editor, Compiler Construction (CC). Springer, 1996. http://i44s11.info.uni-karlsruhe.de:80/~assmann/

    Rastislav Bodik and Sadun Anik. Path-sensitive value-flow analysis. In 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, San Diego, California}, pages 237--251. ACM, January 1998. http://www.cs.wisc.edu/~bodik/research/popl98.ps

    A Practical Framework for Demand-Driven Interprocedural Data Flow Analysis - Duesterwald, Gupta, Soffa (ResearchIndex)

    Skeletal Parallelism Homepage

    Choose a Format for Viewing the Dataflow Web Pages

    Resources for Programming Language Research

    What's New on the Language Research Pages

    EAPLS Home Page -- European Association for Programming Languages and Systems

    WWW Virtual Library - Software Engineering

    Catalog of Compiler Construction Tools

    Archive of comp.compilers

    SEL-HPC Article Archive

    Reading list on parallel languages

    The Teaching About Programming Languages Project

    A List of Courses in Principles and Implementation of Programming Languages



    Tree-based Code optimization


    Performance and Benchmarks


    Program Analysis and Transformations


    Code generation


    C compilers


    XML Complilers

    Soumen Sarkar XSLT is typically used in document transform (XML to WML or HTML) scenario and this misses the point that it could be used very flexibly in "template driven code generation". Here is how the author describes his experience:

    Soumen Sarkar, has successfully used XSLT to achieve source code generation in his Java application software project at ATOGA Systems Inc.. Out of approximately 2,480 Java files in the project 1, 520 were generated. In other words, 61% of the Java code is generated. SQL and XML files are also generated. He has the following publications on this topic:

    Model Driven Programming, published in XML journal of SYS-CON media, August 2002, Volume 3, Issue 8. The article proposed an extension of Model View Controller (MVC) architecture and showed how to implement in software development projects.

    http://www.sys-con.com/xml/pdf/trees.pdf

    Code generation using XML based document transformation, a technical article published in J2EE expert site theserverside.com, November 2001. The URL is http://www.theserverside.com/resources/article.jsp?l=XMLCodeGen .
    Application of this technique to Atoga NMS development resulted in 60% code
    generation.

    http://www.theserverside.com/resources/articles/XMLCodeGen/xmltransform.pdf

    Postscript Versions of Selected Papers

    Soot a Java Bytecode Analysis and Transformation Framework

    Computer Science Department Web Servers

    ACE: CoSy Compilation System
    Instruction-Set Simulation and Tracing
    Runtime Code Generation (RTCG)
    Faculty Research Interests : Greg Morrisett
    ANDF Reference Guide
    Language Prototyping: An Algebraic Specification Approach
    Dynamic Typing in Polymorphic Languages
    Catalog of Free Compilers and Interpreters: introduction
    Catalog of Compiler Construction Tools
    http://www.csd.uu.se/~hakanm/pli.html
    The Compiler Connection
    Graphs to color
    dag
    CPU Info Center
    HAKMEM -- CONTENTS -- DRAFT, NOT YET PROOFED
    ICCD '95
    The comp.compilers newsgroup
    Catalog of Compiler Construction Tools
    Catalog of Free Compilers and Interpreters: introduction
    Compiler Internet Resource List
    re2c
    Lazy code motion handouts
    Dynamic Typing in a Statically Typed Language.
    Error Repair in Shift-Reduce Parsers
    Mark Leone's Research
    ACM Symposium: PLDI '97 Call for papers

    Architectures and Compilers to Support Reconfigurab

    Compiler Jobs

    Humor

     Top 10 reasons computers are male
      ===========================
      
       10. They have a lot of data but are still clueless.
       9.  A better model is always just around the corner.
       8.  They look nice and shiny until you bring them home.
       7.  It is always necessary to have a backup.
       6.  They'll do whatever you say if you push the right buttons.
       5.  The best part of having either one is the games you can play.
       4.  In order to get their attention, you have to turn them on.
       3.  The lights are on but nobody's home.
       2.  Big power surges knock them out for the night.
       1.  Size does matter
      
       Here's the quid pro quo:
      
       Top 10 reasons compilers must be female:
       ========================================
      
       10. Picky, picky, picky.
       9.  They hear what you say, but not what you mean.
       8.  Beauty is only shell deep.
       7.  When you ask what's wrong, they say "nothing".
       6.  Can produce incorrect results with alarming speed.
       5.  Always turning simple statements into big productions.
       4.  Smalltalk is important.
       3.  You do the same thing for years, and suddenly it's wrong.
       2.  They make you take the garbage out.
       1.  Miss a period and they go wild


    Etc

    Society

    Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

    Quotes

    War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

    Bulletin:

    Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

    History:

    Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

    Classic books:

    The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

    Most popular humor pages:

    Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

    The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


    Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

    This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

    You can use PayPal to to buy a cup of coffee for authors of this site

    Disclaimer:

    The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

    Last modified: September 14, 2020