Perl and Parsing 5: Rewind

By Jeffrey Kegler on September 22, 2010 11:45 PM

The Rise and Fall and Rise of the Left

On 28 February 2006, the Golden Age of Right Parsing ended. The End of the Age was like a rewind of the Beginning. The Golden Age of Right Parsing began when a right parser replaced the hand-written recursive descent parser in the C compiler. It ended, almost three decades later, when the world's most visible C compiler replaced its right parser with a hand-written recursive descent implementation.

Right parsing was taken out of mathematical notation on the pages of academic journals and put on the map in 1978. That's when Steven Johnson rewrote Dennis Ritchie's C compiler to make it easier to understand and modify. This "Portable C compiler" used yacc, a new tool that Johnson had written based on some papers by Don Knuth and Frank DeRemer. Knuth had invented LR parsing, and DeRemer described a wrinkle on it that he called LALR. It was said that LALR would actually make right parsing practical. The folks at Bell Labs decided to give it a go.

The result was one of the most influential implementation decisions in the history of programming. In a previous post, I described how LALR parsing rode on the backs of yacc and the Portable C Compiler into total dominance over parsing mindshare. LALR came to define production-quality parsing.

In 1978, there had been two C compilers, both at AT&T's Bell Labs -- Ritchie's original, and Johnson's Portable C Compiler, which was about to replace it. Neither was a commercial product -- AT&T was a regulated monopoly, banned from selling hardware or software.

By 2006, C was the most important systems programming language in a world which had become far more dependent on computer systems. There were now many C compilers, several of them of great commercial importance. Arguably, one of these was the most visible production-quality C compiler. This no longer came from AT&T. The leader among C compilers in 2006 was GCC, the GNU foundation's flagship accomplishment.

GCC can be said to have clearly taken this lead by 1994, when BSD UNIX had switched from the Portable C Compiler to GCC. This was a revolution of another kind, but as far as the parsing algorithm went, it was "Hail to the new boss... same as the old boss". GCC parsing was firmly in the LALR tradition.

But on 28 February 2006, the nearly 3 decades of LALR dominance were over. That's the date of the change log for GCC 4.1. It's a long document, and the relevant entry is deep inside it. It's one of the shorter change descriptions.

In fact, it's exactly this one sentence:

The old Bison-based C and Objective-C parser has been replaced by a new, faster hand-written recursive descent parser.

For me, that's a little like reading page 62,518 of the Federal Register and spotting a one-line notice:

Effective immediately, Manhattan belongs to the Indians.

The Once and Future Parser

Hand-written recursive descent was the origin and remains the soul of left parsing. In fact, hand-written recursive descent was the first real parsing algorithm. Recursive descent is a very natural method if you're a modern programmer. Basically, for every left-hand-side symbol in your grammar, you write a subroutine to parse it. If the rule has right hand side symbols, you call the subroutines for them, in order. When all the recursive calls come back to the top, that's your parse.

Would that parsing were really that simple. But a real beauty of recursive descent (perhaps the ultimate secret behind its comeback) is that it is an idea that allows easy elaboration. For example, suppose I say (as happens typically to be the case) that some symbols are on the left hand side of more than one rule of the grammar. A competent modern programmer will quickly think up hacks that let him determine which rule to use. Someone clever is likely to reinvent lookahead, even if he'd never heard of recursive descent before. If you're a programmer brought up on recursive subroutines, the logic of recursive descent makes sense to you.

"Folk parsers" -- parsers written by programmers with little patience for theory and a "just do it" philosophy -- almost always end up as variations on recursive descent. Those of you with contempt for parsing theory may take as one of your best arguments the current re-emergence of hand-written recursive descent as the method of choice for serious production-quality parsing. It's like we theorists might as well have spent the last 50 years surfing.

I use the term "Folk parsing". It is quite descriptive, but it's also misleading. The appeal of recursive descent is not just to working programmers. Recursive descent has always had a following among academics.

And recursive descent did not emerge from the primal mists. It had an inventor, Ned Irons, who was one of my teachers at Yale. Ned wrote a paper which was "[t]he first to describe a full parser. It is essentially a full backtracking recursive descent left-corner parser" [1]. I take its publication date (1961) as the date of the invention of both recursive descent and parsing.

"You see, in the old days ..."

Some previous papers had described hacks for parsing arithmetic expressions. These seem to have been primarily motivated by the need to parse FORTRAN assignment statements.

FORTRAN in those days was not much more than a glorified macro language. FORTRAN III was parsed line by line, the way you parse configuration files now. Except you probably use much more sophisticated techniques to parse your config files. Many of the simpler techniques, things modern programmers take for granted, weren't known in 1961.

It can be hard to take ourselves back to those days in the history of programming. Concepts we consider obvious were then either partly formed, unknown, or completely unsuspected. In 1961, BNF had just been invented and LISP and IPL were the only languages with stacks. Only LISP (1958) had them as a built-in feature. FORTRAN III had a feature called subroutines, but there were no stacks. A FORTRAN III subroutine puts its return value into a fixed, global location. Recursive subroutine calls were not allowed.

From a parsing point of view, FORTRAN does seem to have had the most complex language feature of the time. The right hand side of FORTRAN's assignment statements allowed arithmetic expressions. These were relatively full-featured arithmetic expressions.

All this was taking place within the limits of an 80 column punched card. (Actually, 72 columns if you were prudent and had sequence numbers on the cards. Otherwise there's hell to pay if you drop the deck.) Parsing those expressions was FORTRAN's claim to fame, one which the inventors weren't about to let you overlook. The language was named after it. FORTRAN is a block acronym for "Formula Translating".

This "formula translating" was accomplished with methods we'd now consider gruesome hacks. True operator-precedence parsing does not seem to have been invented until shortly after Ned invented recursive descent.

Not only do users of modern languages take the availability of recursion and the use of a stack as a given, they expect their languages to have a fully recursive grammar. Statements contain blocks. Blocks contain statements.

A recursive language seems to presuppose a real parsing algorithm. Brute force and trickery might have been enough to parse an arithmetic expression that fit into 72 columns. (80 if you like to live dangerously.) But you can't parse a modern recursive language if you don't use recursion.

So recursive descent, the first real parsing algorithm, needed to be invented first, before the first recursive language, right? You'd think so. But, as we'll see in my next post in this series, that ain't the way it happened.

[1] Grune & Jacobs, Parsing Techniques: A Practical Guide - Second Edition. The history in this post takes as its authority their annotated bibliography, which is online .

2 comments

Tagged as:

Marpa Perl Parsing

2 Comments

Jan Dubois | September 23, 2010 12:49 PM

FORTRAN statements were not limited to 80 characters; you could use continuation lines by putting any character except a space or '0' in column 6 of a new card to continue the statement from the previous card. There were compiler specific limitations on the number of allowed continuation cards, but the limit was usually larger than 1.

Not sure if this was in FORTRAN III, but definitely in FORTRAN IV.

Jeffrey Kegler replied to comment from Jan Dubois | September 23, 2010 1:43 PM

I spoke loosely. The strictest limit was not the number of characters allowed, but FORTRAN's internal tables. The strictest limit (according to the FORTRAN II manual) seems to have been the LAMBDA table. You had to keep lambda under 400, where (not counting subscripts), the initial value was 3, each exponentiation, variable or constant counted 1, each multiply/divide counted 2, each add/subtract counted 3, each function call counted 3, every open paren counted 4, and every function argument counted 4.

You're right about continuation cards -- up to nine were allowed. Again I was speaking loosely. FORTRAN actually only allowed 66 meaningful characters per card. You were NOT allowed to use 80 characters. Whether or not you sequenced your cards, FORTRAN ignored columns 73-80, reserving the space for less reckless programmers. It also reserved columns 1-6 for its own line numbers and other info. This leaves 66 characters per card. 10x66=660. All this is from page 8 of the October 1958 FORTRAN II reference manual, which is online. I leave it as an exercise to the reader to write a FORTRAN expression which uses all 660 characters without breaking any of the limits described on pages 48-50.

About Jeffrey Kegler

I blog about Perl, with a focus on parsing and Marpa, my parsing algorithm based on Jay Earley's.

More info »

Ocean of Awareness