Thursday, February 10, 2011

YACC and the Perils of Operator Precedence

Background

A little while ago I received a query about a very simple grammar example which behaved in an unexpected fashion. The issue was that the parser worked correctly in its original form.  However, there was a common substring in two of the productions. When the common substring was replaced by a new non-terminal, the resulting parser rejected certain clearly legal sentences. The question was: surely these two grammars should be equivalent?

I dug around with the example and was forcibly reminded of some "features" of YACC-like parser generators that I had long forgotten.  So here is an even simpler grammar that also behaves paradoxically.  It is also a tutorial on how the operator precedence algorithm really works.

For those whose memory of shift/reduce parsing theory is even rustier than mine, I have added a page called the Shift/Reduce Parsing Refresher. It is blatantly lifted from Appendix-9 of the GPPG documentation at http://gppg.codeplex.com/.

Introduction

The parsers generated by GPPG are similar in their functionality to those produce by the traditional YACC tools. That is, as well as using LALR(1) states to guide the parse they use operator precedence. Operator Precedence allows convenient handling of such things as expression grammars with large numbers of operators.

When programming languages are specified for their users, that is, for programmers, it is common to specify the semantics of expressions by specifying the relative precedence and associativity of the operators. Thus it is tempting to use this same mechanism for defining the grammar for implementors, and this is what YACC-like tools try to do.

There is another way of defining the way a parser will work with an expression grammar: by defining a number of different syntax non-terminals, traditionally given names like Expression, Simple Expression, Term, Factor and Primary. Going down this second path has the disadvantage that two ways of specifying the same semantic may then be required, because the multiple non-terminal method seems to be less understandable for programmers. Unfortunately, there are disadvantages in using the operator precedence functionality of YACC.

This episode of the blog is concerned with understanding some of the subtleties.

The Breathtakingly Simple Grammar

The running example grammar is very simple. A single operator, and only two tokens.
Goal  : '+' Exp ;
Exp  : 'a' | Exp '+' Exp ;
As it turns out, this grammar specifies a regular language, the same infinite set of strings that are generated by the regular expression -
(\+a)+     or, if you prefer, the set {+a, +a+a, +a+a+a, ... }
What could possible go wrong with such a simple grammar? And if something can go wrong, how about a more complicated case with (say) 17 levels of precedence?

Preliminary Examination

The example grammar is actually ambiguous. The shortest valid string with an ambiguous parse is "+a+a+a", which may be parsed either as though it was meant to be understood as "+((a+a)+a)" or as "+(a+(a+a))".

No big deal, we didn't specify whether the binary "+" was left or right associative.

However, before we go on let's check: send this grammar to GPPG, and it generates a parser, but also a warning which with the /verbose options reads -
Shift/Reduce conflict
    Shift "'+'" : State-6 -> State-5
    Reduce 4 : Exp -> Exp, '+', Exp
This tells us that in state-6 of the automaton there are two possible actions when the next symbol is '+' : shifting the plus sign and going to state-5, or reducing by the production
Exp : Exp '=' Exp 
This is nice, we see that the shift choice would lead to the right-associative semantic for the binary '+', while the reduce choice would lead to the left-associative semantic.

Using YACC Disambiguation - Associativity

The reason that the shift/reduce conflict message is only a warning is because YACC and its look-alikes always break such conflicts by shifting, in preference to reducing. In the case of the running example, this would lead to the right-associative interpretation. The warning is simply a heads-up to the user, so they may check if the "prefer shifting" default gives the correct behavior.
It is possible to make the warning go away by explicitly stating the associativity of the tokens. In the example we may begin with -
%token 'a'
%left '+'
GPPG no longer gives a warning, since we have specified that we want '+' to be left associative. Let us see how this actually works. If GPPG is run with the /report option the part of the html report with the details for state-6 (the state that previously had the shift/reduce conflict)is -
State-6
Kernel Items
  4 Exp : Exp '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . '+' Exp
Parser Actions
  '+' reduce using rule 4 (Exp)
  EOF reduce using rule 4 (Exp) 
Recall that the dot in the kernel items denotes the point in the rule that the recognition has reached. Unusually in this case the two items in the kernel both belong to the same production rule, rule-4. The first item has the dot at the end, indicating that this production has been fully recognized, and may be reduced. The lookahead set that support the reduction is shown within braces. The second item has the dot internally, and shows that a following '+' would continue with the recognition of this production rule. The parser action table shows that with the %left declaration the first item (reduce by rule-4)is selected no matter whether the lookahead is another plus sign or the end of file.
Alternatively, if the declarations say -
%token 'a'
%right '+'
then the same state will have the report -
State-6
Kernel Items
  4 Exp : Exp '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . '+' Exp
Parser Actions
  '+' shift, and go to state 5
  EOF reduce using rule 4 (Exp) 
which is the same behavior as with no associativity specified, but without the warning message.
This seems to have solved the problem, with a clear way to choose the desired behavior, and that pesky warning suppressed.

Using YACC Disambiguation - Operator Precedence

As well as specifying the associativity, YACC allows for the relative precedence of operators to be controlled. The simple rule is that all the tokens associated with a particular %token, %left or %right declaration have the same precedence, but those in later groups take precedence over those in earlier declarations. Thus if we had not only a '+' operator, but also a '*' operator then in conventional arithmetic grammars the multiply operator would occur in a later group than the additive operator.

How Operator Precedence Actually Works

Operator precedence is used to silently resolve shift/reduce conflicts by comparing the precedence of the completed rule with the precedence of the lookahead token. Here are the rules for determining precedence of each rule and each token -
  • All token precedences are greater than zero.
  • The precedence of a token is determined by the position of the declaration group in which it occurs. Groups declared later in the definitions section have higher precedence.
  • The precedence of a production is that given by the "%prec TokenName" declaration, if there is one.
  • Otherwise, the precedence of a production is that of the rightmost terminal symbol in the right-hand-side, if there are any terminal symbols in the right-handside.
  • Otherwise the production has zero precedence.
Having established the precedences, here are the rules by which they are compared -
  • If the precedence of the production is higher than the precedence of the lookahead token, then reduce.
  • Otherwise, if the precedence of the lookahead token is higher than the precedence of the production, then shift.
  • If the precedences are equal and the associativity of the lookahead token is left then reduce.
  • If the precedences are equal and the associativity of the lookahead token is right then shift.
These rules are applied during the generation of the parsing tables, not at parser runtime.
Finally, here are the rules that GPPG uses to decide when to issue conflict diagnostics -
  • If an automaton state has two or more productions that can be reduced, that is, two or more items with the “dot” at the end, then issue a reduce/reduce conflict warning.
  • If an automaton state has a reduction and also possible shift actions, then the conflicts are resolved as detailed above. However, if the conflict is resolved in favor of shifting because the production has zero precedence, then issue a shift/reduce conflict warning.
The conflict-reporting rules are similar for most YACC-like parser generators.

So What is the Problem?

The operator precedence facilities of these kinds of parser generators are designed for operator-based grammars, and are reasonably convenient for this purpose. However, like most simple ideas they have a dark side of unintended consequences.
The most striking of these is the failure of one of the most fundamental properties of context free grammars. Consider our original grammar -
Goal  : '+' Exp ;
Exp  : 'a' | Exp '+' Exp ;
Notice that the pattern "'+' Exp" occurs in two places. We might therefore be tempted to define a new non-terminal symbol for this pattern. Call it "Rest", say. Now we have the new grammar -
Goal  : Rest ;
Rest  : '+' Exp ;
Exp  : 'a' | Exp Rest ;
and this grammar defines precisely the same language as the previous case.
If we send this grammar to GPPG, it generates a parser and, as before, also generates a warning. In this case the warning reads -
Shift/Reduce conflict
  Shift "'+'" : State-5 -> State-4
  Reduce 5 : Rest -> '+', Exp
Following our experience with the original, unfactorised version of this grammar we might try to declare the '+' operator %right. As it turns out, this choice leads to a parser which rejects the simple (and legal) input "+a+a".
What is going on here? How is it possible that the parser is rejecting a legal input?
To explain this weird result we need to look at the parser's state-5, the site of the shift/reduce conflict. Here are the kernel items in this state -
State-5
Kernel Items
  5 Rest: '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . Rest
According to our rules, in this state the precedence of the rule-5 is the same as '+'.  Since we have defined '+' as having %right precedence the parser reduces, and the input reduces to the goal symbol before the EOF is reached.

The Message in all of this

So to return to the question in the background section, are these two grammars equivalent? Well, yes the grammars are equivalent, but the two grammars do not lead to equivalent parsers.

The take-home message in all of this is that operator precedence is a fragile construct. It is normally safe, and sometimes even helpful to factor out common sub-expressions in production right-hand-sides. However two things happen when this abstraction mechanism is used. Firstly, kernel item-sets may become merged, introducing new shift/reduce conflicts; secondly, if the abstraction removes the "last non-terminal in the right hand side" then the precedence of the rule may be modified and parsing behavior changed.

Some folks think that this is a sufficiently serious problem that they shun the use of operator precedence entirely and advocate instead the use of multi-level grammars.

Like the man said "yer pays yer money, and yer takes yer pick".

Tuesday, February 1, 2011

What it's all About

I plan to use this blog to talk about Compilers, including the ones that I wrote in the past or am writing now; Software Tools, including the ones that I wrote or maintain.  I may also indulge in an old-fashioned rant from time to time.

Most of my software that is in the public domain is available on CodePlex http://www.codeplex.com/ although there are some other versions available somewhere inside the QUT website.  Where both are available, the CodePlex versions are the current ones, and are under source code control (I use TortoiseSVN by choice).

Most of the software has been released under the Gardens Point name, chosen for the particular bend in the Brisbane River where the main campus of Queensland University of Technology is situated.  And yes that is the same river that flooded in January 2011, inundating about 11 thousand houses.

What Software Tools?
The software tools that I maintain on CodePlex are:
  • Gardens Point LEX (gplex).  This is a lexical scanner generator which produces C# as output.  The tool has lots of options, and can create scanners that deal with the whole Unicode character set.
  • Gardens Point Parser Generator (gppg).  This is a parser generator which produces C# as output.  The tool has an input language very close to traditional YACC, and produces an LALR(1) parser with the usual disambiguation rules.
  • PE Reader/Writer API (perwapi).  This is a reader/writer for Microsoft .NET PE files.  It has been used by a number of compilers including GPCP and GPM-CLR.  The tool does not deal with the complete world of the latest version of the .NET framework.  I tend to do enhancements/bug-fixes on an on-demand basis.
  • There is also a prototype tree rewriter-generator, based on Bottom Up Tree Rewriting theory.  This is almost ready for release, and I intend to blog about its virtues and challenges quite a bit.
What Compilers?
  • Gardens Point Component Pascal (gpcp).  This is the demonstration compiler that I wrote in year-2000 for the initial release of the .NET framework.  It also formed the running example for my book Compiling for the .NET Common Language Runtime, Prentice-Hall 2002.  There is also a JVM version of this compiler, but it lags behind the CLR version by a couple of releases.
  • Gardens Point Modula-2 (CLR version).  GPM was a compiler family that was built at QUT over about a ten year period.  Initially a commercial product, it was available on a total of about eleven machine architecture/operating system combinations.  Only a freeware Linux/x86 version appears to be still in use, and development stopped in about 1995.  However, a CLR version was created in 2003, as an example of how to write a CLR compiler for a non-typesafe, non-object oriented language.  GPM-CLR is effectively a new backend for the GPM frontend that all versions of GPM used in 1995.  As a side-effect of this open source experiment the previously non-open-source front-end is public.  I use GPM-CLR myself when I need to run some dusty old Modula-2 program.
Stuff in the Works
  • I am currently writing a C# version-3 backend for the QUT C# typechecker.  The typechecker is a framework for research on type-system extensions to C#.  It was used for Andrew Craik's excellent PhD thesis.  The backend is a platform for exploring issues to do with code-generation for .NET.  It is, if you wish, filling in all the bits that are missing from my Compiling for the .NET Common Language Runtime. Some were missing because they hadn't been invented in 2001, and others due to sheer ignorance on my part.
  • As well as the bottom-up rewriter-generator mentioned earlier, I am interested in plain ordinary code selection again (as opposed to code generation for virtual machines such as JVM and CLR).  Code for the SnapDragon is a gleam in the eye.