John Gough on Software Tools: YACC and the Perils of Operator Precedence

Background

A little while ago I received a query about a very simple grammar example which behaved in an unexpected fashion. The issue was that the parser worked correctly in its original form. However, there was a common substring in two of the productions. When the common substring was replaced by a new non-terminal, the resulting parser rejected certain clearly legal sentences. The question was: surely these two grammars should be equivalent?

I dug around with the example and was forcibly reminded of some "features" of YACC-like parser generators that I had long forgotten. So here is an even simpler grammar that also behaves paradoxically. It is also a tutorial on how the operator precedence algorithm really works.

For those whose memory of shift/reduce parsing theory is even rustier than mine, I have added a page called the Shift/Reduce Parsing Refresher. It is blatantly lifted from Appendix-9 of the GPPG documentation at http://gppg.codeplex.com/.

Introduction

The parsers generated by GPPG are similar in their functionality to those produce by the traditional YACC tools. That is, as well as using LALR(1) states to guide the parse they use operator precedence. Operator Precedence allows convenient handling of such things as expression grammars with large numbers of operators.

When programming languages are specified for their users, that is, for programmers, it is common to specify the semantics of expressions by specifying the relative precedence and associativity of the operators. Thus it is tempting to use this same mechanism for defining the grammar for implementors, and this is what YACC-like tools try to do.

There is another way of defining the way a parser will work with an expression grammar: by defining a number of different syntax non-terminals, traditionally given names like Expression, Simple Expression, Term, Factor and Primary. Going down this second path has the disadvantage that two ways of specifying the same semantic may then be required, because the multiple non-terminal method seems to be less understandable for programmers. Unfortunately, there are disadvantages in using the operator precedence functionality of YACC.

This episode of the blog is concerned with understanding some of the subtleties.

The Breathtakingly Simple Grammar

The running example grammar is very simple. A single operator, and only two tokens.

Goal : '+' Exp ;
Exp : 'a' | Exp '+' Exp ;

As it turns out, this grammar specifies a regular language, the same infinite set of strings that are generated by the regular expression -

(\+a)+ or, if you prefer, the set {+a, +a+a, +a+a+a, ... }

What could possible go wrong with such a simple grammar? And if something can go wrong, how about a more complicated case with (say) 17 levels of precedence?

Preliminary Examination

The example grammar is actually ambiguous. The shortest valid string with an ambiguous parse is "+a+a+a", which may be parsed either as though it was meant to be understood as "+((a+a)+a)" or as "+(a+(a+a))".

No big deal, we didn't specify whether the binary "+" was left or right associative.

However, before we go on let's check: send this grammar to GPPG, and it generates a parser, but also a warning which with the /verbose options reads -

Shift/Reduce conflict
Shift "'+'" : State-6 -> State-5
Reduce 4 : Exp -> Exp, '+', Exp

This tells us that in state-6 of the automaton there are two possible actions when the next symbol is '+' : shifting the plus sign and going to state-5, or reducing by the production

Exp : Exp '=' Exp

This is nice, we see that the shift choice would lead to the right-associative semantic for the binary '+', while the reduce choice would lead to the left-associative semantic.

Using YACC Disambiguation - Associativity

The reason that the shift/reduce conflict message is only a warning is because YACC and its look-alikes always break such conflicts by shifting, in preference to reducing. In the case of the running example, this would lead to the right-associative interpretation. The warning is simply a heads-up to the user, so they may check if the "prefer shifting" default gives the correct behavior.

It is possible to make the warning go away by explicitly stating the associativity of the tokens. In the example we may begin with -

%token 'a'
%left '+'

GPPG no longer gives a warning, since we have specified that we want '+' to be left associative. Let us see how this actually works. If GPPG is run with the /report option the part of the html report with the details for state-6 (the state that previously had the shift/reduce conflict)is -

State-6
Kernel Items
4 Exp : Exp '+' Exp .
-lookahead: { '+', EOF }
4 Exp : Exp . '+' Exp
Parser Actions
'+' reduce using rule 4 (Exp)
EOF reduce using rule 4 (Exp)

Recall that the dot in the kernel items denotes the point in the rule that the recognition has reached. Unusually in this case the two items in the kernel both belong to the same production rule, rule-4. The first item has the dot at the end, indicating that this production has been fully recognized, and may be reduced. The lookahead set that support the reduction is shown within braces. The second item has the dot internally, and shows that a following '+' would continue with the recognition of this production rule. The parser action table shows that with the %left declaration the first item (reduce by rule-4)is selected no matter whether the lookahead is another plus sign or the end of file.

Alternatively, if the declarations say -

%token 'a'
%right '+'

then the same state will have the report -

State-6
Kernel Items
4 Exp : Exp '+' Exp .
-lookahead: { '+', EOF }
4 Exp : Exp . '+' Exp
Parser Actions
'+' shift, and go to state 5
EOF reduce using rule 4 (Exp)

which is the same behavior as with no associativity specified, but without the warning message.

This seems to have solved the problem, with a clear way to choose the desired behavior, and that pesky warning suppressed.

Using YACC Disambiguation - Operator Precedence

As well as specifying the associativity, YACC allows for the relative precedence of operators to be controlled. The simple rule is that all the tokens associated with a particular %token, %left or %right declaration have the same precedence, but those in later groups take precedence over those in earlier declarations. Thus if we had not only a '+' operator, but also a '*' operator then in conventional arithmetic grammars the multiply operator would occur in a later group than the additive operator.

How Operator Precedence Actually Works

Operator precedence is used to silently resolve shift/reduce conflicts by comparing the precedence of the completed rule with the precedence of the lookahead token. Here are the rules for determining precedence of each rule and each token -

All token precedences are greater than zero.
The precedence of a token is determined by the position of the declaration group in which it occurs. Groups declared later in the definitions section have higher precedence.
The precedence of a production is that given by the "%prec TokenName" declaration, if there is one.
Otherwise, the precedence of a production is that of the rightmost terminal symbol in the right-hand-side, if there are any terminal symbols in the right-handside.
Otherwise the production has zero precedence.

Having established the precedences, here are the rules by which they are compared -

If the precedence of the production is higher than the precedence of the lookahead token, then reduce.
Otherwise, if the precedence of the lookahead token is higher than the precedence of the production, then shift.
If the precedences are equal and the associativity of the lookahead token is left then reduce.
If the precedences are equal and the associativity of the lookahead token is right then shift.

These rules are applied during the generation of the parsing tables, not at parser runtime.

Finally, here are the rules that GPPG uses to decide when to issue conflict diagnostics -

If an automaton state has two or more productions that can be reduced, that is, two or more items with the “dot” at the end, then issue a reduce/reduce conflict warning.
If an automaton state has a reduction and also possible shift actions, then the conflicts are resolved as detailed above. However, if the conflict is resolved in favor of shifting because the production has zero precedence, then issue a shift/reduce conflict warning.

The conflict-reporting rules are similar for most YACC-like parser generators.

So What is the Problem?

The operator precedence facilities of these kinds of parser generators are designed for operator-based grammars, and are reasonably convenient for this purpose. However, like most simple ideas they have a dark side of unintended consequences.

The most striking of these is the failure of one of the most fundamental properties of context free grammars. Consider our original grammar -

Goal : '+' Exp ;
Exp : 'a' | Exp '+' Exp ;

Notice that the pattern "'+' Exp" occurs in two places. We might therefore be tempted to define a new non-terminal symbol for this pattern. Call it "Rest", say. Now we have the new grammar -

Goal : Rest ;
Rest : '+' Exp ;
Exp : 'a' | Exp Rest ;

and this grammar defines precisely the same language as the previous case.

If we send this grammar to GPPG, it generates a parser and, as before, also generates a warning. In this case the warning reads -

Shift/Reduce conflict
Shift "'+'" : State-5 -> State-4
Reduce 5 : Rest -> '+', Exp

Following our experience with the original, unfactorised version of this grammar we might try to declare the '+' operator %right. As it turns out, this choice leads to a parser which rejects the simple (and legal) input "+a+a".

What is going on here? How is it possible that the parser is rejecting a legal input?

To explain this weird result we need to look at the parser's state-5, the site of the shift/reduce conflict. Here are the kernel items in this state -

State-5
Kernel Items
5 Rest: '+' Exp .
-lookahead: { '+', EOF }
4 Exp : Exp . Rest

According to our rules, in this state the precedence of the rule-5 is the same as '+'. Since we have defined '+' as having %right precedence the parser reduces, and the input reduces to the goal symbol before the EOF is reached.

The Message in all of this

So to return to the question in the background section, are these two grammars equivalent? Well, yes the grammars are equivalent, but the two grammars do not lead to equivalent parsers.

The take-home message in all of this is that operator precedence is a fragile construct. It is normally safe, and sometimes even helpful to factor out common sub-expressions in production right-hand-sides. However two things happen when this abstraction mechanism is used. Firstly, kernel item-sets may become merged, introducing new shift/reduce conflicts; secondly, if the abstraction removes the "last non-terminal in the right hand side" then the precedence of the rule may be modified and parsing behavior changed.

Some folks think that this is a sufficiently serious problem that they shun the use of operator precedence entirely and advocate instead the use of multi-level grammars.

Like the man said "yer pays yer money, and yer takes yer pick".

John Gough on Software Tools

Thursday, February 10, 2011

YACC and the Perils of Operator Precedence