John Gough on Software Tools: Shift-Reduce Parser Refresher

This page is an almost verbatim extract from appendix-9 of the GPPG documentation, which may be downloaded from the GPPG page on CodePlex.
http://gppg.codeplex.com/
Although this page refers specifically to GPPG, it is applicable to any YACC-like parser generator.

Introduction

A grammar is a set of rules that, for a particular symbol alphabet, determines which ordered sequences of symbols form valid sentences in the language that the grammar specifies.

Parsers are automata that recognize the valid sentences of a given grammar. Given an arbitrary symbol sequence as input, a parser must answer yes or no according to the legality or otherwise of that sequence.

Phrase-structured grammars declare a set of syntactic rules that recognize fragments of a sentence as belonging to particular syntactic categories. In phrase structured grammars for natural languages the syntactic categories correspond to concepts such as "noun", "verb" and ``adjectival clause of reason''. In a programming language syntactic categories might be "assignment statement", "variable declaration" and so forth.

Among all classes of phrase-structured grammars the most useful for formal languages are the context-free grammars. The syntactic rules of a context free grammar have the special form

N : S ;

where N is a syntactic category, and S is a sequence of zero or more symbols each of which may denote either an alphabet symbol or a syntactic category. For this example rule we say "N derives S" or equivalently "S may be reduced to N". We call such rules productions. Such rules are context-free because the sequence S may always be reduced to N, without regard for the context, that is, without regard for what comes before or after the input fragment that corresponds to S.

There is a special member of the set of syntactic categories that we designate as the goal symbol G. We may generate all the sentences of a language by starting with the goal symbol and progressively replacing any syntactic categories with the right-hand-side of a corresponding production rule. When we have a sequence of only alphabet symbols, that sequence is a sentence in the language. We say the particular choice of substitution steps has derived the sentence in the language. Each intermediate state of the sequence as the substitution steps are carried out is called a sentential form. Until the final substitution is made the sequence is not a sentence, but at all stages it has the form of a sentence subject only to the expansion of any remaining syntactic placeholders.

Because of this concept of a derivation terminating when there are only alphabet symbols remaining we call the alphabet symbols terminal symbols of the grammar. Conversely, the syntactic categories are non-terminal symbols.

There are two parsing strategies for context-free grammars. One is to start with the goal symbol and to try to find a sequence of derivation steps that matches up with the input sequence. Parsers that work this way as called top-down parsers. "Recursive descent" is the most widely used top-down parsing technique.

The second parsing strategy, bottom-up parsing, attempts to match fragments of the input sequence with the right-hand-sides of production rules. When a fragment is matched it is replaced by the single, non-terminal symbol of the rule's left-hand side. This substitution step is called reduction.

Shift-reduce parsers, as generated by GPPG, or any other YACC-like parser generator, implement the matching of production right-hand sides with the help of a "push-down" stack structure. The input is read symbol by symbol, and at each step a decision is made whether to push the (terminal) symbol on the stack, or to recognize the top n symbols on the stack as matching the right-hand-side of a particular production. If a symbol is pushed on the stack, we have performed a shift. If a right-hand-side of length n is recognized we reduce by popping the topmost n symbols from the stack and push the corresponding left-hand-side non-terminal symbol. If this process leads to the stack holding just the goal symbol a sentence has been recognized.

It is one of the classic results of computer science that the set of sequences of symbols that indicate the applicability of each reduction rule belong to a regular language. That is: the viable prefixes of each reduction may be recognized by a finite state machine (FSA).

The specification for a GPPG parser declares the production rules of the grammar. Specifications usually also declare computations that are performed step by step as each pattern is recognized and the reduction performed. These semantic actions compute attributes of the production left-hand-side symbol in terms of the attribute values of the production right-hand-side symbols.

A GPPG-generated parser does not actually maintain a stack of terminal and non-terminal symbols. Instead is has three stacks to guide its operation. The first stack is a stack of parser states, which are integer values. These values are the states of the FSA that decides whether to shift or reduce, and if reducing, by which production. The second stack is the stack of attribute values for the values that would be on the symbol stack, if there was one. The third stack is a stack of text spans corresponding to the symbols that would be on the symbol stack, if there was one. This last stack is useful for semantic actions that require the retrieval of position information or text strings.

How Shift-Reduce Parsing Works

At each step in the parsing process a shift-reduce parser examines the state on the top of the state stack, and the identity of the next input symbol. The next input symbol, fetched from the associated lexical scanner, is called the lookahead symbol.

Each state of the FSA corresponds to a particular finite set of partially recognized production right-hand-sides. These partially recognized rules are called production items of the grammar. Production items are just right-hand-sides annotated with a "dot" to show how many of the symbols of the rule have been seen so far. The /report option of GPPG generates an html file that shows all of the items for each state of the automaton. Many state have a single kernel item, but others may have several. For an example of kernel items see the discussion in "Precedence" section of the complete GPPG documentation.

The runtime representation of each parser state stores a list of actions to be taken in that state for each possible lookahead symbol. There is also a list showing the next state value after pushing all the possible terminal or non-terminal symbols.

For any particular combination of top-of-stack state and lookahead symbol the parsing engine may decide to shift, reduce or signal an error.

Shift Action

If a shift is chosen the FSA makes a transition from the current state to a new state. The new state is pushed on the state stack. If the scanner has placed any symbol attribute information in the yylval field this is pushed onto the semantic value stack. Finally, if the scanner has placed location information in the yylloc field this is pushed onto the location stack.

Reduce Action

If the parser chooses a reduction by a particular production, then the following steps are taken. First, if the production specifies any semantic action then this computation is performed. If the chosen production has n symbols in its right-hand-side, then the semantic action may use the topmost n values of the semantic value and location stacks.

Following the execution of the semantic action, the topmost n elements are popped from all three stacks. The left-hand-side of the chosen reduction must now be "pushed" onto the stack. The preceeding popping of the stack will have exposed a new top-of-stack state, and the information of that state determines what the new state will be following the push of the production left-hand-side non-terminal symbol. At the same time any semantic value and location information computed by the semantic action are pushed onto the semantic value and location stacks.

Error Action

If there is no possible reduction or shift action for a given state, lookahead
symbol combination then the yyerror method is called.

Loop Forever

After a shift or reduce action, if the new top of stack state corresponds to the goal symbol, and the lookahead symbol is "end of input" the parser returns true. Otherwise, if the lookahead symbol has been consumed by a shift a new input symbol is fetched.

The parser continues by making a new shift-reduce decision based on the new top-of-state stack state value and the (possibly new) lookahead symbol.

What Can Go Wrong?

Not every context-free grammar can be successfully recognized by a GPPG parser. This final section considers some of the things that can prevent successful generation of a parser.

First, it must be said that some grammars cannot be recognized by any parser that does not back-track. Furthermore, there are grammars that define sentences that have multiple derivations, that is, sentences that are ambiguous. Finally, there are grammars which inherently require more than one symbol of lookahead to make correct parsing decisions.

In practice, when a grammar is submitted to GPPG, or any other shift-reduce parser-generator, the tool may notify the user of various conflicts. These arise when the states of the parser have more than one possible action for some state-lookahead combination. There are two kind of conflict: a shift-reduce conflict is notified if a particular state has both
a possible shift action and a possible reduction. The occurrence of a shift-reduce conflict is not fatal, and by default GPPG parsers always shift in preference to reducing. Shift-reduce conflicts are notified to the user so that it may be checked that the default, shift action does lead to a correct result.

The second kind of conflict is a reduce-reduce conflict. Such a conflict arises if there are two or more production items in the state that signify a possible reduction action. Reduce-reduce conflicts tend to be more serious, and it is often necessary to modify a grammar to remove such conflicts to obtain correct behavior.

GPPG has advanced facilities for conflict reporting, and special help for diagnosing the cause of such conflicts. These are aimed at helping the user to successfully remove such conflicts. Consult the complete GPPG documentation for the details..

There is also a facility to produce parsers that emit a trace of the parser actions for each input. For a suitable small input sample, this allows a step by step understanding of the behavior of the parser, particularly when used in conjunction with the human-readable state tables produced by the /report option.