Thursday, March 22, 2012

Using Reference Semantic Value Types with GPPG

All the examples that ship with GPPG, and parsers in the distribution all use .NET value types for the semantic action type.  For very simple situations, the use of the %union construct is convenient.  In particular, if the types of the non-terminal symbols have been declared in the *.y file, then the parser will automatically select the appropriate field from the struct that implements the "union" type.

The name of the semantic action type may be declared using the construct -
%YYSTYPE typename
Where typename is the name of the type.  It is most commonly an abstract class type, with the actual values being concrete subtypes of this class.

The use of such a family of class types is almost always the correct approach when the purpose of the parser is to construct an abstract syntax tree (AST).  The following is a yet another version of the calculator example, two versions of which ship with the GPPG distribution.  In this version, the parser builds an AST for each of the expressions.  These tree-representations of the expressions are evaluated on demand, much in the fashion of C# lambda expressions.

The first thing that has to be said about using a reference YYSTYPE is that the value stack in the parser will, by default, contain a stack of null values. You will be swiftly reminded of this if you try to access any fields of the top of stack value.  The idea is that the semantic actions of the parser will build the tree during the parse.  The semantic actions will therefore be calls to factory methods that create new tree-nodes and hook the sub-trees together.  Here is the grammar definition. It is derived from the grammars of the calculator programs that ship with GPPG.
The structure of the RealTree.Node type is hinted at in the names of the semantic action methods.  Node is an abstract class, and there are exactly three derived classes - Binary, Unary and Leaf.  Each object of node type contains a tag, which is a value from the NodeTag enumeration.  The purpose of these tags is to indicated whether (say) a Binary node represents an addition, subtraction, multiplication or so on.

All of the rest of the code is in another file, named "RealTreeHelper.cs".  This file contains the definitions of the Node classes, and the user-written part of the Parser class, including a simple handwritten scanner.  In particular, the parser class contains the static factory methods that invoke the node constructors.

The goal symbol of the grammar is "list".  This is simply a list of statements, and has an error recovery action so that all errors will return to a state that is expecting the start of another statement.

At runtime the parser maintains a list of 26 variables, named 'a' to 'z', which may be referred to by their case-insensitive name. All of these registers may be cleared, or their values printed to the console.  The interesting productions are those that assign expressions to variables, or evaluate an expression to a floating point double value and print it to the console.  The scanner creates a Leaf node value whenever it scans a literal value, or a variable name.  These leaf nodes will have a tag value that is a NodeTag.literal or respectively.

The expression productions are fairly standard, and create an abstract syntax tree for the recognized expression.  An interesting case is the second production for statement.  In this case the expression is evaluated immediately, so that the semantic value that is produced is a literal value Leaf node, rather than an expression tree value.

The complete program example consists of the grammar (*.y) file, the RealTreeParser.cs file that GPPG produces from this, and the RealTreeHelper.cs file that contains all the handwritten code. All three are zipped together, and available for download from the CodePlex GPPG page  The whole needs to be compiled with the ShiftReduceParserCode.cs library.

Here is a snippet from the helper file --

This base class defines two abstract methods: Eval and Unparse.  Eval recursively traverses the given tree, returning a floating point double. Unparse recursively traverses the given tree constructing a string that gives a fully parenthesized representation of the expression.  The difference between the traversals is just this: during Unparse, when the traversal reaches a name leaf it prints the name of the variable. When Eval reaches a name leaf the traversal evaluates the tree rooted at the variable.

Suppose we have entered the statements -
a = 42; b = 24; c = a + b;
then the  print command will write 
'c' = (a+b)
while the eval c command will write 
result = 66.

It is easy to construct an evaluation that fails by trying to evaluate a circular definition, so it is necessary to guard against this.  During evaluation each node is marked when traversal reaches the node, and unmarked when the recursion leaves. This detects circular definitions, and prevents stack overflows during evaluation. Here is simple example -
a = 42; a = a + 1; eval a
fails because of the circularity of the second statement. On the other hand -
a = 42; a = eval (a+1); eval a
succeeds, and returns 43, as expected. In this case the second command evaluates (a+1) and deposits the new value 43 as a literal in variable a.  In the circular version the second command overwrites the initial value of a with the expression tree (a+1).

In any case, the purpose of the example is to demonstrate the use of a family of classes as the semantic value type for parsers that build abstract syntax trees.  Enjoy!