Sunday, December 25, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-2)

[This is a revised edition of the part-2 post.  An error in the LookaheadHelper code has been corrected, and the example code simplified.  The revised code is also available on CodePlex.]

In the last post a simple method for looking forward in the token stream of a GPPG parser was described.  The key concept was to save the scanner input buffer position, read tokens to resolve the syntactic ambiguity, reposition the input position and then continue. 

There are several limitations on this simple-minded approach.  Firstly, the example relied on being able to restore the scanner state precisely when the input position is backed up.  In cases where the scanner has several different start states this means that extra information needs to be saved and restored.  Secondly, in the case that long lookaheads can occur all of the work done by the scanner is lost when the inut is backed up.  If the scanner creates objects, including instances of the yylloc and yylval types, then this work will be repeated.

In this blog post a different technique is shown.  In this version the scanner defines and produces objects that encapsulate the result of any lookahead-requested call of yylex.  These objects contain three data: the integer-valued token result, the semantic value yylval, and the location value yylloc.  These objects, as well as being delivered to the decision code that is doing the lookahead, but is also added to an input buffer queue.  When backup is performed, the queue is simply handed back to the scanner.  Subsequent calls of yylex dequeue the saved values, until the queue is exhausted and normal scanning resumes.

In order to give a more realistic setting a modification of the grammar fragment from part-1 ensures that we do really need an unbounded lookahead.  Instead of expressions (and the left hand side of assignments) starting with a bare identifier they begin with a designator.  The designator may have member selection, '.' and array indexing.  Hence we must scan forward until we hit either a colon or a "dotdot", implying we have a case selector; or an "assign" symbol, implying that we are still in the same case statement sequence.  Here is the business part of the new grammar.


This grammar fragment, as in part-1, shows the location of the semantic action that is used to trigger the lookahead ad hoc code.  It also shows the location of the dummy symbol that is injected to terminate the current case_element phrase.

Then there is the infrastructure that is used to buffer and unbuffer the lexical data.  Here is the declaration of the class ScanObj.  This can be included inline in the *.lex file, or in a separate LookaheadHelper file.  The following is from a separate file.


And here is the scanner prolog that handles the queue within the scanner's Scan method.


The prolog code is inserted in the *.lex file right after the first "%%" marker, and before the first lexical production.

We can now draw attention to some of the features of the previous code snippet.  Firstly, this code is a very general example.  The enqueuing and dequeuing code allows for the possibility that there may be several ad hoc lookahead locations in the same parser, and that one ad hoc lookahead may consume the enqueued tokens of a previous lookahead.  Note particularly the code that initializes the new token buffer. If the existing code buffer queue is not empty, the leftover tokens in the queue must be added to the end of the newly constructed queue.  There are other structures that achieve the same effect.

Note also that the call to SaveScannerState takes an argument called nextTokenToSave.  In part-1 we made use of the fact that for this particular example the NextToken value would be empty when the ad hoc lookahead was called.  This allowed us to overwrite NextToken to inject the dummy token without any further care.  In general, this is not the case, so the code given above saves the argument if it is non-zero, so that it will be refetched either immediately, or immediately after the dummy token has been consumed.

This infrastructure is shared for all such ad hoc lookaheads, requiring only customization for the various instantiations of the yylloc and yylval types, and the user-chosen scanner namespace and class names.

The bodies of the semantic actions that are called by the parser all conform to the same pattern:
For our running example the method would be DoCaseLookahead, and the body of the lookahead would be the method CaseLookaheadScan.


Note carefully that the lookahead scan must use GetAndEnqueue to fetch tokens, rather than calling yylex directly.  This is to ensure that all the tokens get safely enqueued in the push-back buffer.

Summary

This may look like a lot more code than the example in part-1.  However most of the code is infrastructure, which is shared.  Furthermore this method may be used even with overlapping lookaheads, and cases where the scanner changes start state during a lookahead sequence.  For almost all scenarios this will be more efficient, since the semantic and location value objects created during a lookahead will not be wasted.

Is there anything this second version cannot do?  Well, there is one limitation:  Those grammars where the parser and scanner are tightly linked and the parser changes the scanner start state by calling the scanner's BEGIN method cannot use this method.  This is a very good reason to avoid such program structures.

The only really tricky thing about this method is the responsibility for figuring out how to write a correct "ExampleLookaheadScan" method.  In our present case the occurrence of particular tokens in the lookahead is definitive.

In the final part of this sequence I shall take a really difficult example that arises in the parsing of C#.  A statement sequence may have a local-variable-declaration or a statement-expression, among other possibilities.  Both of these may start with an identifier and it requires an unbounded lookahead to disambiguate the two.  In this case there are no disjoint sets of tokens that separate the two cases and stronger measures are required. 

The complete code of the example in the current blog entry will go up on the gppg page on CodePlex in the next few days.

Tuesday, December 13, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-1)

Gardens Point Parser Generator (GPPG) and other similar parser generators are limited to a one-symbol lookahead.  However, many programming languages have a few places in their grammars that require a some extra help.  In some cases it is possible to get around the issue with a little transformation of the grammar, but in other cases it is necessary to resort to stronger measures.  Of course if you own the language that you are parsing, the simplest response is to modifiy the language to make the problem go away, and no-one gets hurt. If somebody else owns the language then the toolkit gets a bit more exercise.

This blog deals with one of those stronger methods, where it is not possible to state in advance the number of symbols that have to be read to decide between two continuations.

Our running example is a grammar for a Pascal-like language which has a problem with the CASE statement, roughly equivalent to switch in C-like languages. Here is the base grammar with most of the irrelevant bits missing ...
 So, here is the problem - we are parsing the following string and have just read the semicolon, with the identifier "z" as the lookahead symbol ...
           CASE foo OF x,y : a := b; z ...
How do we know whether the 'z' belongs to the current case_element (maybe the start of another assignment statement) or is the start of a new case_label_list?  Well, we can't tell with one symbol of lookahead.  If we read forward and find a comma or a colon it's a case label; if we find an assignment marker ":=" then it's the start of an assignment. For plausible generalizations of the asssignment left-hand-side grammar the number of lookahead symbols is unbounded.

It is for this reason that most Pascal-like grammars have vertical bar '|' separator between case elements, and C-family languages use the keyword case for every label.  However the grammar, as given, is not ambiguous.  Read enough lookahead symbols and the problem is solved.

When a slightly filled-out version of this grammar is fed to gppg there is a shift-reduce conflict in one state.  Looking at the diagnostic output from using the /report /verbose options shows that for the state in question, when the lookahead is IDENT, the identifier will be shifted by default, although reduction  "case_label_list ':' statement_list -> case_element" is also valid.  This is the only conflict.

As an aside, note that things could be worse.  Pascal-family languages normally use the semicolon as a separator between statements, rather than as a statement terminator. The base grammar above uses the semicolon as a terminator.  If we simply use it between elements of a statement list then we would have an obvious problem with the string -
         CASE foo OF x,y : a:= b - z ...
Is the "-z" part of the assignment right-hand-side, or the first label of the next case element?

So now we come to the real question: how can we hack on the parser (and scanner) to resolve such problems with the one-symbol lookahead?

The idea is as follows: when the conflict location is reached jump to user-written code that scans forward in the input stream until the matter is decided. Point the parser in the right direction, reset the scanner buffer-position and resume parsing as normal.

The first step is to find a way of inserting the call to the user code in a semantic action.  For our example, whenever a statement list is being parsed in the context of a case statement we should do the ad hoc lookahead.  So we specialize statement_list as stat_list_for_case.


Now our SaveStateAndDoLookahead method is called every time a new statement is added to the statement list of a case statement.

And what does the method do?  Well, because of some issues with the visibility of various scanner and parser variables we split the code between the scanner and parser.  The scanner code saves scanner state, calls the parser code and restores the scanner state on return.
This code calls the Parser method DoAdHocLookahead. This method does the heavy lifting. [However, see the footnote for a bit more on what "save scanner state" implies.]

DoAdHocLookahead will repeatedly call yylex until a decision may be made.  First of all however note that the conflict only arises if the next symbol fetched is an identifier.  For our simple case we exit the loop as soon as we find a symbol that can belong to an assignment but not to a case label list, or to a case label list but not to an assignment.  Simple.  But how do we "point the parser in the right direction" before parsing resumes?

The simplest way of forcing the parser to do the right thing is to overwrite the parser's NextToken variable. We use a dummy symbol that otherwise does not occur in the grammar, and rewrite the grammar to rely on that symbol to separate the case_element items.
So that is about all there is to it.  Overwriting the NextToken value in the current version of GPPG requires diving into the code of ShiftReduceParser to make the NextToken variable accessible. The next refresh of the tool will give the variable protected accessibility to facilitate this technique.

The complete code for this example is available at http://gppg.codeplex.com

Part two will deal with a much harder example: one of the classic syntactic ambiguities of C#.

Footnote: What "save scanner state" implies, and various disclaimers.
In the example here the saved state in the scanner is the current "lookahead" character, the current line and column count, and the buffer position. This works for this particular scanner, but others may need more.  In particular, if a scanner has a number of start states, the start state might need to be saved and restored also.  Using the YY_START property is the easiest way of doing this, since it automatically resets the start ordinal and the start state. Of course, for a scanner that has a stack of start ordinals the whole stack needs to be saved and restored.

Furthermore, if the scanner detects and reports lexical errors, errors that are backed-up over may be reported twice.  Thus if you wish to be really picky, you might want to save the high tide on the error buffer and restore that as well.

Alternative strategies will be discussed in the second part.



Wednesday, November 16, 2011

Another Trick for Debugging GPCP Programs

Programs compiled using the CLR version of Gardens Point Component Pascal can be debugged using the Visual Studio debugger, as described in an earlier blog.

This short blog deals with a litte trick to exploit one of the "features" of Visual Studio (VS).

When a program being debugged stops at a breakpoint, the local variable window of VS lists the
variables that are in scope in the selected frame of the call chain.  Other fields of the window show the declared type of the variable, and its concrete (actual) type. There is also a field that shows the value of the variable.  This is handy for numeric scalars, for example.

However, for other types Visual Studio fills in the value field by calling the virtual ToString() method on the object. The "miranda" option, for types that do not have a ToString of their own, is to use the method inherited from Object, which returns the name of the concrete type. This is what happens by default for Component Pascal objects.

So, the question is, can we make gpcp types display useful value information in the VS value field? The answer is, of course, yes.  But there is a trick to it.  First a couple of things that don't quite work.

Suppose that we have a pointer to record type, with no declared super type.

TYPE Foo = POINTER TO RECORD ... END;

The super type will be ANYREC, which is implemented as System.Object in the CLR version of gpcp.  However, the compiler will not import the inherited System.Object methods so attempts to override the ToString method will fail.

PROCEDURE (f : Foo)ToString() : Sys.String;

will get a "this is not at override, you must use NEW" error.  And if you try

PROCEDURE (f : Foo)ToString() : Sys.String,NEW;

the method actually will be "new", and will not override the virtual Object method. The trick is to make the compiler read in the inherited methods.  The following works:

IMPORT Sys := mscorlib_System, ... ;
...
TYPE Foo = POINTER TO RECORD (Sys.Object) ... END;
...
PROCEDURE (f : Foo)ToString() : Sys.String;
BEGIN ... END ToString;

Using RTS.NativeObject instead of Sys.Object and RTS.NativeString instead of Sys.String works also.

There is another place where the implicit use of ToString() occurs in the .NET framework.  The format methods like String.Format and WriteLine take a format string and an array of Object.  Each object in the array is converted with ToString and substituted into the numbered position in the format string.

Thus to access the format methods of the .NET libraries it is necessary to insert the values into an array of system objects.  The rules are as follows:
  • For record and pointer to record types, declare a ToString method, as described.
  • For scalar objects, like integers or reals, it is necessary to resort to the manual method of calling Sys.Convert.ToString on the scalar value.  C#, which has implicit boxing, hides this away, but Component Pascal's BOX extension does not apply to scalars.
Here is an example of a ToString method for a Coord type which has x and y values and an optional name s for each point.  We assume that the name is an open array of char, RTS.CharOpen.

PROCEDURE (pt : Coord)ToString() : Sys.String;
  CONST format = "{0}{{{1},{2}}}";
  VAR tag : Sys.String;
BEGIN
  IF pt.s # NIL THEN tag := MKSTR(pt.s^) ELSE tag := "anon" END;
  RETURN Sys.String.Format(format,
    tag, Sys.Convert.ToString(pt.x), Sys.Convert.ToString(pt.y));
END ToString;

Note that in this example we are using the fact that there is a Format method that takes three single objects following the string. We could have used the (string, object[]) version instead, and would have had to use the array version if there were more than three objects to fit in the format.

The Coord objects will now display helpfully in VS, and can in turn be directly used in other format strings.

Wednesday, April 27, 2011

Why a Regular Expression Engine is not a Scanner Generator

(Updated 17 November 2011.  Regular expressions corrected to use backslash instead of slash.)

I enjoy getting queries from the users of my software. Sure, it takes up time but very often the enquiry highlights a common misconception, sometimes a point that should be clarified in the documentation, and sometimes (hiss, boo) reports a real bug.

A couple of weeks ago I received a query about GPLEX that brought a couple of important issues into focus. The user was creating a scanner, and was having trouble with the generated grammar not accepting certain apparently legal literal string tokens. The grammar for literal strings had single quote characters ' as delimters, and escaped single quote characters with the "backslash" character. Naturally enough literal backslashes were also backslash escaped. The user had tried the "obvious" regular expression (RE)
 '(\\.|[^'])*'
which, in words, reads "single quote character followed by zero or more repetitions of EITHER a backslash followed by any character except newline OR any character except single quote, with the whole string terminated by another single quote character".
This did not work on a set of simple test strings, so the user had tried a sequence of progressively more complicated RE without success.

So, the real query was:
  • Why does not the obvious RE work?
  • User had tried the obvious RE against the .NET RE engine and all was ok.
  • Rumour has it that GPLEX uses its own RE-engine, could this be the problem?
Well, GPLEX does have a clean-room RE-engine and hell I have been known to have bugs in my software, so I figured I better have close look. I fired up GPLEX and, sure enough, grammars using the obvious RE did not work correctly.

Step 2 was to fire up my special "tracing" version of GPLEX which creates a scanner that chatters on about what character it has just read, what the next state will be, and when it hits an accept state. Following through the problematic example string
'\\'
immediately showed what the problem was. The scanner was in the accept state when it hit the second single quote, but continued reading characters looking for an even longer valid string. Suddenly the problem is obvious: suppose we name the two alternates of the closure as (A) "any backslash escaped character except newline" and (B) "any character except single quote". Now consider the first backslash character in the example string. It qualifies either as an A or a B. If it is an A then the next two characters are an escaped single quote (thus NOT terminating the string) else if it is the start of an escaped backslash then the second single quote terminates the string! What the scanner does is to mark the position of the second single quote and continue to munch characters. If the input ends before another single quote is found, the scanner backs up to the marked position, report a valid string, and continues scanning.

Just for the record the correct regular expression is:
'(\\.|[^'\\])*'
although for most uses it would be best to exclude newlines inside of strings at the RE level to give better diagnostics. In any case, this simple RE worked for the enquirer.

So what do we learn from this whole business? Well, I guess that it is a reminder that regular expressions are still tricky, especially when there are overlapping alternates.  However, this only answers the question in bullet point one. What about point two and point three?

Well, the clue is in the title of this blog. A scanner is not just a regular expression engine. A scanner tries to pick out tokens from a continuing stream of input. It tries to find the longest possible match and backs up when a match fails. By contrast an RE engine tries to match a single pattern, and generally does lots of complicated other things like find sub-patterns within a matching overall pattern. Neither one really does the work of the other, except in almost trivially simple cases.

In this case, I believe that what happened was that what I have been calling the "obvious RE" was extracted from the scanner specification and tried against the .NET RE engine. It worked.

The funny thing is that had the obvious RE been extracted from the scanner spec and tried as a single pattern in GPLEX it would have worked perfectly when tried out one sample string at a time. Each input containing a literal string like the one that put its finger on the error would have munched extra characters until it met the end of the input, backed up and reported a correct string just as the .NET RE engine did.

I am glad that my enquirer did not do this, as it would have caused even more confusion.

Cheers
John

Thursday, March 3, 2011

Debugging GPCP with Visual Studio

Back in the early days of Visual Studio.NET the IDE shipped with a program called DbgCLR, which was found in a GuiDebug directory in the VS distribution.  Since VS-2008, this program is not present.

When run with the default /debug option GPCP emits a "pdb" file for use by the VS debugger. However, since there is no language integration of current versions of Component Pascal into VS debugging CP programs with the VS debugger requires some ingenuity.  This blog deals with how I do it.  It is entirely possible that there are other ways of achieving the same effect, like the man said "your mileage may vary".

The key idea is to trick VS into thinking that you are debugging a C# project, which just happens to jump out into other language modules. In this case we will tell VS that the aim of the project is to compile the runtime support file RTS.cs to produce RTS.dll.

Simple Example, Debugging Hello World

(1) We begin by creating a directory, called DebugHello, say, and place in the directory our source file Hello.cp and the runtime system source RTS.cs. The runtime file is found under
C:\gpcp-CLR\source\libs\CSharp
in my installation.
(2) Start up Visual Studio and select File > New > Project from Existing Code. When the wizard starts, choose Visual C# as the project type,and click Next.
(3) At the next  screen of the wizard we point VS at the directory that was created at step(1), choose a name for the project, DebugHello, and set the output type as Class Library. Click Finish.
(4) At this stage we go into the Solution Explorer pane, right click on DebugHello, and select Properties. In the Properties page choose the Applications tab, and set the Assembly name to "RTS". Probably nothing else needs to be set in this tab. We will assume that the default output path bin\debug is left unchanged in the Build tab.
(5) In the Build Events tab we want to create a pre-build event to compile Hello.cp. We could do this with a simple command line call to gpcp, but for multi-source-file compilations we need to control the place where the symbol and output files go so we will use the command
call $(ProjectDir)PreBuild.bat $(ProjectDir)
The content of this batch file is discussed later. Once we have the batch file we can invoke Build > Build Solution. The bin\debug directory will now contain RTS.dll, RTS.pdb, Hello.exe, Hello.pdb.
(6) In the Debug tab we select the Start external program radio button, and use the file browser to select the Hello.exe executable in the output directory. This is all that is required for "Hello World", However for programs that take input we will also need to set the working directory and the command line arguments in this tab.
We may now open hello.cp in the Solution Explorer pane, place a breakpoint at BEGIN and start debugging!

The PreBuild.bat File

We want to set the working directory so that the symbol file, IL file and list file (if any) are placed in the source directory.  However, we want the output files to go to the bin\debug directory. The source directory is passed as an argument to the batch file by the macro $(ProjectDir) so the batch file reads --
cd %1%
gpcp /bindir=bin\debug hello.cp

A Bigger Example - Debugging GPCP

A (much) larger example would be debugging GPCP itself. This example illustrates a number of extra issues typical of larger programs. Here are the steps equivalent to those of the simple example.

(1) We begin by creating a directory, called DebugGpcp, say, and place in the directory our source file runtime system source RTS.cs. As well, we copy ALL of the cp sources from the source\gpcp directory of the distribution.
(2) Start up Visual Studio and select File > New > Project from Existing Code. When the wizard starts, choose Visual C# as the project type,and click Next.
(3) At the next  screen of the wizard we point VS at the directory that was created at step(1), choose a name for the project, DebugGpcp, and set the output type as Class Library. Click Finish.
(4) At this stage we go into the Solution Explorer pane, right click on DebugGpcp, and select Properties. In the Properties page choose the Applications tab, and set the Assembly name to "RTS". Probably nothing else needs to be set in this tab. We will assume that the default output path bin\debug is left unchanged in the Build tab.
(5) In the Build Events tab we want to create a pre-build event to compile gpcp using the command
call $(ProjectDir)PreBuild.bat $(ProjectDir)
The content of this batch file is discussed later. Once we have the batch file we can invoke Build > Build Solution. The bin\debug directory will now contain RTS.dll, RTS.pdb and lots of other files.
(6) In the Debug tab we select the Start external program radio button, and use the file browser to select the gpcp.exe executable in the output directory. We also set the working directory and the command line arguments to compile the source file that we want GPCP to compile.

The PreBuild.bat File for GPCP

For a project as large as gpcp we want to only compile the files that have been modified, so we will use the CPMake tool here. CPMake compiles only changed files, or those that depend on changed files, and also compiles all the sources in the dependence order. If a complete recompile is wanted the /all option of CPMake does this. Our required batch file is --

cd %1%
CPMake /bindir=bin\debug gpcp
Note that the argument to CPMake is the base name of the program, without file extension.

Extra Libraries

GPCP requires a number of additional libraries, as well as RTS. All of these dynamic link libraries need to be copied into the bin\debug directory in order for the program to work. Here are the libraries that GPCP relies on. Some are standard libraries in the distribution, others are components of GPCP that are implemented in C#.

GPBinFiles.dll
GPFiles.dll
GPTextFiles.dll
MsilAsm.dll
QUT.PERWAPI.dll
QUT.SymbolRW.dll

If we want to step into these libraries it is necessary to copy the corresponding pdb files as well. Since these libraries should remain unchanged during any debugging of GPCP, these files are not checked for recompilation during the debugging process.

And that is all there is to it.

Happy debugging!

Thursday, February 10, 2011

YACC and the Perils of Operator Precedence

Background

A little while ago I received a query about a very simple grammar example which behaved in an unexpected fashion. The issue was that the parser worked correctly in its original form.  However, there was a common substring in two of the productions. When the common substring was replaced by a new non-terminal, the resulting parser rejected certain clearly legal sentences. The question was: surely these two grammars should be equivalent?

I dug around with the example and was forcibly reminded of some "features" of YACC-like parser generators that I had long forgotten.  So here is an even simpler grammar that also behaves paradoxically.  It is also a tutorial on how the operator precedence algorithm really works.

For those whose memory of shift/reduce parsing theory is even rustier than mine, I have added a page called the Shift/Reduce Parsing Refresher. It is blatantly lifted from Appendix-9 of the GPPG documentation at http://gppg.codeplex.com/.

Introduction

The parsers generated by GPPG are similar in their functionality to those produce by the traditional YACC tools. That is, as well as using LALR(1) states to guide the parse they use operator precedence. Operator Precedence allows convenient handling of such things as expression grammars with large numbers of operators.

When programming languages are specified for their users, that is, for programmers, it is common to specify the semantics of expressions by specifying the relative precedence and associativity of the operators. Thus it is tempting to use this same mechanism for defining the grammar for implementors, and this is what YACC-like tools try to do.

There is another way of defining the way a parser will work with an expression grammar: by defining a number of different syntax non-terminals, traditionally given names like Expression, Simple Expression, Term, Factor and Primary. Going down this second path has the disadvantage that two ways of specifying the same semantic may then be required, because the multiple non-terminal method seems to be less understandable for programmers. Unfortunately, there are disadvantages in using the operator precedence functionality of YACC.

This episode of the blog is concerned with understanding some of the subtleties.

The Breathtakingly Simple Grammar

The running example grammar is very simple. A single operator, and only two tokens.
Goal  : '+' Exp ;
Exp  : 'a' | Exp '+' Exp ;
As it turns out, this grammar specifies a regular language, the same infinite set of strings that are generated by the regular expression -
(\+a)+     or, if you prefer, the set {+a, +a+a, +a+a+a, ... }
What could possible go wrong with such a simple grammar? And if something can go wrong, how about a more complicated case with (say) 17 levels of precedence?

Preliminary Examination

The example grammar is actually ambiguous. The shortest valid string with an ambiguous parse is "+a+a+a", which may be parsed either as though it was meant to be understood as "+((a+a)+a)" or as "+(a+(a+a))".

No big deal, we didn't specify whether the binary "+" was left or right associative.

However, before we go on let's check: send this grammar to GPPG, and it generates a parser, but also a warning which with the /verbose options reads -
Shift/Reduce conflict
    Shift "'+'" : State-6 -> State-5
    Reduce 4 : Exp -> Exp, '+', Exp
This tells us that in state-6 of the automaton there are two possible actions when the next symbol is '+' : shifting the plus sign and going to state-5, or reducing by the production
Exp : Exp '=' Exp 
This is nice, we see that the shift choice would lead to the right-associative semantic for the binary '+', while the reduce choice would lead to the left-associative semantic.

Using YACC Disambiguation - Associativity

The reason that the shift/reduce conflict message is only a warning is because YACC and its look-alikes always break such conflicts by shifting, in preference to reducing. In the case of the running example, this would lead to the right-associative interpretation. The warning is simply a heads-up to the user, so they may check if the "prefer shifting" default gives the correct behavior.
It is possible to make the warning go away by explicitly stating the associativity of the tokens. In the example we may begin with -
%token 'a'
%left '+'
GPPG no longer gives a warning, since we have specified that we want '+' to be left associative. Let us see how this actually works. If GPPG is run with the /report option the part of the html report with the details for state-6 (the state that previously had the shift/reduce conflict)is -
State-6
Kernel Items
  4 Exp : Exp '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . '+' Exp
Parser Actions
  '+' reduce using rule 4 (Exp)
  EOF reduce using rule 4 (Exp) 
Recall that the dot in the kernel items denotes the point in the rule that the recognition has reached. Unusually in this case the two items in the kernel both belong to the same production rule, rule-4. The first item has the dot at the end, indicating that this production has been fully recognized, and may be reduced. The lookahead set that support the reduction is shown within braces. The second item has the dot internally, and shows that a following '+' would continue with the recognition of this production rule. The parser action table shows that with the %left declaration the first item (reduce by rule-4)is selected no matter whether the lookahead is another plus sign or the end of file.
Alternatively, if the declarations say -
%token 'a'
%right '+'
then the same state will have the report -
State-6
Kernel Items
  4 Exp : Exp '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . '+' Exp
Parser Actions
  '+' shift, and go to state 5
  EOF reduce using rule 4 (Exp) 
which is the same behavior as with no associativity specified, but without the warning message.
This seems to have solved the problem, with a clear way to choose the desired behavior, and that pesky warning suppressed.

Using YACC Disambiguation - Operator Precedence

As well as specifying the associativity, YACC allows for the relative precedence of operators to be controlled. The simple rule is that all the tokens associated with a particular %token, %left or %right declaration have the same precedence, but those in later groups take precedence over those in earlier declarations. Thus if we had not only a '+' operator, but also a '*' operator then in conventional arithmetic grammars the multiply operator would occur in a later group than the additive operator.

How Operator Precedence Actually Works

Operator precedence is used to silently resolve shift/reduce conflicts by comparing the precedence of the completed rule with the precedence of the lookahead token. Here are the rules for determining precedence of each rule and each token -
  • All token precedences are greater than zero.
  • The precedence of a token is determined by the position of the declaration group in which it occurs. Groups declared later in the definitions section have higher precedence.
  • The precedence of a production is that given by the "%prec TokenName" declaration, if there is one.
  • Otherwise, the precedence of a production is that of the rightmost terminal symbol in the right-hand-side, if there are any terminal symbols in the right-handside.
  • Otherwise the production has zero precedence.
Having established the precedences, here are the rules by which they are compared -
  • If the precedence of the production is higher than the precedence of the lookahead token, then reduce.
  • Otherwise, if the precedence of the lookahead token is higher than the precedence of the production, then shift.
  • If the precedences are equal and the associativity of the lookahead token is left then reduce.
  • If the precedences are equal and the associativity of the lookahead token is right then shift.
These rules are applied during the generation of the parsing tables, not at parser runtime.
Finally, here are the rules that GPPG uses to decide when to issue conflict diagnostics -
  • If an automaton state has two or more productions that can be reduced, that is, two or more items with the “dot” at the end, then issue a reduce/reduce conflict warning.
  • If an automaton state has a reduction and also possible shift actions, then the conflicts are resolved as detailed above. However, if the conflict is resolved in favor of shifting because the production has zero precedence, then issue a shift/reduce conflict warning.
The conflict-reporting rules are similar for most YACC-like parser generators.

So What is the Problem?

The operator precedence facilities of these kinds of parser generators are designed for operator-based grammars, and are reasonably convenient for this purpose. However, like most simple ideas they have a dark side of unintended consequences.
The most striking of these is the failure of one of the most fundamental properties of context free grammars. Consider our original grammar -
Goal  : '+' Exp ;
Exp  : 'a' | Exp '+' Exp ;
Notice that the pattern "'+' Exp" occurs in two places. We might therefore be tempted to define a new non-terminal symbol for this pattern. Call it "Rest", say. Now we have the new grammar -
Goal  : Rest ;
Rest  : '+' Exp ;
Exp  : 'a' | Exp Rest ;
and this grammar defines precisely the same language as the previous case.
If we send this grammar to GPPG, it generates a parser and, as before, also generates a warning. In this case the warning reads -
Shift/Reduce conflict
  Shift "'+'" : State-5 -> State-4
  Reduce 5 : Rest -> '+', Exp
Following our experience with the original, unfactorised version of this grammar we might try to declare the '+' operator %right. As it turns out, this choice leads to a parser which rejects the simple (and legal) input "+a+a".
What is going on here? How is it possible that the parser is rejecting a legal input?
To explain this weird result we need to look at the parser's state-5, the site of the shift/reduce conflict. Here are the kernel items in this state -
State-5
Kernel Items
  5 Rest: '+' Exp .
      -lookahead: { '+', EOF }
  4 Exp : Exp . Rest
According to our rules, in this state the precedence of the rule-5 is the same as '+'.  Since we have defined '+' as having %right precedence the parser reduces, and the input reduces to the goal symbol before the EOF is reached.

The Message in all of this

So to return to the question in the background section, are these two grammars equivalent? Well, yes the grammars are equivalent, but the two grammars do not lead to equivalent parsers.

The take-home message in all of this is that operator precedence is a fragile construct. It is normally safe, and sometimes even helpful to factor out common sub-expressions in production right-hand-sides. However two things happen when this abstraction mechanism is used. Firstly, kernel item-sets may become merged, introducing new shift/reduce conflicts; secondly, if the abstraction removes the "last non-terminal in the right hand side" then the precedence of the rule may be modified and parsing behavior changed.

Some folks think that this is a sufficiently serious problem that they shun the use of operator precedence entirely and advocate instead the use of multi-level grammars.

Like the man said "yer pays yer money, and yer takes yer pick".

Tuesday, February 1, 2011

What it's all About

I plan to use this blog to talk about Compilers, including the ones that I wrote in the past or am writing now; Software Tools, including the ones that I wrote or maintain.  I may also indulge in an old-fashioned rant from time to time.

Most of my software that is in the public domain is available on CodePlex http://www.codeplex.com/ although there are some other versions available somewhere inside the QUT website.  Where both are available, the CodePlex versions are the current ones, and are under source code control (I use TortoiseSVN by choice).

Most of the software has been released under the Gardens Point name, chosen for the particular bend in the Brisbane River where the main campus of Queensland University of Technology is situated.  And yes that is the same river that flooded in January 2011, inundating about 11 thousand houses.

What Software Tools?
The software tools that I maintain on CodePlex are:
  • Gardens Point LEX (gplex).  This is a lexical scanner generator which produces C# as output.  The tool has lots of options, and can create scanners that deal with the whole Unicode character set.
  • Gardens Point Parser Generator (gppg).  This is a parser generator which produces C# as output.  The tool has an input language very close to traditional YACC, and produces an LALR(1) parser with the usual disambiguation rules.
  • PE Reader/Writer API (perwapi).  This is a reader/writer for Microsoft .NET PE files.  It has been used by a number of compilers including GPCP and GPM-CLR.  The tool does not deal with the complete world of the latest version of the .NET framework.  I tend to do enhancements/bug-fixes on an on-demand basis.
  • There is also a prototype tree rewriter-generator, based on Bottom Up Tree Rewriting theory.  This is almost ready for release, and I intend to blog about its virtues and challenges quite a bit.
What Compilers?
  • Gardens Point Component Pascal (gpcp).  This is the demonstration compiler that I wrote in year-2000 for the initial release of the .NET framework.  It also formed the running example for my book Compiling for the .NET Common Language Runtime, Prentice-Hall 2002.  There is also a JVM version of this compiler, but it lags behind the CLR version by a couple of releases.
  • Gardens Point Modula-2 (CLR version).  GPM was a compiler family that was built at QUT over about a ten year period.  Initially a commercial product, it was available on a total of about eleven machine architecture/operating system combinations.  Only a freeware Linux/x86 version appears to be still in use, and development stopped in about 1995.  However, a CLR version was created in 2003, as an example of how to write a CLR compiler for a non-typesafe, non-object oriented language.  GPM-CLR is effectively a new backend for the GPM frontend that all versions of GPM used in 1995.  As a side-effect of this open source experiment the previously non-open-source front-end is public.  I use GPM-CLR myself when I need to run some dusty old Modula-2 program.
Stuff in the Works
  • I am currently writing a C# version-3 backend for the QUT C# typechecker.  The typechecker is a framework for research on type-system extensions to C#.  It was used for Andrew Craik's excellent PhD thesis.  The backend is a platform for exploring issues to do with code-generation for .NET.  It is, if you wish, filling in all the bits that are missing from my Compiling for the .NET Common Language Runtime. Some were missing because they hadn't been invented in 2001, and others due to sheer ignorance on my part.
  • As well as the bottom-up rewriter-generator mentioned earlier, I am interested in plain ordinary code selection again (as opposed to code generation for virtual machines such as JVM and CLR).  Code for the SnapDragon is a gleam in the eye.