Sunday, December 25, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-2)

[This is a revised edition of the part-2 post.  An error in the LookaheadHelper code has been corrected, and the example code simplified.  The revised code is also available on CodePlex.]

In the last post a simple method for looking forward in the token stream of a GPPG parser was described.  The key concept was to save the scanner input buffer position, read tokens to resolve the syntactic ambiguity, reposition the input position and then continue. 

There are several limitations on this simple-minded approach.  Firstly, the example relied on being able to restore the scanner state precisely when the input position is backed up.  In cases where the scanner has several different start states this means that extra information needs to be saved and restored.  Secondly, in the case that long lookaheads can occur all of the work done by the scanner is lost when the inut is backed up.  If the scanner creates objects, including instances of the yylloc and yylval types, then this work will be repeated.

In this blog post a different technique is shown.  In this version the scanner defines and produces objects that encapsulate the result of any lookahead-requested call of yylex.  These objects contain three data: the integer-valued token result, the semantic value yylval, and the location value yylloc.  These objects, as well as being delivered to the decision code that is doing the lookahead, but is also added to an input buffer queue.  When backup is performed, the queue is simply handed back to the scanner.  Subsequent calls of yylex dequeue the saved values, until the queue is exhausted and normal scanning resumes.

In order to give a more realistic setting a modification of the grammar fragment from part-1 ensures that we do really need an unbounded lookahead.  Instead of expressions (and the left hand side of assignments) starting with a bare identifier they begin with a designator.  The designator may have member selection, '.' and array indexing.  Hence we must scan forward until we hit either a colon or a "dotdot", implying we have a case selector; or an "assign" symbol, implying that we are still in the same case statement sequence.  Here is the business part of the new grammar.


This grammar fragment, as in part-1, shows the location of the semantic action that is used to trigger the lookahead ad hoc code.  It also shows the location of the dummy symbol that is injected to terminate the current case_element phrase.

Then there is the infrastructure that is used to buffer and unbuffer the lexical data.  Here is the declaration of the class ScanObj.  This can be included inline in the *.lex file, or in a separate LookaheadHelper file.  The following is from a separate file.


And here is the scanner prolog that handles the queue within the scanner's Scan method.


The prolog code is inserted in the *.lex file right after the first "%%" marker, and before the first lexical production.

We can now draw attention to some of the features of the previous code snippet.  Firstly, this code is a very general example.  The enqueuing and dequeuing code allows for the possibility that there may be several ad hoc lookahead locations in the same parser, and that one ad hoc lookahead may consume the enqueued tokens of a previous lookahead.  Note particularly the code that initializes the new token buffer. If the existing code buffer queue is not empty, the leftover tokens in the queue must be added to the end of the newly constructed queue.  There are other structures that achieve the same effect.

Note also that the call to SaveScannerState takes an argument called nextTokenToSave.  In part-1 we made use of the fact that for this particular example the NextToken value would be empty when the ad hoc lookahead was called.  This allowed us to overwrite NextToken to inject the dummy token without any further care.  In general, this is not the case, so the code given above saves the argument if it is non-zero, so that it will be refetched either immediately, or immediately after the dummy token has been consumed.

This infrastructure is shared for all such ad hoc lookaheads, requiring only customization for the various instantiations of the yylloc and yylval types, and the user-chosen scanner namespace and class names.

The bodies of the semantic actions that are called by the parser all conform to the same pattern:
For our running example the method would be DoCaseLookahead, and the body of the lookahead would be the method CaseLookaheadScan.


Note carefully that the lookahead scan must use GetAndEnqueue to fetch tokens, rather than calling yylex directly.  This is to ensure that all the tokens get safely enqueued in the push-back buffer.

Summary

This may look like a lot more code than the example in part-1.  However most of the code is infrastructure, which is shared.  Furthermore this method may be used even with overlapping lookaheads, and cases where the scanner changes start state during a lookahead sequence.  For almost all scenarios this will be more efficient, since the semantic and location value objects created during a lookahead will not be wasted.

Is there anything this second version cannot do?  Well, there is one limitation:  Those grammars where the parser and scanner are tightly linked and the parser changes the scanner start state by calling the scanner's BEGIN method cannot use this method.  This is a very good reason to avoid such program structures.

The only really tricky thing about this method is the responsibility for figuring out how to write a correct "ExampleLookaheadScan" method.  In our present case the occurrence of particular tokens in the lookahead is definitive.

In the final part of this sequence I shall take a really difficult example that arises in the parsing of C#.  A statement sequence may have a local-variable-declaration or a statement-expression, among other possibilities.  Both of these may start with an identifier and it requires an unbounded lookahead to disambiguate the two.  In this case there are no disjoint sets of tokens that separate the two cases and stronger measures are required. 

The complete code of the example in the current blog entry will go up on the gppg page on CodePlex in the next few days.

Tuesday, December 13, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-1)

Gardens Point Parser Generator (GPPG) and other similar parser generators are limited to a one-symbol lookahead.  However, many programming languages have a few places in their grammars that require a some extra help.  In some cases it is possible to get around the issue with a little transformation of the grammar, but in other cases it is necessary to resort to stronger measures.  Of course if you own the language that you are parsing, the simplest response is to modifiy the language to make the problem go away, and no-one gets hurt. If somebody else owns the language then the toolkit gets a bit more exercise.

This blog deals with one of those stronger methods, where it is not possible to state in advance the number of symbols that have to be read to decide between two continuations.

Our running example is a grammar for a Pascal-like language which has a problem with the CASE statement, roughly equivalent to switch in C-like languages. Here is the base grammar with most of the irrelevant bits missing ...
 So, here is the problem - we are parsing the following string and have just read the semicolon, with the identifier "z" as the lookahead symbol ...
           CASE foo OF x,y : a := b; z ...
How do we know whether the 'z' belongs to the current case_element (maybe the start of another assignment statement) or is the start of a new case_label_list?  Well, we can't tell with one symbol of lookahead.  If we read forward and find a comma or a colon it's a case label; if we find an assignment marker ":=" then it's the start of an assignment. For plausible generalizations of the asssignment left-hand-side grammar the number of lookahead symbols is unbounded.

It is for this reason that most Pascal-like grammars have vertical bar '|' separator between case elements, and C-family languages use the keyword case for every label.  However the grammar, as given, is not ambiguous.  Read enough lookahead symbols and the problem is solved.

When a slightly filled-out version of this grammar is fed to gppg there is a shift-reduce conflict in one state.  Looking at the diagnostic output from using the /report /verbose options shows that for the state in question, when the lookahead is IDENT, the identifier will be shifted by default, although reduction  "case_label_list ':' statement_list -> case_element" is also valid.  This is the only conflict.

As an aside, note that things could be worse.  Pascal-family languages normally use the semicolon as a separator between statements, rather than as a statement terminator. The base grammar above uses the semicolon as a terminator.  If we simply use it between elements of a statement list then we would have an obvious problem with the string -
         CASE foo OF x,y : a:= b - z ...
Is the "-z" part of the assignment right-hand-side, or the first label of the next case element?

So now we come to the real question: how can we hack on the parser (and scanner) to resolve such problems with the one-symbol lookahead?

The idea is as follows: when the conflict location is reached jump to user-written code that scans forward in the input stream until the matter is decided. Point the parser in the right direction, reset the scanner buffer-position and resume parsing as normal.

The first step is to find a way of inserting the call to the user code in a semantic action.  For our example, whenever a statement list is being parsed in the context of a case statement we should do the ad hoc lookahead.  So we specialize statement_list as stat_list_for_case.


Now our SaveStateAndDoLookahead method is called every time a new statement is added to the statement list of a case statement.

And what does the method do?  Well, because of some issues with the visibility of various scanner and parser variables we split the code between the scanner and parser.  The scanner code saves scanner state, calls the parser code and restores the scanner state on return.
This code calls the Parser method DoAdHocLookahead. This method does the heavy lifting. [However, see the footnote for a bit more on what "save scanner state" implies.]

DoAdHocLookahead will repeatedly call yylex until a decision may be made.  First of all however note that the conflict only arises if the next symbol fetched is an identifier.  For our simple case we exit the loop as soon as we find a symbol that can belong to an assignment but not to a case label list, or to a case label list but not to an assignment.  Simple.  But how do we "point the parser in the right direction" before parsing resumes?

The simplest way of forcing the parser to do the right thing is to overwrite the parser's NextToken variable. We use a dummy symbol that otherwise does not occur in the grammar, and rewrite the grammar to rely on that symbol to separate the case_element items.
So that is about all there is to it.  Overwriting the NextToken value in the current version of GPPG requires diving into the code of ShiftReduceParser to make the NextToken variable accessible. The next refresh of the tool will give the variable protected accessibility to facilitate this technique.

The complete code for this example is available at http://gppg.codeplex.com

Part two will deal with a much harder example: one of the classic syntactic ambiguities of C#.

Footnote: What "save scanner state" implies, and various disclaimers.
In the example here the saved state in the scanner is the current "lookahead" character, the current line and column count, and the buffer position. This works for this particular scanner, but others may need more.  In particular, if a scanner has a number of start states, the start state might need to be saved and restored also.  Using the YY_START property is the easiest way of doing this, since it automatically resets the start ordinal and the start state. Of course, for a scanner that has a stack of start ordinals the whole stack needs to be saved and restored.

Furthermore, if the scanner detects and reports lexical errors, errors that are backed-up over may be reported twice.  Thus if you wish to be really picky, you might want to save the high tide on the error buffer and restore that as well.

Alternative strategies will be discussed in the second part.