tag:blogger.com,1999:blog-12385463025742353632024-03-13T13:58:31.129-07:00John Gough on Software ToolsJohn Gough on Software Tools deals with Programming Language Compilers, and Software Tools used for the recognition of formal languages. It is a practical blog, but can get a bit theoretical in places. Remember: there is nothing so practical as a good theory.k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-1238546302574235363.post-69665900066036571062013-03-20T23:39:00.000-07:002013-03-20T23:39:04.240-07:00Procedure Types and Managed Execution (part 2)<h3>
Summary</h3>
<div>
This is part two of a short series on the implementation of procedure types and variables on top of the .NET framework, or the Java Virtual Machine. In this part I consider alternative mechanisms for implementing procedure types on each platform.</div>
<h3>
The .NET Framework</h3>
<div>
On the .NET framework the delegate type appears to be the obvious choice. After all, the construct was designed for exactly this job, and as the basis of the .NET event-handling model. However, there is nothing to stop us from using a single-member base class or interface, just as we are (or rather were) forced to do on the JVM. Which is better? </div>
<div>
<br /></div>
<div>
The question of which mechanism is better is not one that may be answered on the basis of inclination or traditional prejudice. What we need is some evidence, and to be sure evidence from a range of different usage patterns. However, let us start with the simple case of simple procedure (static method) with a simple signature.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEvTFT3zNu5O7dvXBswxBRS59JBK-SNkbql7BODcncag6D6k992iIsvewdsRDezyOgKMC0-2nBu7Nlh6qMwnuHwYrwkCSkESGxCQM5oiCg07xuAIGV3_PSnoi8nL1CLKOYnocYq391eLsG/s1600/MM-Net1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEvTFT3zNu5O7dvXBswxBRS59JBK-SNkbql7BODcncag6D6k992iIsvewdsRDezyOgKMC0-2nBu7Nlh6qMwnuHwYrwkCSkESGxCQM5oiCg07xuAIGV3_PSnoi8nL1CLKOYnocYq391eLsG/s1600/MM-Net1.PNG" /></a></div>
Here are three ways of calling the static method <i>Hello</i>, using three different mechanisms for representing procedure values. At line 23 we declare the <i>StringToVoid</i> delegate type, and at line 14 we create a instance of the delegate type that wraps the method <i>Hello</i>.<br />
<br />
At line 26 an interface is declared with a single member "Invoke" of the string to void signature. This interface is implemented by the<i> IWrapHello</i> class, that wraps the <i>Hello</i> method. Essentially the same semantics are achieved by the abstract class <i>S2V</i> and its subclass <i>WrapHello</i> at line 31. The three "function values" are created at lines 14 .. 16, and are each called, in turn, in the following three lines. <br />
<br />
So, a compiler could use any of these three mechanisms to represent types and values for static methods. The amount of generated IL would be very similar each case, and the details of the naming of the wrapper types would be hidden in the implementation. But what are the comparative overheads of these three mechanisms?<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFz9BKbK2vNPdqQ9ve2N7rrN6VUH91wZtqFovYB-g2tVXC2aMwko19h38ghOym4hQ5ZIFUfJG8MHYRsFOPiiw87hJEWKJxleS3GVFC82hfPhsKCbdIOlHmTmfRJRbjT17K6gg4uujpx8gO/s1600/Net-Timing.PNG" imageanchor="1" style="clear: left; display: inline !important; margin-bottom: 1em; margin-left: auto; margin-right: auto; text-align: center;"><img border="0" height="113" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFz9BKbK2vNPdqQ9ve2N7rrN6VUH91wZtqFovYB-g2tVXC2aMwko19h38ghOym4hQ5ZIFUfJG8MHYRsFOPiiw87hJEWKJxleS3GVFC82hfPhsKCbdIOlHmTmfRJRbjT17K6gg4uujpx8gO/s640/Net-Timing.PNG" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">.NET timing for various procedure variable implementations (see text).</td></tr>
</tbody></table>
Four different ways of implementing procedure variables were tried, with two different micro benchmarks. The first test was very simple: the static method that was encapsulated in the variable took a string argument and returned its length. This variable was called in a loop which repeated about 10^6 times. Some idea was to make the body of the wrapped method so simple that small differences in the calling machinery would still be significant. The results are shown in the left column of the table.<br />
<br />
The results seem to indicate that the .NET delegate mechanism has a small time advantage over both the concrete sub-class overriding an abstract method in a single-member class, or a class implementing the single member of an interface. The other thing to note is the overhead of reflective invocation. About two orders of magnitude, for a simple method body. <br />
<br />
Of course, with one must be a little careful with a JIT compiler, as it may do special magic on loops that repeat many, many times. The second example is designed to be slightly more realistic. The code uses a rather unusual coding pattern for implementing a finite state automaton (FSA). The state variable is a procedure variable holding the "nextstate" function. The function is fed the next symbol, in this case a single bit and returns the nextstate function corresponding to the next state of the FSA. The whole thing is attached to a pseudo-random binary sequence generator which repeats every (2^19 -1) steps. The details of generators of this kind are given at Wikipedia <a href="http://en.wikipedia.org/wiki/Linear_feedback_shift_register">Linear_feedback_shift_register</a>. In this case a Galois implementation was used. This generator produces a string of 18 zero bits only once every cycle, which is what the FSA searches for. The state is updated at every step, in an apparently random fashion, in an attempt to defeat any clever prediction strategy of the JIT. The results are shown in the right hand column of the table. It may be seen that our three favored strategies have very comparable results.<br />
<br />
The final column of the table, just for the record, lists the compatibility semantics of the procedure variable containers. All three of our favored mechanisms enforce <i>name</i>-compatibility, which is not what we really want. What we want is compatibility based on equality of <i>Signatures</i> of the encapsulated method. Reflective invocation of the <i>MethodInfo</i> value has the right semantics, but the loss of efficiency is overwhelming.<br />
<h3>
The JVM Framework</h3>
<h4>
The <i>MethodHandle</i> Class, and JSR292</h4>
<div>
Meanwhile, in the Java world, Java SE Version-7 introduced a number of changes designed to benefit all those dynamically typed languages that have been implemented to run on top of the JVM. The process of proposing and implementing the changes took several years, and anyone interested in the process should Google "JSR292" to see the history.</div>
<div>
<br /></div>
<div>
In essence, the JSR 292 changes introduced one new byte code instruction,<i> invokedynamic</i>, and a few new classes in the library. <i>MethodHandle</i> and<i> MethodType</i> are the two new classes of interest here, and both belong to the package<i> java.lang.invoke</i>. </div>
<div>
<br /></div>
<div>
I shall say nothing this time about<i> invokedynamic</i>, although it is an interesting topic for another day. Despite the impetus for the new instruction having come from the dynamically-typed language community, it turns out to be an ideal method for the implementation of the Lambda Abstractions that will launch with Java SE Version-8. However the two new classes are directly relevant to this discussion.</div>
<div>
<br /></div>
<div>
<i>MethodType</i> is a descriptor for a method signature. The object factory for the class is passed arguments with the Class objects for the return type and the parameter types. <i> MethodHandle</i> is wrapper for a method value. Objects of the type may be created by a factory method that is passed a <i>Class</i> object, the name of a method within that class, and a <i>MethodType</i> object. The returned handle encapsulates the matching method. If the <i>MethodInfo</i> object of the target method is known, presumably by reflection, then the corresponding method handle is created by a call to the amusingly named <i>"java.lang.invoke.unreflect".</i></div>
<div>
<i><br /></i></div>
<div>
Here are the corresponding results of the two test programs.</div>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUFq6_bb4ShMg17V0j-CE45TuYdZ_FbFzdMQKbm9gVSMMI8X4bh_PBHcXm4S_q121W_4GW2uttiUZW3DT8_Nyj5PlV6DMz7EMmwcfXAHm1g1V38m-fn7QxfPxoVhVAPQtPkNtyARsqoq9v/s1600/Jvm-Timing.PNG" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" height="154" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUFq6_bb4ShMg17V0j-CE45TuYdZ_FbFzdMQKbm9gVSMMI8X4bh_PBHcXm4S_q121W_4GW2uttiUZW3DT8_Nyj5PlV6DMz7EMmwcfXAHm1g1V38m-fn7QxfPxoVhVAPQtPkNtyARsqoq9v/s640/Jvm-Timing.PNG" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">JVM timing for various procedure variable implementations (see text).<br />
<br />
<div style="text-align: left;">
<span style="font-size: small;">All of these measurements, for both frameworks, were performed on a four year old Intel E8400 processor running Windows-7 (32-bit). You may note that the <i>MethodHandle</i> mechanism has low overheads, with similar per-call overhead to the previous implementation methods once the method bodies do any computation. The rather lower overhead of reflective invocation of a<i> java.lang.Class.method</i> compared to that for a <i>System.Reflection.MethodInfo</i> is an interesting piece of trivia to store away in a spare neuron.</span></div>
<div style="text-align: left;">
<span style="font-size: small;"><br /></span></div>
<div style="text-align: left;">
<span style="font-size: small;">The Anonymous interface method row of the table is interesting, as this performs well on the example using the nextstate-function pattern. The gain here is that there is one less method call in the chain. In effect, the anonymous method "inlines" the body of the next-state function rather than <i>calling</i> the next-state function.</span></div>
<div style="text-align: left;">
<span style="font-size: small;"><br /></span></div>
<div style="text-align: left;">
<span style="font-size: small;">A simple table-lookup implementation of the FSA is included here, as a reality check.</span></div>
<div style="text-align: left;">
<span style="font-size: small;"><br /></span></div>
<div style="text-align: left;">
<span style="font-size: small;">So here is a tentative conclusion: The <i>MethodHandle</i> mechanism has comparable overhead to the other implementations of procedure variables. And, as a bonus, it has the sensible structural-compatibility semantics that we would expect procedure types to display.</span></div>
<h3 style="text-align: left;">
<span style="font-size: small;">About the Next Episode</span></h3>
<div style="text-align: left;">
<span style="font-size: small;">In the final episode of this series I shall look a bit closer at the case of method values that encapsulate both a method, and a receiver ("this" value) to invoke the method on. I shall also look at using lambda abstractions to realize similar coding patterns to the use of procedure/method variables.</span></div>
<h4 style="text-align: left;">
<span style="color: red; font-size: small;">Late-Breaking News</span></h4>
<div style="text-align: left;">
<span style="color: red; font-size: small;">My <i>MethodHandle</i>-based FSA benchmark crashes the Hotspot(r) Compiler for Windows X64. It runs fine on the 32-bit X86 Hotspot compiler. On X64 it runs fine with java -Xint, or if the run is too short for Hotspot to attempt heroic optimizations. Watch this space for further developments.</span></div>
<h4>
</h4>
</td></tr>
</tbody></table>
<br />
<br />
<br />
<br />
<br /></div>
k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-6815633196277345062013-03-15T21:09:00.001-07:002013-03-16T19:30:18.113-07:00Procedure Types and Managed Execution<h3>
Summary</h3>
This is the first part of a multi-post exploration of the ways in which procedure variables, function pointers (call them what you will) are implemented in managed execution systems such as .NET and the JVM. Particular implementation details for both .NET and the JVM type systems place limitations on the implementation of this functionality.<br />
<br />
This first post briefly reviews the idea behind procedure types, and the implementation mechanisms on the .NET and JVM platforms. <br />
<br />
The second part looks at various ways to get the same semantics on the (pre-version 7) Java Virtual Machine. This second part also discusses the limitations of both managed execution systems in implementing procedure types.<br />
<br />
A final part will look specifically at how the new functionality in JDK v7 changes the landscape, removing one of the limitations. This final part will also look at implementations based on use of lambda abstractions - available on the .NET platform currently, and coming soon in version-8 of the Java.<br />
<h3>
Background</h3>
Many programming languages have a construct that allows a variable to be declared as containing a value that denotes a particular procedure value. The procedure denoted by the variable may be invoked by<i> calling</i> the procedure value. Such procedure values may be created, by associating the new value with some named procedure, and values may be copied and assigned using the conventional syntax of the particular language.<br />
<br />
Compilers implement these constructs, in the simplest case, by making the procedure value be of pointer size, and using the entry-point-address of the bound procedure as the value. This implementation mechanism is explicit in <i>ANSI C</i>, where such values are called "function pointers". Languages that permit nested procedures to be used as procedure values are a little more complicated, since the called procedure needs to be given access to the stack frames of its enclosing procedures. However, that circumstance is mainly of historical and theoretical interest nowadays.<br />
<br />
Typical machine architectures have two distinct subroutine call instructions, <span style="font-family: Courier New, Courier, monospace;">call</span> and <span style="font-family: Courier New, Courier, monospace;">calli</span>, which respectively call a method by its symbolic name, and indirectly call the method that the entry point address of which has been loaded into some chosen register.<br />
<br />
The calling sequence for a regular call goes something like this -<br />
<ul>
<li>load up the arguments to the call</li>
<li><span style="font-family: Courier New, Courier, monospace;">call</span> the named procedure</li>
</ul>
or in the indirect, procedure variable case -<br />
<ul>
<li>load up the arguments to the call</li>
<li>load the procedure variable value into the chosen register</li>
<li><span style="font-family: Courier New, Courier, monospace;">calli</span> through the chosen register</li>
</ul>
Of course, for a statically typed language, the number and type of the the arguments must in each case match the formal parameters of the procedure ends up being invoked by the call.<br />
<br />
Procedure variables are thus strongly typed, and programming languages that provide procedure variables may allow such procedure types to be named in the usual way. In Component Pascal, the syntax looks like this -<br />
<i> TYPE NoArgToInt = PROCEDURE( ) : INTEGER;</i><br />
Which gives a name to a type of procedures that take no arguments, but return an integer. We may now declare variables of this type -<br />
<i> VAR p1, p2 : NoArgToInt;</i><br />
which declares two variables of the named procedure type. We could just have easily left the procedure type unnamed, and declared the variables as being of some anonymous procedure type -<br />
<i> VAR p1, p2 : PROCEDURE( ) : INTEGER;</i><br />
with the same effect.<br />
<br />
When the compiler emits instructions to construct a new procedure value it will check that any named procedure that is being bound to a new value has the correct argument number and type and conforming return type.<br />
<br />
Now, a little thought will convince the reader that although it may be convenient to have a name for a procedure type, but two procedure types should b e compatible provided that they have the same call signature. Thus such languages have type compatibility rules for procedure values that implement<i> structural compatibility</i>, that is "two procedure types are the same type if they have the same argument number and type, and the same return value (or lack of return value)".<br />
<br />
This observation, regarding structural compatibility, does not even arise in <i>ANSI C</i>. Function pointer types are denoted by an "abstract declarator" that declares only the parameter types and the return type. The abstract declarator for our <i>NoArgToInt</i> type would be -<br />
<div style="text-align: center;">
<span style="font-family: Courier New, Courier, monospace;">int (*)( )</span></div>
that is "pointer to function taking no arguments and returning<span style="font-family: Courier New, Courier, monospace;"> int</span>.<br />
<h3>
Implementing Procedure Types on .NET</h3>
<div>
Implementing procedure types on .NET is fairly straightforward. The built-in <b>delegate type</b> of the framework is <i>almost</i> exactly what we need. Consider the following C# progam fragment -<br />
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvCyqahl3zLGKVkSCOhYUEY_upPiG9M0lMpGEiVL83cEBk1wKCffRwv4sAgJ3NEriXks6BLDaDC9T0vInV87-uyD3FMMrdqUmqRahUBrMde1OSMMeBuGWiozMiUvv7M3Ja8isCWHCPPndA/s1600/ProcTypes1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvCyqahl3zLGKVkSCOhYUEY_upPiG9M0lMpGEiVL83cEBk1wKCffRwv4sAgJ3NEriXks6BLDaDC9T0vInV87-uyD3FMMrdqUmqRahUBrMde1OSMMeBuGWiozMiUvv7M3Ja8isCWHCPPndA/s1600/ProcTypes1.PNG" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
At line 20 a new delegate type is defined, with the "no args, returns int" signature. At line 23 a static method <i>Foo</i> is defined, with a compatible signature. A new instance of the delegate type is defined at line 26, taking <i>Foo</i> as its bound method value. The method is invoked via the delegate value at line 27.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
There are a couples of wrinkles about .NET delegates that I am ignoring for the moment. Delegates may be bound to either a static method, as in this example, or bound to an instance method. In the case of an instance method the "this" reference is supplied at delegate creation time. A delegate may thus be thought of as encapsulating two data: a pointer to the method entry point, and a copy of the receiver to which the method is to be applied.</div>
<h4>
Not <i>exactly</i> what we wanted</h4>
<div>
The only real defect of the .NET delegate, as a feature to implement procedure values, is the lack of structural compatibility. Suppose an application depends on two existing libraries, both of which have <i>API</i> calls with parameters of delegate types, and use different names for (say) the <i>NoArgToInt</i> signature. Now the names of the delegate types are baked into the <i>API</i>s, and although our method<i> Foo</i> is compatible with each, the delegate values are not assignment compatible. This is extremely annoying, since it is clear that the same semantic analysis inside the compiler that decides that <i>Foo</i> is compatible with each of the two delegate types can also decide that values of the two types are compatible with each other. (There was a proposal to move to structural compatibility for delegate types in V2 of the framework, but it didn't make it into the release.)</div>
<div>
<br /></div>
<div>
The .NET version of GPCP essentially produces the same machine code for Procedure Types as a C# compiler would for the equivalent delegate-based program. The semantics are not exactly as described in the language report, because of the lack of structural compatability of values.</div>
<h3>
Implementing Procedure Types on the (pre-V7) JVM</h3>
<div>
On the JVM the starting point is even further back, as the JVM has nothing equivalent to a built-in delegate type. Since the platform does not give us the primitive construct that we need, we must build it ourselves. Here is the Java program that we need to construct to get semantics equivalent to the C# example above.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJwkDsu_MkYrK9aGugmIpTzWi_T-2SHHcaFg8VM2hAaKkJBxWx1D1aqcofgW1h3_I_Bj2ONGuzkg5xhAHZwDa4tN-8_Kr3RGcBFxnn1QD4sQS2RqHOu282ICAb8hZRhOADpMGZhjIBNy7v/s1600/ProcValueJVM.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJwkDsu_MkYrK9aGugmIpTzWi_T-2SHHcaFg8VM2hAaKkJBxWx1D1aqcofgW1h3_I_Bj2ONGuzkg5xhAHZwDa4tN-8_Kr3RGcBFxnn1QD4sQS2RqHOu282ICAb8hZRhOADpMGZhjIBNy7v/s1600/ProcValueJVM.PNG" /></a></div>
Each procedure<i> type</i> corresponds to an abstract base class, defining a single abstract method Invoke, with the signature that we wish to encapsulate. Every <i>value</i> of this type has a separate derived class that implements the abstract method by calling the target method, in this case<i> Foo</i>. This is in essence the translation formula that GPCP-JVM uses to implement procedure types and values.<br />
<br />
It may seem that if a program has many, many different procedure value instances that the namespace is going to become cluttered with lots of "single use" sub-classes, and that an anonymous class would be nicer. This is true of course, but the namespace is still cluttered, it is just that the clutter is with names that the compiler chose for you. And, yes, the names will be names that would be illegal in your Java program, so that these names will not clash with anything that the programmer might have named herself.<br />
<br />
It would also be possible to define an interface base type, rather than an abstract type to hold procedure values, with similar results.<br />
<br />
Notice that the Java mechanism has the same problem with compatibility of types as the .NET framework. Two variables are assignment compatible if they are derived from the same abstract base type.<br />
<h4>
Preview of part II.</h4>
<div>
In part two of this series, I shall look at alternative mechanisms for implementing procedure types. In particular there are interface base types (rather than an abstract base class), there is the use of runtime reflection, and the new <i>MethodHandle</i> types that appeared in Java SE version 7. I also will report a performance shoot-out, so that users may make informed choices between the various mechanisms.</div>
<div>
<br /></div>
<div>
<br /></div>
k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-30105812554359106472012-03-22T16:33:00.001-07:002012-03-22T16:35:00.594-07:00Using Reference Semantic Value Types with GPPGAll the examples that ship with <i>GPPG</i>, and parsers in the distribution all use<i> .NET</i> value types for the semantic action type. For very simple situations, the use of the <span style="font-family: 'Courier New', Courier, monospace;">%union</span> construct is convenient. In particular, if the types of the non-terminal symbols have been declared in the *.y file, then the parser will automatically select the appropriate field from the struct that implements the "union" type.<br />
<br />
The name of the semantic action type may be declared using the construct -<br />
<span style="font-family: 'Courier New', Courier, monospace;">%YYSTYPE</span><i> typename</i><br />
Where<i> typename</i> is the name of the type. It is most commonly an abstract class type, with the actual values being concrete subtypes of this class. <br />
<br />
The use of such a family of class types is almost always the correct approach when the purpose of the parser is to construct an<b> abstract syntax tree</b> (<i>AST</i>). The following is a yet another version of the calculator example, two versions of which ship with the <i>GPPG</i> distribution. In this version, the parser builds an <i>AST</i> for each of the expressions. These tree-representations of the expressions are evaluated on demand, much in the fashion of C# lambda expressions.<br />
<br />
The first thing that has to be said about using a reference <i>YYSTYPE</i> is that the value stack in the parser will, by default, contain a stack of <b>null</b> values. You will be swiftly reminded of this if you try to access any fields of the top of stack value. The idea is that the semantic actions of the parser will build the tree during the parse. The semantic actions will therefore be calls to factory methods that create new tree-nodes and hook the sub-trees together. Here is the grammar definition. It is derived from the grammars of the calculator programs that ship with <i>GPPG</i>.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh1RP5wk-PGnuMI08H2-H0rZx3bBvgpggx8MnbJa31boI7I0Ggf321R37uazGBNlMay9T-x-17vvHsR2OP6xcXMPipZoRwA75PoiiyxrLOHjVi4nGo6HBJnHWQyz1l3TqUTK7CVWPogdNb/s1600/RealTreeGrammar.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjh1RP5wk-PGnuMI08H2-H0rZx3bBvgpggx8MnbJa31boI7I0Ggf321R37uazGBNlMay9T-x-17vvHsR2OP6xcXMPipZoRwA75PoiiyxrLOHjVi4nGo6HBJnHWQyz1l3TqUTK7CVWPogdNb/s1600/RealTreeGrammar.PNG" /></a></div>
The structure of the<i> RealTree.Node</i> type is hinted at in the names of the semantic action methods. <i>Node</i> is an abstract class, and there are exactly three derived classes - <i>Binary, Unary</i> and <i>Leaf</i>. Each object of node type contains a tag, which is a value from the <i>NodeTag</i> enumeration. The purpose of these tags is to indicated whether (say) a <i>Binary</i> node represents an addition, subtraction, multiplication or so on.<br />
<br />
All of the rest of the code is in another file, named "<span style="font-family: 'Courier New', Courier, monospace;">RealTreeHelper.cs</span>". This file contains the definitions of the <i>Node</i> classes, and the user-written part of the <i>Parser</i> class, including a simple handwritten scanner. In particular, the parser class contains the static factory methods that invoke the node constructors.<br />
<br />
The goal symbol of the grammar is "list". This is simply a list of statements, and has an error recovery action so that all errors will return to a state that is expecting the start of another statement.<br />
<br />
At runtime the parser maintains a list of 26 variables, named 'a' to 'z', which may be referred to by their case-insensitive name. All of these registers may be cleared, or their values printed to the console. The interesting productions are those that assign expressions to variables, or evaluate an expression to a floating point double value and print it to the console. The scanner creates a <i>Leaf</i> node value whenever it scans a literal value, or a variable name. These leaf nodes will have a tag value that is a <i>NodeTag.literal </i>or <i>NodeTag.name</i> respectively.<br />
<br />
The expression productions are fairly standard, and create an abstract syntax tree for the recognized expression. An interesting case is the second production for statement. In this case the expression is evaluated immediately, so that the semantic value that is produced is a literal value <i>Leaf</i> node, rather than an expression tree value.<br />
<br />
The complete program example consists of the grammar (<span style="font-family: 'Courier New', Courier, monospace;">*.y</span>) file, the <span style="font-family: 'Courier New', Courier, monospace;">RealTreeParser.cs</span> file that <i>GPPG</i> produces from this, and the <span style="font-family: 'Courier New', Courier, monospace;">RealTreeHelper.cs</span><span style="font-family: inherit;"> file that contains all the handwritten code. All three are zipped together, and available for download from the CodePlex<i> GPPG</i> page </span><a href="http://gppg.codeplex.com/releases/view/80821">http://gppg.codeplex.com/releases/view/80821 </a> The whole needs to be compiled with the <span style="font-family: 'Courier New', Courier, monospace;">ShiftReduceParserCode.cs</span> library.<br />
<br />
Here is a snippet from the helper file --<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHD18zcKcNfvmPo-pHgb5OcuVLQNfWok2eNa_N0XoSLtFPflLoJ0C2NhU9c5BQU7bCcfQX2mFX4cYeclHAGazB7vHMq69ABtHX0QGNolp79PjdTO0oc-gzw9cbzVZ_HS_dnoUecr0AB_9c/s1600/NodeClass.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHD18zcKcNfvmPo-pHgb5OcuVLQNfWok2eNa_N0XoSLtFPflLoJ0C2NhU9c5BQU7bCcfQX2mFX4cYeclHAGazB7vHMq69ABtHX0QGNolp79PjdTO0oc-gzw9cbzVZ_HS_dnoUecr0AB_9c/s1600/NodeClass.PNG" /></a></div>
<br />
This base class defines two abstract methods:<i> Eval</i> and <i>Unparse</i>. <i>Eval </i>recursively traverses the given tree, returning a floating point double. <i>Unparse</i> recursively traverses the given tree constructing a string that gives a fully parenthesized representation of the expression. The difference between the traversals is just this: during <i>Unparse</i>, when the traversal reaches a name leaf it prints the name of the variable. When<i> Eval </i>reaches a name leaf the traversal evaluates the tree rooted at the variable.<br />
<br />
Suppose we have entered the statements -<br />
<span style="font-family: 'Courier New', Courier, monospace;">a = 42; b = 24; c = a + b;</span><br />
then the <span style="font-family: 'Courier New', Courier, monospace;">print</span> command will write<span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">'c' = (a+b)</span><br />
while the <span style="font-family: 'Courier New', Courier, monospace;">eval c</span> command will write<span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;">result = 66</span>.<br />
<br />
<span style="font-family: inherit;">It is easy to construct an evaluation that fails by trying to evaluate a circular definition, so it is necessary to guard against this. During evaluation each node is marked when traversal reaches the node, and unmarked when the recursion leaves. This detects circular definitions, and prevents stack overflows during evaluation. Here is simple example -</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">a = 42; a = a + 1; eval a</span><br />
<span style="font-family: inherit;">fails because of the circularity of the second statement. On the other hand -</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">a = 42; a = eval (a+1); eval a</span><br />
<span style="font-family: inherit;">succeeds, and returns 43, as expected. In this case the second command evaluates</span><span style="font-family: 'Courier New', Courier, monospace;"> (a+1)</span><span style="font-family: inherit;"> and deposits the new value 43 as a literal in variable</span><span style="font-family: 'Courier New', Courier, monospace;"> a</span><span style="font-family: inherit;">. In the circular version the second command overwrites the initial value of</span><span style="font-family: 'Courier New', Courier, monospace;"> a</span><span style="font-family: inherit;"> with the <i>expression tree</i></span><span style="font-family: 'Courier New', Courier, monospace;"> (a+1)</span><span style="font-family: inherit;">.</span><br />
<span style="font-family: inherit;"><br /></span><br />
<span style="font-family: inherit;">In any case, the purpose of the example is to demonstrate the use of a family of classes as the semantic value type for parsers that build abstract syntax trees. Enjoy!</span><br />
<span style="font-family: inherit;"><br /></span><br />
<br />
<br />k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-42348902733864421142011-12-25T19:30:00.000-08:002012-02-21T20:39:35.906-08:00Doing ad hoc Lookahead in GPPG Parsers (part-2)[This is a revised edition of the part-2 post. An error in the LookaheadHelper code has been corrected, and the example code simplified. The revised code is also available on CodePlex.]<br />
<br />
In the last post a simple method for looking forward in the token stream of a GPPG parser was described. The key concept was to save the scanner input buffer position, read tokens to resolve the syntactic ambiguity, reposition the input position and then continue. <br />
<br />
There are several limitations on this simple-minded approach. Firstly, the example relied on being able to restore the scanner state precisely when the input position is backed up. In cases where the scanner has several different start states this means that extra information needs to be saved and restored. Secondly, in the case that long lookaheads can occur all of the work done by the scanner is lost when the inut is backed up. If the scanner creates objects, including instances of the<em> yylloc</em> and <em>yylval</em> types, then this work will be repeated.<br />
<br />
In this blog post a different technique is shown. In this version the scanner defines and produces objects that encapsulate the result of any lookahead-requested call of <em>yylex</em>. These objects contain three data: the integer-valued token result, the semantic value <em>yylval</em>, and the location value<em> yylloc</em>. These objects, as well as being delivered to the decision code that is doing the lookahead, but is also added to an input buffer queue. When backup is performed, the queue is simply handed back to the scanner. Subsequent calls of<em> yylex</em> dequeue the saved values, until the queue is exhausted and normal scanning resumes.<br />
<br />
In order to give a more realistic setting a modification of the grammar fragment from part-1 ensures that we do really need an unbounded lookahead. Instead of expressions (and the left hand side of assignments) starting with a bare identifier they begin with a designator. The designator may have member selection, '.' and array indexing. Hence we must scan forward until we hit either a colon or a <em>"dotdot",</em> implying we have a case selector; or an<em> "assign"</em> symbol, implying that we are still in the same case statement sequence. Here is the business part of the new grammar.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-RoGtBLRDjR0/TvfWT_8ScnI/AAAAAAAAABQ/2VJ8cqb7-Us/s1600/BaseGrammar2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://4.bp.blogspot.com/-RoGtBLRDjR0/TvfWT_8ScnI/AAAAAAAAABQ/2VJ8cqb7-Us/s400/BaseGrammar2.PNG" width="330" /></a></div>
<br />
This grammar fragment, as in part-1, shows the location of the semantic action that is used to trigger the lookahead<em> ad hoc</em> code. It also shows the location of the dummy symbol that is injected to terminate the current <em>case_element</em> phrase.<br />
<br />
Then there is the infrastructure that is used to buffer and unbuffer the lexical data. Here is the declaration of the class ScanObj. This can be included inline in the *.lex file, or in a separate <em>LookaheadHelper</em> file. The following is from a separate file.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNHc4GQGsPaza-xFrjloIWK0SGnjv5hCvSAX-2bR0ytEewdHvg1y7EwmjBO6uKCs2juTR3RlcxbjfHF0MZfbkDBvvB-fwcm932eBqDVdHos6ngdRHoctCIcfiWJQCvJQECVEjJGgF3GZdK/s1600/ScannerInfrastructure2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNHc4GQGsPaza-xFrjloIWK0SGnjv5hCvSAX-2bR0ytEewdHvg1y7EwmjBO6uKCs2juTR3RlcxbjfHF0MZfbkDBvvB-fwcm932eBqDVdHos6ngdRHoctCIcfiWJQCvJQECVEjJGgF3GZdK/s1600/ScannerInfrastructure2.PNG" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
And here is the scanner prolog that handles the queue within the scanner's<em> Scan</em> method.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-fbN1GAcIOsE/TvfbjHBOWVI/AAAAAAAAABo/ucLF1fO4iIU/s1600/ScannerProlog.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-fbN1GAcIOsE/TvfbjHBOWVI/AAAAAAAAABo/ucLF1fO4iIU/s1600/ScannerProlog.PNG" /></a></div>
<br />
The prolog code is inserted in the *.lex file right after the first "%%" marker, and before the first lexical production.<br />
<br />
We can now draw attention to some of the features of the previous code snippet. Firstly, this code is a very general example. The enqueuing and dequeuing code allows for the possibility that there may be several <em>ad hoc</em> lookahead locations in the same parser, and that one <em>ad hoc</em> lookahead may consume the enqueued tokens of a previous lookahead. Note particularly the code that initializes the new token buffer. If the existing code buffer queue is not empty, the leftover tokens in the queue must be added to the end of the newly constructed queue. There are other structures that achieve the same effect.<br />
<br />
Note also that the call to <em>SaveScannerState</em> takes an argument called <em>nextTokenToSave</em>. In part-1 we made use of the fact that for this particular example the <em>NextToken</em> value would be empty when the <em>ad hoc</em> lookahead was called. This allowed us to overwrite <em>NextToken</em> to inject the dummy token without any further care. In general, this is not the case, so the code given above saves the argument if it is non-zero, so that it will be refetched either immediately, or immediately after the dummy token has been consumed.<br />
<br />
This infrastructure is shared for all such ad hoc lookaheads, requiring only customization for the various instantiations of the yylloc and yylval types, and the user-chosen scanner namespace and class names.<br />
<br />
The bodies of the semantic actions that are called by the parser all conform to the same pattern:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiurAosG3zt0Ds3Cvnq_gTw70KT6TGQtwXNzncdtCHkLsoCUkdjoYkgJN_rc364hVlrUzUkFt4cRogxnOvl-5LZITzZDSv_JhWep4gC2mu4UlyhNOC4FjjUszL4980o9dqwE9bEQ2-SoHPu/s1600/SaveState2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiurAosG3zt0Ds3Cvnq_gTw70KT6TGQtwXNzncdtCHkLsoCUkdjoYkgJN_rc364hVlrUzUkFt4cRogxnOvl-5LZITzZDSv_JhWep4gC2mu4UlyhNOC4FjjUszL4980o9dqwE9bEQ2-SoHPu/s1600/SaveState2.PNG" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
For our running example the method would be<em> DoCaseLookahead</em>, and the body of the lookahead would be the method <em>CaseLookaheadScan</em>.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjinE1IcZSR4sHawICwhESqMGxNLGODEEsD9jZOkQ5NQ3OvnCCuyJ4VqAPru0cf5YgkDepSRMJIO3hDylkDImuN_ZDlz3qNzhABkSnbsm7VR1zQTPeuTjkbGTrwE1Qsodt9Z7mUFqf6Mspu/s1600/CaseLookaheadScan2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjinE1IcZSR4sHawICwhESqMGxNLGODEEsD9jZOkQ5NQ3OvnCCuyJ4VqAPru0cf5YgkDepSRMJIO3hDylkDImuN_ZDlz3qNzhABkSnbsm7VR1zQTPeuTjkbGTrwE1Qsodt9Z7mUFqf6Mspu/s1600/CaseLookaheadScan2.PNG" /></a></div>
<br /><br />
Note carefully that the lookahead scan must use <em>GetAndEnqueue</em> to fetch tokens, rather than calling <em>yylex</em> directly. This is to ensure that all the tokens get safely enqueued in the push-back buffer.<br />
<br />
<span style="font-size: large;"><strong>Summary</strong></span><br />
<br />
This may look like a lot more code than the example in part-1. However most of the code is infrastructure, which is shared. Furthermore this method may be used even with overlapping lookaheads, and cases where the scanner changes start state during a lookahead sequence. For almost all scenarios this will be more efficient, since the semantic and location value objects created during a lookahead will not be wasted.<br />
<br />
Is there anything this second version cannot do? Well, there is one limitation: Those grammars where the parser and scanner are tightly linked and the parser changes the scanner start state by calling the scanner's BEGIN method cannot use this method. This is a very good reason to avoid such program structures.<br />
<br />
The only really tricky thing about this method is the responsibility for figuring out how to write a correct "<em>ExampleLookaheadScan</em>" method. In our present case the occurrence of particular tokens in the lookahead is definitive. <br />
<br />
In the final part of this sequence I shall take a really difficult example that arises in the parsing of C#. A statement sequence may have a<em> local-variable-declaration</em> or a <em>statement-expression</em>, among other possibilities. Both of these may start with an identifier and it requires an unbounded lookahead to disambiguate the two. In this case there are no disjoint sets of tokens that separate the two cases and stronger measures are required. <br />
<br />
The complete code of the example in the current blog entry will go up on the gppg page on CodePlex in the next few days.k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-4867013899336650822011-12-13T22:51:00.000-08:002011-12-14T18:08:53.656-08:00Doing ad hoc Lookahead in GPPG Parsers (part-1)<em>Gardens Point Parser Generator </em>(GPPG) and other similar parser generators are limited to a one-symbol lookahead. However, many programming languages have a few places in their grammars that require a some extra help. In some cases it is possible to get around the issue with a little transformation of the grammar, but in other cases it is necessary to resort to stronger measures. Of course if you <strong><em>own</em></strong> the language that you are parsing, the simplest response is to modifiy the language to make the problem go away, and no-one gets hurt. If somebody else owns the language then the toolkit gets a bit more exercise.<br />
<br />
This blog deals with one of those stronger methods, where it is not possible to state in advance the number of symbols that have to be read to decide between two continuations.<br />
<br />
Our running example is a grammar for a Pascal-like language which has a problem with the <em>CASE</em> statement, roughly equivalent to <strong>switch</strong> in C-like languages. Here is the base grammar with most of the irrelevant bits missing ...<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyGVw9MiA4_DI9UzW7hIFP2rhLmrhkrVr5pLWA_9HSP1oOUCfPmA21fhf9H9HikWk7ApRsBlaNyZeTKmDeArWzEzP9uk2GzCJC9bvQ76OPR6udVKO5_-oJyVpOmaWsCYJ11PFvyRkMe22V/s1600/BaseGrammar.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="325" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyGVw9MiA4_DI9UzW7hIFP2rhLmrhkrVr5pLWA_9HSP1oOUCfPmA21fhf9H9HikWk7ApRsBlaNyZeTKmDeArWzEzP9uk2GzCJC9bvQ76OPR6udVKO5_-oJyVpOmaWsCYJ11PFvyRkMe22V/s400/BaseGrammar.PNG" width="400" /></a></div>
So, here is the problem - we are parsing the following string and have just read the semicolon, with the identifier "z" as the lookahead symbol ...<br />
<span style="font-family: "Courier New", Courier, monospace;"><strong><span style="font-size: x-small;"> </span><span style="font-size: small;">CASE foo OF x,y : a := b; z</span></strong> ...</span><br />
<span style="font-family: inherit;">How do we know whether the 'z' belongs to the current <em>case_element</em> (maybe the start of another assignment statement) or is the start of a new <em>case_label_list</em>? Well, we can't tell with one symbol of lookahead. If we read forward and find a comma or a colon it's a case label; if we find an assignment marker "<strong>:=</strong>" then it's the start of an assignment. For plausible generalizations of the asssignment left-hand-side grammar the number of lookahead symbols is unbounded.</span><br />
<br />
It is for this reason that most Pascal-like grammars have vertical bar<strong> '|'</strong> separator between case elements, and C-family languages use the keyword <strong>case</strong> for every label. However the grammar, as given, is not ambiguous. Read enough lookahead symbols and the problem is solved.<br />
<br />
When a slightly filled-out version of this grammar is fed to <strong>gppg</strong> there is a shift-reduce conflict in one state. Looking at the diagnostic output from using the /report /verbose options shows that for the state in question, when the lookahead is IDENT, the identifier will be shifted by default, although reduction "<em>case_label_list ':' statement_list <span style="font-family: "Courier New", Courier, monospace;"><strong>-></strong></span> case_element</em>" is also valid. This is the only conflict.<br />
<br />
As an aside, note that things could be worse. Pascal-family languages normally use the semicolon as a <em>separator</em> between statements, rather than as a statement<em> terminator</em>. The base grammar above uses the semicolon as a terminator. If we simply use it between elements of a statement list then we would have an obvious problem with the string -<br />
<strong><span style="font-family: "Courier New", Courier, monospace;"> CASE foo OF x,y : a:= b - z ...</span></strong><br />
<span style="font-family: inherit;">Is the "-z" part of the assignment right-hand-side, or the first label of the next case element?</span><br />
<br />
So now we come to the real question: how can we hack on the parser (and scanner) to resolve such problems with the one-symbol lookahead?<br />
<br />
The idea is as follows: when the conflict location is reached jump to user-written code that scans forward in the input stream until the matter is decided. Point the parser in the right direction, reset the scanner buffer-position and resume parsing as normal.<br />
<br />
The first step is to find a way of inserting the call to the user code in a semantic action. For our example, whenever a statement list is being parsed in the context of a case statement we should do the <em>ad hoc</em> lookahead. So we specialize <em>statement_list</em> as<em> stat_list_for_case</em>.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDRpTVu08kwwEXkqbs8KQWidOErsFJKiPZ1w8VJYi0VP-B4NgtW8shQE7CXF0HuS3PGK8wr1LrCawIct5Qc9Td9IwlesAaB4_4TJ0_fJjSTg7UP7-hNYSOXIVmHqm5wIx-1oSs28ENJWnN/s1600/BaseGrammarMod.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDRpTVu08kwwEXkqbs8KQWidOErsFJKiPZ1w8VJYi0VP-B4NgtW8shQE7CXF0HuS3PGK8wr1LrCawIct5Qc9Td9IwlesAaB4_4TJ0_fJjSTg7UP7-hNYSOXIVmHqm5wIx-1oSs28ENJWnN/s320/BaseGrammarMod.PNG" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Now our <em>SaveStateAndDoLookahead</em> method is called every time a new statement is added to the statement list of a case statement.<br />
<br />
And what does the method do? Well, because of some issues with the visibility of various scanner and parser variables we split the code between the scanner and parser. The scanner code saves scanner state, calls the parser code and restores the scanner state on return.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixUtppCazAaud91UTpz221gWCDSNDkTz3sp3t5hpzDqHsinp3bE9eCwU8_2GSb0lBgCglrdK-6hG-qtSonvUxTX1wkX_N_kLLF-5SV1ZsdiIW7PekuJJxcL3A2ZIrNHaA7suXBOpv14WCq/s1600/SaveState.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="126" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixUtppCazAaud91UTpz221gWCDSNDkTz3sp3t5hpzDqHsinp3bE9eCwU8_2GSb0lBgCglrdK-6hG-qtSonvUxTX1wkX_N_kLLF-5SV1ZsdiIW7PekuJJxcL3A2ZIrNHaA7suXBOpv14WCq/s320/SaveState.PNG" width="320" /></a></div>
This code calls the <em>Parser</em> method <em>DoAdHocLookahead</em>. This method does the heavy lifting. [However, see the footnote for a bit more on what "save scanner state" implies.]<br />
<br />
<em>DoAdHocLookahead </em>will repeatedly call <em>yylex</em> until a decision may be made. First of all however note that the conflict only arises if the next symbol fetched is an identifier. For our simple case we exit the loop as soon as we find a symbol that can belong to an assignment but not to a case label list, or to a case label list but not to an assignment. Simple. But how do we "point the parser in the right direction" before parsing resumes?<br />
<br />
The simplest way of forcing the parser to do the right thing is to overwrite the parser's <em>NextToken</em> variable. We use a dummy symbol that otherwise does not occur in the grammar, and rewrite the grammar to rely on that symbol to separate the<em> case_element</em> items. <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGGfYdBpVX_ptEDsZ6MdWvdvR06wNSzkkVc4LJhzvEg1YpYZwbJ79OQsJ0b73uao8CUbSX1oDvhr70yfgwdDC-iK2eMZr_rBujXPhmnel8zxwopE8lzUwwYefRSd5ZX4kGF7-VJEpIFwtp/s1600/CaseElementList.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="69" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGGfYdBpVX_ptEDsZ6MdWvdvR06wNSzkkVc4LJhzvEg1YpYZwbJ79OQsJ0b73uao8CUbSX1oDvhr70yfgwdDC-iK2eMZr_rBujXPhmnel8zxwopE8lzUwwYefRSd5ZX4kGF7-VJEpIFwtp/s320/CaseElementList.PNG" width="320" /></a></div>
So that is about all there is to it. Overwriting the <em>NextToken</em> value in the current version of <em>GPPG</em> requires diving into the code of <em>ShiftReduceParser</em> to make the <em>NextToken</em> variable accessible. The next refresh of the tool will give the variable protected accessibility to facilitate this technique.<br />
<br />
The complete code for this example is available at <a href="http://gppg.codeplex.com/">http://gppg.codeplex.com</a><br />
<br />
Part two will deal with a much harder example: one of the classic syntactic ambiguities of C#.<br />
<br />
<strong>Footnote: What "save scanner state" implies, and various disclaimers.</strong><br />
In the example here the saved state in the scanner is the current "lookahead" character, the current line and column count, and the buffer position. This works for this particular scanner, but others may need more. In particular, if a scanner has a number of start states, the start state might need to be saved and restored also. Using the YY_START property is the easiest way of doing this, since it automatically resets the start ordinal <em>and</em> the start state. Of course, for a scanner that has a stack of start ordinals the whole stack needs to be saved and restored.<br />
<br />
Furthermore, if the scanner detects and reports lexical errors, errors that are backed-up over may be reported twice. Thus if you wish to be really picky, you might want to save the high tide on the error buffer and restore that as well.<br />
<br />
Alternative strategies will be discussed in the second part.<br />
<br />
<br />
<br />k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-55053174316402696232011-11-16T19:04:00.000-08:002011-11-16T19:04:02.073-08:00Another Trick for Debugging GPCP ProgramsPrograms compiled using the CLR version of Gardens Point Component Pascal can be debugged using the Visual Studio debugger, as described in an earlier blog.<br />
<br />
This short blog deals with a litte trick to exploit one of the "features" of Visual Studio (VS).<br />
<br />
When a program being debugged stops at a breakpoint, the local variable window of VS lists the<br />
variables that are in scope in the selected frame of the call chain. Other fields of the window show the <em>declared type</em> of the variable, and its concrete (actual) type. There is also a field that shows the <em>value</em> of the variable. This is handy for numeric scalars, for example.<br />
<br />
However, for other types Visual Studio fills in the value field by calling the virtual <em>ToString()</em> method on the object. The "<em>miranda</em>" option, for types that do not have a <em>ToString</em> of their own, is to use the method inherited from Object, which returns the name of the concrete type. This is what happens by default for Component Pascal objects.<br />
<br />
So, the question is, can we make gpcp types display useful value information in the VS value field? The answer is, of course, yes. But there is a trick to it. First a couple of things that don't quite work.<br />
<br />
Suppose that we have a pointer to record type, with no declared super type.<br />
<br />
<span style="font-family: "Courier New", Courier, monospace;">TYPE Foo = POINTER TO RECORD ... END; </span><br />
<br />
The super type will be ANYREC, which is implemented as System.Object in the CLR version of gpcp. However, the compiler will not import the inherited <em>System.Object</em> methods so attempts to override the <em>ToString</em> method will fail.<br />
<br />
<span style="font-family: "Courier New", Courier, monospace;">PROCEDURE (f : Foo)ToString() : Sys.String;</span><br />
<br />
will get a "this is not at override, you must use NEW" error. And if you try<br />
<br />
<span style="font-family: "Courier New", Courier, monospace;">PROCEDURE (f : Foo)ToString() : Sys.String,NEW;</span><br />
<br />
the method actually will be "new", and will not override the virtual <em>Object</em> method. The trick is to make the compiler read in the inherited methods. The following works:<br />
<br />
<span style="font-family: "Courier New", Courier, monospace;">IMPORT Sys := mscorlib_System, ... ;</span><br />
<span style="font-family: "Courier New", Courier, monospace;">...</span><br />
<span style="font-family: "Courier New", Courier, monospace;">TYPE Foo = POINTER TO RECORD (Sys.Object) ... END;</span><br />
<span style="font-family: Courier New;">...</span><br />
<span style="font-family: Courier New;">PROCEDURE (f : Foo)ToString() : Sys.String; </span><br />
<span style="font-family: Courier New;">BEGIN ... END ToString;</span><br />
<br />
<span style="font-family: inherit;">Using <em>RTS.NativeObject</em> instead of <em>Sys.Object</em> and <em>RTS.NativeString</em> instead of <em>Sys.String</em> works also.</span><br />
<br />
There is another place where the implicit use of <em>ToString()</em> occurs in the .NET framework. The format methods like <em>String.Format</em> and <em>WriteLine</em> take a format string and an array of <em>Object</em>. Each object in the array is converted with <em>ToString</em> and substituted into the numbered position in the format string.<br />
<br />
Thus to access the format methods of the .NET libraries it is necessary to insert the values into an array of system objects. The rules are as follows:<br />
<ul><li>For record and pointer to record types, declare a <em>ToString</em> method, as described.</li>
<li>For scalar objects, like integers or reals, it is necessary to resort to the manual method of calling <em>Sys.Convert.ToString</em> on the scalar value. C#, which has implicit boxing, hides this away, but Component Pascal's BOX extension does not apply to scalars.</li>
</ul>Here is an example of a <em>ToString</em> method for a Coord type which has x and y values and an optional name s for each point. We assume that the name is an open array of char, <em>RTS.CharOpen</em>.<br />
<br />
<span lang="EN-AU"><span style="font-family: "Courier New", Courier, monospace;">PROCEDURE (pt : Coord)ToString() : Sys.String;</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> CONST format = "{0}{{{1},{2}}}";</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> VAR tag : Sys.String;</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> BEGIN</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> IF pt.s # NIL THEN tag := MKSTR(pt.s^) ELSE tag := "anon" END;</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> RETURN Sys.String.Format(format, </span><br />
<span style="font-family: "Courier New", Courier, monospace;"> tag, Sys.Convert.ToString(pt.x), Sys.Convert.ToString(pt.y));</span><br />
<span style="font-family: "Courier New", Courier, monospace;"> END ToString;</span></span><br />
Note that in this example we are using the fact that there is a Format method that takes three single objects following the string. We could have used the (string, object[]) version instead, and would have had to use the array version if there were more than three objects to fit in the format.<br />
<br />
The Coord objects will now display helpfully in VS, and can in turn be directly used in other format strings.k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-55826418757424122382011-04-27T02:01:00.000-07:002011-11-16T20:56:33.232-08:00Why a Regular Expression Engine is not a Scanner Generator(Updated 17 November 2011. Regular expressions corrected to use backslash instead of slash.)<br />
<br />
I enjoy getting queries from the users of my software. Sure, it takes up time but very often the enquiry highlights a common misconception, sometimes a point that should be clarified in the documentation, and sometimes (hiss, boo) reports a real bug.<br />
<br />
A couple of weeks ago I received a query about <em>GPLEX</em> that brought a couple of important issues into focus. The user was creating a scanner, and was having trouble with the generated grammar not accepting certain apparently legal literal string tokens. The grammar for literal strings had single quote characters <strong>'</strong> as delimters, and escaped single quote characters with the "backslash" character. Naturally enough literal backslashes were also backslash escaped. The user had tried the "obvious" regular expression (RE)<br />
<span style="font-family: "Courier New", Courier, monospace;"><strong><span style="color: black;"> '<span style="font-family: "Courier New", Courier, monospace;">(</span></span><span style="color: black; font-family: "Courier New", Courier, monospace;">\\.|[^'])*'</span></strong> </span><br />
<span style="font-family: inherit;">which, in words, reads "<em>single quote character followed by zero or more repetitions of EITHER a backslash followed by any character except newline OR any character except single quote, with the whole string terminated by another single quote character</em>".</span><br />
This did not work on a set of simple test strings, so the user had tried a sequence of progressively more complicated RE without success.<br />
<br />
So, the real query was:<br />
<ul><li>Why does not the obvious RE work?</li>
<li>User had tried the obvious RE against the .NET RE engine and all was ok.</li>
<li>Rumour has it that <em>GPLEX</em> uses its own RE-engine, could this be the problem?</li>
</ul>Well, <em>GPLEX</em> does have a clean-room RE-engine and hell I have been known to have bugs in my software, so I figured I better have close look. I fired up <em>GPLEX</em> and, sure enough, grammars using the obvious RE did not work correctly.<br />
<br />
Step 2 was to fire up my special "tracing" version of <em>GPLEX</em> which creates a scanner that chatters on about what character it has just read, what the next state will be, and when it hits an accept state. Following through the problematic example string<br />
<span style="font-family: "Courier New", Courier, monospace;"><strong>'\\'</strong></span><br />
immediately showed what the problem was. The scanner was in the accept state when it hit the second single quote, but continued reading characters looking for an even longer valid string. Suddenly the problem is obvious: suppose we name the two alternates of the closure as (A) "<em>any backslash escaped character except newline</em>" and (B) "<em>any character except single quote</em>". Now consider the first backslash character in the example string. It qualifies either as an A or a B. If it is an A then the next two characters are an escaped single quote (thus NOT terminating the string) else if it is the start of an escaped backslash then the second single quote terminates the string! What the scanner does is to mark the position of the second single quote and continue to munch characters. If the input ends before another single quote is found, the scanner backs up to the marked position, report a valid string, and continues scanning.<br />
<br />
Just for the record the correct regular expression is:<br />
<span style="font-family: "Courier New", Courier, monospace;"><strong>'(\\.|[^'\\])*'</strong></span><br />
although for most uses it would be best to exclude newlines inside of strings at the RE level to give better diagnostics. In any case, this simple RE worked for the enquirer.<br />
<br />
So what do we learn from this whole business? Well, I guess that it is a reminder that regular expressions are still tricky, especially when there are overlapping alternates. However, this only answers the question in bullet point one. What about point two and point three?<br />
<br />
Well, the clue is in the title of this blog. A scanner is not just a regular expression engine. A scanner tries to pick out tokens from a continuing stream of input. It tries to find the longest possible match and backs up when a match fails. By contrast an RE engine tries to match a single pattern, and generally does lots of complicated other things like find sub-patterns within a matching overall pattern. Neither one really does the work of the other, except in almost trivially simple cases.<br />
<br />
In this case, I believe that what happened was that what I have been calling the "obvious RE" was extracted from the scanner specification and tried against the .NET RE engine. It worked.<br />
<br />
The funny thing is that had the obvious RE been extracted from the scanner spec and tried as a single pattern in <em>GPLEX</em> it would have worked perfectly when tried out one sample string at a time. Each input containing a literal string like the one that put its finger on the error would have munched extra characters until it met the end of the input, backed up and reported a correct string just as the .NET RE engine did.<br />
<br />
I am glad that my enquirer did not do this, as it would have caused even more confusion.<br />
<br />
Cheers<br />
Johnk_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-41403799635544304782011-03-03T23:27:00.000-08:002011-03-03T23:59:00.975-08:00Debugging GPCP with Visual StudioBack in the early days of Visual Studio.<em>NET</em> the <em>IDE</em> shipped with a program called <em>DbgCLR</em>, which was found in a <em>GuiDebug</em> directory in the VS distribution. Since VS-2008, this program is not present.<br />
<br />
When run with the default /debug option <em>GPCP</em> emits a "pdb" file for use by the VS debugger. However, since there is no language integration of current versions of Component Pascal into VS debugging CP programs with the VS debugger requires some ingenuity. This blog deals with how I do it. It is entirely possible that there are other ways of achieving the same effect, like the man said "your mileage may vary".<br />
<br />
The key idea is to trick VS into thinking that you are debugging a C# project, which just happens to jump out into other language modules. In this case we will tell VS that the aim of the project is to compile the runtime support file <em>RTS.cs</em> to produce <em>RTS.dll</em>. <br />
<br />
<h2>Simple Example, Debugging Hello World</h2>(1) We begin by creating a directory, called <em>DebugHello</em>, say, and place in the directory our source file <em>Hello.cp</em> and the runtime system source <em>RTS.cs</em>. The runtime file is found under <br />
<span style="font-family: "Courier New", Courier, monospace;">C:\gpcp-CLR\source\libs\CSharp</span> <br />
in my installation.<br />
(2) Start up Visual Studio and select <em>File > New > Project from Existing Code. </em>When the wizard starts, choose Visual C# as the project type,and click <em>Next</em>.<br />
(3) At the next screen of the wizard we point VS at the directory that was created at step(1), choose a name for the project, <em>DebugHello</em>, and set the output type as <em>Class Library</em>. Click <em>Finish</em>.<br />
(4) At this stage we go into the <em>Solution Explorer</em> pane, right click on <em>DebugHello</em>, and select <em>Properties</em>. In the Properties page choose the Applications tab, and set the Assembly name to "<em>RTS</em>". Probably nothing else needs to be set in this tab. We will assume that the default output path <span style="font-family: "Courier New", Courier, monospace;">bin\debug</span> is left unchanged in the <em>Build</em> tab.<br />
(5) In the <em>Build Events</em> tab we want to create a pre-build event to compile <em>Hello.cp. </em>We could do this with a simple command line call to <em>gpcp</em>, but for multi-source-file compilations we need to control the place where the symbol and output files go so we will use the command<br />
<span style="font-family: "Courier New", Courier, monospace;">call $(ProjectDir)PreBuild.bat $(ProjectDir)</span><br />
The content of this batch file is discussed later. Once we have the batch file we can invoke <em>Build > Build Solution</em>. The bin\debug directory will now contain <em>RTS.dll, RTS.pdb, Hello.exe, Hello.pdb</em>.<br />
(6) In the <em>Debug</em> tab we select the <em>Start external program</em> radio button, and use the file browser to select the <em>Hello.exe</em> executable in the output directory. This is all that is required for "Hello World", However for programs that take input we will also need to set the working directory and the command line arguments in this tab.<br />
We may now open hello.cp in the Solution Explorer pane, place a breakpoint at <em>BEGIN </em>and start debugging!<br />
<br />
<h2>The PreBuild.bat File</h2>We want to set the working directory so that the symbol file, IL file and list file (if any) are placed in the source directory. However, we want the output files to go to the bin\debug directory. The source directory is passed as an argument to the batch file by the macro $(ProjectDir) so the batch file reads --<br />
<blockquote><span style="font-family: "Courier New", Courier, monospace;">cd %1%</span><br />
<span style="font-family: "Courier New", Courier, monospace;">gpcp /bindir=bin\debug hello.cp</span></blockquote><h2>A Bigger Example - Debugging GPCP</h2>A (much) larger example would be debugging GPCP itself. This example illustrates a number of extra issues typical of larger programs. Here are the steps equivalent to those of the simple example.<br />
<br />
(1) We begin by creating a directory, called <em>DebugGpcp</em>, say, and place in the directory our source file runtime system source <em>RTS.cs</em>. As well, we copy <em>ALL</em> of the cp sources from the source\gpcp directory of the distribution.<br />
(2) Start up Visual Studio and select <em>File > New > Project from Existing Code. </em>When the wizard starts, choose Visual C# as the project type,and click <em>Next</em>.<br />
(3) At the next screen of the wizard we point VS at the directory that was created at step(1), choose a name for the project, <em>DebugGpcp</em>, and set the output type as <em>Class Library</em>. Click <em>Finish</em>.<br />
(4) At this stage we go into the <em>Solution Explorer</em> pane, right click on <em>DebugGpcp</em>, and select <em>Properties</em>. In the Properties page choose the Applications tab, and set the Assembly name to "<em>RTS</em>". Probably nothing else needs to be set in this tab. We will assume that the default output path bin\debug is left unchanged in the <em>Build</em> tab.<br />
(5) In the <em>Build Events</em> tab we want to create a pre-build event to compile <em>gpcp</em> using the command<br />
<span style="font-family: "Courier New", Courier, monospace;">call $(ProjectDir)PreBuild.bat $(ProjectDir)</span><br />
The content of this batch file is discussed later. Once we have the batch file we can invoke <em>Build > Build Solution</em>. The bin\debug directory will now contain <em>RTS.dll, RTS.pdb </em>and lots of other files.<br />
(6) In the <em>Debug</em> tab we select the <em>Start external program</em> radio button, and use the file browser to select the <em>gpcp.exe</em> executable in the output directory. We also set the working directory and the command line arguments to compile the source file that we want <em>GPCP</em> to compile.<br />
<blockquote></blockquote><h2>The PreBuild.bat File for GPCP</h2>For a project as large as gpcp we want to only compile the files that have been modified, so we will use the <i>CPMake</i> tool here. <i>CPMake</i> compiles only changed files, or those that depend on changed files, and also compiles all the sources in the dependence order. If a complete recompile is wanted the /all option of <em>CPMake</em> does this. Our required batch file is --<br />
<span style="font-family: "Courier New", Courier, monospace;"></span><br />
<blockquote><span style="font-family: "Courier New", Courier, monospace;">cd %1%</span><br />
<span style="font-family: "Courier New", Courier, monospace;">CPMake /bindir=bin\debug gpcp</span></blockquote>Note that the argument to <em>CPMake</em> is the <strong>base name</strong> of the program, without file extension.<br />
<h2>Extra Libraries</h2><em>GPCP</em> requires a number of additional libraries, as well as <em>RTS</em>. All of these dynamic link libraries need to be copied into the bin\debug directory in order for the program to work. Here are the libraries that <em>GPCP</em> relies on. Some are standard libraries in the distribution, others are components of <em>GPCP</em> that are implemented in C#.<br />
<span style="font-family: "Courier New", Courier, monospace;"></span><br />
<blockquote><span style="font-family: "Courier New", Courier, monospace;">GPBinFiles.dll</span><br />
<span style="font-family: "Courier New", Courier, monospace;">GPFiles.dll</span><br />
<span style="font-family: "Courier New", Courier, monospace;">GPTextFiles.dll</span><br />
<span style="font-family: "Courier New", Courier, monospace;">MsilAsm.dll</span><br />
<span style="font-family: "Courier New", Courier, monospace;">QUT.PERWAPI.dll</span><br />
<span style="font-family: "Courier New", Courier, monospace;">QUT.SymbolRW.dll</span></blockquote><br />
If we want to step into these libraries it is necessary to copy the corresponding pdb files as well. Since these libraries should remain unchanged during any debugging of <em>GPCP</em>, these files are not checked for recompilation during the debugging process.<br />
<br />
And that is all there is to it.<br />
<br />
Happy debugging!k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-18336400455655261482011-02-10T22:14:00.000-08:002011-02-10T22:14:16.452-08:00YACC and the Perils of Operator Precedence<h2>Background</h2><em>A little while ago I received a query about a very simple grammar example which behaved in an unexpected fashion. The issue was that the parser worked correctly in its original form. However, there was a common substring in two of the productions. When the common substring was replaced by a new non-terminal, the resulting parser rejected certain clearly legal sentences. The question was: surely these two grammars should be equivalent? </em><br />
<br />
<em>I dug around with the example and was forcibly reminded of some "features" of YACC-like parser generators that I had long forgotten. So here is an even simpler grammar that also behaves paradoxically. It is also a tutorial on how the operator precedence algorithm </em>really<em> works.</em><br />
<br />
<em>For those whose memory of shift/reduce parsing theory is even rustier than mine, I have added a page called the Shift/Reduce Parsing Refresher. It is blatantly lifted from Appendix-9 of the GPPG documentation at </em><a href="http://gppg.codeplex.com/">http://gppg.codeplex.com/</a>.<br />
<br />
<h2>Introduction</h2>The parsers generated by <i>GPPG</i> are similar in their functionality to those produce by the traditional <i>YACC</i> tools. That is, as well as using <i>LALR(1)</i> states to guide the parse they use operator precedence. Operator Precedence allows convenient handling of such things as expression grammars with large numbers of operators.<br />
<br />
<div></div>When programming languages are specified for their users, that is, for <i>programmers</i>, it is common to specify the semantics of expressions by specifying the relative precedence and associativity of the operators. Thus it is tempting to use this same mechanism for defining the grammar for <i>implementors</i>, and this is what <i>YACC</i>-like tools try to do.<br />
<br />
<div></div>There is another way of defining the way a parser will work with an expression grammar: by defining a number of different syntax non-terminals, traditionally given names like <i>Expression, Simple Expression, Term, Factor</i> and <i>Primary</i>. Going down this second path has the disadvantage that two ways of specifying the same semantic may then be required, because the multiple non-terminal method seems to be less understandable for programmers. Unfortunately, there are disadvantages in using the operator precedence functionality of <i>YACC</i>.<br />
<br />
<div></div>This episode of the blog is concerned with understanding some of the subtleties.<br />
<br />
<h2>The Breathtakingly Simple Grammar</h2>The running example grammar is very simple. A single operator, and only two tokens.<br />
<blockquote>Goal : '+' Exp ;<br />
Exp : 'a' | Exp '+' Exp ;</blockquote>As it turns out, this grammar specifies a regular language, the same infinite set of strings that are generated by the regular expression -<br />
<blockquote>(\+a)+ or, if you prefer, the set {+a, +a+a, +a+a+a, ... }</blockquote>What could possible go wrong with such a simple grammar? And if something can go wrong, how about a more complicated case with (say) 17 levels of precedence?<br />
<br />
<h2>Preliminary Examination</h2>The example grammar is actually ambiguous. The shortest valid string with an ambiguous parse is "+a+a+a", which may be parsed either as though it was meant to be understood as "+((a+a)+a)" or as "+(a+(a+a))". <br />
<br />
No big deal, we didn't specify whether the binary "+" was left or right associative.<br />
<br />
However, before we go on let's check: send this grammar to <i>GPPG</i>, and it generates a parser, but also a warning which with the /verbose options reads -<br />
<blockquote><strong>Shift/Reduce conflict</strong><br />
Shift "'+'" : State-6 -> State-5<br />
Reduce 4 : Exp -> Exp, '+', Exp</blockquote>This tells us that in state-6 of the automaton there are two possible actions when the next symbol is '+' : shifting the plus sign and going to state-5, or reducing by the production<br />
<blockquote>Exp : Exp '=' Exp </blockquote>This is nice, we see that the shift choice would lead to the right-associative semantic for the binary '+', while the reduce choice would lead to the left-associative semantic.<br />
<br />
<h2>Using YACC Disambiguation - Associativity</h2>The reason that the shift/reduce conflict message is only a <i>warning</i> is because <i>YACC</i> and its look-alikes always break such conflicts by shifting, in preference to reducing. In the case of the running example, this would lead to the right-associative interpretation. The warning is simply a heads-up to the user, so they may check if the "prefer shifting" default gives the correct behavior.<br />
<div></div>It is possible to make the warning go away by explicitly stating the associativity of the tokens. In the example we may begin with -<br />
<blockquote><span style="font-family: "Courier New", Courier, monospace;">%token 'a'</span><br />
<span style="font-family: "Courier New", Courier, monospace;">%left '+'</span></blockquote><i>GPPG</i> no longer gives a warning, since we have specified that we want '+' to be left associative. Let us see how this actually works. If <i>GPPG</i> is run with the /report option the part of the html report with the details for state-6 (the state that previously had the shift/reduce conflict)is -<br />
<blockquote><b>State-6</b><br />
<b>Kernel Items</b> <br />
4 Exp : Exp '+' Exp . <br />
-lookahead: { '+', EOF } <br />
4 Exp : Exp . '+' Exp <br />
<b>Parser Actions</b><br />
'+' reduce using rule 4 (Exp)<br />
EOF reduce using rule 4 (Exp) </blockquote>Recall that the dot in the kernel items denotes the point in the rule that the recognition has reached. Unusually in this case the two items in the kernel both belong to the same production rule, rule-4. The first item has the dot at the end, indicating that this production has been fully recognized, and may be reduced. The lookahead set that support the reduction is shown within braces. The second item has the dot internally, and shows that a following '+' would continue with the recognition of this production rule. The parser action table shows that with the <span style="font-family: "Courier New", Courier, monospace;">%left</span> declaration the first item (reduce by rule-4)is selected no matter whether the lookahead is another plus sign or the end of file.<br />
<div></div>Alternatively, if the declarations say -<br />
<blockquote><span style="font-family: "Courier New", Courier, monospace;">%token 'a'</span><br />
<span style="font-family: "Courier New", Courier, monospace;">%right '+'</span></blockquote>then the same state will have the report -<br />
<blockquote><b>State-6</b><br />
<b>Kernel Items</b> <br />
4 Exp : Exp '+' Exp . <br />
-lookahead: { '+', EOF } <br />
4 Exp : Exp . '+' Exp <br />
<b>Parser Actions</b><br />
'+' shift, and go to state 5<br />
EOF reduce using rule 4 (Exp) </blockquote>which is the same behavior as with no associativity specified, but without the warning message.<br />
<div></div>This seems to have solved the problem, with a clear way to choose the desired behavior, and that pesky warning suppressed.<br />
<br />
<h2>Using YACC Disambiguation - Operator Precedence</h2>As well as specifying the associativity, <i>YACC</i> allows for the relative precedence of operators to be controlled. The simple rule is that all the tokens associated with a particular <span style="font-family: "Courier New", Courier, monospace;">%token, %left or %right</span> declaration have the same precedence, but those in later groups take precedence over those in earlier declarations. Thus if we had not only a '+' operator, but also a '*' operator then in conventional arithmetic grammars the multiply operator would occur in a later group than the additive operator.<br />
<br />
<h2>How Operator Precedence Actually Works</h2>Operator precedence is used to silently resolve shift/reduce conflicts by comparing the precedence of the completed rule with the precedence of the lookahead token. Here are the rules for determining precedence of each rule and each token -<br />
<ul><li>All token precedences are greater than zero.</li>
<li>The precedence of a token is determined by the position of the declaration group in which it occurs. Groups declared later in the definitions section have higher precedence.</li>
<li>The precedence of a production is that given by the "<span style="font-family: "Courier New", Courier, monospace;">%prec</span> <em>TokenName</em>" declaration, if there is one.</li>
<li>Otherwise, the precedence of a production is that of the rightmost terminal symbol in the right-hand-side, if there are any terminal symbols in the right-handside.</li>
<li>Otherwise the production has zero precedence.</li>
</ul>Having established the precedences, here are the rules by which they are compared -<br />
<ul><li>If the precedence of the production is higher than the precedence of the lookahead token, then reduce.</li>
<li>Otherwise, if the precedence of the lookahead token is higher than the precedence of the production, then shift.</li>
<li>If the precedences are equal and the associativity of the lookahead token is left then reduce.</li>
<li>If the precedences are equal and the associativity of the lookahead token is right then shift.</li>
</ul>These rules are applied during the generation of the parsing tables, not at parser runtime.<br />
<div></div>Finally, here are the rules that <i>GPPG</i> uses to decide when to issue conflict diagnostics -<br />
<ul><li>If an automaton state has two or more productions that can be reduced, that is, two or more items with the “dot” at the end, then issue a reduce/reduce conflict warning.</li>
<li>If an automaton state has a reduction and also possible shift actions, then the conflicts are resolved as detailed above. However, if the conflict is resolved in favor of shifting because the production has zero precedence, then issue a shift/reduce conflict warning.</li>
</ul>The conflict-reporting rules are similar for most <i>YACC</i>-like parser generators.<br />
<br />
<h2>So What is the Problem?</h2>The operator precedence facilities of these kinds of parser generators are designed for operator-based grammars, and are reasonably convenient for this purpose. However, like most simple ideas they have a dark side of unintended consequences.<br />
<div></div>The most striking of these is the failure of one of the most fundamental properties of context free grammars. Consider our original grammar -<br />
<blockquote>Goal : '+' Exp ;<br />
Exp : 'a' | Exp '+' Exp ;</blockquote>Notice that the pattern "'+' Exp" occurs in two places. We might therefore be tempted to define a new non-terminal symbol for this pattern. Call it "Rest", say. Now we have the new grammar -<br />
<blockquote>Goal : Rest ;<br />
Rest : '+' Exp ;<br />
Exp : 'a' | Exp Rest ;</blockquote>and this grammar defines precisely the same language as the previous case. <br />
<div></div>If we send this grammar to <i>GPPG</i>, it generates a parser and, as before, also generates a warning. In this case the warning reads -<br />
<blockquote><strong>Shift/Reduce conflict</strong><br />
Shift "'+'" : State-5 -> State-4<br />
Reduce 5 : Rest -> '+', Exp</blockquote>Following our experience with the original, unfactorised version of this grammar we might try to declare the '+' operator <span style="font-family: "Courier New", Courier, monospace;">%right</span>. As it turns out, this choice leads to a parser which rejects the simple (and legal) input "+a+a".<br />
<div></div>What is going on here? How is it possible that the parser is rejecting a legal input? <br />
<div></div>To explain this weird result we need to look at the parser's state-5, the site of the shift/reduce conflict. Here are the kernel items in this state -<br />
<blockquote><b>State-5</b><br />
<b>Kernel Items</b> <br />
5 Rest: '+' Exp . <br />
-lookahead: { '+', EOF } <br />
4 Exp : Exp . Rest</blockquote>According to our rules, in this state the precedence of the rule-5 is the same as '+'. Since we have defined '+' as having <span style="font-family: "Courier New", Courier, monospace;">%right</span> precedence the parser reduces, and the input reduces to the goal symbol before the EOF is reached.<br />
<br />
<div></div><h2>The Message in all of this</h2>So to return to the question in the background section, are these two grammars equivalent? Well, yes the <i>grammars</i> are equivalent, but the two grammars do not lead to equivalent <i>parsers</i>.<br />
<br />
The take-home message in all of this is that operator precedence is a fragile construct. It is normally safe, and sometimes even helpful to factor out common sub-expressions in production right-hand-sides. However two things happen when this abstraction mechanism is used. Firstly, kernel item-sets may become merged, introducing new shift/reduce conflicts; secondly, if the abstraction removes the "last non-terminal in the right hand side" then the precedence of the <i>rule</i> may be modified and parsing behavior changed.<br />
<br />
Some folks think that this is a sufficiently serious problem that they shun the use of operator precedence entirely and advocate instead the use of multi-level grammars.<br />
<br />
Like the man said "yer pays yer money, and yer takes yer pick".k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0tag:blogger.com,1999:blog-1238546302574235363.post-4597125329076834012011-02-01T17:50:00.000-08:002011-02-01T17:50:18.098-08:00What it's all AboutI plan to use this blog to talk about Compilers, including the ones that I wrote in the past or am writing now; Software Tools, including the ones that I wrote or maintain. I may also indulge in an old-fashioned rant from time to time.<br />
<br />
Most of my software that is in the public domain is available on CodePlex <a href="http://www.codeplex.com/">http://www.codeplex.com/</a> although there are some other versions available somewhere inside the QUT website. Where both are available, the CodePlex versions are the current ones, and are under source code control (I use TortoiseSVN by choice).<br />
<br />
Most of the software has been released under the <em>Gardens Point</em> name, chosen for the particular bend in the Brisbane River where the main campus of Queensland University of Technology is situated. And yes that is the same river that flooded in January 2011, inundating about 11 thousand houses. <br />
<br />
<span style="font-size: large;"><strong>What Software Tools?</strong></span><br />
The software tools that I maintain on CodePlex are:<br />
<ul><li>Gardens Point LEX (gplex). This is a lexical scanner generator which produces C# as output. The tool has lots of options, and can create scanners that deal with the whole Unicode character set.</li>
<li>Gardens Point Parser Generator (gppg). This is a parser generator which produces C# as output. The tool has an input language very close to traditional YACC, and produces an LALR(1) parser with the usual disambiguation rules.</li>
<li>PE Reader/Writer API (perwapi). This is a reader/writer for Microsoft .NET PE files. It has been used by a number of compilers including GPCP and GPM-CLR. The tool does not deal with the complete world of the latest version of the .NET framework. I tend to do enhancements/bug-fixes on an on-demand basis.</li>
<li>There is also a prototype tree rewriter-generator, based on Bottom Up Tree Rewriting theory. This is almost ready for release, and I intend to blog about its virtues and challenges quite a bit.</li>
</ul><span style="font-size: large;"><strong>What Compilers?</strong></span><br />
<ul><li>Gardens Point Component Pascal (gpcp). This is the demonstration compiler that I wrote in year-2000 for the initial release of the .NET framework. It also formed the running example for my book <em>Compiling for the .NET Common Language Runtime</em>, Prentice-Hall 2002. There is also a JVM version of this compiler, but it lags behind the CLR version by a couple of releases.</li>
<li>Gardens Point Modula-2 (CLR version). GPM was a compiler family that was built at QUT over about a ten year period. Initially a commercial product, it was available on a total of about eleven machine architecture/operating system combinations. Only a freeware Linux/x86 version appears to be still in use, and development stopped in about 1995. However, a CLR version was created in 2003, as an example of how to write a CLR compiler for a non-typesafe, non-object oriented language. GPM-CLR is effectively a new backend for the GPM frontend that all versions of GPM used in 1995. As a side-effect of this open source experiment the previously non-open-source front-end is public. I use GPM-CLR myself when I need to run some dusty old Modula-2 program.</li>
</ul><span style="font-size: large;"><strong>Stuff in the Works</strong></span><br />
<ul><li>I am currently writing a C# version-3 backend for the QUT C# typechecker. The typechecker is a framework for research on type-system extensions to C#. It was used for Andrew Craik's excellent PhD thesis. The backend is a platform for exploring issues to do with code-generation for .NET. It is, if you wish, filling in all the bits that are missing from my <em>Compiling for the .NET Common Language Runtime</em>. Some were missing because they hadn't been invented in 2001, and others due to sheer ignorance on my part.</li>
<li>As well as the bottom-up rewriter-generator mentioned earlier, I am interested in plain ordinary code selection again (as opposed to code generation for virtual machines such as JVM and CLR). Code for the SnapDragon is a gleam in the eye.</li>
</ul>k_john_goughhttp://www.blogger.com/profile/05866587102718103047noreply@blogger.com0