John Gough on Software Tools

Wednesday, March 20, 2013

Procedure Types and Managed Execution (part 2)

Summary

This is part two of a short series on the implementation of procedure types and variables on top of the .NET framework, or the Java Virtual Machine. In this part I consider alternative mechanisms for implementing procedure types on each platform.

The .NET Framework

On the .NET framework the delegate type appears to be the obvious choice. After all, the construct was designed for exactly this job, and as the basis of the .NET event-handling model. However, there is nothing to stop us from using a single-member base class or interface, just as we are (or rather were) forced to do on the JVM. Which is better?

The question of which mechanism is better is not one that may be answered on the basis of inclination or traditional prejudice. What we need is some evidence, and to be sure evidence from a range of different usage patterns. However, let us start with the simple case of simple procedure (static method) with a simple signature.

Here are three ways of calling the static method Hello, using three different mechanisms for representing procedure values. At line 23 we declare the StringToVoid delegate type, and at line 14 we create a instance of the delegate type that wraps the method Hello.

At line 26 an interface is declared with a single member "Invoke" of the string to void signature. This interface is implemented by the IWrapHello class, that wraps the Hello method. Essentially the same semantics are achieved by the abstract class S2V and its subclass WrapHello at line 31. The three "function values" are created at lines 14 .. 16, and are each called, in turn, in the following three lines.

So, a compiler could use any of these three mechanisms to represent types and values for static methods. The amount of generated IL would be very similar each case, and the details of the naming of the wrapper types would be hidden in the implementation. But what are the comparative overheads of these three mechanisms?

.NET timing for various procedure variable implementations (see text).

Four different ways of implementing procedure variables were tried, with two different micro benchmarks. The first test was very simple: the static method that was encapsulated in the variable took a string argument and returned its length. This variable was called in a loop which repeated about 10^6 times. Some idea was to make the body of the wrapped method so simple that small differences in the calling machinery would still be significant. The results are shown in the left column of the table.

The results seem to indicate that the .NET delegate mechanism has a small time advantage over both the concrete sub-class overriding an abstract method in a single-member class, or a class implementing the single member of an interface. The other thing to note is the overhead of reflective invocation. About two orders of magnitude, for a simple method body.

Of course, with one must be a little careful with a JIT compiler, as it may do special magic on loops that repeat many, many times. The second example is designed to be slightly more realistic. The code uses a rather unusual coding pattern for implementing a finite state automaton (FSA). The state variable is a procedure variable holding the "nextstate" function. The function is fed the next symbol, in this case a single bit and returns the nextstate function corresponding to the next state of the FSA. The whole thing is attached to a pseudo-random binary sequence generator which repeats every (2^19 -1) steps. The details of generators of this kind are given at Wikipedia Linear_feedback_shift_register. In this case a Galois implementation was used. This generator produces a string of 18 zero bits only once every cycle, which is what the FSA searches for. The state is updated at every step, in an apparently random fashion, in an attempt to defeat any clever prediction strategy of the JIT. The results are shown in the right hand column of the table. It may be seen that our three favored strategies have very comparable results.

The final column of the table, just for the record, lists the compatibility semantics of the procedure variable containers. All three of our favored mechanisms enforce name-compatibility, which is not what we really want. What we want is compatibility based on equality of Signatures of the encapsulated method. Reflective invocation of the MethodInfo value has the right semantics, but the loss of efficiency is overwhelming.

The JVM Framework

The MethodHandle Class, and JSR292

Meanwhile, in the Java world, Java SE Version-7 introduced a number of changes designed to benefit all those dynamically typed languages that have been implemented to run on top of the JVM. The process of proposing and implementing the changes took several years, and anyone interested in the process should Google "JSR292" to see the history.

In essence, the JSR 292 changes introduced one new byte code instruction, invokedynamic, and a few new classes in the library. MethodHandle and MethodType are the two new classes of interest here, and both belong to the package java.lang.invoke.

I shall say nothing this time about invokedynamic, although it is an interesting topic for another day. Despite the impetus for the new instruction having come from the dynamically-typed language community, it turns out to be an ideal method for the implementation of the Lambda Abstractions that will launch with Java SE Version-8. However the two new classes are directly relevant to this discussion.

MethodType is a descriptor for a method signature. The object factory for the class is passed arguments with the Class objects for the return type and the parameter types. MethodHandle is wrapper for a method value. Objects of the type may be created by a factory method that is passed a Class object, the name of a method within that class, and a MethodType object. The returned handle encapsulates the matching method. If the MethodInfo object of the target method is known, presumably by reflection, then the corresponding method handle is created by a call to the amusingly named "java.lang.invoke.unreflect".

Here are the corresponding results of the two test programs.

JVM timing for various procedure variable implementations (see text).

All of these measurements, for both frameworks, were performed on a four year old Intel E8400 processor running Windows-7 (32-bit). You may note that the MethodHandle mechanism has low overheads, with similar per-call overhead to the previous implementation methods once the method bodies do any computation. The rather lower overhead of reflective invocation of a java.lang.Class.method compared to that for a System.Reflection.MethodInfo is an interesting piece of trivia to store away in a spare neuron.

The Anonymous interface method row of the table is interesting, as this performs well on the example using the nextstate-function pattern. The gain here is that there is one less method call in the chain. In effect, the anonymous method "inlines" the body of the next-state function rather than calling the next-state function.

A simple table-lookup implementation of the FSA is included here, as a reality check.

So here is a tentative conclusion: The MethodHandle mechanism has comparable overhead to the other implementations of procedure variables. And, as a bonus, it has the sensible structural-compatibility semantics that we would expect procedure types to display.

About the Next Episode

In the final episode of this series I shall look a bit closer at the case of method values that encapsulate both a method, and a receiver ("this" value) to invoke the method on. I shall also look at using lambda abstractions to realize similar coding patterns to the use of procedure/method variables.

Late-Breaking News

My MethodHandle-based FSA benchmark crashes the Hotspot(r) Compiler for Windows X64. It runs fine on the 32-bit X86 Hotspot compiler. On X64 it runs fine with java -Xint, or if the run is too short for Hotspot to attempt heroic optimizations. Watch this space for further developments.

Friday, March 15, 2013

Procedure Types and Managed Execution

Summary

This is the first part of a multi-post exploration of the ways in which procedure variables, function pointers (call them what you will) are implemented in managed execution systems such as .NET and the JVM. Particular implementation details for both .NET and the JVM type systems place limitations on the implementation of this functionality.

This first post briefly reviews the idea behind procedure types, and the implementation mechanisms on the .NET and JVM platforms.

The second part looks at various ways to get the same semantics on the (pre-version 7) Java Virtual Machine. This second part also discusses the limitations of both managed execution systems in implementing procedure types.

A final part will look specifically at how the new functionality in JDK v7 changes the landscape, removing one of the limitations. This final part will also look at implementations based on use of lambda abstractions - available on the .NET platform currently, and coming soon in version-8 of the Java.

Background

Many programming languages have a construct that allows a variable to be declared as containing a value that denotes a particular procedure value. The procedure denoted by the variable may be invoked by calling the procedure value. Such procedure values may be created, by associating the new value with some named procedure, and values may be copied and assigned using the conventional syntax of the particular language.

Compilers implement these constructs, in the simplest case, by making the procedure value be of pointer size, and using the entry-point-address of the bound procedure as the value. This implementation mechanism is explicit in ANSI C, where such values are called "function pointers". Languages that permit nested procedures to be used as procedure values are a little more complicated, since the called procedure needs to be given access to the stack frames of its enclosing procedures. However, that circumstance is mainly of historical and theoretical interest nowadays.

Typical machine architectures have two distinct subroutine call instructions, call and calli, which respectively call a method by its symbolic name, and indirectly call the method that the entry point address of which has been loaded into some chosen register.

The calling sequence for a regular call goes something like this -

load up the arguments to the call
call the named procedure

or in the indirect, procedure variable case -

load up the arguments to the call
load the procedure variable value into the chosen register
calli through the chosen register

Of course, for a statically typed language, the number and type of the the arguments must in each case match the formal parameters of the procedure ends up being invoked by the call.

Procedure variables are thus strongly typed, and programming languages that provide procedure variables may allow such procedure types to be named in the usual way. In Component Pascal, the syntax looks like this -
TYPE NoArgToInt = PROCEDURE( ) : INTEGER;
Which gives a name to a type of procedures that take no arguments, but return an integer. We may now declare variables of this type -
VAR p1, p2 : NoArgToInt;
which declares two variables of the named procedure type. We could just have easily left the procedure type unnamed, and declared the variables as being of some anonymous procedure type -
VAR p1, p2 : PROCEDURE( ) : INTEGER;
with the same effect.

When the compiler emits instructions to construct a new procedure value it will check that any named procedure that is being bound to a new value has the correct argument number and type and conforming return type.

Now, a little thought will convince the reader that although it may be convenient to have a name for a procedure type, but two procedure types should b e compatible provided that they have the same call signature. Thus such languages have type compatibility rules for procedure values that implement structural compatibility, that is "two procedure types are the same type if they have the same argument number and type, and the same return value (or lack of return value)".

This observation, regarding structural compatibility, does not even arise in ANSI C. Function pointer types are denoted by an "abstract declarator" that declares only the parameter types and the return type. The abstract declarator for our NoArgToInt type would be -

int (*)( )

that is "pointer to function taking no arguments and returning int.

Implementing Procedure Types on .NET

Implementing procedure types on .NET is fairly straightforward. The built-in delegate type of the framework is almost exactly what we need. Consider the following C# progam fragment -

At line 20 a new delegate type is defined, with the "no args, returns int" signature. At line 23 a static method Foo is defined, with a compatible signature. A new instance of the delegate type is defined at line 26, taking Foo as its bound method value. The method is invoked via the delegate value at line 27.

There are a couples of wrinkles about .NET delegates that I am ignoring for the moment. Delegates may be bound to either a static method, as in this example, or bound to an instance method. In the case of an instance method the "this" reference is supplied at delegate creation time. A delegate may thus be thought of as encapsulating two data: a pointer to the method entry point, and a copy of the receiver to which the method is to be applied.

Not exactly what we wanted

The only real defect of the .NET delegate, as a feature to implement procedure values, is the lack of structural compatibility. Suppose an application depends on two existing libraries, both of which have API calls with parameters of delegate types, and use different names for (say) the NoArgToInt signature. Now the names of the delegate types are baked into the APIs, and although our method Foo is compatible with each, the delegate values are not assignment compatible. This is extremely annoying, since it is clear that the same semantic analysis inside the compiler that decides that Foo is compatible with each of the two delegate types can also decide that values of the two types are compatible with each other. (There was a proposal to move to structural compatibility for delegate types in V2 of the framework, but it didn't make it into the release.)

The .NET version of GPCP essentially produces the same machine code for Procedure Types as a C# compiler would for the equivalent delegate-based program. The semantics are not exactly as described in the language report, because of the lack of structural compatability of values.

Implementing Procedure Types on the (pre-V7) JVM

On the JVM the starting point is even further back, as the JVM has nothing equivalent to a built-in delegate type. Since the platform does not give us the primitive construct that we need, we must build it ourselves. Here is the Java program that we need to construct to get semantics equivalent to the C# example above.

Each procedure type corresponds to an abstract base class, defining a single abstract method Invoke, with the signature that we wish to encapsulate. Every value of this type has a separate derived class that implements the abstract method by calling the target method, in this case Foo. This is in essence the translation formula that GPCP-JVM uses to implement procedure types and values.

It may seem that if a program has many, many different procedure value instances that the namespace is going to become cluttered with lots of "single use" sub-classes, and that an anonymous class would be nicer. This is true of course, but the namespace is still cluttered, it is just that the clutter is with names that the compiler chose for you. And, yes, the names will be names that would be illegal in your Java program, so that these names will not clash with anything that the programmer might have named herself.

It would also be possible to define an interface base type, rather than an abstract type to hold procedure values, with similar results.

Notice that the Java mechanism has the same problem with compatibility of types as the .NET framework. Two variables are assignment compatible if they are derived from the same abstract base type.

Preview of part II.

In part two of this series, I shall look at alternative mechanisms for implementing procedure types. In particular there are interface base types (rather than an abstract base class), there is the use of runtime reflection, and the new MethodHandle types that appeared in Java SE version 7. I also will report a performance shoot-out, so that users may make informed choices between the various mechanisms.

Thursday, March 22, 2012

Using Reference Semantic Value Types with GPPG

All the examples that ship with GPPG, and parsers in the distribution all use .NET value types for the semantic action type. For very simple situations, the use of the %union construct is convenient. In particular, if the types of the non-terminal symbols have been declared in the *.y file, then the parser will automatically select the appropriate field from the struct that implements the "union" type.

The name of the semantic action type may be declared using the construct -
%YYSTYPE typename
Where typename is the name of the type. It is most commonly an abstract class type, with the actual values being concrete subtypes of this class.

The use of such a family of class types is almost always the correct approach when the purpose of the parser is to construct an abstract syntax tree (AST). The following is a yet another version of the calculator example, two versions of which ship with the GPPG distribution. In this version, the parser builds an AST for each of the expressions. These tree-representations of the expressions are evaluated on demand, much in the fashion of C# lambda expressions.

The first thing that has to be said about using a reference YYSTYPE is that the value stack in the parser will, by default, contain a stack of null values. You will be swiftly reminded of this if you try to access any fields of the top of stack value. The idea is that the semantic actions of the parser will build the tree during the parse. The semantic actions will therefore be calls to factory methods that create new tree-nodes and hook the sub-trees together. Here is the grammar definition. It is derived from the grammars of the calculator programs that ship with GPPG.

The structure of the RealTree.Node type is hinted at in the names of the semantic action methods. Node is an abstract class, and there are exactly three derived classes - Binary, Unary and Leaf. Each object of node type contains a tag, which is a value from the NodeTag enumeration. The purpose of these tags is to indicated whether (say) a Binary node represents an addition, subtraction, multiplication or so on.

All of the rest of the code is in another file, named "RealTreeHelper.cs". This file contains the definitions of the Node classes, and the user-written part of the Parser class, including a simple handwritten scanner. In particular, the parser class contains the static factory methods that invoke the node constructors.

The goal symbol of the grammar is "list". This is simply a list of statements, and has an error recovery action so that all errors will return to a state that is expecting the start of another statement.

At runtime the parser maintains a list of 26 variables, named 'a' to 'z', which may be referred to by their case-insensitive name. All of these registers may be cleared, or their values printed to the console. The interesting productions are those that assign expressions to variables, or evaluate an expression to a floating point double value and print it to the console. The scanner creates a Leaf node value whenever it scans a literal value, or a variable name. These leaf nodes will have a tag value that is a NodeTag.literal or NodeTag.name respectively.

The expression productions are fairly standard, and create an abstract syntax tree for the recognized expression. An interesting case is the second production for statement. In this case the expression is evaluated immediately, so that the semantic value that is produced is a literal value Leaf node, rather than an expression tree value.

The complete program example consists of the grammar (*.y) file, the RealTreeParser.cs file that GPPG produces from this, and the RealTreeHelper.cs file that contains all the handwritten code. All three are zipped together, and available for download from the CodePlex GPPG page http://gppg.codeplex.com/releases/view/80821 The whole needs to be compiled with the ShiftReduceParserCode.cs library.

Here is a snippet from the helper file --

This base class defines two abstract methods: Eval and Unparse. Eval recursively traverses the given tree, returning a floating point double. Unparse recursively traverses the given tree constructing a string that gives a fully parenthesized representation of the expression. The difference between the traversals is just this: during Unparse, when the traversal reaches a name leaf it prints the name of the variable. When Eval reaches a name leaf the traversal evaluates the tree rooted at the variable.

Suppose we have entered the statements -
a = 42; b = 24; c = a + b;
then the print command will write
'c' = (a+b)
while the eval c command will write
result = 66.

It is easy to construct an evaluation that fails by trying to evaluate a circular definition, so it is necessary to guard against this. During evaluation each node is marked when traversal reaches the node, and unmarked when the recursion leaves. This detects circular definitions, and prevents stack overflows during evaluation. Here is simple example -
a = 42; a = a + 1; eval a
fails because of the circularity of the second statement. On the other hand -
a = 42; a = eval (a+1); eval a
succeeds, and returns 43, as expected. In this case the second command evaluates (a+1) and deposits the new value 43 as a literal in variable a. In the circular version the second command overwrites the initial value of a with the expression tree (a+1).

In any case, the purpose of the example is to demonstrate the use of a family of classes as the semantic value type for parsers that build abstract syntax trees. Enjoy!

Sunday, December 25, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-2)

[This is a revised edition of the part-2 post. An error in the LookaheadHelper code has been corrected, and the example code simplified. The revised code is also available on CodePlex.]

In the last post a simple method for looking forward in the token stream of a GPPG parser was described. The key concept was to save the scanner input buffer position, read tokens to resolve the syntactic ambiguity, reposition the input position and then continue.

There are several limitations on this simple-minded approach. Firstly, the example relied on being able to restore the scanner state precisely when the input position is backed up. In cases where the scanner has several different start states this means that extra information needs to be saved and restored. Secondly, in the case that long lookaheads can occur all of the work done by the scanner is lost when the inut is backed up. If the scanner creates objects, including instances of the yylloc and yylval types, then this work will be repeated.

In this blog post a different technique is shown. In this version the scanner defines and produces objects that encapsulate the result of any lookahead-requested call of yylex. These objects contain three data: the integer-valued token result, the semantic value yylval, and the location value yylloc. These objects, as well as being delivered to the decision code that is doing the lookahead, but is also added to an input buffer queue. When backup is performed, the queue is simply handed back to the scanner. Subsequent calls of yylex dequeue the saved values, until the queue is exhausted and normal scanning resumes.

In order to give a more realistic setting a modification of the grammar fragment from part-1 ensures that we do really need an unbounded lookahead. Instead of expressions (and the left hand side of assignments) starting with a bare identifier they begin with a designator. The designator may have member selection, '.' and array indexing. Hence we must scan forward until we hit either a colon or a "dotdot", implying we have a case selector; or an "assign" symbol, implying that we are still in the same case statement sequence. Here is the business part of the new grammar.

This grammar fragment, as in part-1, shows the location of the semantic action that is used to trigger the lookahead ad hoc code. It also shows the location of the dummy symbol that is injected to terminate the current case_element phrase.

Then there is the infrastructure that is used to buffer and unbuffer the lexical data. Here is the declaration of the class ScanObj. This can be included inline in the *.lex file, or in a separate LookaheadHelper file. The following is from a separate file.

And here is the scanner prolog that handles the queue within the scanner's Scan method.

The prolog code is inserted in the *.lex file right after the first "%%" marker, and before the first lexical production.

We can now draw attention to some of the features of the previous code snippet. Firstly, this code is a very general example. The enqueuing and dequeuing code allows for the possibility that there may be several ad hoc lookahead locations in the same parser, and that one ad hoc lookahead may consume the enqueued tokens of a previous lookahead. Note particularly the code that initializes the new token buffer. If the existing code buffer queue is not empty, the leftover tokens in the queue must be added to the end of the newly constructed queue. There are other structures that achieve the same effect.

Note also that the call to SaveScannerState takes an argument called nextTokenToSave. In part-1 we made use of the fact that for this particular example the NextToken value would be empty when the ad hoc lookahead was called. This allowed us to overwrite NextToken to inject the dummy token without any further care. In general, this is not the case, so the code given above saves the argument if it is non-zero, so that it will be refetched either immediately, or immediately after the dummy token has been consumed.

This infrastructure is shared for all such ad hoc lookaheads, requiring only customization for the various instantiations of the yylloc and yylval types, and the user-chosen scanner namespace and class names.

The bodies of the semantic actions that are called by the parser all conform to the same pattern:

For our running example the method would be DoCaseLookahead, and the body of the lookahead would be the method CaseLookaheadScan.

Note carefully that the lookahead scan must use GetAndEnqueue to fetch tokens, rather than calling yylex directly. This is to ensure that all the tokens get safely enqueued in the push-back buffer.

Summary

This may look like a lot more code than the example in part-1. However most of the code is infrastructure, which is shared. Furthermore this method may be used even with overlapping lookaheads, and cases where the scanner changes start state during a lookahead sequence. For almost all scenarios this will be more efficient, since the semantic and location value objects created during a lookahead will not be wasted.

Is there anything this second version cannot do? Well, there is one limitation: Those grammars where the parser and scanner are tightly linked and the parser changes the scanner start state by calling the scanner's BEGIN method cannot use this method. This is a very good reason to avoid such program structures.

The only really tricky thing about this method is the responsibility for figuring out how to write a correct "ExampleLookaheadScan" method. In our present case the occurrence of particular tokens in the lookahead is definitive.

In the final part of this sequence I shall take a really difficult example that arises in the parsing of C#. A statement sequence may have a local-variable-declaration or a statement-expression, among other possibilities. Both of these may start with an identifier and it requires an unbounded lookahead to disambiguate the two. In this case there are no disjoint sets of tokens that separate the two cases and stronger measures are required.

The complete code of the example in the current blog entry will go up on the gppg page on CodePlex in the next few days.

Tuesday, December 13, 2011

Doing ad hoc Lookahead in GPPG Parsers (part-1)

Gardens Point Parser Generator (GPPG) and other similar parser generators are limited to a one-symbol lookahead. However, many programming languages have a few places in their grammars that require a some extra help. In some cases it is possible to get around the issue with a little transformation of the grammar, but in other cases it is necessary to resort to stronger measures. Of course if you own the language that you are parsing, the simplest response is to modifiy the language to make the problem go away, and no-one gets hurt. If somebody else owns the language then the toolkit gets a bit more exercise.

This blog deals with one of those stronger methods, where it is not possible to state in advance the number of symbols that have to be read to decide between two continuations.

Our running example is a grammar for a Pascal-like language which has a problem with the CASE statement, roughly equivalent to switch in C-like languages. Here is the base grammar with most of the irrelevant bits missing ...

So, here is the problem - we are parsing the following string and have just read the semicolon, with the identifier "z" as the lookahead symbol ...
CASE foo OF x,y : a := b; z ...
How do we know whether the 'z' belongs to the current case_element (maybe the start of another assignment statement) or is the start of a new case_label_list? Well, we can't tell with one symbol of lookahead. If we read forward and find a comma or a colon it's a case label; if we find an assignment marker ":=" then it's the start of an assignment. For plausible generalizations of the asssignment left-hand-side grammar the number of lookahead symbols is unbounded.

It is for this reason that most Pascal-like grammars have vertical bar '|' separator between case elements, and C-family languages use the keyword case for every label. However the grammar, as given, is not ambiguous. Read enough lookahead symbols and the problem is solved.

When a slightly filled-out version of this grammar is fed to gppg there is a shift-reduce conflict in one state. Looking at the diagnostic output from using the /report /verbose options shows that for the state in question, when the lookahead is IDENT, the identifier will be shifted by default, although reduction "case_label_list ':' statement_list -> case_element" is also valid. This is the only conflict.

As an aside, note that things could be worse. Pascal-family languages normally use the semicolon as a separator between statements, rather than as a statement terminator. The base grammar above uses the semicolon as a terminator. If we simply use it between elements of a statement list then we would have an obvious problem with the string -
CASE foo OF x,y : a:= b - z ...
Is the "-z" part of the assignment right-hand-side, or the first label of the next case element?

So now we come to the real question: how can we hack on the parser (and scanner) to resolve such problems with the one-symbol lookahead?

The idea is as follows: when the conflict location is reached jump to user-written code that scans forward in the input stream until the matter is decided. Point the parser in the right direction, reset the scanner buffer-position and resume parsing as normal.

The first step is to find a way of inserting the call to the user code in a semantic action. For our example, whenever a statement list is being parsed in the context of a case statement we should do the ad hoc lookahead. So we specialize statement_list as stat_list_for_case.

Now our SaveStateAndDoLookahead method is called every time a new statement is added to the statement list of a case statement.

And what does the method do? Well, because of some issues with the visibility of various scanner and parser variables we split the code between the scanner and parser. The scanner code saves scanner state, calls the parser code and restores the scanner state on return.

This code calls the Parser method DoAdHocLookahead. This method does the heavy lifting. [However, see the footnote for a bit more on what "save scanner state" implies.]

DoAdHocLookahead will repeatedly call yylex until a decision may be made. First of all however note that the conflict only arises if the next symbol fetched is an identifier. For our simple case we exit the loop as soon as we find a symbol that can belong to an assignment but not to a case label list, or to a case label list but not to an assignment. Simple. But how do we "point the parser in the right direction" before parsing resumes?

The simplest way of forcing the parser to do the right thing is to overwrite the parser's NextToken variable. We use a dummy symbol that otherwise does not occur in the grammar, and rewrite the grammar to rely on that symbol to separate the case_element items.

So that is about all there is to it. Overwriting the NextToken value in the current version of GPPG requires diving into the code of ShiftReduceParser to make the NextToken variable accessible. The next refresh of the tool will give the variable protected accessibility to facilitate this technique.

The complete code for this example is available at http://gppg.codeplex.com

Part two will deal with a much harder example: one of the classic syntactic ambiguities of C#.

Footnote: What "save scanner state" implies, and various disclaimers.
In the example here the saved state in the scanner is the current "lookahead" character, the current line and column count, and the buffer position. This works for this particular scanner, but others may need more. In particular, if a scanner has a number of start states, the start state might need to be saved and restored also. Using the YY_START property is the easiest way of doing this, since it automatically resets the start ordinal and the start state. Of course, for a scanner that has a stack of start ordinals the whole stack needs to be saved and restored.

Furthermore, if the scanner detects and reports lexical errors, errors that are backed-up over may be reported twice. Thus if you wish to be really picky, you might want to save the high tide on the error buffer and restore that as well.

Alternative strategies will be discussed in the second part.

Wednesday, November 16, 2011

Another Trick for Debugging GPCP Programs

Programs compiled using the CLR version of Gardens Point Component Pascal can be debugged using the Visual Studio debugger, as described in an earlier blog.

This short blog deals with a litte trick to exploit one of the "features" of Visual Studio (VS).

When a program being debugged stops at a breakpoint, the local variable window of VS lists the
variables that are in scope in the selected frame of the call chain. Other fields of the window show the declared type of the variable, and its concrete (actual) type. There is also a field that shows the value of the variable. This is handy for numeric scalars, for example.

However, for other types Visual Studio fills in the value field by calling the virtual ToString() method on the object. The "miranda" option, for types that do not have a ToString of their own, is to use the method inherited from Object, which returns the name of the concrete type. This is what happens by default for Component Pascal objects.

So, the question is, can we make gpcp types display useful value information in the VS value field? The answer is, of course, yes. But there is a trick to it. First a couple of things that don't quite work.

Suppose that we have a pointer to record type, with no declared super type.

TYPE Foo = POINTER TO RECORD ... END;

The super type will be ANYREC, which is implemented as System.Object in the CLR version of gpcp. However, the compiler will not import the inherited System.Object methods so attempts to override the ToString method will fail.

PROCEDURE (f : Foo)ToString() : Sys.String;

will get a "this is not at override, you must use NEW" error. And if you try

PROCEDURE (f : Foo)ToString() : Sys.String,NEW;

the method actually will be "new", and will not override the virtual Object method. The trick is to make the compiler read in the inherited methods. The following works:

IMPORT Sys := mscorlib_System, ... ;
...
TYPE Foo = POINTER TO RECORD (Sys.Object) ... END;
...
PROCEDURE (f : Foo)ToString() : Sys.String;
BEGIN ... END ToString;

Using RTS.NativeObject instead of Sys.Object and RTS.NativeString instead of Sys.String works also.

There is another place where the implicit use of ToString() occurs in the .NET framework. The format methods like String.Format and WriteLine take a format string and an array of Object. Each object in the array is converted with ToString and substituted into the numbered position in the format string.

Thus to access the format methods of the .NET libraries it is necessary to insert the values into an array of system objects. The rules are as follows:

For record and pointer to record types, declare a ToString method, as described.
For scalar objects, like integers or reals, it is necessary to resort to the manual method of calling Sys.Convert.ToString on the scalar value. C#, which has implicit boxing, hides this away, but Component Pascal's BOX extension does not apply to scalars.

Here is an example of a ToString method for a Coord type which has x and y values and an optional name s for each point. We assume that the name is an open array of char, RTS.CharOpen.

PROCEDURE (pt : Coord)ToString() : Sys.String;
CONST format = "{0}{{{1},{2}}}";
VAR tag : Sys.String;
BEGIN
IF pt.s # NIL THEN tag := MKSTR(pt.s^) ELSE tag := "anon" END;
RETURN Sys.String.Format(format,
tag, Sys.Convert.ToString(pt.x), Sys.Convert.ToString(pt.y));
END ToString;
Note that in this example we are using the fact that there is a Format method that takes three single objects following the string. We could have used the (string, object[]) version instead, and would have had to use the array version if there were more than three objects to fit in the format.

The Coord objects will now display helpfully in VS, and can in turn be directly used in other format strings.

Wednesday, April 27, 2011

Why a Regular Expression Engine is not a Scanner Generator

(Updated 17 November 2011. Regular expressions corrected to use backslash instead of slash.)

I enjoy getting queries from the users of my software. Sure, it takes up time but very often the enquiry highlights a common misconception, sometimes a point that should be clarified in the documentation, and sometimes (hiss, boo) reports a real bug.

A couple of weeks ago I received a query about GPLEX that brought a couple of important issues into focus. The user was creating a scanner, and was having trouble with the generated grammar not accepting certain apparently legal literal string tokens. The grammar for literal strings had single quote characters ' as delimters, and escaped single quote characters with the "backslash" character. Naturally enough literal backslashes were also backslash escaped. The user had tried the "obvious" regular expression (RE)
'(\\.|[^'])*'
which, in words, reads "single quote character followed by zero or more repetitions of EITHER a backslash followed by any character except newline OR any character except single quote, with the whole string terminated by another single quote character".
This did not work on a set of simple test strings, so the user had tried a sequence of progressively more complicated RE without success.

So, the real query was:

Why does not the obvious RE work?
User had tried the obvious RE against the .NET RE engine and all was ok.
Rumour has it that GPLEX uses its own RE-engine, could this be the problem?

Well, GPLEX does have a clean-room RE-engine and hell I have been known to have bugs in my software, so I figured I better have close look. I fired up GPLEX and, sure enough, grammars using the obvious RE did not work correctly.

Step 2 was to fire up my special "tracing" version of GPLEX which creates a scanner that chatters on about what character it has just read, what the next state will be, and when it hits an accept state. Following through the problematic example string
'\\'
immediately showed what the problem was. The scanner was in the accept state when it hit the second single quote, but continued reading characters looking for an even longer valid string. Suddenly the problem is obvious: suppose we name the two alternates of the closure as (A) "any backslash escaped character except newline" and (B) "any character except single quote". Now consider the first backslash character in the example string. It qualifies either as an A or a B. If it is an A then the next two characters are an escaped single quote (thus NOT terminating the string) else if it is the start of an escaped backslash then the second single quote terminates the string! What the scanner does is to mark the position of the second single quote and continue to munch characters. If the input ends before another single quote is found, the scanner backs up to the marked position, report a valid string, and continues scanning.

Just for the record the correct regular expression is:
'(\\.|[^'\\])*'
although for most uses it would be best to exclude newlines inside of strings at the RE level to give better diagnostics. In any case, this simple RE worked for the enquirer.

So what do we learn from this whole business? Well, I guess that it is a reminder that regular expressions are still tricky, especially when there are overlapping alternates. However, this only answers the question in bullet point one. What about point two and point three?

Well, the clue is in the title of this blog. A scanner is not just a regular expression engine. A scanner tries to pick out tokens from a continuing stream of input. It tries to find the longest possible match and backs up when a match fails. By contrast an RE engine tries to match a single pattern, and generally does lots of complicated other things like find sub-patterns within a matching overall pattern. Neither one really does the work of the other, except in almost trivially simple cases.

In this case, I believe that what happened was that what I have been calling the "obvious RE" was extracted from the scanner specification and tried against the .NET RE engine. It worked.

The funny thing is that had the obvious RE been extracted from the scanner spec and tried as a single pattern in GPLEX it would have worked perfectly when tried out one sample string at a time. Each input containing a literal string like the one that put its finger on the error would have munched extra characters until it met the end of the input, backed up and reported a correct string just as the .NET RE engine did.

I am glad that my enquirer did not do this, as it would have caused even more confusion.

Cheers
John