Wednesday, April 27, 2011

Why a Regular Expression Engine is not a Scanner Generator

(Updated 17 November 2011.  Regular expressions corrected to use backslash instead of slash.)

I enjoy getting queries from the users of my software. Sure, it takes up time but very often the enquiry highlights a common misconception, sometimes a point that should be clarified in the documentation, and sometimes (hiss, boo) reports a real bug.

A couple of weeks ago I received a query about GPLEX that brought a couple of important issues into focus. The user was creating a scanner, and was having trouble with the generated grammar not accepting certain apparently legal literal string tokens. The grammar for literal strings had single quote characters ' as delimters, and escaped single quote characters with the "backslash" character. Naturally enough literal backslashes were also backslash escaped. The user had tried the "obvious" regular expression (RE)
which, in words, reads "single quote character followed by zero or more repetitions of EITHER a backslash followed by any character except newline OR any character except single quote, with the whole string terminated by another single quote character".
This did not work on a set of simple test strings, so the user had tried a sequence of progressively more complicated RE without success.

So, the real query was:
  • Why does not the obvious RE work?
  • User had tried the obvious RE against the .NET RE engine and all was ok.
  • Rumour has it that GPLEX uses its own RE-engine, could this be the problem?
Well, GPLEX does have a clean-room RE-engine and hell I have been known to have bugs in my software, so I figured I better have close look. I fired up GPLEX and, sure enough, grammars using the obvious RE did not work correctly.

Step 2 was to fire up my special "tracing" version of GPLEX which creates a scanner that chatters on about what character it has just read, what the next state will be, and when it hits an accept state. Following through the problematic example string
immediately showed what the problem was. The scanner was in the accept state when it hit the second single quote, but continued reading characters looking for an even longer valid string. Suddenly the problem is obvious: suppose we name the two alternates of the closure as (A) "any backslash escaped character except newline" and (B) "any character except single quote". Now consider the first backslash character in the example string. It qualifies either as an A or a B. If it is an A then the next two characters are an escaped single quote (thus NOT terminating the string) else if it is the start of an escaped backslash then the second single quote terminates the string! What the scanner does is to mark the position of the second single quote and continue to munch characters. If the input ends before another single quote is found, the scanner backs up to the marked position, report a valid string, and continues scanning.

Just for the record the correct regular expression is:
although for most uses it would be best to exclude newlines inside of strings at the RE level to give better diagnostics. In any case, this simple RE worked for the enquirer.

So what do we learn from this whole business? Well, I guess that it is a reminder that regular expressions are still tricky, especially when there are overlapping alternates.  However, this only answers the question in bullet point one. What about point two and point three?

Well, the clue is in the title of this blog. A scanner is not just a regular expression engine. A scanner tries to pick out tokens from a continuing stream of input. It tries to find the longest possible match and backs up when a match fails. By contrast an RE engine tries to match a single pattern, and generally does lots of complicated other things like find sub-patterns within a matching overall pattern. Neither one really does the work of the other, except in almost trivially simple cases.

In this case, I believe that what happened was that what I have been calling the "obvious RE" was extracted from the scanner specification and tried against the .NET RE engine. It worked.

The funny thing is that had the obvious RE been extracted from the scanner spec and tried as a single pattern in GPLEX it would have worked perfectly when tried out one sample string at a time. Each input containing a literal string like the one that put its finger on the error would have munched extra characters until it met the end of the input, backed up and reported a correct string just as the .NET RE engine did.

I am glad that my enquirer did not do this, as it would have caused even more confusion.