Standards and Regexes

ziggy on 2006-01-12T16:42:17

I'm working on a project of a certain vintage, and of a certain age, that uses upwards of five programming languages to get stuff done. Annoying, but nowhere near uncommon. (There's a story about how JScheme is included in the JDK sources, because the code to generate CORBA classes for Java are written in Scheme...)

Luckily for me, I had a feature to add that traces through most of those languages all at once: Haskell -> Tcl -> XSLT -> Tcl. (The Perly bits form the backend system, not the frontend runtime components.) Thankfully, it was a simple fix: add wordbreak barriers around a regex being output from a Haskell program and sent upstream to heaven knows where.

Should be simple, right? Just replace "..." with "\\b(...)\\b", or some variant thereof. Easy peasy.

Except that the \b metacharacter is Perl syntax, and the regex isn't going to be processed by Perl. At one point, I though that this regex was going to be processed by a component written in C, using the GNU Regex library. Turns out that Perl, GNU Regex and PCRE all agree that \b is a word boundary. (POSIX regexes don't appear to know what a word boundary is...)

Yet none of the standard regex magic was working. Tracing through the code, I discovered that Tcl's regex engine was the one being used (by way of XSLT; about as convenient as a direct flight from Sydney to New York by way of Mars).

Looking over Tcl's regex docs, it turns out that \b is a backspace character!

Because matching backspace characters is such a common operation within a regex, Tcl preserves the C-style escape for \b, and uses \y for word boundaries.

WHAT ON EARTH WERE THEY THINKING?


From the MRE book

sigzero on 2006-01-12T19:03:26

And what about this !\b" business? This is a regex thing: in Perl regular expressions, !\b" nor mally matches a word boundary, but within a character class, it matches a backspace. A word boundary would make no sense as part of a class, so Perl is free to let it mean something else. The warnings in the first chapter about how a character class's (sub language) is different from the main regex language certainly apply to Perl (and every other regex flavor as well).

So the Tcl people weren't wrong, just different from Perl.

Prior art

ChrisDolan on 2006-01-12T19:55:17

I believe \b as backspace significantly predates \b as word boundary. I know that \b means backspace in PDF. Most ASCII charts that show escapes use \b for backspace (0x08). \b is backspace in ANSI C.

So, Perl is the usurper, not TCL. That's not to say that Perl is wrong, of course. :-)

Re:Prior art

sigzero on 2006-01-12T20:16:05

Not wrong, just different. The problem is when we Perl folks try to hold up Perl as "the" regex standard and it is not. The MRE book explains that different language regex implementations are a wee bit different from each other.

Re:Prior art

ziggy on 2006-01-12T20:41:05

Different and annoying.

The problem is that Perl's regex syntax is adopted as the gold standard whenever another language/library needs to beef up its regex handling. There's a very large common subset shared between Perl, PCRE, GNU Regex, and probably some Java library. In general, this is a good thing, because it means that regexes generally become normalized, at least for the common cases. There should be one (common) way to find word boundaries, but all bets are off on variable capture and executing code in a replacement. (GNU Regex allows \< to match the start-of-word boundary and \> to match the end of word boundary. It's perfectly reasonable that such extensions exist, but core features, like * meaning .* within a regex engine are simply wrong.)

We get \b as a mnemnonic because Perl is context sensitive. And context sensitivity is a good thing, if somewhat baffling to remember that "\b" and qr"\b" aren't the same thing. I'd say that Larry made the right choice, because when you're printing something \b as backspace makes sense (inherited from C), and \b as a word boundary makes sense (quite unlikely that you want to match a backspace character).

I've dug into Tcl's internals before, and while they are quite well engineered, there are some problems with the overall language design. As a result, it has two significant shortcomings: \b (or any other metacharacter) can have one and only one translation, and there is no such thing as context sensitivity as we know it.

Even if that weren't true, there's also a case for backward compatability. I'm pretty sure that \b wasn't in Perl1. Therefore, it had to be added at some point, and because a regex isn't a string (until recently, with the qr{} loophole), there was never a conflict of interpretation. However, because regexes in Tcl are strings, and strings treat \b as whitespace, introducing the Perl semantic would have broken existing Tcl programs. Thus, the only viable solution was to introduce something else to mean \b.

Since you were going to break some eggs, better to err on the side of backward compatability with your own community instead of compatability with the rest of the universe. A justifiable decision, especially when you realize both choices suck.

Therefore, because there is no C-like ASCII metacharacter \y, the string "\y" prints the character y (as you would expect) while "\y" matches a word boundary when used as a regex. All of which means that while Tcl eschews specialized contexts, it still has them.

perhaps necessary

jmm on 2006-01-12T20:15:01

I don't know tcl except for snippets I've absorbed over the years, but doesn't tcl use strings rather extensively? And does that not mean that the regex is entered as a string that later gets used as a regex (rather than being parsed as a regex when tcl first analyses the characters of the script)? That would mean that usurping \b for word boundary within a regex would also usurp \b for backspace in all other strings too. So, at the very least, you'd have to write it as \\b. (Having the string be processed separately as a string first and then as a regex later has other problems - you also want \\b to mean a backslash followed by a b - so either tcl does arrange to distinguish between string and regex at the initial parse or you end up having to use \\\\x to match a backslash followed by x and \\x to match the meta meaning of x.)