I'm working on a project of a certain vintage, and of a certain age, that uses upwards of five programming languages to get stuff done. Annoying, but nowhere near uncommon. (There's a story about how JScheme is included in the JDK sources, because the code to generate CORBA classes for Java are written in Scheme...)
Luckily for me, I had a feature to add that traces through most of those languages all at once: Haskell -> Tcl -> XSLT -> Tcl. (The Perly bits form the backend system, not the frontend runtime components.) Thankfully, it was a simple fix: add wordbreak barriers around a regex being output from a Haskell program and sent upstream to heaven knows where.
Should be simple, right? Just replace "..." with "\\b(...)\\b", or some variant thereof. Easy peasy.
Except that the \b metacharacter is Perl syntax, and the regex isn't going to be processed by Perl. At one point, I though that this regex was going to be processed by a component written in C, using the GNU Regex library. Turns out that Perl, GNU Regex and PCRE all agree that \b is a word boundary. (POSIX regexes don't appear to know what a word boundary is...)
Yet none of the standard regex magic was working. Tracing through the code, I discovered that Tcl's regex engine was the one being used (by way of XSLT; about as convenient as a direct flight from Sydney to New York by way of Mars).
Looking over Tcl's regex docs, it turns out that \b is a backspace character!
Because matching backspace characters is such a common operation within a regex, Tcl preserves the C-style escape for \b, and uses \y for word boundaries.
WHAT ON EARTH WERE THEY THINKING?
So the Tcl people weren't wrong, just different from Perl.
Re:Prior art
sigzero on 2006-01-12T20:16:05
Not wrong, just different. The problem is when we Perl folks try to hold up Perl as "the" regex standard and it is not. The MRE book explains that different language regex implementations are a wee bit different from each other.
Re:Prior art
ziggy on 2006-01-12T20:41:05
Different and annoying.
The problem is that Perl's regex syntax is adopted as the gold standard whenever another language/library needs to beef up its regex handling. There's a very large common subset shared between Perl, PCRE, GNU Regex, and probably some Java library. In general, this is a good thing, because it means that regexes generally become normalized, at least for the common cases. There should be one (common) way to find word boundaries, but all bets are off on variable capture and executing code in a replacement. (GNU Regex allows \< to match the start-of-word boundary and \> to match the end of word boundary. It's perfectly reasonable that such extensions exist, but core features, like * meaning.* within a regex engine are simply wrong.)
We get \b as a mnemnonic because Perl is context sensitive. And context sensitivity is a good thing, if somewhat baffling to remember that "\b" and qr"\b" aren't the same thing. I'd say that Larry made the right choice, because when you're printing something \b as backspace makes sense (inherited from C), and \b as a word boundary makes sense (quite unlikely that you want to match a backspace character).
I've dug into Tcl's internals before, and while they are quite well engineered, there are some problems with the overall language design. As a result, it has two significant shortcomings: \b (or any other metacharacter) can have one and only one translation, and there is no such thing as context sensitivity as we know it.
Even if that weren't true, there's also a case for backward compatability. I'm pretty sure that \b wasn't in Perl1. Therefore, it had to be added at some point, and because a regex isn't a string (until recently, with the qr{} loophole), there was never a conflict of interpretation. However, because regexes in Tcl are strings, and strings treat \b as whitespace, introducing the Perl semantic would have broken existing Tcl programs. Thus, the only viable solution was to introduce something else to mean \b.
Since you were going to break some eggs, better to err on the side of backward compatability with your own community instead of compatability with the rest of the universe. A justifiable decision, especially when you realize both choices suck.
Therefore, because there is no C-like ASCII metacharacter \y, the string "\y" prints the character y (as you would expect) while "\y" matches a word boundary when used as a regex. All of which means that while Tcl eschews specialized contexts, it still has them.