Update

robin on 2002-01-22T00:51:04

I haven't written in this journal for quite a while, to put it mildly. Today seems a good day to bring it up to date, in a fragmentary way. Looking back at the last entry I made, it feels like a long time ago; but I haven't done a great deal of Perl since then. In November I gave a talk to dorkbotlondon about the elephants. A triumph ;-)

My main obsessions recently have been formal language theory, logic, and algebra. I'm excited today, because I just found out that I've been offered a funded place on the logic course I applied for, starting in September (the day before my 26th birthday). By coincidence, I'm going to Manchester tomorrow to visit the department; I'll feel a lot more relaxed about it, knowing that I've got a place! I've never been to Manchester before; apparently it rains a lot. (For a while I was planning to apply to MIT, and I even went to far as to take the GRE, but I gradually realised that not only would it be personally difficult to be so far from home for so long, but also that there's very little activity at MIT in the purely theoretical areas which interest me most.)

I also became a Perl monk since I last updated this. The site is organised like a game -- you gain points depending on how many people vote for your contributions. Not only is it a great resource for beginners, but there's a fair bit of advanced Perl going on there too (not to mention the occasional gems)

Connecting Perl and theory, I've become fascinated by regular expressions which allow back referencing (like /^(.*)\1$/). They've been largely neglected by theorists, and yet they're so rich and subtle. The known theoretical facts (as far as I can tell) are:

  • The problem of deciding whether a regex [1] matches a string is NP-complete.
  • If you have a regex $r, there isn't necessarily a regex which matches all and only the strings which $r doesn't match [2].
  • The deeper you allow capturing expressions to be nested, the more languages can be expressed.
  • Not all context-free languages are expressible using regexes, and not all regex-expressible languages are context-free.
(I suspect this last fact partly explains their neglect: they don't fit cleanly into the Chomsky hierarchy of languages.) So there's a host of interesting questions still to be answered. I've made a small amount of progress with one or two of them I think, but no definitive answers yet . Most recently I've been toying with the idea that if you restrict the capturing to be at the top level (i.e. not inside a (?:$foo)* or another capture ($bar)) then the resulting formalism is weakly equivalent to word equations (i.e. equations over a finitely-generated free monoid) with regular constraints. If that's true, it connects the theory to something that mathematicians are interested in! (The puzzle I recently posted to FWP is related to this conjecture.) On the down side, even if it's true it may well be rather hard to prove.

Another wild thought: what if you specify an order-sorted algebra with words (i.e. strings) as an explicit sub-sort of regular expressions? Word equations and regular expression matching could both be described in such an algebra, and in a unified way. Is that any use for anything? I don't know...

footnotes:

[1] I'm just talking about traditional regular expressions with the single addition of back referencing. Look-ahead operators, embedded code, independent subexpressions etc. aren't allowed, for my current purposes.

[2] Note that if you allow the negative look-ahead operator, this obviously stops being true. However much weirder things start to happen then, and I don't want to go there for the moment.