A better quoting-aware word-splitting regex

Aristotle on 2009-01-31T15:47:15

Anyone who has read Jeff Friedl’s book can write a decent basic word splitter regex off the cuff:

m{ \G [ ]* (
    " [^"\\]* (?: \\. [^"\\]* )* "
    |
    [^ ]+
) }gsx

I have written and used this many times. What has always bugged me about it, however, is that it captures the delimiters along with the content, so afterwards you have to do something like this to the captured value:

s!\A"(.*)"\z!$1!;

This is… not pretty.

Of course, you could use two captures:

m{ \G [ ]* (?:
    " ( [^"\\]* (?: \\. [^"\\]* )* ) "
    |
    ( [^ ]+ )
) }gsx

But then you need to check which of the two captures has the value – is it in $1 or $2? So this is still inelegant. The pattern has already done all the work of examining the string – why can’t it provide its results in an invariant form?

The problem is that the presence of the trailing quote must be dependent on the presence of a leading quote, so you must keep the quotes inside the alternation, so there is no way to avoid having either two distinct captures that exclude the quotes or a single broad capture that includes them.

Except, of course, you don’t have to and there is. True enough: when you rely on the matcher to pick an alternation implicitly, the quotes must be included in the alternation. But by using an extended regular expression feature (that has been marked experimental for a decade – what’s up with that?), namely conditional matches, you can make the match of the trailing quote conditional on the leading quote independently of an alternation.

m{ \G [ ]*
     (")?
     ( (?(1)
         [^"\\]* (?: \\. [^"\\]* )*
         |
         [^ ]+
     ) )
     (?(1)")
}gsx

And of course you can (and must, in this case) use a conditional match in the middle to explicitly specify which of the cases to pick, depending on the presence of the leading quote.

This way, you can match surrounding delimiters in a captured alternation for some of its cases but not others, without having to include the delimiters in the capture.

Note that the interesting capture is now $2, not $1 – we need the first capture for the quote, since conditional matches can only use captured groups as conditionals. However, the matched word is always in $2, regardless of whether quotes were involved. Furthermore, $1 has now turned into a true boolean flag, whereas previously this information had to be inferred (however easily).

This pleases me.

How about using $+ ?

bart on 2009-02-01T01:33:27

TIMTOWTDI, but you don't have to jump through all those hoops, just to get one definite capture variable for every case. Just take a look at $+ and you'll see it was especially designed for these cases, where you have captures in alternatives and you don't know which alternative matched. So, you can just use your 2 captures regex, and simply get the one you want with $+.

But, I think this probably is not such a good idea. You ought to strip the backslashes in those quoted strings that are just there to escape special characters (backslashes and quotes). Unless, if you always want to remove these backslashes, whether the string was quoted, or not.

Re:How about using $+ ?

Aristotle on 2009-02-01T08:28:56

Two upsides of using conditional match:

It works in multiple-capture scenarios.
It provides a boolean flag capture.
The boolean flag makes it particularly nice to make the backslash stripping conditional on the presence of quotes.