$RS as Regex

gnat on 2003-04-11T16:18:06

I thought "how hard can this be, really?" and wrote some code to read records from a filehandle where the record separator is specified by a regular expression. How hard? Let me quote from my braindump into the Cookbook recipe:

The basic logic is simple: keep a buffer of text read from the file. Try to find a match for the record separator. If we can't find a match, read more text and try again. If we do get a match, then whatever came before the record separator was the record. Stop when you can't match and there's no more data to read into the buffer.

The code is complicated, however, by special cases. When your regular expression matches the empty string, you should get back your data one character at a time. If you find a match for the record separator and have consumed all the data currently in the buffer, you can't be sure that there isn't more of the record separator waiting to be read from the filehandle. So if there's a successful match, you need to put the record and separator back into the buffer, read more data, and try again. And keep trying until you run out of data in the filehandle or you get a match that leaves data in the buffer.

Tom's sanity-checking my code now, but when he's done I'd like to find someone willing (foolish?) to turn it into a CPAN module. I don't have time to do the distro framework, write more tests (I used Test::More when writing the code, and T::M truly rocks), or package it usefully. (I have a stab at a tied filehandle interface so that you can get <FH> even when $RS is a regex).

Any volunteers?

--Nat
(you can tell it's a first draft because I bounce around between we and you, one of my weaknesses)


how is greediness handled?

wickline on 2003-04-11T17:55:30

I'm wondering how this takes regex greediness into account.

What if you have a regex
qr/(?:01)*/
and you have a file which is several MB of random zeros and ones. If you read in
111101010
into your buffer, how can you know that the zero at the end of your buffer isn't about to be followed by another one?

It seems that if you have greedy regex elements, then you may have to slurp in the whole file to be able to tell whether you've matched the longest posible record separator. One could write even more pathalogical cases like
qr/.*(?=.)/s
or somesuch. That would match whenever your buffer had at least one character in it, but it technically shouldn't match until near the end of the file.

Hmmm... it looks like you also have to be careful to increment the pos() even if the regex match doesn't... and keep the pos() in sync with your buffer manipulation. You wouldn't want something with a lookahead (like the example above) to generate (in that case) an infinite sequence of empty records due to bookkeeping issues.

-matt

Re:how is greediness handled?

gnat on 2003-04-12T01:29:49

Greediness: if we get a match that leaves nothing in the buffer, then read some more into the buffer and try again until we either exhaust the file (for a regexp like /.*/s) or have a match that leaves something in the buffer.

Pathological cases will cause the entire file to be read into memory, but I don't see a way around that. If your record separator is /.*/s then you're saying to Perl "the entire file is my record separator". I don't see a way to handle this except by reading the whole file. That's your own silly fault for having such a bogus record separator.

--Nat

Re:how is greediness handled?

wickline on 2003-04-12T16:50:25

> if we get a match that leaves nothing in the buffer

But the point I was attempting to make was that whether or not something is left in the buffer is not the best indication of whether or not the RS matched enough stuff. Maybe I'd have to see the actual code, but from the description of it, it sounds like it could behave differently depending on how input matched up with buffer size. You could use the same data and RS and get different results depending on your buffer size.
my $record_sep = qr/(?:01)*/;
my $data = '11110101010101000001111001011110';
# RS should be  ^^^^^^^^^^    ^^    ^^^^
# but if buffer goes to ^ then we'll have a problem
# code will use ^^^^^^^^ as the first separator
The regex will match four '01' pairs and leave a '0' in the buffer not realizing that had it only read in a bit more text, it could have matched one more '01' pair as part of that first RS.

In this next case, the problem isn't that it would slurp in the whole file, the problem is that it should slurp in the whole file, but from the description, it sounds like it might not actually do enough slurping, because there would still be unmatched text in the buffer. It's another example to show how 'text left in the buffer' may not be the best indicator given that regexes can be greedy.
my $record_sep = qr/.*(?=.)/s
my $data = 'abcdefghijklmnopqrstuvwxyz';
Suppose your buffer size is five characters. First the buffer reads in
abcde
It applies the regex to that buffer and gets a match. The match indicates that the RS is
abcd
and that there is text still left unmatched in the buffer
e
So, it declares a success. There's an empty string for the first record, and 'abcd' as the first record separator. What happens after that depends on how you manage the buffer, but chances are that the end result would be something other than the expected result which would be two records: '' and 'z'. My guess would be that you'd end up with extra '' records, and that the number of extras would depend on the buffer and data size. For small buffers and/or larger data sizes, you'd have more extra '' records. ...but all of this is just guessing without looking at code :)

-matt