I thought "how hard can this be, really?" and wrote some code to read records from a filehandle where the record separator is specified by a regular expression. How hard? Let me quote from my braindump into the Cookbook recipe:
The basic logic is simple: keep a buffer of text read from the file. Try to find a match for the record separator. If we can't find a match, read more text and try again. If we do get a match, then whatever came before the record separator was the record. Stop when you can't match and there's no more data to read into the buffer.Tom's sanity-checking my code now, but when he's done I'd like to find someone willing (foolish?) to turn it into a CPAN module. I don't have time to do the distro framework, write more tests (I used Test::More when writing the code, and T::M truly rocks), or package it usefully. (I have a stab at a tied filehandle interface so that you can get <FH> even when $RS is a regex).The code is complicated, however, by special cases. When your regular expression matches the empty string, you should get back your data one character at a time. If you find a match for the record separator and have consumed all the data currently in the buffer, you can't be sure that there isn't more of the record separator waiting to be read from the filehandle. So if there's a successful match, you need to put the record and separator back into the buffer, read more data, and try again. And keep trying until you run out of data in the filehandle or you get a match that leaves data in the buffer.
Any volunteers?
--Nat
(you can tell it's a first draft because I bounce around between we and you, one of my weaknesses)
and you have a file which is several MB of random zeros and ones. If you read inqr/(?:01)*/
into your buffer, how can you know that the zero at the end of your buffer isn't about to be followed by another one?111101010
or somesuch. That would match whenever your buffer had at least one character in it, but it technically shouldn't match until near the end of the file.qr/.*(?=.)/s
Re:how is greediness handled?
gnat on 2003-04-12T01:29:49
Greediness: if we get a match that leaves nothing in the buffer, then read some more into the buffer and try again until we either exhaust the file (for a regexp like/.*/s) or have a match that leaves something in the buffer. Pathological cases will cause the entire file to be read into memory, but I don't see a way around that. If your record separator is
/.*/s then you're saying to Perl "the entire file is my record separator". I don't see a way to handle this except by reading the whole file. That's your own silly fault for having such a bogus record separator. --Nat
Re:how is greediness handled?
wickline on 2003-04-12T16:50:25
> if we get a match that leaves nothing in the buffer
But the point I was attempting to make was that whether or not something is left in the buffer is not the best indication of whether or not the RS matched enough stuff. Maybe I'd have to see the actual code, but from the description of it, it sounds like it could behave differently depending on how input matched up with buffer size. You could use the same data and RS and get different results depending on your buffer size.The regex will match four '01' pairs and leave a '0' in the buffer not realizing that had it only read in a bit more text, it could have matched one more '01' pair as part of that first RS.my $record_sep = qr/(?:01)*/;
my $data = '11110101010101000001111001011110';
# RS should be ^^^^^^^^^^ ^^ ^^^^
# but if buffer goes to ^ then we'll have a problem
# code will use ^^^^^^^^ as the first separator
In this next case, the problem isn't that it would slurp in the whole file, the problem is that it should slurp in the whole file, but from the description, it sounds like it might not actually do enough slurping, because there would still be unmatched text in the buffer. It's another example to show how 'text left in the buffer' may not be the best indicator given that regexes can be greedy.Suppose your buffer size is five characters. First the buffer reads inmy $record_sep = qr/.*(?=.)/s
my $data = 'abcdefghijklmnopqrstuvwxyz';It applies the regex to that buffer and gets a match. The match indicates that the RS isabcdeand that there is text still left unmatched in the bufferabcdSo, it declares a success. There's an empty string for the first record, and 'abcd' as the first record separator. What happens after that depends on how you manage the buffer, but chances are that the end result would be something other than the expected result which would be two records: '' and 'z'. My guess would be that you'd end up with extra '' records, and that the number of extras would depend on the buffer and data size. For small buffers and/or larger data sizes, you'd have more extra '' records.e...but all of this is just guessing without looking at code :)
-matt