Regexps for reverse engineering?

gav on 2004-10-16T18:33:09

I got roped into trying to help a friend of a friend extract some reports from their billing application data. I got given a 26 meg data file to play with, and some digging with 'strings' helped me find bits of data mixed in with the binary gibberish. I came up with the following code:

my $person  = qr/[\x08\x09]([A-Z]{8})/;
my $lo      = qr/[\x00-x20]{1,3}/;
my $id      = qr/OH${lo}7${lo}(\d{5})${lo}AA/;
my $date    = qr/\x06(\d\d)(\d\d)(\d\d)/;
my $time    = qr/\x08(\d\d):(\d\d):(\d\d)/;
my $money   = qr/([1-9]\d+\.\d\d)/;
my $chars   = qr/[\x20-\x7E]+?/;
my $desc    = qr/\x17($chars)\x00/;

while ($data =~ /$person.*?$id.*?$date.*?$desc.*?$time.*?$time.*?$money/gs) {
    my ($p_id, $r_id, $d, $text, $t1, $t2, $cost) 
        = ($1, $2, "$3/$4/$5", $6, "$7:$8:$9", "$10:$11:$12", $13);
    print "$p_id: $r_id -- $d $t1 -> $t2 -- [$text] \$$cost\n";
}

I was happy, I'd got back a bunch of records that looked sensible. The problem was that I wasn't getting some of the records that I saw in the file. Changing my $person to qr/[\x08\x09](G[A-Z]{7})/ or qr/[\x08\x09](S[A-Z]{7})/ gives me a bunch of different records. But why would (S[A-Z]{7}) give different results than ([A-Z]{8})?. I'm stumped.


three regexen

wickline on 2004-10-17T20:43:49

> Changing my $person to
>    qr/[\x08\x09](G[A-Z]{7})/
> or
>    qr/[\x08\x09](S[A-Z]{7})/
> gives me a bunch of different records. But why would
>    (S[A-Z]{7})
> give different results than
>    ([A-Z]{8})
> ?. I'm stumped.
I'm not sure if I'm misreading this, but it looks like you have three different regular expressions there:
    qr/  G [A-Z]{7}  /x
    qr/  S [A-Z]{7}  /x
    qr/    [A-Z]{8}  /x
I would expect the three of them to match different values. The first one matches eight-letter uc words starting with 'G'. The second matches eight letter uc words starting with 'S'. The final regex matches eight letter uc words starting with any letter.
    GABCDEFG    matches first and third
    SABCDEFG    matches second and third
    ABCDEFGH    matches only the third
... or am I missing the point?

-matt

Re:three regexen

gav on 2004-10-17T21:02:15

> I would expect the three of them to match different values. The first one matches eight-letter uc words starting with 'G'. The second matches eight letter uc words starting with 'S'. The final regex matches eight letter uc words starting with any letter.





Sorry for not being clear. I'd expect /S[A-Z]{7}/ to be a subset of the matches from /[A-Z]{8}/ but instead the latter isn't returning some of the results the former does.

Re:three regexen

wickline on 2004-10-17T21:12:13

That would match my expectation as well. Are you 100% positive that there is no other difference in the two regexen (or the two scripts)?

If so, I'd be stumped too :/

-matt

Re:three regexen

n1vux on 2004-10-18T15:08:27

Glad to see /x modifier there. /S[A-Z]{7}/ should be a subset of /[A-Z]{7}/, in particular the subset /(?=S)[A-Z]{7}/. If it isn't, it could be a bug in the backtracking logic ... or an issue with binmode?

If there is any possibility of accented 'national' characters (which there always is in unconstrained data) '\w' is much preferred to [A-Za-z] or [A-Z]/i.

I'd worry that some 'persons' might actually be shorter than 8 chars, or have spaces or lower case in some systems. (van Helsing etc)

What strings(1) shows you isn't quite what Perl sees. Try xd(1) or od(1) to see details. If on Windows (or VMS?) set binmode(3) on your filehandle. (For portability, set binmode anytime reading binary data.)

good luck, we'll be interested to hear what the results are.

Dangers of .*?

vsergu on 2004-10-18T18:32:17

I suspect part of the problem could be that you're not getting the correspondence between person, ID, and so on that you expect because .*? is matching more than you want it to. You could be matching a person from one record and then skipping over parts of that record to another one because some of the other patterns don't match for that record.

It may not be what's happening in this case, but a common oversight in constructing regexes is to assume that .*? will match the shortest possible sequence of characters to satisfy the overall match. That's not true. It's still matches the first position it can, and takes the shortest sequence that starts there.