Binary file visualizers?

brian_d_foy on 2004-01-06T15:20:33

My Mac::iTunes module reads the binary "iTunes Music Library" files that iTunes uses as its database (why can't iPhoto do that?). Each iTunes version changes it slightly, even for the minor versions. That is really no big whoop, but I still have to do a little work to discover how it changed.

I think a good way to handle all of these version issues in the module would be to step back a bit. I want to write a binary file description, in human readable text, then have a Perl module turn that into the parsing code. It is not as complicated as Parse::RecDescent sort of things because it is just going to be a list of things that come after each other and how long they are (with a few exceptions). For instance, a GIF image can be described by such a thing, because it is block-oriented too.

The file for iTunes would look something like this, where the numbers are the bytes, and the names somehow relate to Perl variables the program can access later.

MAJOR_VERSION 1
MINOR_VERSION 1     # the second byte
FOO           1.3.7 # some bit field, bits 3 to 7
HTIM_LENGTH   4
HTIM_BLOCK    4 * HTIM_LENGTH

I write a different description file for each version of iTunes, and I do not have to change the code when iTunes adds a couple of bytes to a block.

I have not really thought about this, so it is a very rough idea, but since I like to grok binary files (just for giggles, you know), this is a really interesting side project for me.

Perhaps, if I can figure out that stuff, I can make a color-coded hex dump utility that visualises all the pieces. That would really be something I would like to have since I am at the level of print-outs and multiple colors of highlighting pens. Even if I could just get it to print out with the colors I wanted I would be happy, but I would rather be thrilled.

So, worthy audience, tell me where this has already been done and why it is a stupid idea!

See magic(5)

merlyn on 2004-01-06T16:16:46

The file(1) program has a magic(5) database that talks about byte offsets and bit positions. You could probably start with that format and add in more semantic hooks about what to do when you get there.

Or, a more perlish solution would be to define "unpack" values for byte offsets and bit positions.

Re:See magic(5)

brian_d_foy on 2004-01-06T18:50:58
You mean a more C-ish solution? :)

magic(5) looks for specific things at specific places, and I need a little bit of logic. For instance, in the iTunes file, the first 100 bytes or so may have fixed meaning, but after that there are variable length blocks, so the offsets now have offsets. I will have to think about that.

I might be able to describe things in terms of pseudo-pack type things, but that would require a lot of byte-fiddling I think since some of the values are odd numbers of bytes. That may seem strange at first, but when you save a handful of bytes for every one of the thousands of songs in the database, it adds up.

I think a couple of the values within the variable length blocks also have variable length---actually, I know they do. I cannot really create an unpack string without parsing the block, can I?

backend implemented, looking for a language

Somni on 2004-01-19T18:04:19

I've actually implemented a backend for this, and am now looking to rewrite the frontend. What I'm using now to specify the format is a large data structure. I've been looking around for an existing mini-language to adapt, but have yet to find one.

My code is here. The binary file parser is File::ComplexFormat (after I rewrite with a new mini-language I intend to rename it to Parse::Binary, or something, and release to CPAN); the parser for the data structure is File::ComplexFormat::SpecParser::Array. The only working example of the module is File::Format::Diablo::d2s, for parsing Diablo 2 character files. The file format is described in the call to File::ComplexFormat, as the spec argument.

My problem hasn't been so much in specifying the data types; I could probably use something similar to C, or just the single lines I have in the data structure format I have now. The problems arise when parsing a complex file; some fields are there only if other fields are set to some value, and there are arrays of values, sometimes based on some count, sometimes based on a delimiter. This requires conditionals and loops based on values that have already been read. In addition, I'm creating a data structure from the parsed file (or writing a data structure to the parsed file), so I needed code that allows me to name the keys in the hash and group like values into sub-hashes.

If you have any ideas on the mini-language or the code I have already I'd like to hear them.

Re:backend implemented, looking for a language

brian_d_foy on 2004-01-19T18:20:53
If you are looking to write a mini-language, check out my Polyglot module. I wrote it to create languages to control my household electronics.

I'll check out your stuff when I get a chance. It sounds very exciting (especially since I don't have to do the work!)