Parsing

Matts on 2001-09-04T08:50:40

This appeared in the discussion board attached to this journal:

Looks like your work on xml parser is involving you in some of the details of LL parsing. You might want to check out www.antlr.org Despite it being a Java tool, it is a nice tool for LL(k) [IMHO, Way better than yacc and the associated lalr grammars]

Actually I find this kind of parsing really easy. The good thing about XML is its 99% suitable for "character at a time" parsing. So the tokeniser is just "give me a character". There aren't many multi-character tokens in XML, the few exceptions being the bits in the DTD (like DOCTYPE, ELEMENT, ENTITY etc). So all I do for that is to read in a character at a time and if it's not what I expect, I buffer what I've read into the Reader class and backup. Actually the reader class does all that for me - I have match() and match_str() methods which do all the work.

So, take a section of the grammar:

document ::= prolog element Misc*

In perl code this becomes:

sub document { my ($self, $reader) = @_; $self->prolog($reader); $self->element($reader); while(!$reader->eof) { $self->Misc($reader); } }

Misc is defined as:

Misc ::= Comment | PI | S

So in perl code this becomes:

sub Misc { my ($self, $reader) = @_; if ($self->Comment($reader)) { return 1; } elsif ($self->PI($reader)) { return 1; } elsif ($self->skip_whitespace($reader)) { return 1; } return 0; }

Now this is actually slightly different from the grammar, in that S is one whitespace character, whereas skip_whitespace consumes all whitespace. But analysis of the grammar shows that Misc never comes in isolation (i.e. it's always Misc*, or (something | else | Misc)*, so consuming all WS is safe.

Finally, the actual parsing (all you really see above is expansion of the grammar into perl code - it's missing the actual parsing of tokens bit, which happens at the leaves of the code).

sub Comment { my ($self, $reader) = @_; if ($reader->match_string('') || $self->parser_error(...); $self->lexhandler_method('comment', { Data => $comment_str }); return 1; } return 0; }

I hope that makes sense to people :-) Oh, and $CharMinusDash is a qr() regexp contained in an external file (Productions.pm). It uses the new unicode \x{...} syntax which only works in 5.7.2, sadly. Ah well, we need 5.7.2 to be able to do character set switching (using perlio, Encode, and binmode) so that's how it has to be.

Anyway, the upshot is that I find this a lot easier to work with than grammar "toolkits" like yacc (or antlr), simply because I'm intimately familiar with Perl. I know how to debug it. Perl is my friend :-)

Unfortunately the XML grammar is a bitch. It's not complete. The XML spec authors decided to punt on a lot of grammar issues and put those bits of the spec in the verbiage, rather than making the grammar complete (because this would confuse people reading the spec, apparently - I fail to see how that's really true!). So now I'm left trying to figure out how you might go about parsing PEReferences (those %foo; things that allow you to modularise DTDs) in the internal and external subset, since the grammar doesn't specify that properly. Bah! Maybe I'll skip that part for now :-)