SAX

TorgoX on 2002-01-21T23:33:56

Dear Log,

Odd things are happening as I go around asking XML people how they would imagine SAX to relate to push-parsing (the sort of parsing that HTML::Parser does) and pull-parsing (like HTML::TokeParser does):

First off, many people seem not to understand the question, and it takes a lot of explanation to get the idea across; I think this comes from a lack of experience (on their part) in comparable data formats other than XML. (This is just an impression on my part, but it's an awfully persistent one; I'm not sure what to make of it tho.)

But more importantly and ominously, everyone seems to interpret SAX differently, while blissfully insisting that they are all in agreement and that there is no problem. This is very disconcerting. A similar situation existed with Pod; and I found it very very hard to fix that situation -- to the limited extent that I did manage to fix it with perlpodspec. So anyone who wants to improve SAX specification will probably face similar problems. I'm glad I'm not in their shoes.


Precisions

darobin on 2002-01-22T00:21:08

I'd be curious to have a few precisions here. SAX is push-parsing (with the tiny extra that you can get a little context from the driver if it provides it), so I don't see how people could get that wrong ;-) As for pull-parsing, it is true that SAX does nothing for that. It should be too hard to come up with an API for pull-parsing, the trouble is mostly in agreeing on one. I guess that if someone presents a pull-parser system that is reasonable enough, it'll be adopted.

I would think that people's misunderstandings come mostly from thinking that SAX is an XML-only API. You can read anything into SAX, and write SAX out to anything. It just so happens that it was created for XML, and that the intervening data maps to a tree structure (seen token by token) that is coherent with that of XML. I also think that the problem is that people that have experience in other data formats misunderstand XML because they expect XML to "do something" just like that, as if it were a dedicated syntax. It's just glue for data. You're rarely do something interesting with glue alone (glue sculptures? ;-).

As for the part that interests me most, how many different interpretations of SAX did you get, and how did they differ? There are a few bugs and inconsistencies in parsers here and there, and some parts of the spec are not implemented anywhere (though this is steadily progressing). Apart from that, the SAX model is rather simple. Events have a defined order, and defined content. I don't think that the situation is comparable to that of POD, not by a fair margin. At least, not that I've noticed while using any SAX2 drivers. Differences were filed as bugs, and flattened out.

I'm pretty much in the shoes of the guy that edits the PerlSAX2 spec and I have strong interest in seeing everyone happy with SAX I'd be delighted to have a clearer description of points on which you think the spec is unclear, or allows for differing interpretations. Until there is at least one parser that is complete (which should happen soon now that Matt has produced locator code) the spec is considered to be in minimal flux. By minimal I mean no big change can happen, but minor fixes can. In some cases, this includes things as simple as specifying that data item Foo, when it doesn't exist, must be '' and not undef. Nothing big, but important precisions if we are to put everyone on the same page :-)

Knowing what you think is wrong in there would definitely help.

Re:Precisions

TorgoX on 2002-01-22T02:58:06

As for the part that interests me most, how many different interpretations of SAX did you get, and how did they differ?

A notable point of difference was between people who considered the events to be all that the spec specified, and people who considered the spec to specify events and also parse(), parse_file(), parse_string(), and their behavior. If you read SAX as obliging one to follow the behavior of parse() et al in current parsers, then you can't easily implement something like HTML::TokeParser.

Maybe it would be simpler if someone wise in the spirit of SAX would go and read and understand the interfaces to HTML::Parser and HTML::TokeParser, and then implement something SAXful SAXilicious SAXicilious with the same kinds of interfaces.

One "there, I did it, look how!" is worth a dozen "Well, it's possible... "'s.

Go! Do it!

Re:Precisions

darobin on 2002-01-22T03:37:45

Ah thanks for the precision. That is an area that I would never have considered grey, but then I have my nose right inside it all the time, and a lot of the talk that define(s|d) SAX isn't archived as it happened on IRC.

Here's an attempted short breakdown of the general idea (as clear as I can make it past 4am). The parse calls (parse, parse_file etc.) are part of the SAX spec, and there's a very good reason for this and for why I very much doubt that it'll change. SAX drivers are meant to be totally interchangeable (modulo Features, which are queriable). That's why there's such a thing as XML::SAX::ParserFactory: you ask for a parser, get it, and interact with it without needing to know which parser it is. A bit like DBI after a fashion.

Because of that, the parse calls need to be clearly specified, and consistent accross parsers.

Now, that does not mean that there isn't a (conceptual) separation between those parts of the interface, and the content of what the events receive. The contents are consistent accross all events as you can see, and in fact are meant to be similar in PerlDOM. The only reason why they are presently spec'd in PerlSAX only is because that's the only spec that's mature enough to be spoken of. If someone would come up with a pull-parser spec that would reuse those data-items, then it might make sense to have a common spec for the data bits, and separate specs for the interfaces.

As for grafting a pull-parser API on top of the SAX API, I think that's a bad idea. SAX is push by essence, and it makes it simpler that way. This means that a pull-parser API would live in a different spec, and be implement most of the time by different modules. Of course, it doesn't mean that they shouldn't share the same node representations -- much to the contrary -- but the two should be separated so that no one expects parsers from one side to be also able to behave as parsers from the other side. That would make some of them very hard and others potentially impossible.

People have been saying that a pull-parser equivalent of SAX would be very cool for almost as long as SAX has existed. However I was talking with Kip about that earlier and despite the fact that we all agree it would be nice to have, no one appears to have been itched enough to write one.

And that's why I'm going to disappoint you and leave you with a "Well, it's possible..." I have no need whatsoever for a pull-parser, SAX fills all of my parsing needs. I agree that a pull-parser would be nice, and should someone endeavour to write one I will help with what I can. Hey, TokeParser has so simple an API that doing the same thing for XML, reusing the node representations from SAX, ought to be a no-brainer (provided one has an underlying parser that supports that type of interaction).

I, like many others, simply haven't felt the need for that bad enough to scratch the itch so far. Someone deciding to implement such a parser would, however, get lots of support from us folks that are happy as it is with SAX ;-)

As for HTML::Parser, I know it well and have used it intensively until I found out about XML (in fact I used to use it as an XML parser, more or less). Is there anything from it that you think would be useful to the SAX space? All I can think of off the top of my head is the ability to specify parameter templates but I've always found that that was more of a hassle (even after writing many programs that use it I still need to refer to the docs) than anything else. However, if you really need that feature, it could easily be done as a filter. I won't code that myself though, so here's another "Well, you could..." ;-) Hey, don't blame me. I code SAX every day and have "look, here's what I've done" SAX stuff every day. I can't possibly implement all the SAX-related ideas I hear about ;-) I can, however, promise to support with what means I have anyone that wishes to make an idea SAXy.

Re: SAX

kingubu on 2002-01-22T00:22:38

Well, I'm the first to admit that my background is not in CS; so asking "is SAX a blah-blah-parser?" will likely draw a stare.

That said, however, the reality is that SAX itself makes no rule about which type of parser you hook to it. It is a simple event-based API who's only real expectation is that the various "lexical events" (start_document, end_document, start_element, end_element, etc.) are fired in document order. How you choose to fire those events based on the needs of your application is entirely up to you.

If you want "chunk style" parsing, write a driver that accepts chunks of whatever your heart desires and fires the SAX events in the order you expect. If you want something that processes an entire (whatever) document and then fires all the SAX events when its done processing, then go ahead and write a driver that does that... No one is stopping you.

But if you expect anyone who knows SAX and how easy it truly makes the task of writing drivers for non-XML data to tie their own hands by specifying what a driver *should* be, or *has to be* so that Sean can say "oh, well, its a bibble-parser; i can sleep now" -- don't expect to come back unbruised. :-)

Re: SAX

TorgoX on 2002-01-22T02:25:29

Well, I'm the first to admit that my background is not in CS; so asking "is SAX a blah-blah-parser?" will likely draw a stare. [...] so that Sean can say "oh, well, its a bibble-parser; i can sleep now"

I'm interested to see that your admitted inability to understand what I'm saying hasn't stopped you from reacting to it indignantly.

Re: SAX

darobin on 2002-01-22T02:40:23

Well, I must say that I understand kingubu's reaction. He worked hard on PerlSAX2 as several of us did and reading here that SAX is just as bad as POD if not worse, without even the slightest factual argument to sustain such a claim is a bit, err, hard :-) That's why I'm asking for precisions above, I know it's a journal and anyone can bitch about whatever (which I do) without sustaining one's claims but still, if you think SAX sucks we would be glad to hear about why it sucks instead of just aspersions on the fact that you think it does. There are plenty of feedback channels, two of the most immediately efficient being the perl-xml list and #axkit/#axkit-dahut on rhizomatic.net where one can usually throw rotten fruits at the PerlSAX quadrumvirate live ;-)

Re: SAX

TorgoX on 2002-01-22T03:42:55

...and reading here that SAX is just as bad as POD if not worse

You may have read that, but I didn't write it. I sense that you two are reacting very defensively to things you're irrationally imputing to my message.

I prescribe an (in)decent massage, and/or some medical marijuana.

The doktor has spoken!

Re: SAX

darobin on 2002-01-22T05:00:21

This is very disconcerting. A similar situation existed with Pod; and I found it very very hard to fix that situation

That's what I read. I don't feel that I'm reacting very defensively to your message. Perhaps it is that, despite the fact that I've defended POD against offered replacements on more than one occasion, I really hate it :-). That would suffice to explain what appears as defensiveness ("What? You compared my baby to POD!!!" ;-).

That is not to say that we won't accept your remedies. I'll leave the marijuane to kingubu as I know he'll be happy to have it and gladly accept the one way trip to London to see my masseuse^W^W^W^Wgirlfriend :-)