Of Pipelines, SAX, and Ilkley Moor

pdcawley on 2002-02-26T10:56:41

So, we've been thinking about pipelines and application plumbing. Essentially this arose out of looking at Openframe, James's almost completely lovely general application framework. However, on close inspection, it turned out to be not general enough for my needs at least. But the basic concept of a transaction flowing through a pipeline was.

So, James set about writing Pipeline which, we hope will provide a common API for almost everything that does pipeline type processing, and which will allow us to seperate out such minor details as how the various segments in the pipeline are glued together, how we decide what segment to send the transaction to next, allowing application writers to program at the appropriate scale.

For example, where I work, pipeline segments will be queues, serviced by one or more demons, and each queue can be thought of as being included in multiple different pipelines. But that won't be visible to someone writing a segment. And, because we (will) have a common Pipeline API, I can easily set up a debugging environment using a standalone pipeline with MockObjects for requests and all that other good stuff. And then, if a problem arises in the fully concurrent version of the system the programmer knows that the bug is definitely concurrency related rather than a 'simple' application logic problem.

I'm also hoping that the Pipeline API is going to seperate the routing of a transaction through the plumbing from the transport of said transaction. If we make these things pluggable, then you could if you so desired specify one pipeline using XML::Sax::Machine type syntax and semantics and another with Openframe, but transactions would be transported through the system in exactly the same way.

I've been looking at SAX too. And bloody hell but it's horrible. I'm sure it's very powerful and capable, but the 'hoopage' as acme calls it is truly gruesome. Yes, you can use SAX to do almost everything, but I submit that for an awful lot of tasks it's an overcomplicated monstrosity. However, I hope that once the pipeline work is in place, someone will take the time to implement a SAX layer on top of it. (It's probably possible to build something that implements the Pipeline API on top of current implementations of SAX filters, but that seems to be somewhat arse about face to these eyes).

Oh yes. What about Ilkley Moor?

Well, at the summit/dim sum there was more discussion of various metaphors and analogies for pipelining. I came up with a vaguely disgusting one involving the digestive system of ducks, which led to thoughts about chaining ducks and, well... it was horrible.

However, it also occured to me that the folk song, Ilkley Moor Bar T'at is, in fact, a description of the flow of a transaction (called 'thee' or 'thou' in the song) through a pipeline. I now offer a commentary for you. Note that I'm stripping off the redundancies and choruses, which are obviously part of some messaging protocol between pipeline segments...

Where 'as thou been since I saw thee?
A transaction arrives at the beginning of the pipeline, and we go off to look for any persistent session information.
Tha's been a courting Mary Jane
Historic data has been fetched from the session. This describes the current state of the session.
Tha's going to catch thy death of cold
As a result of the fact that the transaction has been 'courting mary jane' without wearing a hat, the transaction is dispatched (literally) to a death of cold.
Then we shall have to bury thee
The transaction is encapsulated in a container (coffin) and sent on its way.
Then t' worms'll come and eat thee up
The transaction is split up into multiple parts, each part encapsulated in a 'worm' object. These can be thought of as multiple sub transactions, or as a SAX manifold.
The t' ducks'll come and eat up t' worms
The various subrequests (worms) are collated and encapsulated by 'duck' objects.
Then we shall come and eat up ducks
The final part of the pipeline (or consumer) now accepts all the ducks and encapsulates them.
Then we shall all have eaten thee
A simple summary of the final state of the consumer(s), which now contain (albeit indirectly) the transaction.


There is much hoopage

james on 2002-02-26T11:08:48

And I'm not sure that the Ilkley Moor metaphor helps significantly with that.

:-)

--james.

SAX

darobin on 2002-02-26T12:04:46

SAX, complicated? Are we talking about the same SAX? The SAX that requires one to know how to use hashes and implement about five method calls in 95% of the cases ? The same SAX that people use to post ten line solutions to many questions on perl-xml and several pm lists ?

darobin stares in disbelief

SAX is complicated because XML is

barries on 2002-02-26T13:37:18

SAX is complicated purely because it deals with XML at a low level, and XML is actually quite complicated. As with XML, the trick lies is using just that portions you need.

SAX, because it is so low-level, should be built upon so it does not have to be used directly. Assembling SAX filters to form pipelines and more complicated beasties is quite easy, and there are an ever growing number of filters to make it so you *don't* have to mess with SAX internals directly.

Re:SAX is complicated because XML is

darobin on 2002-02-26T13:48:03

I'm not sure that that suffices to make it qualify as complex. In the vast majority of cases (ie >95%) I only need two things from SAX: elements and character data.

The reason I find SAX simple is because you can simply ignore what you don't care about. It only gets complex when you want to deal with complex XML.

In that respect, I'd call it more complete than complex. Easy things easy and all that ;)

Re:SAX is complicated because XML is

pdcawley on 2002-02-26T14:09:16

How much would I have to change if I wanted to make things so that each SAX filter was actually run from within a standalone daemon servicing a reliable message queue. It's also a requirement that each daemon should be able to serve as a filter in different filter pipelines.

I'm not saying that this should be easy to implement, but the Pipeline API that I've been envisaging should at least make things so that I only have to change the transport model and all the other stuff will remain untouched.

I don't want to think in terms of receiving one event and emitting multiple events for later on down the line. I want to think in terms of a transaction object into which I stuff data and which I can stash in a data store between stages so if the system crashes I can recover the state and carry on when it comes back up.

It may be that I've fundamentally misunderstood what SAX is all about, but I don't think so.

I believe that SAX is a domain specific implementation of the Pipeline pattern. It's probably really good in that domain and can do lots of useful things. But I need pipelines and I'm not working in that domain.

Re:SAX is complicated because XML is

Matts on 2002-02-26T14:25:49

How much would I have to change if I wanted to make things so that each SAX filter was actually run from within a standalone daemon servicing a reliable message queue. It's also a requirement that each daemon should be able to serve as a filter in different filter pipelines.

This is what proxies are for.

To me, it seems like Pipeline is a combination of allowing an alternate method calling convention between pipeline stages (handlers, in SAX terms), and a data store mechanism (the super class of the handler, in SAX terms).

SAX doesn't have a separate concept of a data store, nor a separate concept of a dispatch mechanism. The data store is an instance of the handler itself, and the dispatch mechanism is method calls on the next handler in the chain. But the beauty of that is it's simple for simple needs, and because this is perl, you can circumvent all that plumbing to do whatever you need to do. You could create your SAX handler with a "store" entry, and only store data there using its methods. And for the method calls you can use a proxy.

I'm not saying you should use SAX, I'm just suggesting that SAX isn't as complex as you think it is (which is entirely our fault probably), and that it's probably a decent platform on which to build on, or at least investigate further.

Re:SAX is complicated because XML is

james on 2002-02-26T14:34:57

At the end of the day, Pipeline isn't about XML, SAX, or anything else. Its about genericity (I hate that word, if it even is a word). I don't really care that much about XML but, and this is the killer, I understand that other people do, and even more to the point:

I don't care what the data flowing through the pipeline looks like

I could be YAML if you want.

The reason why Pipeline was written was as a direct result of loads of people coming up with loads of different implementations of the same thing. If we have many implementations and many interfaces, thats a bad thing. If we can have many implementations but one interface thats a better thing. There are certain problems that interface cannot solve, but certain problems that it can, and if we can capitalize on solving those then we are ahead of the game.

If people want to use SAX, then cool. If people want to wrap/warp SAX to have a Pipeline interface, then even cooler -- then it can be plugged into other things. If people don't then thats fine too.

Whatever...

Re:SAX is complicated because XML is

darobin on 2002-02-26T14:49:11

You wouldn't need to change anything. I probably don't have all the information I'd need to produce a complete answer (I'm also badly hungover) but from what you describe it would seem to me that all you'd need to create is either a new type of SAX::Machine, or something along the lines of AxKit{B2B,NG}.

However, I think that that would only be useful if your data were XML-like, ie inherently extensible because you need it to be. Otherwise I'd probably fail to see the point of using XML.

Re:SAX is complicated because XML is

ziggy on 2002-02-26T15:30:02

I don't want to think in terms of receiving one event and emitting multiple events for later on down the line. I want to think in terms of a transaction object ...
This is the impedence mismatch in a nutshell.

SAX is simply a pipelineable event processor. The events are quite low level (start/end tags, blobs of text). If you want to have a pipeline that passes generic objects back and forth, then you probably don't want SAX per se -- the overhead of serialization between two SAX handlers is likely overkill (why convert an object to XML events just to hop to the next pipeline stage/SAX handler?)

Of course, if you could model your object as a single start tag event, then things would get much easier. But that gets back to molding the problem to fit the solution instead of the other way around.

You're correct in asserting that SAX isn't designed or intended to be a generic event pipeline, but rather a specific instance of a low level pipeline for a specific domain. I'm not sure why you're trying to coax your solution into a SAX pipeline if the data is fundementally incompatible with the SAX event model, or why that's a problem with SAX.

Re:SAX is complicated because XML is

pdcawley on 2002-02-26T16:42:53

I'm not sure why you're trying to coax your solution into a SAX pipeline if the data is fundementally incompatible with the SAX event model, or why that's a problem with SAX.

Mostly 'cos I made the mistake of believing the hype. And over reacted when I realised that the hype wasn't true.

Elucidation

ziggy on 2002-02-26T14:09:26

I think that shoving a few diagrams through graphviz or possibly even pic might help describe the path data is taking through a pipeline. Especially for those who don't want to think about dead people becoming worms, ducks or other people. :-)

Re:Elucidation

pdcawley on 2002-02-26T14:58:28

But that would be too easy. And it wouldn't be funny. It's not like anyone doesn't really have a handle on what a pipeline looks like...

Re:Elucidation

barries on 2002-02-26T15:12:59

every SAX machine allows you to do this :)

See XML::Handler::Machine2GraphViz for details.