XPath Optimisations

darobin on 2003-01-29T14:58:38

For all the friendly freaks out there writing XPath implementations on streams, I think you might be quite interested in reading about Xaos (pronouced Chaos, it's the ancient Greek work and is normally spelt χαος -- I thought that should be χαως but I could be completely off), standing for XML Analysis, Optimization, and Stuff. Its most important feature is an algorithm to convert an XPath containing upward axes to one that only contains descending ones, thus making it a lot easier to work on a stream using a rather simple automaton.

PS: the new use.perl feature that does Unicode => &#xxx; conversion is cool :)


Cool, thanks!

barries on 2003-01-29T15:52:35

Not sure of any other XPath-on-streams freaks, let me know of any.

Will take a look at that toolkit when next I dive in to EventPath implementation (see XML::Filter::Dispatcher for that, those of you interested in streaming XPath implementations).

Thanks for the pointer.

- Barrie

Re:Cool, thanks!

darobin on 2003-01-29T16:05:46

Somehow I thought you would read this :) As for XPath-on-streams freaks, well, if it doesn't have to include implementer then you can count me in. You can also count the guy working next door from me that wants to do that kind of thing for our Publisher software. Then you have the STX guys as well. And prolly a number of other people, lo! we're a crowd, lets take over the world!

It sure would be nice having an optimized matcher in XFD, especially one that can handle all of XPath on a stream.

Re: STX

barries on 2003-01-29T16:21:48

STX isn't XPath; it's a matching language and designed for streaming environments. I've seen mentions of constructs that allow you to collect read-only source trees, but the matching is limited to what's handy without any buffering at all.

EventPath is a superset of XPath which buffers events as necessary for in-order delivery in the event of possibly out-of-order matching expressions like /a[b]/c. And it gets most things correct :). This Xaos stuff might be a better way of implementing it, however; X::F::D uses continuations (well, really, more like queing anonymous subs for currying) to match against pending events given that the current event matched.

So the leading "/" in "/a/b" will match against start_document and queue an anonymous sub to match against an start_element for <a>, which, when it matches, will in turn queue a sub to match a start_element for <b> in <a>'s child events, etc. All of this is acheived by compiling "/a/b" in to a large perl anonymous subroutine (I avoid closures due to all the leaks in older perls; may revisit that now) that gets run in the start_document() handler.

This requires some overhead to run all those check subs and I'm looking forward to the day that I can optimize some of that to state machines and, possibly even less overhead, I want to optimize "a" => $action to be a simple "if $elt->{LocalName} eq 'a'" test in a compiled start_element sub. ETIME, but that would be as fast as handwritten SAX filter code.

- Barrie

Re: STX

darobin on 2003-01-29T17:43:24

Oh, I remember STX as a subset of XPath, haven't looked that way in ages. Xaos immediately reminded me of XSD as what it does to match seems quite similar, notably related to what they call Total Matching (I think you do something quite similar, but I haven't thought that through yet). You could probably go indeed faster with a state machine, and I've been wondering if it's possible to see a stream as a b-tree as explained in Murata's (can't find the PPT, it contained graphics that made it a lot clearer). What I certainly thought of as a bonus for XFD was the ability to add add XPath axes to EventPath which wasn't the case last time I looked.

Re:Cool, thanks!

mir on 2003-01-29T16:07:06

Hand raised, waving frantically

It has been in my todo list for ever... maybe between ripping the XPAth grammar from XML::Filter::Dispatcher and this I can write something really cool for XML::Twig.

I'd bet on you getting there first though ;--(

Re:Cool, thanks!

darobin on 2003-01-29T16:18:22

Couldn't you go at it the lazy way and build that Twig functionality layered over XFD?

Re:Cool, thanks!

barries on 2003-01-29T16:41:53

Here's how X::F::D does its pale immitation of TWIG-like things:
   '/*/b' => [ 'hash()'   => sub { use BFD; d xvalue } ],
   '/*/c' => [ 'struct()' => sub { use BFD; d xvalue } ],

Those two will print out hashes (returned by xvalue()) for each <b> and <c> in the root element. Mind you, this isn't meant to look pretty or replace XML::TWIG, it's meant to provide limited TWIG-like functionality in the X::F::D environment so you can conveniently slurp a twig in to a HASH or HASH of HASHes (respectively) when doing other things with X::F::D.

The first, hash() is a HASH reference of scalar strings (for attrs) and ARRAYs (for descendant elements) keyed with xpath-like expressions ('@attr' => $string, 'c/d' => [ @strings ]).

The second, struct(), provides a nested structure of HASHes keyed somewhat like the first, but without flattening the structure in to a single hash:

<root
    xmlns="default-ns"
    xmlns:foo="foo-ns"
    a="A"
    foo:a="FOOA"
>
    <a aa1="AA1" foo:aa1="AA2">
        <b>B1</b>
        <foo:b>B2</foo:b>
    </a>
</root>

becomes (deleted a few strings to duck the overly oppressive use.perl lameness filter that seems to want to prevent people from talking about perl on use.perl. Oddly, adding more plain text triggers the wonky filter agian. Sigh.):

$VAR1 = {
    'a' => [
        {
            '@aa2' => 'AA2',
            'b' => [
                {
                    '' => 'B1'
                },
                {
                    '' => 'B2'
                }
             ],
             '@aa1' => 'AA1'
         }
     ],
     '@a' => 'A'
};

And, Michel, please feel free to copy the grammar. I got it from an old James Clark posting and converted it from yacc to Parse::Yapp and added some EventPath things like start:: and end:: axes, etc.

- Barrie

Re:Cool, thanks!

barries on 2003-01-29T16:47:20

(mind you, I pasted the wrong XML in there. Ignore the namespaces, that's from a different test).

Sorry,

Barrie

it's all greek to me...

geoff on 2003-01-30T13:25:29

according to the Liddle-Scott lexicon it's definitely το χαος - neuter with an α stem. according to Smythe that makes it a noun of the first declension. however, if it were a first declension neuter noun, it would be το χαον not το χαος in first person singular. I think that this means that it's one of those irregular greek nouns that you just have to know. in this particular case, the lexicon may have a hint - homer used εος, the masculine possessive, to refer to χαος, not εον as though it were neuter. so, what you have is a neuter noun with a masculine declension.

in any case, it definitely appears not to be χαως :)

Re:it's all greek to me...

darobin on 2003-01-30T14:20:39

Ah cool, many thanks for looking it up. My greek dates back to when I was fourteen or fifteen, it's hazy at best. Going back home I looked around to check, with a vague intuition of where I would find it. It's in a letter from a girlfriend dating back to when I had greek courses, and whe quotes Hesiod's "en arkhe khaos en". Rereading it, it's hard to tell in her handwriting if it's an ο or an ω, which leads me to believe that maybe she didn't remember as she wrote and scrambled it on purpose (I know I do that sometimes). I'll have to track her down to find out!

Now, how the hell a mention of χαως came back to me yesterday from a letter ten years ago (I'm pretty certain I didn't see that word elsewhere, having done only one year of greek) I have no idea... I'd love to see her face though if I managed to find her and that was the first question I asked ;-)