XSLT sort puzzle

runrig on 2008-01-19T00:15:23

Following up on previous XML sorting efforts, where I mostly wanted to sort by tag name, then by NAME attribute, but for some elements, there were other attributes that I wanted to sort by. I first did the tag_name/NAME_attribute sort with XML::Filter::XSLT then sorted by the other odd elements using XML::Filter::Sort.

Out of curiosity, I wanted to try to do it all with XSLT, so I came up with:







  
    
  



  
    

    
      
    

    
      
      
    

  



  
    
  



Which sorts "D" elements by the FOO attribute, and everything else by tag name then NAME attribute. The difference between this and my previous method is that now "D" tags will come before "A" tags (and any others), whereas before they were still sorted by tag name. I think I can get the former two-pass approach behavior in one XSLT document, but for now, it's more trouble than it's worth. It would be easy if you could define a sort callback function.... :-)


Way too much work

Aristotle on 2008-01-19T07:25:40

You’re defining two identity copies, one on the root element and one for everything else; why? Next, you write a template for //* which is equivalent to matching * (but I haven’t slept and it’s 6AM, so I might be making a mistake). And finally, you say making the D nodes sort together with all the others would be a lot of work, when actually it’s less work than you’re already doing. Overall:

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0"
>

<xsl:output
    method="xml"
    encoding="iso-8859-1"
    indent="yes"
    omit-xml-declaration="no"
/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="*">
  <xsl:copy>

    <xsl:apply-templates select="@*"/>

    <xsl:apply-templates select="*">
      <xsl:sort select="name()"/>
      <xsl:sort select="
          self::*[not(self::D)]/@NAME
          | self::D/@FOO
      "/>
    </xsl:apply-templates>

  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

I think. Untested. You may have to write things a bit differently (to groggy to recall all the details precisely), but that’s the approach.

Re:Way too much work

runrig on 2008-01-22T22:39:35

Thanks. Tested and it works. I bow to your superior XSLT-fu (mine needs a lot of work). My only consolation is that I've seen worse code than mine from someone else who was trying to do the same sort of thing (and who wasn't even yet trying to sort by name()). I still find XSLT non-intuitive, e.g., how does it decide which templates to match, does it matter what in what order the templates come, and so why doesn't the first (identity) template above just match and copy everything unchanged?

Re:Way too much work

Aristotle on 2008-01-23T01:32:31

Don’t worry, you’re in good company. Almost no one groks XSLT properly. (This is probably because every single XSLT tutorial is terrible junk.) Most of the transforms I see are seriously suboptimal, and a very large fraction of them are really atrocious. There’s a lot of cargo cult and programming by coincide in the XSLT world. Your attempt was quite decent by those standards. :-)

Actually, the way XSLT processing happens is very simple, but because of the implicit default templates, it seems very magical.

First off: XSLT checks all templates for whether they match the current node. It then looks at the XPath expressions of the templates and calculates a priority from each of them based on a simple set of rules (but you can specify the priority of a particular template directly). Then it picks the template with the highest priority. If there are several matching templates with the same priority, this is an error.

Secondly: the order in which nodes are processed is dependent purely on the order in which they get selected for processing by xsl:apply-templates. There is nothing magical about it; if there is no xsl:apply-templates instruction that picks up a particular node, then it won’t even be looked at.

Now: these issues are clouded by the are built-in templates that are implicitly part of every transform. They look like this:

<xsl:template match="*|/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()|@*">
  <xsl:value-of select="."/>
</xsl:template>

<xsl:template match="processing-instruction()|comment()"/>

In other words, by default:

  1. Due to the first one of those templates, the root node and all elements will produce no output, but visiting one will guarantee that that processor will visit all its child nodes as well.

  2. When a text or attribute node is visited, it’s value is output due to the second template.

  3. The last template shown specifies that comments and processing instructions are simply thrown away.

When the processor loads a document, it makes the root node the current node and tries to match a template to it. Unless you’ve explicitly written an template to match it yourself, the first built-in template shown up there is what matches, so xsl:apply-templates is called, which selects the document element of the document and tries to match a template to it in turn. This gets the ball rolling. The processing then proceeds in order of the xsl:apply-templates instructions encountered and the nodes they select, matching templates to those nodes and executing the instructions given in them. Whenever it encounters an xsl:apply-templates, it recurses.

That’s it. That’s all.

It’s really very simple and completely non-magical. I remember how much it mystified me when I first started, because no one, and I mean no one, explains these few simple facts in simple terms.

Once you understand that, it’s all just XPath-fu, which is set based, so making effective use of it (like what I did to consolidate your two different sorts) is somewhat reminiscent of the skill it takes to write effective SQL queries (though not exactly the same).

I suggest using the XSLT spec as a reference; it’s more than sufficiently readable from a language user’s (as opposed to implementer’s) perspective for that purpose. (As is the XPath spec.)

Re:Way too much work

Aristotle on 2008-01-23T01:35:17

Ugh, even a multitude of previewings did not protect me from submitting the comment with a whole bunch of small grammatical and spelling mistakes. “It’s value?” Argh. My face is melting off. I’m not stupid – I swear!

Re:Way too much work

runrig on 2008-02-11T19:44:40

Huh, the order of templates is not supposed to matter, but if I switch the order of the templates above, then it doesn't work anymore. If you change the "*" template to "//*", then it works again. Which probably explains why I had "//*" in my first template. Code:

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;
use XML::LibXSLT;

my $parser = XML::LibXML->new();
my $xslt = XML::LibXSLT->new();

my $source = $parser->parse_string(<<'EOT');
<?xml version="1.0" encoding="UTF-8"?>
<FOO>
<A NAME="a" FOO="c"/>
<B NAME="a" FOO="c"/>
<C NAME="a" FOO="c"/>
<D NAME="b" FOO="c"/>
<A NAME="b" FOO="b"/>
<B NAME="b" FOO="b"/>
<C NAME="b" FOO="b"/>
<D NAME="a" FOO="b"/>
<A NAME="c" FOO="a"/>
<B NAME="c" FOO="a"/>
<C NAME="c" FOO="a"/>
<D NAME="c" FOO="a"/>
</FOO>
EOT

my $style_doc = $parser->parse_string(<<'EOT');
<?xml version="1.0" encoding="ISO-8859-1"?>

<xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0"
>

<xsl:output
    method="xml"
    encoding="iso-8859-1"
    indent="yes"
    omit-xml-declaration="no"
/>

<xsl:template match="//*">
  <xsl:copy>

    <xsl:apply-templates select="@*"/>

    <xsl:apply-templates select="*">
      <xsl:sort select="name()"/>
      <xsl:sort select="
          self::*[not(self::D)]/@NAME
          | self::D/@FOO
      "/>
      <xsl:sort select="self::E/@FOO"/>
    </xsl:apply-templates>

  </xsl:copy>
</xsl:template>

<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

EOT

my $stylesheet = $xslt->parse_stylesheet($style_doc);

my $results = $stylesheet->transform($source);

print $stylesheet->output_string($results);

Re:Way too much work

runrig on 2008-02-11T20:03:36

Aha...from the docs:

It is an error if this leaves more than one match. An XSLT processor may signal the error; if it does not signal the error, it must recover by choosing, from amongst the matches that are left, the one that occurs last in the stylesheet.

Re:Way too much work

runrig on 2008-02-11T20:04:37

So it would probably be better to put an explicit priority on the "*" template.

Re:Way too much work

Aristotle on 2008-02-11T21:47:19

Or use something more restrictive than node() in the identity transform to avoid matching elements, eliminating the conflict in the first place. The types a node can have are element, text, comment and processing instruction. Matching any node except elements therefore translates to XPath as “text()|comment()|processing-instruction()”. Another, possibly better way to write that (certainly a shorter one) is “node()[not(self::*)]”. (The predicate “[]” constrains the node() to match only when * would not() match the current node of that step, self::.)

Messing with priorities directly is often bad maintainability juju. If you have a lot of possible matches, conflicts resolution will be complex and will occasionally produce surprising interactions that can be hair-tearingly hard to fix. Reducing the number of candidate templates is almost always preferable over coercing the matcher into resolving to the right one.