One of the things I look for in an application that stores its data in XML is whether or not the application fails if the ordering of the tags in the XML is changed. Some applications that use XML format don't seem to even look at the values inside the tags, they just throw out everything between < and > and instead depend on the order in the file to determine the meaning of the data. I sure don't want this to happen to any of my applications!
I've been using XML as a file format to represent
CAD data. I translate from a proprietary format
from a CAD vendor into XML, do something to the
data, then write it back out in the proprietary
format. I want to make sure that if I happen to
reorder the XML tags in this process that I don't
create invalid data when I write it back out,
because the proprietary format has order-dependency.
I can imagine a few ways to handle this, and as
usual there is a speed/memory/program-complexity
tradeoff.
I am using the most excellent
XML::Twig module, using the
online tutorial and the O'Reilly book,
Perl & XML. One thing the docs are a little
thin on is examples of using the XPATH capabilities
of XML::Twig. Here is an example that I made:
my $fn= 'traces_small.xml';
my ($tag, $att, $value) = ('NET','name','/P5C');
my @example;
push @example, sprintf('TRACES/STFIRST');
push @example, sprintf('%s', $tag);
push @example, sprintf('%s[@%s]', $tag, $att);
push @example, sprintf('%s[@%s="%s"]', $tag, $att, $value);
foreach my $pattern (@example)
{
print "Matching XSLT expression $pattern\n";
print "TwigRoots\n";
my $xml= new XML::Twig(
TwigRoots => {$pattern => 1},
error_context => 1,
);
$xml->set_pretty_print('indented');
$xml->parsefile($fn);
$xml->print;
print "--------------------\n";
}
foreach my $pattern (@example)
{
print "Matching XSLT expression $pattern\n";
print "start_tag_handlers, original_string\n";
my $xml= new XML::Twig(
start_tag_handlers => { $pattern =>
sub
{
print $_[0]->original_string,"\n"
}
},
error_context => 1,
);
$xml->parsefile($fn);
print "--------------------\n";
}
Now I want to make a module to create my CAD
data in a certain order, independent of the
order of my input data.
Approach 1
Put tags on the data that say what order they
should be in.
I don't like this approach because I want my
XML format to work for different CAD tools that
have different requirements for the order of
their data. One of the main purposes of the XML
format is to have it be CAD tool independent.
So order properties are right out.
Approach 2
One pass per section
In this approach, I would parse the XML file
as many times as needed, each time printing only
the next section of the CAD data. Example:
if it were HTML, I might have one pass for the
header and one pass for the body. This is easy
to code and takes a minimal amount of memory,
but it is CPU intensive.
Approach 3
One pass, store data in an array
Here I would store the data into elements of
an array. At the end of parsing the array would
be printed out to the file, and the order of
the elements in the array would take care of
the ordering of the output file. In the
HTML example, $array[0] would hold the header
and $array[1] would hold the body. This approach
is memory intensive, since I have to store all
the CAD data in an array.
Approach 4
One pass, store the data in an array of files
This approach is like the array, except each
element of the array is an open file handle, and
the data gets printed to the different files.
At the end of the processing the files are
appended to each other. This approach is file-handle
intensive, and is probably not a good idea when
there are, say, 100 or so file handles open at
a time. This type of approach tends to run into
the kernel parameter for the maximum number of
open files for a process.
Since the trend these days is to throw RAM at
problems, I'll try approach 3 first, and perhaps
have an option in my code to use approach 4
or possibly 2.