I have been continually frustrated by XML modules. They were all either hard to use, had ridiculous dependencies, took too much memory, or some horrible Lovecraftian combination of those. So, when Adam Kennedy recently muttered on the datetime mailing list about his *::Tiny modules, I decided to do XML right. The result is XML::Tiny, which (according to its documentation) implements a useful subset of XML. Secretly, I think it implements the useful subset.
The core parser is less than 20 lines of code and is sufficient to parse an RSS feed or the responses from Amazon's web services. It should be compatible with perl 5.004_05 and with XML::Parser with the XML::Parser::EasyTree style, has no dependencies outside the core, and consumes as near as damnit no memory.
It should be compatible with perl 5.004_05
Sorry, but it's not because you're using exists
on arrays, which wasn't supported until Perl 5.6.
$
exists operator argument is not a HASH element at lib/XML/Tiny.pm line 94.
$
exists operator argument is not a HASH element at lib/XML/Tiny.pm line 94.
Re:Perl 5.004 compatiblity
drhyde on 2007-01-29T22:32:37
Yeah, someone else spotted that too - fixed in 1.01, which will hit the CPAN shortly.Constructive criticism:
jdavidb on 2007-02-02T19:48:59
The synopsis shows:
my $document = parsefile($xmlfile);But it doesn't tell me what $document looks like. Am I going to call methods on it to get what's in it? Is it a bunch of hashes of arrays of hashes? How do I use it?
I'm sure that's answered later in the docs, but to me the synopsis ought to demonstrate enough of the API to give me the flavor of it so I can decide if I want to spend the time reading the docs to learn the module.
Re:Constructive criticism:
drhyde on 2007-02-03T13:13:48
Thanks, I'll fix that in the next version.
Re:People need to know...
sigzero on 2007-01-30T03:40:30
That this isn't an XML parser. It is majorly broken in various ways. It's more of a tag-soup parser. I still think you should change the name of it.XML::Tag::ParserRe:People need to know...
TorgoX on 2007-01-30T03:55:36
You're right. It should be called "XML::Retarded", and come with a special helmet that its users should have to wear for if/when they walk into a door and go bonk and cry.And what's this guy going on about the other modules having too many dependencies? XML::Parser::Lite has none, last I looked. And wasn't RETARDED.
Re:People need to know...
Alias on 2007-01-30T05:17:33
Unless I'm missing something, XML::Parser::Lite is not a standalone module, it only comes embedded inside SOAP::Lite, which has 60 bugs, 16 cpan testers failures, and has a dependency on XML::Parser.
Unless you are talking about something else.
Additionally,::Tiny modules always are a bit controvertial. There's always arguments about where to draw the line between feature-set and code size.
How EXACTLY is it retarded?
EXACTLY between line 1 and the last, inclusive
TorgoX on 2007-01-30T06:43:14
His unescaping logic is daffy, and his attribute syntax rejects half the XML that I write in the course of a day. But then, he could always say "I meant to do that!".He (or you!) should just grab XML::Parser::Lite and release it as a standalone module. The fact that XML::Parser::Lite is currently available only in that SOAP dist is a great big mistake-- and a mistake which bears fixing, which is more than I can say for this XML::Tiny mess.
Re:EXACTLY between line 1 and the last, inclusive
drhyde on 2007-01-30T08:35:51
"His unescaping logic is daffy" is not constructive criticism. Please try again."His attribute syntax rejects half the XML that I write in the course of a day" is not constructive criticism. Please try again.
That is, please try again if you expect me to pay any attention.
Re:EXACTLY between line 1 and the last, inclusive
Matts on 2007-01-30T09:05:44
The spec is right there. It's easy to read. It's even pretty easy to write a valid parser for it (I wrote XML::SAX::PurePerl in a couple of days, although it is admittedly lacking in a few areas, but it tries). There's even a bunch of fairly good test suites out there for XML. Try James Clark's stuff for a starter. Or fire the stuff in XML::SAX::PurePerl at it. People don't have time to list all the XML parsing bugs out for you when it purposely avoids being an XML parser. But if you're going to call it an XML parser people are going to tell you it's not.Re:EXACTLY between line 1 and the last, inclusive
drhyde on 2007-01-30T13:08:57
I was careful to list the ways in which it is not compliant with the spec. If despite that people expect it to comply with the spec then they really are far too stupid for me to care about.I would note that despite the doom-saying XML fanboys in this thread, I have had more feedback from people saying "thanks for this really useful module" than I have for any other module I have released. Funny, that. Y'know, I think that because I'm such a nice person, I'll continue in my evil ways and keep releasing useful software that does the 99% of the job that people obviously want to do, despite it infuriating the insignificant minority that insists on 100% perfection in everything.
Re:EXACTLY between line 1 and the last, inclusive
grantm on 2007-01-31T03:38:09
I was careful to list the ways in which it is not compliant with the spec.Your module's key point of differentiation seems to be that it does not, in fact, comply with the XML spec. Why then are you surprised that people don't think it belongs in CPAN's 'XML' namespace?
I'm not sure that referring to people as "XML Fanboys" is helpful either.
Re:EXACTLY between line 1 and the last, inclusive
drhyde on 2007-01-31T08:30:32
On the contrary, its key point of differentiation is that it implements a significant chunk of the XML spec - enough of the spec to be very useful - while imposing minimal burdens on the user.There's an awful lot of things in life and on the CPAN which, like this module, are sufficiently complete to be useful without being perfect. Paracetamol is one example. It is sold as a pain reliever, despite the fact that under some circumstances it will rot the liver, which I am sure is not a particularly pleasant experience. But, most of the time it's good enough, and users are expected to read the page of caveats and warnings supplied with it to see if their particular circumstances mean they should use something else. Sound familiar? Once the fanboys - oops, sorry, I mean XMLcops - get out into the real world, they'll be sorely disappointed to find useful expediency almost always trumping baroque perfection. They have my deepest sympathy as I'm sure it will come as a terrible shock to them. That said, I will be terribly impressed if they can, out in the real world, stick to their admirable principles, and only accept perfect medicine. Personally I think it's more likely that I'll fart a grand piano.
Re:EXACTLY between line 1 and the last, inclusive
hex on 2007-01-31T13:09:10
The fact that XML::Parser::Lite is currently available only in that SOAP dist is a great big mistake-- and a mistake which bears fixingYes. I've just packaged it up and uploaded it to CPAN, and I'm currently waiting to hear back from the authors about getting co-maint permissions.
Re:EXACTLY between line 1 and the last, inclusive
hex on 2007-01-31T14:42:26
Okay, I don't know what the permissions on PAUSE are for then, because here it is. Enjoy!Re:EXACTLY between line 1 and the last, inclusive
Alias on 2007-02-03T11:04:45
The permissions on PAUSE are for it to be included in the index, so when you say "install XML::Parser::Lite" it actually installs your package.
Currently it would still install SOAP::LiteRe:EXACTLY between line 1 and the last, inclusive
paulclinger on 2007-02-01T06:59:18
> The fact that XML::Parser::Lite is currently available only in that SOAP dist is a great big mistake-- and a mistake which bears fixingThis is primarily because I believe there was (and still is) a valid reason for it NOT to be released as a standalone module: it doesn't handle char/entity references (among other things). While I appreciate Earle Martin's effort to make the module more easily available to people, I think this is a too serious limitation to ignore. SOAP::Lite is able to deal with it, but any other app will need to come up with its own solution.
I'm sure there are ways to fix this; I even wrote another module that doesn't have that limitation (XML::ReParser), but never released to CPAN. This limitation is primarily caused by inconsistencies in how regexps are supported inside ?{} code; you can find more details and references to XML::ReParser here: http://beta.nntp.perl.org/group/perl.perl5.porters/2001/07/msg41148.html
I probably should just use whatever way works to process entities and release both Parser::Lite and ReParser. Paul.
Re:People need to know...
Ed Avis on 2008-01-07T12:55:03
Heh... I once wrote a module XML::Stupid that had a similar interface to XML::Simple but relied on having one text element per line and no attributes. So it could parse only XML like
<container>
<text1>Hello</text1>
<text2>There</text2>
</container>
I wrote it for speed, grinding line-by-line through some big and nasty credit ratings files from S&P. They also used all uppercase for the element names. Of course I was much too ashamed to consider distributing this module.
I don't think there is anything wrong with parsing a limited subset of XML provided you clearly advertise you're doing that.Re:People need to know...
drhyde on 2007-01-30T08:49:20
"It is majorly broken in various ways" is not constructive criticism. Please try again. If you can find ways in which it is *actually* broken - ie, where it doesn't conform to the documentation, or the tests are inadequate - then I would prefer that you open a ticket using rt.cpan.org, although I will also accept submissions by email.As for "It's
... a tag-soup parser" - yes, well done, you spotted that it is a non-validating XML module. I know this may come as a shock, but such things are actually useful. Re:People need to know...
darobin on 2007-01-30T10:29:22
You're entirely missing the point. This is not about being a subset of the functionality or anything, this is not an XML parser. The problem is not about validation -- no one sane cares about that -- the problem is about well-formedness. There is a lot of perfectly good XML out there that this thing won't parse, and there is a lot of completely wrong XML that it won't flag as such.
If you want to write and release a parser for "vague stuff that has angle brackets in it" and find it useful that's wonderful and by all means go ahead, just don't for a single second delude yourself that you've written an XML parser. Or if you must insist on deluding yourself, please be so kind as to not chance dragging others down in your misunderstanding by releasing it in the XML:: space. It has absolutely nothing to do there. Please move it.
Re:People need to know...
Alias on 2007-01-30T14:31:25
CSS::Tiny does not parse the entire CSS specification.
YAML::Tiny does not support the entire YAML specification.
Config::Tiny does not support some elements of the Windows.ini specification.
Taken literally, CSS::Tiny is not a CSS parser, YAML::Tiny is not a YAML parser and Config::Tiny is not a Windows.ini parser.
In exchange for sacrificing some completeness and correctness,::Tiny modules provide a single small .pm file you can drop onto any Perl you are likely to find in existance anywhere, copied in by hand or embedded inside another module if needed, and it will load and do simple and unchallenging tasks using almost no memory and often almost no CPU.
They also tend to be a love it or hate it solution to the problem.
For most of my Tiny modules, I tend to get a mix of compliments and complaints.
For the complaints that the module does not implement enough functionality, my challenge is to provide a patch which adds that feature in only a few additional lines of code.
If you can't provide the feature in a Tiny way, then the Tiny module is simply not suitable for that problem, and you should upgrade to the "real" module.
As I understand it, people using XML for "real" work, with namespaces and schemas and character sets requirements are not the target of XML::Tiny.
That said, if you can provide patches to XML::Tiny that add the features you consider to be missing, and do so in only a couple of lines of code, I would expect that he'd happily take those patches from you.
But to complain that XML::Tiny is missing features is to miss the entire point of the::Tiny suffix. Re:People need to know...
chromatic on 2007-01-30T18:48:58
But to complain that XML::Tiny is missing features is to miss the entire point of the::Tiny suffix. I think people instead are complaining that it's missing the XML:: part of the distribution name.
Re:People need to know...
drhyde on 2007-01-30T21:32:03
Nah, what they're complaining about is that they think users are too stupid to read the documentation before using the software, and so won't be aware of its limitations.Personally, I have a considerably higher opinion of the users than that.
Re:People need to know...
chromatic on 2007-01-30T22:27:55
Well if that's all it is,
XML::Tinier
ought to be a breeze to write, and I guarantee that it'll have even less code.Re:People need to know...
Alias on 2007-01-31T02:10:25
Yes, but will it fulfill ANY use cases for XML:) Re:People need to know...
chromatic on 2007-01-31T03:39:05
Good point. Maybe I'll call it DBI::Tiny instead.
Re:People need to know...
Alias on 2007-02-01T06:59:11
In that case, will it fulfill any use cases for DBI?
(And of course DBI is a reserved namespace)Re:People need to know...
darobin on 2007-01-31T10:40:39
I don't think that users are too stupid to read the documentation, but I do think that they will tend to believe it when they read it, and unfortunately it is misleading, as explained in http://use.perl.org/comments.pl?sid=34409&cid=52943, and will waste their time. That's not playing nice.
Re:People need to know...
jns on 2007-01-31T14:20:20
I've fed the "not well formed" parts of the w3c test suite throw this and I don't think that it would be too much of a hardship to make it pass *nearly* all of them (or rather reject them but you know what I mean.) The tricky bit of course is having it check for "well-formedness" in things that it is already ignoring such as comments and processing instructions.Re:People need to know...
drhyde on 2007-02-01T11:24:05
At first glance, a great number of those failures seem to be to do with broken entity declarations which, of course, I don't spot. I'll certainly take a look though and see if there's any quick wins to be had. Thanks!Re:People need to know...
darobin on 2007-01-31T10:38:11
As I said, if people find this sort of thing useful, let them have it, I have no issue with that. Just don't call it XML. It's not. NotXML::Tiny or TagParser::Tiny would be fine. Furthermore the documentation is extremely misleading in that it claims to support a subset of XML -- that is not the case. It would support a subset of XML if every document it understood could also be successfully parsed by an XML parser, but didn't support some things that an XML parser would report. As it is, it supports all sorts of things that an XML parser doesn't. The documentation's claim that it is a parser for an XML subset rather than a tag-soup parser is way off the mark.
I don't have a problem with a Duck::Tiny that would sort of go quac instead of quack -- that happens all the time and still produces useful tools. There does exist a useful proper subset of XML -- this is definitely not it. The documentation takes some sort of populist moral higher ground about it implementing the useful stuff and being decried by a bunch of holier than thou pedants. That's all rather subjective chaff, as would be making the opposite claim in a similar fashion, so I decided to take it through a reality test. I gathered about 2000 XML documents that come from real world projects and are definitely not complex, not high-brow, do not require schemata or anything like that, etc. Mostly basic Web stuff like XHTML, SVG, XSLT, some configuration files, some XUL -- nothing fancy. It turns out, XML::Tiny does something useful with 83 of them. As it turns out, entities, PIs, and CDATA sections are not rarely used features. And handling charsets is very useful because doing it yourself is a bitch. Of course, of those 83 documents almost all use namespaces which XML::Tiny doesn't report, so it's not helping all that much.
The fact is, in the real world, people use XML in many different ways. That is different for instance from CSS or
.ini files where there does tend to be a very uniform common subset. A lot of smart and pragmatic people have tried to define a smaller XML, and failed. Claiming success is just as misleading as those people who regularly pop up claiming to have found radically new compression methods that can compress themselves ad infinitum. So if it keeps its name, at the very least the documentation should be a whole lot more explicit about its severe limitations. And furthermore, since David doesn't seem to care in the least about adherence to the spec, I really don't understand his resistance to calling it TagSoup::Tiny. Why use the "Sacred Trigraph 'XML'" as he calls it when it is in every way a tag-soup parser, and when everyone knows and agrees that tag-soup parsers are very useful? Banking on the holy trigraph? Pigheadedness?
Re:People need to know...
drhyde on 2007-01-31T12:16:10
One excellent reason for calling it XML::Tiny rather than SomethingElse::Tiny which I've not mentioned yet is that that's what users will look for. Users who need to process simple XML documents will, obviously, look for XML stuff. If I were to call it SomethingElse::Tiny I wouldn't be able to help them. I suppose you could call that "banking on the holy trigraph" but given that I don't actually care whether anyone else uses it, it must be a very small bank.If you were to bother to tell me *what* it supports that a "real" parser doesn't, then I might be able to fix those issues to your satisfaction - either by rejecting those documents or updating the documentation. As I have said repeatedly, I see constructive criticism as being a Good Thing. Vague unsubstantiated assertions, however, are a Bad Thing. Constructive criticisms are things like "the tests fail on perl 5.6.2" or "it doesn't parse this 'ere document" or "it thinks this 'ere pile of crap is well-formed, and this [blah] is why it shouldn't". Believe it or not, I do actually care about adherence to the spec. If I didn't care, do you really think that I'd have bothered pointing out ways in which it doesn't comply? You can help, if you want to, by pointing out the flaws in plain language so that even someone as stupid as me can understand what they are.
Your choice of test data would seem to be flawed and probably reflects the sort of things that *you* use XML for, and I know that XML is something that plays a big part in your world. They're the sorts of things that *I* would use a big parser for too, if they were the sort of things I cared about, which they're not. XHTML is not its target. Most people use a web browser for parsing that. SVG is not its target. The majority of the few people who use SVG at all use an SVG viewer for that (I use the Adobe one, which is a slow unwieldy pig). Most people don't use XSLT at all but if they did they've explicitly use an XSLT thingy and not a generic XML thingy - and certainly not one that goes to great lengths to tell you that it won't handle everything. XUL - erm, isn't that the horrible Mozilla user interface thing? Given that I'm not even entirely sure what it is, I'm willing to assume that most users aren't either and so won't have to deal with it.
But I am glad that you recognise that people use XML in different ways. Lots of people use it because, well, it's what other people have made available. They want a bare-bones minimal solution which will get the XML out of their face RIGHT NOW without having to download half of the CPAN. They certainly don't choose XML and no doubt in many cases they'd rather use something else. A case in point, the Amazon Web Services, which as far as I'm aware is only available in XML form - in fact it's partially my recent fight with AWS and XML that made me think I should write this. I'd rather they used something less verbose and lighter weight, but somehow I can't see them paying much attention to me.
I even recognise that I *might* be projecting myself onto those others - although, again, bear in mind the feedback I've received. But if that were the case, this would hardly be the first time that someone had an itch to scratch and decided to share their code in the off-chance that someone else might find it useful. That sort of behaviour is normally considered a Good Thing.
And yes, I'm afraid that after all of the unhelpful, seemingly unthinking criticism, yes, I am afraid that I am rather inclined to dig my heels in somewhat. Constructive criticism might have avoided that, and constructive criticism might make me change my mind. But up until now, the only constructive criticism has come from those who have contacted me to say how much they like the module.
Re:People need to know...
vek on 2007-01-31T16:15:53
Although to be fair you did claim that you "decided to do XML right" and that might be why those that have "done XML right" got out of their pram.Re:People need to know...
fishbot on 2007-01-31T17:10:52
I have to agree here. I can see the value of this module... if you have some tiny chunk of very simple XML, and you want something Tiny to just get it into a usable Perl structure, then sure this is a plausible solution.
However, saying that you feel that you've done XML right and implemented (you feel) "the useful subset" of XML is obviously going to rile people who have dedicated time to actually implementing robust, if heavy, solutions.
I tried the module, tossed an everyday XML file from my $work at it, and it did a decent job. This whole argument is stupid - you had an itch and released something that scratched it. That would have been awesome if you hadn't touted it the way you did, then did ridiculous things like suggesting people like Matt Sergeant and Sean M. Burke don't know what real world programming is like.
When you present hubris and laziness, expect impatience.Re:People need to know...
Alias on 2007-02-01T07:00:12
What sort of thing were the 83 it was useful for?Re:People need to know...
Aristotle on 2007-02-02T01:15:03
A ton of newsfeeds on the web use CDATA to escape their HTML. Your module is useful only if I have enough control over the other side of the wire to know that the XML I receive will always comply to the subset you have implemented. If I have no explicit agreements with the publisher, your module is a ticking timebomb – I never know when they are going to send me XML that will blow up in my face.
Sorry, this is not useful for people who live on the web, as opposed to closed systems.
In fact, your choice to decode select entities actively prevents sensible people from doing any post-processing to clean up after you, like by using HTML::Entities to decode entities in the strings you give them. If had you known well enough to do nothing at all about entities, that would in fact work, and your code would have been far more useful than it is now that you insisted on doing a half-arsed job of it.
Re:People need to know...
drhyde on 2007-02-02T21:02:16
One of the areas in which I *am* compliant with the spec is that I decode those entities that I know about. Without supporting those are &, >, <, ", and '. To not decode those would be an error. Sure, getting rid of that would please you, but only at the expense of having to put up with a different set of ingrates whining about it.Re:People need to know...
drhyde on 2007-02-02T21:04:14
Haha, amusingly this site doesn't escape certain characters, those making what I wrote utterly unreadable;-) Re:People need to know...
Aristotle on 2007-02-06T10:01:20
Did you select “HTML Formatted”? ’Cause in that case it’s a feature, not a bug.
Re:People need to know...
Aristotle on 2007-02-02T22:46:04
No. That is the exact opposite of compliance. Nowhere does the spec say that you should decode the entities you know about and let the others pass through verbatim. According to the spec it must be a fatal error to encounter any entities you do not recognise; I understand that this is not useful for your purposes, so the other sensible option is not to decode entities at all, leaving entity handling up to the user. As a bonus, guess what, that makes your code smaller! The way you do it now, because you decode
&
, it’s impossible to know whether the string é was in the file as &eacute; and got decoded already (so it should be left alone) or was a literal é (that was not decoded and so needs to be decoded manually by the API client). The data is irreversibly mangled.But go on, dismiss people who actually know what they’re talking about as whining ingrates (for what am I supposed to grateful anyway? I’m not touching your code with a 10ʺ pole as it stands), then complain that “the XML police” labels you as the crackpot you are. If you’d spent half the time writing code that you spent justifying your bugs, the thing might actually have been useful. XML::Parser::Lite demonstrates that it’d be possible – if the will was there.
FWIW, Ovid wrote an XML emitter module a while ago that is deliberately capable of producing malformed XML. He approached it from a “I know this is not really a good idea but the other side of the wire demands broken XML and I need to get the job done” perspective; there was a long but amicable discussion about the name, and in the end he called it Data::XML::Variant, which noone complained about since, despite it “taking in vain” the “Holy Trigraph.” There is no “XML police.”
I have no problem with the idea of writing an XML parser that does not implement every aspect of a fullblown parser, or even one that does not understand all features of the spec, but you cannot tell me that a module that mangles its input data and fails to reject files it is not written to process correctly, leading to junk data passed to the API client code in both cases, is anything but broken. Your bug is about as severe as a CSV emitter module that does not escape field separates in values, resulting in broken CSV files – it renders the whole thing worthless.
Re:People need to know...
drhyde on 2007-02-03T02:39:04
Because I'm such a nice person, and because you asked so nicely (oh, wait, you didn't) I just added support for CDATA.Re:People need to know...
ank on 2007-01-30T16:30:40
Hahaha! Not tag-soup! Anything but tag-soup!
WHAT? more than one way to do XML? No, that can't be! That's just not right! What the hell, I'm moving to Java - they know how to conform to standards there.
Crybaby. Don't like it? Don't use it.
Re:People need to know...
Matts on 2007-01-30T17:00:49
That's like saying Ruby is "more than one way to do Perl". No, it's ruby, not Perl. There are parsers for Ruby that work just fine, but don't call them Perl parsers.Re:People need to know...
Alias on 2007-01-31T02:09:37
For a suitably small subset of functionality, Python can be identical to JavaScript.
Thus, a Javascript intepreter can server as a Python interpreter:)
There are many XML modules, but for most of them you have to process XML manually. That is not easy, certainly when you have to read Schema's. The XML specification documents are horribly complex and hard to read. This post gives me the opportunity to plug my own new XML module: http://search.cpan.org/user/markov/XML-Compile.
XML::Compile is XML::LibXML based, and uses schema's to create translators between perl and XML. It does validation on values, handles all namespace issues for you, and follows the W3C specs closely. In most cases, you will not need any explicit processing of XML in your script.
The next step is WSDL (this week) and SOAP (hopefully soon). The target is to create a worthy competitor to SOAP::Lite.
XML::Compile needs to age some
more, to find flaws. Please try it.
Re:There actually are more new modules
byrnereese on 2007-02-02T19:46:34
You have my vote. SOAP::Lite really does need a lot of work, and I dare say a clean start. Creating a SOAP toolkit though is more complex then simply creating a new XML perl parser IMHO.
parsefile()
slurps that 20MB XML file anywayRe:On how to handle entities
Aristotle on 2007-02-06T10:06:53
As I wrote in our email conversation, I think that is an entirely reasonable thing to do. And since this leaves no way of declaring any entities, you could also just reject any document containing any entity besides the predeclared five. Then if you add code to decode numeric character references, you’ve got that topic covered.
Re:version 1.04 released
Aristotle on 2007-02-11T16:48:28
Mmmm, more people might notice if you post this on your journal proper, instead of just as a comment on a weeks-old thread.