I managed to get the parse of large.xml (a 70K file) down from 9 seconds to about 7 or 8 seconds. Not a huge improvement - and I didn't feel my time was terribly well spent, at least until I tried bleadperl (5.7.2 current), where previously it had been significantly slower (about 17 seconds) and was now down to about 8 or 9 seconds (it will always be slower because it does many more unicode checks when running under a unicode capable perl). So that's good.
Well good is perhaps an overstatement, since libxml2's "xmllint" program takes 29ms to parse the same file. Ah well, I think it's time to stop worrying about parsing performance, and start thinking about full compliance instead.
Why does perl make for such a crappy parser?
Re:It may be obvious, but...
Matts on 2002-02-04T12:01:25
The problem is that Perl is just slow. Not really much I can do about that. When you compare it to C, where it can do really nice things like char = ++*p to get the current character and move to the next byte in a string . With perl a similar idiom is: $char = substr($str, 0, 1, ''), which has a lot more overhead (same for a regexp to do the same). Character-wise coding in perl has always been a bit of a pain.Re:It may be obvious, but...
Matts on 2002-02-04T12:02:08
Err, that should have been char = *p++.
My C sucks;-) Re:It may be obvious, but...
jdavidb on 2002-02-04T18:05:05
Maybe someone needs to write a character-array manipulation class, a la PDL for huge matrix crunching. The class would gain a lot in efficiency for trading away the many capabilities Perl ordinarily gives. This would be something gross in XS, I'm sure.
Or maybe, if I'm thinking of writing a custom text-manipulation class for Perl, something's dreadfully wrong with the world. In much the same way that we always took XML::Parser's dependence on a C parser as an indication that something was wrong (and we were right).
Remember the C<less> pragma? You could supposedly use less 'memory' or whatever, and the interpreter would switch optimizations around to trade speed or whatever for memory. It'd be cool if you could trade off abilities on a scalar for efficiency. As in, declare that this scalar can never be bound to a regex operator such as m// or s///.
What am I rambling about?
I'm presuming the answer is "Yes," but did you profile the code?
matts: "Yes, jdavidb, I profiled the code and discovered 80% of the processing occurs in statements like $c = substr($buf, 0, 1); Get off my case!
Re:Profiling
Matts on 2002-02-05T09:48:55
Hehe, yeah I did profile, lots. (out of interest, anyone know why "use File::Temp" causes DProf to segfault?)
I'm going to post something to perlmonks including the profiling output and the heavy subs in question. Maybe someone there can help out.