Week 9 of Web.pm -- encodings, and a deep dive into Genshi

masak on 2009-06-17T11:50:15

Den all IRS d00dz an bad kittehs coem to hear Jebus. An Fariseez sez "LOL Jebus etz wit bad kittehs! Him sux!'

Jebus sez, "WTF? If lolcat had hundrd sheeps an won gits losted, doan him leev naintee nain sheeps an go luk fr losted won? Den him find it an iz liek 'w00t!' An dem him coem hoem, trow partee cuz him finded losted sheep. Srsly! Ceiling Cat moar happi wen a bad kitteh being maed lolcat den bowt naintee nain gud kittehs." — Luke 15:1-7

I'm working mostly on Hitomi, a Perl 6 port of the Python templating engine Genshi. In the past week, I decided to dive into Genshi, looking at how data flows from the template to the finished result. I now have a pretty good understanding of this, so I thought I'd expand a bit on it here.

Genshi's fundamental data structure is called Stream, and it looks very much like a sequence of SAX events to me: open-tag, close-tag, text, processing-instruction, etc. Different transformations are then applied to a stream to yield the final result. A transformation could be something like "remove all <script> elements" or "shorten all posts that are longer than 400 characters". A stream modified in-place, but combine with a transformation to produce a new stream. The nice thing is that the actual templating is also expressed as a series of this kind of transformations. But the Genshi user can easily provide her own transformation on top of the standard ones.

I like this model very much. It feels extremely clean and extensible. I decided to port as much as I can to Hitomi. My short-term goal is to make things round-trip using the streams, and to that effect, I've ported a test file with 89 tests from Genshi to Perl 6.

It's still not totally clear to me how text is converted to a stream and then back. I can easily picture how a stream event knows how to serialize itself back into text, so the mystery on that side isn't very great; it's just that I haven't found the actual Python code for it yet. On the stream generation side, the data flow disappears into a Python-Expat library. Delegating XML parsing to a third party also seems like an exceedingly good idea to me.

Can we do the same thing — delegate to Expat, or some other suitable library — in Hitomi? I think so, and the Parrot documentation seems hopeful. I'd very much like to get that working. But in the short run, I'm pondering whether it might not be easier to make a small, throwaway XML parser out of the bits and pieces we developed as prototypes. I could make it a separate class and call it Impostor, to make sure we remember to remove it later.

Another issue I ran into is one of encoding. viklund++ has been doing heroic work in the past week making November handle UTF-8 correctly. The reason this is heroic is that Rakudo doesn't have a model for string encodings yet. The information has to be forced out of Rakudo against its will, and I've heard viklund mutter darkly about hacks and workarounds lately... It all culminated in a good discussion on #perl6 last night, and pmichaud++ promised to make a preliminary implementation of .encode (for Str) and .decode (for Buf), if we just sat down and wrote some tests to show what we expected these to do.

Looking forward a bit; I think there's a good chance we'll have something usable with Hitomi before my original grant period is over. After that, it might be a good idea to start looking at the port of Ruby's Hpricot (for manipulating and searching HTML documents), and to start digging into the MVC quagmire. I still expect to do some preliminary MVC investigations before that, though.

I wish to thank The Perl Foundation for sponsoring the Web.pm effort.

Tests for .encode and .decode

moritz on 2009-06-17T15:06:58

I already wrote the tests for .encode and .decode (t/spec/S32-str/encode.t), so all there needs to be done is the actual two methods, a Buf type with a constructor, and a multi of infix:eqv comparing two Buf objects.

And probably some way to print a Buf without trying to encode anything again.

Why remove?

Aristotle on 2009-06-17T16:27:54

I think a methodical translation of the XML TR to a Perl 6 grammar would be a great asset to Perl 6 as a platform (user perspective) and would equally serve as a benchmark for the usefulness and performance of the grammar language (developer perspective).

Further, once we have a realisation of the the original vision that the Perl 6 parser itself be a grammar overloadable by the program being under execution, this XML grammar could be used to implement something very much like Javascript’s E4X, without the need for any support within the Perl 6 specification itself (in contradistinction with the E4X being an explicit extension of the base Javascript language).

And really, if done well (ie. using the specification as the guide in order to ensure completeness, comprehensiveness and compliance), an XML parser isn’t actually that scary a beast to write. And there’s an extensive pre-existing test suite, too.

Now, this is a project in its own right, and thus I expect that even if you see my point and agree with it being worthwhile, I doubt you’ll be eager to have another hairy yak…

But I wanted to argue that case.

Re:Why remove?

masak on 2009-06-18T07:48:56

Agree on all points, including the one about that being a separate project. I'm likely to do the bare minimum for Hitomi, and then we'll iteratively try to converge on something clean/fast/maintainable.
I might mention that Karl Rune Nilsen (krunen++) started building an XML parser at the post-NPW hackathon. It's still in its early stages, but perhaps someone intent on writing the XML parser Aristotle envisions would still like to consult krunen for ideas. Synergy++.