RTF in extremîs

TorgoX on 2002-09-26T10:43:10

I spent all day working on Pod::Simple::RTF. I think the basic Pod::Simple framework is now quite mature; I only made two or three minute changes to it in the whole course of writing Pod::Simple::RTF.

Most of today's fiddling with the RTF thing was adding heuristics that almost no-one will notice, but which make things pretty -- things like "codeblocks of under 15 lines shouldn't be split across pages", and "A reasonably short heading followed by a paragraph shouldn't be split across pages". These aren't exactly trivial things with a tokeparser interface, but I pull 80/20 cheats here and there -- so under some very rare cases, a "keep this together with the next paragraph" code won't get generated. But no big deal; it's just hardcopies of docs.

Benefits of a tokeparser framework: it's fast! It parses AND formats perlvar in under a second. And that's even as it has to go construct and destroy a few thousand objects along the way. (A point where I diverge from HTML::TokeParser's approach is that I have tokens be actual objects, with accessors, not just bare arrayrefs.)

I also spent forever figuring out how to express Unicode characters in RTF -- horror itself, and shoddily supported even in MSWord 2000, but it's better than just doing s/[^\x00-\xff]/X/g;. I've got to at least try, since there's not much of an alternative.

I rather wish RTF allowed comments. I.e., things that a RTF processor (or rather, the RTF processor, since using anything but MSWord to interpret RTF can be pretty dicey) would discard, but which could be used for the equivalent of "<!-- and now we start the thing that might not work -->" or whatever. It'd just be handy for debugging.

Anyhow, the RTF thing is almost done (I've only got left the bit that automatically marks things that look like code as being not spellcheckable), at which point I won't have to think about RTF in any real detail for quite a while.

By the way, s/([^\x00-\xFF])/'\\uc1\\u'.( (ord($1) < 32768) ? ord($1):(ord($1)-65536) ).'?'/eg;
That's what turns Unicode characters like "\x{4E4B}\x{9053}" into their RTF representation, «\uc1\u20043?\uc1\u-28589?». Whee!

In other news, I'm coming to the slow and grudging realization that my book isn't terribly bad.


Next stop, Pod::Simple::PDF?

jhi on 2002-09-26T13:34:19

Right?

Re:Next stop, Pod::Simple::PDF?

TorgoX on 2002-09-27T02:17:03

Well, once this RTF thing is done, then I think I should finish the Pod::Simple documentation, then see about a Pod::Simple::Man. I don't really know *roff, but I'll see how far I get by just retrofitting the current Pod::Man.

Your book

gnat on 2002-09-26T17:18:09

Is bloody good. It's one of those books that changes the way people work. After editing it, I was a lot more comfortable with using the web as a data source (and sometimes as a data sink). I couldn't have put together OSCON without it.

--Nat

HTML::TokeParser::Simple

Ovid on 2002-09-26T22:23:42

(A point where I diverge from HTML::TokeParser's approach is that I have tokens be actual objects, with accessors, not just bare arrayrefs.)

Never having been a fan of HTML::TokeParser's arrayrefs, I wrote HTML::TokeParser::Simple. It provides the accessor methods I wanted and makes the code much easier to read. Want to know if a token is a starting or ending form tag? With HTML::TokeParser, you do this:

    if( ('S' eq $token->[0] or 'E' eq $token->[0]) and 'form' eq $token->[1] ) {

Now, you can do this:

    if ( $token->is_tag( 'form' ) ) {

Heck, you can leave the 'form' off to get a boolean response on whether or not something is a tag. There are many useful methods, it makes HTML parsing an almost trivial affair, and I have no idea if anyone is using the darn thing. That's a somewhat deflating aspect of being a CPAN author. Aside from the occasional bug report, who knows if you've helped anyone? I've had some thoughts of other things I'd like to do with that module, but I don't want to spend a lot of time one something that no one will ever use.