Corpus Colossus

TorgoX on 2002-10-25T21:04:26

Dear Log,

I needed a corpus of HTML files for testing the HTML::Formatter classes (which I'm tidying up). So I logged into a friend's .edu account, and with a few little Unix commands (including a one-liner involving the king of them all, Perl), I made a tar file of all the reasonably-sized ~user/public_html/index.html files on the system. All 2,908 of them. Excellent!


Content Supermodels

TorgoX on 2002-10-25T21:19:15

Dear Log,
On the mp3-trola: Grieg, "Cradle Song"

So I'm trying my hand at writing an HTML::FormatRTF to go along with HTML::FormatPS and HTML::FormatText in the HTML-Format(ter?) dist. While writing pod2rtf was pretty straightforward, this is proving pretty tricky -- because the block-level content-models for HTML are so much trickier than the block-level content-models for Pod. If I could wave a magic wand and make all the HTML input be XHTML Strict, that'd be really handy. But for the moment, there's the problems like "<blockquote>foo<p>bar</blockquote>" parsing as:

  • blockquote
    • "foo"
    • p
    • "bar"
whereas what I'd really want is this:
  • blockquote
    • p (implicit)
    • "foo"
  • p
    • "bar"
The best, but trickiest, way is to make HTML::TreeBuilder optionally parse things the second way. It's best because it'd be useful to other people. It's be trickiest because first off it means messing with a pretty complex module; and because my idea is of what I want the parse-trees to look like for this purpose is not going to be The Answer of how you'd want them to look like for all purposes. For example, it'd be handy for purposes of rendering to RTF if "<li>foo<p>bar" would parse like:
  • li
    • "foo"
  • p
    • "bar"
But for rendering to some other formats, you might want it to come out like:
  • li
    • "foo"
    • p
    • "bar"
So, in conclusion: feh. The hardest thing in good programming is accepting that some things are best implemented as ad-hoc solutions, not Grand Solutions To Everything.

Late news: For HTML::FormatRTF, I'm giving up on the approach that was meshing so badly with the HTML content-models, and doing something a bit stranger, sort of the way HTML::FormatPS does it.

Re:Content Supermodels

ziggy on 2002-10-25T21:43:22

If I could wave a magic wand and make all the HTML input be XHTML Strict, that'd be really handy.
Have you tried:
tidy -m -asxhtml foo.html