Stupid crippled HTML, stupid vim, stupid regexen

ethan on 2002-11-13T11:34:37

Moving on with my blogger things have gotten much more complicated than I had ever wished.

1st nuisance: Entries can be made in several formats. When retrieving some entries made by other people I had to realize that some make them in Plain Old Text while others prefer HTML. My initial thought was letting the raw text run through html2text. This was quite horrible in several respects. It uses some odd backspace escapes to highlight text. Naturally, vim does not know any of them. Those could be turned off fortunately. More tricky is the newline thing. Newlines don't mean a lot in HTML and html2text expects HTML. So, I had to replace newlines with <br>. Fair enough. Next issue was this proprietary <ecode> thingy. It seems to be a some form of preformatting tag to preverse indenting. Now I first convert them to <code> and afterwards replace whitespaces with the nonbreaking whitespace entity. The whole sequence now looks as follows (there is more to come, undoubtedly):



$entry->{body} =~ s"<(/?)ecode>"<$1code>"g; $entry->{body} =~ s"\n"<br>"g; $entry->{body} =~ s"<code>(.*?)</code>" '<code>' . do { (my $s = $1) =~ s/\s/&nbsp;/g;$s } . '</code>'"gsex;



That should leave indenting intact and is also understood by html2text. Of course, other deficiencies remain: URLs are currently lost and there is this Damocles sword above me insofar as I tackle these HTML issues with regexen.

2nd nuisance: There doesn't seem to be a vim-script or macro or plugin or whatsoever available that does basic HTML rendering in a buffer. Disappointing.

3rd nuisance: Regexen are stupid. Perl's regex engine did not like my fancy first attempt at doing it with look-ahead and look-behind. This one might be arguable: Perhaps I was the stupid one here.

None the less, the most recent version of the blogger is as always available through here. The user interface is quite consistent now with the activated window always maximized, easy toggling between compose-window and index-window. Entries from other people can be retrieved by their nickname and are at least rendered in a readable fashion.


HTML in vim

petdance on 2002-11-13T16:07:47

I don't know why you're not having any luck with HTML syntax highlighting in vim. Try
:set ft=html

Re:HTML in vim

ethan on 2002-11-13T17:33:42

Oh, that always worked. This is what I do for the compose-window. But when reading an entry I'd rather not have the highlighted HTML-source but a nicely layed-out text document. I would have especeted that something like that already exists for vim. I couldn't find anything suitable though.

Re:HTML in vim

petdance on 2002-11-13T17:56:27

Oh, oh, I see. Could you pipe the file to lynx? Something like:
:%!lynx -

Re:HTML in vim

ethan on 2002-11-13T19:04:42

Could you pipe the file to lynx?

Hmmh, looks as though lynx can't read HTML stream-wise. Same limitation applies to links. w3m however can do it but doesn't appear to understand this sort of HTML-bastardism. The result is basically identical to what got piped in. ":%!html2text -nobs" works but then there are no linebreaks for Plain Old Text format.

Currently, I use IPC::Open2 to do it inside my module:


sub get_entry {
        my $id = shift or return;
        my $ret = $S->get_entry($id);
        return if _had_error($ret) or (my $entry = $ret->result);
        my $pid = open2 (\*OUT, \*IN, "html2text", "-style", "pretty", "-nobs");
        $entry->{body} =~ s"<(/?)ecode>"<$1code>"g;
        $entry->{body} =~ s"\n"<br>"g;
        $entry->{body} =~
                s"<code>(.*?)</code>"
                  '<code>' . do { (my $s = $1) =~ s/\s/&nbsp;/g;$s } . '</code>'"gsex;
        return $entry->{body} if $@ =~ /^open2:/;
        print IN $entry->{body};
        close IN;
        my @output = <OUT>;
        waitpid $pid, 0;
        return join "\015", @output;
}


I'd really like to use something like HTML::FormatText. But it expects an HTML::TreeBuilder object to render. I have my doubts that it is that easy to turn the journal entry sources into such a tree since they aren't complete HTML documents.

Anyway, the whole business keeps me busy and is intriguing so I am not even complaining that much. ;-)