HTML::ToText::Simple

miyagawa on 2006-08-21T14:14:52

Lazyweb,

I've been looking for simple HTML-to-text converter.

HTML::FormatText does most of what I want, but it does more than that. Rendering HR tags to horizontal "-----" is one example. I don't like that.

HTML::Element has as_text() method, which is very close to what I want. But apparently, it doesn't do the right thing with img@alt attribute (Bar is dumped empty, not "Bar"), and "Foo
Bar" is dumped as "FooBar", not "Foo Bar".

I chatted with Yuval (nothingmuch) in #catalyst and would like to write a simple Visitor module to do with HTML::TreeBuilder generated tree.

If that sounds like a duplicate of someone else's work, let me know.


Tried lynx?

kane on 2006-08-21T17:50:56

That's always been the cheap solution, to just pipe it thru lynx. Not much choice in the formatting, but a quick fix if you want to do a html2text.

Just a thought :)

Re:Tried lynx?

Aristotle on 2006-08-21T18:40:51

I like vilistextum. Decent rendering and very fast.

Re:Tried lynx?

miyagawa on 2006-08-21T19:43:18

I forgot to mention that I prefer Pure perl. But thanks anyway.

I do this, sort of

jrockway on 2006-08-21T23:01:13

I'm doing something like this also:

http://www.jrock.us/trac/blog_software/browser/trunk/Angerwhale/lib/Blog/Format/ HTML.pm#L52

I agree that I should probably replace imgs with their alt (instead of dropping them). If you feel like working on this, I'll probably replace my code with your module.

However, I think I'll add that feature, and process the output with Text::Autoformat, as well. Let me know what you think.

Re:I do this, sort of

miyagawa on 2006-08-22T03:29:06

Agreed. Formatting the plain text into readable format should be the job another module should deal with.