Annoying problems get annoying solutions

Ovid on 2005-10-25T20:33:56

The problem: take the HTML, strip everything which is not text from between the body tags and truncate the body at 1000 characters, regardless of whether or not I'm chopping a word in half.

sub compress_body {
    my $html   = shift;
    my $parser = HTML::TokeParser::Simple->new(\$html);

    my ($header, $body, $footer) = ('', '', '');

    my $curr_output = \$header;

    while ( my $token = $parser->get_token ) {
        $curr_output = \$footer if $token->is_end_tag('body');

        # skip non-text in body
        if ( $curr_output eq \$body ) {
            next unless $token->is_text;
        }
        $$curr_output .= $token->as_is;

        $curr_output = \$body if $token->is_start_tag('body');
    }

    return join "\n" => $header, substr($body, 0, 1000), $footer;
}

Can you think of anything simpler? I think the above code is just flat out ugly.

Update: cleaned it up just a tad even though the logic is the same. I also hate how things are dependent on the order.


return when the body = 1000?

merlyn on 2005-10-25T21:20:59

No need to gather a larger body, only to truncate it...

abstraction is always nice

markjugg on 2005-10-25T21:35:43

How about: my $compressed_body = compress_body($html);

My attempt at humor is meant to indicate the upside here:

At least your annoying code is in a routine, so you have made the functionality less annoying for everything that uses it.

use tt2

kennyg on 2005-10-25T21:54:11

[% FILTER truncate(1000) %] really long text from example... [% END %]

Re:

Aristotle on 2005-10-26T03:25:44

I’m not sure I understand what your code is doing, so I’ll code to your specification instead…

sub compress_body {
    my ( $html ) = @_;
    my $parser = HTML::TokeParser::Simple->new( \$html );
    my $text = '';
    my $in_body;

    1 while $_ = $parser->get_token and not $_->is_start_tag( 'body' );

    while ( my $token = $parser->get_token ) {
        if( $token->is_text ) {
            $text .= $token->as_is;
            last if length( $text ) > 1000;
        }
        elsif( $token->is_end_tag( 'body' ) ) {
            last;
        }
    }

    return substr( $text, 0, 1000 );
}

Re:

Aristotle on 2005-10-26T03:29:33

That is, minus the $in_body I accidentally left in there.

Oh, and my own preferred solution would be to load the thing with libxml’s HTML parsing mode, then do $dom->findvalue( 'substring( /html/body, 0, 1000 )' ). XPath rocks.

Re:

Ovid on 2005-10-26T04:02:38

The problem is, it gets passed an entire HTML document and has to preserve everything up to the first body tag (inclusive) and after the last body tag (also inclusive).

Re:

Aristotle on 2005-10-26T05:30:53

Ah, now suddenly all your contortions make sense.

sub compress_body {
    my ( $html ) = @_;
    my $parser = HTML::TokeParser::Simple->new(\$html);
    my $out = '';
    my $body = '';

    while ( my $token = $parser->get_token ) {
        if( $token->is_start_tag('body') .. $token->is_end_tag('body') ) {
            $out .= $token->as_is if $token->is_tag( 'body' );
            $body .= $token->as_is if $token->is_text;
        }
        else {
            if( length( $body ) ) {
                $out .= substr $body, 0, 1000;
                $body = '';
            }
            $out .= $token->as_is;
        }
    }

    return $out . substr $body, 0, 1000;
}

A bit more repetition than I’d like. Maybe

sub compress_body {
    my ( $html ) = @_;
    my $parser = HTML::TokeParser::Simple->new(\$html);
    my $out = '';
    my $in_body;

    while ( my $token = $parser->get_token ) {
        if( $in_body ) {
            my $body = '';
            {
                last if $token->is_end_tag( 'body' );
                $body .= $token->as_is if $token->is_text;
                redo if $token = $parser->get_token;
            }
            $out .= substr $body, 0, 1000;
            undef $in_body;
        }

        $in_body = $token->is_start_tag( 'body' );
        $out .= $token->as_is;
    }

    return $out;
}

Better… in a way… I think.

Re:

Aristotle on 2005-10-26T05:34:02

Hmm, that has a subtle bug. There has to be a last if not $token; before undef $in_body;.