Unicode help needed?

Ovid on 2005-05-13T18:34:50

Ordinarily, I would just as theory how to handle this, but he's going to be gone for a bit (for reasons that I note he doesn't appear to have blogged, so I'll remain mum for a bit and hold off on the congratulations). I am importing some data from a MySQL database and am getting output that looks like this:

Scott Sterling^@
Mimi ValdÃ~CFFÃ~C,Ã~B©s
Kevin R. Scott
Delphine A Fawundu-
Mariel ConcepciÃ~CFFÃ~C,Ã~B³n

Does anyone know what that stuff is and how I might be able to convert that to properly escaped HTML entities?

Update: solved. Much grief led me to create the following subroutine (there were HTML tags embedded, too):

sub scrub_text {
    my $html = shift;
    my $parser = HTML::TokeParser::Simple->new(string => $html);
    my $text = '';
    while (my $token = $parser->get_token) {
        $text .= $token->as_is unless $token->is-tag;
    }
    $text =  encode_entities($text, "\200-\377");
    $text =~ s/[\r\n]/ /g;
    $text =~ s/[^[:print:]]//g;
    $text =~ trim($text);
    return $text;
}

Fortunately, this is a one-time import, so I don't have to worry too much about performance.


UTF-8 conversion

iburrell on 2005-05-13T22:58:44

The junk which starts with A-tilde is almost certainly UTF-8 being displayed as Latin-1. First, convert the UTF-8 octets to Unicode string with Encode::decode. Then HTML::Entities should give you the proper Unicode numeric entities.