Ordinarily, I would just as theory how to handle this, but he's going to be gone for a bit (for reasons that I note he doesn't appear to have blogged, so I'll remain mum for a bit and hold off on the congratulations). I am importing some data from a MySQL database and am getting output that looks like this:
Scott Sterling^@Does anyone know what that stuff is and how I might be able to convert that to properly escaped HTML entities?
Update: solved. Much grief led me to create the following subroutine (there were HTML tags embedded, too):
sub scrub_text {
my $html = shift;
my $parser = HTML::TokeParser::Simple->new(string => $html);
my $text = '';
while (my $token = $parser->get_token) {
$text .= $token->as_is unless $token->is-tag;
}
$text = encode_entities($text, "\200-\377");
$text =~ s/[\r\n]/ /g;
$text =~ s/[^[:print:]]//g;
$text =~ trim($text);
return $text;
}
Fortunately, this is a one-time import, so I don't have to worry too much about performance.