Ordinarily, I would just as theory how to handle this, but he's going to be gone for a bit (for reasons that I note he doesn't appear to have blogged, so I'll remain mum for a bit and hold off on the congratulations). I am importing some data from a MySQL database and am getting output that looks like this:
Scott Sterling^@Does anyone know what that stuff is and how I might be able to convert that to properly escaped HTML entities?
Update: solved. Much grief led me to create the following subroutine (there were HTML tags embedded, too):
sub scrub_text { my $html = shift; my $parser = HTML::TokeParser::Simple->new(string => $html); my $text = ''; while (my $token = $parser->get_token) { $text .= $token->as_is unless $token->is-tag; } $text = encode_entities($text, "\200-\377"); $text =~ s/[\r\n]/ /g; $text =~ s/[^[:print:]]//g; $text =~ trim($text); return $text; }
Fortunately, this is a one-time import, so I don't have to worry too much about performance.