Finding typos in our catalog

LTjake on 2006-03-21T16:04:50

We're getting close to launching our new catalog browser. The old DB schema was a bloody mess, so in the process of making this new app we've refined the schema and tried to do as much data quality checking as we can.

Just last night i stumbled upon a nice list of common typos found in the OhioLINK catalog. I thought it might be worthwhile to check the list against our data.

I wrote a quick n' dirty script to spit out the list of words from that page and put it in the DATA section of the following script.

It simply tries a search on the catalogue and grabs any item ids it finds in the resulting response.

use constant CATALOG_URL => 'http://localhost:3000/search/?q=%s';
use constant ITEM_REGEX  => '\/item\/(\d+)';

use strict;
use warnings;

use LWP::UserAgent;
use List::MoreUtils qw( uniq );

$|++;

my $agent = LWP::UserAgent->new;
my $regex = ITEM_REGEX;
while(  ) {
    chomp;

    # remove strings in brackets and clean up whitespace
    s/[\[\(].+?[\]\)]//g;
    s/\s+$//;
    s/\s+/ /g;

    # query the catalog
    my $response = $agent->get( sprintf( CATALOG_URL, $_ ) );
    next unless $response->content;
    my @matches = ( $response->content =~ /$regex/gs );
    next unless @matches;

    # print the results
    print "$_: " . join( ', ', uniq @matches ) . "\n";
}


# Words taken from http://faculty.quinnipiac.edu/libraries/tballard/typoscomplete.html
# Regex: /
(.+?)<\/font>/gs __DATA__ Accomodat* Accordia* Activite* Administat* Administraton* Adminstrat* Amd Archael* Artic Assocat* Asss* [and not ass's] Berkeley [and] Mass Cby* Cincinatti* ...

It hasn't gone through a full run yet, but so far out of about 3350 words, we've only matched about 105. Not bad.


Anti-word-list

n1vux on 2006-03-21T17:15:54

Nice find. I like it. Thanks!