We're getting close to launching our new catalog browser. The old DB schema was a bloody mess, so in the process of making this new app we've refined the schema and tried to do as much data quality checking as we can.
Just last night i stumbled upon a nice list of common typos found in the OhioLINK catalog. I thought it might be worthwhile to check the list against our data.
I wrote a quick n' dirty script to spit out the list of words from that page and put it in the DATA section of the following script.
It simply tries a search on the catalogue and grabs any item ids it finds in the resulting response.
use constant CATALOG_URL => 'http://localhost:3000/search/?q=%s';
use constant ITEM_REGEX => '\/item\/(\d+)';
use strict;
use warnings;
use LWP::UserAgent;
use List::MoreUtils qw( uniq );
$|++;
my $agent = LWP::UserAgent->new;
my $regex = ITEM_REGEX;
while( ) {
chomp;
# remove strings in brackets and clean up whitespace
s/[\[\(].+?[\]\)]//g;
s/\s+$//;
s/\s+/ /g;
# query the catalog
my $response = $agent->get( sprintf( CATALOG_URL, $_ ) );
next unless $response->content;
my @matches = ( $response->content =~ /$regex/gs );
next unless @matches;
# print the results
print "$_: " . join( ', ', uniq @matches ) . "\n";
}
# Words taken from http://faculty.quinnipiac.edu/libraries/tballard/typoscomplete.html
# Regex: /
(.+?)<\/font>/gs
__DATA__
Accomodat*
Accordia*
Activite*
Administat*
Administraton*
Adminstrat*
Amd
Archael*
Artic
Assocat*
Asss* [and not ass's]
Berkeley [and] Mass
Cby*
Cincinatti*
...
It hasn't gone through a full run yet, but so far out of about 3350 words, we've only matched about 105. Not bad.