We're getting close to launching our new catalog browser. The old DB schema was a bloody mess, so in the process of making this new app we've refined the schema and tried to do as much data quality checking as we can.
Just last night i stumbled upon a nice list of common typos found in the OhioLINK catalog. I thought it might be worthwhile to check the list against our data.
I wrote a quick n' dirty script to spit out the list of words from that page and put it in the DATA
section of the following script.
It simply tries a search on the catalogue and grabs any item ids it finds in the resulting response.
use constant CATALOG_URL => 'http://localhost:3000/search/?q=%s'; use constant ITEM_REGEX => '\/item\/(\d+)'; use strict; use warnings; use LWP::UserAgent; use List::MoreUtils qw( uniq ); $|++; my $agent = LWP::UserAgent->new; my $regex = ITEM_REGEX; while( ) { chomp; # remove strings in brackets and clean up whitespace s/[\[\(].+?[\]\)]//g; s/\s+$//; s/\s+/ /g; # query the catalog my $response = $agent->get( sprintf( CATALOG_URL, $_ ) ); next unless $response->content; my @matches = ( $response->content =~ /$regex/gs ); next unless @matches; # print the results print "$_: " . join( ', ', uniq @matches ) . "\n"; } # Words taken from http://faculty.quinnipiac.edu/libraries/tballard/typoscomplete.html # Regex: /
(.+?)<\/font>/gs __DATA__ Accomodat* Accordia* Activite* Administat* Administraton* Adminstrat* Amd Archael* Artic Assocat* Asss* [and not ass's] Berkeley [and] Mass Cby* Cincinatti* ...
It hasn't gone through a full run yet, but so far out of about 3350 words, we've only matched about 105. Not bad.