Record Linkage / Name and Address Standardization

malbh on 2004-03-24T15:51:01

At my last gig, I used several expensive commercial packages to solve what was called the merge/purge - deduplication problem.
I was always hoping for an open source solution, because the results were sometimes mysterious and access to people who understood the theory behind name and address matching was very limited. (I think the idea was that these were trade secrets)
At the time, I didn't know that the computer science term for the problem was record linkage, and that the bio-informatics community has been working on it from two directions: finding overlappng data in large strings (for human genome), and bringing together people's names for medical records and history. This last one is particularly promising, since it's a close match to name and address issues. From that, I google'd a few record linkage projects, but the projects they discussed: AJAX and Potter's Wheel seemed to be more interactive projects and I'm more interested in large datasets. The source for the only AJAX I could find seems to be a java tool and the Potter's wheel project seems to have been abandoned in 2000.
There's a group at the australian national university working on it now: as part of their data mining group with a project called Febrl, but that's also in python and seems to be created by non-programmers. This looks like an interesting gap...
I like how they've avoided the deterministic approach to creating matching rules and are using hidden markov models to determining standardization and linkage rules.
My experience tells me that this should be the way to go, since matching rules are really based on the data anyway.
I'm going to try to follow the same approach, but using perl tools and maybe looking into the bio-perl community for performance ideas. I'll post the results here as I find them.
Some initial taks:

Find a good, efficient, perl-centric hidden markov model implementation that doesn't eat too much memory. I've looked at a couple, most of them break into C, and while I'm a competent C programmer, I'm thinking it will make it less friendly over all.
Find enough good sample data that is in the public domain. If I'm going to train a model, I'm going to need names and addresses. I've got to find a good US sample that isn't owned by anyone.
Train the model, run some tests, and post the results.

I'll see how this goes in a few days.

Lingua-EN-MatchNames

brianiac on 2004-03-24T16:55:53

This doesn't help with addresses, but I created the Lingua-EN-MatchNames module to eliminate duplicate user records between security and groupware databases a few years ago, and it worked quite well.

Re:Lingua-EN-MatchNames

malbh on 2004-03-24T18:46:08
Thanks Brian,

Yes, I looked at it, as well as the excellent modules Lingua:EN:NameParse and Lingua::EN::AddressParse by Kim Ryan. I plan to use them once I get to the blocking window level of matches.

The problem, of course, is that when you have many millions of records, the turn around time for a really close look at each record just gets too large. So what I think I need to do is determine how to split the records for large datasets into groups that can be compared in an economical amount of time (the blocking window) and make use of grid or lam style computing to the computationally intensive work needed to do the merging.

Your module does cover almost everything needed once we have a list of candidates to work on, so when I'm at that stage, I'll see if I can submit any changes to you.
Thanks, Mike Harding