While doing some research a few days ago I found myself reading a paragraph that seemed very familiar. In digging around, I found the other news story I was looking for. Several sentences were duplicates and several were subtly altered, but it was the same paragraph. The stories, I might add, were over a year apart and were by different authors.
While it could very well be that this particular news source has an internal practice of allowing reporters to borrow copy from one another without attribution, I'm not aware this is a common practice (of course, I am not a journalist, either.) Further, with all of the recent high profile plagiarism cases, it seems less likely than ever that news organizations would tolerate this practice. In trying to research whether or not the reporter in question had plagiarized any other work, I quickly found that, while it's easy to compare two paragraphs, it's not easy to compare one story to hundreds of others. Automation is the way to go.
Many of the tools I found on the CPAN seemed too low level for this type of work, so I started writing Text::Plagiarized. It's not on the CPAN, nor is it available for download. However, after a bit of research, I found it was suprisingly easy to do a basic analysis (well, the code is easy to use. I threw away three implementations before I stumbled on the "easy" one.)
my $text = Text::Plagiarized->new; $text->original($original_text); foreach my $comparison (@comparison_texts) { $text->comparison($comparison); $text->analyze; print $text->percent, $/; # percent of matching sentences if ($text->percent > $some_threshold) { # arrayref of array refs with [$sentence, $possible_match] print Dumper($text->matches); } }
You can tweak how "sensitive" you want the matching to be, but so far, it handles fuzzy matching like the following two texts:
my ($text1, $text2) = (<<"END_FIRST", <<"END_SECOND"); This is some text that might be plagiarized. Whether or not it has been can be difficult for a simple program to detect. The writer may simply change a few words here and there. He or she might add some extra punctuation or just throw in an extra sentence or two. However they do it, there is usually some subtle difference between the original and the copy. END_FIRST This text might be plagiarized. Whether or not it has been can be difficult for a simple program to detect. The writer can simply change a few words here and there or they might add some extra punctuation. However they do it, there are usually subtle differences between the original and the copy. END_SECOND
At the default threshold (80% match), only the first sentence in those paragraphs fail to match. Merely setting the threshold to 74% will pick up that first sentence.
For some reason I feel a bit uncomfortable about releasing this. I'm not sure why. In any event, it's not done, so I have time to think about this. I don't account for mispellings or stemming, the interface might change, and it seems fairly fragile in odd corner cases.
Perhaps you're looking at only one aspect of how a module like this may be used. Yes, it can be used for detecting plaigarism, should the user choose to do so. But it can also be used as a similarity detection metric; which has uses far beyond seeing if journalists borrowed copy or if students cribbed essays.
Related articles ? contextual matching ? I can think of a few more uses for this type of module. I'd actually like to see how you do it, out of academic interest.
Re:a change of name ?
da on 2005-06-06T14:12:13
Good idea. Text::Related would be one possibility.
This would be perfect for an open-source google news.
I'd love to use the code, if you ever decide to release it.Re:a change of name ?
Ovid on 2005-06-06T15:45:15
Because of the way the code is designed, I seriously doubt that it could be used for related articles or contextual matching. It's slow, but that's because of the algorithm I chose (which turned out to be surprisingly faster than some of the other options I was looking at.) It does a sentence by sentence comparison to determine "how far apart" two sentences are in terms of insertions, deletions and replacement. If they're close enough (under the user defined threshold), then a match is reported. It's the "tortoise" of matching versus the "hare." It will see things the hare won't, but it will take longer to do so.
I should add that I also wondered if I had chosen a bad name, but I can't think of what else this module might be used for. I actually thought about calling it Text::Compare, but as it turns out, a module by that name was released two weeks ago. I actually tried to use that module, but it turned out to not be suitable for my needs.
Re:a change of name ?
chaoticset on 2005-06-07T03:56:47
Perhaps Text::Rephrase?
Re:You can bet...
iburrell on 2005-06-07T20:27:40
Actually, the term papers people already use a fairly algorithm. I used to work at a company which did something similar. The basic algorithm was similar to rsync, using hashes for blocks of tokens.