I'm analyzing the content of some documents in order to find potential correlations between them. Breaking each document into individual words, stemming those words, and throwing out the stopwords gave me some 18,000 unique words from a 600-document corpus, with over 40% of words appearing only once in the corpus and almost 80% of the words appearing fewer than ten times.
I knew my existing list of stop words was insufficient, but I really don't want to pick out the top 1000 or 2000 useful words from a list of 18,000, especially because this is a test corpus of perhaps 7% of the actual corpus.
Now I start to wonder if some of the lexical analysis modules would be useful in picking out only the nouns (unstemmed) and verbs (stemmed) from a document, rather than taking all of the words of a document as significant. The correlation algorithm appears sound, but if I can throw out lots of irrelevant data, I can improve the performance and utility of the application.
Any thoughts?
What sort of correlations are you looking for? I was focusing on detecting plagiarism at one point and found that breaking things down by sentence was more useful. As I don't know what you're trying to do, I've no idea if that link will prove useful.
Re:Breaking it down by word?
chromatic on 2005-09-02T23:39:12
Detecting plagiarism is much more specific than this problem. I want to be able to analyze a document and suggest a handful of other documents that, from their intertextual context at least, appear to discuss similar things. For example, a tutorial about creating homemade pizza dough is probably not very similar to a journal entry about linguistic analysis, but probably is similar to an article discussing different types of pizza ovens.
I'm trying to answer the question "Do the relevant topics of these documents overlap?", not necessarily "Do these documents share a common ancestry?"
Re:Breaking it down by word?
Ovid on 2005-09-03T00:04:22
I see. That makes sense. Perhaps a heuristic approach is best as there are few algorithms likely to realize that "July heat wave" and "dog days of summer" might be related, though when the text is long enough idiomatic expressions are likely to come out in the wash.
My initial thought would be to try to score words in documents. Take the words that appear the most frequently and somehow correlate their frequency in the document by their infrequency in the language. Thus, the least common words which appear the the most often would generate a higher score. You could do a first-pass straight comparison or try to go further with synonyms (or even antonyms). You could havee even more fun by trying to account for misspellings
:) Re:Breaking it down by word?
tinman on 2005-09-03T16:27:16
From WordNet::Similarity.. there is an algorithm in there by Lesk (from a paper in 1986 which Citeseer and Scholar don't seem to find a reference for); which uses something they call gloss overlap.
What Lesk basically does is what Ovid suggests, except the calculation is performed for Wordnet glossary definitions and not for entire documents.
Re:Breaking it down by word?
saorge on 2005-09-13T16:20:58
Your initial thought is called Zipf's law in information retrieval.Re:Breaking it down by word?
ziggy on 2005-09-04T02:23:43
So, something similar to Amazon's "statistically improbable phrases"?
Re:LSI
sky on 2005-09-03T08:40:31
Search::ContextGraph
You could use Ted Pedersen's Wordnet::Similarity modules. This attaches a numerical value to any two words and can help you identify which words are related, and how closely. I prefer jcn (the Jiang Conrath method) myself, but there are 10 different techniques on offer.
Also, would it not make sense to use a POS (Part of Speech) tagger before you break down stop words and so on ? I can't recommend a Perl based POS tagger offhand, since most of my work in this area is done in Java... but I'm pretty sure they must exist. Do a POS tag (to find out noun, verb, adjective contexts for individual sentences), then do what you're doing now. This way, you would get both the word + the part of speech tag. For example, like has at least seven different contexts in which it may be used... which range from verb to adjective. fling could be either a noun or a verb.. and so on. Might give you a bit more granularity to work with..
Re:Many solutions
saorge on 2005-09-13T16:38:40
Sorry for the poor layout of my comment, I've forgot to select Plain Old Text or to preview my comment.