Llama's frequency of "perl"

brian_d_foy on 2005-10-31T05:18:58

According to my word count program, "perl" is only the 17th most frequent word in the Llama, 4th Edition.

            the 6494
              a 2862
             to 2710
             of 2235
           that 1689
             in 1658
            you 1571
             is 1518
            and 1394
             it  956
            for  926
             if  917
           this  832
             as  791
             be  706
             or  671
           perl  660


We're not doing much better in the re-write of the Alpaca, where "perl" has slipped to 21st. We still have time to change that, but from the looks of it I'll have to include a couple of paragraphs of just "perl perl perl ...".

            the 4977
             to 1952
              a 1917
             of 1340
            you 1331
             in 1083
            and 1081
             is 1048
           that  952
            for  741
             as  636
             it  608
           this  605
            can  543
             if  532
           with  444
             be  442
             an  383
             or  351
           your  351
           perl  321


I could have written something to go through all of our magazine columns, but then I'd have to use a module or something.


Yes but...

saorge on 2005-10-31T09:01:45

Perl is the first relevant word, because the others are "stop words". There some modules on the CPAN to work with these stop words. There are also a lot of modules to index text (even if your first intention isn't to really index text in the sense of a search engine). The occurences of term is often saved into the database because these value is used to compute the ranking of the document after a search (the most known of tese methods are TF.IDF). So, it could be simpler to query the database. perlindex is a script available on the CPAN that index the Perl documentation available on your hard disk. One option of these script ask the total number (-d threshold for the occurence) of occurences. On my box, head is the more often word used.

Re:Yes but...

brian_d_foy on 2005-10-31T16:39:40

The joke doesn't work as well when "perl" is at the top of the list for both books.

I don't think it's simpler to make a database. My script was only 10 lines, including blank ones :)

Re:Yes but...

n1vux on 2005-10-31T21:44:15

Dear brian,

Sorry I went Pedantic on your joke along with saorge. The perl perl perl ... graf you threatened to add to the next book is funny. I just got caught up in search boffin blather on about stopwords ... a hot-button thing.

- Bill

Stopwords and relevance

n1vux on 2005-10-31T12:49:12

Right on. Perhaps even stronger, Perl is the first substantive word, relevant or otherwise. But stopwords is the correct searchwonk jargon for this. I susspect philologists have their own term, Larry probably would have a word that means "not a helper verb, particle, article, or pronoun". Perl is the first substantive noun, proper or common, on both lists.

Be happy!

-- Bill
former search boffin