Siesta

chaoticset on 2003-07-22T14:07:15

Wrote a letter frequency analyzer, because I wanted to. (Actually, it's the first step in a not-altogether ambitious project to write something that attempts to solve simple shift ciphers automatically.) Did a baseline frequency on a dictionary file and used Storable to keep those stats. (I know, I need something better, but it was the biggest wad of English text I had available that I was reasonably sure wasn't riddled with errors or "stylistic" typos. Any suggestions for a better text are appreciated.)

Started learning how to footle with CGI::Application and HTML::Template, and I'm pleasantly surprised. One little hangup -- not being able to make clickable images -- is not such a big deal, and I suspect there's a way to do it that I haven't discovered yet.

No other news...

Project Gutenberg

gizmo_mathboy on 2003-07-22T14:58:58

If you need any texts of any size check out Project Gutenberg.

Re:Project Gutenberg

chaoticset on 2003-07-22T15:31:51
Any specific ones you'd recommend there? I wouldn't be against using a group of texts -- the process takes about two minutes with a 1.3 meg text file, and I want this to be really good, so I'd be willing to stick a couple hours to it. I'd like the stats to be much more precise than normal, so I'd be all for processing ten or twelve texts. It's just that the ones I happen to have on my drive at home (forgive me) aren't normal. Perl docs, while reasonably grammatic and spell-correct and all, aren't normal text. Neither are Trek fanfic (yes, I have a few, yes, I've read a few, and no, I'm not ashamed). Neither is _The Crying Of Lot 49_, unfortunately.

Re:Project Gutenberg

gizmo_mathboy on 2003-07-22T21:34:37
Funny you should ask. About a month ago on the Perl Quiz of the Week mailing list, one of the quizzes concerned repeated substrings. One of the folks used the following text (extracted from an email):
-----------------------
'The Life and Opinions of Tristram Shandy, Gentleman' by Laurence Sterne, which when downloaded weighs in at around 1 Mo (as compared with 27 Ko for Dan Schmidt's US constitution).
-----------------------

The location of these (from another email):
http://www.dfan.org/constitution.txt
http://www.ibiblio.org/gutenberg/ etext97/shndy10.txt

Enjoy. However, for frequency analysis just about anything might work on Project Gutenberg. Well, maybe the chromosomes might not.

Source corpus (corpi?)

dws on 2003-07-22T16:31:37

If you're intending to solve short newspaper ciphers, consider that they tend to be quotations. Doing a frequency analysis on a set of quotes files might not be a bad idea. Having a separate distribution for "First letter of first word in a sentence, by wordlength" might help gain a toehold.