parsers and stringy stuff

tinman on 2004-03-24T17:09:41

Playing around with JavaCC (For some weird reason, that URL is HTTPS). Slightly mangled code, but it really does a great job.

This whole Token business is beginning to depress me. I'm trying to wriggle a few of my custom filters before tokenization even begins in the Lucene sample, and the classes are a bit err.. complicated. Oh, well. If it was easy, it wouldn't be this much fun trying to figure all of it out.

One other note: a coworker (well, someone else at the university) got a Tomcat cluster working. mod_jk2 in front serving requests and session replication within the cluster. Cool stuff (and it's all FREE. I know the expensive app servers can replicate sessions ;)


ANTLR

rafael on 2004-03-24T19:06:02

If you're busy with Java parser generators, you might want to look at the excellent ANTLR as well.

Re:ANTLR

brianiac on 2004-03-25T01:16:01

I looked at a number of parser generators (flex, GOLD, ANTLR, ...) to mark tokens up as XML. I was struck by two things in my search:

  1. Apparently, I am the first person in the history in the universe to want to do this.
  2. Given the amount of time PGs have been around, plus the number and size of their communities, there do not seem to be any comprehensive grammar repositories (try looking for Perl, VBScript, TSQL, JavaScript, ...).

Re:ANTLR

rafael on 2004-03-25T07:48:05

About grammar repositories: given that the style and shape of a grammar is influenced by the constraint of the parser generator (LL vs LR mostly), this explains why there are no generic grammar repository -- this would be impractical since grammars would need adaptations to each parser generator.

And about a grammar for Perl 5 : there is no such thing as a context free grammar for Perl 5. You can always look at perly.y in the perl source distribution, but the tokenizer is the scary part. "Nothing but perl can parse Perl."

Re:ANTLR

tinman on 2004-03-25T23:51:31

Oh, yes. Thanks. I knew about Antlr and friends from earlier attempts to write a parser for a not-so-simple configuration file. But it seems the Lucene project uses JavaCC , which I had never used before. I was wondering if it would be as good, and it seems so. Lucene is under the Apache Software Foundation, so perhaps they were turned off by the vague sounding license terms ? Maybe the guy who wrote the code initially didn't know about Antlr ? I'm just speculating :)