Rant against wiki parser writers...

Matts on 2003-02-19T23:07:14

This rant is inspired by me just looking at the source code for Text::Tiki, but I certainly don't intend to single that parser out...

Why is it that wikitext [1] parser writers foist upon us broken crappy regexp based parsers that break at the slightest deviation from the spec, don't treat the document as a structure and truly believe that they are just parsing "text" and only need to produce "text". What if I don't want to produce HTML but actually want to *do* something with that data?

Repeat after me: "s///g is not a parser!"

Some day I'm going to find time to finish off Text::WikiFormat::SAX and show people what its all about.

Rant over.

[1] And this includes WikiText, TikiText, UseModText and all the various flavours that I've had the misfortune to look at the source code of.


Someone write Wiki::TextParser kplsthnx

Kake on 2003-02-20T01:31:45

Must admit to being an idiot in that it took me quite a while to figure out what thingy to click to reply to this.

We talked about this on IRC and I agreed with you. I went and looked at HTML::Parser, which I think we figured out was a good model for what we want to be able to do with WikiText. It's all in XS, so I ran away. Bad move? Me no spik C.

Kake

Better Alternatives???

jcavanaugh on 2003-02-20T01:42:34

Have you looked at the source for TWiki?? Its regexp based as well.

I realize that regexp based parsing/expanding is a primitive mechanism and painfully slow at times. However, its also pretty darn powerful as well.

Im interested in your thoughts on how to make a better parser/renderer for something like TWiki a reality.

--John Cavanaugh

Re:Better Alternatives???

sbwoodside on 2003-02-20T07:13:01

Powerful? Sure, in that it can do a lot. No, in that it doesn't help the software understand the structure of the data at all. Regex is a very limited language.

Re:Better Alternatives???

Matts on 2003-02-20T08:35:25

The problem with Twiki's parser (and all the other ones) is they all look something like this:
  $text =~ s/someformatting1/<somehtml1>$1<\/somehtml1>/;
  $text =~ s/someformatting2/<somehtml2>$1<\/somehtml2>/;
  $text =~ s/someformatting3/<somehtml3>$1<\/somehtml3>/;
  $text =~ s/someformatting4/<somehtml4>$1<\/somehtml4>/;
  $text =~ s/someformatting5/<somehtml5>$1<\/somehtml5>/;
Which is great if you want HTML, but what if you want to parse the twiki text to put the data into a semantic search engine (where titles or bold text might have more relevance) - you have to parse once to HTML and then parse the HTML, and that I think is a broken model.

The parsers should be written as frontend+backend - where the frontend basically tokenises the Wiki text and the (default) backend turns those tokens into HTML. But another backend might do something completely different.

The model I'm talking about is that used by Text::WikiFormat::SAX, but people are very afraid of proper parsers (witness how long it took people to adopt to using HTML::Parser instead of regexp based parsers), because they mean you have to think of your source as data rather than text. Plus Text::WikiFormat::SAX is broken, mostly because nobody uses it so I have no incentive to fix it ;-)

Three Reasons

chromatic on 2003-02-20T02:05:47

  • Writing a proper parser is hard
  • We're not as smart as you are
  • A terrible regex implementation that gets the job done pretty well today is a heckofalot better than a beautiful, perfect event-based parser that isn't here yet

I'm only about halfway kidding.

Re:Three Reasons

Matts on 2003-02-20T08:28:29

I agree with all your points but the second one ;-)

Since Text::WikiFormat::SAX doesn't work properly I'm obviously not as smart as you think I am ;-)

But in all seriousness this is something I hope to put right, in a similar way to trying to put right the whole XML parser nonesense. That way I can put my code where my rant is, or something like that.

and they use different regexes!

kellan on 2003-02-20T03:50:35

I might almost forgive them if they implemented the same regex, but each person implementing a different regex based "mini-language" is the worst.

Maybe its not too late to get gnat to include recipes on using Parse::Yapp, and Parse::RecDescent in the next cookbook.

Re:and they use different regexes!

gnat on 2003-02-23T03:24:39

P::RD and P::Y will be there ...

--Nat