This rant is inspired by me just looking at the source code for Text::Tiki, but I certainly don't intend to single that parser out...
Why is it that wikitext [1] parser writers foist upon us broken crappy regexp based parsers that break at the slightest deviation from the spec, don't treat the document as a structure and truly believe that they are just parsing "text" and only need to produce "text". What if I don't want to produce HTML but actually want to *do* something with that data?
Repeat after me: "s///g is not a parser!"
Some day I'm going to find time to finish off Text::WikiFormat::SAX and show people what its all about.
Rant over.
[1] And this includes WikiText, TikiText, UseModText and all the various flavours that I've had the misfortune to look at the source code of.
Must admit to being an idiot in that it took me quite a while to figure out what thingy to click to reply to this.
We talked about this on IRC and I agreed with you. I went and looked at HTML::Parser, which I think we figured out was a good model for what we want to be able to do with WikiText. It's all in XS, so I ran away. Bad move? Me no spik C.
Kake
Re:Better Alternatives???
sbwoodside on 2003-02-20T07:13:01
Powerful? Sure, in that it can do a lot. No, in that it doesn't help the software understand the structure of the data at all. Regex is a very limited language.Re:Better Alternatives???
Matts on 2003-02-20T08:35:25
The problem with Twiki's parser (and all the other ones) is they all look something like this:Which is great if you want HTML, but what if you want to parse the twiki text to put the data into a semantic search engine (where titles or bold text might have more relevance) - you have to parse once to HTML and then parse the HTML, and that I think is a broken model.$text =~ s/someformatting1/<somehtml1>$1<\/somehtml1>/;
$text =~ s/someformatting2/<somehtml2>$1<\/somehtml2>/;
$text =~ s/someformatting3/<somehtml3>$1<\/somehtml3>/;
$text =~ s/someformatting4/<somehtml4>$1<\/somehtml4>/;
$text =~ s/someformatting5/<somehtml5>$1<\/somehtml5>/;
The parsers should be written as frontend+backend - where the frontend basically tokenises the Wiki text and the (default) backend turns those tokens into HTML. But another backend might do something completely different.
The model I'm talking about is that used by Text::WikiFormat::SAX, but people are very afraid of proper parsers (witness how long it took people to adopt to using HTML::Parser instead of regexp based parsers), because they mean you have to think of your source as data rather than text. Plus Text::WikiFormat::SAX is broken, mostly because nobody uses it so I have no incentive to fix it;-)
I'm only about halfway kidding.
Re:Three Reasons
Matts on 2003-02-20T08:28:29
I agree with all your points but the second one;-)
Since Text::WikiFormat::SAX doesn't work properly I'm obviously not as smart as you think I am;-)
But in all seriousness this is something I hope to put right, in a similar way to trying to put right the whole XML parser nonesense. That way I can put my code where my rant is, or something like that.
Re:and they use different regexes!
gnat on 2003-02-23T03:24:39
P::RD and P::Y will be there... --Nat