November 16, 2008 -- the right man for the job

masak on 2008-11-16T22:51:21

624 years ago today, Jadwiga, a 10-year old girl, after two years of negotiations between her mother and the ruling lords, was crowned King of Poland.

Not that there's anything wrong with that. She appears to have been a just and respected monarch. Wikipedia:

As a monarch, young Jadwiga probably had little actual power. Nevertheless, she was actively engaged in her kingdom's political, diplomatic and cultural life and acted as the guarantor of Władysław's promises to reclaim Poland's lost territories. In 1387, Jadwiga led two successful military expeditions to reclaim the province of Halych in Red Ruthenia, which had been retained by Hungary in a dynastic dispute at her accession.

She died at the age of 25 from birth complications. Nowadays, she is venerated by the Roman Catholic Church as Saint Hedwig, and by others as the patron saint of queens, and of United Europe.

Been hacking on the MediaWiki parser today. Specifically, the code that finds == headings == and makes <h2>headings</h2> out of them. I've now implemented the easy test case, where the heading is to be found on its own, and not intermixed with ordinary paragraph text. Three tests remain to be satisfied in which it is.

Also spoke to Shlomi Fish (rindolf) today, who apparently got a grant for doing a MediaWiki parser, but got stuck. I asked him why he found the task hard, and he gave as an example the text a''b'''c''d'''e (or something equivalent), i.e. improperly nested style tokens.

I know about that problem. I have tests for it already.

In fact, a few years ago, I implemented an extremely reliable parser for a large subset of the MediaWiki syntax, but that time in Java. It had a very peculiar design goal, in that I never wanted it to fail with an error message, or with some other type of lack of output. Additionally, it sent the resulting HTML on to a set of XML transformers, so the resulting output had to be impeccable XHTML.

Think about it. The user can type any old broken, mis-nested, intentionally sadistic markup into the text box, and it still always comes out as freshly pressed valid XHTML. That's DWIM on steroids, some sort of "the user is right even when she's wrong" mentality. That module is still being used by dozens of people every day at my former employer. Of all the software I've written in my life, that one is perhaps the one I'm still the most proud of.

I'm not trying to brag, just showing that I have some sense of what I'm up against. The objective for this module is somewhat different: right now, I aim for bug-for-bug compatibility. If MediaWiki parses something in an incredibly stupid way, I want to do it too. I know it would be much easier, and probably more sane, to 'tidy up' the grammar while implementing it. But I don't want that; then it wouldn't be MediaWiki markup. One should be able to copy a text from a MediaWiki instance, and paste it in a November instance.

Come to think of it, I might have to make some small concessions if MediaWiki generates invalid XHTML in some case. In that case, valid XHTML takes priority. But hopefully, I'll still be able to emulate the way the page looks.

I look forwards to the thorny bits of the markup parser. I think PGE and me will have a great time vanquishing those windmills. ☺

I already have quite a few tests; but some still remain to be written. A few tests will surely be added when I find more corner cases. But all in all, I'm making good progress. Too bad I'm not getting a grant. ☺

First up is satisfying those mixed-heading-and-paragraph tests. That code will have to be sufficiently general, or at least generalized later, because lists, definition lists and possibly other things will behave the same way, i.e. line-orientedly. Then comes that issue with correctly handling mis-nested bold/italic. (And mis-nested bold/italic/links.) That will most likely require its very own blog post.

P.S. I'm not usually this cocky in my blog posts, but I wrote this immediately after watching a video podcast with Randal Schwartz. In it, he said that people don't know what you're good at until you tell them. I think he's right.


Not a peculiar design goal

Aristotle on 2008-11-19T04:54:09

Never failing with an error message is what the currently most heavily used parsers do: the parsers the feed the HTML rendering engines in our browsers. Their goal is exactly that: always render something, never bail out with a parsing error.

Re:Not a peculiar design goal

masak on 2008-11-19T22:32:14

Huh - don't know why I didn't think of the browsers when I wrote that. The browser authors must have problems a hundred times thornier than did I.

There's no question that it would be simpler from a programmer's perspective just to refuse to render if some set of grammar rules are not strictly followed. In most systems this is actually a requirement from a safety perspective, but in visual rendering like HTML or wiki markups, the temptation to forgive and forget is strong... especially when several browsers compete for the same market share, and none of them wants to "break the web".

For my part, whenever I had the choice between pretending that the user had written something sensible, and throwing a big red warning in their face, I chose to do some extra work in the code and make things work sensibly even though they were ungrammatical. The code absorbed the errors instead of giving up.

It's strange, because I still don't believe that's the best way to do things, neither for browsers nor for markup engines. But I also don't have a better idea.

Re:Not a peculiar design goal

Aristotle on 2008-11-20T03:48:00

No one ever thinks of the browsers. :-) The most successful computing platform ever, by a yawning margin, and paradoxically enough the most casually overlooked one by just as yawning a margin.

As for guessing vs catching fire, the problem in case of the web is that the user who gets to see the error is the one least capable of fixing it. So I don’t see how browsers could avoid lax parsing, even in an ideal world where almost all markup was valid (as opposed to the real one, where something like 99.99% of markup is invalid).

But for something like wiki markup where the author is close at hand, I would favour a mixed approach: use a forgiving parser during authoring that tries to make a sensible guess when it encounters an error, but then ask the user whether the guess is correct, providing a corrected version of the input that they can easily rubber-stamp. That way, you can ensure that data is always clean before you commit it to storage. The benefit is that you run no risk of diverging interpretations of that data when you pass it to different compliant parsers.

Of course, this is even more complex than “just” writing a forgiving parser and not caring about clean data.