Valid, Strict, HTML

pudge on 2005-04-13T20:14:18

I've installed on use.perl.org the code that turns your comments and journals into valid HTML 4.01 strict. The surrounding HTML is not, but the comments and journals themselves are.

Or should be.

If you notice any big problems, let me know. I had some issues with character references, but a sampling of TorgoX's journal entries with Unicode in them showed that I appear to have gotten those fixed.

I had most of the problems recently with URL handling, mostly because I was not concentrating on those but the tags themselves, and some errors crept in. Those should be ironed out too.

Probably next week I'll begin converting old comments, sigs, user bios, etc. to valid HTML. Maybe even stories. Journals don't have to be converted, as they are rendered on display from the originally saved HTML.


xhtml/css status for the rest of slashcode?

prahl on 2005-04-13T23:21:40

Fixing comments is a nice step forward. Will the use Perl; templates also be getting an xhtml/css revision?

[off topic] Do you know why the number of comments might not appear on http://omlc.ogi.edu/ --- the first story has a comment but nothing shows on the front page to indicate its presence.

Re:xhtml/css status for the rest of slashcode?

pudge on 2005-04-14T01:06:13

Fixing comments is a nice step forward. Will the use Perl; templates also be getting an xhtml/css revision?

That is in progress. It's a separate phase that others are working on. (Note that we decided on HTML 4.01 strict, but the code can handle XHTML too if we wish.)

[off topic] Do you know why the number of comments might not appear on http://omlc.ogi.edu/ --- the first story has a comment but nothing shows on the front page to indicate its presence.

Offhand, no, not sure.

Re:

Aristotle on 2005-04-14T18:36:44

The encodings handling needs some cleaning up. Fortunately it doesn’t appear to be broken so badly as to be hard to fix.

My “Expressiveness matters” post gets its curly quotes encoded “ and ”, respectively, which are undefined in the ISO-8859-1 charset the pages claim to be encoded in. They are only defined in Windows Codepage 1252. It still works browsers have generally given up and just treat the two as equal (which is doable because Win1252 is a true superset of Latin1), but correct it ain’t.

But the same numeric entities are used in the RSS feed. Not only are you not forgiven for claiming to be Latin1 when you are Win1252 there, though, but numeric entities in XML always refer to Unicode codepoints. So the curly quotes must be encoded as “ and ”, respectively. As a result, all XML consumers show my post with “no such character” boxes around the title.

The easiest thing to do (in terms of ensuring correctness, not necessarily in terms of implementation on a site with a huge amount of legacy content (though that might require nothing more than a dump/transcode/restore cycle of the database)) would be to just switch to UTF-8 wholesale. Then you can forget about numeric entities entirely and just encoding the five requisite characters (amp, lt, gt, apos, quot) with their named entities. (Or use ' instead of apos, since that named entity is defined only in XML, not in HTML.)

Re:

pudge on 2005-04-14T19:17:26

Not only are you not forgiven for claiming to be Latin1 when you are Win1252 there

And you are not forgiven for *using* Win1252 in the first place. I am not sure it is correct for me to try to fix your mistake and guess at what character you intended. How can I know you meant those to be curly quotes, and not something else? Sure, those are undefined in Latin-1, but how do I know what charset you are using, if you're not using Latin-1?

Re:

Aristotle on 2005-04-14T19:29:13

Ugh! You are correct. The problem is precisely the aforementioned fact that browsers treat Latin1 as Win1252: the form is Latin1, so when I paste curly quotes, my browser throws its arms up and sends Win1252, instead of telling me. Gahhhh.

Can we please have UTF-8 as soon as manageably possible? :-(

Re:

pudge on 2005-04-14T19:35:57

In Slash right now, we have special casing for high-bit chars, for sites that want plain ASCII. What I can probably do is add to that, for sites like useperl that are more open, special-casing those few chars from 128-159. It should catch most cases, like this one. It sucks, but ... so does the web. :-)

As to UTF, we tried it once and it messed us up in various ways, largely due to browser support, so I am not eager to try again any time soon. I think this is the best way for now, converting everything to an entity. That assumes the browser sends us good data, which is an unfortunate assumption, which maybe for this one set of cases I can try to handle separately.

Re:

pudge on 2005-04-14T22:01:41

I implemented the special-casing for those few non-Latin-1 chars that browsers like to send. Your journal entry title now has the proper encoding.

Re:

Aristotle on 2005-04-14T22:42:58

It had it before the fix as well; after our exchange, I went and fixed the entities manually. If you want I can try seeing what happens if I change the entities back though.

Re:

pudge on 2005-04-14T22:50:56

No need. You can see it is correct “here.”