Entify Your HTML!

brianiac on 2005-07-07T05:24:13

Updated Reposted from my other journal http://brianary.blogspot.com/2005/07/entify-your-html.html:

To the embarassingly uninformed third party vendors of web-based applications, I present a quick look at HTML entities. This is Chapter One stuff in even the most basic HTML book, but I still get puzzled, dismissive, and even indignant replies when I request fixes for simple HTML bugs.

Three important characters: < > &

These characters are special to HTML for processing. In the text or attribute values of a page, you must use entities that stand for them: &lt; &gt; &amp;(respectively). In attributes, " should also be replaced with &quot; (you can also use &quot; in text, but it isn't a requirement).

The Web Is A Big Place

If you forget to entify your special characters, some browsers will sometimes let you get away with it. If you intend to produce code for the widest possible audience (which is the whole point of the Internet, after all), it is best not to assume your indiscretions will always go unnoticed; better to do it right to start with, and you won't have to double check every support call ($$$) to see if unentified HTML is part of the problem.

Unentified HTML Is Insecure HTML

All Cross-Site Scripting (XSS) attacks are caused by unentified HTML, and can be prevented using entities. The liability of such an attack, though potentially considerable, is nothing compared to the loss of client trust.

It's Easy

Every web development language has a single function you can call to entify the contents of string or text variables (numeric and date/time variables do not typically require escaping), e.g. Server.HTMLEncode() in Active Server Pages or htmlentities() in PHP. In cases where the language does not provide such a function, writing one is trivial: four search-and-replace calls (do the ampersand first).

It just kills me how often I see unencoded HTML (of the severity that actually breaks things), and how defensive companies get when it's pointed out. As if it were a lengthy or difficult fix.


Wait ...

pudge on 2005-07-07T07:09:41

Are you saying ' and " must not be single octets anywhere in the text of HTML, but rather, that &apos; and &quot; must be used? What part of the HTML spec did you get that from? In attribute values, yes, but in the text, there's no such requirement.

Re:Wait ...

brianiac on 2005-07-07T16:33:13

These are subtleties that I see no need to try to explain to people that resist even encoding . This is not a normative reference work, but a rule of thumb intended for an audience with a poor track record of understanding and accepting standards.

Re:Wait ...

brianiac on 2005-07-07T16:38:27

"...even encoding < and > ."

Apparently this is some strange new usage of "Plain Old Text" that I was not previously aware of.

Re:Wait ...

pudge on 2005-07-07T16:45:25

Plain Old Text still allows HTML, but it takes care of newlines, and it's worked that way for years in Slash without too many complaints.

We'd get a lot more if it worked as you wanted it to. You wanted Extrans. Which has a lame name. But oh well.

Re:Wait ...

pudge on 2005-07-07T16:43:56

It's not a subtlety: there is simply no need to encode ' and " anywhere but in attributes.

Also, &apos; is illegal in HTML. It's only a named entity in XHTML.

Re:Wait ...

brianiac on 2005-07-07T17:31:30

But not even all attributes require &quot:, only attributes that use " as a delmiter. My original intent was a single clear rule, without dithering or qualifying that could be used as an excuse to ignore this process based on complexity.

I work with several third-party vendors in the financial sector, who have a great deal to learn about web development. This is meant to be simple enough to remember for those writing home banking, bill payment, loan application, and other web apps, but also apps from any other sector (niche software needing the biggest shove).

The lack of &apos; in the HTML spec is obviously an oversight by the W3C. Why would ' -delimited attributes not be given the same escape mnemonic capacity (&#39; or &#x27; just aren't as easy to remember) that " -delimited attributes are?

However, the point is taken. See if the update seems better to you.

Re:Wait ...

pudge on 2005-07-07T17:47:31

I'd say a better rule is to not delimit attributes with ', and use onle ", as God intended. :-)

As to why &apos; is missing, no idea, but yeah, it seems lame. But as everyone seems to accept it (except for proper strict validators that know about entities), it's not something I'd personally care too much about. Just offering it as a footnote.

Re:Wait ...

dws on 2005-07-07T18:29:13

So people are resisting what? A rule of thumb?

Re:Wait ...

brianiac on 2005-07-07T23:42:18

Yes. Perhaps more properly, a simplified version of the spec. Too much truly lousy HTML is being generated.