Anne van Kesteren

Ampersands matter

10 June 2004

Dave Shea writes a little piece, or actually quotes, something in the The Standards Police. In short: don't report validation errors. I think it is strange that a designer gets upset about that.

If such a thing happens, they should go to the developer of the site (they are working on) and ask for fixing the used tools instead of getting annoyed by validation e-mails. If you still don't get that & should really be & in your code you should consider learning it again.

XML has 5 important entities that need to be encoded as in: <, &, >, ' and ". The rest of the special characters can be handled using either UTF-8 (optimal/best solution) or by using decimal or hexadecimal entities (make sure you convert Windows characters). The most made error is probably the one with & and the fact that people don't get that it is important to get it right. Everything that is XML based, like all syndication formats, all XHTML based sites will crash when you haven't encoded it in the right manner. (Ok, when you send XHTML with the incorrect content-type it won't, but in that case you shouldn't wine about validation feedback either.)

Be sure to read comment 9 and comment 17 of the discussion. Comment 21 might be interesting as well.

Comments

I kinda have to agree w/ not reporting validation errors. HTML or xHTML validation is well beyond plausible. Very few sites validate and most have more errors than lines of HTML. This is because successive versions of HTML have been incompatible w/ each other. Blame it on the W3C.
Posted by Randy Charles Morin at 8:58PM
> This is because successive versions of HTML have been incompatible w/ each other.
Please provide evidence for this assertion.
Posted by Mark at 9:06PM
I guess an example will suffice. But, I don't know why I'd answer your question, when you don't answer mine. Oh well, in the spirit of detente. IMG/@ALT in HTML 2 was not required and is now.
Posted by Randy Charles Morin at 11:32PM
Hmm, you could still use the older DOCTYPE assume and still validate. Or are those people using newly introduced elements as well?
However, although not having the ALT specified is harmful and will give issues with screen readers and Google et cetera (so this was actually a good change) I think that the more recent "non important errors" like & are worse.
Posted by Anne at 11:39PM
I wasn't suggesting that ALT is a serious issue, just using is as an example to confirm my assertion. Thanks.
Posted by Randy Charles Morin at 11:44PM
You know, I don't want to open a can of worms, but I thought it'd be good to point out, since I was the origin of the "ampersand issue" and it seems to be popping up all over (??) -- it was an extreme, almost Devil's advocate type example.
Man, you guys just took that one and ran with it didn't ya? ;)
(shakes head)
I highly doubt un-encoded entities are going to take down the Web any time soon. But, hey, I might be wrong.
Oh and by the way -- this very issue, getting comments to be valid, has created a pretty major usability issue on your site Anne.
It wasn't 100% clear I had to actually code my entry.
How about explaining how to comment in plain language?
Be glad you have a savvy audience.
Posted by Keith at 2:26AM
oop. Sorry I guess I should read the guidelines. I missed the link.
Posted by Keith at 2:27AM
If someone points a commercial site owner to a W3C validation error report, there is a prominent insinuation that the developer has not got his job done properly.
Besides smelling like phase one of upcoming consulting service offers intruding the inbox, that is not necessarily the case for several scenarios:
- the page does not include a doctype declaration at all—no judge, no jury, no crime; it is not like HTML would viciously transform into a Kafka-esk SGML application overnight, and even if it did, the patient might be fully tagged
- the prolog contains a doctype declaration that the proceeding document instance set does not live up to; that is not bogus anymore, because all major web wowser vendors encourage using the doctype declaration as a processing instruction—blame them, if you have to blame anyone
- the validator might not be able to resolve sensible defaults, like <!doctype html system>, like it is the case for the catalog file of the W3C validator; bother the maintainer of the validator, not the author—anyone with real need for a validating system would not resort to a remotely hosted tool in the first place
Keeping in mind that SGML parsing and UA behaviour in terms of text/html have precisely nothing in common. No surprise, the hilarious compatibility guidelines of XHTML 1.0 are based on the gist of that very fact (this, uhm, informative compatibility contradicts earlier specs all right, and conveniently spares us the details).
So far for real life; beyond tags/html, even a schema valid xhtml document instance without a doctype declaration would fail in the W3C validator. Now, who would use a tool for a job the tool cannot accomplish? (impossible to answer without name-calling—I will live up to mother’ advice and say nothing :-)
Posted by Eric at 4:22AM
I think you need to separate two issues here: Incompatibilities between SGML and XML, and incompatibilities between different DTDs. When SGML got written, it was constructed not to be a general language (they never thought anyone would be that stupid as to build an SGML parser that was not an SGML application engine) but instead to be a framework for the syntaces of other languages. Therefore it's not entirely strange that tagsoup wasn't just considered the best way, it was considered the only way, for HTML. With XML on the other hand you are expected to use a single general parser for all XML applications and thus you need fatal both syntactical (well formedness) errors and for validating parsers - like for SGML - grammatical (validity) errors.
Because the XML parsers are general, we can't allow slackness for any single XML application. We need to treat the markup the same, unambigously, since it's no longer the parser's work to recognise the application, that's something that takes place on infoset level. This means that non-escaped ampersands are infinitely more important issues than validity errors such as using embeds or anything like that.
As for the reason for most pages on the web being invalid, I wouldn't attribute it to any kind of incompatibilities between DTDs. The user agents all recognise one single set of elements as HTML (same set is used for XHTML, in the namespace recognising browsers) and not one set per DTD. They recognise the widest possible set of HTML elements and attributes they can, so that they don't have to fork code. I would rather attribute it to a number of things, all on the editor side:
- If the editor is human, we definitely have the possibility of human error. This was so infinitely common in the early days of HTML that browsers couldn't be strict, and thus invented the real HTML tagsoup parsers, that don't require well formed SGML.
- The editing tools are pretty much all, to this day, non-validating. Sure, some have on-demand validation facilities, but they aren't enforcing validation. This means that the developers for the other parts of these editors, with code libraries, WYSISYG objects etc. don't have any preassure on them to generate valid code.
- The general dynamic content generation tools (CGI+PERL, SS-JavaScript, ASP, ASP.NET, PHP, JSP) have never been validating, and are in fact hard to force generation of valid code. This leads to most non-static sites being pretty much invalid.
That's the situation today, and today is pretty much better than ten years ago, when Netscape started to get warm in their boots with respect to HTML tagsoup parsing. Now of course, browsers can't stop with tagsoup parsing, because nobody would use them if they only could display the few blogs that validate and even fewer corporate sites.
Posted by liorean at 8:28AM
I was working on a long-term contract for a client. We had built a registration system that interfaced with a local college's Student Information System to register students in Continuing Education courses. We had developed a fairly intricate system by which we reported errors if something "wasn't right" when a course was selected -- perhaps the course wasn't found in the database (we were pulling from two separate databases... not fun).
In any case, the error log and table filled with records that just didn't make sense. This went on for a few days. Eventually, two other developers spent a half a day each on it. They called me over for another set of eyes. The culprit? An unencoded ampersand in the QueryString. Instead of &sect=010 it was written as &sect=010. The browser was interpreting &sect as an entity, and throwing the errors.
Yes, the browser was wrong in interpreting it as an entity -- there was no trailing semi-colon. However, validation would have been very useful there, and while the unencoded ampersands didn't tear down the web that day, it sure brought the developers to tears when they saw how much time they had wasted for something that validation would have flagged in an instant.
Posted by Derek at 9:37AM
Anne, you write that the greater-than entity must be encoded. I remember Mark writing that it needn't have. Why do you think it must?
Next, although neither I, not probably you care about this, Gecko-based browsers crash the pages if the keyword entities are used on well-formed and valid pages using a custom doctype, where the entities are referenced in the custom DTD as they are in the regular DTD. Any custom doctype barfs Mozillians, even if it differs by just one character. The fix is to provide local paths to the entity definitions, but I'll be buggered if I do that. I use numerical entities, and only because of my infinite mercy, which I do possess, despite of what some claim :]
So much room for malice. One keyword instead of a numeral, and poof, the one and only proper browser sits down.
So, not all is rosy, eh?
M.
p.s. I'm glad that, as you report, the Augias's stable is now being cleaned. Way to go, boys :]
Posted by Moose at 1:27PM
Moose, because it is part of the opening and closing tag and therefore it has "special meaning". Maybe that was about SGML, which has some strange, heavily complicated parsing rules?
And Mozilla is correct there (I would like to say: "obviously"). Since it doesn't do such a thing as DTD parsing and therefore it must report not well-formed according to the XML specification.
Posted by Anne at 1:32PM
There is a bug report somewhere in the belly of the Gargantua about that. Unresolved. I rely on White Lynx's report about this bug, because I don't venture in 'them there lands'.
Aristotle whispers to me that if what you said were right (which it obviously isn't), then Mozilla should still crash after the fix inside the DTD. Since it doesn't, its behavior is wrong either way, and your claim is disproved.
To my thinking, if you have all tags closed, you can have a greater-than character input directly into text, and still have a well-formed document. Please provide a testcase to the contrary, and I will eat my words :]
M.
Posted by Moose at 1:59PM
I think you need to separate two issues here: Incompatibilities between SGML and XML, and incompatibilities between different DTDs. When SGML got written, it was constructed not to be a general language

—liorean
Pardon? SGML looks general(ised) enough to me. I would rather seperate two other issues: the parser and the application.
Moose, because it is part of the opening and closing tag and therefore it has "special meaning". Maybe that was about SGML, which has some strange, heavily complicated parsing rules?

—Anne
Neither strange nor heavily complicated; the ability to do some abstraction helps, though. This can be entered literally:
b>a
And this too:
1<2
In short, it is handy too have as little characters constituting tokens with a lexical meaning as possible. '<' does not have a special meaning, STAGO (start tag open) has; STAGO is immediately proceeded by a name start character—assuming the reference concrete syntax that is [A-Za-az]—or the designated delimiter just remains data.
Yes, the browser was wrong in interpreting it as an entity -- there was no trailing semi-colon.

—Derek
Entity references do not need a trailing reference end delimiter if they are not stalked by a name character.
Posted by Eric at 3:55PM
Eric - you wrote:
Entity references do not need a trailing reference end delimiter if they are not stalked by a name character.

I see that, and thanks for pointing to it. Even more important that the ampersand be encoded then...
Posted by Derek at 5:57PM
Eric: No, I meant exactly that. SGML was meant to be a framework for writing markup language definitions á la the programming language definitions of YACC or LEX. There was never any intention of having general SGML parsers - parsers would always be SGML application specific. Or that was the thought...
The only characters in XML that require escaping in element content are less-than and ampersand. However, in attribute value you need to escape the quoting character you are using for the attribute value, whether that is the apostrophe or the double quote. The greater-than character doesn't need escaping in either context, but it's provided for consistency.
Eric: As for STAGO, you might be correct in the SGML case (I couldn't find an authorative source describing it one way or the other). However, XML does not allow it, as can be seen by the chardata construct
CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
which describes what contents an element may have that are not CDATA sections, entitity references, comments, PIs or elements.
Posted by liorean at 6:27PM
O my god and I was thinking SGML was dead :-)
Moose, you are referring to the so-called HTML entities are you? Not to the 5 pre-declared XML entities I mentioned in the post?
I'm almost sure Mozilla gets this right, since the XML support is better than in Opera from what I have seen and know. (Although Opera has some nice support for XML content-types I believe.)
Posted by Anne at 7:12PM
Eric wrote:
This can be entered literally:

b>a

And this too:

1<2

In short, it is handy too have as little characters constituting tokens with a lexical meaning as possible. '<' does not have a special meaning, STAGO (start tag open) has; STAGO is immediately proceeded by a name start character—assuming the reference concrete syntax that is [A-Za-az]—or the designated delimiter just remains data.

One can see the obvious problem with this. What if the user enters "b<a" - the browser will consider you are starting a hyperlink. So it would be wise to always encode chevrons, no matter what.
Posted by Chris Hester at 7:46PM
IMG/@ALT in HTML 2 was not required and is now.

And?
Use img alt="" in an HTML 4.01 document, then take the same img element and put it in an HTML 2 document. It will still work. alt is permitted in both cases. What HTML 2 isn't is forward-compatible: Permissible img without alt in 2 is impermissible in 4.01.
Posted by Joe Clark at 11:42PM
Have a read.
Posted by MaThIbUs at 2:46AM
Hi! i find great solution on your page... thanxs
Posted by dan at 5:07AM