Anne van Kesteren

Ampersands matter

Dave Shea writes a little piece, or actually quotes, something in the The Standards Police. In short: don't report validation errors. I think it is strange that a designer gets upset about that.

If such a thing happens, they should go to the developer of the site (they are working on) and ask for fixing the used tools instead of getting annoyed by validation e-mails. If you still don't get that & should really be & in your code you should consider learning it again.

XML has 5 important entities that need to be encoded as in: <, &, >, ' and ". The rest of the special characters can be handled using either UTF-8 (optimal/best solution) or by using decimal or hexadecimal entities (make sure you convert Windows characters). The most made error is probably the one with & and the fact that people don't get that it is important to get it right. Everything that is XML based, like all syndication formats, all XHTML based sites will crash when you haven't encoded it in the right manner. (Ok, when you send XHTML with the incorrect content-type it won't, but in that case you shouldn't wine about validation feedback either.)

Be sure to read comment 9 and comment 17 of the discussion. Comment 21 might be interesting as well.

Comments

  1. I kinda have to agree w/ not reporting validation errors. HTML or xHTML validation is well beyond plausible. Very few sites validate and most have more errors than lines of HTML. This is because successive versions of HTML have been incompatible w/ each other. Blame it on the W3C.

    Posted by Randy Charles Morin at

  2. > This is because successive versions of HTML have been incompatible w/ each other.

    Please provide evidence for this assertion.

    Posted by Mark at

  3. I guess an example will suffice. But, I don't know why I'd answer your question, when you don't answer mine. Oh well, in the spirit of detente. IMG/@ALT in HTML 2 was not required and is now.

    Posted by Randy Charles Morin at

  4. Hmm, you could still use the older DOCTYPE assume and still validate. Or are those people using newly introduced elements as well?

    However, although not having the ALT specified is harmful and will give issues with screen readers and Google et cetera (so this was actually a good change) I think that the more recent "non important errors" like & are worse.

    Posted by Anne at

  5. I wasn't suggesting that ALT is a serious issue, just using is as an example to confirm my assertion. Thanks.

    Posted by Randy Charles Morin at

  6. You know, I don't want to open a can of worms, but I thought it'd be good to point out, since I was the origin of the "ampersand issue" and it seems to be popping up all over (??) -- it was an extreme, almost Devil's advocate type example.

    Man, you guys just took that one and ran with it didn't ya? ;)

    (shakes head)

    I highly doubt un-encoded entities are going to take down the Web any time soon. But, hey, I might be wrong.

    Oh and by the way -- this very issue, getting comments to be valid, has created a pretty major usability issue on your site Anne.

    It wasn't 100% clear I had to actually code my entry.

    How about explaining how to comment in plain language?

    Be glad you have a savvy audience.

    Posted by Keith at

  7. oop. Sorry I guess I should read the guidelines. I missed the link.

    Posted by Keith at

  8. If someone points a commercial site owner to a W3C validation error report, there is a prominent insinuation that the developer has not got his job done properly.

    Besides smelling like phase one of upcoming consulting service offers intruding the inbox, that is not necessarily the case for several scenarios:

    Keeping in mind that SGML parsing and UA behaviour in terms of text/html have precisely nothing in common. No surprise, the hilarious compatibility guidelines of XHTML 1.0 are based on the gist of that very fact (this, uhm, informative compatibility contradicts earlier specs all right, and conveniently spares us the details).

    So far for real life; beyond tags/html, even a schema valid xhtml document instance without a doctype declaration would fail in the W3C validator. Now, who would use a tool for a job the tool cannot accomplish? (impossible to answer without name-calling—I will live up to mother’ advice and say nothing :-)

    Posted by Eric at

  9. I think you need to separate two issues here: Incompatibilities between SGML and XML, and incompatibilities between different DTDs. When SGML got written, it was constructed not to be a general language (they never thought anyone would be that stupid as to build an SGML parser that was not an SGML application engine) but instead to be a framework for the syntaces of other languages. Therefore it's not entirely strange that tagsoup wasn't just considered the best way, it was considered the only way, for HTML. With XML on the other hand you are expected to use a single general parser for all XML applications and thus you need fatal both syntactical (well formedness) errors and for validating parsers - like for SGML - grammatical (validity) errors.

    Because the XML parsers are general, we can't allow slackness for any single XML application. We need to treat the markup the same, unambigously, since it's no longer the parser's work to recognise the application, that's something that takes place on infoset level. This means that non-escaped ampersands are infinitely more important issues than validity errors such as using embeds or anything like that.

    As for the reason for most pages on the web being invalid, I wouldn't attribute it to any kind of incompatibilities between DTDs. The user agents all recognise one single set of elements as HTML (same set is used for XHTML, in the namespace recognising browsers) and not one set per DTD. They recognise the widest possible set of HTML elements and attributes they can, so that they don't have to fork code. I would rather attribute it to a number of things, all on the editor side:

    That's the situation today, and today is pretty much better than ten years ago, when Netscape started to get warm in their boots with respect to HTML tagsoup parsing. Now of course, browsers can't stop with tagsoup parsing, because nobody would use them if they only could display the few blogs that validate and even fewer corporate sites.

    Posted by liorean at

  10. I was working on a long-term contract for a client. We had built a registration system that interfaced with a local college's Student Information System to register students in Continuing Education courses. We had developed a fairly intricate system by which we reported errors if something "wasn't right" when a course was selected -- perhaps the course wasn't found in the database (we were pulling from two separate databases... not fun).

    In any case, the error log and table filled with records that just didn't make sense. This went on for a few days. Eventually, two other developers spent a half a day each on it. They called me over for another set of eyes. The culprit? An unencoded ampersand in the QueryString. Instead of &sect=010 it was written as &sect=010. The browser was interpreting &sect as an entity, and throwing the errors.

    Yes, the browser was wrong in interpreting it as an entity -- there was no trailing semi-colon. However, validation would have been very useful there, and while the unencoded ampersands didn't tear down the web that day, it sure brought the developers to tears when they saw how much time they had wasted for something that validation would have flagged in an instant.

    Posted by Derek at

  11. Anne, you write that the greater-than entity must be encoded. I remember Mark writing that it needn't have. Why do you think it must?

    Next, although neither I, not probably you care about this, Gecko-based browsers crash the pages if the keyword entities are used on well-formed and valid pages using a custom doctype, where the entities are referenced in the custom DTD as they are in the regular DTD. Any custom doctype barfs Mozillians, even if it differs by just one character. The fix is to provide local paths to the entity definitions, but I'll be buggered if I do that. I use numerical entities, and only because of my infinite mercy, which I do possess, despite of what some claim :]

    So much room for malice. One keyword instead of a numeral, and poof, the one and only proper browser sits down.

    So, not all is rosy, eh?

    M.

    p.s. I'm glad that, as you report, the Augias's stable is now being cleaned. Way to go, boys :]

    Posted by Moose at

  12. Moose, because it is part of the opening and closing tag and therefore it has "special meaning". Maybe that was about SGML, which has some strange, heavily complicated parsing rules?

    And Mozilla is correct there (I would like to say: "obviously"). Since it doesn't do such a thing as DTD parsing and therefore it must report not well-formed according to the XML specification.

    Posted by Anne at

  13. There is a bug report somewhere in the belly of the Gargantua about that. Unresolved. I rely on White Lynx's report about this bug, because I don't venture in 'them there lands'.

    Aristotle whispers to me that if what you said were right (which it obviously isn't), then Mozilla should still crash after the fix inside the DTD. Since it doesn't, its behavior is wrong either way, and your claim is disproved.

    To my thinking, if you have all tags closed, you can have a greater-than character input directly into text, and still have a well-formed document. Please provide a testcase to the contrary, and I will eat my words :]

    M.

    Posted by Moose at

  14. I think you need to separate two issues here: Incompatibilities between SGML and XML, and incompatibilities between different DTDs. When SGML got written, it was constructed not to be a general language

    —liorean

    Pardon? SGML looks general(ised) enough to me. I would rather seperate two other issues: the parser and the application.

    Moose, because it is part of the opening and closing tag and therefore it has "special meaning". Maybe that was about SGML, which has some strange, heavily complicated parsing rules?

    —Anne

    Neither strange nor heavily complicated; the ability to do some abstraction helps, though. This can be entered literally:

    b>a

    And this too:

    1<2

    In short, it is handy too have as little characters constituting tokens with a lexical meaning as possible. '<' does not have a special meaning, STAGO (start tag open) has; STAGO is immediately proceeded by a name start character—assuming the reference concrete syntax that is [A-Za-az]—or the designated delimiter just remains data.

    Yes, the browser was wrong in interpreting it as an entity -- there was no trailing semi-colon.

    —Derek

    Entity references do not need a trailing reference end delimiter if they are not stalked by a name character.

    Posted by Eric at

  15. Eric - you wrote:

    Entity references do not need a trailing reference end delimiter if they are not stalked by a name character.

    I see that, and thanks for pointing to it. Even more important that the ampersand be encoded then...

    Posted by Derek at

  16. Eric: No, I meant exactly that. SGML was meant to be a framework for writing markup language definitions á la the programming language definitions of YACC or LEX. There was never any intention of having general SGML parsers - parsers would always be SGML application specific. Or that was the thought...

    The only characters in XML that require escaping in element content are less-than and ampersand. However, in attribute value you need to escape the quoting character you are using for the attribute value, whether that is the apostrophe or the double quote. The greater-than character doesn't need escaping in either context, but it's provided for consistency.

    Eric: As for STAGO, you might be correct in the SGML case (I couldn't find an authorative source describing it one way or the other). However, XML does not allow it, as can be seen by the chardata construct

    CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

    which describes what contents an element may have that are not CDATA sections, entitity references, comments, PIs or elements.

    Posted by liorean at

  17. O my god and I was thinking SGML was dead :-)

    Moose, you are referring to the so-called HTML entities are you? Not to the 5 pre-declared XML entities I mentioned in the post?

    I'm almost sure Mozilla gets this right, since the XML support is better than in Opera from what I have seen and know. (Although Opera has some nice support for XML content-types I believe.)

    Posted by Anne at

  18. Eric wrote:
    This can be entered literally:

    b>a

    And this too:

    1<2

    In short, it is handy too have as little characters constituting tokens with a lexical meaning as possible. '<' does not have a special meaning, STAGO (start tag open) has; STAGO is immediately proceeded by a name start character—assuming the reference concrete syntax that is [A-Za-az]—or the designated delimiter just remains data.

    One can see the obvious problem with this. What if the user enters "b<a" - the browser will consider you are starting a hyperlink. So it would be wise to always encode chevrons, no matter what.

    Posted by Chris Hester at

  19. IMG/@ALT in HTML 2 was not required and is now.

    And?

    Use img alt="" in an HTML 4.01 document, then take the same img element and put it in an HTML 2 document. It will still work. alt is permitted in both cases. What HTML 2 isn't is forward-compatible: Permissible img without alt in 2 is impermissible in 4.01.

    Posted by Joe Clark at

  20. Have a read.

    Posted by MaThIbUs at

  21. Hi! i find great solution on your page... thanxs

    Posted by dan at