Anne van Kesteren

Redefining Tag Soup, by Faruk Ates

17 August 2004

(This article is not written by Anne van Kesteren and does not necessarily represent his opinions. Markup has been modified, but didn't change the meaning or intention of the original author.)

A lot of people use the term "tag soup" these days, but in ways that it shouldn't really be used because it makes them look silly. MIME types matter, true enough, but a wrong declaration does not make something tag soup.

If you've read articles like Anne's MIME types matter; DOCTYPEs don't, you'll undoubtedly have come across this term: tag soup. But what is tag soup, and how should we apply it? The opinions don't differ as much on that, until you start asking a lot of people about it. A currently upcoming trend is that of using the MIME-type excuse to label documents as Tag Soup. But let's not get ahead of ourselves.

Tag Soup as explained by the page I just linked:

[...] the term was coined by Dan Connolly of the W3C when he was talking about HTML parsers that accept anything anywhere. The example he cited is the TITLE element. It really only makes sense in the HEAD of a document, but apparently one or more browsers would let you set the title of a page in the body of the page! It's not like this makes the earth crumble or the sky fall, everything can proceed normally, but it's wrong to do it there and the world would be a (slightly) better place if browsers didn't allow it.

So basically, tag soup is when a document is poorly written HTML - XHTML, XML, and so forth - and the browser will try to make sense out of it using its built-in intelligence (or "crapcode fixing skills" as I'd like to call it).

So how are people mis-using the term, you wonder? Well, let's see. Ever since people have started talking about tag soup whenever they were really talking about XHTML and MIME types, people have taken over "the message": (validating) XHTML documents with a text/html MIME type are tag soup. WRONG! If you have a validating XHTML document, it doesn't matter whether you send a text/html header, a text/xml or even text/css header (to be silly): it is still a validating XHTML document and not Tag Soup!

But then, why are people calling it tag soup? That's because they've gone into a generalization that missed out on a vital detail, making it a wrong generalization. What happens when you send a (valid or not) XHTML document with a text/html header, is that a browser receiving it as such will treat it as tag soup. So my perfectly validating XHTML document sent with text/html will be treated as tag soup, but it is not tag soup itself. If the document validates, you can even safely state that it cannot be tag soup, because real tag soup comes from having invalid markup that somehow ends up displaying properly due to browser's crapcode fixing skills.

Even though it's technically not well-formed when sent with a wrong MIME type, it's wrong to label the document as being Tag Soup, because it simply cannot be Tag Soup as judged by the definition of Tag Soup.

So the important detail is simply the manner in which you put it. Those who have caused this trend to occur do it right: This basically means that the example is rendered as HTML tag soup in any browser. (Anne van Kesteren)

Want to do it right? Add the words treated as, whenever you say that a valid document with a wrong MIME type "is tag soup". They're not tag soup, they came through a validator and if you saved them to your harddrive (as .xhtml in the case of XHTML documents) they would get a proper MIME type when loaded up again and their contents would still be exactly the same.

- Faruk Ates

Comments

Thanks for hosting my little article, Anne. :)
Posted by Faruk Ates at 6:22PM
MAYBE I’M JUST obsessive, but I sometimes do seriously get off on minutia. Here’s a great article by Faruk Ates “Redefining Tag Soup” courtesy Anne van Kesteren’s weblog that, well, redefines the term Tag Soup!
Posted by molly.com » and speaking of minutia at 7:15PM
"tag soup" also implies that HTML is ugly, messy and ambiguous, while XHTML is correct and structured.
But valid HTML is exactly as structured an unambiguous as valid XHTML. The difference is of course, that a parser will reject non-well-formed XHTML, but this is not such a big deal, since most HTML errors are semantic or validity errors, not well-formedness errors.
For example, a XHTML document with a title-element inside the body-element, is still well-formed, so it would be parsed successfully by a XHTML UA, which would then have to handle the misplaced title somehow. Exactly the same error handling could be used in HTML.
Posted by Olav Junker Kjær at 8:37PM
The anwser I wanted to give, but didn't give. Thanks Faruk!
The MIME Content-type is only, and I quote the specification, to specify the media type and subtype of data in the body of a message.
So it's like putting a label 'Intel inside' on a PC with an AMD processor. The label doesn't change the content!
It's the same as an extension. If I rename a Excel-file from test.xls to test.doc, it will still be a Excel-file. If I open it in Excel, it will still open. If I open it in Word, then Word will complain that it isn't a Word-document.
Even if the MIME Content-type is correct, then it's no garantee that the data can be read correctly by the reciving application. XHTML1.x and XHTML2 have the same MIME Contenttype (application/xhtml+xml), but that doesn't mean Mozille 1.7 understands it.
If an Excel file is created with Excel2002, if is not a garantee that it can be open in Eexcel97...
Posted by Rémy at 8:55PM
If you have a valid XHTML document, it doesn't matter whether you send a text/html header, a text/xml or even text/css (to be silly) header: it is still a valid XHTML document!

Sorry, but the MIME-type does determine whether an XHTML document is well-formed. Consider a document whose first line is
```
<xml version="1.0" encoding="..."?>
```
sent as text/xml. RFC 3023 says that the above encoding declaration is to be ignored. The encoding is to be
1. The charset specified in the HTTP headers (if present) or
2. US-ASCII.
So... unless the server set the correct charset in the HTTP headers, such a document is ill-formed (unless, by chance, it contains only US-ASCII).
Change the Content-Type to application/xml or application/xhtml+xml and the rules change. Now the encoding is
1. The charset specified in the HTTP headers (if present)or
2. the encoding specified in the encoding attribute of the XML declaration within the document or
3. UTF-8.
Hmmm... Now the document is well-formed.
Finally, if sent as text/html, the <xml ... > declaration is ignored by the HTML parser. In that case, the encoding is
1. The charset specified in the HTTP headers (if present)or
2. the encoding specified in a <meta ...> element or
3. ISO-8859-1.
Yet again, the validity status of the document changes when you change the MIME-type.
More broadly, there are a host of other incompatibilities between the way documents sent as text/html and application/xhtml+xml are to be handled. You just can't avoid the transport-layer in talking about the "validity" of an XHTML document. Let's just say that, when sent as text/html, the document is, in all cases, parsed as HTML (using the "html parser", AKA the "tag-soup parser").
The distinction between "being handled as tag-soup" and "being tag-soup" (where, "tag-soup" is here used in the browser-writers' parlance, as synonymous with "HTML") is useful only to the extent that you care about the platonic integrity of your document and not about how it is handled.
Posted by Jacques Distler at 10:35PM
[...] Any document sent as text/html is treated as tag soup, even by an XHTML user-agent. [...]
Posted by WebSprockets » Valid XHTML & WordPress at 1:54AM
Jacques, not 100% well-formed does not immediately constitute as being tag soup. A document can be perfectly well-formed (in the meaning of the word), sadly be sent with a wrong MIME type (and thus not be `well-formed` in the most technical, Standards-oriented meaning), and it would still not be Tag Soup.
But, in an attempt to satisfy you, the article was edited slightly to make that distinction a little bit more.
Posted by Faruk Ates at 5:33AM
Huh. I didn't know well-formedness could be measured in percentages. I thought you're either well-formed or... not.
I'm not sure why folks are so fond of certain components of well-formedness (quote your attributes, nest your elements properly, blah blah) and so quick to pooh-pooh everything else. Just because WASP hasn't been harping on the transport layer for the last four years, doesn't mean this stuff isn't important. Yes, poorly nested elements are bad. So are user agents that mangle documents due to their ignorance of transport layer standards. Why is one more important than the other?
Posted by Evan at 6:58AM
As Evan says, well-formedness is a binary proposition. An XML document either is well-formed, or it isn't. If the character-encoding — as determined by the precedence rules of RFC 3023 (or, for local files, as determined by yet a different precedence rule; your "clarification" just added another layer of confusion) — is wrong, then the document (barring some miracle) is ill-formed.
Period.
But the meta-point that I was making still seems to elude you. You clearly feel that the distinction between "being treated as tag soup" and "being tag soup" is an important one: that a "valid" XHTML document, sent with the wrong MIME type is a trivial cosmetic change. It's not.
Posted by Jacques Distler at 12:04PM
“You just can’t avoid the transport-layer in talking about the “validity” of an XHTML document.”
So the XML documents I have sitting on my local hard drive that conform to XHTML syntax — if they’re well-formed but just sitting there, not being accessed, are they considered tag soup?
Similarly, if a tree falls in a forest, and no one is around to hear it…
Posted by Dave S. at 12:39PM
Faruk, although I understand the distinction you're trying to make, I don't fully understand what your point is.
How could it possibly matter that your document contains valid and well-formed XHTML markup, when user agents still have to parse it as if it were tag soup? All the advantages of using X(HT)ML are lost.
Posted by Tommy Olsson at 12:43PM
So the XML documents I have sitting on my local hard drive that conform to XHTML syntax ...

Sigh.
The algorithm for determining the encoding of a local XML file is beastly clever. But the XML Specification clearly indicates that "external" (file-system or transport-layer) mechanisms for conveying the encoding information take precedence. If these conflict with what's actually in the file, it is "a fatal error."
If this character-encoding stuff makes you head hurt, I promise you that a discussion of external entities (and non-validating versus validating parsers) will cause you to start bleeding from the ears.
Heck, Henri Sivonen and I went a few rounds on that subject on your own blog. Did you even notice?
Posted by Jacques Distler at 2:12PM
As Evan says, well-formedness is a binary proposition.

Yes, but not well-formed is not necessarily being Tag Soup. There's a distinction, and you (Jacques) are scaring (new) people away from learning XHTML by telling them they're creating tag soup even when they're not.
If I have a validating XHTML document, but it gets sent as text/html (which I tend to do to whatever doesn't accept application/xhtml+xml), it may not be well-formed, but it isn't Tag Soup either. Tag Soup is not the opposite of well-formed when you're talking about correct MIME types. Tag Soup is the opposite of well-formed when you're talking about a validating XHTML/HTML structure versus a structure that displays properly (in at least one browser, anyway), but a) is incredibly messy and has elements and tags in all the wrong places, and b) only displays properly because that particular browser is making sense out of your mess on its own.
Tag Soup documents cannot pass a strict validation test. That's what the article is about. Since I didn't write it for those with a ton of technical knowledge into the very details of these protocols and standards, but instead for those who are using it wrongly just because they didn't know about the distinction yet, I'm not surprised that all the technical people are hiccuping over the details (which are written to be easy to understand for the targetted audience).
Posted by Faruk Ates at 3:18PM
Jacques — as a content author, I see no reason why it would be required for me to deal with, or even know anything about XML auto-detection. Other than that the tools aren’t doing it for me.
As a content author, I set my DOCTYPE, specify my encoding, validate my markup, and drop the file on the server. The client and server should determine the transport.
No, it doesn’t work anything like that. But do you ever get the feeling you’re discussing this with the wrong people? Shouldn’t this be a conversation between the Apache/Mozilla/Opera/etc. crews instead of us lowly end users?
(incidentally, ever noticed Tim Bray’s Ongoing is XHTML1.1, served as text/html? Just saying…)
Posted by Dave S. at 9:04PM
If these conflict with what’s actually in the file, it is “a fatal error.”

…well, no reason to know other than that.
Posted by Dave S. at 9:48PM
What Tommy said.
But do you ever get the feeling you’re discussing this with the wrong people? Shouldn’t this be a conversation between the Apache/Mozilla/Opera/etc. crews instead of us lowly end users?

Dave, I sympathize with this statement, and I think it does have some validity. However, we lowly web monkeys need to worry about this stuff too. Just a recent example: A couple weeks ago I got an X-Philes submission from someone who made a rather snarky remark about how he couldn't understand why there weren't more X-Philes, why don't people read specs, etc. Unfortunately for him, his pages were being served as text/html. Turns out that he had tried to set his MIME type as application/xhtml+xml in a <meta> element at the top of the page.
(A general aside: for those of you who choose to submit their sites to the X-Philes, please refrain from making snarky remarks about the failings of others in the web standards arena. Even though the X-Philes is a very weak test of web standards compliance, the success rate for 1st-time submission is somewhere between 30 and 40 percent -- and I assure you, there is no positive correlation between those who succeed, and those who enjoy belittling others with their submissions.)
Sorry about that, back on point. Dave, I think WASP has done a superb job spreading awareness of web standards. However, there are other pieces to the XHTML puzzle -- pieces that cannot be swept under the rug, pieces are important to app developers and content developers alike. I would like to see WASP turning its formidable communications efforts towards closing this gap. Why isn't WASP hammering away on this?
Posted by Evan at 11:03PM
The Web Standards Project has done something regarding MIME types, but they talked to the W3C, who are all big fans of XHTML (and especially RDF) and as Ian Hickson mentioned don't really care about desktop browsers.
More practical would be to advice HTML instead. Since user agents are tolerant with that syntax. With XHTML this is not the case and MIME types and several other layers of complexity have to be considered important.
Posted by Anne at 11:53PM
"However, we lowly web monkeys need to worry about this stuff too."

Awareness and understanding are two different things. I agree with awareness, but I don't necessarily feel understanding is something I need on this matter.
Unfortunately for him, his pages were being served as text/html.

If it weren't for the snarky comment he made, I'd have blamed this entirely on the server. I'm not blind to your point though.
...pieces are important to app developers and content developers alike

Pieces like...? Awareness is half the battle. If you have a list of items WaSP needs to address, now's the time to pull it out.
We've done the MIME thing, going straight to the source in this WaSP asks W3C. The result is about what you'd expect, I don't think I'll analyze it here.
If you care about getting better answers than that, WaSP could use your help.
Posted by Dave S. at 11:53PM
Faruk wrote:
Yes, but not well-formed is not necessarily being Tag Soup.

Ill-formedness is a fatal error for an XML document. You can't get any soupier than that.
There's a distinction, and you (Jacques) are scaring (new) people away...

So we should hide the truth from people, so as not to scare them away? As Mark Pilgrim is fond of saying, "XML is hard." Wishing it were easy doesn't make it so. And (pretending to) move the goalposts to make it easier doesn't make it so either.
Dave wrote:
Shouldn’t this be a conversation between the Apache/Mozilla/Opera/etc. crews instead of us lowly end users?

Abstractions leak. It's a sad, but inevitable fact of life.
The aforementioned discussion with Henri Sivonen is an even better example of this phenomenon. There we had a situation where a perfectly valid and well-formed XHTML document will cause a perfectly-compliant XHTML user-agent to throw a fatal parsing error.
I howled at the unfairness of it all, and shed bitter tears. Henri told me to grow up.
Posted by Jacques Distler at 12:46AM
Sorry, Dave, I missed that article. My bad.
As for whether one can codify [the set of XML issues that web developers should know about, but don't] into a user-friendly list... yes, that would be a constructive thing to do, wouldn't it? Hmmmm. Personally, I doubt I could enumerate even a fraction of these issues on my own. So point taken, again. I'll have to think about this.
XML is hard.
Posted by Evan at 2:32AM
Sorry, but the MIME-type does determine whether an XHTML document is well-formed.

So... unless the server set the correct charset in the HTTP headers, such a document is ill-formed (unless, by chance, it contains only US-ASCII).

But that's a case of the wrong character encoding being used. That issue is distinct to the issue of whether text/html intrinsically makes XHTML malformed or not.
The distinction between "being handled as tag-soup" and "being tag-soup" (where, "tag-soup" is here used in the browser-writers' parlance, as synonymous with "HTML") is useful only to the extent that you care about the platonic integrity of your document and not about how it is handled.

That statement only holds true if you assume everything everywhere will always treat it as tag-soup. If you are publishing XHTML content as text/html, there's nothing stopping you from using tools that expect well-formed XHTML on that content. So what if browsers treat it as tag-soup?
Posted by Jim Dabell at 2:03PM
If you are publishing XHTML content as text/html, there's nothing stopping you from using tools that expect well-formed XHTML on that content. So what if browsers treat it as tag-soup?

Server-side, you can do anything you bloody well please with the content. If it's well-formed XML, you can use XML-processing tools on it. If it's not well-formed, those tools will barf. But then, since it's your content, you can fix it.
Client-side is another matter. Clients should not sniff the contents of documents sent as text/plain or text/html to see if they are "really" XML. That way lies madness.
Posted by Jacques Distler at 11:48AM
<My bad.>
English Parsing Error: mismatched noun. Expected: <mistake> Location:http://annevankesteren.nl/archives/2004/08/tag-soup Comment 20, Column 41 <My bad.> --^
Phil
Posted by phil smears at 2:29PM