Anne van Kesteren

Microsoft to break the web

4 November 2005

Neh, nothing about the new Internet Explorer 7. That might very well break the web too, depending on how they do it, but this about only parsing well-formed feeds. I wonder if this mean they will follow every single bit from RFC 3023 and XML. If they do, that means they will thrown in a parse error on every single character that is incorrect and treat text/xml as if it had a declared charset of us-ascii unless that was overwritten using the charset parameter of the MIME type.

Why does that cause them to break the web? Well, following the rules above leads to breaking 44% of the feeds out there. Best wishes to the marketing department.

Comments

Neh, nothing about the new Internet Explorer 7. That might very well break the web too, depending on how they do it

That goes for everything, of course.
I wonder if this mean they will follow every single bit from RFC 3023 and XML. If they do, that means they will thrown in a parse error on every single character that is incorrect and treat text/xml as if it had a declared charset of us-ascii unless that was overwritten using the charset parameter of the MIME type.

What's not good about that? If it fails in IE, people will fix their feeds.
Posted by David Håsäther at 6:48AM
I think this is a very bad move. Most RSS parsers (I say this meaning Opera's) can parse non-wellformed RSS pretty good. Microsoft is heading downhill in my book (they were going up for a bit).
Posted by Ethan Poole at 9:17AM
What's not good about that? If it fails in IE, people will fix their feeds.

I feel the same. If IE goes balls-to-the-walls-strict about this sort of thing, people aren't going to sit back and let their feeds continue to break; they'll fix them.
If feeds are crap, then they need to be broken; they deserve to be broken. Maybe this will prove to be a powerful step towards heightening people's awareness of the standards.
Posted by Ryan Bergeman at 9:58AM
This is excellent news! I'd never have expected MS to follow a specification this strictly.
This and their not-supporting-application/xhtml+xml-until-it-can-be-done-right makes them look a heckuvalot better in my eyes. They're still the evil empire, but at least they're trying.
Posted by Brendan Taylor at 12:16PM
This sounds like fixing the Web to me, Anne. I agree that breaking the usability of a huge chunk of the content out there is poor form, but it's the content that's broken (and to a large degree the specs). If the might of IE and Windows can go a ways in fixing this, I'm all for it.
That said, following the law of HTTP in regard to the default character encoding for the 'text' type serves no one; it's an entirely impotent restriction in practice and is pretty harmful in the real world.
In short, complete strictness with regard to XML should be applauded; complete strictness with regard to MIME is a matter I think is open to much debate.
Posted by J. King at 1:46PM
I think we should just give up on making everything valid since browser makers seem to want to go out of their way to compensate for bad code. Therefore, I say web pages should be one giant string and the browser should be left to try and figure out what the heck the author meant. Why not, after all browsers seem to be able to compensate for some pretty nasty code. Since they do, I ask you: what's the point?
Posted by Ara Pehlivanian at 3:33PM
J. King, MIME type handling is part of the XML specification. The specification references the in the post mentioned RFC.
Anyway, it might indeed be nice, but I wonder if they will truly do it. See also on notice by Sam Ruby and the comments there.
Posted by Anne at 3:44PM
Ara, all websites already exist of strings, but I assume you mean something else? Most browsers also have pretty compliant XML parsers, except for failing on incorrect characters and the culprit of text/xml. For feed parsing however, it became clear that publishers, directed by Dave Winer, went the HTML road again and you had to parse it using a new kind of tag soup parser.
Posted by Anne at 3:48PM
Looks like fixing the RSS space.
I still consider RFC 3023 impractical.
Posted by Henri Sivonen at 5:37PM
Oh my gosh! They're dammed if they do and they're dammed if they don't! At last MS decides that following the standards is important and you go and complain because it will be incompatible with a bunch of ill-formed feeds served with the wrong MIME type...
Tough luck! Content providers can fix their feeds so that UAs don't have to deal with their mistakes by using non-conformant error handling, especially when error handling is already well defined in XML. Feed readers that don't implement conformant XML parsers are broken. Software that serves ill-formed XML feeds are broken. Parsers that reject such feeds are not and the web will benefit greatly from it, even though there very well could be a few teething problems.
If MS didn't take this standards-compliant approach, XML would just continue along the same path and end up where we are now with HTML and I'm sure no-one wants that to happen.
Posted by Lachlan Hunt at 6:06PM
First of all, this is great news. The second IE7's feedreader hits the market, people with bad feeds will _have_ _to_ update them.
Second of all, the microsoft team says they want to only parse well-formed feeds. That doesn't mean they'll follow the specs to the letter, they'll just give an error on not well-formed stuff.
Posted by Jochem at 6:08PM
Anne: Yeah, I meant more like doing away with tags altogether and just using plain text without any markup. They're so good, let them figure it out! I was also being a little sarcastic because I have such a hard time with—as you call them—tag soup parsers. It just doesn't make sense to me how that sort of garbage product could have been allowed to be written. Stupid tag soup parsers. Compilers don't let you get away with a colon instead of a semi-colon, but browsers will let you close parent tags before closing child tags, use quotes for attributes only if you want to, etc…. Stupid, stupid. It just promotes junk markup and that's why the majority of sites on the 'net don't validate.
You know, if browser makers spent just a little less energy writing compensation algorithms, and a little more time actually implementing standards such as full CSS 2 support, life would be so much better! I think we need a new movement along the lines of ALA's "To Hell with Bad Browsers."
There, now I've vented some more ;-)
Breathe in, breathe out, breath in, breathe...
Posted by Ara Pehlivanian at 9:05PM
Yes! It's so simple to make a well formed feed (especially since most are generated and not hand written). This makes me happy. The web, or MS, is finally willing to move forward. 44% of the web will have to fix their feeds REALLY QUICK, so it shouldn't become a problem in the long run....especially since everyone is becoming aware of this NOW.
Posted by Devon at 9:26PM
I say this is a very good move by Microsoft. They got it right, for once. This is a hopeful sign, because now people will have to have well-formed feeds. So the tools to make those feeds will have to be improved. That can only be good.
Let us hope they will make the same move regarding XHTML, so that people will have to use well-formed markup, resulting in better tools producing that markup.
Posted by Ben de Groot at 6:34PM
If MS didn't take this standards-compliant approach, XML would just continue along the same path and end up where we are now with HTML...

This is a logical fallacy known formally as a "slippery slope" argument. Not only is it fallacious in the general case, it is also fallacious in this particular case. To wit: despite the wide prevalence of tag soup feed parsers, overall well-formedness is getting better, not worse. We can debate the causes of this (consolidation of publishing tools, wider knowledge of the feed validator, etc.) but it's sure as hell not because of client-side draconian parsers.
Interoperability is a worthy goal, and client-side draconian error handling is such a seductive solution that many people just swallow it without thinking it through to its logical conclusion.
I meant more like doing away with tags altogether and just using plain text without any markup.

There ain't no such thing as plain text.
...following the law of HTTP in regard to the default character encoding for the 'text' type serves no one; it's an entirely impotent restriction in practice and is pretty harmful in the real world.

So you finally bothered to read the spec and now disagree with it on practical grounds? There's a name for people like you. (Yeah yeah, ad hominems are logical fallacies too. So sue me.)
If feeds are crap, then they need to be broken; they deserve to be broken.

Wow. Words fail me.
It's so simple to make a well formed feed...

You must be new here.
Posted by Mark at 3:36AM
To Mark's point: if the heterogeneity of content -- often generated from multiple sources -- makes draconian error handling unfeasible, what set of social conventions and technological infrastructures need to be in place to minimize the need for such relaxed parsers? I have yet to see a feasible proposal that balances the problems of "tag-soup parsers" with relaxed of invalid markup. Perhaps someone will point me in the right direction. Maybe I'm just new here.
Another question, from a security standpoint: What are long-term security risks of loose error handling on the client side from a software design perspective? Do they outweigh the need to keep things from breaking in this case? In others?
Posted by David at 9:12PM