Anne van Kesteren

XML entities

19 December 2007

There are several solutions to the XML entities problem. We just have to pick one:

Extend the set of default entities of XML. This seems the most pragmatic choice to me, but I’m sure there will be quite some people who don’t like it.
Continue the hack of known public identifiers that will enable a set of entities to work. This is what we’re doing now though as you can see not all browsers recognize all “important” public identifiers yet.
Deploy UTF-8+names. This solution seems sub optimal to me, but it’s worthy of consideration.

Comments

Or (d) do nothing.
If you want exotic entities in an XML document, either include them as literals (preferably as UTF-8, but that's a matter of preference and known support footprint) or use numeric entities (the latter being slightly more useful for things like non-breaking spaces where spotting them might be tricky). I don't really see why it's so immensely important to support   et al, other than “because HTML does”.
Any change to the level of support will take a very long time before it's mainstream enough to be useful in any case.
Posted by Mo at 9:42PM
More math entity pain coming up. DTDs don’t work on the Web.
Posted by Henri Sivonen at 10:48PM
Mo, doing nothing equals going with “magic” public identifiers which is the situation we are in now. Not really optimal in my opinion. I suppose another option is to remove that feature, but stuff will break, I’m sure.
Posted by Anne van Kesteren at 12:06AM
Anne: Okay, “nothing” wasn't too well-phrased. I'm of the firm opinion that the DOCTYPE magic shouldn't have ever been present in the first place, and shouldn't be there now. Browsers that do properly support XHTML should deprecate the “feature” with immediate effect and remove it in a few months' time. Nobody actually churning out XHTML is supposed to be using it, and XHTML support is arguably new enough that there aren't enough documents in the wild for a reversion to the specified behaviour to cause widespread panic, riots in the streets, and so on.
Mind you, people will complain, and an awful lot of people seem remarkably scared of that.
In more detail: I'm opposed to (a) because it strikes me as unnecessary and won't work reliably for a long time (plenty of browsers installed on users' machines will do DOCTYPE sniffing for a while to come irrespective of what people decide); I'm opposed to (b) because it's dirty and messy and, quite frankly, sucks, and I'm opposed to (c) because it's a solution looking for a problem. Most operating systems work perfectly well with UTF-8 (and other Unicode encodings) when told to: they have to parse the damn things anyway.
Posted by Mo at 12:39AM
If it weren't for the math people, we could ignore the problem and just live with it. Irritating but not fatal. But the math people really don't have a good alternative to name all their glyphic weirdness, that I know of.
Posted by Tim at 1:56AM
If it weren't for the math people, we could ignore the problem and just live with it. Irritating but not fatal. But the math people really don't have a good alternative to name all their glyphic weirdness, that I know of.

MathML isn’t written by hand in practice. Instead, it is generated from something else. The generator should output unescaped UTF-8 and not leak mnemonic names to the browser.
Posted by Henri Sivonen at 4:47AM