Anne van Kesteren

Standard compliant tools

19 June 2004

Keith wrote something about his efforts to let MT output valid markup, more specifically, the characters. Before he wrote this, there was some (serious) discussion on the site of Dave Shea between him and Jaques Distler about validation and ampersands.

That was few days ago and happily jumped on the bandwagon and wrote a little entry saying that everyone should use the five important entities (and if not, they should try to learn it), which was probably a bit evil of me, since there are some reasons people can't really validate. Reasons include:

Software
Software
Software
Server configuration

One could probably find an argument stating that server configuration would be part of software as well. Here is my thing: I think that software should be improved. It is probably quite easy to say that and it seems to pretty damn hard, since even WordPress can mess it up if you don't pay attention (or use application/xhtml+xml for the admin area and use a ampersand as a category name; that is fixed for 1.3 by the way). Fortunately, comments are ensured to be valid and posts are almost always valid, since I have some basic grasp of the XHTML syntax if I may say so. Maybe I should add that weblogging software is probably the most valid software that currently exists. It is also the software that generates the most user friendly URIs, where vignette tries hard to make things look ugly. Encoding seems to be some "left alone technology" as well (if you ever plan to make software, use utf-8) in most software (not in WordPress, obviously).

Here are some ideas:

Whine
Whine
Report bugs
Whine

Comments

This is what ultimately did the job for me (it'll leave through &entity; and Ӓ, all other occurances of unencoded &'s will be encoded). It's not really my code, but I modified it to leave Ӓ style through.
```
$text = preg_replace('/&(?!(#w{2,6}|w{2,6});)/', '&amp;', $text);
```
I fail to see why you say even Wordpress however... Besides, you say your posts are almost always valid (well, duh, why wouldn't they be), the question you should ask here is would the average user post be valid? Besides, why don't you simply apply the same validation script to the admin area? That would ensure the validility of your posts.
Posted by Frenzie at 4:24PM
Frenzie, you could just use utf-8, which is forward compatible and can handle every character instead of just a few. It is also one of the two (utf-16 is one as well) required character sets for XML parsers to implement. So if you use it, you make sure that every XML parser can read your site, which uses the application/xhtml+xml MIME type.
Posted by Anne at 4:42PM
I will try things out with UTF-8, but you will still need something like this:
$text = preg_replace('/&(?!w{2,6};)/', '&', $text);
If you don't like that you can of course rewrite it so that it only leaves through six entities and displays the other as &entity;. But no matter what you do, you will need this to ensure valid posts. Btw, how will that work with the database? It's stored as it was submitted in iso-8859-1 after all...
Posted by Frenzie at 6:20PM
Disregard that, I wasn't thinking, that's what htmlspecialchars() does. But anyway, I wanted it to support Ӓ as well.
Posted by Frenzie at 6:24PM
That is supported without a entity in utf-8, wake up! ;-) Ӓ. Just like all other "special" characters.
Posted by Anne at 6:37PM
But when I use ä,ö,ü in wordpress with charset utf-8 they are displayed as ?. Shouldn't they show up just as expected?
Posted by Christoph Wagner at 7:16PM
You should also change your setting in the admin panel I guess. And note that having the charset being declared as utf-8 isn't really the same as having the characters being utf-8 characters.
How do you think it works here ;-).
Posted by Anne at 7:38PM
That is supported without a entity in utf-8, wake up! ;-) Ӓ. Just like all other "special" characters.
Is that a reason to disallow them? The only stupid thing is that I fear that Opera will submit this reply with an entity... :(
Posted by Frenzie at 8:13PM
It did not, good, good. Just testing the “”...
Anne, if I would switch to UTF-8 completely I suppose it would also mean changing the database entries? What kind of PHP fix script in combination with SQL queries would be required to change the encoding then?
Posted by Frenzie at 8:16PM
Frenzie, I can't follow you at all. If this is about my comment system again, just contact me, thanks.
The converting process is simple: export your db, save as utf-8 in your editor, import again.
Posted by Anne at 8:23PM
It has nothing to do with your comment form. It's just that the way I've written it to change any & into & to exclude entity and #1234 and you assume it is because I want to post certain characters which is not the reason I want it to be left trough. ISO doesn't need that either, it's just the fact that I had written that thing to convert every non &entity; into &entity but I wanted it to exclude to Ӓ as well. It has nothing to do with me wanting to use those things, but I can't get the message through to you and you keep saying that I don't need them with UTF-8. I simply don't need them with ISO-8859-1 either, but that's no reason for me to ditch every user submitted Ӓ into something ugly-looking.
Also, how does IE react? It loves to convert everything into windows-1225 encoding, even if the site says it should be otherwise.
Posted by Frenzie at 12:00AM