Anne van Kesteren

Retrieving information from other sites is difficult

23 June 2004

Yesterday I got a pingback comment from someone. Apart from the fact that the author of that post should read something about name linking there was one thing particularly interesting. After he made the pingback, my weblog became non well-formed. This wasn't his fault, since the current pingback specification says nothing about retrieving information or excerpts, so it is a bug in WordPress. I'm not even sure if the bug would be solvable. His weblog is encoded in iso-8859-1, which means that the character 'ä' takes only one byte. My weblog however, is encoded in utf-8 and that same character is a two byte character here if I'm not mistaken, which was the compatibility issue I faced yesterday. WordPress only copies the bytes from the external site and pastes them here, without worrying about charsets and such.

This problem is considered minimal by some, but a conforming XML parser, which Mozilla and probably Opera are not, would give an XML non well-formed error. This would mean for example that Mark Pilgrim's Universal Feed Parser would fall back into bozo mode if I had an Atom feed for my recent comments or weblog entry + comments, something that isn't desired. Furthermore, by losing the ability to display the character you lose data. First I thought that there was a simple solution, like stripping all the characters that are not in the first 127 from US-ASCII or using utf8_encode, but both solutions have problems. With the first solution you could lose important data and it probably won't work with more exotic encodings, the second solution only works with iso-8859-1, which is not the only character set on the internet. There are three solutions (from best to ugly):

Convert the web to utf-8
Comment moderation for pingbacks always on (the problem is that there is no such option)
A function that takes the "input", "input charset" and "output charset" arguments and does something nice with it for all possible character encodings currently alive

Comments

[...] nem Annes Einträgen handelt. Ich sah' die Felle schon davon schwimmen… Im Eintrag von Anne, der den Pingback erzeugt hat, geht es um zwei Dinge: Mein [...]
Posted by DenkZEIT :: Nahtoderfahrung at 7:52AM
[...] n 24, 2004 11:28 I read yesterday about Anne's trouble with Pingback and character encodings [...]
Posted by J. King log at 7:52AM
The problem with using utf-8 is that it messes up the rss-feed. I.e. the german umlauts are displayed invorrectly.
I would like to convert to utf-8 but as long as the rss-feed will be broken by that I'll stick with iso8859-1. I'd change to us-ascii if this would help...
Posted by Steffen Glückselig at 7:49PM
If your german umlauts doesn't get displayed correctly when you use utf-8 that is because they are not encoded using utf-8. So you need to change the encoding of the text.
Posted by Peter Winnberg at 8:03PM
Wouldn't that be Wordpress' job? If I set encoding to utf-8 Wordpress should send 'my' umlauts as utf-8, shouldn't it? I mean, I should not have to enter unicode-entities to stay valid...
Or should I?
Posted by Steffen Glückselig at 8:10PM
You should. When you enter new data, it will be correct, but you need to convert your old data first.
Posted by Anne at 8:38PM
Is there an 'easy' way to convert my old posts? I wouldn't like to think about doing it by hand...
Posted by Steffen Glückselig at 9:20PM
Your three solutions could also be ranked as "least pragmatic to most pragmatic."
Supposedly there's a way to determine whether you are looking at UTF-8. I'm not how to tell what non-UTF-8 encodings might be, but this is a start.
Posted by Adam Rice at 9:41PM
Not a problem that WordPress will be able to solve, I'm afraid. There are three ways to translate encodings in PHP (besides the 8859-1/UTF-8 only encode_utf8): the mb_string extension, the iconv extension, and the recode extension. All things that need to be compiled into PHP, all things that most hosts don't compile into PHP. And even then, with inconv, the best of the lot, you are dependent on the underlying operating system's iconv, with the supported encodings and even their names varying from one server to another. Work around that, and "all" you have to do is correctly determine the encoding of the page pinging you (see the Atom permathread).
Posted by Phil Ringnalda at 11:00PM
Sam Ruby ran into something similar a few weeks back.
Posted by Tim at 12:52AM
Is there an 'easy' way to convert my old posts? I wouldn't like to think about doing it by hand...

The only easy way I could think of is exporting SQL data, search and replace, importing SQL data.
Posted by Christoph Wagner at 7:52AM
What stands out as too obvious to even mention, is that the pingback specification sucks. What should Atom's role be in all of this? Should Atom bring its own well-defined trackback or pingback specification, or should we try to evolve the existing API's in the right direction so problems like this gets fixed?
Posted by Asbjørn Ulsberg at 2:54PM
Pingback is actually fine. That WordPress likes to show some excerpt is optional, not part of the specification. Trackback does suck big time, however.
Atom doesn't have a role in this. The only thing we should make sure is that all Atom feeds MUST be encoded in utf-8.
Posted by Anne at 3:06PM
Anne, have you tried adding the <pingback /> string that is included in every pingback sent in the list of words that automatically throw a comment into moderation? Maybe that might work.
Posted by OF Jay at 10:04AM
Yeah, I added that, just haven't had a new pingback yet. But of course, even that is suboptimal, since if my admin was real XHTML it would crash as well on invalid characters.
Posted by Anne at 12:05PM