Besides going to school, the supermarket and studying for a bit my main project of the day was coding something in PHP that turned character references into UTF-8 encoded characters. Considering that XML parsers must treat character references correctly anyway you might call it a useless project. Nevertheless, I tackled regular expressions for a bit, learned a lot from Henri Sivonen and used a function of him he had rewritten from the Mozilla C++ implementation. I also fixed an ugly bug regarding the twenty-seven differences between iso-8859-1 and windows-1252. Yes, although I’m using UTF-8 for this weblog it was still possible to enter some of these invalid characters as character references. Now they are converted to their safe equivalent and after that converted to UTF-8 characters. I think this might be a bug in the XML PHP parser regarding those characters. It should positively reject them right away. But then again, PHP is a messy language with ugly bugs. A correction is in place here. Those characters are supposed to be control characters of no use, so PHP is treating them correctly as being valid.
Why am I still using PHP? Shoot me; I’m not a programmer and PHP was the first thing I learned after writing valid HTML and CSS. PHP is a language that lets me set up things quickly without asking too many questions, unless of course things get tricky, like today. A future project should probably be based on Python or the “hey, I set it up in less than 2 hours” Ruby on Rails. In combination with a PostgreSQL database to please Faruk and to not use MySQL like everyone else. Because you know; being the same isn’t always fun.
(That I’m probably contradicting myself in several ways in that last sentence is fortunately not the point; nor the topic of today’s conversation.)
To get to the point: I implemented Jacques Distler’s HTML+MathML entities replacement trick in PHP. On top of that, as that was the easy part, I use the function string2utf8
which basically calls two preg_replace_callback
’s to get the job done. I made up the first and final argument and I really appreciate KtK’s input on the second. Especially using $s
was something that hadn’t passed my mind and isn’t really documentated in an obvious way to me. Here is that function:
function string2utf8($string){ $string = preg_replace_callback('/&#([0-9]+);/',create_function('$s','return dcr2utf8($s[1]);'),$string); return preg_replace_callback('/&#x([a-f0-9]+);/i',create_function('$s','return dcr2utf8(hexdec($s[1]));'),$string); }
As you have undoubtedly noted, that function calls another function which accepts the decimal value of a character reference. (That is the exact reason for using hexdec
in the second preg_replace_callback
call.) The function that is called by preg_replace_callback
is written by Henri Sivonen as said before, who based it on Mozilla’s code base. (See: UTF-8 to Code Point Array Converter in PHP.) I slightly modified it and renamed it:
function dcr2utf8($src){ $dest = ''; if($src < 0){ return false; }elseif($src <= 0x007f){ $dest .= chr($src); }elseif($src <= 0x07ff){ $dest .= chr(0xc0 | ($src >> 6)); $dest .= chr(0x80 | ($src & 0x003f)); }elseif($src == 0xFEFF){ // nop -- zap the BOM }elseif ($src >= 0xD800 && $src <= 0xDFFF){ // found a surrogate return false; }elseif($src <= 0xffff){ $dest .= chr(0xe0 | ($src >> 12)); $dest .= chr(0x80 | (($src >> 6) & 0x003f)); $dest .= chr(0x80 | ($src & 0x003f)); }elseif($src <= 0x10ffff){ $dest .= chr(0xf0 | ($src >> 18)); $dest .= chr(0x80 | (($src >> 12) & 0x3f)); $dest .= chr(0x80 | (($src >> 6) & 0x3f)); $dest .= chr(0x80 | ($src & 0x3f)); }else{ // out of range return false; } return $dest; }
I’m not going to post the function safe_cr
here, which by the way stands for ‘safe character references’ and converts the twenty-seven differences to their safe equivalents and also makes it possible for you, the holy end user, to enter HTML and MathML entities (over twenty-one-hundred) into the comment system. I convert them and then I give it to the XML parser. I published safe_cr
along with the other functions here: Character references to UTF-8 functions plus HTML and MathML entities converter. Have fun.
For who wants to know, dcr2utf8
stands for ‘decimal character reference to UTF-8’. Something just occured to me: By implementing this my comment section is partly conforming (I guess as conforming as possible) to the UTF-8+names proposal from Tim Bray. There is also some start at an official RFC. Tim, may you ever read this, why didn’t you push this further?
/&#x([a-zA-Z0-9]+);/
seems like a weird regular expression for hex values.
Shouldn't it be /&#x([a-fA-F0-9]+);/
?
Yeah, that might improve it just a little bit. Note though that invalid characters are rejected anyway later in the run, but you’re right. Paying attention to details didn’t cross my mind when writing this. Actually, it never crosses my mind when writing PHP.
Bwuh, he uses my deprecated nickname :)
I thought about what Tommy said when I was lying in my bed, but that's kinda mustard-like.
My simple version, which doesn't take those 27 differences into account, can be found on click here btw.
Also, an example of using $s inside create_function()
together with preg_replace_callback()
is waiting for you on php.net. There's way too much (documentated in an obvious way) on that site already, if you ask me :)
[...] being the same isn’t always fun.
;)
But, it's almost no difference between MySQL and PostgreSQL, isn't it? I can hardly remember the times I worked with it, flanked by a nice J2EE framework running on WebLogic...
But, it's almost no difference between MySQL and PostgreSQL, isn't it?
Check out this comparison.
Thank you, Krijn.
WARNING WARNING BULLSHIT ALERT!
Sorry, but this comparison that Krijn linked to is horrendously worthless. They have no idea how to do proper benchmark tests, as they have no idea how PostgreSQL really works (nor is optimized). As a result, their results indicate that PostgreSQL would be much slower. However, using proper transactions (auto-commits are not the same!) will speed up PostgreSQL way more than in their tests, even, so the whole article on that page is futile.
Furthermore, the only thing they really tested are some basic INSERTs and SELECTs. Yes, if you're gonna do nothing more than what MySQL was built on (a fast INSERT-and-SELECT SQL system), then you'll favor MySQL. When doing some actual RDBMS tests, one quickly encounters a problem: MySQL isn't a fully qualified RDBMS, as it doesn't natively support Transactions (only on InnoDB tables). Okay, so that aside, let's just test features. Oh wait, no dice. Half of all the useful features that you'd want to test don't exist in MySQL. Planned for 5.0 or 5.1. It'll be 2 more years before those releases will be anywhere near as stable as PostgreSQL, and only God knows when they'll be as fast as PostgreSQL.
To sum up: ignore that silly comparison article. There's a world of difference between MySQL and PostgreSQL, and for anything more complicated than a simple card-catalog system (i.e. basic INSERT, UPDATE and SELECT stuff), PostgreSQL is a much better choice. Don't need anything whatsoever beyond simple queries? Go with MySQL. Want to have some database intelligence? Security and reliability? Useful features (functions, triggers, transactions, views, subqueries, procedural languages, constraints, etc.)? Go with PostgreSQL.
Also, /&#x([a-f0-9]+);/i
would make even more sense. :)
I’m not going to post the function
safe_cr here
, which by the way stands for ‘safe character references’ and converts the twenty-seven differences to their safe equivalents and also makes it possible for you, the holy end user, to enter HTML and MathML entities (over twenty-one-hundred) into the comment system […]
Why not? It should be interesting to see… or has it already been posted elsewhere?
By the way, I think the <code>
tag extended too far there (safe_cr here
instead of safe_cr
).
The main reason is that it is over twenty-one-hundred lines and about 67kB large. (Mainly due to the entities.) I guess I could post it in a separate file though. Check back later today.
The XML PHP parser operating correctly
Those 27 characters are not invalid Unicode characters. They are simply rarely used control characters. Given that you are unlikely to be wanting to display such control characters on your website, and given how frequently you will be encountering the win-1252 versions of this data, it makes sense to convert them to their true Unicode equivalents.
Sam, thanks. I added an erratum to that statement.
In related new, I also published the functions here.
When I type Θ into this comment form, and click on Preview
, it is converted to utf-8, and displayed as Θ. In the <textarea>
, in which I can re-edit my comment, it is also converted to Θ. If that was a typo on my part, and what I really meant was θ (θ), I need to delete and retype, rather than changing a "T" to a "t" (or maybe I really meant ϑ (ϑ) ).
Personally, I'd return the user what he typed, and convert to utf-8 on display (and, when the comment is finally POSTed, on storage to the database). One of the reasons for having named entities is that they are easier to enter/edit/remember than the corresponding Unicode codepoints.
Good point. I wanted to make it that way but I forgot to actually do it. I will modify it once I got the time.
@Faruk: I'm with you 100%. Too bad none of the hosting companies I'm working with right now allow Postgres. Stuck in the MySQL world again.
Haha, thank you, too, Faruk ;)
In related new, I also published the functions here.
Thank you. :-)
Your code should work in most cases. However in SGML a character reference does not always have to end with a semi-colon (;). In some cases it is allowed, but recommended against, to remove the semi-colon
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
This is supported in Firefox and to even some greater extent in Internet Explorer where Iñtërnâtiônàlizætiøn
will be decoded as Iñtërnâtiônàlizætiøn. If you want to decode named character references for security reasons, for example for filtering 'javascript:' URIs, you need to be as just as liberal as IE, because otherwise you run the risk of not detecting every single 'javascript:' URI.
Recently I've created some similar functions that are just as liberal as IE. The library file can be downloaded from my website: http://www.rakaz.nl/projects/entity.phps
Niels, thanks for your comment. Much appreciated. However, I’m not sure if I need to be as liberal as Internet Explorer though, as such character reference usage is just rejected and not allowed.
Jacques Distler, consider it fixed. If you find anything that doesn’t work please let me know.
One question regarding UTF-8 and printing the actual language on screen! How do you print ر السلام الاستثماري العراقي (which is some UTF-8 jibberish) in its native form (i.e. Arabic)? Sorry if this is the wrong place to ask.