When I turned some WordPress features off today my weblog became partially ill-formed. Using some very basic PHP XML functions I was able to find all the relevant posts quite quickly. I used the following files:
<?php require_once("path/to/file/db.php"); header("content-type:text/plain;charset=utf-8"); $r = mysql_query("SELECT ID,post_content FROM wp_posts"); if(mysql_num_rows($r)>0){ while($arr = mysql_fetch_array($r)){ $id = $arr['ID']; $content = $arr['post_content']; if(!xml_parse(xml_parser_create(),"<foo>".$content."</foo>")){ print "ouch: ".$id."\n"; } } } ?>
… and:
<?php require_once("path/to/file/db.php"); header("content-type:text/plain;charset=utf-8"); $r = mysql_query("SELECT comment_ID,comment_content FROM wp_comments"); if(mysql_num_rows($r)>0){ while($arr = mysql_fetch_array($r)){ $id = $arr['comment_ID']; $content = $arr['comment_content']; if(!xml_parse(xml_parser_create(),"<foo>".$content."</foo>")){ print "ouch: ".$id."\n"; } } } ?>
Using this I got all posts using ill-formed XML as well as posts and comments using HTML entities (which are not recognized by an XML parser) like ©
. I fixed all of them to make everything future proof.
Eventually I’m planning to make it part of the sidebar which then states for how many percent my weblog is well-formed or so. (Also when I switch to HTML4 for the user interface as this is mainly for the backend.)
It was pointed out in private e-mail by Mark Jaquith that non hacked versions of WordPress might need to have $content = apply_filters('the_content', $content);
for post content and $content = apply_filters('comment_text', $content);
for comments to make the XML check work properly. In my opinion, such additional things should not be needed as your database needs to be clean, but I think most users won’t care.
There you mention something. I still need to delete some WP features myself...
xml_parse is PHP5, right?
~Grauw
Wrong: xml_parse
.