Anne van Kesteren

Checking posts and comments on being well-formed

When I turned some WordPress features off today my weblog became partially ill-formed. Using some very basic PHP XML functions I was able to find all the relevant posts quite quickly. I used the following files:

<?php
require_once("path/to/file/db.php");
header("content-type:text/plain;charset=utf-8");

$r = mysql_query("SELECT ID,post_content FROM wp_posts");
if(mysql_num_rows($r)>0){
 while($arr = mysql_fetch_array($r)){
  $id = $arr['ID'];
  $content = $arr['post_content'];
  if(!xml_parse(xml_parser_create(),"<foo>".$content."</foo>")){
   print "ouch: ".$id."\n";
  }
 }
}
?>

… and:

<?php
require_once("path/to/file/db.php");
header("content-type:text/plain;charset=utf-8");

$r = mysql_query("SELECT comment_ID,comment_content FROM wp_comments");
if(mysql_num_rows($r)>0){
 while($arr = mysql_fetch_array($r)){
  $id = $arr['comment_ID'];
  $content = $arr['comment_content'];
  if(!xml_parse(xml_parser_create(),"<foo>".$content."</foo>")){
   print "ouch: ".$id."\n";
  }
 }
}
?>

Using this I got all posts using ill-formed XML as well as posts and comments using HTML entities (which are not recognized by an XML parser) like &copy;. I fixed all of them to make everything future proof.

Eventually I’m planning to make it part of the sidebar which then states for how many percent my weblog is well-formed or so. (Also when I switch to HTML4 for the user interface as this is mainly for the backend.)

It was pointed out in private e-mail by Mark Jaquith that non hacked versions of WordPress might need to have $content = apply_filters('the_content', $content); for post content and $content = apply_filters('comment_text', $content); for comments to make the XML check work properly. In my opinion, such additional things should not be needed as your database needs to be clean, but I think most users won’t care.

Comments

  1. There you mention something. I still need to delete some WP features myself...

    Posted by Frenzie at

  2. xml_parse is PHP5, right?

    ~Grauw

    Posted by Laurens Holst at

  3. Wrong: xml_parse.

    Posted by Anne at