Parsing XML using PHP: SimpleXML
May 29th, 2007 at 2:07 pm (PHP Articles, Community News)
Introduction
“Simplicity of character is no hindrance to the subtlety of intellect.”
- John Morley
When people ask me “What is SimpleXML?” I often quip, “XML is the solution to all your problems; SimpleXML ensures it isn’t the root of your problems!”
Those of you who have parsed XML with PHP4, or are currently dealing with XML parsing in PHP4, know that it can indeed be very painful to handle documents with any degree of complexity. You either need to use the SAX approach and write a handwritten parser for every document, or you need to use the DOM extension; which (in addition to its tendency to crash, leak and generally misbehave under heavy usage) involves the pain of processing documents using an API designed for a heavily object oriented language and
targeted at supporting every single one of XML’s idiosyncrasies.
Consider the following small XML snippet, which describes a small collection of books in XML format. The document has a root node of library, with a direct child of shelf, which classifies the books as fiction. The shelf displayed has two children() labelled book; “Of Mice and Men” by John Steinbeck and “Harry Potter and the Philospher’s Stone” by J.K. Rowling.
The document itself is simple enough: you can see the structure very clearly, and you can understand the path you need to follow to access that information.
Now, before we get into why SimpleXML will change your life, let’s first look at how one would parse this document using DOM:
$doc = new domDocument();
$doc->load(’library.xml’);
$library = $doc->documentElement;
$shelves = $library->childNodes;
foreach ($shelves as $shelf) {
if ($shelf instanceof domElement) {
process_shelf($shelf);
}
}
function process_shelf($shelf)
{
printf(”Shelf %s\n”, $shelf->getAttribute(’id’));
$books = $shelf->childNodes;
foreach ($books as $book) {
if ($book instanceof domElement) {
process_book($book);
}
}
}
function process_book($book)
{
foreach ($book->childNodes as $child) {
if (! ($child instanceof domElement)) {
continue;
}
foreach($child->childNodes as $element) {
$content = trim($element->nodeValue);
switch ($child->tagName) {
case ‘title’:
printf(”Title: %s\n”, $content);
break;
case ‘author’:
printf(”Author: %s\n”, $content);
break;
}
}
}
}
?>
As you can see, it takes 47 lines of well-crafted PHP code - with no error checking- to manipulate and print out a list of the books within the XML file. With error checking, comments and other things you might find add in the real world, it could easily take 70-80 lines of code to parse this straightforward, simple XML document.
Contrast the example above with the following piece of code that uses the SimpleXML extension to access the same document, and print out the exact same information.
$library = simplexml_load_file('library.xml');
foreach ($library->shelf as $shelf) {
printf(”Shelf %s\n”, $shelf[’id’]);
foreach ($shelf->book as $book) {
printf(”Title: %s\n”, $book->title);
printf(”Author: %s\n”, $book->author);
}
}
?>
With SimpleXML, element names are automatically mapped to properties on an object, and this happens recursively. Attributes are mapped to iterator accesses. All of this happens “on-demand,” using Zend Engine 2’s new object overloading features. SimpleXML’s “low-fat” approach to XML parsing reduced the code size of this example from 47 lines of code, to a mere 10. Furthermore, the code is considerably more readable: instead of using statements like foreach($child->childNodes as $element) to access the element node of an XML child, you simply reference it by name.
Advanced Simplicity
In a perfect world all XML documents, and the information you needed to extract from them, would be as basic as the example given above. In fact this is true in many cases: configuration files, basic data export, and basic serialization all require parsing capabilities no greater than the above example. There are, however, some cases where the basic functionality listed above simply isn’t suitable.
Namespaces
One issue that SimpleXML encountered was XML namespaces. XML documents allow you to hide tags away into a labelled section called a namespace. SimpleXML originally solved namespaces by simply adding another level of indirection:
To print out the names of all the different blog entries you could write the following code:
$entries = simplexml_load_file('syndic.xml');
foreach ($entries->blog->entry as $entry) {
printf(”%s\n”, $entry->name);
}
?>
This approach, however, proved to be too naive; while it was fine for parsing a particular document, it was no good at all for any type of generalized processing. One thing to note about XML namespaces is that the qualified name (i.e. blog) is just a simple alias with no particular relevance. The significant portion of a namespace is the URL (http://www.edwardbear.org/serendipity/), which is what people who parse XML documents should rely upon.
Therefore, the approach SimpleXML takes to supporting multiple namespaces is not to add any changes to the way you access properties, but rather to give you two methods: attributes() and children(). The children() function returns all the children() of an XML node in a given namespace. If no namespace is passed to the children() function, all the elements in the global namespace are returned.
The example given above is properly parsed with the following bit of code:
$entries = simplexml_load_file('syndic.xml');
foreach ($entries->children(’http://www.edwardbear.org/serendipity/’) as $entry) {
printf(”%s\n”, $entry->name);
}
?>
Note: You may also pass the qualified name to the children() or attributes() method so they will check for that as well, but this is not recommended.
Searching, Splitting, Recursing
The other way that SimpleXML didn’t really address the needs of people developing XML applications was that, while it provided a nice way to algorithmically process a document, it didn’t provide any features for performing common searches and accesses. For example, how does one access all descendants of a given node? How can you search a document, and find a tag and a value that both match a given condition? There are many common operations on XML documents that are a pain to write by hand, and desperately need simplification.
As a solution to this problem, SimpleXML doesn’t re-invent the wheel, but instead provides the xpath() method, which allows you to perform W3C standard Xpath queries on an XML document. A problem like getting all descendants of a given node turns into a highly optimized Xpath query //children(). While the full scope of Xpath is well outside the scope of this document, it is recommended that anyone serious about processing XML should learn to use the Xpath language, which is as important to XML as Regular Expressions are to plain text.
Edge Conditions
While SimpleXML is a great tool for processing XML, its simplicity does come with a few drawbacks. Most notable among these is that processing mixed XML and text content with SimpleXML is very hard. For example, consider the following XML
This
SimpleXML
Accessing $document->blurb with print_r() or var_dump() would return an element iterator that contained the contents of italic, bold, and underline. It would not, however, return the text surrounding those elements. This is because when given the choice between mixed elements and contents, SimpleXML will always choose to return the elements, and ignore the contents, of a particular tag.
SimpleXML has two solutions to this problem built into the library. Firstly, a method called asXML() is provided, which will take the given node and serialize its contents, as well as the contents of all its children(), to either a file or a string. With the example above, you would call $document->blurb->asXML() and it would return the full contents of the blurb node in a format suitable for printing or further processing.
The second solution is to bypass SimpleXML for certain portions of your document. One of the explicit design goals of PHP5’s XML support was to allow all extensions to interoperate at a minimal cost. Since LibXML2 is the lingua franca of all XML extensions, DOM and SimpleXML objects can be exchanged with zero copies. It’s just a different way of viewing the same underlying object! By this method, the DOM extension can “import” SimpleXML objects and use them as DOM objects, and vice versa. When you need to use a DOM feature you can, and when you need SimpleXML’s ease of processing, you have that too.
Summary
PHP5’s new XML support was designed as a coherent set of APIs to process and manipulate XML. This includes the DOM extension, which provides all you’ll ever need for handling XML, the SAX API for streaming XML parsing, XSLT for XML transformations and SimpleXML when you need to do anything else.
Be fruitful and multiply!
About the Author
Sterling Hughes is a PHP core developer and the chief instigator of the SimpleXML extension for PHP 5. His earlier contributions include the ADT, cURL, XSLT, and Mono extensions. He works as a freelance Web developer, creating dynamic Web applications in PHP, C and Perl, and is also the co-author of PHP Developer’s Cookbook.
Sterling can be contacted at sterling@apache.org.









