Python and XML: An Introduction
<h1>http://www.boddie.org.uk/python/XML_intro.html</h1><br><table border="0" cellpadding="5" cellspacing="0" width="100%"><tbody><tr><td><br></td><td><br></td>
<th class="Unselected"><a href="http://www.boddie.org.uk/python/index.html" class="Unselected">Python</a></th>
<td><br></td>
</tr>
</tbody>
</table>
<h2><a name="Abstract" id="Abstract">Abstract</a></h2>
<p>The <a href="http://www.python.org/">Python</a> programming language
provides an increasing amount of support for XML technologies. This document
attempts to introduce some basic XML processing concepts to readers who have
not yet started to use Python with XML, and it takes the form of a tutorial.
It is assumed that the reader knows the basic terminology of XML and is
comfortable with "XML as text".</p>
<h2><a name="Prerequisites" id="Prerequisites">Prerequisites</a></h2>
<h3>Python 2.0</h3>
<p>I believe that Python release 2.0 or greater is most appropriate for XML
processing - this is partly due to the introduction of Unicode support and
the reasonably high probability that, for a number of readers, some XML
documents encountered will use character sets which are not so easily
supported using traditional Python character strings. I cannot really imagine
how Python 1.5.2, for example, is able to handle "non-Western-European"
characters, but given the length of time since the introduction of Python
2.0, as well as the familiarity the Python community has with the newer
features of the releases from that time until the present, Python 2.0 is a
safe and not unreasonable requirement.</p>
<h3>PyXML 0.6.6</h3>
<p>Some releases of Python come with a fair amount of built-in XML support,
and the extent of this support can be tested by starting Python interactively
and testing for the presence of the <code>minidom</code> module:</p>
<pre>import xml.dom.minidom</pre>
<p>Should this module be imported without complaints from the interpreter, it
would appear that your Python version is fairly recent and probably good
enough for the purposes of this tutorial. Otherwise, you should download the
<a href="http://pyxml.sf.net/">PyXML</a> package, choosing release 0.6.6 or
greater.</p>
<h2><a name="Activities" id="Activities">Activities</a></h2>
<p>All the activities in this tutorial require the import of the
<code>minidom</code> module. Therefore, for all of the program fragments
featured, this module must have been made available through an import
statement similar to the following:</p>
<pre>import xml.dom.minidom</pre>
<p>We could import just the classes we need from the module, but we can leave
our options open at this point. Entering the following statement gives us an
idea of the contents of the module:</p>
<pre>dir(xml.dom.minidom)</pre>
<h3>Namespaces</h3>
<p>Before we start to experiment, here is a note about namespaces. It is very
tempting not to use namespaces when starting out with XML documents and XML
processing, but namespaces provide an interesting way of associating XML
elements with certain meanings, applications and "domains". Rather than cause
confusion later by introducing namespaces after the basic operations have
been presented, I believe that they are easy enough to use at the start not
to cause confusion at all.</p>
<h3><span class="Submodule">Creating</span> an XML Document</h3>
<p>First, import the <code>minidom</code> module:</p>
<pre class="Python">import xml.dom.minidom</pre>
<p>To create a new XML document, just instantiate a new <code>Document</code>
object:</p>
<pre class="Python">def get_a_document():<br> doc = xml.dom.minidom.Document()</pre>
<p>This document is not really interesting without some contents, however, so
we should add something to it.</p>
<h4>Elements</h4>
<p>XML documents have a single "root element" inside which all other elements
and pieces of text are placed. We could create an XML document which
describes departments in a company with a "root element" called "business"
within the namespace "http://www.boddie.org.uk/paul/business" - we want to
communicate the fact that the "business" element means something special to
us, and that the use of our namespace indicates that it is our special
"business" element rather than any old home-made "business" element; the
following statement creates such an element:</p>
<pre class="Python"> business_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "business")</pre>
<p>At this point, the element exists but has not been placed in the document;
we need to add it to the document at the "root level" as follows:</p>
<pre class="Python"> doc.appendChild(business_element)</pre>
<p>Let us return the created objects at the end of this function:</p>
<pre class="Python"> return doc, business_element</pre>
<p>Now, the "root element" has been added - we can investigate this by
querying the document about the elements within it. At the Python
prompt...</p>
<pre class="Prompt">>>> import xhtmlhook # to read this document<br>>>> from XML_intro.Creating import *<br>>>> doc, business_element = get_a_document()</pre>
<pre class="PromptRequest">>>> doc.childNodes</pre>
<pre class="PromptResponse">[<DOM Element: business at 136860108>]</pre>
<p>This shows a list of elements with only one element present within it. We
can, of course, examine the list and the element more closely:</p>
<pre class="PromptRequest">>>> doc.childNodes[0].namespaceURI</pre>
<pre class="PromptResponse">'http://www.boddie.org.uk/paul/business'</pre>
<p>This was the namespace that we gave our element.</p>
<pre class="PromptRequest">>>> doc.childNodes[0].localName</pre>
<pre class="PromptResponse">'business'</pre>
<p>This was the element name that we used.</p>
<p>We can add elements within this one. For example, a "location" element
might be interesting to describe a particular location of part of a company,
and such an element could be created in a similar way to that used above:</p>
<pre class="Python">def add_a_location(doc, business_element):<br> location_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "location")</pre>
<p>We add the element as a child node of the old element in the same fashion
as before:</p>
<pre class="Python"> business_element.appendChild(location_element)</pre>
<p>Let us return the created object at the end of this function:</p>
<pre class="Python"> return location_element</pre>
<p>At this point, it is possible to navigate down from the "root" of the
document to the newly added element:</p>
<pre class="Prompt">>>> location_element = add_a_location(doc, business_element)</pre>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes</pre>
<pre class="PromptResponse">[<DOM Element: location at 136781996>]</pre>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes[0]</pre>
<pre class="PromptResponse"><DOM Element: location at 136781996></pre>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes[0].namespaceURI</pre>
<pre class="PromptResponse">'http://www.boddie.org.uk/paul/business'</pre>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes[0].localName</pre>
<pre class="PromptResponse">'location'</pre>
<h4>Text</h4>
<p>Text is a central feature of XML documents - inside elements, blocks of
text may be stored and retrieved. We might have, in our example, an element
which is called "surroundings", and this could be found within the "location"
element as a means of describing the surroundings of a particular company
location. Inside the "surroundings" element, there could be a block of text
which forms such a description.</p>
<p>Here, we create the new "surroundings" element and add it within the
"location" element:</p>
<pre class="Python">def add_surroundings(doc, location_element):<br> surroundings_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "surroundings")<br> location_element.appendChild(surroundings_element)</pre>
<p>And we specify the descriptive text within this new element by creating a
new text node:</p>
<pre class="Python"> description = doc.createTextNode("A quiet, scenic park with lots of wildlife.")</pre>
<p>Of course, we need to add this to the document, and since the text is to
be included within the "surroundings" element, it makes sense to add it to
that element as a child node:</p>
<pre class="Python"> surroundings_element.appendChild(description)</pre>
<p>Let us return the created object from this function:</p>
<pre class="Python"> return surroundings_element</pre>
<p>We may now find our way from the document "root", if we want to:</p>
<pre class="Prompt">>>> surroundings_element = add_surroundings(doc, location_element)</pre>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0]</pre>
<pre class="PromptResponse"><DOM Text node "A quiet, s..."></pre>
<p>It is possible to see the entire contents of the text node using the
<code>nodeValue</code> attribute of the node:</p>
<pre class="PromptRequest">>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0].nodeValue</pre>
<pre class="PromptResponse">'A quiet, scenic park with lots of wildlife.'</pre>
<p>We can, as with elements (as we shall see in a moment), add many text
nodes within an element:</p>
<pre class="Python">def add_more_surroundings(doc, surroundings_element):<br> description = doc.createTextNode(" It's usually sunny here, too.")<br> surroundings_element.appendChild(description)</pre>
<p>Here is the "proof" that this worked:</p>
<pre class="Prompt">>>> add_more_surroundings(doc, surroundings_element)</pre>
<pre class="PromptRequest">>>> surroundings_element.childNodes</pre>
<pre class="PromptResponse">[<DOM Text node "A quiet, s...">, <DOM Text node " It's usual...">]</pre>
<p>We might want all our text within one node in future, however.
Fortunately, a method exists to collect text nodes together (note the "-ize"
spelling):</p>
<pre class="Python">def fix_element(element):<br> element.normalize()</pre>
<p>The results can be investigated, too:</p>
<pre class="Prompt">>>> fix_element(surroundings_element)</pre>
<pre class="PromptRequest">>>> surroundings_element.childNodes[0].nodeValue</pre>
<pre class="PromptResponse">"A quiet, scenic park with lots of wildlife. It's usually sunny here, too."</pre>
<h4>Attributes</h4>
<p>Elements in XML documents may have attributes attached to them. For
example, the "location" element could have another element within it
(alongside the "surroundings" element) entitled "building", and this new
element could have an attribute called "name":</p>
<pre class="Python">def add_building(doc, location_element):<br> building_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "building")</pre>
<p>After adding this element...</p>
<pre class="Python"> location_element.appendChild(building_element)<br> return building_element</pre>
<p>...it should be noticed that the new element appears after the
"surroundings" element as a child element (or node) of "location". This can
be seen at the Python prompt as follows:</p>
<pre class="Prompt">>>> building_element = add_building(doc, location_element)</pre>
<pre class="PromptRequest">>>> location_element.childNodes</pre>
<pre class="PromptResponse">[<DOM Element: surroundings at 136727844>, <DOM Element: building at 136286548>]</pre>
<p>Now, we may add an attribute directly to the new element like this:</p>
<pre class="Python">def name_building(building_element):<br> building_element.setAttributeNS("http://www.boddie.org.uk/paul/business", "business:name", "Ivory Tower")</pre>
<p>After the namespace and the element name, the value is specified. This
attribute does not need to be explicitly added to the element, although we
could have used other means of creating and adding it. This can be tested as
follows:</p>
<pre class="Prompt">>>> name_building(building_element)</pre>
<pre class="PromptRequest">>>> building_element.getAttributeNS("http://www.boddie.org.uk/paul/business", "name")</pre>
<pre class="PromptResponse">'Ivory Tower'</pre>
<p>One important thing to note is the use of the "qualified name" when
setting the attribute and the "local name" when getting the attribute
value.</p>
<h4>Attribute Namespaces and Prefixes</h4>
<p>One might expect that by setting an attribute with a particular namespace,
there would not be any need to explicitly state a prefix to appear before the
local name in the final, written XML document - after all, it should be
possible to detect that a namespace has been employed and that a prefix is
required in the full attribute name so that the attribute is recognised as
being associated with that namespace. Unfortunately, we do not seem to have
that luxury and must explicitly specify a prefix as part of the qualified
name when setting such an attribute; in the above example, the qualified name
consists of the prefix "business" and local name "name".</p>
<h3><span class="Submodule">Writing</span> an XML Document</h3>
<p>All the above effort should not be wasted, and so we will attempt to write
the document out. One of the easiest ways of doing this, whilst respecting
the namespaces that we have carefully included, is to use another module
(along with the <code>minidom</code>) module:</p>
<pre class="Python">import xml.dom.ext<br>import xml.dom.minidom</pre>
<p>Inside this module are a number of useful functions and classes. However,
we shall use the <code>PrettyPrint</code> function to write the document to
"standard output", ie. the screen:</p>
<pre class="Python">def write_to_screen(doc):<br> xml.dom.ext.PrettyPrint(doc)</pre>
<p>We can then try this out:</p>
<pre class="Prompt">>>> from XML_intro.Writing import *<br>>>> write_to_screen(doc)</pre>
<pre class="PromptResponse"><?xml version='1.0' encoding='UTF-8'?><br><business xmlns='http://www.boddie.org.uk/paul/business' xmlns:business='http://www.boddie.org.uk/paul/business'><br> <location><br> <surroundings>A quiet, scenic park with lots of wildlife.</surroundings><br> <building business:name='Ivory Tower'/><br> </location><br></business></pre>
<p>We could have used another, simpler printing function or class, but we are
usually so accustomed to (or spoilt by) nicely formatted textual XML
documents that anything less than <code>PrettyPrint</code> probably would not
do! Especially since <code>PrettyPrint</code> allows us to write the document
to a file:</p>
<pre class="Python">def write_to_file(doc, name="/tmp/doc.xml"):<br> file_object = open(name, "w")<br> xml.dom.ext.PrettyPrint(doc, file_object)<br> file_object.close()</pre>
<p>Or even easier:</p>
<pre class="Python">def write_to_file_easier(doc, name="/tmp/doc.xml"):<br> xml.dom.ext.PrettyPrint(doc, open(name, "w"))</pre>
<p><strong>NOTE:</strong> Should we not use "wb" for portability reasons?</p>
<pre class="Prompt">>>> write_to_file(doc)</pre>
<h3><span class="Submodule">Reading</span> an XML Document</h3>
<p>XML would not be very useful if we could not read it back later for
subsequent processing. Fortunately, <code>minidom</code> makes this fairly
easy to achieve:</p>
<pre class="Python">import xml.dom.minidom<br>def get_a_document(name="/tmp/doc.xml"):<br> return xml.dom.minidom.parse(name)</pre>
<p>Or, if a file is already open for the purposes of reading...</p>
<pre class="Python">def get_a_document_from_file(file_object):<br> return xml.dom.minidom.parse(file_object)</pre>
<h4>Textual "Interference"</h4>
<p>Unfortunately, if a file is written out in a prettyprinted fashion, it
gets read in with all the extra padding, and this padding appears as text
nodes between elements where we never inserted any kind of text nodes before.
For example:</p>
<pre class="Prompt">>>> from XML_intro.Reading import *<br>>>> doc2 = get_a_document()</pre>
<pre class="PromptRequest">>>> doc2.childNodes[0].childNodes[0]</pre>
<pre class="PromptResponse"><DOM Text node "\012"></pre>
<p>What is this new text node? (We expected to see the "location" element's
DOM object here instead.) Well, it appears to be a newline character which
was inserted into our file to make the contents of that file look good when
viewed in a text editor, for example. There are more of these things, too:</p>
<pre class="PromptRequest">>>> doc2.childNodes[0].childNodes[1]</pre>
<pre class="PromptResponse"><DOM Text node " "></pre>
<p>This text node is a piece of indentation - something which made the
textual form of the "location" element appear slightly further right than the
"left margin" used by the "business" element. If we keep looking, though, we
can find the "location" element:</p>
<pre class="PromptRequest">>>> doc2.childNodes[0].childNodes[2]</pre>
<pre class="PromptResponse"><DOM Element: location at 137096548></pre>
<p>So how do we avoid referencing the wrong nodes in the document?</p>
<ol><li>Instead of assuming that any element is at a particular child node
position, loop over the child nodes until the appropriate node is
found.</li><li>Tell the parser (which reads the document from the file) not to include
all the padding.</li><li>Use convenience functions to get down to the part of the document which
interests us.</li><li>Use XPath querying to select the most interesting nodes.</li></ol>
<p>Here is a brief description of each of the above options:</p>
<h5>Node Iterators</h5>
<p>What we could do in the above case is to loop over the child nodes and
examine each node to see what type it is. If it is an element then we
investigate it, starting with "business" just to be safe:</p>
<pre class="Python">def find_business_element(doc):<br> business_element = None<br> for e in doc.childNodes:<br> if e.nodeType == e.ELEMENT_NODE and e.localName == "business":<br> business_element = e<br> break<br> return business_element</pre>
<p>We know that if <code>business_element</code> is not <code>None</code>,
then we found an element called "business". Naturally, we can change the name
of the element to be found according to the situation. Note that we compare
the value of the <code>nodeType</code> attribute to a special attribute
called <code>ELEMENT_NODE</code>. It is not obvious from this example, but
special attributes such as <code>ELEMENT_NODE</code> and
<code>TEXT_NODE</code> actually belong to the
<code>xml.dom.minidom.Node</code> class, and are therefore available in (or
shared by) all node objects.</p>
<h5>Parser Option Changes</h5>
<p><strong>NOTE:</strong> This should be documented somewhere. I haven't
investigated it yet.</p>
<h5>Convenience Functions</h5>
<p>It should possble to show the "surroundings" element's textual contents
just by searching for the "surroundings" element, finding its child nodes,
and then getting their values. Here, we use the
<code>getElementsByTagNameNS</code> method on the document object to find all
occurrences of the "surroundings" element in the document:</p>
<pre class="Python">def get_surroundings_elements(doc):<br> return doc.getElementsByTagNameNS("http://www.boddie.org.uk/paul/business", "surroundings")</pre>
<p>The result can then be investigated:</p>
<pre class="Prompt">>>> elements = get_surroundings_elements(doc2)</pre>
<pre class="PromptRequest">>>> elements</pre>
<pre class="PromptResponse">[<DOM Element: surroundings at 137101100>]</pre>
<p>So, the contents of the element can indeed be discovered, too:</p>
<pre class="PromptRequest">>>> elements[0]</pre>
<pre class="PromptResponse"><DOM Element: surroundings at 137101100></pre>
<pre class="PromptRequest">>>> elements[0].childNodes</pre>
<pre class="PromptResponse">[<DOM Text node "A quiet, s...">]</pre>
<pre class="PromptRequest">>>> elements[0].childNodes[0].nodeValue</pre>
<pre class="PromptResponse">u"A quiet, scenic park with lots of wildlife. It's usually sunny here, too."</pre>
<p>Note that the description that appears as the value of the text node is
actually a Unicode string, but this should not concern you too much - Unicode
strings are widely accepted by <code>minidom</code> and behave a lot like
"traditional" Python strings.</p>
<p>Now, there are a number of problems with the above approach:</p>
<ul><li>What if there are many "location" elements? We only specified one, but
a company might have many locations.</li><li>What if more than one "surroundings" element is allowed in at a
"location"? It would be a strange concept, but it would confuse the above
approach, too.</li></ul>
<p>The principal problem with using such convenience functions is that any
structure or context that an element may have, which is taken into
consideration by descending into the document yourself, is lost when the
results of the function are returned: all "surroundings" elements are bundled
together into one package and handed back. However, we can learn about the
context of the elements by exploring various useful attributes of those
elements.</p>
<p>Every "surroundings" element should have a "location" element as its
parent. We can check this at the Python prompt:</p>
<pre class="PromptRequest">>>> elements[0].parentNode</pre>
<pre class="PromptResponse"><DOM Element: location at 137096548></pre>
<p>This at least allows us to compare "location" nodes against each other
using something like this:</p>
<pre class="Python">def examine_descriptions(elements):<br> if elements[0].parentNode is elements[1].parentNode:<br> print "They both describe the same location."</pre><br><br>
頁:
[1]