Elements and Element Trees

Fredrik Lundh | Last updated July 2007

This note introduces the Element, SubElement and ElementTree types available in the effbot.org elementtree library.

For an overview, with links to articles and more documentation, see the ElementTree Overview page.

For an API reference, see The elementtree.ElementTree Module.

You can download the library from the effbot.org downloads page.

In this article:

The Element Type
Attributes
Text Content
Searching for Subelements
Reading and Writing XML Files
XML Namespaces

The Element Type #

The Element type is a flexible container object, designed to store hierarchical data structures in memory. The type can be described as a cross between a list and a dictionary.

Each element has a number of properties associated with it:

a tag. This is a string identifying what kind of data this element represents (the element type, in other words).
a number of attributes, stored in a Python dictionary.
a text string to hold text content, and a tail string to hold trailing text
a number of child elements, stored in a Python sequence

All elements must have a tag, but all other properties are optional. All strings can either be Unicode strings, or 8-bit strings containing US-ASCII only.

To create an element, call the Element constructor, and pass the tag string as the first argument:

from elementtree.ElementTree import Element

root = Element("root")

You can access the tag string via the tag attribute:

print root.tag

To build a tree, create more elements, and append them to the parent element:

root = Element("root")

root.append(Element("one"))
root.append(Element("two"))
root.append(Element("three"))

Since this is a very common operation, the library provides a helper function called SubElement that creates a new element and adds it to its parent, in one step:

from elementtree.ElementTree import Element, SubElement

root = Element("root")

SubElement(root, "one")
SubElement(root, "two")
SubElement(root, "three")

To access the subelements, you can use ordinary list (sequence) operations. This includes len(element) to get the number of subelements, element[i] to fetch the i’th subelement, and using the for-in statement to loop over the subelements:

for node in root:
    print node

The element type also supports slicing (including slice assignment), and the standard append, insert and remove methods:

nodes = node[1:5]
node.append(subnode)
node.insert(0, subnode)
node.remove(subnode)

Note that remove takes an element, not a tag. To find the element to remove, you can either loop over the parent, or use one of the find methods described below.

Truth Testing #

In ElementTree 1.2 and earlier, the sequence behaviour means that an element without any subelements tests as false (since it’s an empty sequence), even if it contains text or attributes. To check the return value from a function or method that may return None instead of a node, you must use an explicit test.

def fetchnode():
    ...

node = fetchnode()

if not node: # careful!
    print "node not found, or node has no subnodes"

if node is None:
    print "node not found"

Note: This behaviour is likely to change somewhat in ElementTree 1.3. To write code that is compatible in both directions, use “element is None” to test for a missing element, and “len(element)” to test for non-empty elements.

Accessing Parents #

The element structure has no parent pointers. If you need to keep track of child/parent relations, you can structure your program to work on the parents rather than the children:

for parent in tree.getiterator():
    for child in parent:
        ... work on parent/child tuple

The getiterator function is explained in further detail below.

If you do this a lot, you can wrap the iterator code in a generator function:

def iterparent(tree):
    for parent in tree.getiterator():
        for child in parent:
            yield parent, child

for parent, child in iterparent(tree):
    ... work on parent/child tuple

Another approach is to use a separate data structure to map from child elements to their parents. In Python 2.4 and later, the following one-liner creates a child/parent map for an entire tree:

parent_map = dict((c, p) for p in tree.getiterator() for c in p)

Attributes #

In addition to the tag and the list of subelements, each element can have one or more attributes. Each element attribute consists of a string key, and a corresponding value. As for ordinary Python dictionaries, all keys must be unique.

Element attributes are in fact stored in a standard Python dictionary, which can be accessed via the attrib attribute. To set attributes, you can simply assign to attrib members:

from elementtree.ElementTree import Element

elem = Element("tag")
elem.attrib["first"] = "1"
elem.attrib["second"] = "2"

When creating a new element, you can pass in element attributes using keyword arguments. The previous example is better written as:

from elementtree.ElementTree import Element

elem = Element("tag", first="1", second="2")

The Element type provides shortcuts for attrib.get, attrib.keys, and attrib.items. There’s also a set method, to set the value of an element attribute:

from elementtree.ElementTree import Element

elem = Element("tag", first="1", second="2")

# print 'first' attribute
print elem.attrib.get("first")

# same, using shortcut
print elem.get("first")

# print list of keys (using shortcuts)
print elem.keys()
print elem.items()

# the 'third' attribute doesn't exist
print elem.get("third")
print elem.get("third", "default")

# add the attribute and try again
elem.set("third", "3")
print elem.get("third", "default")


1
1
['first', 'second']
[('first', '1'), ('second', '2')]
None
default
3

Note that while the attrib value is required to be a real mutable Python dictionary, an ElementTree implementation may choose to use another internal representation, and create the dictionary only if someone asks for it. To take advantage of such implementations, stick to the shortcut methods whenever possible.

Text Content #

The element type also provides a text attribute, which can be used to hold additional data associated with the element. As the name implies, this attribute is usually used to hold a text string, but it can be used for other, application-specific purposes.

from elementtree.ElementTree import Element

elem = Element("tag")
elem.text = "this element also contains text"

If there is no additional data, this attribute is set to an empty string, or None.

The element type actually provides two attributes that can be used in this way; in addition to text, there’s a similar attribute called tail. It too can contain a text string, an application-specific object, or None. The tail attribute is used to store trailing text nodes when reading mixed-content XML files; text that follows directly after an element are stored in the tail attribute for that element:

    <tag><elem>this goes into elem's
    text attribute</elem>this goes into
    elem's tail attribute</tag>

See the Mixed Content section for more information.

Note that some implementations may only support string objects as text or tail values.

Example

File: elementtree-example-1.py

# elementtree-example-1.py

from elementtree.ElementTree import Element, SubElement, dump

window = Element("window")

title = SubElement(window, "title", font="large")
title.text = "A sample text window"

text = SubElement(window, "text", wrap="word")

box = SubElement(window, "buttonbox")
SubElement(box, "button").text = "OK"
SubElement(box, "button").text = "Cancel"

dump(window)


$ python elementtree-example-1.py
<window><title font="large">A sample text window</title><text wrap=
"word" /><buttonbox><button>OK</button><button>Cancel</button></but
tonbox></window>

Searching for Subelements #

The Element type provides a number of methods that can be used to search for subelements:

find(pattern) returns the first subelement that matches the given pattern, or None if there is no matching element.

findtext(pattern) returns the value of the text attribute for the first subelement that matches the given pattern. If there is no matching element, this method returns None.

findall(pattern) returns a list (or another iterable object) of all subelements that match the given pattern.

In ElementTree 1.2 and later, the pattern argument can either be a tag name, or a path expression. If a tag name is given, only direct subelements are checked. Path expressions can be used to search the entire subtree.

ElementTree 1.1 and earlier only supports plain tag names.

In addition, the getiterator method can be used to loop over the tree in depth-first order:

getiterator(tag) returns a list (or another iterable object) which contains all subelements that has the given tag, on all levels in the subtree. The elements are returned in document order (that is, in the same order as they would appear if you saved the tree as an XML file).

getiterator() (without argument) returns a list (or another iterable object) of all subelement in the subtree.

getchildren() returns a list (or another iterable object) of all direct child elements. This method is deprecated; new code should use indexing or slicing to access the children, or list(elem) to get a list.

Reading and Writing XML Files #

The Element type can be used to represent XML files in memory. The ElementTree wrapper class is used to read and write XML files.

To load an XML file into an Element structure, use the parse function:

from elementtree.ElementTree import parse

tree = parse(filename)
elem = tree.getroot()

You can also pass in a file handle (or any object with a read method):

from elementtree.ElementTree import parse

file = open(filename, "r")
tree = parse(file)
elem = tree.getroot()

The parse method returns an ElementTree object. To get the topmost element object, use the getroot method.

In recent versions of the ElementTree module, you can also use the file keyword argument to create a tree, and fill it with contents from a file in one operation:

from elementtree.ElementTree import ElementTree

tree = ElementTree(file=filename)
elem = tree.getroot()

To save an element tree back to disk, use the write method on the ElementTree class. Like the parse function, it takes either a filename or a file object (or any object with a write method):

from elementtree.ElementTree import ElementTree

tree = ElementTree(file=infile)
tree.write(outfile)

If you want to write an Element object hierarchy to disk, wrap it in an ElementTree instance:

from elementtree.ElementTree import Element, SubElement, ElementTree

html = Element("html")
body = SubElement(html, "body")
ElementTree(html).write(outfile)

Note that the standard element writer creates a compact output. There is no built-in support for pretty printing or user-defined namespace prefixes in the current version, so the output may not always be suitable for human consumption (to the extent XML is suitable for human consumption, that is).

One way to produce nicer output is to add whitespace to the tree before saving it; see the indent function on the Element Library Functions page for an example.

To convert between XML and strings, you can use the XML, fromstring, and tostring helpers:

from elementtree.ElementTree import XML, fromstring, tostring

elem = XML(text)

elem = fromstring(text) # same as XML(text)

text = tostring(elem)

XML Namespaces #

The elementtree module supports qualified names (QNames) for element tags and attribute names. A qualified name consists of a (uri, local name) pair.

Qualified names were introduced with the XML Namespace specification.

The element type represents a qualified name pair, also called universal name, as a string of the form “{uri}local“. This syntax can be used both for tag names and for attribute keys.

The following example creates an element where the tag is the qualified name pair (spam.effbot.org, egg).

from elementtree.ElementTree import Element

elem = Element("{spam.effbot.org}egg"}

If you save this to an XML file, the writer will automatically generate proper XML namespace declarations, and pick a suitable prefix. When you load an XML file, the parser converts qualified tag and attribute names to the element syntax.

Note that the standard parser discards namespace prefixes and declarations, so if you need to access the prefixes later on (e.g. to handle qualified names in attribute values or character data), you must use an alternate parser. For more information on this topic, see the articles The ElementTree iterparse Function and Using the ElementTree Module to Generate SOAP Messages, Part 3: Dealing with Qualified Names.

[comment on/vote for this article]

Comment:

I love this library, but the use of namespaces should be mentioned more clearly in your short examples. For instance, the following works (the namespace ends with a slash):

namespace="xspf.org/ns/0/"
tree.findall('.//{%s}track' % namespace)

While the following does not work:

namespace="xspf.org/ns/0"          # NO TRAILING SLASH
tree.findall('.//{%s}/track' % namespace) # ERROR

Keep up the good work!

Posted by Berco (2006-11-18)

In ET, a qualified name always consists of namespace URI (written inside the braces), and a local part. A slash inside braces belongs to the URI, a slash outside it is something different. /F

Comment:

Isn't it the following a bug? >>> s = fromstring('<b xmlns="h">hey</b>') >>> tostring(s) '<ns0:b xmlns:ns0="h">hey</ns0:b>' Why does ET add an extra prefix? The document is valid without.

Posted by Sylvain (2006-12-07)

The b element belongs to the h namespaces, so both serializations are equivalent. /F

Comment:

I agree. I didn't suggest ElementTree was invalid. I wonder however why ElementTree does not respect the fact I do not set a prefix. Whether or not both serializations are equals is moot really.

Posted by Sylvain (2006-12-09)

Namespaces are identified using URI:s, not prefixes. The prefix is just a local placeholder, and there's no difference between a URI provided via an explicit prefix or via a default. If this is not clear, I suggest reading the XML namespace specification again, and perhaps also James Clark's namespace articlea. And if you need non-standard serialization, feel free to roll your own serializer. /F

Comment:

why im getting the error "ImportError: No module named elementtree.ElementTree". i found, that in python there is only "xml.etree.ElementTree" library to import.. which does "NameError: name 'Element' is not defined"

Posted by klemkas (2006-12-14)

The version shipped with Python 2.5 is installed under xml.etree, the original release is installed under elementtree. See the overview page for more on this. /F

Comment:

Hi, can I somehow avoid the need for carrying XMLNS URI everywhere? I wish to look up tags without prepending them with "{uri}" as it's pretty annoying ;-) Unfortunately the XML that comes to my app has xmlns attribute. Indeed I could rip that off with regexp before feeding it to fromstring(), but that's rather a brutal solution. Can I instead say something like tree.setDefaultNS("{some.org/foo}") and then do tree.findall("Data") instead of tree.findall("{some.org/foo}Data")? Michal

Posted by mludvig (2007-01-08)

Comment:

fetchnode(), I can not find this method anywhere so I guess it is fictive. Please change the documentation so newcomers do not waste their time looking for it. I am one of those :)

Posted by velle (2007-02-09)

It's an example of "a function or method that may return None instead of a node", as discussed in the paragraph before the code sample /F

Comment:

Indeed the way XML Namespaces are managed by this library is a big drawback, IMHO. Even the designers of the W3C Recommendation somehow better managed the tradeoff in a better way. Tradeoff between precision requirement, provided by the uniqueness of the uri, and handiness: except in the xmlns declaration, short identifiers are used. the "{uri}local" form of the tag is a pain for programmers, readers, and error prone. oh, I just read James Clark's XML Namespaces article (www.jclark.com/xml/xmlns.htm), I see where the "{url}local" notation comes from! Another more serious drawback: indeed, the w3c recommendation shows no concern on the choice of the Qname prefix, as long as it gets declared (hence associated to an URI), and it aint reserved. This does not mean nobody cares! I am working on xmp, an adobe promoted format to embed metadata in files (pdf, jpg), as xml. The specification documents define "preferred field namespace prefix"(es)! All software agents I met so far comply to these preferred prefixes. Am I to use ns0, ns1, etc, instead of these? No, You comply to de facto norms, logical or not, compliant to other norms or not. The elementTree documentation comments on the "soap case" (effbot.org/zone/elementsoap-3.htm): thats another example of people who "care for prefixes" (although its a little bit different there). Using a different parser, that keeps track of prefixes, is not a good solution in my case: if I build an element from scratch, not from an existing record, I still want the correct uri-prefix associations. Well, this one calls for a quite natural solution: let the user (of the elementTree module) provide a dict that defines this association: either at the module level (good enough for me), or tied to the tree. I am sorry I seem to criticize a module I just discovered, and I find real nice in other respects. but while the internal tag syntax is just a matter of taste, the tag prefixes I need to write are a requirement. Ill check now wether I can use the module with slight modifications, possibly deriving from the elementTree classes.

Posted by pierre Imbaud pierre.imbaud@laposte.net (2007-02-15)

Have you noticed the _namespace_map variable in the ElementTree namespace? /F

Comment:

Good job. Many thanks for such beauty library. I also worked with DOM and SAX under Java, but Element Tree is more friendly.

Posted by Mintaka (2007-04-17)

Comment:

I use code like this:


  ET._namespace_map["schemas.xmlsoap.org/soap/envelope/"] = 'soap'

  ET._namespace_map["www.w3.org/2001/XMLSchema-instance"] = 'xsi'

  ET._namespace_map["www.w3.org/2001/XMLSchema"] = 'xs'

  ET._namespace_map["urn:hl7-org:v3"] = 'hl7'

which does generate proper prefixes on output with very little effort. Maybe this should be mentioned in the article (unless future versions of ElementTree will not support it). It took me a while to figure it out, and it does solve the prefix problem which is a drawback indeed.

Posted by Marc de Graauw (2007-04-17)

Comment:

I'd like to second the request for a way to set a default namespace...I'm processing GPX files, and having to continually remember that I need {www.topografix.com/GPX/1/0} in front of every element name is cumberson. Absent a default namespace, an inverse of the _namespace_map dictionary would be tremendously helpful, so that instead of '{www.topografix.com/GPX/1/0}trkpt' I could say '{gpx}trkpt'.

Posted by Lars (2007-05-25)

Comment:

I'd second the need to handle namespaces more opaquely. I'm using Amazon's web service APIs, which namespace everything and blanch at having to specify "awis.amazonaws.com/doc/2005-07-11" in calls to find. What if this value changes? While not likely, it could and would massively break things. To prevent this, I need to check the next-nearest parent where that namespace is declared and use it. That's insanely painful, particularly for a Python library. Perhaps there's already a solution to this. Anyone?

Posted by Garrett (2007-05-28)

But that namespace is a part of the element name, and is part of the protocol. If you get an element that's not in that namespace, it's not part of the specification you've programmed against. The "right thing" to do in your case is to dig the namespace URL out of the WSDL description for the service. If you want to ignore the namespace, you have to do that explicitly, and at your own risk. /F

this page was rendered by a django application in 0.05s 2013-02-07 19:07:04.477057. hosted by webfaction.