Python XML processing with lxml

Abstract

Describes the lxml package for reading and writing XML files with the Python programming language.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. Introduction: Python and XML

2. How ElementTree represents XML

3. Reading an XML document

4. Creating a new XML document

5. Modifying an existing XML document

6. Features of the etree module

6.1. The Comment() constructor
6.2. The Element() constructor
6.3. The ElementTree() constructor
6.4. The fromstring() function: Create an element from a string
6.5. The parse() function: build an ElementTree from a file
6.6. The ProcessingInstruction() constructor
6.7. The QName() constructor
6.8. The SubElement() constructor
6.9. The tostring() function: Serialize as XML
6.10. The XMLID() function: Convert text to XML with a dictionary of id values

7. class ElementTree: A complete XML document

7.1. ElementTree.find()
7.2. ElementTree.findall(): Find matching elements
7.3. ElementTree.findtext(): Retrieve the text content from an element
7.4. ElementTree.getiterator(): Make an iterator
7.5. ElementTree.getroot(): Find the root element
7.6. ElementTree.xpath(): Evaluate an XPath expression
7.7. ElementTree.write(): Translate back to XML

8. class Element: One element in the tree

8.1. Attributes of an Element instance
8.2. Accessing the list of child elements
8.3. Element.append(): Add a new element child
8.4. Element.clear(): Make an element empty
8.5. Element.find(): Find a matching sub-element
8.6. Element.findall(): Find all matching sub-elements
8.7. Element.findtext(): Extract text content
8.8. Element.get(): Retrieve an attribute value with defaulting
8.9. Element.getchildren(): Get element children
8.10. Element.getiterator(): Make an iterator to walk a subtree
8.11. Element.getroottree(): Find the ElementTree containing this element
8.12. Element.insert(): Insert a new child element
8.13. Element.items(): Produce attribute names and values
8.14. Element.iterancestors(): Find an element's ancestors
8.15. Element.iterchildren(): Find all children
8.16. Element.iterdescendants(): Find all descendants
8.17. Element.itersiblings(): Find other children of the same parent
8.18. Element.keys(): Find all attribute names
8.19. Element.remove(): Remove a child element
8.20. Element.set(): Set an attribute value
8.21. Element.xpath(): Evaluate an XPath expression

9. XPath processing

9.1. An XPath example

10. The art of Web-scraping: Parsing HTML with Beautiful Soup

11. Automated validation of input files

11.1. Validation with a Relax NG schema
11.2. Validation with an XSchema (XSD) schema

12. etbuilder.py: A simplified XML builder module

12.1. Using the etbuilder module
12.2. CLASS(): Adding class attributes
12.3. FOR(): Adding for attributes
12.4. subElement(): Adding a child element
12.5. addText(): Adding text content to an element

13. Implementation of etbuilder

13.1. Features differing from Lundh's original
13.2. Prologue
13.3. CLASS(): Helper function for adding CSS class attributes
13.4. FOR(): Helper function for adding XHTML for attributes
13.5. subElement(): Add a child element
13.6. addText(): Add text content to an element
13.7. class ElementMaker: The factory class
13.8. ElementMaker.__init__(): Constructor
13.9. ElementMaker.__call__(): Handle calls to the factory instance
13.10. ElementMaker.__handleArg(): Process one positional argument
13.11. ElementMaker.__getattr__(): Handle arbitrary method calls
13.12. Epilogue
13.13. testetbuilder: A test driver for etbuilder

14. rnc_validate: A module to validate XML against a Relax NG schema

14.1. Design of the rnc_validate module
14.2. Interface to the rnc_validate module
14.3. rnc_validate.py: Prologue
14.4. RelaxException
14.5. class RelaxValidator
14.6. RelaxValidator.validate()
14.7. RelaxValidator.__init__(): Constructor
14.8. RelaxValidator.__makeRNG(): Find or create an .rng file
14.9. RelaxValidator.__getModTime(): When was this file last changed?
14.10. RelaxValidator.__trang(): Translate .rnc to .rng format

15. rnck: A standalone script to validate XML against a Relax NG schema

15.1. rnck: Prologue
15.2. rnck: main()
15.3. rnck: checkArgs()
15.4. rnck: usage()
15.5. rnck: fatal()
15.6. rnck: message()
15.7. rnck: validateFile()
15.8. rnck: Epilogue

1. Introduction: Python and XML

With the continued growth of both Python and XML, there is a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the lxml package has two big advantages:

Performance. Reading and writing even fairly large XML files takes an almost imperceptible amount of time.
Ease of programming. The lxml package is based on ElementTree, which Fredrik Lundh invented to simplify and streamline XML processing.

lxml is similar in many ways to two other, earlier packages:

Fredrik Lundh continues to maintain his original version of ElementTree.
xml.etree.ElementTree is now an official part of the Python library. There is a C-language version called cElementTree which may be even faster than lxml for some applications.

However, the author prefers lxml for providing a number of additional features that make life easier. In particular, support for XPath makes it considerably easier to manage more complex XML structures.