Next / Previous / Contents / TCC Help System / NM Tech homepage

Python XML processing with lxml

spacer

Abstract

Describes the lxml package for reading and writing XML files with the Python programming language.

This publication is available in Web form and also as a PDF document. Please forward any comments to tcc-doc@nmt.edu.

Table of Contents

1. Introduction: Python and XML
2. How ElementTree represents XML
3. Reading an XML document
4. Creating a new XML document
5. Modifying an existing XML document
6. Features of the etree module
6.1. The Comment() constructor
6.2. The Element() constructor
6.3. The ElementTree() constructor
6.4. The fromstring() function: Create an element from a string
6.5. The parse() function: build an ElementTree from a file
6.6. The ProcessingInstruction() constructor
6.7. The QName() constructor
6.8. The SubElement() constructor
6.9. The tostring() function: Serialize as XML
6.10. The XMLID() function: Convert text to XML with a dictionary of id values
7. class ElementTree: A complete XML document
7.1. ElementTree.find()
7.2. ElementTree.findall(): Find matching elements
7.3. ElementTree.findtext(): Retrieve the text content from an element
7.4. ElementTree.getiterator(): Make an iterator
7.5. ElementTree.getroot(): Find the root element
7.6. ElementTree.xpath(): Evaluate an XPath expression
7.7. ElementTree.write(): Translate back to XML
8. class Element: One element in the tree
8.1. Attributes of an Element instance
8.2. Accessing the list of child elements
8.3. Element.append(): Add a new element child
8.4. Element.clear(): Make an element empty
8.5. Element.find(): Find a matching sub-element
8.6. Element.findall(): Find all matching sub-elements
8.7. Element.findtext(): Extract text content
8.8. Element.get(): Retrieve an attribute value with defaulting
8.9. Element.getchildren(): Get element children
8.10. Element.getiterator(): Make an iterator to walk a subtree
8.11. Element.getroottree(): Find the ElementTree containing this element
8.12. Element.insert(): Insert a new child element
8.13. Element.items(): Produce attribute names and values
8.14. Element.iterancestors(): Find an element's ancestors
8.15. Element.iterchildren(): Find all children
8.16. Element.iterdescendants(): Find all descendants
8.17. Element.itersiblings(): Find other children of the same parent
8.18. Element.keys(): Find all attribute names
8.19. Element.remove(): Remove a child element
8.20. Element.set(): Set an attribute value
8.21. Element.xpath(): Evaluate an XPath expression
9. XPath processing
9.1. An XPath example
10. The art of Web-scraping: Parsing HTML with Beautiful Soup
11. Automated validation of input files
11.1. Validation with a Relax NG schema
11.2. Validation with an XSchema (XSD) schema
12. etbuilder.py: A simplified XML builder module
12.1. Using the etbuilder module
12.2. CLASS(): Adding class attributes
12.3. FOR(): Adding for attributes
12.4. subElement(): Adding a child element
12.5. addText(): Adding text content to an element
13. Implementation of etbuilder
13.1. Features differing from Lundh's original
13.2. Prologue
13.3. CLASS(): Helper function for adding CSS class attributes
13.4. FOR(): Helper function for adding XHTML for attributes
13.5. subElement(): Add a child element
13.6. addText(): Add text content to an element
13.7. class ElementMaker: The factory class
13.8. ElementMaker.__init__(): Constructor
13.9. ElementMaker.__call__(): Handle calls to the factory instance
13.10. ElementMaker.__handleArg(): Process one positional argument
13.11. ElementMaker.__getattr__(): Handle arbitrary method calls
13.12. Epilogue
13.13. testetbuilder: A test driver for etbuilder
14. rnc_validate: A module to validate XML against a Relax NG schema
14.1. Design of the rnc_validate module
14.2. Interface to the rnc_validate module
14.3. rnc_validate.py: Prologue
14.4. RelaxException
14.5. class RelaxValidator
14.6. RelaxValidator.validate()
14.7. RelaxValidator.__init__(): Constructor
14.8. RelaxValidator.__makeRNG(): Find or create an .rng file
14.9. RelaxValidator.__getModTime(): When was this file last changed?
14.10. RelaxValidator.__trang(): Translate .rnc to .rng format
15. rnck: A standalone script to validate XML against a Relax NG schema
15.1. rnck: Prologue
15.2. rnck: main()
15.3. rnck: checkArgs()
15.4. rnck: usage()
15.5. rnck: fatal()
15.6. rnck: message()
15.7. rnck: validateFile()
15.8. rnck: Epilogue

1. Introduction: Python and XML

With the continued growth of both Python and XML, there is a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the lxml package has two big advantages:

  • Performance. Reading and writing even fairly large XML files takes an almost imperceptible amount of time.

  • Ease of programming. The lxml package is based on ElementTree, which Fredrik Lundh invented to simplify and streamline XML processing.

lxml is similar in many ways to two other, earlier packages:

  • Fredrik Lundh continues to maintain his original version of ElementTree.

  • xml.etree.ElementTree is now an official part of the Python library. There is a C-language version called cElementTree which may be even faster than lxml for some applications.

However, the author prefers lxml for providing a number of additional features that make life easier. In particular, support for XPath makes it considerably easier to manage more complex XML structures.


Next: 2. How ElementTree represents XML
Help: Tech Computer Center: Help System
Home: About New Mexico Tech

John W. Shipman
Comments welcome: tcc-doc@nmt.edu
Last updated: 2011-11-11 13:07
URL: www.nmt.edu/tcc/help/pubs/pylxml/web/index.html
gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.