spacer

Beautiful Soup Documentation

by Leonard Richardson (leonardr@segfault.org)

这份文档也有中文版了 (This document is also available in Chinese translation)

Этот документ также доступен в русском переводе. [Внешняя ссылка] (This document is also available in Russian translation. [External link])

このドキュメントでは、(外部リンク)日本語訳でもご覧になれます。 (This document is also available in Japanese translation. [External link])


Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. There's also a Ruby port called Rubyful Soup.

This document illustrates all major features of Beautiful Soup version 3.0, with examples. It shows you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.

Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation

Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.

Table of Contents

Quick Start

Get Beautiful Soup here. The changelog describes differences between 3.0 and earlier versions.

Include Beautiful Soup in your application with a line like one of the following:

from BeautifulSoup import BeautifulSoup          # For processing HTML
from BeautifulSoup import BeautifulStoneSoup     # For processing XML
import BeautifulSoup                             # To get everything

If you get the message "No module named BeautifulSoup", but you know Beautiful Soup is installed, you're probably using the Beautiful Soup 4 beta. Use this code instead:

from bs4 import BeautifulSoup # To get everything

This document only covers Beautiful Soup 3. Beatiful Soup 4 has some slight differences; see the README.txt file for details.

Here's some code demonstrating the basic features of Beautiful Soup. You can copy and paste this code into a Python session to run it yourself.

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

Here are some ways to navigate the soup:

soup.contents[0].name
# u'html'

soup.contents[0].contents[0].name
# u'head'

head = soup.contents[0].contents[0]
head.parent.name
# u'html'

head.next
# <title>Page title</title>

head.nextSibling.name
# u'body'

head.nextSibling.contents[0]
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>

head.nextSibling.contents[0].nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

Here are a couple of ways to search the soup for certain tags, or tags with certain properties:

titleTag = soup.html.head.title
titleTag
# <title>Page title</title>

titleTag.string
# u'Page title'

len(soup('p'))
# 2

soup.findAll('p', align="center")
# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]

soup.find('p', align="center")
# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>

soup('p', align="center")[0]['id']
# u'firstpara'

soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'

soup.find('p').b.string
# u'one'

soup('p')[1].b.string
# u'two'

It's easy to modify the soup:

titleTag['id'] = 'theTitle'
titleTag.contents[0].replaceWith("New title")
soup.html.head
# <head><title id="theTitle">New title</title></head>

soup.p.extract()
soup.prettify()
# <html>
#  <head>
#   <title id="theTitle">
#    New title
#   </title>
#  </head>
#  <body>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

soup.p.replaceWith(soup.b)
# <html>
#  <head>
#   <title id="theTitle">
#    New title
#   </title>
#  </head>
#  <body>
#   <b>
#    two
#   </b>
#  </body>
# </html>

soup.body.insert(0, "This page used to have ")
soup.body.insert(2, " &lt;p&gt; tags!")
soup.body
# <body>This page used to have <b>two</b> &lt;p&gt; tags!</body>

Here's a real-world example. It fetches the ICC Commercial Crime Services weekly piracy report, parses it with Beautiful Soup, and pulls out the piracy incidents:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', ):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

Parsing a Document

A Beautiful Soup constructor takes an XML or HTML document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory.

If you give Beautiful Soup a perfectly-formed document, the parsed data structure looks just like the original document. But if there's something wrong with the document, Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.

Parsing HTML

Use the BeautifulSoup class to parse an HTML document. Here are some of the things that BeautifulSoup knows:

Here it is in action:

from BeautifulSoup import BeautifulSoup
html = "<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
soup = BeautifulSoup(html)
print soup.prettify()
# <html>
#  <p>
#   Para 1
#  </p>
#  <p>
#   Para 2
#   <blockquote>
#    Quote 1
#    <blockquote>
#     Quote 2
#    </blockquote>
#   </blockquote>
#  </p>
# </html>

Note that BeautifulSoup figured out sensible places to put the closing tags, even though the original document lacked them.

That document isn't valid HTML, but it's not too bad either. Here's a really horrible document. Among other problems, it's got a <FORM> tag that starts outside of a <TABLE> tag and ends inside the <TABLE> tag. (HTML like this was found on a website run by a major web company.)

from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
 <table>
 <td><input name="input1">Row 1 cell 1
 <tr><td>Row 2 cell 1
 </form> 
 <td>Row 2 cell 2<br>This</br> sure is a long cell
</body> 
</html>"""

Beautiful Soup handles this document as well:

print BeautifulSoup(html).prettify()
# <html>
#  <form>
#   <table>
#    <td>
#     <input name="input1" />
#     Row 1 cell 1
#    </td>
#    <tr>
#     <td>
#      Row 2 cell 1
#     </td>
#    </tr>
#   </table>
#  </form>
#  <td>
#   Row 2 cell 2
#   <br />
#   This 
#   sure is a long cell
#  </td>
# </html>

The last cell of the table is outside the <TABLE> tag; Beautiful Soup decided to close the <TABLE> tag when it closed the <FORM> tag. The author of the original document probably intended the <FORM> tag to extend to the end of the table, but Beautiful Soup has no way of knowing that. Even in a bizarre case like this, Beautiful Soup parses the invalid document and gives you access to all the data.

Parsing XML

The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.

Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting: Here it is in action:

from BeautifulSoup import BeautifulStoneSoup
xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"
soup = BeautifulStoneSoup(xml)
print soup.prettify()
# <doc>
#  <tag1>
#   Contents 1
#   <tag2>
#    Contents 2
#   </tag2>
#  </tag1>
#  <tag1>
#   Contents 3
#  </tag1>
# </doc>

The most common shortcoming of BeautifulStoneSoup is that it doesn't know about self-closing tags. HTML has a fixed set of self-closing tags, but with XML it depends on what the DTD says. You can tell BeautifulStoneSoup that certain tags are self-closing by passing in their names as the selfClosingTags argument to the constructor:

from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
#  Text 1
#  <selfclosing>
#   Text 2
#  </selfclosing>
# </tag>

print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
#  Text 1
#  <selfclosing />
#  Text 2
# </tag>

If That Doesn't Work

There are several other parser classes with different heuristics from these two. You can also subclass and customize a parser and give it your own heuristics.

Beautiful Soup Gives You Unicode, Dammit

By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'

Here's an example with a Japanese document encoded in UTF-8:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf")
soup.contents[0]
# u'\u3053\u308c\u306f'
soup.originalEncoding
# 'utf-8'

str(soup)
# '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'

# Note: this bit uses EUC-JP, so it only works if you have cjkcodecs
# installed, or are running Python 2.4.
soup.__str__('euc-jp')
# '\xa4\xb3\xa4\xec\xa4\xcf'

Beautiful Soup uses a class called UnicodeDammit to detect the encodings of documents you give it and convert them to Unicode, no matter what. If you need to do this for other documents (without using Beautiful Soup to parse them), you can use UnicodeDammit by itself. It's heavily based on code from the Universal Feed Parser.

If you're running an older version of Python than 2.4, be sure to download and install cjkcodecs and iconvcodec, which make Python capable of supporting more codecs, especially CJK codecs. Also install the chardet library, for better autodetection.

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

Beautiful Soup will almost always guess right if it can make a guess at all. But for documents with no declarations and in strange encodings, it will often not be able to guess. It will fall back to Windows-1252, which will probably be wrong. Here's an EUC-JP example where Beautiful Soup guesses the encoding wrong. (Again, because it uses EUC-JP, this example will only work if you are running Python 2.4 or have cjkcodecs installed):

from BeautifulSoup import BeautifulSoup
euc_jp = '\xa4\xb3\xa4\xec\xa4\xcf'

soup = BeautifulSoup(euc_jp)
soup.originalEncoding
# 'windows-1252'

str(soup)
# '\xc2\xa4\xc2\xb3\xc2\xa4\xc3\xac\xc2\xa4\xc3\x8f'     # Wrong!

But if you specify the encoding with fromEncoding, it parses the document correctly, and can convert it to UTF-8 or back to EUC-JP.

soup = BeautifulSoup(euc_jp, fromEncoding="euc-jp")
soup.originalEncoding
# 'windows-1252'

str(soup)
# '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'                 # Right!

soup.__str__(self, 'euc-jp') == euc_jp
# True

If you give Beautiful Soup a document in the Windows-1252 encoding (or a similar encoding like ISO-8859-1 or ISO-8859-2), Beautiful Soup finds and destroys the document's smart quotes and other Windows-specific characters. Rather than transforming those characters into their Unicode equivalents, Beautiful Soup transforms them into HTML entities (BeautifulSoup) or XML entities (BeautifulStoneSoup).

To prevent this, you can pass smartQuotesTo=None into the soup constructor: then smart quotes will be converted to Unicode like any other native-encoding characters. You can also pass in "xml" or "html" for smartQuotesTo, to change the default behavior of BeautifulSoup and BeautifulStoneSoup.

from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
text = "Deploy the \x91SMART QUOTES\x92!"

str(BeautifulSoup(text))
# 'Deploy the &lsquo;SMART QUOTES&rsquo;!'

str(BeautifulStoneSoup(text))
# 'Deploy the &#x2018;SMART QUOTES&#x2019;!'

str(BeautifulSoup(text, smartQuotesTo="xml"))
# 'Deploy the &#x2018;SMART QUOTES&#x2019;!'

BeautifulSoup(text, smartQuotesTo=None).contents[0]
# u'Deploy the \u2018SMART QUOTES\u2019!'

Printing a Document

You can turn a Beautiful Soup document (or any subset of it) into a string with the str function, or the prettify or renderContents methods. You can also use the unicode function to get the whole document as a Unicode string.

The prettify method adds strategic newlines and spacing to make the structure of the document obvious. It also strips out text nodes that contain only whitespace, which might change the meaning of an XML document. The str and unicode functions don't strip out text nodes that contain only whitespace, and they don't add any whitespace between nodes either.

Here's an example.

from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)

str(soup)
# '<html><h1>Heading</h1><p>Text</p></html>'
soup.renderContents()
# '<html><h1>Heading</h1><p>Text</p></html>'
soup.__str__()
# '<html><h1>Heading</h1><p>Text</p></html>'
unicode(soup)
# u'<html><h1>Heading</h1><p>Text</p></html>'

soup.prettify()
# '<html>\n <h1>\n  Heading\n </h1>\n <p>\n  Text\n </p>\n</html>'

print soup.prettify()
# <html>
#  <h1>
#   Heading
#  </h1>
#  <p>
#   Text
#  </p>
# </html>

Note that str and renderContents give different results when used on a tag within the document. str prints a tag and its contents, and renderContents only prints the contents.

heading = soup.h1
str(heading)
# '<h1>Heading</h1>'
heading.renderContents()
# 'Heading'

When you call __str__, prettify, or renderContents, you can specify an output encoding. The default encoding (the one used by str) is UTF-8. Here's an example that parses an ISO-8851-1 string and then outputs the same string in different encodings:

from BeautifulSoup import BeautifulSoup
doc = "Sacr\xe9 bleu!"
soup = BeautifulSoup(doc)
str(soup)
# 'Sacr\xc3\xa9 bleu!'                          # UTF-8
soup.__str__("ISO-8859-1")
# 'Sacr\xe9 bleu!'
soup.__str__("UTF-16")
# '\xff\xfeS\x00a\x00c\x00r\x00\xe9\x00 \x00b\x00l\x00e\x00u\x00!\x00'
soup.__str__("EUC-JP")
# 'Sacr\x8f\xab\xb1 bleu!'

If the original document contained an encoding declaration, then Beautiful Soup rewrites the declaration to mention the new encoding when it converts the document back to a string. This means that if you load an HTML document into BeautifulSoup and print it back out, not only should the HTML be cleaned up, but it should be transparently converted to UTF-8.

Here's an HTML example:

from BeautifulSoup import BeautifulSoup
doc = """<html>
<meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" >
Sacr\xe9 bleu!
</html>"""

print BeautifulSoup(doc).prettify()
# <html>
#  <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
#  Sacré bleu!
# </html>

Here's an XML example:

from BeautifulSoup import BeautifulStoneSoup
doc = """<?xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!"""

print BeautifulStoneSoup(doc).prettify()
# <?xml version='1.0' encoding='utf-8'>
# Sacré bleu!

The Parse Tree

So far we've focused on loading documents and writing them back out. Most of the time, though, you're interested in the parse tree: the data structure Beautiful Soup builds as it parses the document.

A parser object (an instance of BeautifulSoup or BeautifulStoneSoup) is a deeply-nested, well-connected data structure that corresponds to the structure of an XML or HTML document. The parser object contains two other types of objects: Tag objects, which correspond to tags like the <TITLE> tag and the <B> tags; and NavigableString objects, which correspond to strings like "Page title" and "This is paragraph".

There are also some subclasses of NavigableString (CData, Comment, Declaration, and ProcessingInstruction), which correspond to special XML constructs. They act like NavigableStrings, except that when it's time to print them out they have some extra data attached to them. Here's a document that includes a comment:

from BeautifulSoup import BeautifulSoup
import re
hello = "Hello! <!--I've got to be nice to get what I want.-->"
commentSoup = BeautifulSoup(hello)
comment = commentSoup.find(text=re.compile("nice"))

comment.__class__
# <class 'BeautifulSoup.Comment'>
comment
# u"I've got to be nice to get what I want."
comment.previousSibling
# u'Hello! '

str(comment)
# "<!--I've got to be nice to get what I want.-->"
print commentSoup
# Hello! <!--I've got to be nice to get what I want.-->

Now, let's take a closer look at the document used at the beginning of the documentation:

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

The attributes of Tags

Tag and NavigableString objects have lots of useful members, most of which are covered in Navigating the Parse Tree and Searching the Parse Tree. However, there's one aspect of Tag objects we'll cover here: the attributes.

SGML tags have attributes:. for instance, each of the <P> tags in the example HTML above has an "id" attribute and an "align" attribute. You can access a tag's attributes by treating the Tag object as though it were a dictionary:

firstPTag, secondPTag = soup.findAll('p')

firstPTag['id']
# u'firstPara'

secondPTag['id']
# u'secondPara'

NavigableString objects don't have attributes; only Tag objects have them.

Navigating the Parse Tree

All Tag objects have all of the members listed below (though the actual value of the member may be None). NavigableString objects have all of them except for contents and string.

parent

In the example above, the parent of the <HEAD> Tag is the <HTML> Tag. The parent of the <HTML> Tag is the BeautifulSoup parser object itself. The parent of the parser object is None. By following parent, you can move up the parse tree:

soup.head.parent.name
# u'html'
soup.head.parent.parent.__class__.__name__
# 'BeautifulSoup'
soup.parent == None
# True

contents

With parent you move up the parse tree. With contents you move down the tree. contents is an ordered list of the Tag and NavigableString objects contained within a page element. Only the top-level parser object and Tag objects have contents. NavigableString objects are just strings and can't contain sub-elements, so they don't have contents.

In the example above, the contents of the first <P> Tag is a list containing a NavigableString ("This is paragraph "), a <B> Tag, and another NavigableString ("."). The contents of the <B> Tag: a list containing a NavigableString ("one").

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
pTag.contents[1].contents
# [u'one']
pTag.contents[0].contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

string

For your convenience, if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]. In the example above, soup.b.string is a NavigableString representing the Unicode string "one". That's the string contained in the first <B> Tag in the parse tree.

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

But soup.p.string is None, because the first <P> Tag in the parse tree has more than one child. soup.head.string is also None, even though the <HEAD> Tag has only one child, because that child is a Tag (the <TITLE> Tag), not a NavigableString.

soup.p.string == None
# True
soup.head.string == None
# True

nextSibling and previousSibling

These members let you skip to the next or previous thing on the same level of the parse tree. In the document above, the nextSibling of the <HEAD> Tag is the <BODY> Tag, because the <BODY> Tag is the next thing directly beneath the <html> Tag. The nextSibling of the <BODY> tag is None, because there's nothing else directly beneath the <HTML> Tag.

soup.head.nextSibling.name
# u'body'
soup.html.nextSibling == None
# True

Conversely, the previousSibling of the <BODY> Tag is the <HEAD> tag, and the previousSibling of the <HEAD> Tag is None:

soup.body.previousSibling.name
# u'head'
soup.head.previousSibling == None
# True

Some more examples: the nextSibling of the first <P> Tag is the second <P> Tag. The previousSibling of the <B> Tag inside the second <P> Tag is the NavigableString "This is paragraph". The previousSibling of that NavigableString is None, not anything inside the first <P> Tag.

soup.p.nextSibling
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

secondBTag = soup.findAlll('b')[1]
secondBTag.previousSibling
# u'This is paragraph'
secondBTag.previousSibling.previousSibling == None
# True

next and previous

These members let you move through the document elements in the order they were processed by the parser, rather than in the order they appear in the tree. For instance, the next of the <HEAD> Tag is the <TITLE> Tag, not the <BODY> Tag. This is because, in the original document, the <TITLE> tag comes immediately after the <HEAD> tag.

soup.head.next
# u'title'
soup.head.nextSibling.name
# u'body'
soup.head.previous.name
# u'html'

Where next and previous are concerned, a Tag's contents come before its nextSibling. You usually won't have to use these members, but sometimes it's the easiest way to get to something buried inside the parse tree.

Iterating over a Tag

You can iterate over the contents of a Tag by treating it as a list. This is a useful shortcut. Similarly, to see how many child nodes a Tag has, you can call len(tag) instead of len(tag.contents). In terms of the document above:

for i in soup.body:
    print i
# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>

len(soup.body)
# 2
len(soup.body.contents)
# 2

Using tag names as members

It's easy to navigate the parse tree by acting as though the name of the tag you want is a member of a parser or Tag object. We've been doing it throughout these examples. In terms of the document above, soup.head gives us the first (and, as it happens, only) <HEAD> Tag in the document:

soup.head
# <head><title>Page title</title></head>

In general, calling mytag.foo returns the first child of mytag that happens to be a <FOO> Tag. If there aren't any <FOO> Tags beneath mytag, then mytag.foo returns None. You can use this to traverse the parse tree very quickly:

soup.head.title
# <title>Page title</title>

soup.body.p.b.string
# u'one'

You can also use this to quickly jump to a certain part of a parse tree. For instance, if you're not worried about <TITLE> tags in weird places outside of the <HEAD> tag, you can just use soup.title to get an HTML document's title. You don't have to use soup.head.title:

soup.title.string
# u'Page title'

soup.p jumps to the first <P> tag inside a document, wherever it is. soup.table.tr.td jumps to the first column of the first row of the first table in the document.

These members actually alias to the first method, covered below. I mention it here because the alias makes it very easy to zoom in on an interesting part of a well-known parse tree.

An alternate form of this idiom lets you access the first <FOO> tag as .fooTag instead of .foo. For instance, soup.table.tr.td could also be expressed as soup.tableTag.trTag.tdTag, or even soup.tableTag.tr.tdTag. This is useful if you like to be more explicit about what you're doing, or if you're parsing XML whose tag names conflict with the names of Beautiful Soup methods and members.

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.person.parent                      # A Beautiful Soup member
# <person name="Bob"><parent rel="mother" name="Alice"></parent></person>
xmlSoup.person.parentTag                   # A tag name
# <parent rel="mother" name="Alice"></parent>

If you're looking for tag names that aren't valid Python identifiers (like hyphenated-name), you need to use find.

Searching the Parse Tree

Beautiful Soup provides many methods that traverse the parse tree, gathering Tags and NavigableStrings that match criteria you specify.

There are several ways to define criteria for matching Beautiful Soup objects. Let's demonstrate by examining in depth the most basic of all Beautiful Soup search methods, findAll. As before, we'll demonstrate on the following document:

from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

Incidentally, the two methods described in this section (findAll and find) are available only to Tag objects and the top-level parser objects, not to NavigableString objects. The methods defined in Searching Within the Parse Tree are also available to NavigableString objects.

The basic find method: findAll(name, attrs, recursive, text, limit, **kwargs)

The findAll method traverses the tree, starting at the given point, and finds all the Tag and NavigableString objects that match the criteria you give. The signature for the findall method is this:

findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name and the keyword arguments.