这份文档也有中文版了 (This document is also available in Chinese translation)
Этот документ также доступен в русском переводе. [Внешняя ссылка] (This document is also available in Russian translation. [External link])
このドキュメントでは、(外部リンク)日本語訳でもご覧になれます。 (This document is also available in Japanese translation. [External link])
Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. There's also a Ruby port called Rubyful Soup.
This document illustrates all major features of Beautiful Soup version 3.0, with examples. It shows you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.
Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.
Tag
sparent
contents
string
nextSibling
and previousSibling
next
and previous
Tag
findAll(name, attrs, recursive, text, limit, **kwargs)
findall
find(name, attrs, recursive, text, **kwargs)
first
?findNextSiblings(name, attrs, text, limit, **kwargs)
and findNextSibling(name, attrs, text, **kwargs)
findPreviousSiblings(name, attrs, text, limit, **kwargs)
and findPreviousSibling(name, attrs, text, **kwargs)
findAllNext(name, attrs, text, limit, **kwargs)
and findNext(name, attrs, text, **kwargs)
findAllPrevious(name, attrs, text, limit, **kwargs)
and findPrevious(name, attrs, text, **kwargs)
SoupStrainer
sextract
Get Beautiful Soup here. The changelog describes differences between 3.0 and earlier versions.
Include Beautiful Soup in your application with a line like one of the following:
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
If you get the message "No module named BeautifulSoup", but you know Beautiful Soup is installed, you're probably using the Beautiful Soup 4 beta. Use this code instead:
from bs4 import BeautifulSoup # To get everything
This document only covers Beautiful Soup 3. Beatiful Soup 4 has some slight differences; see the README.txt file for details.
Here's some code demonstrating the basic features of Beautiful Soup. You can copy and paste this code into a Python session to run it yourself.
from BeautifulSoup import BeautifulSoup import re doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
Here are some ways to navigate the soup:
soup.contents[0].name # u'html' soup.contents[0].contents[0].name # u'head' head = soup.contents[0].contents[0] head.parent.name # u'html' head.next # <title>Page title</title> head.nextSibling.name # u'body' head.nextSibling.contents[0] # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> head.nextSibling.contents[0].nextSibling # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
Here are a couple of ways to search the soup for certain tags, or tags with certain properties:
titleTag = soup.html.head.title titleTag # <title>Page title</title> titleTag.string # u'Page title' len(soup('p')) # 2 soup.findAll('p', align="center") # [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>] soup.find('p', align="center") # <p id="firstpara" align="center">This is paragraph <b>one</b>. </p> soup('p', align="center")[0]['id'] # u'firstpara' soup.find('p', align=re.compile('^b.*'))['id'] # u'secondpara' soup.find('p').b.string # u'one' soup('p')[1].b.string # u'two'
It's easy to modify the soup:
titleTag['id'] = 'theTitle' titleTag.contents[0].replaceWith("New title") soup.html.head # <head><title id="theTitle">New title</title></head> soup.p.extract() soup.prettify() # <html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html> soup.p.replaceWith(soup.b) # <html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <b> # two # </b> # </body> # </html> soup.body.insert(0, "This page used to have ") soup.body.insert(2, " <p> tags!") soup.body # <body>This page used to have <b>two</b> <p> tags!</body>
Here's a real-world example. It fetches the ICC Commercial Crime Services weekly piracy report, parses it with Beautiful Soup, and pulls out the piracy incidents:
import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("www.icc-ccs.org/prc/piracyreport.php") soup = BeautifulSoup(page) for incident in soup('td', ): where, linebreak, what = incident.contents[:3] print where.strip() print what.strip() print
A Beautiful Soup constructor takes an XML or HTML document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory.
If you give Beautiful Soup a perfectly-formed document, the parsed data structure looks just like the original document. But if there's something wrong with the document, Beautiful Soup uses heuristics to figure out a reasonable structure for the data structure.
Use the BeautifulSoup
class to parse an HTML
document. Here are some of the things that BeautifulSoup
knows:
Here it is in action:
from BeautifulSoup import BeautifulSoup html = "<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2" soup = BeautifulSoup(html) print soup.prettify() # <html> # <p> # Para 1 # </p> # <p> # Para 2 # <blockquote> # Quote 1 # <blockquote> # Quote 2 # </blockquote> # </blockquote> # </p> # </html>
Note that BeautifulSoup
figured out sensible places to put the
closing tags, even though the original document lacked them.
That document isn't valid HTML, but it's not too bad either. Here's a really horrible document. Among other problems, it's got a <FORM> tag that starts outside of a <TABLE> tag and ends inside the <TABLE> tag. (HTML like this was found on a website run by a major web company.)
from BeautifulSoup import BeautifulSoup html = """ <html> <form> <table> <td><input name="input1">Row 1 cell 1 <tr><td>Row 2 cell 1 </form> <td>Row 2 cell 2<br>This</br> sure is a long cell </body> </html>"""
Beautiful Soup handles this document as well:
print BeautifulSoup(html).prettify() # <html> # <form> # <table> # <td> # <input name="input1" /> # Row 1 cell 1 # </td> # <tr> # <td> # Row 2 cell 1 # </td> # </tr> # </table> # </form> # <td> # Row 2 cell 2 # <br /> # This # sure is a long cell # </td> # </html>
The last cell of the table is outside the <TABLE> tag; Beautiful Soup decided to close the <TABLE> tag when it closed the <FORM> tag. The author of the original document probably intended the <FORM> tag to extend to the end of the table, but Beautiful Soup has no way of knowing that. Even in a bizarre case like this, Beautiful Soup parses the invalid document and gives you access to all the data.
The BeautifulSoup
class is full of web-browser-like
heuristics for divining the intent of HTML authors. But XML doesn't
have a fixed tag set, so those heuristics don't apply. So
BeautifulSoup
doesn't do XML very well.
Use the BeautifulStoneSoup
class to parse XML
documents. It's a general class with no special knowledge of any XML
dialect and very simple rules about tag nesting: Here it is in action:
from BeautifulSoup import BeautifulStoneSoup xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3" soup = BeautifulStoneSoup(xml) print soup.prettify() # <doc> # <tag1> # Contents 1 # <tag2> # Contents 2 # </tag2> # </tag1> # <tag1> # Contents 3 # </tag1> # </doc>
The most common shortcoming of
BeautifulStoneSoup
is that it doesn't know about
self-closing tags. HTML has a fixed set of self-closing tags, but with
XML it depends on what the DTD says. You can tell
BeautifulStoneSoup
that certain tags are self-closing by
passing in their names as the selfClosingTags
argument to
the constructor:
from BeautifulSoup import BeautifulStoneSoup xml = "<tag>Text 1<selfclosing>Text 2" print BeautifulStoneSoup(xml).prettify() # <tag> # Text 1 # <selfclosing> # Text 2 # </selfclosing> # </tag> print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify() # <tag> # Text 1 # <selfclosing /> # Text 2 # </tag>
There are several other parser classes with different heuristics from these two. You can also subclass and customize a parser and give it your own heuristics.
By the time your document is parsed, it has been transformed into Unicode. Beautiful Soup stores only Unicode strings in its data structures.
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("Hello") soup.contents[0] # u'Hello' soup.originalEncoding # 'ascii'
Here's an example with a Japanese document encoded in UTF-8:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf") soup.contents[0] # u'\u3053\u308c\u306f' soup.originalEncoding # 'utf-8' str(soup) # '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' # Note: this bit uses EUC-JP, so it only works if you have cjkcodecs # installed, or are running Python 2.4. soup.__str__('euc-jp') # '\xa4\xb3\xa4\xec\xa4\xcf'
Beautiful Soup uses a class called UnicodeDammit
to
detect the encodings of documents you give it and convert them to
Unicode, no matter what. If you need to do this for other documents
(without using Beautiful Soup to parse them), you can use
UnicodeDammit
by itself. It's heavily based on code from
the Universal Feed Parser.
If you're running an older version of Python than 2.4, be sure to
download and install cjkcodecs
and
iconvcodec
, which make Python capable of supporting
more codecs, especially CJK codecs. Also install the chardet
library, for better autodetection.
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
fromEncoding
argument
to the soup constructor.
http-equiv
META tag. If Beautiful Soup finds this kind of encoding within the
document, it parses the document again from the beginning and gives
the new encoding a try. The only exception is if you explicitly
specified an encoding, and that encoding actually worked: then it will
ignore any encoding it finds in the document.
chardet
library, if you have it installed.
Beautiful Soup will almost always guess right if it can make a
guess at all. But for documents with no declarations and in strange
encodings, it will often not be able to guess. It will fall back to
Windows-1252, which will probably be wrong. Here's an EUC-JP example
where Beautiful Soup guesses the encoding wrong. (Again, because it
uses EUC-JP, this example will only work if you are running Python 2.4
or have cjkcodecs
installed):
from BeautifulSoup import BeautifulSoup euc_jp = '\xa4\xb3\xa4\xec\xa4\xcf' soup = BeautifulSoup(euc_jp) soup.originalEncoding # 'windows-1252' str(soup) # '\xc2\xa4\xc2\xb3\xc2\xa4\xc3\xac\xc2\xa4\xc3\x8f' # Wrong!
But if you specify the encoding with fromEncoding
, it
parses the document correctly, and can convert it to UTF-8 or back to
EUC-JP.
soup = BeautifulSoup(euc_jp, fromEncoding="euc-jp") soup.originalEncoding # 'windows-1252' str(soup) # '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' # Right! soup.__str__(self, 'euc-jp') == euc_jp # True
If you give Beautiful Soup a document in the Windows-1252 encoding
(or a similar encoding like ISO-8859-1 or ISO-8859-2), Beautiful Soup
finds and destroys the document's smart quotes and other
Windows-specific characters. Rather than transforming those characters
into their Unicode equivalents, Beautiful Soup transforms them into
HTML entities (BeautifulSoup
) or XML entities
(BeautifulStoneSoup
).
To prevent this, you can pass smartQuotesTo=None
into the soup
constructor: then smart quotes will be converted to Unicode like any
other native-encoding characters. You can also pass in "xml" or "html"
for smartQuotesTo
, to change the default behavior of BeautifulSoup
and BeautifulStoneSoup
.
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup text = "Deploy the \x91SMART QUOTES\x92!" str(BeautifulSoup(text)) # 'Deploy the ‘SMART QUOTES’!' str(BeautifulStoneSoup(text)) # 'Deploy the ‘SMART QUOTES’!' str(BeautifulSoup(text, smartQuotesTo="xml")) # 'Deploy the ‘SMART QUOTES’!' BeautifulSoup(text, smartQuotesTo=None).contents[0] # u'Deploy the \u2018SMART QUOTES\u2019!'
You can turn a Beautiful Soup document (or any subset of it) into a
string with the str
function, or the prettify
or renderContents
methods. You can also use the unicode
function to get the whole
document as a Unicode string.
The prettify
method adds strategic newlines and spacing to make
the structure of the document obvious. It also strips out text nodes
that contain only whitespace, which might change the meaning of an XML
document. The str
and unicode
functions don't strip out text nodes
that contain only whitespace, and they don't add any whitespace
between nodes either.
Here's an example.
from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) str(soup) # '<html><h1>Heading</h1><p>Text</p></html>' soup.renderContents() # '<html><h1>Heading</h1><p>Text</p></html>' soup.__str__() # '<html><h1>Heading</h1><p>Text</p></html>' unicode(soup) # u'<html><h1>Heading</h1><p>Text</p></html>' soup.prettify() # '<html>\n <h1>\n Heading\n </h1>\n <p>\n Text\n </p>\n</html>' print soup.prettify() # <html> # <h1> # Heading # </h1> # <p> # Text # </p> # </html>
Note that str
and renderContents
give
different results when used on a tag within the document.
str
prints a tag and its contents, and
renderContents
only prints the contents.
heading = soup.h1 str(heading) # '<h1>Heading</h1>' heading.renderContents() # 'Heading'
When you call __str__
, prettify
, or
renderContents
, you can specify an output encoding. The
default encoding (the one used by str
) is UTF-8. Here's
an example that parses an ISO-8851-1 string and then outputs the same
string in different encodings:
from BeautifulSoup import BeautifulSoup doc = "Sacr\xe9 bleu!" soup = BeautifulSoup(doc) str(soup) # 'Sacr\xc3\xa9 bleu!' # UTF-8 soup.__str__("ISO-8859-1") # 'Sacr\xe9 bleu!' soup.__str__("UTF-16") # '\xff\xfeS\x00a\x00c\x00r\x00\xe9\x00 \x00b\x00l\x00e\x00u\x00!\x00' soup.__str__("EUC-JP") # 'Sacr\x8f\xab\xb1 bleu!'
If the original document contained an encoding declaration, then
Beautiful Soup rewrites the declaration to mention the new encoding
when it converts the document back to a string. This means that if you
load an HTML document into BeautifulSoup
and print it
back out, not only should the HTML be cleaned up, but it should be
transparently converted to UTF-8.
Here's an HTML example:
from BeautifulSoup import BeautifulSoup doc = """<html> <meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" > Sacr\xe9 bleu! </html>""" print BeautifulSoup(doc).prettify() # <html> # <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> # Sacré bleu! # </html>
Here's an XML example:
from BeautifulSoup import BeautifulStoneSoup doc = """<?xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!""" print BeautifulStoneSoup(doc).prettify() # <?xml version='1.0' encoding='utf-8'> # Sacré bleu!
So far we've focused on loading documents and writing them back out. Most of the time, though, you're interested in the parse tree: the data structure Beautiful Soup builds as it parses the document.
A parser object (an instance of BeautifulSoup
or
BeautifulStoneSoup
) is a deeply-nested, well-connected data
structure that corresponds to the structure of an XML or HTML
document. The parser object contains two other types of objects: Tag
objects, which correspond to tags like the <TITLE> tag and the <B>
tags; and NavigableString
objects, which correspond to strings like
"Page title" and "This is paragraph".
There are also some subclasses of NavigableString
(CData
,
Comment
, Declaration
, and ProcessingInstruction
), which
correspond to special XML constructs. They act like
NavigableString
s, except that when it's time to print them out they
have some extra data attached to them. Here's a document that includes
a comment:
from BeautifulSoup import BeautifulSoup import re hello = "Hello! <!--I've got to be nice to get what I want.-->" commentSoup = BeautifulSoup(hello) comment = commentSoup.find(text=re.compile("nice")) comment.__class__ # <class 'BeautifulSoup.Comment'> comment # u"I've got to be nice to get what I want." comment.previousSibling # u'Hello! ' str(comment) # "<!--I've got to be nice to get what I want.-->" print commentSoup # Hello! <!--I've got to be nice to get what I want.-->
Now, let's take a closer look at the document used at the beginning of the documentation:
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
Tag
sTag
and NavigableString
objects have lots of useful members,
most of which are covered in
Navigating the Parse Tree and
Searching the Parse Tree.
However, there's one aspect of Tag
objects we'll cover here: the
attributes.
SGML tags have attributes:. for instance, each of the <P> tags in
the example HTML above has an "id"
attribute and an "align" attribute. You can access a tag's attributes
by treating the Tag
object as though it were a dictionary:
firstPTag, secondPTag = soup.findAll('p') firstPTag['id'] # u'firstPara' secondPTag['id'] # u'secondPara'
NavigableString
objects don't have attributes; only Tag
objects
have them.
All Tag
objects have all of the members listed below (though the
actual value of the member may be None
). NavigableString
objects
have all of them except for contents
and string
.
parent
In the example above, the parent
of the <HEAD> Tag
is the <HTML> Tag
. The parent of the <HTML>
Tag
is the BeautifulSoup
parser object itself. The parent of the
parser object is None
. By following parent
, you can move up the
parse tree:
soup.head.parent.name # u'html' soup.head.parent.parent.__class__.__name__ # 'BeautifulSoup' soup.parent == None # True
contents
With parent
you move up the parse tree. With contents
you move
down the tree. contents
is an ordered list of the Tag
and
NavigableString
objects contained within a page element. Only the
top-level parser object and Tag
objects have
contents
. NavigableString
objects are just strings and can't
contain sub-elements, so they don't have contents
.
In the example above, the
contents
of the first <P> Tag
is a list containing a
NavigableString
("This is paragraph "), a <B> Tag
, and another
NavigableString
("."). The contents
of the <B> Tag
: a list
containing a NavigableString
("one").
pTag = soup.p pTag.contents # [u'This is paragraph ', <b>one</b>, u'.'] pTag.contents[1].contents # [u'one'] pTag.contents[0].contents # AttributeError: 'NavigableString' object has no attribute 'contents'
string
For your convenience, if a tag has only one child node, and that
child node is a string, the child node is made available as
tag.string
, as well as tag.contents[0]
.
In the example above,
soup.b.string
is a NavigableString
representing the Unicode string
"one". That's the string contained in the first <B> Tag
in the parse
tree.
soup.b.string # u'one' soup.b.contents[0] # u'one'
But soup.p.string
is None
, because the first <P> Tag
in the
parse tree has more than one child. soup.head.string
is also None
,
even though the <HEAD> Tag has only one child, because that child is a
Tag
(the <TITLE> Tag
), not a NavigableString
.
soup.p.string == None # True soup.head.string == None # True
nextSibling
and previousSibling
These members let you skip to the next or previous thing on the
same level of the parse tree. In the
document above, the nextSibling
of the <HEAD> Tag
is the
<BODY> Tag
, because the <BODY> Tag
is the next thing directly
beneath the <html> Tag
. The nextSibling
of the <BODY> tag is
None
, because there's nothing else directly beneath the <HTML>
Tag
.
soup.head.nextSibling.name # u'body' soup.html.nextSibling == None # True
Conversely, the previousSibling
of the <BODY> Tag
is the <HEAD>
tag, and the previousSibling
of the <HEAD> Tag
is None
:
soup.body.previousSibling.name # u'head' soup.head.previousSibling == None # True
Some more examples: the nextSibling
of the first <P> Tag
is the
second <P> Tag
. The previousSibling
of the <B> Tag
inside the
second <P> Tag
is the NavigableString
"This is paragraph". The
previousSibling
of that NavigableString
is None
, not anything
inside the first <P> Tag
.
soup.p.nextSibling # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> secondBTag = soup.findAlll('b')[1] secondBTag.previousSibling # u'This is paragraph' secondBTag.previousSibling.previousSibling == None # True
next
and previous
These members let you move through the document elements in the
order they were processed by the parser, rather than in the order they
appear in the tree. For instance, the next
of the <HEAD> Tag
is
the <TITLE> Tag
, not the <BODY> Tag
. This is because, in
the original document, the <TITLE>
tag comes immediately after the <HEAD> tag.
soup.head.next # u'title' soup.head.nextSibling.name # u'body' soup.head.previous.name # u'html'
Where next
and previous
are concerned, a Tag
's contents
come
before its nextSibling
. You usually won't have to use these members,
but sometimes it's the easiest way to get to something buried inside
the parse tree.
Tag
You can iterate over the contents
of a Tag
by treating it as a
list. This is a useful shortcut. Similarly, to see how many child
nodes a Tag
has, you can call len(tag)
instead of
len(tag.contents)
. In terms of the
document above:
for i in soup.body: print i # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> len(soup.body) # 2 len(soup.body.contents) # 2
It's easy to navigate the parse tree by acting as though the name
of the tag you want is a member of a parser or Tag
object. We've
been doing it throughout these examples. In terms of the document above, soup.head
gives us the first
(and, as it happens, only) <HEAD> Tag
in the document:
soup.head # <head><title>Page title</title></head>
In general, calling mytag.foo
returns the first child of mytag
that happens to be a <FOO> Tag
. If there aren't any <FOO> Tag
s
beneath mytag
, then mytag.foo
returns None
.
You can use this to traverse the parse tree very quickly:
soup.head.title # <title>Page title</title> soup.body.p.b.string # u'one'
You can also use this to quickly jump to a certain part of a parse
tree. For instance, if you're not worried about <TITLE> tags in weird
places outside of the <HEAD> tag, you can just use soup.title
to get
an HTML document's title. You don't have to use soup.head.title
:
soup.title.string # u'Page title'
soup.p
jumps to the first <P> tag inside a document, wherever it
is. soup.table.tr.td
jumps to the first column of the first row of
the first table in the document.
These members actually alias to the first
method, covered below. I mention it here because
the alias makes it very easy to zoom in on an interesting part of a
well-known parse tree.
An alternate form of this idiom lets you access the first <FOO> tag
as .fooTag
instead of .foo
. For instance, soup.table.tr.td
could
also be expressed as soup.tableTag.trTag.tdTag
, or even
soup.tableTag.tr.tdTag
. This is useful if you like to be more
explicit about what you're doing, or if you're parsing XML whose tag
names conflict with the names of Beautiful Soup methods and members.
from BeautifulSoup import BeautifulStoneSoup xml = '<person name="Bob"><parent rel="mother" name="Alice">' xmlSoup = BeautifulStoneSoup(xml) xmlSoup.person.parent # A Beautiful Soup member # <person name="Bob"><parent rel="mother" name="Alice"></parent></person> xmlSoup.person.parentTag # A tag name # <parent rel="mother" name="Alice"></parent>
If you're looking for tag names that aren't valid Python
identifiers (like hyphenated-name
), you need to use find
.
Beautiful Soup provides many methods that traverse the parse tree,
gathering Tag
s and NavigableString
s that match criteria you
specify.
There are several ways to define
criteria for matching Beautiful Soup objects. Let's demonstrate by
examining in depth the most basic of all Beautiful Soup search
methods, findAll
. As before, we'll demonstrate on the following
document:
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
Incidentally, the two methods described in this section (findAll
and find
) are available only to Tag
objects and the top-level
parser objects, not to NavigableString
objects. The methods defined
in Searching Within the
Parse Tree are also available to NavigableString
objects.
findAll(name, attrs, recursive, text, limit, **kwargs)
The findAll
method traverses the tree, starting at the given
point, and finds all the Tag
and NavigableString
objects that match
the criteria you give. The signature for the findall
method is this:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
These arguments show up
over and over again throughout the Beautiful Soup API. The most
important arguments are name
and the keyword arguments.
The name
argument restricts the set
of tags by name. There are several ways to restrict the name, and
these too show up over and over again throughout the Beautiful Soup
API.
The simplest usage is to just pass in a tag name. This code finds
all the <B> Tag
s in the document:
soup.findAll('b') # [<b>one</b>, <b>two</b>]
You can also pass in a regular expression. This code finds all the tags whose names start with B:
import re tagsStartingWithB = soup.findAll(re.compile('^b')) [tag.name for tag in tagsStartingWithB] # [u'body', u'b', u'b']
You can pass in a list or a dictionary. These two calls find all the <TITLE> and all the <P> tags. They work the same way, but the second call runs faster:
soup.findAll(['title', 'p']) # [<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll({'title' : True, 'p' : True}) # [<title>Pag