Python爬蟲庫(kù)BeautifulSoup怎么用-創(chuàng)新互聯(lián)

這篇文章主要介紹Python爬蟲庫(kù)BeautifulSoup怎么用，文中介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們一定要看完！

創(chuàng)新互聯(lián)公司堅(jiān)持“要么做到，要么別承諾”的工作理念，服務(wù)領(lǐng)域包括：網(wǎng)站建設(shè)、成都網(wǎng)站設(shè)計(jì)、企業(yè)官網(wǎng)、英文網(wǎng)站、手機(jī)端網(wǎng)站、網(wǎng)站推廣等服務(wù)，滿足客戶于互聯(lián)網(wǎng)時(shí)代的鄂爾多斯網(wǎng)站設(shè)計(jì)、移動(dòng)媒體設(shè)計(jì)的需求，幫助企業(yè)找到有效的互聯(lián)網(wǎng)解決方案。努力成為您成熟可靠的網(wǎng)絡(luò)建設(shè)合作伙伴！

一、介紹

BeautifulSoup庫(kù)是靈活又方便的網(wǎng)頁(yè)解析庫(kù)，處理高效，支持多種解析器。利用它不用編寫正則表達(dá)式即可方便地實(shí)現(xiàn)網(wǎng)頁(yè)信息的提取。

Python常用解析庫(kù)

解析器	使用方法	優(yōu)勢(shì)	劣勢(shì)
Python標(biāo)準(zhǔn)庫(kù)	BeautifulSoup(markup, “html.parser”)	Python的內(nèi)置標(biāo)準(zhǔn)庫(kù)、執(zhí)行速度適中、文檔容錯(cuò)能力強(qiáng)	Python 2.7.3 or 3.2.2)前的版本中文容錯(cuò)能力差
lxml HTML 解析器	BeautifulSoup(markup, “l(fā)xml”)	速度快、文檔容錯(cuò)能力強(qiáng)	需要安裝C語(yǔ)言庫(kù)
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安裝C語(yǔ)言庫(kù)
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯(cuò)性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴(kuò)展

二、快速開始

給定html文檔，產(chǎn)生BeautifulSoup對(duì)象

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

輸出完整文本

print(soup.prettify())

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

瀏覽結(jié)構(gòu)化數(shù)據(jù)

print(soup.title) #<title>標(biāo)簽及內(nèi)容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內(nèi)的字符串
print(soup.title.parent.name) #<title>的父標(biāo)簽name屬性(head)
print(soup.p) # 第一個(gè)<p></p>
print(soup.p['class']) #第一個(gè)<p></p>的class
print(soup.a) # 第一個(gè)<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標(biāo)簽

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有標(biāo)簽內(nèi)的鏈接

for link in soup.find_all('a'):
  print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

獲得所有文字內(nèi)容

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自動(dòng)補(bǔ)全標(biāo)簽并進(jìn)行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.prettify())#格式化代碼，自動(dòng)補(bǔ)全
print(soup.title.string)#得到title標(biāo)簽里的內(nèi)容

標(biāo)簽選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title)#選擇了title標(biāo)簽
print(type(soup.title))#查看類型
print(soup.head)

獲取標(biāo)簽名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title.name)

獲取標(biāo)簽屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.attrs['name'])#獲取p標(biāo)簽中，name這個(gè)屬性的值
print(soup.p['name'])#另一種寫法，比較直接

獲取標(biāo)簽內(nèi)容

print(soup.p.string)

標(biāo)簽嵌套選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.head.title.string)

子節(jié)點(diǎn)和子孫節(jié)點(diǎn)

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.contents)#獲取指定標(biāo)簽的子節(jié)點(diǎn)，類型是list

另一個(gè)方法，child：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.children)#獲取指定標(biāo)簽的子節(jié)點(diǎn)的迭代器對(duì)象
for i,children in enumerate(soup.p.children):#i接受索引，children接受內(nèi)容
	print(i,children)

輸出結(jié)果與上面的一樣，多了一個(gè)索引。注意，只能用循環(huán)來迭代出子節(jié)點(diǎn)的信息。因?yàn)橹苯臃祷氐闹皇且粋€(gè)迭代器對(duì)象。

獲取子孫節(jié)點(diǎn)：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.descendants)#獲取指定標(biāo)簽的子孫節(jié)點(diǎn)的迭代器對(duì)象
for i,child in enumerate(soup.p.descendants):#i接受索引，child接受內(nèi)容
	print(i,child)

父節(jié)點(diǎn)和祖先節(jié)點(diǎn)

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.a.parent)#獲取指定標(biāo)簽的父節(jié)點(diǎn)

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.parents)))#獲取指定標(biāo)簽的祖先節(jié)點(diǎn)

兄弟節(jié)點(diǎn)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標(biāo)簽的后面的兄弟節(jié)點(diǎn)
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標(biāo)簽的前面的兄弟節(jié)點(diǎn)

標(biāo)準(zhǔn)選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據(jù)標(biāo)簽名、屬性、內(nèi)容查找文檔。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul標(biāo)簽下的內(nèi)容
print(type(soup.find_all('ul')[0]))#查看其類型

下面的例子就是查找所有ul標(biāo)簽下的li標(biāo)簽：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs（屬性）

通過屬性進(jìn)行元素的查找

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個(gè)字典類型，也就是想要查找的屬性
print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同樣的內(nèi)容，因?yàn)檫@兩個(gè)屬性是在同一個(gè)標(biāo)簽里面的。

特殊類型的參數(shù)查找：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是個(gè)特殊的屬性，可以直接使用
print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_

text

根據(jù)文本內(nèi)容來進(jìn)行選擇：

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容，但是返回的不是標(biāo)簽

所以說這個(gè)text在做內(nèi)容匹配的時(shí)候比較方便，但是在做內(nèi)容查找的時(shí)候并不是太方便。

方法

find

find用法和findall一模一樣，但是返回的是找到的第一個(gè)符合條件的內(nèi)容輸出。

ind_parents()， find_parent()

find_parents()返回所有祖先節(jié)點(diǎn)，find_parent()返回直接父節(jié)點(diǎn)。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟節(jié)點(diǎn)，find_next_sibling()返回后面的第一個(gè)兄弟節(jié)點(diǎn)

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節(jié)點(diǎn),find_previous_sibling()返回前面第一個(gè)兄弟節(jié)點(diǎn)

find_all_next(),find_next()

find_all_next()返回節(jié)點(diǎn)后所有符合條件的節(jié)點(diǎn)，find_next()返回后面第一個(gè)符合條件的節(jié)點(diǎn)

find_all_previous(),find_previous()

find_all_previous()返回節(jié)點(diǎn)前所有符合條件的節(jié)點(diǎn)，find_previous()返回前面第一個(gè)符合條件的節(jié)點(diǎn)

CSS選擇器通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class，中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標(biāo)簽下面的li標(biāo)簽
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標(biāo)簽下的，class=element的元素
print(type(soup.select('ul')[0]))#打印節(jié)點(diǎn)類型

再看看層層嵌套的選擇：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可獲取屬性
  print(ul.attrs['id'])#另一種寫法

獲取內(nèi)容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text（）方法就能獲取內(nèi)容了。

以上是“Python爬蟲庫(kù)BeautifulSoup怎么用”這篇文章的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對(duì)大家有幫助，更多相關(guān)知識(shí)，歡迎關(guān)注創(chuàng)新互聯(lián)成都網(wǎng)站設(shè)計(jì)公司行業(yè)資訊頻道！

另外有需要云服務(wù)器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務(wù)器15元起步，三天無理由+7*72小時(shí)售后在線，公司持有idc許可證，提供“云服務(wù)器、裸金屬服務(wù)器、高防服務(wù)器、香港服務(wù)器、美國(guó)服務(wù)器、虛擬主機(jī)、免備案服務(wù)器”等云主機(jī)租用服務(wù)以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡(jiǎn)單易用、服務(wù)可用性高、性價(jià)比高”等特點(diǎn)與優(yōu)勢(shì)，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應(yīng)用場(chǎng)景需求。

文章題目：Python爬蟲庫(kù)BeautifulSoup怎么用-創(chuàng)新互聯(lián)
本文鏈接：http://www.ekvhdxd.cn/article32/ddhpsc.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供外貿(mào)網(wǎng)站建設(shè)、Google、定制網(wǎng)站、標(biāo)簽優(yōu)化、移動(dòng)網(wǎng)站建設(shè)、軟件開發(fā)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

午夜无码人妻aⅴ大片色欲张津瑜,国产69久久久欧美黑人A片,色妺妺视频网,久久久久国产综合AV天堂

Python爬蟲庫(kù)BeautifulSoup怎么用-創(chuàng)新互聯(lián)