2018年8月24日 星期五

網站擷取-使用Python學習筆記 (1) 建構Scraper

  我給予《網站擷取-使用Python(Web Scraping With Python: Collecting More Data from the Modern Web)這本書極高的評價,這本由Ryan Mitchell著作的書籍將「網站擷取」這件事做了全面又淺顯易懂的解釋,從應對各種網站環境的爬蟲技巧,到法律上可能碰到的問題,本書作者給了簡單清楚的原型作法與案例說明,讓您可以擁有良好的網站擷取概觀。

  本書於2018年再推出第二版書籍,將第一版書籍過時的程式碼更新,並添加新的章節,然而本學習筆記僅為第一版書籍的內容,詳細資訊可參考本書作者的GitHub

  資訊領域的發展一日千里,網站擷取這門技術可以說是走在技術最前端,無論是資料保護、偵測機器人Bot、維護伺服器Server資源,還是各種巧取資料的特殊技巧,彼此是競爭的關係,每分每秒總是有人在嘗試各種手段進行資料收集與被收集的攻防戰,因此這些技術極容易過時,儘管如此,本書作者提供了許多良好的編程與解決問題的觀念與想法,能夠讓我們在技術快速更迭之下,有所依循。

  本書建議有基礎Python能力的朋友閱讀,而我是在閱讀《Python自動化的樂趣-搞定重複瑣碎&單調無聊的工作》(Automate the Boring Stuff with Python: Practical Programming for Total Beginners)這本由Al Sweigart著作的書籍之後,發覺這兩本書的內容可以順利銜接,在《Python自動化的樂趣》一書介紹的多數套件模組,在《網站擷取》一書中也都將繼續使用。

第一章:您的第一個Scraper

補充資料
l   在命令提示字元輸入pip install virtualenv,下載並安裝Virtualenv
l   在命令提示字元用虛擬環境確保類別庫整潔:
#
在目前工作目錄建立scrapingEnv資料夾
#
建議在未來統一使用Anaconda虛擬環境
virtualenv scrapingEnv
cd scrapingEnv
scripts\activate
deactivate

RequestsBeautifulSoup
l   在命令提示字元輸入pip install requests,下載並安裝Requests
l   在命令提示字元輸入pip install beautifulsoup4,下載並安裝BeautifulSoup
l   import requests, bs4
aRes = requests.get("http://pythonscraping.com/pages/page1.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
print(aBeaSou.select("h1"))
print(aBeaSou.h1)

第二章:進階HTML解析

BeautifulSoup物件的.find().findAll()方法
l   .find()的意義等於.findAll()limit設為1
.find(tag, attributes, recursive, text, limit, keywords)
.findAll(tag, attributes, recursive, text, keywords)
tag:
標籤名稱字串或串列,如"span"["h1", "h2"]
attributes:
屬性與比對內容的字典,如{"id": "text"}
recursive: True(
預設)走訪標籤的子代,False只走訪最上層標籤
text:
標籤的文字內容,如text = "the prince"
keywords:
屬性與比對內容,如id = "text"
l   import requests, bs4
aRes = requests.get("http://www.pythonscraping.com/pages/warandpeace.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
for i in aBeaSou.findAll("span", {"class": "green"}):
    print(i.get_text())

走訪樹狀結構
l   處理子代(.children)與子孫(.descendants)
#
如果只想尋找子代的話,可以使用.children標籤;如果想尋找子代與子孫的話,可以使用.descendants標籤
import requests, bs4
aRes = requests.get("http://www.pythonscraping.com/pages/page3.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
for i in aBeaSou.find("table", {"id": "giftList"}).children:
    print(i)
l   處理平輩(.next_siblings.previous_siblings.next_sibling.previous_sibling)
import requests, bs4
aRes = requests.get("http://www.pythonscraping.com/pages/page3.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
for i in aBeaSou.find("table", {"id": "giftList"}).tr.next_siblings:
    print(i)
l   處理親代(.parents.parent)
import requests, bs4
aRes = requests.get("http://www.pythonscraping.com/pages/page3.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
print(aBeaSou.find("img", {"src": "../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

整合BeautifulSoup與正規表達式
l   正規表達式:
import re
aBox = "Hell"
aRegex = re.compile(r"(?!"+aBox+")o World")
aMatch = aRegex.search("Hello World")
print(aMatch.group())
à o World
l   import requests, bs4, re
aRes = requests.get("http://www.pythonscraping.com/pages/page3.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
for i in aBeaSou.findAll("img", {"src": re.compile("\.\./img/gifts/img.*\.jpg")}):
    print(i["src"])

整合BeautifulSoupLambda表達式
l   Lambda表達式:
square = lambda a: a**2
print(square(2))
à 4
l   整合時限制接受一個標籤物件引數,然後回傳Boolean
import requests, bs4
aRes = requests.get("http://www.pythonscraping.com/pages/page3.html")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
for i in aBeaSou.findAll(lambda tag: len(tag.attrs) == 2):
    print(i)
    print("\n")

這本書的主軸是BeautifulSoup
l   bs4除了BeautifulSoup物件、Tag物件,還有NavigableString物件用來表現標籤裡的文字,Comment物件用來尋找HTML註解標籤裡的文字。
l   其它解析函式庫:
lxml
HTML Parser

第三章:開始爬行

走訪內部網域
l   import requests, bs4, re, random, datetime

#
偽亂數與亂數種子
random.seed(datetime.datetime.now())

def getLinks(pageUrl):
    aRes = requests.get("http://en.wikipedia.org" + pageUrl)
    aRes.raise_for_status()
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    # Wikipedia
文章連結
    return aBeaSou.find("div", {"id": "bodyContent"}).findAll("a", href = re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Keanu_Reeves")
while len(links) > 0:
    aRandomLink = links[random.randint(0, len(links) - 1)].attrs["href"]
    print(aRandomLink)
    links = getLinks(aRandomLink)

爬行整個網站
l   import requests, bs4, re

links = set()

def getLinks(pageUrl):
    global pages
    aRes = requests.get("http://en.wikipedia.org" + pageUrl)
    aRes.raise_for_status()
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    # Wikipedia
標題
    print(aBeaSou.h1.get_text())
    # Wikipedia
第一段文字
    print(aBeaSou.find(id = "mw-content-text").p.get_text())
    for link in aBeaSou.findAll("a", href = re.compile("^(/wiki/)")):
        #
避免重複爬到同一個頁面,因為Python預設的遞迴深度限制(程式可以遞迴呼叫自己的次數)1000
        if link.attrs["href"] not in links:
            aNewLink = link.attrs["href"]
            print("\n" + aNewLink)
            links.add(aNewLink)
            getLinks(aNewLink)

getLinks("")

走訪外部網域
l   import requests, bs4, re, random, datetime

random.seed(datetime.datetime.now())

def getInternalLinks(aBeaSou, addressPart):
    internalLinks = []
    for link in aBeaSou.findAll("a", href = re.compile("^(/|.*"+addressPart+")")):
        if not link.attrs["href"] in internalLinks:
            internalLinks.append(link.attrs["href"])
    return internalLinks

def getExternalLinks(aBeaSou, addressPart):
    externalLinks = []
    for link in aBeaSou.findAll("a", href = re.compile("^(http|www)((?!"+addressPart+").)*$")):
        if not link.attrs["href"] in externalLinks:
            externalLinks.append(link.attrs["href"])
    return externalLinks

#
使用.replace()以空字串取代http://https://
def splitAddress(pageUrl):
    aMatch = re.compile(r"http://|https://").search(pageUrl)
    if aMatch == None:
        addressParts = pageUrl.split("/")
    else:
        addressParts = pageUrl.replace(aMatch.group(), "").split("/")
    return addressParts[0]

def getRandomExternalLink(pageUrl):
    aRes = requests.get(pageUrl)
    aRes.raise_for_status()
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    externalLinks = getExternalLinks(aBeaSou, splitAddress(pageUrl))
    if len(externalLinks) == 0:
        internalLinks = getInternalLinks(aBeaSou, splitAddress(pageUrl))
        return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks) - 1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks) - 1)]

def followExternalOnly(pageUrl):
    externalLink = getRandomExternalLink(pageUrl)
    print("Random external link is: " + externalLink)
    followExternalOnly(externalLink)

followExternalOnly("http://oreilly.com")

Anaconda
l   Windows上下載安裝Scrapy之前,必須先下載安裝Anaconda
Anaconda
Get Scrapy
Scrapy 1.5 documentation
l   安裝Anaconda之後,在Anaconda Prompt或命令提示字元(在「編輯系統環境變數」加入路徑,或於安裝時勾選加入路徑)輸入,管理conda
# conda
版本
conda --version
#
更新conda
conda update conda
conda update anaconda
l   安裝Anaconda之後,在Anaconda Prompt或命令提示字元(在「編輯系統環境變數」加入路徑,或於安裝時勾選加入路徑)輸入,管理虛擬環境:
#
列出所有環境
conda info -e
conda info --envs
conda env list
#
新建環境
conda create -n
環境名稱 python=Python版本 模組名稱
conda create --name
環境名稱 python=Python版本 模組名稱
#
啟用環境
activate
環境名稱
#
退出環境
deactivate
#
複製環境
conda create -n
新環境名稱 --clone 舊環境名稱
#
刪除環境
conda remove -n
環境名稱 --all
conda env remove -n
環境名稱
l   安裝Anaconda之後,在Anaconda Prompt或命令提示字元(在「編輯系統環境變數」加入路徑,或於安裝時勾選加入路徑)輸入,管理Python
# Python
版本
python --version
#
檢查可以安裝的Python
conda search python
#
更新Python
conda update python
l   安裝Anaconda之後,在Anaconda Prompt或命令提示字元(在「編輯系統環境變數」加入路徑,或於安裝時勾選加入路徑)輸入,管理模組:
#
列出所有模組
conda list -n
環境名稱
#
更新所有模組
conda update --all -n
環境名稱
#
檢查可以安裝的模組
conda search
模組名稱
#
安裝模組
conda install -n
環境名稱 模組名稱
#
更新模組
conda update -n
環境名稱 模組名稱
#
解除安裝模組
conda remove -n
環境名稱 模組名稱
l   所有虛擬環境都存放在安裝Anaconda資料夾裡的envs資料夾裡。

Scrapy
l   步驟1-在Anaconda Prompt或命令提示字元輸入,建立虛擬環境、安裝Scrapy模組
conda create -n pythonScraping
activate pythonScraping
conda install scrapy
l   步驟2-在Anaconda Prompt或命令提示字元輸入,在目前工作目錄建立專案資料夾:
scrapy startproject wikiSpider
l   步驟3.1-在專案資料夾修改程式碼:
#
修改C:\Users\Timmy\wikiSpider\wikiSpider\items.py
from scrapy import Item, Field
#
每個ScrapyItem物件都代表網站上的單一頁面
class Article(Item):
    #
只收集每個單一頁面的標題
    title = Field()
步驟3.2-在專案資料夾建立程式碼(範例1)
#
建立C:\Users\Timmy\wikiSpider\wikiSpider\spiders\articleSpiderOne.py
from scrapy import Spider
from wikiSpider.items import Article
class ArticleSpider(Spider):
    name = "articleOne"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page", "http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
    def parse(self, response):
        item = Article()
        title = response.xpath("//h1/text()")[0].extract()
        print("Title is: " + title)
        item["title"] = title
        return item
步驟3.2-在專案資料夾建立程式碼(範例2)
#
建立C:\Users\Timmy\wikiSpider\wikiSpider\spiders\articleSpiderTwo.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from wikiSpider.items import Article
class ArticleSpider(CrawlSpider):
    name = "articleTwo"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
    rules = [Rule(LinkExtractor(allow = ("(/wiki/)((?!:).)*$"),), callback = "parse_item", follow = True)]
    def parse_item(self, response):
        item = Article()
        title = response.xpath("//h1/text()")[0].extract()
        print("Title is: " + title)
        item["title"] = title
        return item
l   步驟4-在Anaconda Prompt或命令提示字元輸入,執行專案:
cd wikiSpider
#
articleSpiderOne.py裡面由name = "article"定義的名稱呼叫Scraper
scrapy crawl articleOne
#
在專案資料夾輸出執行成果
scrapy crawl articleOne -o articles.csv -t csv
scrapy crawl articleOne -o articles.json -t json
scrapy crawl articleOne -o articles.xml -t xml
l   選擇性設定-顯示日誌紀錄,由高至低分為CRITICALERRORWARNINGDEBUGINFO5個層級:
#
修改C:\Users\Timmy\wikiSpider\wikiSpider\settings.py,增加一行程式碼
LOG_LEVEL = "ERROR"
#
Anaconda Prompt或命令提示字元輸入,執行專案時僅顯示ERROR層級或以上的日誌紀錄
scrapy crawl articleOne
l   選擇性設定-輸出日誌紀錄,由高至低分為CRITICALERRORWARNINGDEBUGINFO5個層級:
#
Anaconda Prompt或命令提示字元輸入,在專案資料夾輸出日誌紀錄
scrapy crawl articleOne -s LOG_FILE=wiki.log

第四章:使用API

API
l   大多數現代API(Application Programming Interface)在使用前都需要先完成某些形式的認證,認證可使用後,API最常見的回應格式是XML(eXtensible Markup Language)JSON(JavaScript Object Notation),近年來JSON的能見度比XML高。
l   絕大多數API都能以這4個動作完整操控:
import requests
# GET
請求物件
aRes = requests.get("https://api.github.com/events")
# POST
傳送物件
bRes = requests.post("http://httpbin.org/post", data = {"key": "value"})
# PUT
更新物件
cRes = requests.put("http://httpbin.org/put", data = {"key": "value"})
# DELETE
刪除物件
dRes = requests.delete("http://httpbin.org/delete")

整合ScrapersAPI
l   此範例可搭配Visualization:GeoChart進行資料視覺化:
import requests, bs4, re, random, datetime, json

random.seed(datetime.datetime.now())

def getLinks(pageUrl):
    aRes = requests.get("http://en.wikipedia.org" + pageUrl)
    aRes.raise_for_status()
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    return aBeaSou.find("div", {"id": "bodyContent"}).findAll("a", href = re.compile("^(/wiki/)((?!:).)*$"))

def getHistoryIPs(pageUrl):
    pageUrl = pageUrl.replace("/wiki/", "")
    historyUrl = "http://en.wikipedia.org/w/index.php?title=" + pageUrl + "&action=history"
    print("History url is: " + historyUrl)
    aRes = requests.get(historyUrl)
    aRes.raise_for_status()
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    #
只找classmw-anonuserlink的連結,裡面是IP位址而不是帳號
    ipAddresses = aBeaSou.findAll("a", {"class": "mw-anonuserlink"})
    addressList = set()
    for ipAddress in ipAddresses:
        addressList.add(ipAddress.get_text())
    return addressList

def getCountry(historyIP):
    # This API endpoint is deprecated and will stop working on July 1st, 2018. For more information please visit: https://github.com/apilayer/freegeoip#readme
    aRes = requests.get("http://freegeoip.net/json/" + historyIP)
    aRes.raise_for_status()
    stringOfJsonData = aRes.text
    pythonValue = json.loads(stringOfJsonData)
    return pythonValue["country_code"]

links = getLinks("/wiki/Python_(programming_language)")
while len(links) > 0:
    for link in links:
        historyIPs = getHistoryIPs(link.attrs["href"])
        for historyIP in historyIPs:
            country = getCountry(historyIP)
            print(historyIP + " is from " + country)
    aNewLink = links[random.randint(0, len(links) - 1)].attrs["href"]
    links = getLinks(aNewLink)

第五章:儲存資料

擷取檔案
l   import requests, bs4, os
from urllib.request import urlretrieve

dirName = "pythonScraping"
os.makedirs(dirName, exist_ok = True)

aRes = requests.get("http://www.pythonscraping.com")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")

#
擷取單一檔案
imageLocation = aBeaSou.find("a", {"id": "logo"}).find("img")["src"]
urlretrieve(imageLocation, os.path.join(dirName, "logo.jpg"))

#
擷取所有檔案
fileLoactionList = aBeaSou.findAll(src = True)
for i in fileLoactionList:
    aFileLocation = i["src"]
    try:
        print("Try to retrieve: " + aFileLocation)
        urlretrieve(aFileLocation, os.path.join(dirName, os.path.basename(aFileLocation)))
    except:
        print("Cannot retrieve: " + aFileLocation)

整合ScrapersCSV
l   import requests, bs4, csv
aRes = requests.get("http://en.wikipedia.org/wiki/Comparison_of_text_editors")
aRes.raise_for_status()
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
rows = aBeaSou.find("table", {"class": "wikitable"}).findAll("tr")
aFile = open("filesEditors.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for row in rows:
    csvRow = []
    for cell in row.findAll(["td", "th"]):
        csvRow.append(cell.get_text())
    aWriter.writerow(csvRow)
aFile.close()

MySQL開放原始碼關聯式資料庫管理系統
l   學習MySQL的資源:
MySQL Cookbook
MySQL超新手入門(
推薦)
l   下載、安裝MySQL及其相關工具:
Windows上安裝MySQLSetup Type建議選擇Server Only以避免安裝很多額外的Microsoft軟體與函式庫,若選擇Server OnlyConfigure時必須選擇Server Computer
MySQL
MySQL Workbench
phpMyAdmin之類的工具讓快速檢視、排序與新增資料變得很簡單:
MySQL Workbench
phpMyAdmin
l   MySQL簡易命令列操作-新建資料庫、資料表:
--
新建資料庫
CREATE DATABASE testDB;
--
使用資料庫
USE testDB;
--
新建資料表,MySQL的資料表不能沒有欄位,包含名稱、變數型別、其它可省略的屬性等,欄位最後定義資料表的鍵
CREATE TABLE testable (id BIGINT(7) NOT NULL AUTO_INCREMENT, title VARCHAR(200), content VARCHAR(10000), created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id));
--
顯示資料表結構
DESCRIBE testTable;
l   MySQL簡易命令列操作-查詢、新增、修改、刪除資料表紀錄:
--
SELECT為例,MySQL子句使用的順序
SELECT
à FROM à WHERE à GROUP BY à HAVING à ORDER BY à LIMIT
-- SELECT
查詢資料表紀錄
SELECT * FROM testTable WHERE title = "testTitle" AND content = "testContent";
SELECT * FROM testTable WHERE id BETWEEN 1 AND 5;
SELECT * FROM testTable WHERE id NOT IN (1, 3, 5);
SELECT * FROM testTable ORDER BY title DESC LIMIT 5;
SELECT id, title AS newTitle FROM testTable WHERE content LIKE "%test%";
SELECT DISTINCT title FROM testable;
-- INSERT
新增資料表紀錄
INSERT INTO testTable (title, content) VALUES ("testTitle", "testContent");
-- UPDATE
修改資料表紀錄
UPDATE testTable SET title = "newTitle", content = "newContent" WHERE id = 1;
-- DELETE
刪除資料表紀錄
DELETE FROM testTable WHERE id = 1;
--
建立與刪除索引
CREATE INDEX testIndex ON testTable (title, content(16));
ALTER TABLE testTable DROP INDEX testIndex;
l   MySQL簡易命令列操作-統計函式:
SELECT Min(id) FROM testTable;
SELECT Max(id) FROM testTable;
SELECT Avg(id) FROM testTable;
SELECT Sum(id) FROM testTable;
SELECT Count(id) FROM testTable;

PyMySQL模組
l   在命令提示字元輸入pip install PyMySQL,下載並安裝PyMySQL
l   PyMySQL模組及其相關內容:
PyMySQL 0.7.2 documentation
PythonDatabase API Specification v2.0
l   一個連線物件(Connection)可以擁有很多游標物件(Cursor),游標物件負責追蹤特定的狀態資訊:
import pymysql

aConn = pymysql.connect(host = "127.0.0.1", user = "root", passwd = "myPassword", port = 3306)
aCurs = aConn.cursor()
aCurs.execute("USE testDB")

#
取得資料
aCurs.execute("SELECT * FROM testTable WHERE id = 1")
print(aCurs.fetchone())

#
取得資料筆數
aCurs.execute("SELECT * FROM testTable")
print(aCurs.rowcount)

#
取得最後資料編號
aCurs.execute("INSERT INTO testTable (title, content) VALUES (\"testTitle\", \"testContent\")")
aConn.commit()
print(aCurs.lastrowid)

aCurs.close()
aConn.close()

整合ScrapersMySQL
l   Wikipedia資料存到MySQLMySQL的設定:
MySQL
預設不會處理Unicode,因此需要開啟處理Unicode的功能,把資料庫、資料表的預設字集從utf8mb4改成utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE testTable CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
l   Wikipedia資料存到MySQLPython程式碼:
import requests, bs4, random, datetime, pymysql, re

random.seed(datetime.datetime.now())
aConn = pymysql.connect(host = "127.0.0.1", user = "root", passwd = "myPassword", port = 3306, charset = "utf8")
aCurs = aConn.cursor()
aCurs.execute("USE testDB")

def store(title, content):
    aCurs.execute("INSERT INTO testTable (title, content) VALUES (\"%s\", \"%s\")", (title, content))
    aCurs.connection.commit()

def getLinks(pageUrl):
    aRes = requests.get("http://en.wikipedia.org" + pageUrl)
    aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
    title = aBeaSou.find("h1").get_text()
    content = aBeaSou.find("div", {"id": "mw-content-text"}).find("p").get_text()
    store(title, content)
    return aBeaSou.find("div", {"id": "bodyContent"}).findAll("a", href = re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Keanu_Reeves")
try:
    while len(links) > 0:
        aRandomLink = links[random.randint(0, len(links) - 1)].attrs["href"]
        print(aRandomLink)
        links = getLinks(aRandomLink)
finally:
    #
無論如何,游標跟連線都會在程式結束前關掉
    aCurs.close()
    aConn.close()

第六章:讀取文件

讀取純文字
l   Python試著把文件以ASCII來解讀,而瀏覽器則試著把文件以ISO-8859-1來解讀,以下範例文件(法文、斯拉夫文)應該以UTF-8來解讀:
#
方法一,本書作者讀取網頁方式-urllib
from urllib.request import urlopen
aPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
print(str(aPage.read(), "utf-8"))
#
方法二,自己建議讀取網頁方式-requests
import requests
aRes = requests.get("http://www.pythonscraping.com/pages/warandpeace/chapter1-ru.txt")
aRes.raise_for_status()
aRes.encoding = "utf-8"
print(aRes.text)
l   讀取網頁時,配合HTML頁面內容的meta標籤,選用它指定的編碼方式:
#
方法一,本書作者讀取網頁方式-urllib
from urllib.request import urlopen
import bs4
aPage = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
aBeaSou = bs4.BeautifulSoup(aPage, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
content = bytes(content, "UTF-8")
content = content.decode("UTF-8")
print(content)
#
方法二,自己建議讀取網頁方式-requests
import requests, bs4
aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
print(content)

讀取CSV
l   把資料以字串的形式載入,然後包到一個物件裡,讓Python能夠把它視為檔案處理而不實際把資料存成檔案,會是比較好的做法:
import requests, io, csv

aRes = requests.get("http://pythonscraping.com/files/MontyPythonAlbums.csv")
aRes.raise_for_status()

# csv.reader
會以list物件的形式回傳內容
aFile = io.StringIO(aRes.text)
aReader = csv.reader(aFile)
for i in aReader:
    print(i)

# csv.DictReader
會以dict物件的形式回傳內容,欄位名稱儲存在.fieldnames之內
aFile = io.StringIO(aRes.text)
aDictReader = csv.DictReader(aFile)
print(aDictReader.fieldnames)
for i in aDictReader:
    print(i)

讀取PDF
l   在命令提示字元輸入pip install PDFMiner3K,下載並安裝PDFMiner3K
l   在命令提示字元輸入pip install PyPDF2,下載並安裝PyPDF2
l   # 方法一,本書作者讀取網頁方式-pdfminer
from urllib.request import urlopen
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams = laparams)
    process_pdf(rsrcmgr, device, pdfFile)
    device.close()
    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
content = readPDF(pdfFile)
print(content)
#
方法二,自己建議讀取網頁方式-PyPDF2
import requests, io, PyPDF2
aRes = requests.get("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
aRes.raise_for_status()
aFile = io.BytesIO(aRes.content)
aReader = PyPDF2.PdfFileReader(aFile)
for i in range(aReader.numPages):
    aPage = aReader.getPage(i)
    print(aPage.extractText())

讀取DOCX
l   所有docx檔都以Zip壓縮來節省空間,解壓縮出來的是xml檔,文件裡所有文字都放在<w:t>標籤裡:
import requests, bs4, io, zipfile
aRes = requests.get("http://pythonscraping.com/pages/AWordDocument.docx")
aRes.raise_for_status()
aFile = io.BytesIO(aRes.content)
aXml = zipfile.ZipFile(aFile)
xmlContent = aXml.read("word/document.xml")
aBeaSou = bs4.BeautifulSoup(xmlContent.decode("utf-8"), "lxml-xml")
for i in aBeaSou.findAll("w:t"):
    print(i.get_text())

沒有留言:

張貼留言