2018年8月24日 星期五

網站擷取-使用Python學習筆記 (2) 進階Scraping

  我給予《網站擷取-使用Python(Web Scraping With Python: Collecting More Data from the Modern Web)這本書極高的評價,這本由Ryan Mitchell著作的書籍將「網站擷取」這件事做了全面又淺顯易懂的解釋,從應對各種網站環境的爬蟲技巧,到法律上可能碰到的問題,本書作者給了簡單清楚的原型作法與案例說明,讓您可以擁有良好的網站擷取概觀。

  本書於2018年再推出第二版書籍,將第一版書籍過時的程式碼更新,並添加新的章節,然而本學習筆記僅為第一版書籍的內容,詳細資訊可參考本書作者的GitHub

  資訊領域的發展一日千里,網站擷取這門技術可以說是走在技術最前端,無論是資料保護、偵測機器人Bot、維護伺服器Server資源,還是各種巧取資料的特殊技巧,彼此是競爭的關係,每分每秒總是有人在嘗試各種手段進行資料收集與被收集的攻防戰,因此這些技術極容易過時,儘管如此,本書作者提供了許多良好的編程與解決問題的觀念與想法,能夠讓我們在技術快速更迭之下,有所依循。

  本書建議有基礎Python能力的朋友閱讀,而我是在閱讀《Python自動化的樂趣-搞定重複瑣碎&單調無聊的工作》(Automate the Boring Stuff with Python: Practical Programming for Total Beginners)這本由Al Sweigart著作的書籍之後,發覺這兩本書的內容可以順利銜接,在《Python自動化的樂趣》一書介紹的多數套件模組,在《網站擷取》一書中也都將繼續使用。

第七章:清理您的髒資料

未以程式清理髒資料
l   import requests, bs4, csv

def getNgrams(input, n):
    input = input.split(" ")
    output = []
    for i in range(len(input) - n + 1):
        output.append(input[i : i + n])
    return output

aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)

aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for row in ngrams:
    aWriter.writerow(row)
aFile.close()

以程式清理髒資料
l   import requests, bs4, re, string, csv
from collections import OrderedDict

def cleanInput(input):
    input = re.sub("\n+", " ", input)
    input = re.sub("\[[0-9]*\]", "", input)
    input = re.sub(" +", " ", input)
    input = input.split(" ")
    cleanInput = []
    for i in input:
        item = i.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == "a" or item.lower() == "i"):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = dict()
    for i in range(len(input) - n + 1):
        newNgram = " ".join(input[i : i + n])
        if newNgram in output:
            output[newNgram] += 1
        else:
            output[newNgram] = 1
    return output

aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
# dictionary sorted by value
# import operator
# ngrams = OrderedDict(sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True))
ngrams = OrderedDict(sorted(ngrams.items(), key = lambda t: t[1], reverse = True))

aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for k, v in ngrams.items():
    row = []
    row.append(k)
    row.append(v)
    aWriter.writerow(row)
aFile.close()

OpenRefine
l   OpenRefine可以快速、輕易地清理資料,還能把資料整理成能輕易閱讀、處理的格式,使用前必須先把資料存成CSV,下載安裝連結如下:
OpenRefine
操作時所需GREL的完整語言說明如下:
OpenRefineDocumentation For Users
l   OpenRefine需要透過瀏覽器進行操作,然而目前Google ChromeMicrosoft Edge等瀏覽器已不支援該功能。

第八章:讀寫自然語言

馬可夫模型(Markov Model)
l   本範例有使用單行List Comprehension技巧:
#
舊寫法
filterEmpty = []
for word in words:
    if word != "":
        filterEmpty.append(word)
words = filterEmpty
#
新寫法
words = [word for word in words if word != ""]
l   import requests, random

def wordListSum(wordList):
    sum = 0
    for k, v in wordList.items():
        sum += v
    return sum

def retrieveRandomWord(wordList):
    randomIndex = random.randint(1, wordListSum(wordList))
    for k, v in wordList.items():
        randomIndex -= v
        if randomIndex <= 0:
            return k

def buildWordDict(text):
    text = text.replace("\n", " ")
    text = text.replace("\"", "")
    punctuation = [",", ".", ";", ":"]
    for i in punctuation:
        text = text.replace(i, " " + i + " ")
   
    words = text.split(" ")
    words = [word for word in words if word != ""]
   
    wordDict = {}
    for j in range(1, len(words)):
        if words[j - 1] not in wordDict:
            wordDict[words[j - 1]] = {}
        if words[j] not in wordDict[words[j - 1]]:
            wordDict[words[j - 1]][words[j]] = 0
        wordDict[words[j - 1]][words[j]] += 1
    return wordDict

aRes = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
aRes.raise_for_status()
wordDict = buildWordDict(aRes.text)

length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
    chain += currentWord + " "
    currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)

自然語言工具組(Natural Language Toolkit, NLTK)
l   在命令提示字元輸入pip install nltk,下載並安裝NLTK
l   開啟NLTK下載器,建議安裝所有資料庫:
import nltk
nltk.download()
l   詞彙、語句的List物件與Text物件:
import nltk
text = "The dust was thick so he had to dust. Later, he had to groom for the groom."
a = nltk.word_tokenize(text)
print(a)
b = nltk.sent_tokenize(text)
print(b)
c = nltk.Text(a)
print(c)
l   載入內建書本文字當作範例,使用FreqDist頻率分布物件:
from nltk.book import *
from nltk import FreqDist
a = FreqDist(text6)
print(a)
print(a.most_common(10))
l   載入內建書本文字當作範例,使用bigrams進行2grams字串組合:
from nltk.book import *
from nltk import bigrams
a = bigrams(text6)
b = FreqDist(a)
print(b)
print(b[("Sir", "Robin")])
l   載入內建書本文字當作範例,使用ngrams進行4grams字串組合:
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
b = FreqDist(a)
print(b)
print(b[("father", "smelt", "of", "elderberries")])
l   載入內建書本文字當作範例,找出指定字串開頭的4grams字串組合:
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
for i in a:
    if i[0] == "coconut":
        print(i)
l   NLTK的字典分析-Penn Treebank詞性標記:
import nltk
a = nltk.word_tokenize("The dust was thick so he had to dust. Later, he had to groom for the groom.")
print(nltk.pos_tag(a))

第九章:爬過表單與登入頁

Python Requests函式庫
l   提交一份基本表單:
# http://pythonscraping.com/pages/files/form.html
import requests
data = {"firstname": "Ryan", "lastname": "Mitchell"}
aRes = requests.post("http://pythonscraping.com/pages/files/processing.php", data = data)
print(aRes.text)
l   提交檔案與影像:
# http://pythonscraping.com/files/form2.html
import requests
files = {"uploadFile": open("D:\\test.png", "rb")}
aRes = requests.post("http://pythonscraping.com/pages/files/processing2.php", files = files)
print(aRes.text)
l   處理登入與Cookies
# http://pythonscraping.com/pages/cookies/login.html
import requests
data = {"username": "Ryan", "password": "\"password\""}
aRes = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aRes.cookies.get_dict())
bRes = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies = aRes.cookies)
print(bRes.text)
l   處理登入與Cookies
#
比較複雜的網站會經常無預警修改cookiesRequests套件的Session功能可以處理這些狀況
# http://pythonscraping.com/pages/cookies/login.html
import requests
session = requests.Session()
data = {"username": "Ryan", "password": "\"password\""}
aSes = session.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aSes.cookies.get_dict())
bSes = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(bSes.text)
l   HTTP基本登入認證(Basic Access Authentication)
# Requests
套件的auth模組專門處理HTTP認證
# http://pythonscraping.com/pages/auth/login.php
import requests
auth = requests.auth.HTTPBasicAuth("Ryan", "password")
aRes = requests.post("http://pythonscraping.com/pages/auth/login.php", auth = auth)
print(aRes.text)

第十章:搜刮JavaScript

JavaScript
l   多數狀況下,在網路上只會遇到兩種用戶端語言,分別為Flash程式用的ActionScript,以及JavaScript;常見的JavaScript函式庫包含jQueryGoogle AnalyticsGoogle Maps
l   Ajax(Asynchronous JavaScript and XML)是用來對Web Server傳送或接收資料,但不必為此多請求一個新頁面的技術。
l   動態HTML(Dynamic HTML)指的是HTMLCSS等內容隨著用戶端Script修改頁面內容而變化。
l   一般Scraper無法執行Ajax與動態HTMLJavaScript,解法只有兩種,一是直接從JavaScript擷取相關內容,另一是使用可以執行JavaScriptPython套件。
l   Selenium套件在Python執行JavaScript
在命令提示字元輸入pip install selenium,下載並安裝Selenium
selenium PyPI
Selenium - Web Browser Automation
PhantomJSHeadless Browser讓程式靜靜地在背景執行
(2017
Selenium不再支援PhantomJS2018PhantomJS宣布暫停開發,因此建議改用Headless ChromeHeadless Firefox)
PhantomJS - Scriptable Headless Browser

Ajax載入資料
l   使用ChromeDriverSelenium Selector載入:
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()
l   使用PhantomJSBeautifulSoup載入:
from selenium import webdriver
import time, bs4
driver = webdriver.PhantomJS(r"D:\phantomjs-2.1.1-windows\bin\phantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
aBeaSou = bs4.BeautifulSoup(driver.page_source, "html.parser")
print(aBeaSou.find(id = "content").get_text())
driver.close()

反覆檢查頁面是否完整載入資料
l   Ajax載入資料:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")

try:
    #
參考下列Locator選取策略說明
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    # print(driver.find_element(By.ID, "content").text)
    print(driver.find_element_by_id("content").text)

driver.close()
l   處理用戶端重導:
#
處理伺服端重導用一般urllib函式庫即可
from selenium import webdriver
import time

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")

element = driver.find_element_by_tag_name("html")
count = 0
while True:
    count += 1
    if count > 20:
        print("Timing Out After 10 Seconds and Returning")
        break
    time.sleep(0.5)
    if element != driver.find_element_by_tag_name("html"):
        break

print(driver.page_source)
driver.close()

Locator選取策略說明
l   Locator選取策略可以用在By物件:
(By.ID, "test")
(By.CLASS_NAME, "test")
(By.CSS_SELECTOR, "#idName")
(By.CSS_SELECTOR, ".className")(By.CSS_SELECTOR, "tagName")
(By.LINK_TEXT, "test")
(By.PARTIAL_LINK_TEXT, "test")
(By.NAME, "test")
(By.TAG_NAME, "p")
(By.XPATH, "//div")
l   XPath(XML Path)語法:
XPath
語法是用來走訪、選取XML文件內容,其語法包含四大觀念:
一、Root節點與Non-root節點:
/div
選取位於文件根部的div標籤
//div
選取位於文件任何地方的所有div標籤
二、屬性選取:
//@href
選取所有包含href屬性的節點
//a[@href = "http://google.com"]
選取所有連到Google首頁的連結
三、用位置選取節點:
//a[3]
選取文件裡的第三個連結
//table[last()]
選取文件裡的最後一個表格
//a[position() < 3]
選取文件裡的前三個連結
四、星號(*)會匹配任何字元或節點:
//table/tr/*
選取所有表格中的所有tr標籤的子代
//div[@*]
選取所有至少擁有一個屬性的div標籤
XPath
語法相關內容:
XPath Syntax

第十一章:影像處理與文字辨識

識別CAPTCHA(Computer Automated Public Turning test to tell Computers and Humans Apart)常用的函式庫
l   Pillow影像處理函式庫:
在命令提示字元輸入pip install pillow,下載並安裝Pillow
Pillow
相關內容:
Pillow - Pillow (PIL Fork) 5.2.0documentation
l   Tesseract光學字元辨識函式庫:
Tesseract
是用Python寫的命令列程式,不是透過import取用的函式庫,在Windows上可找Installer安裝,並依提示選擇下載安裝語言包:
Downloadstesseract-ocr
Tesseract
無圖形使用者介面,需新增系統變數,讓系統知道資料檔存放位置,變數值請輸入安裝路徑:Windows設定 à 編輯系統環境變數 à 系統內容 à 環境變數 à 系統變數:
新增系統變數:變數名稱TESSDATA_PREFIX,變數值C:\Tesseract-OCR
編輯環境變數:變數名稱Path,新增變數值C:\Tesseract-OCR
Tesseract
相關內容:
pytesseract PyPI
l   NumPy數學函式庫:
在命令提示字元輸入pip install numpy,下載並安裝NumPy
用以訓練Tesseract識別更多字元或是新字型。

影像處理
l   高斯模糊:
from PIL import Image, ImageFilter
kitten = Image.open("zophie.png")
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save("zophie_blurred.png")
blurryKitten.show()
l   二值化濾鏡,將背景的灰階濾掉、突顯文字部分:
from PIL import Image
badImage = Image.open("textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("textCleaned.png")

光學字元辨識
l   以命令提示字元操作,處理格式整齊的文字:
#
在目前工作目錄輸入文字圖片與輸出.txt
tesseract textOriginal.png textOutput
#
在指定工作目錄輸入文字圖片與輸出.txt
tesseract D:\textOriginal.png D:\textOutput
#
直接輸出內容而不儲存至.txt
tesseract textOriginal.png stdout
#
預設辨識英文,亦可指定辨識文字圖片的英文
tesseract textOriginal.png textOutput -l eng
#
需安裝語言包,指定辨識文字圖片的繁體中文
tesseract textOriginal.png textOutput -l chi_tra
l   整合影像處理以辨識文字圖片檔案:
from PIL import Image
import subprocess

badImage = Image.open("D:\\textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("D:\\textCleaned.png")

subprocess.call(["tesseract", "D:\\textCleaned.png", "D:\\textOutput"])

textFile = open("D:\\textOutput.txt", "r")
print(textFile.read())
textFile.close()
l   訓練Tesseract
常用的CAPTCHA,可當訓練範本:
Drupal CAPTCHA
方便製作訓練範本所需外框檔(.box)的工具:
Tesseract OCR Chopper
本書作者用Python寫了一套工具,並建議需要100個檔案以確保訓練的資料量足夠:
tesseract-trainer
訓練Tesseract相關內容:
TrainingTesseract

第十二章:避開Scraping陷阱

看起來像人而非機器人
l   調整Headers,可以比較一般Python Scraper送出的Headers與一般瀏覽送出的Headers
import requests, bs4
session = requests.Session()
url = https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
aSes = session.get(url, headers = headers)
aBeaSou = bs4.BeautifulSoup(aSes.text, "html.parser")
print(aBeaSou.find("table", {"class": "table-striped"}).get_text)
l   處理Cookies,本書作者建議研究網站產生的Cookies,思考Scrapers需要處理哪些:
from selenium import webdriver
aDriver = webdriver.Chrome(r"D:\chromedriver.exe")
bDriver = webdriver.Chrome(r"D:\chromedriver.exe")

aDriver.get("http://pythonscraping.com")
aDriver.implicitly_wait(1)
savedCookies = aDriver.get_cookies()
print(savedCookies)

bDriver.get("http://pythonscraping.com")
bDriver.implicitly_wait(1)
bDriver.delete_all_cookies()
for cookie in savedCookies:
    bDriver.add_cookie(cookie)
print(bDriver.get_cookies())

aDriver.close()
bDriver.close()
l   處理Cookies相關工具:
EditThisCookie

常見的表單防護機制
l   隱藏輸入欄位,一般Python Scraper容易忽略送出的隱藏欄位內容,一般瀏覽時皆會自動被送出。
l   躲開蜂蜜罐,一般Python Scraper容易自動處理的隱藏欄位內容,一般瀏覽時不會自動被處理:
# Selenium
會實際描繪造訪的頁面,所以可透過is_displayed()函式判斷頁面上實際看到的元素
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/itsatrap.html")

links = driver.find_elements_by_tag_name("a")
for link in links:
    if not link.is_displayed():
        print("The Link " + link.get_attribute("href") + " Is a Trap")

fields = driver.find_elements_by_tag_name("input")
for field in fields:
    if not field.is_displayed():
        print("Do Not Change Value of " + field.get_attribute("name"))

第十三章:以Scrapers測試您的網站

Python內建的單元測試模組unittest進行測試
l   setUp()函式會在每個獨立測試之前執行:
import unittest
class TestAddition(unittest.TestCase):
    def setUp(self):
        print("Setting Up the Test")
    def tearDown(self):
        print("Tearing Down the Test")
    def test_twoPlusTwo(self):
        total = 2 + 2
        self.assertEqual(4, total);
if __name__ == "__main__":
    unittest.main()
l   setUpClass()函式會在整個class中的所有測試開始前執行一次:
import requests, bs4, unittest
class TestWikipedia(unittest.TestCase):
    aBeaSou = None
    def setUpClass():
        global aBeaSou
        url = "http://en.wikipedia.org/wiki/Monty_Python"
        aBeaSou = bs4.BeautifulSoup(requests.get(url).text, "html.parser")
    def test_titleText(self):
        global aBeaSou
        pageTitle = aBeaSou.find("h1").get_text()
        self.assertEqual("Monty Python", pageTitle)
    def test_contentExists(self):
        global aBeaSou
        content = aBeaSou.find("div", {"id": "mw-content-text"})
        self.assertIsNotNone(content)
if __name__ == "__main__":
    unittest.main()

Selenium的斷言進行測試
l   from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://en.wikipedia.org/wiki/Monty_Python")
assert "Monty Python" in driver.title
driver.close()

Selenium的其它常用功能
l   與網站互動:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/files/form.html")
firstnameField = driver.find_element_by_name("firstname")
lastnameField = driver.find_element_by_name("lastname")
submitButton = driver.find_element_by_id("submit")

"""
方法一
firstnameField.send_keys("Ryan")
lastnameField.send_keys("Mitchell")
submitButton.click()
"""
#
方法二
actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN)
actions.perform()

print(driver.find_element_by_tag_name("body").text)
driver.close()
l   拖放:
from selenium import webdriver
from selenium.webdriver import ActionChains

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')

print(driver.find_element_by_id("message").text)

element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()

print(driver.find_element_by_id("message").text)

driver.close()
l   螢幕截圖:
from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://www.pythonscraping.com/")
driver.get_screenshot_as_file(r"D:\test.png")
driver.close()

第十四章:遠端Scraping

為什麼要用遠端主機
l   理由一-需要不同的IP位址:
 使用Tor的目標是改變IP位址:
 Tor(The OnionRouter)
 PySocks能把流量轉到Proxy Server,能跟預設在port 9150Tor配合:
 在命令提示字元輸入pip install PySocks==1.5.0,下載並安裝PySocks
 PySocks PyPI
 http://icanhazip.com會顯示連上它的ClientIP位址。
l   理由二-需要更強的處理能力與靈活度:
 方法一-從網站代管帳戶執行遠端主機。
 方法二-從雲端執行遠端主機:
  在Google雲端運算平台上使用PythonJavaScript相關資源:
  GoogleCompute Engine
  使用Amazon Web Services相關資源:
  Pythonand AWS Cookbook

附錄C:擷取網頁時的法律與道德考量

機器人排除標準(Robots Exclusion Standard)
l   網址後面加上/robots.txt,如https://www.mlb.com/robots.txt
User-agent
代表規則適用的使用者。
Allow
Disallow代表是否允許操作網站指定範圍。

Google網頁快取
l   如果您想搜尋的網站連不上,嘗試輸入http://webcache.googleusercontent.com/search?q=cache:您想搜尋的網站,尋找有無Google網頁快取的版本,例如:
http://webcache.googleusercontent.com/search?q=cache:http://pythonscraping.com/

沒有留言:

張貼留言