Timmy's Column: 網站擷取－使用Python學習筆記 (2) 進階Scraping

　　我給予《網站擷取－使用Python》(Web Scraping With Python: Collecting More Data from the Modern Web)這本書極高的評價，這本由Ryan Mitchell著作的書籍將「網站擷取」這件事做了全面又淺顯易懂的解釋，從應對各種網站環境的爬蟲技巧，到法律上可能碰到的問題，本書作者給了簡單清楚的原型作法與案例說明，讓您可以擁有良好的網站擷取概觀。

　　本書於2018年再推出第二版書籍，將第一版書籍過時的程式碼更新，並添加新的章節，然而本學習筆記僅為第一版書籍的內容，詳細資訊可參考本書作者的GitHub。

　　資訊領域的發展一日千里，網站擷取這門技術可以說是走在技術最前端，無論是資料保護、偵測機器人Bot、維護伺服器Server資源，還是各種巧取資料的特殊技巧，彼此是競爭的關係，每分每秒總是有人在嘗試各種手段進行資料收集與被收集的攻防戰，因此這些技術極容易過時，儘管如此，本書作者提供了許多良好的編程與解決問題的觀念與想法，能夠讓我們在技術快速更迭之下，有所依循。

　　本書建議有基礎Python能力的朋友閱讀，而我是在閱讀《Python自動化的樂趣－搞定重複瑣碎&單調無聊的工作》(Automate the Boring Stuff with Python: Practical Programming for Total Beginners)這本由Al Sweigart著作的書籍之後，發覺這兩本書的內容可以順利銜接，在《Python自動化的樂趣》一書介紹的多數套件模組，在《網站擷取》一書中也都將繼續使用。

第七章：清理您的髒資料

未以程式清理髒資料

l   import requests, bs4, csv

def getNgrams(input, n):
    input = input.split(" ")
    output = []
    for i in range(len(input) - n + 1):
        output.append(input[i : i + n])
    return output

aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)

aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for row in ngrams:
    aWriter.writerow(row)
aFile.close()

以程式清理髒資料

l   import requests, bs4, re, string, csv
from collections import OrderedDict

def cleanInput(input):
    input = re.sub("\n+", " ", input)
    input = re.sub("\[[0-9]*\]", "", input)
    input = re.sub(" +", " ", input)
    input = input.split(" ")
    cleanInput = []
    for i in input:
        item = i.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == "a" or item.lower() == "i"):
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)
    output = dict()
    for i in range(len(input) - n + 1):
       newNgram = " ".join(input[i : i + n])
        if newNgram in output:
            output[newNgram] += 1
        else:
            output[newNgram] = 1
    return output

aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
# dictionary sorted by value
# import operator
# ngrams = OrderedDict(sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True))
ngrams = OrderedDict(sorted(ngrams.items(), key = lambda t: t[1], reverse = True))

aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for k, v in ngrams.items():
    row = []
    row.append(k)
    row.append(v)
    aWriter.writerow(row)
aFile.close()

OpenRefine

l OpenRefine可以快速、輕易地清理資料，還能把資料整理成能輕易閱讀、處理的格式，使用前必須先把資料存成CSV，下載安裝連結如下：
OpenRefine
操作時所需GREL的完整語言說明如下：
OpenRefineDocumentation For Users

l OpenRefine需要透過瀏覽器進行操作，然而目前Google Chrome與Microsoft Edge等瀏覽器已不支援該功能。

第八章：讀寫自然語言

馬可夫模型(Markov Model)

l   本範例有使用單行List Comprehension技巧：
# 舊寫法
filterEmpty = []
for word in words:
    if word != "":
        filterEmpty.append(word)
words = filterEmpty
# 新寫法
words = [word for word in words if word != ""]

l   import requests, random

def wordListSum(wordList):
    sum = 0
    for k, v in wordList.items():
        sum += v
    return sum

def retrieveRandomWord(wordList):
    randomIndex = random.randint(1, wordListSum(wordList))
    for k, v in wordList.items():
        randomIndex -= v
        if randomIndex <= 0:
            return k

def buildWordDict(text):
    text = text.replace("\n", " ")
    text = text.replace("\"", "")
    punctuation = [",", ".", ";", ":"]
    for i in punctuation:
        text = text.replace(i, " " + i + " ")

    words = text.split(" ")
    words = [word for word in words if word != ""]

    wordDict = {}
    for j in range(1, len(words)):
        if words[j - 1] not in wordDict:
            wordDict[words[j - 1]] = {}
        if words[j] not in wordDict[words[j - 1]]:
           wordDict[words[j - 1]][words[j]] = 0
        wordDict[words[j - 1]][words[j]] += 1
    return wordDict

aRes = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
aRes.raise_for_status()
wordDict = buildWordDict(aRes.text)

length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
    chain += currentWord + " "
    currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)

自然語言工具組(Natural Language Toolkit, NLTK)

l 在命令提示字元輸入pip install nltk，下載並安裝NLTK。

l 開啟NLTK下載器，建議安裝所有資料庫：
import nltk
nltk.download()

l 詞彙、語句的List物件與Text物件：
import nltk
text = "The dust was thick so he had to dust. Later, he had to groom for the groom."
a = nltk.word_tokenize(text)
print(a)
b = nltk.sent_tokenize(text)
print(b)
c = nltk.Text(a)
print(c)

l 載入內建書本文字當作範例，使用FreqDist頻率分布物件：
from nltk.book import *
from nltk import FreqDist
a = FreqDist(text6)
print(a)
print(a.most_common(10))

l 載入內建書本文字當作範例，使用bigrams進行2grams字串組合：
from nltk.book import *
from nltk import bigrams
a = bigrams(text6)
b = FreqDist(a)
print(b)
print(b[("Sir", "Robin")])

l 載入內建書本文字當作範例，使用ngrams進行4grams字串組合：
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
b = FreqDist(a)
print(b)
print(b[("father", "smelt", "of", "elderberries")])

l   載入內建書本文字當作範例，找出指定字串開頭的4grams字串組合：
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
for i in a:
    if i[0] == "coconut":
        print(i)

l NLTK的字典分析－Penn Treebank詞性標記：
import nltk
a = nltk.word_tokenize("The dust was thick so he had to dust. Later, he had to groom for the groom.")
print(nltk.pos_tag(a))

l NLTK模組與自然語言處理相關內容：
NLTK 3.3 documentation
NaturalLanguage Processing with Python
Natural LanguageAnnotations for Machine Learning

第九章：爬過表單與登入頁

Python Requests函式庫

l 提交一份基本表單：
# http://pythonscraping.com/pages/files/form.html
import requests
data = {"firstname": "Ryan", "lastname": "Mitchell"}
aRes = requests.post("http://pythonscraping.com/pages/files/processing.php", data = data)
print(aRes.text)

l 提交檔案與影像：
# http://pythonscraping.com/files/form2.html
import requests
files = {"uploadFile": open("D:\\test.png", "rb")}
aRes = requests.post("http://pythonscraping.com/pages/files/processing2.php", files = files)
print(aRes.text)

l 處理登入與Cookies：
# http://pythonscraping.com/pages/cookies/login.html
import requests
data = {"username": "Ryan", "password": "\"password\""}
aRes = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aRes.cookies.get_dict())
bRes = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies = aRes.cookies)
print(bRes.text)

l 處理登入與Cookies：
# 比較複雜的網站會經常無預警修改cookies，Requests套件的Session功能可以處理這些狀況
# http://pythonscraping.com/pages/cookies/login.html
import requests
session = requests.Session()
data = {"username": "Ryan", "password": "\"password\""}
aSes = session.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aSes.cookies.get_dict())
bSes = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(bSes.text)

l HTTP基本登入認證(Basic Access Authentication)：
# Requests套件的auth模組專門處理HTTP認證
# http://pythonscraping.com/pages/auth/login.php
import requests
auth = requests.auth.HTTPBasicAuth("Ryan", "password")
aRes = requests.post("http://pythonscraping.com/pages/auth/login.php", auth = auth)
print(aRes.text)

第十章：搜刮JavaScript

JavaScript

l 多數狀況下，在網路上只會遇到兩種用戶端語言，分別為Flash程式用的ActionScript，以及JavaScript；常見的JavaScript函式庫包含jQuery、Google Analytics、Google Maps。

l Ajax(Asynchronous JavaScript and XML)是用來對Web Server傳送或接收資料，但不必為此多請求一個新頁面的技術。

l 動態HTML(Dynamic HTML)指的是HTML、CSS等內容隨著用戶端Script修改頁面內容而變化。

l 一般Scraper無法執行Ajax與動態HTML的JavaScript，解法只有兩種，一是直接從JavaScript擷取相關內容，另一是使用可以執行JavaScript的Python套件。

l 以Selenium套件在Python執行JavaScript：
在命令提示字元輸入pip install selenium，下載並安裝Selenium。
selenium PyPI
Selenium - Web Browser Automation
以PhantomJS－Headless Browser讓程式靜靜地在背景執行
(2017年Selenium不再支援PhantomJS，2018年PhantomJS宣布暫停開發，因此建議改用Headless Chrome或Headless Firefox)：
PhantomJS - Scriptable Headless Browser

Ajax載入資料

l 使用ChromeDriver與Selenium Selector載入：
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

l 使用PhantomJS與BeautifulSoup載入：
from selenium import webdriver
import time, bs4
driver = webdriver.PhantomJS(r"D:\phantomjs-2.1.1-windows\bin\phantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
aBeaSou = bs4.BeautifulSoup(driver.page_source, "html.parser")
print(aBeaSou.find(id = "content").get_text())
driver.close()

反覆檢查頁面是否完整載入資料

l   Ajax載入資料：
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")

try:
    # 參考下列Locator選取策略說明
    element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    # print(driver.find_element(By.ID, "content").text)
    print(driver.find_element_by_id("content").text)

driver.close()

l   處理用戶端重導：
# 處理伺服端重導用一般urllib函式庫即可
from selenium import webdriver
import time

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")

element = driver.find_element_by_tag_name("html")
count = 0
while True:
    count += 1
    if count > 20:
        print("Timing Out After 10 Seconds and Returning")
        break
    time.sleep(0.5)
    if element != driver.find_element_by_tag_name("html"):
        break

print(driver.page_source)
driver.close()

Locator選取策略說明

l Locator選取策略可以用在By物件：
(By.ID, "test")
(By.CLASS_NAME, "test")
(By.CSS_SELECTOR, "#idName")、(By.CSS_SELECTOR, ".className")、(By.CSS_SELECTOR, "tagName")
(By.LINK_TEXT, "test")
(By.PARTIAL_LINK_TEXT, "test")
(By.NAME, "test")
(By.TAG_NAME, "p")
(By.XPATH, "//div")

l XPath(XML Path)語法：
XPath語法是用來走訪、選取XML文件內容，其語法包含四大觀念：
一、Root節點與Non-root節點：
/div選取位於文件根部的div標籤
//div選取位於文件任何地方的所有div標籤
二、屬性選取：
//@href選取所有包含href屬性的節點
//a[@href = "http://google.com"]選取所有連到Google首頁的連結
三、用位置選取節點：
//a[3]選取文件裡的第三個連結
//table[last()]選取文件裡的最後一個表格
//a[position() < 3]選取文件裡的前三個連結
四、星號(*)會匹配任何字元或節點：
//table/tr/*選取所有表格中的所有tr標籤的子代
//div[@*]選取所有至少擁有一個屬性的div標籤
XPath語法相關內容：
XPath Syntax

第十一章：影像處理與文字辨識

識別CAPTCHA(Computer Automated Public Turning test to tell Computers and Humans Apart)常用的函式庫

l Pillow影像處理函式庫：
在命令提示字元輸入pip install pillow，下載並安裝Pillow。
Pillow相關內容：
Pillow - Pillow (PIL Fork) 5.2.0documentation

l Tesseract光學字元辨識函式庫：
Tesseract是用Python寫的命令列程式，不是透過import取用的函式庫，在Windows上可找Installer安裝，並依提示選擇下載安裝語言包：
Downloadstesseract-ocr
Tesseract無圖形使用者介面，需新增系統變數，讓系統知道資料檔存放位置，變數值請輸入安裝路徑：Windows設定 à 編輯系統環境變數 à 系統內容 à 環境變數 à 系統變數：
新增系統變數：變數名稱TESSDATA_PREFIX，變數值C:\Tesseract-OCR
編輯環境變數：變數名稱Path，新增變數值C:\Tesseract-OCR
Tesseract相關內容：
pytesseract PyPI

l NumPy數學函式庫：
在命令提示字元輸入pip install numpy，下載並安裝NumPy。
用以訓練Tesseract識別更多字元或是新字型。

影像處理

l 高斯模糊：
from PIL import Image, ImageFilter
kitten = Image.open("zophie.png")
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save("zophie_blurred.png")
blurryKitten.show()

l 二值化濾鏡，將背景的灰階濾掉、突顯文字部分：
from PIL import Image
badImage = Image.open("textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("textCleaned.png")

光學字元辨識

l 以命令提示字元操作，處理格式整齊的文字：
# 在目前工作目錄輸入文字圖片與輸出.txt檔
tesseract textOriginal.png textOutput
# 在指定工作目錄輸入文字圖片與輸出.txt檔
tesseract D:\textOriginal.png D:\textOutput
# 直接輸出內容而不儲存至.txt檔
tesseract textOriginal.png stdout
# 預設辨識英文，亦可指定辨識文字圖片的英文
tesseract textOriginal.png textOutput -l eng
# 需安裝語言包，指定辨識文字圖片的繁體中文
tesseract textOriginal.png textOutput -l chi_tra

l 整合影像處理以辨識文字圖片檔案：
from PIL import Image
import subprocess

badImage = Image.open("D:\\textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("D:\\textCleaned.png")

subprocess.call(["tesseract", "D:\\textCleaned.png", "D:\\textOutput"])

textFile = open("D:\\textOutput.txt", "r")
print(textFile.read())
textFile.close()

l 訓練Tesseract：
常用的CAPTCHA，可當訓練範本：
Drupal CAPTCHA
方便製作訓練範本所需外框檔(.box)的工具：
Tesseract OCR Chopper
本書作者用Python寫了一套工具，並建議需要100個檔案以確保訓練的資料量足夠：
tesseract-trainer
訓練Tesseract相關內容：
TrainingTesseract

第十二章：避開Scraping陷阱

看起來像人而非機器人

l 調整Headers，可以比較一般Python Scraper送出的Headers與一般瀏覽送出的Headers：
import requests, bs4
session = requests.Session()
url = https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
aSes = session.get(url, headers = headers)
aBeaSou = bs4.BeautifulSoup(aSes.text, "html.parser")
print(aBeaSou.find("table", {"class": "table-striped"}).get_text)

l 處理Cookies，本書作者建議研究網站產生的Cookies，思考Scrapers需要處理哪些：
from selenium import webdriver
aDriver = webdriver.Chrome(r"D:\chromedriver.exe")
bDriver = webdriver.Chrome(r"D:\chromedriver.exe")

aDriver.get("http://pythonscraping.com")
aDriver.implicitly_wait(1)
savedCookies = aDriver.get_cookies()
print(savedCookies)

bDriver.get("http://pythonscraping.com")
bDriver.implicitly_wait(1)
bDriver.delete_all_cookies()
for cookie in savedCookies:
bDriver.add_cookie(cookie)
print(bDriver.get_cookies())

aDriver.close()
bDriver.close()

l 處理Cookies相關工具：
EditThisCookie

常見的表單防護機制

l 隱藏輸入欄位，一般Python Scraper容易忽略送出的隱藏欄位內容，一般瀏覽時皆會自動被送出。

l   躲開蜂蜜罐，一般Python Scraper容易自動處理的隱藏欄位內容，一般瀏覽時不會自動被處理：
# Selenium會實際描繪造訪的頁面，所以可透過is_displayed()函式判斷頁面上實際看到的元素
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/itsatrap.html")

links = driver.find_elements_by_tag_name("a")
for link in links:
    if not link.is_displayed():
        print("The Link " + link.get_attribute("href") + " Is a Trap")

fields = driver.find_elements_by_tag_name("input")
for field in fields:
    if not field.is_displayed():
        print("Do Not Change Value of " + field.get_attribute("name"))

第十三章：以Scrapers測試您的網站

以Python內建的單元測試模組unittest進行測試

l 學習unittest的資源：
PythonTutorial 第六堂（1）使用 unittest 單元測試

l   setUp()函式會在每個獨立測試之前執行：
import unittest
class TestAddition(unittest.TestCase):
    def setUp(self):
        print("Setting Up the Test")
    def tearDown(self):
        print("Tearing Down the Test")
    def test_twoPlusTwo(self):
        total = 2 + 2
        self.assertEqual(4, total);
if __name__ == "__main__":
    unittest.main()

l   setUpClass()函式會在整個class中的所有測試開始前執行一次：
import requests, bs4, unittest
class TestWikipedia(unittest.TestCase):
    aBeaSou = None
    def setUpClass():
        global aBeaSou
        url = "http://en.wikipedia.org/wiki/Monty_Python"
        aBeaSou = bs4.BeautifulSoup(requests.get(url).text, "html.parser")
    def test_titleText(self):
        global aBeaSou
        pageTitle = aBeaSou.find("h1").get_text()
        self.assertEqual("Monty Python", pageTitle)
    def test_contentExists(self):
        global aBeaSou
        content = aBeaSou.find("div", {"id": "mw-content-text"})
        self.assertIsNotNone(content)
if __name__ == "__main__":
    unittest.main()

以Selenium的斷言進行測試

l from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://en.wikipedia.org/wiki/Monty_Python")
assert "Monty Python" in driver.title
driver.close()

Selenium的其它常用功能

l 與網站互動：
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/files/form.html")
firstnameField = driver.find_element_by_name("firstname")
lastnameField = driver.find_element_by_name("lastname")
submitButton = driver.find_element_by_id("submit")

""" 方法一
firstnameField.send_keys("Ryan")
lastnameField.send_keys("Mitchell")
submitButton.click()
"""
# 方法二
actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN)
actions.perform()

print(driver.find_element_by_tag_name("body").text)
driver.close()

l 拖放：
from selenium import webdriver
from selenium.webdriver import ActionChains

driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')

print(driver.find_element_by_id("message").text)

element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()

print(driver.find_element_by_id("message").text)

driver.close()

l 螢幕截圖：
from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://www.pythonscraping.com/")
driver.get_screenshot_as_file(r"D:\test.png")
driver.close()

第十四章：遠端Scraping

為什麼要用遠端主機

l 理由一－需要不同的IP位址：
　使用Tor的目標是改變IP位址：
　Tor(The OnionRouter)
　PySocks能把流量轉到Proxy Server，能跟預設在port 9150的Tor配合：
　在命令提示字元輸入pip install PySocks==1.5.0，下載並安裝PySocks。
　PySocks PyPI
　http://icanhazip.com會顯示連上它的Client的IP位址。

l 理由二－需要更強的處理能力與靈活度：
　方法一－從網站代管帳戶執行遠端主機。
　方法二－從雲端執行遠端主機：
　　在Google雲端運算平台上使用Python與JavaScript相關資源：
　　GoogleCompute Engine
　　使用Amazon Web Services相關資源：
　　Pythonand AWS Cookbook

附錄C：擷取網頁時的法律與道德考量

機器人排除標準(Robots Exclusion Standard)

l 網址後面加上/robots.txt，如https://www.mlb.com/robots.txt：
User-agent代表規則適用的使用者。
Allow或Disallow代表是否允許操作網站指定範圍。

Google網頁快取

l 如果您想搜尋的網站連不上，嘗試輸入http://webcache.googleusercontent.com/search?q=cache:您想搜尋的網站，尋找有無Google網頁快取的版本，例如：
http://webcache.googleusercontent.com/search?q=cache:http://pythonscraping.com/

Timmy's Column

2018年8月24日星期五

網站擷取－使用Python學習筆記 (2) 進階Scraping

沒有留言:

張貼留言

2018年8月24日 星期五

網站擷取－使用Python學習筆記 (2) 進階Scraping

沒有留言:

張貼留言

2018年8月24日星期五