資訊領域的發展一日千里,網站擷取這門技術可以說是走在技術最前端,無論是資料保護、偵測機器人Bot、維護伺服器Server資源,還是各種巧取資料的特殊技巧,彼此是競爭的關係,每分每秒總是有人在嘗試各種手段進行資料收集與被收集的攻防戰,因此這些技術極容易過時,儘管如此,本書作者提供了許多良好的編程與解決問題的觀念與想法,能夠讓我們在技術快速更迭之下,有所依循。
本書建議有基礎Python能力的朋友閱讀,而我是在閱讀《Python自動化的樂趣-搞定重複瑣碎&單調無聊的工作》(Automate the Boring Stuff with Python: Practical Programming for
Total Beginners)這本由Al Sweigart著作的書籍之後,發覺這兩本書的內容可以順利銜接,在《Python自動化的樂趣》一書介紹的多數套件模組,在《網站擷取》一書中也都將繼續使用。
第七章:清理您的髒資料
未以程式清理髒資料
l import requests, bs4, csv
def getNgrams(input, n):
input = input.split(" ")
output = []
for i in range(len(input) - n + 1):
output.append(input[i : i + n])
return output
aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for row in ngrams:
aWriter.writerow(row)
aFile.close()
def getNgrams(input, n):
input = input.split(" ")
output = []
for i in range(len(input) - n + 1):
output.append(input[i : i + n])
return output
aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for row in ngrams:
aWriter.writerow(row)
aFile.close()
以程式清理髒資料
l import requests, bs4, re, string, csv
from collections import OrderedDict
def cleanInput(input):
input = re.sub("\n+", " ", input)
input = re.sub("\[[0-9]*\]", "", input)
input = re.sub(" +", " ", input)
input = input.split(" ")
cleanInput = []
for i in input:
item = i.strip(string.punctuation)
if len(item) > 1 or (item.lower() == "a" or item.lower() == "i"):
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = dict()
for i in range(len(input) - n + 1):
newNgram = " ".join(input[i : i + n])
if newNgram in output:
output[newNgram] += 1
else:
output[newNgram] = 1
return output
aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
# dictionary sorted by value
# import operator
# ngrams = OrderedDict(sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True))
ngrams = OrderedDict(sorted(ngrams.items(), key = lambda t: t[1], reverse = True))
aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for k, v in ngrams.items():
row = []
row.append(k)
row.append(v)
aWriter.writerow(row)
aFile.close()
from collections import OrderedDict
def cleanInput(input):
input = re.sub("\n+", " ", input)
input = re.sub("\[[0-9]*\]", "", input)
input = re.sub(" +", " ", input)
input = input.split(" ")
cleanInput = []
for i in input:
item = i.strip(string.punctuation)
if len(item) > 1 or (item.lower() == "a" or item.lower() == "i"):
cleanInput.append(item)
return cleanInput
def getNgrams(input, n):
input = cleanInput(input)
output = dict()
for i in range(len(input) - n + 1):
newNgram = " ".join(input[i : i + n])
if newNgram in output:
output[newNgram] += 1
else:
output[newNgram] = 1
return output
aRes = requests.get("http://en.wikipedia.org/wiki/Python_(programming_language)")
aRes.raise_for_status()
aRes.encoding = "utf-8"
aBeaSou = bs4.BeautifulSoup(aRes.text, "html.parser")
content = aBeaSou.find("div", {"id": "mw-content-text"}).get_text()
ngrams = getNgrams(content, 2)
# dictionary sorted by value
# import operator
# ngrams = OrderedDict(sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True))
ngrams = OrderedDict(sorted(ngrams.items(), key = lambda t: t[1], reverse = True))
aFile = open("ngrams.csv", "w", newline = "", encoding = "utf-8")
aWriter = csv.writer(aFile)
for k, v in ngrams.items():
row = []
row.append(k)
row.append(v)
aWriter.writerow(row)
aFile.close()
OpenRefine
l
OpenRefine可以快速、輕易地清理資料,還能把資料整理成能輕易閱讀、處理的格式,使用前必須先把資料存成CSV,下載安裝連結如下:
OpenRefine
操作時所需GREL的完整語言說明如下:
OpenRefineDocumentation For Users
OpenRefine
操作時所需GREL的完整語言說明如下:
OpenRefineDocumentation For Users
l OpenRefine需要透過瀏覽器進行操作,然而目前Google Chrome與Microsoft Edge等瀏覽器已不支援該功能。
第八章:讀寫自然語言
馬可夫模型(Markov Model)
l 本範例有使用單行List Comprehension技巧:
# 舊寫法
filterEmpty = []
for word in words:
if word != "":
filterEmpty.append(word)
words = filterEmpty
# 新寫法
words = [word for word in words if word != ""]
# 舊寫法
filterEmpty = []
for word in words:
if word != "":
filterEmpty.append(word)
words = filterEmpty
# 新寫法
words = [word for word in words if word != ""]
l import requests, random
def wordListSum(wordList):
sum = 0
for k, v in wordList.items():
sum += v
return sum
def retrieveRandomWord(wordList):
randomIndex = random.randint(1, wordListSum(wordList))
for k, v in wordList.items():
randomIndex -= v
if randomIndex <= 0:
return k
def buildWordDict(text):
text = text.replace("\n", " ")
text = text.replace("\"", "")
punctuation = [",", ".", ";", ":"]
for i in punctuation:
text = text.replace(i, " " + i + " ")
words = text.split(" ")
words = [word for word in words if word != ""]
wordDict = {}
for j in range(1, len(words)):
if words[j - 1] not in wordDict:
wordDict[words[j - 1]] = {}
if words[j] not in wordDict[words[j - 1]]:
wordDict[words[j - 1]][words[j]] = 0
wordDict[words[j - 1]][words[j]] += 1
return wordDict
aRes = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
aRes.raise_for_status()
wordDict = buildWordDict(aRes.text)
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
chain += currentWord + " "
currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)
def wordListSum(wordList):
sum = 0
for k, v in wordList.items():
sum += v
return sum
def retrieveRandomWord(wordList):
randomIndex = random.randint(1, wordListSum(wordList))
for k, v in wordList.items():
randomIndex -= v
if randomIndex <= 0:
return k
def buildWordDict(text):
text = text.replace("\n", " ")
text = text.replace("\"", "")
punctuation = [",", ".", ";", ":"]
for i in punctuation:
text = text.replace(i, " " + i + " ")
words = text.split(" ")
words = [word for word in words if word != ""]
wordDict = {}
for j in range(1, len(words)):
if words[j - 1] not in wordDict:
wordDict[words[j - 1]] = {}
if words[j] not in wordDict[words[j - 1]]:
wordDict[words[j - 1]][words[j]] = 0
wordDict[words[j - 1]][words[j]] += 1
return wordDict
aRes = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt")
aRes.raise_for_status()
wordDict = buildWordDict(aRes.text)
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
chain += currentWord + " "
currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)
自然語言工具組(Natural Language Toolkit, NLTK)
l 在命令提示字元輸入pip install nltk,下載並安裝NLTK。
l 開啟NLTK下載器,建議安裝所有資料庫:
import nltk
nltk.download()
import nltk
nltk.download()
l 詞彙、語句的List物件與Text物件:
import nltk
text = "The dust was thick so he had to dust. Later, he had to groom for the groom."
a = nltk.word_tokenize(text)
print(a)
b = nltk.sent_tokenize(text)
print(b)
c = nltk.Text(a)
print(c)
import nltk
text = "The dust was thick so he had to dust. Later, he had to groom for the groom."
a = nltk.word_tokenize(text)
print(a)
b = nltk.sent_tokenize(text)
print(b)
c = nltk.Text(a)
print(c)
l 載入內建書本文字當作範例,使用FreqDist頻率分布物件:
from nltk.book import *
from nltk import FreqDist
a = FreqDist(text6)
print(a)
print(a.most_common(10))
from nltk.book import *
from nltk import FreqDist
a = FreqDist(text6)
print(a)
print(a.most_common(10))
l 載入內建書本文字當作範例,使用bigrams進行2grams字串組合:
from nltk.book import *
from nltk import bigrams
a = bigrams(text6)
b = FreqDist(a)
print(b)
print(b[("Sir", "Robin")])
from nltk.book import *
from nltk import bigrams
a = bigrams(text6)
b = FreqDist(a)
print(b)
print(b[("Sir", "Robin")])
l 載入內建書本文字當作範例,使用ngrams進行4grams字串組合:
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
b = FreqDist(a)
print(b)
print(b[("father", "smelt", "of", "elderberries")])
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
b = FreqDist(a)
print(b)
print(b[("father", "smelt", "of", "elderberries")])
l 載入內建書本文字當作範例,找出指定字串開頭的4grams字串組合:
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
for i in a:
if i[0] == "coconut":
print(i)
from nltk.book import *
from nltk import ngrams
a = ngrams(text6, 4)
for i in a:
if i[0] == "coconut":
print(i)
l NLTK的字典分析-Penn Treebank詞性標記:
import nltk
a = nltk.word_tokenize("The dust was thick so he had to dust. Later, he had to groom for the groom.")
print(nltk.pos_tag(a))
import nltk
a = nltk.word_tokenize("The dust was thick so he had to dust. Later, he had to groom for the groom.")
print(nltk.pos_tag(a))
l NLTK模組與自然語言處理相關內容:
NLTK 3.3 documentation
NaturalLanguage Processing with Python
Natural LanguageAnnotations for Machine Learning
NLTK 3.3 documentation
NaturalLanguage Processing with Python
Natural LanguageAnnotations for Machine Learning
第九章:爬過表單與登入頁
Python Requests函式庫
l 提交一份基本表單:
# http://pythonscraping.com/pages/files/form.html
import requests
data = {"firstname": "Ryan", "lastname": "Mitchell"}
aRes = requests.post("http://pythonscraping.com/pages/files/processing.php", data = data)
print(aRes.text)
# http://pythonscraping.com/pages/files/form.html
import requests
data = {"firstname": "Ryan", "lastname": "Mitchell"}
aRes = requests.post("http://pythonscraping.com/pages/files/processing.php", data = data)
print(aRes.text)
l 提交檔案與影像:
# http://pythonscraping.com/files/form2.html
import requests
files = {"uploadFile": open("D:\\test.png", "rb")}
aRes = requests.post("http://pythonscraping.com/pages/files/processing2.php", files = files)
print(aRes.text)
# http://pythonscraping.com/files/form2.html
import requests
files = {"uploadFile": open("D:\\test.png", "rb")}
aRes = requests.post("http://pythonscraping.com/pages/files/processing2.php", files = files)
print(aRes.text)
l 處理登入與Cookies:
# http://pythonscraping.com/pages/cookies/login.html
import requests
data = {"username": "Ryan", "password": "\"password\""}
aRes = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aRes.cookies.get_dict())
bRes = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies = aRes.cookies)
print(bRes.text)
# http://pythonscraping.com/pages/cookies/login.html
import requests
data = {"username": "Ryan", "password": "\"password\""}
aRes = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aRes.cookies.get_dict())
bRes = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies = aRes.cookies)
print(bRes.text)
l 處理登入與Cookies:
# 比較複雜的網站會經常無預警修改cookies,Requests套件的Session功能可以處理這些狀況
# http://pythonscraping.com/pages/cookies/login.html
import requests
session = requests.Session()
data = {"username": "Ryan", "password": "\"password\""}
aSes = session.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aSes.cookies.get_dict())
bSes = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(bSes.text)
# 比較複雜的網站會經常無預警修改cookies,Requests套件的Session功能可以處理這些狀況
# http://pythonscraping.com/pages/cookies/login.html
import requests
session = requests.Session()
data = {"username": "Ryan", "password": "\"password\""}
aSes = session.post("http://pythonscraping.com/pages/cookies/welcome.php", data = data)
print(aSes.cookies.get_dict())
bSes = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(bSes.text)
l HTTP基本登入認證(Basic Access Authentication):
# Requests套件的auth模組專門處理HTTP認證
# http://pythonscraping.com/pages/auth/login.php
import requests
auth = requests.auth.HTTPBasicAuth("Ryan", "password")
aRes = requests.post("http://pythonscraping.com/pages/auth/login.php", auth = auth)
print(aRes.text)
# Requests套件的auth模組專門處理HTTP認證
# http://pythonscraping.com/pages/auth/login.php
import requests
auth = requests.auth.HTTPBasicAuth("Ryan", "password")
aRes = requests.post("http://pythonscraping.com/pages/auth/login.php", auth = auth)
print(aRes.text)
第十章:搜刮JavaScript
JavaScript
l 多數狀況下,在網路上只會遇到兩種用戶端語言,分別為Flash程式用的ActionScript,以及JavaScript;常見的JavaScript函式庫包含jQuery、Google Analytics、Google Maps。
l Ajax(Asynchronous JavaScript and XML)是用來對Web
Server傳送或接收資料,但不必為此多請求一個新頁面的技術。
l 動態HTML(Dynamic HTML)指的是HTML、CSS等內容隨著用戶端Script修改頁面內容而變化。
l 一般Scraper無法執行Ajax與動態HTML的JavaScript,解法只有兩種,一是直接從JavaScript擷取相關內容,另一是使用可以執行JavaScript的Python套件。
l 以Selenium套件在Python執行JavaScript:
在命令提示字元輸入pip install selenium,下載並安裝Selenium。
selenium PyPI
Selenium - Web Browser Automation
以PhantomJS-Headless Browser讓程式靜靜地在背景執行
(2017年Selenium不再支援PhantomJS,2018年PhantomJS宣布暫停開發,因此建議改用Headless Chrome或Headless Firefox):
PhantomJS - Scriptable Headless Browser
在命令提示字元輸入pip install selenium,下載並安裝Selenium。
selenium PyPI
Selenium - Web Browser Automation
以PhantomJS-Headless Browser讓程式靜靜地在背景執行
(2017年Selenium不再支援PhantomJS,2018年PhantomJS宣布暫停開發,因此建議改用Headless Chrome或Headless Firefox):
PhantomJS - Scriptable Headless Browser
Ajax載入資料
l 使用ChromeDriver與Selenium Selector載入:
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()
l 使用PhantomJS與BeautifulSoup載入:
from selenium import webdriver
import time, bs4
driver = webdriver.PhantomJS(r"D:\phantomjs-2.1.1-windows\bin\phantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
aBeaSou = bs4.BeautifulSoup(driver.page_source, "html.parser")
print(aBeaSou.find(id = "content").get_text())
driver.close()
from selenium import webdriver
import time, bs4
driver = webdriver.PhantomJS(r"D:\phantomjs-2.1.1-windows\bin\phantomjs.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
aBeaSou = bs4.BeautifulSoup(driver.page_source, "html.parser")
print(aBeaSou.find(id = "content").get_text())
driver.close()
反覆檢查頁面是否完整載入資料
l Ajax載入資料:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
# 參考下列Locator選取策略說明
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
# print(driver.find_element(By.ID, "content").text)
print(driver.find_element_by_id("content").text)
driver.close()
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
# 參考下列Locator選取策略說明
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
# print(driver.find_element(By.ID, "content").text)
print(driver.find_element_by_id("content").text)
driver.close()
l 處理用戶端重導:
# 處理伺服端重導用一般urllib函式庫即可
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
element = driver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 20:
print("Timing Out After 10 Seconds and Returning")
break
time.sleep(0.5)
if element != driver.find_element_by_tag_name("html"):
break
print(driver.page_source)
driver.close()
# 處理伺服端重導用一般urllib函式庫即可
from selenium import webdriver
import time
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
element = driver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 20:
print("Timing Out After 10 Seconds and Returning")
break
time.sleep(0.5)
if element != driver.find_element_by_tag_name("html"):
break
print(driver.page_source)
driver.close()
Locator選取策略說明
l Locator選取策略可以用在By物件:
(By.ID, "test")
(By.CLASS_NAME, "test")
(By.CSS_SELECTOR, "#idName")、(By.CSS_SELECTOR, ".className")、(By.CSS_SELECTOR, "tagName")
(By.LINK_TEXT, "test")
(By.PARTIAL_LINK_TEXT, "test")
(By.NAME, "test")
(By.TAG_NAME, "p")
(By.XPATH, "//div")
(By.ID, "test")
(By.CLASS_NAME, "test")
(By.CSS_SELECTOR, "#idName")、(By.CSS_SELECTOR, ".className")、(By.CSS_SELECTOR, "tagName")
(By.LINK_TEXT, "test")
(By.PARTIAL_LINK_TEXT, "test")
(By.NAME, "test")
(By.TAG_NAME, "p")
(By.XPATH, "//div")
l XPath(XML Path)語法:
XPath語法是用來走訪、選取XML文件內容,其語法包含四大觀念:
一、Root節點與Non-root節點:
/div選取位於文件根部的div標籤
//div選取位於文件任何地方的所有div標籤
二、屬性選取:
//@href選取所有包含href屬性的節點
//a[@href = "http://google.com"]選取所有連到Google首頁的連結
三、用位置選取節點:
//a[3]選取文件裡的第三個連結
//table[last()]選取文件裡的最後一個表格
//a[position() < 3]選取文件裡的前三個連結
四、星號(*)會匹配任何字元或節點:
//table/tr/*選取所有表格中的所有tr標籤的子代
//div[@*]選取所有至少擁有一個屬性的div標籤
XPath語法相關內容:
XPath Syntax
XPath語法是用來走訪、選取XML文件內容,其語法包含四大觀念:
一、Root節點與Non-root節點:
/div選取位於文件根部的div標籤
//div選取位於文件任何地方的所有div標籤
二、屬性選取:
//@href選取所有包含href屬性的節點
//a[@href = "http://google.com"]選取所有連到Google首頁的連結
三、用位置選取節點:
//a[3]選取文件裡的第三個連結
//table[last()]選取文件裡的最後一個表格
//a[position() < 3]選取文件裡的前三個連結
四、星號(*)會匹配任何字元或節點:
//table/tr/*選取所有表格中的所有tr標籤的子代
//div[@*]選取所有至少擁有一個屬性的div標籤
XPath語法相關內容:
XPath Syntax
第十一章:影像處理與文字辨識
識別CAPTCHA(Computer Automated Public
Turning test to tell Computers and Humans Apart)常用的函式庫
l
Pillow影像處理函式庫:
在命令提示字元輸入pip install pillow,下載並安裝Pillow。
Pillow相關內容:
Pillow - Pillow (PIL Fork) 5.2.0documentation
在命令提示字元輸入pip install pillow,下載並安裝Pillow。
Pillow相關內容:
Pillow - Pillow (PIL Fork) 5.2.0documentation
l
Tesseract光學字元辨識函式庫:
Tesseract是用Python寫的命令列程式,不是透過import取用的函式庫,在Windows上可找Installer安裝,並依提示選擇下載安裝語言包:
Downloadstesseract-ocr
Tesseract無圖形使用者介面,需新增系統變數,讓系統知道資料檔存放位置,變數值請輸入安裝路徑:Windows設定 à 編輯系統環境變數 à 系統內容 à 環境變數 à 系統變數:
新增系統變數:變數名稱TESSDATA_PREFIX,變數值C:\Tesseract-OCR
編輯環境變數:變數名稱Path,新增變數值C:\Tesseract-OCR
Tesseract相關內容:
pytesseract PyPI
Tesseract是用Python寫的命令列程式,不是透過import取用的函式庫,在Windows上可找Installer安裝,並依提示選擇下載安裝語言包:
Downloadstesseract-ocr
Tesseract無圖形使用者介面,需新增系統變數,讓系統知道資料檔存放位置,變數值請輸入安裝路徑:Windows設定 à 編輯系統環境變數 à 系統內容 à 環境變數 à 系統變數:
新增系統變數:變數名稱TESSDATA_PREFIX,變數值C:\Tesseract-OCR
編輯環境變數:變數名稱Path,新增變數值C:\Tesseract-OCR
Tesseract相關內容:
pytesseract PyPI
l NumPy數學函式庫:
在命令提示字元輸入pip install numpy,下載並安裝NumPy。
用以訓練Tesseract識別更多字元或是新字型。
在命令提示字元輸入pip install numpy,下載並安裝NumPy。
用以訓練Tesseract識別更多字元或是新字型。
影像處理
l 高斯模糊:
from PIL import Image, ImageFilter
kitten = Image.open("zophie.png")
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save("zophie_blurred.png")
blurryKitten.show()
from PIL import Image, ImageFilter
kitten = Image.open("zophie.png")
blurryKitten = kitten.filter(ImageFilter.GaussianBlur)
blurryKitten.save("zophie_blurred.png")
blurryKitten.show()
l 二值化濾鏡,將背景的灰階濾掉、突顯文字部分:
from PIL import Image
badImage = Image.open("textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("textCleaned.png")
from PIL import Image
badImage = Image.open("textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("textCleaned.png")
光學字元辨識
l 以命令提示字元操作,處理格式整齊的文字:
# 在目前工作目錄輸入文字圖片與輸出.txt檔
tesseract textOriginal.png textOutput
# 在指定工作目錄輸入文字圖片與輸出.txt檔
tesseract D:\textOriginal.png D:\textOutput
# 直接輸出內容而不儲存至.txt檔
tesseract textOriginal.png stdout
# 預設辨識英文,亦可指定辨識文字圖片的英文
tesseract textOriginal.png textOutput -l eng
# 需安裝語言包,指定辨識文字圖片的繁體中文
tesseract textOriginal.png textOutput -l chi_tra
# 在目前工作目錄輸入文字圖片與輸出.txt檔
tesseract textOriginal.png textOutput
# 在指定工作目錄輸入文字圖片與輸出.txt檔
tesseract D:\textOriginal.png D:\textOutput
# 直接輸出內容而不儲存至.txt檔
tesseract textOriginal.png stdout
# 預設辨識英文,亦可指定辨識文字圖片的英文
tesseract textOriginal.png textOutput -l eng
# 需安裝語言包,指定辨識文字圖片的繁體中文
tesseract textOriginal.png textOutput -l chi_tra
l 整合影像處理以辨識文字圖片檔案:
from PIL import Image
import subprocess
badImage = Image.open("D:\\textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("D:\\textCleaned.png")
subprocess.call(["tesseract", "D:\\textCleaned.png", "D:\\textOutput"])
textFile = open("D:\\textOutput.txt", "r")
print(textFile.read())
textFile.close()
from PIL import Image
import subprocess
badImage = Image.open("D:\\textBad.png")
cleanedImage = badImage.point(lambda x: 0 if x < 143 else 255)
cleanedImage.save("D:\\textCleaned.png")
subprocess.call(["tesseract", "D:\\textCleaned.png", "D:\\textOutput"])
textFile = open("D:\\textOutput.txt", "r")
print(textFile.read())
textFile.close()
l 訓練Tesseract:
常用的CAPTCHA,可當訓練範本:
Drupal CAPTCHA
方便製作訓練範本所需外框檔(.box)的工具:
Tesseract OCR Chopper
本書作者用Python寫了一套工具,並建議需要100個檔案以確保訓練的資料量足夠:
tesseract-trainer
訓練Tesseract相關內容:
TrainingTesseract
常用的CAPTCHA,可當訓練範本:
Drupal CAPTCHA
方便製作訓練範本所需外框檔(.box)的工具:
Tesseract OCR Chopper
本書作者用Python寫了一套工具,並建議需要100個檔案以確保訓練的資料量足夠:
tesseract-trainer
訓練Tesseract相關內容:
TrainingTesseract
第十二章:避開Scraping陷阱
看起來像人而非機器人
l 調整Headers,可以比較一般Python Scraper送出的Headers與一般瀏覽送出的Headers:
import requests, bs4
session = requests.Session()
url = https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
aSes = session.get(url, headers = headers)
aBeaSou = bs4.BeautifulSoup(aSes.text, "html.parser")
print(aBeaSou.find("table", {"class": "table-striped"}).get_text)
import requests, bs4
session = requests.Session()
url = https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
aSes = session.get(url, headers = headers)
aBeaSou = bs4.BeautifulSoup(aSes.text, "html.parser")
print(aBeaSou.find("table", {"class": "table-striped"}).get_text)
l 處理Cookies,本書作者建議研究網站產生的Cookies,思考Scrapers需要處理哪些:
from selenium import webdriver
aDriver = webdriver.Chrome(r"D:\chromedriver.exe")
bDriver = webdriver.Chrome(r"D:\chromedriver.exe")
aDriver.get("http://pythonscraping.com")
aDriver.implicitly_wait(1)
savedCookies = aDriver.get_cookies()
print(savedCookies)
bDriver.get("http://pythonscraping.com")
bDriver.implicitly_wait(1)
bDriver.delete_all_cookies()
for cookie in savedCookies:
bDriver.add_cookie(cookie)
print(bDriver.get_cookies())
aDriver.close()
bDriver.close()
from selenium import webdriver
aDriver = webdriver.Chrome(r"D:\chromedriver.exe")
bDriver = webdriver.Chrome(r"D:\chromedriver.exe")
aDriver.get("http://pythonscraping.com")
aDriver.implicitly_wait(1)
savedCookies = aDriver.get_cookies()
print(savedCookies)
bDriver.get("http://pythonscraping.com")
bDriver.implicitly_wait(1)
bDriver.delete_all_cookies()
for cookie in savedCookies:
bDriver.add_cookie(cookie)
print(bDriver.get_cookies())
aDriver.close()
bDriver.close()
常見的表單防護機制
l 隱藏輸入欄位,一般Python Scraper容易忽略送出的隱藏欄位內容,一般瀏覽時皆會自動被送出。
l 躲開蜂蜜罐,一般Python Scraper容易自動處理的隱藏欄位內容,一般瀏覽時不會自動被處理:
# Selenium會實際描繪造訪的頁面,所以可透過is_displayed()函式判斷頁面上實際看到的元素
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/itsatrap.html")
links = driver.find_elements_by_tag_name("a")
for link in links:
if not link.is_displayed():
print("The Link " + link.get_attribute("href") + " Is a Trap")
fields = driver.find_elements_by_tag_name("input")
for field in fields:
if not field.is_displayed():
print("Do Not Change Value of " + field.get_attribute("name"))
# Selenium會實際描繪造訪的頁面,所以可透過is_displayed()函式判斷頁面上實際看到的元素
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/itsatrap.html")
links = driver.find_elements_by_tag_name("a")
for link in links:
if not link.is_displayed():
print("The Link " + link.get_attribute("href") + " Is a Trap")
fields = driver.find_elements_by_tag_name("input")
for field in fields:
if not field.is_displayed():
print("Do Not Change Value of " + field.get_attribute("name"))
第十三章:以Scrapers測試您的網站
以Python內建的單元測試模組unittest進行測試
l setUp()函式會在每個獨立測試之前執行:
import unittest
class TestAddition(unittest.TestCase):
def setUp(self):
print("Setting Up the Test")
def tearDown(self):
print("Tearing Down the Test")
def test_twoPlusTwo(self):
total = 2 + 2
self.assertEqual(4, total);
if __name__ == "__main__":
unittest.main()
import unittest
class TestAddition(unittest.TestCase):
def setUp(self):
print("Setting Up the Test")
def tearDown(self):
print("Tearing Down the Test")
def test_twoPlusTwo(self):
total = 2 + 2
self.assertEqual(4, total);
if __name__ == "__main__":
unittest.main()
l setUpClass()函式會在整個class中的所有測試開始前執行一次:
import requests, bs4, unittest
class TestWikipedia(unittest.TestCase):
aBeaSou = None
def setUpClass():
global aBeaSou
url = "http://en.wikipedia.org/wiki/Monty_Python"
aBeaSou = bs4.BeautifulSoup(requests.get(url).text, "html.parser")
def test_titleText(self):
global aBeaSou
pageTitle = aBeaSou.find("h1").get_text()
self.assertEqual("Monty Python", pageTitle)
def test_contentExists(self):
global aBeaSou
content = aBeaSou.find("div", {"id": "mw-content-text"})
self.assertIsNotNone(content)
if __name__ == "__main__":
unittest.main()
import requests, bs4, unittest
class TestWikipedia(unittest.TestCase):
aBeaSou = None
def setUpClass():
global aBeaSou
url = "http://en.wikipedia.org/wiki/Monty_Python"
aBeaSou = bs4.BeautifulSoup(requests.get(url).text, "html.parser")
def test_titleText(self):
global aBeaSou
pageTitle = aBeaSou.find("h1").get_text()
self.assertEqual("Monty Python", pageTitle)
def test_contentExists(self):
global aBeaSou
content = aBeaSou.find("div", {"id": "mw-content-text"})
self.assertIsNotNone(content)
if __name__ == "__main__":
unittest.main()
以Selenium的斷言進行測試
l from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://en.wikipedia.org/wiki/Monty_Python")
assert "Monty Python" in driver.title
driver.close()
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://en.wikipedia.org/wiki/Monty_Python")
assert "Monty Python" in driver.title
driver.close()
Selenium的其它常用功能
l 與網站互動:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/files/form.html")
firstnameField = driver.find_element_by_name("firstname")
lastnameField = driver.find_element_by_name("lastname")
submitButton = driver.find_element_by_id("submit")
""" 方法一
firstnameField.send_keys("Ryan")
lastnameField.send_keys("Mitchell")
submitButton.click()
"""
# 方法二
actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN)
actions.perform()
print(driver.find_element_by_tag_name("body").text)
driver.close()
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://pythonscraping.com/pages/files/form.html")
firstnameField = driver.find_element_by_name("firstname")
lastnameField = driver.find_element_by_name("lastname")
submitButton = driver.find_element_by_id("submit")
""" 方法一
firstnameField.send_keys("Ryan")
lastnameField.send_keys("Mitchell")
submitButton.click()
"""
# 方法二
actions = ActionChains(driver).click(firstnameField).send_keys("Ryan").click(lastnameField).send_keys("Mitchell").send_keys(Keys.RETURN)
actions.perform()
print(driver.find_element_by_tag_name("body").text)
driver.close()
l 拖放:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')
print(driver.find_element_by_id("message").text)
element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()
print(driver.find_element_by_id("message").text)
driver.close()
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get('http://pythonscraping.com/pages/javascript/draggableDemo.html')
print(driver.find_element_by_id("message").text)
element = driver.find_element_by_id("draggable")
target = driver.find_element_by_id("div2")
actions = ActionChains(driver)
actions.drag_and_drop(element, target).perform()
print(driver.find_element_by_id("message").text)
driver.close()
l 螢幕截圖:
from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://www.pythonscraping.com/")
driver.get_screenshot_as_file(r"D:\test.png")
driver.close()
from selenium import webdriver
driver = webdriver.Chrome(r"D:\chromedriver.exe")
driver.get("http://www.pythonscraping.com/")
driver.get_screenshot_as_file(r"D:\test.png")
driver.close()
第十四章:遠端Scraping
為什麼要用遠端主機
l 理由一-需要不同的IP位址:
使用Tor的目標是改變IP位址:
Tor(The OnionRouter)
PySocks能把流量轉到Proxy Server,能跟預設在port 9150的Tor配合:
在命令提示字元輸入pip install PySocks==1.5.0,下載並安裝PySocks。
PySocks PyPI
http://icanhazip.com會顯示連上它的Client的IP位址。
使用Tor的目標是改變IP位址:
Tor(The OnionRouter)
PySocks能把流量轉到Proxy Server,能跟預設在port 9150的Tor配合:
在命令提示字元輸入pip install PySocks==1.5.0,下載並安裝PySocks。
PySocks PyPI
http://icanhazip.com會顯示連上它的Client的IP位址。
l 理由二-需要更強的處理能力與靈活度:
方法一-從網站代管帳戶執行遠端主機。
方法二-從雲端執行遠端主機:
在Google雲端運算平台上使用Python與JavaScript相關資源:
GoogleCompute Engine
使用Amazon Web Services相關資源:
Pythonand AWS Cookbook
方法一-從網站代管帳戶執行遠端主機。
方法二-從雲端執行遠端主機:
在Google雲端運算平台上使用Python與JavaScript相關資源:
GoogleCompute Engine
使用Amazon Web Services相關資源:
Pythonand AWS Cookbook
附錄C:擷取網頁時的法律與道德考量
機器人排除標準(Robots Exclusion Standard)
l 網址後面加上/robots.txt,如https://www.mlb.com/robots.txt:
User-agent代表規則適用的使用者。
Allow或Disallow代表是否允許操作網站指定範圍。
User-agent代表規則適用的使用者。
Allow或Disallow代表是否允許操作網站指定範圍。
Google網頁快取
l 如果您想搜尋的網站連不上,嘗試輸入http://webcache.googleusercontent.com/search?q=cache:您想搜尋的網站,尋找有無Google網頁快取的版本,例如:
http://webcache.googleusercontent.com/search?q=cache:http://pythonscraping.com/
http://webcache.googleusercontent.com/search?q=cache:http://pythonscraping.com/
沒有留言:
張貼留言