2019年1月17日 星期四

Python資料科學學習手冊學習筆記 (1) NumPy & Pandas

  學習了這麼多基礎的Python語言操作,並且學會了基礎的網路擷取技術,但是直到開始閱讀這本書之前,都不算步入資料科學的領域,而是徘徊於基礎的程式語言世界而已,不過儘管如此,應用基礎程式語言替我們帶來的電腦自動化操作,已確實為我們擺脫無聊工作的枷鎖。

  工作上或生活中,難免需要處理多欄位表格資料的清理、篩選、串接、合併、聚合等資訊方面的操作,進而繪製圖表以利掌握數據的整體樣貌,而部落格版主我亦於工作上碰到這方面的問題:使用傳統的程式迴圈雖然也能解決資料處理的問題,但隨著大數據時代的來臨,傳統的程式迴圈顯得極度缺乏效率,版主我曾寫了簡單的三層迴圈以處理工作上臨時性的資料需求,僅僅八千餘筆的資料卻讓該三層迴圈跑了約五分鐘之久才產出結果,萬一程式運作過程出現任何問題,豈不是要重新再耗費一段五分鐘時間?更不用說處理數萬筆、數十萬筆,甚至是真正的大數據時,將耗費極大的時間與運算資源。

  資料科學正是因應大數據與機器學習的崛起,而不斷發光發熱的一門顯學,這本由Jake VanderPlas著作,何敏煌譯著的《Python資料科學學習手冊》(Python Data Science Handbook: Essential Tools for Working with Data),是入門資料科學的良好讀物,作者詳盡說明了NumPyPandasMatplotlibScikit-Learn四大資料科學相關之Python套件的操作方法,並透過各式範例展現這些套件工具的靈活應用性與高效率運算能力。

  本書適合已擁有基礎Python能力的人閱讀,且閱讀本書時可能需要自行整理本書重點,因為本書將各套件功能詳盡說明,編排時不免顯得些許混雜,若沒有自行梳理程式邏輯的習慣,閱讀起來可能會覺得混亂,而難以駕馭本書所介紹的各項功能強大且好用的程式工具。

第一章:IPython:更好用的Python

本書內容
l   本書使用套件:
IPython
用來做為互動式的執行及程式碼之共享。
NumPy
用來處理以同質性陣列為主的資料。
Pandas
用來處理異質性和標籤類型的資料。
Matplotlib
用來處理具有發行等級品質的視覺化。
Scikit-Learn
可以進行機器學習。
SciPy
用來進行一般的科學計算工作。

Anaconda Prompt輸入ipythonjupyter notebook
l   IPython說明文件:
#
使用?取得說明文件
help(len)
len?
#
使用??取得原始程式碼
def square(a):
    return a ** 2
square??
l   IPython補齊功能:
#
使用Tab鍵的補齊功能來探索模組
from bs4 import Beautiful<tab>
#
使用*萬用字元的補齊功能來探索物件
example = "Hello World"
*ample?
l   IPython輸入輸出:
In [6]: import math
math.sin(2)
Out[6]: 0.9092974268256817
In [7]: math.cos(2)
Out[7]: -0.4161468365471424
In [8]: # In
物件是一個串列,Out物件是一個字典
Out[6] ** 2 + Out[7] ** 2
Out[8]: 1.0
In [9]: # Out[6]
_6相同
_6 ** 2 + _7 ** 2
Out[9]: 1.0

In [10]: #
使用_n個底線快捷符號取得最近的第n次輸出
print(_)
print(__)
print(___)
l   IPython抑制輸出:
#
在敘述行末處加上;分號,會安靜地執行該敘述,不會顯示結果
In [26]: 2 + 3;
In [27]: 26 in Out
Out[27]: False

IPython中的Shell命令
l   所有放在!驚嘆號之後的內容都會在作業系統命令列上面執行:
#
印出目前工作目錄的內容
!dir
#
印出目前工作目錄的路徑
directory = !cd
print(directory)
# echo
就像是Pythonprint()函式
message = "Hello World"
!echo {message}

IPython神奇命令(魔術命令,Magic Command)
l   %開頭,用來操作單行輸入的Line Magic;以%%開頭,用來操作多行輸入的Cell Magic
l   Magic命令說明文件:
#
印出Magic命令說明文件
%magic
#
印出所有可以使用的Magic命令
%lsmagic
l   執行外部程式碼:
%run myScript.py
l   測定程式碼時間:
# %time
單次測定程式碼的執行時間
# %timeit
重複測定程式碼的執行時間
%time a = [i **2 for i in range(1000)]
l   Shell相關的Magic命令:
# %automagic
函式啟用之下,如同在作業系統的命令提示字元一樣
%automagic on

第二章:NumPy介紹

陣列
l   Python的固定型態陣列:
import array
print(array.array("i", list(range(10))))
l   NumPy的固定型態陣列:
import numpy as np
print(np.array([1, 2, 3.14, 4, 5]))
l   從無到有建立NumPy陣列:
import numpy as np
#
內容全為01*5的整數陣列
print(np.zeros(5, dtype = int))
#
內容全為13*5的浮點數陣列
print(np.ones((3, 5), dtype = float))
#
內容全為3.143*5的陣列
print(np.full((3, 5), 3.14))
#
內容從020,每次以2為間隔
print(np.arange(0, 20, 2))
#
內容為01之間,1*5的平均分布陣列
print(np.linspace(0, 1, 5))
#
內容為01之間,3*5的亂數值陣列
print(np.random.random((3, 5)))
#
內容為01之間,3*5的亂數值陣列
print(np.random.rand(3, 5))
#
內容為取自平均0、標準差1的常態分布,3*5的亂數值陣列
print(np.random.normal(0, 1, (3, 5)))
#
內容取自標準常態分布,3*5的亂數值陣列
print(np.random.randn(3, 5))
#
內容為010之間,3*5的整數亂數值陣列
print(np.random.randint(0, 10, (3, 5)))
#
內容為3*3的單位矩陣
print(np.eye(3))
#
建立1*3的未初始化陣列
print(np.empty(3))

NumPy陣列基礎
l   NumPy陣列屬性:
import numpy as np
a3 = np.random.randint(10, size = (3, 4, 5))
print(a3.ndim)
print(a3.shape)
print(a3.size)
print(a3.dtype)
l   NumPy陣列索引:
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, 1])
l   NumPy陣列Fancy索引:
# Fancy
索引返回索引陣列廣播後的形狀,而不是被索引的陣列形狀
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, [1]])
print(a2[[2, 1]])
print(a2[[2, 0], [1, 2]])
ind = np.array([[2, 0], [1, 2]])
print(a2[ind])

#
以下範例需配合陣列重塑與陣列廣播概念
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
print(a2[row, col])
print(a2[row[:, np.newaxis], col])
l   NumPy陣列切片:
import numpy as np
a1 = np.arange(10)
print(a1)
# a1[start:stop:step]
print(a1[::2])
print(a1[1::2])
print(a1[::-1])
print(a1[5::-2])
l   NumPy陣列複製:
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
a2Same = a2[:2, :2]
a2Same[0, 0] = 11
print(a2)
print(a2Same)
a2Copy = a2[:2, :2].copy()
a2Copy[0, 0] = 21
print(a2)
print(a2Copy)
l   NumPy陣列重塑:
import numpy as np
a1 = np.array([1, 2, 3])
print(a1.reshape((1, 3)))
print(a1.reshape((3, 1)))
print(a1[np.newaxis, :])
print(a1[:, np.newaxis])
a2 = np.array([[1, 2, 3], [4, 5, 6]])
print(np.ravel(a2))
l   NumPy陣列串接:
import numpy as np
a2 = np.array([[1, 2, 3], [4, 5, 6]])
#
沿著第一軸(垂直)串接
print(np.concatenate([a2, a2]))
#
沿著第二軸(水平)串接
print(np.concatenate([a2, a2], axis = 1))
#
沿著第一軸(垂直)串接
print(np.vstack([a2, a2]))
#
沿著第二軸(水平)串接
print(np.hstack([a2, a2]))
# numpy.dstack()
函式沿著第三軸串接
l   NumPy陣列分割:
import numpy as np
#
使用numpy.split()函式做一維陣列分割
a1 = np.arange(10)
a11, a12, a13 = np.split(a1, [3, 5])
print(a11, a12, a13)
#
使用numpy.vsplit()numpy.hsplit()函式做二維陣列分割
a2 = np.arange(16).reshape((4, 4))
upper, lower = np.vsplit(a2, [2])
print(upper, lower)
left, right = np.hsplit(a2, [2])
print(left, right)
#
使用numpy.dsplit()函式做三維陣列以上的第三軸分割

NumPy陣列計算
l   Universal Functions
UFuncs(Universal Functions)
向量化方式是被設計用來把迴圈推送到NumPy的已編譯層,讓執行的速度更快;UFuncs亦支援scipy.special子模組更特殊和晦澀的函式。
l   邏輯運算子:
and
or決定整個物件的真或假,而使用&|則是針對物件的每一個位元運算,因此在NumPy幾乎都使用&|
l   NumPy陣列算術:
import numpy as np
a1 = np.random.random(10)
#
使用Python基本方法,較慢較不佳
print(sum(a1))
#
使用NumPy函式方法,較快較佳
print(np.sum(a1))
#
使用NumPy物件方法,較快較佳
print(a1.sum())
l   NumPy陣列廣播(Broadcasting)
#
規則一-以低維度陣列起始
#
規則二-拉長具有形狀是1的陣列,以符合另一陣列
#
規則三-若前兩項規則執行後,兩陣列形狀仍不同,會發生錯誤
import numpy as np
a1 = np.array([0, 1, 2])
a2 = np.ones((3,3))
print(a1 + 5)
print(a1 + a2)
print(a1 + a1[:, np.newaxis])
l   NumPy陣列比較:
import numpy as np
a = np.arange(1, 10)
print(a < 5)
print(a[a < 5])
# numpy.count_nonzero()
函式計算有多少個True
print(np.count_nonzero(a < 5))
# False
0True1
print(np.sum(a < 5))
# numpy.any()
函式返回是否有任一值符合給定條件
print(np.any(a > 9))
# numpy.all()
函式返回是否所有值皆符合給定條件
print(np.all(a > 0))
# numpy.where()
函式返回判斷式的判斷結果
print(np.where(a > 5, "Bigger than Five", "Not Bigger than Five"))
l   NumPy陣列賦值:
import numpy as np
#
使用Fancy索引修改,重複同一索引位置,先修改為4再修改為6
a = np.zeros(10)
a[[0, 0]] = [4, 6]
print(a)
#
使用Fancy索引修改,重複同一索引位置,因為一次計算,故不會重複修改
b = np.zeros(10)
i = [2, 3, 3, 4, 4, 4]
b[i] += 1
print(b)
#
使用.at()函式,重複同一索引位置,會重複修改
c = np.ones(10)
j = [3, 4, 4, 5, 5, 5]
np.add.at(c, j, 1)
print(c)
l   NumPy陣列排序:
import numpy as np
#
排序元素,使用numpy.sort()函式
np.random.seed(0)
a = np.random.randint(10, size = 5)
print(a)
print(np.sort(a))
#
排序元素,使用numpy.argsort()函式顯示被排序過元素的索引值
bRand = np.random.RandomState(0)
b = bRand.randint(10, size = 5)
print(b)
print(np.argsort(b))
print(b[np.argsort(b)])
l   NumPy陣列分區(Partitioning)(部分排序)
import numpy as np
a = np.array([7, 2, 3, 1, 6, 5, 4])
#
使用numpy.partition()函式把最小的3個值放在左側
print(np.partition(a, 3))
#
使用numpy.argpartition()函式顯示被分區過元素的索引值
print(np.argpartition(a, 3))
print(a[np.argpartition(a, 3)])

NumPy陣列計算技巧
l   引數out用來設定輸出:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.zeros(10)
np.multiply(a, 3, out = b[::2])
print(b)
l   引數axis用來設定軸度:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(np.min(a, axis = 0))
print(a.max(axis = 1))
l   .reduce().accumulate()函式用來計算聚合:
import numpy as np
a = np.arange(1, 10)
# .reduce()
函式僅儲存最終結果
print(np.add.reduce(a))
# .accumulate()
函式儲存計算過程
print(np.add.accumulate(a))
l   .outer()函式用來計算外積:
#
九九乘法表
import numpy as np
a = np.arange(1, 10)
print(np.multiply.outer(a, a))
l   資料裝箱(Binning Data)
import numpy as np
#
準備原始資料
rawData = np.random.randn(100)
#
準備裝箱區間
bins = np.linspace(-5, 5, 20)
#
準備裝箱計次空間
counts = np.zeros_like(bins)
#
返回原始資料裝箱後的索引
ind = np.searchsorted(bins, rawData)
#
返回裝箱計次
np.add.at(counts, ind, 1)
print(counts)
#
使用numpy.histogram()函式結果相同
counts2, bins2 = np.histogram(rawData, bins)
print(counts2)

NumPy結構化陣列(Structured Array)
l   NumPy資料型態:
字元        說明                        範例
b             
位元組                    np.dtype("b")
i              
有號整數                np.dtype("i4") == np.int32
u             
無號整數                np.dtype("u1") == np.uint8
f              
浮點數                    np.dtype("f8") == np.int64
c             
複數浮點數            np.dtype("c16") == np.complex128
S
a      字串                        np.dtype("S5")
U              Unicode
字串          np.dtype("U") == np.str_
V             
原始資料(void)       np.dtype("V") == np.void
l   結構化與記錄陣列:
import numpy as np
name = ["Alice", "Bob", "Cathy", "Doug"]
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

#
結構化陣列
data = np.zeros(4, dtype = {"names": ("name", "age", "weight"), "formats": ("U10", "i4", "f8")})
data["name"] = name
data["age"] = age
data["weight"] = weight

print(data.dtype)
print(data)
print(data["name"])
print(data[0])
print(data[-1]["name"])

#
記錄陣列
dataRec = data.view(np.recarray)
print(dataRec.name)

第三章:使用Pandas操作資料

Pandas物件介紹-Series物件
l   從無到有建立Series物件:
import pandas as pd
#
資料是List,未定義索引
a = pd.Series([0.8, 0.2])
print(a)
#
資料是List,定義索引
b = pd.Series([0.8, 0.2], index = ["Ind_A", "Ind_B"])
print(b)
#
資料是純量,它會被重複地填到指定索引中
c = pd.Series(5, index = ["Ind_A", "Ind_B"])
print(c)
#
資料是字典
d = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
print(d)
#
資料是字典,設定索引做選擇
e = pd.Series({2: "Two", 1: "One", 3: "Three"}, index = [3, 2])
print(e)
l   Series物件屬性和方法:
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data.values)
print(data.index)
print(data.keys())
print(list(data.items()))
l   Series物件索引和切片:
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data[1])
print(data[1:3])
l   Series物件Indexer索引和切片:
import pandas as pd
data = pd.Series(["a", "b", "c"], index = [1, 3, 5])
# loc
屬性允許索引和切片總是參考到明確的索引
print(data.loc[1])
print(data.loc[1:3])
# iloc
屬性允許索引和切片總是參考到隱含的Python型態索引
print(data.iloc[1])
print(data.iloc[1:3])
l   Series物件選取資料:
import pandas as pd
data = pd.Series({"Ind_A": 0.2, "Ind_B": 0.8, "Ind_C": 0.4, "Ind_D": 0.6})
#
使用切片選取
print(data["Ind_B":"Ind_C"])
print(data[1:3])
#
使用遮罩選取
print(data[(data > 0.3) & (data < 0.7)])
#
使用Fancy索引選取
print(data[["Ind_C", "Ind_D"]])
l   Series物件新增、修改、刪除資料:
import pandas as pd
data = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
#
新增資料
data["Ind_C"] = 0.4
#
修改資料
data["Ind_B"] = 0.1
#
刪除資料
data.pop("Ind_A")
print(data)

Pandas物件介紹-DataFrame物件
l   從無到有建立DataFrame物件:
import numpy as np
import pandas as pd

# Series
物件
population = pd.Series({"California": 38332521, "Texas": 26448193, "New York": 19651127, "Florida": 19552860, "Illinois": 12882135})
area = pd.Series({"California": 423967, "Texas": 695662, "New York": 141297, "Florida": 170312, "Illinois": 149995})

#
Series物件的字典建構DataFrame物件
a = pd.DataFrame({"Population": population, "Area": area})
print(a)
#
Series物件建構DataFrame物件
b = pd.DataFrame(population, columns = ["Population"])
print(b)

#
List物件建構DataFrame物件
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
c = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(c)
#
NumPy二維陣列建構DataFrame物件
npTwoData = np.random.rand(3, 2)
d = pd.DataFrame(npTwoData, columns = ["Col_A", "Col_B"], index = ["Ind_A", "Ind_B", "Ind_C"])
print(d)
#
NumPy結構陣列建構DataFrame物件
npStrucData = np.zeros(3, dtype = [("Col_A", "i8"), ("Col_B", "f8")])
e = pd.DataFrame(npStrucData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(e)
l   DataFrame物件屬性:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data.values)
print(data.index)
print(data.columns)
print(data.T)
l   DataFrame物件選取資料:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
#
針對元素選取
print(data.values[1])
print(data.values[1, 1])
#
針對Columns選取
print(data["Col_A"])
print(data.Col_B)
#
使用Indexer切片選取
print(data.loc["Ind_B":, "Col_B":])
print(data.iloc[:2, :1])
#
使用遮罩選取
print(data.loc[data["Col_B"] >= 2])
#
使用Fancy索引選取
print(data.loc[["Ind_B", "Ind_C"]])
l   DataFrame物件新增、修改、刪除資料:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
#
新增欄資料
data["Col_C"] = (data["Col_B"] - data["Col_A"]) ** 2
#
修改資料
data.iloc[0, 0] = 9
#
刪除欄資料
data.pop("Col_A")
print(data)

Pandas物件介紹-Index物件
l   Index物件與NumPy陣列操作幾乎相同,不同處為Index物件不可修改:
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
#
這會發生錯誤
# aInd[1] = 4
l   Index物件有序的集合:
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
bInd = pd.Index([2, 3, 5, 7, 11])
#
交集
print(aInd & bInd)
#
聯集
print(aInd | bInd)
#
對等差分
print(aInd ^ bInd)

Pandas物件介紹-MultiIndex物件
l   從無到有建立MultiIndex物件:
import numpy as np
import pandas as pd

# MultiIndex
物件
aInd = pd.MultiIndex.from_arrays([["A", "A", "B", "B"], [1, 2, 1, 2]])
print(aInd)
bInd = pd.MultiIndex.from_tuples([("A", 1), ("A", 2), ("B", 1), ("B", 2)])
print(bInd)
cInd = pd.MultiIndex.from_product([["A", "B"], [1, 2]])
print(cInd)
dInd = pd.MultiIndex(levels = [["A", "B"], [1, 2]], labels = [[0, 0, 1, 1], [0, 1, 0, 1]])
print(dInd)

#
Series物件上套用MultiIndex物件
aData = pd.Series(np.random.rand(4), index = aInd)
print(aData)
#
DataFrame物件上套用MultiIndex物件
bData = pd.DataFrame(np.random.rand(4, 2), columns = ["Col_A", "Col_B"], index = bInd)
print(bData)

#
直接在Series物件上建構階層式索引(Hierarchical Indexing)
cData = pd.Series({("A", 1): 0.25, ("A", 2): 0.5, ("B", 1): 0.75, ("B", 2): 1.0})
print(cData)
#
直接在DataFrame物件上建構階層式索引
dData = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
print(dData)
l   Series物件上以MultiIndex物件選取、轉換資料:
import pandas as pd

ind = [("California", 2000), ("California", 2010), ("New York", 2000), ("New York", 2010), ("Texas", 2000), ("Texas", 2010)]
index = pd.MultiIndex.from_tuples(ind, names = ["State", "Year"])
pop = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]

#
若需更改索引可用popData.reindex(newIndex)
#
若需更改索引欄名可用popData.index.names = ["newState", "newYear"]
popData = pd.Series(pop, index = index)

#
使用切片選取
print(popData[:, 2010])
print(popData["California"])
print(popData["California":"New York"])
#
使用遮罩選取
print(popData[popData > 22000000])
#
使用Fancy索引選取
print(popData[["California", "Texas"]])

#
轉換階層式索引的Series物件與DataFrame物件(可逆轉換)
popDf = popData.unstack()
print(popDf)
popDfZero = popData.unstack(level = 0)
print(popDfZero)
popSe = popDf.stack()
print(popSe)

#
轉換階層式索引的Series物件為DataFrame物件(不可逆轉換)
popDf2 = popData.reset_index(name = "Population")
print(popDf2)
popDf3 = popDf2.set_index(["State", "Year"])
print(popDf3)
l   DataFrame物件上以MultiIndex物件選取、聚合資料:
import numpy as np
import pandas as pd

data = np.round(np.random.rand(4, 6), 1)
data[:, ::2] *= 10
data += 37

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names = ["Year", "Visit"])
columns = pd.MultiIndex.from_product([["Bob", "Guido", "Sue"], ["HR", "Temp"]], names = ["Subject", "Type"])

healthData = pd.DataFrame(data, index = index, columns = columns)

#
針對Columns選取
print(healthData["Guido", "HR"])
#
使用Indexer切片選取
print(healthData.loc[:, ("Bob", "HR")])
print(healthData.iloc[:2, :2])
#
建立想要的切片,使用切片選取
idx = pd.IndexSlice
print(healthData.loc[idx[:, 1], idx[:, "HR"]])

#
針對IndexColumns聚合資料
yearMean = healthData.mean(level = "Year")
print(yearMean)
print(yearMean.mean(axis = 1, level = "Type"))
l   排序階層式索引:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([["Ind_A", "Ind_C", "Ind_B"], [1, 2]])
data = pd.Series(np.random.rand(6), index = index)
print(data)
data = data.sort_index()
print(data)

空值(Not a Number, NaN)
l   建立包含空值的DataFrame物件:
import pandas as pd
nanData = [{"Col_A": 1, "Col_B": 2}, {"Col_B": 3, "Col_C": 4}, {"Col_A": 5, "Col_C": 6}]
data = pd.DataFrame(nanData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data)
l   空值的型態:
import pandas as pd
#
數值與其空值以浮點數(float64)的方式儲存
a = pd.Series([2, 4, 6], index = [0, 1, 2])
b = pd.Series([1, 3, 5], index = [1, 2, 3])
print(a + b)
#
字串與其空值以物件(object)的方式儲存
c = pd.Series(["Two", "Four", "Six"], index = [0, 1, 2])
d = pd.Series(["One", "Three", "Five"], index = [1, 2, 3])
print(c + d)
l   空值視為0
# NumPy
空值的忽略
import numpy as np
a = np.array([1, np.nan, 3, 4])
print(np.sum(a))
print(np.nansum(a))
print(np.max(a))
print(np.nanmax(a))
print(np.min(a))
print(np.nanmin(a))

# Pandas
空值的忽略
import pandas as pd
b = pd.Series([2, 4, 6], index = [0, 1, 2])
c = pd.Series([1, 3, 5], index = [1, 2, 3])
print(b.add(c, fill_value = 0))
l   .isnull().notnull()用以偵測空值:
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, "Hello", None])
print(a[a.isnull()])
print(a[a.notnull()])
l   .dropna()用以拋棄空值:
import numpy as np
import pandas as pd

# Series
拋棄空值
a = pd.Series([1, np.nan, "Hello", None])
print(a.dropna())

# DataFrame
拋棄空值,只能夠丟棄一整列或是欄
b = pd.DataFrame([[1, np.nan, 3], [4, 5, 6], [np.nan, 8, 9]])
print(b.dropna())
print(b.dropna(axis = "columns"))

#
引數howthresh可控制要多少數量的空值才拋棄
c = pd.DataFrame([[1, np.nan, np.nan], [4, 5, np.nan], [7, 8, np.nan]])
#
全為空值才拋棄
print(c.dropna(axis = "columns", how = "all"))
#
非空值數要多少個才保留
print(c.dropna(axis = "rows", thresh = 2))
l   .fillna()用以填入空值:
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, 2, None, 3])
#
填入0
print(a.fillna(0))
#
往前填(Forward-fill),填入前一個值
print(a.fillna(method = "ffill"))
#
往後填(Back-fill),填入後一個值
print(a.fillna(method = "bfill"))

串接資料集
l   使用pandas.concat()串接資料:
import pandas as pd

#
串接Series物件
aSer = pd.Series(["A", "B", "C"], index = [1, 2, 3])
bSer = pd.Series(["D", "E", "F"], index = [4, 5, 6])
print(pd.concat([aSer, bSer]))

#
串接DataFrame物件
def makeDf(cols, ind):
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(pd.concat([aDf, bDf]))
cDf = makeDf("AB", [1, 2])
dDf = makeDf("CD", [1, 2])
print(pd.concat([cDf, dDf], axis = 1))
l   使用pandas.concat()串接資料-索引重複問題:
import pandas as pd
def makeDf(cols, ind):
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [2, 3])

#
索引重複沒有問題
print(pd.concat([aDf, bDf]))
#
索引重複當做錯誤
# print(pd.concat([aDf, bDf], verify_integrity = True))
#
索引重複忽略錯誤,使用新的索引
print(pd.concat([aDf, bDf], ignore_index = True))
#
加上階層式索引
print(pd.concat([aDf, bDf], keys = ["X", "Y"]))
l   使用pandas.concat()串接資料-不同欄名問題:
import pandas as pd
def makeDf(cols, ind):
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("BC", [3, 4])

#
不同欄名串接時,輸出結果預設使用聯集join = "outer"
print(pd.concat([aDf, bDf]))
#
不同欄名串接時,輸出結果改為使用交集join = "inner"
print(pd.concat([aDf, bDf], join = "inner"))
#
不同欄名串接時,輸出結果指定欄名與aDf相同
print(pd.concat([aDf, bDf], join_axes = [aDf.columns]))
l   使用.append()串接資料,.append()方法與pandas.concat()方法不同,.append()方法串接資料後會建立一個新物件:
import pandas as pd
def makeDf(cols, ind):
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(aDf.append(bDf))

合併資料集
l   使用pandas.merge()合併資料:
import pandas as pd

aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Group": ["Accounting", "Engineering", "HR"], "Supervisor": ["Carly", "Guido", "Steve"]})
dDf = pd.DataFrame({"Group": ["Accounting", "Accounting", "Engineering", "Engineering", "HR", "HR"], "Skills": ["Math", "Spreadsheets", "Coding", "Linux", "Spreadsheets", "Organization"]})

#
一對一合併
print(pd.merge(aDf, bDf))
#
多對一合併
print(pd.merge(aDf, cDf))
#
多對多合併
print(pd.merge(aDf, dDf))
l   使用pandas.merge()合併資料-搭配引數:
import pandas as pd

aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Salary": [70000, 80000, 120000, 90000]})

aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")

#
引數on指定要合併的欄名
print(pd.merge(aDf, bDf, on = "Employee"))
#
引數left_onright_on指定要串接的兩個不同欄名
#
可使用.drop()方法丟棄重複的欄位
newDf = pd.merge(aDf, cDf, left_on = "Employee", right_on = "Name")
print(newDf)
print(newDf.drop("Name", axis = 1))
#
引數left_indexright_index指定要合併的兩個不同索引
print(pd.merge(aDfInd, bDfInd, left_index = True, right_index = True))
#
混合使用,指定要合併的欄名或索引
print(pd.merge(aDfInd, cDf, left_index = True, right_on = "Name"))
l   使用pandas.merge()合併資料-不同資料集之間的交集、聯集:
import pandas as pd

aDf = pd.DataFrame({"Name": ["Peter", "Paul", "Mary"], "Food": ["Fish", "Beans", "Bread"]})
bDf = pd.DataFrame({"Name": ["Mary", "Joseph"], "Drink": ["Wine", "Beer"]})

#
輸出結果預設使用交集how = "inner"
print(pd.merge(aDf, bDf))
#
輸出結果使用聯集
print(pd.merge(aDf, bDf, how = "outer"))
#
輸出結果使用左方DataFrame
print(pd.merge(aDf, bDf, how = "left"))
#
輸出結果使用右方DataFrame
print(pd.merge(aDf, bDf, how = "right"))
l   使用pandas.merge()合併資料-相同欄名問題:
import pandas as pd

aDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [1, 2, 3, 4]})
bDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [3, 1, 4, 2]})

#
預設會自動區分相同的欄名(_x_y)
print(pd.merge(aDf, bDf, on = "Name"))
#
自行設定區分相同欄名的方式
print(pd.merge(aDf, bDf, on = "Name", suffixes = ["_L", "_R"]))
l   使用.join()合併資料:
import pandas as pd

aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})

aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")

print(aDfInd.join(bDfInd))

Pandas聚合運算
l   Pandas聚合運算列表:
describe()               
敘述統計
count()                    
資料項目總數
first(), last()            
第一個和最後一個資料項目
min(), max()           
最小值和最大值
mean(), median()   
平均數和中位數
std(), var()              
標準差和變異數
sum(), prod()          
資料項目的和與積
mad()                     
平均絕對差
l   Pandas簡單聚合運算:
import numpy as np
import pandas as pd

aDf = pd.DataFrame({"Col_A": np.random.rand(5), "Col_B": np.random.rand(5)})

print(aDf.describe())
print(aDf.sum())
print(aDf.sum(axis = 1))
l   Pandas簡單聚合運算搭配.groupby()方法:
import numpy as np
import pandas as pd

aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})

#
GroupBy物件上進行聚合運算
print(aDf.groupby("Key").sum())
#
GroupBy物件上選擇欄位進行聚合運算
print(aDf.groupby("Key")["Col_B"].sum())
l   GroupBy物件方法:
import numpy as np
import pandas as pd

aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})

# .aggregate()
聚合計算
print(aDf.groupby("Key").aggregate(["min", np.median, max]))
print(aDf.groupby("Key").aggregate({"Col_A": min, "Col_B": "max"}))

# .filter()
過濾
def filterFunc(x):
    return x["Col_B"].sum() < 10
print(aDf.groupby("Key").filter(filterFunc))

# .transform()
轉換
print(aDf.groupby("Key").transform(lambda x: x - x.mean()))

# .apply()
套用,極有彈性,可以套用任一函式,傳回Pandas物件或純量
def normByColB(x):
    x["Col_A"] /= x["Col_B"].sum()
    return x
print(aDf.groupby("Key").apply(normByColB))
l   GroupBy物件方法-搭配指定分割鍵:
import numpy as np
import pandas as pd

aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
bDf = aDf.set_index("Key")

# List
做為指定分割鍵
print(aDf.groupby([0, 1, 0, 1, 2, 0]).sum())

#
字典做為指定分割鍵
print(bDf.groupby({"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}).sum())

# Python
函式修改索引做為指定分割鍵
print(bDf.groupby(str.lower).sum())

#
混合使用
print(bDf.groupby([str.lower, {"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}]).sum())

樞紐分析表(Pivot Table)
l   樞紐分析表基本操作:
import seaborn as sns
import pandas as pd

#
使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")
age = pd.cut(titanic["age"], [0, 18, 80])
fare = pd.qcut(titanic["fare"], 2)

#
基本樞紐分析表
#
等同titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
print(titanic.pivot_table("survived", index = "sex", columns = "class"))

#
階層式樞紐分析表
print(titanic.pivot_table("survived", index = ["sex", age], columns = "class"))
print(titanic.pivot_table("survived", ["sex", age], [fare, "class"]))
l   樞紐分析表引數選項:
import seaborn as sns

#
使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")

# aggfunc
引數控制聚合計算型態,預設是平均值(mean)
print(titanic.pivot_table(index = "sex", columns = "class", aggfunc = {"survived": sum, "fare": "mean"}))

# margins
引數在輸出表格的最後欄列計算總數
print(titanic.pivot_table("survived", index = "sex", columns = "class", margins = True))

Pandas向量化字串操作
l   Pandas基本向量化字串操作:
import pandas as pd

names = pd.Series(["peter", "Paul", None, "MARY", "gUIDO"])

# Pandas
字元取得
print(names.str[-1])
print(names.str.get(-1))

# Pandas
字串切片
print(names.str[0:3])
print(names.str.slice(0, 3))

# Pandas
字串方法
print(names.str.capitalize())
print(names.str.lower())
l   Pandas字串方法-搭配正規表示式:
import pandas as pd

monte = pd.Series(["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])

# .str.match()
方法呼叫re.match()並傳回布林值
print(monte.str.match(r"^[^AEIOU].*[^aeiou]$"))
# .str.extract()
方法呼叫re.match()並傳回符合的群組
print(monte.str.extract(r"(^[^AEIOU].*[^aeiou]$)"))
# .str.contains()
方法呼叫re.search()並傳回布林值
print(monte.str.contains(r"^[^AEIOU].*[^aeiou]$"))
# .str.findall()
方法呼叫re.findall()並傳回所有符合的結果
print(monte.str.findall(r"^[^AEIOU].*[^aeiou]$"))

# .str.replace()
方法用指定的字串取代符合的樣式
print(monte.str.replace(r"[A-Za-z]+ ", "Name "))
# .str.count()
方法計算符合樣式的數目
print(monte.str.count(r"[A-Za-z]+"))
# .str.split()
方法以符合的樣式分隔字串
print(monte.str.split(r"[A-Z]"))
l   Pandas字串方法-特殊方法:
import pandas as pd

# A = Born in America
# B = Born in the United Kingdom
# C = Likes Cheese
# D = Likes Spam
monteData = pd.DataFrame({"Info": ["B|C|D", "B|D", "A|C", "B|D", "B|C", "B|C|D"]}, index = ["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])

# .str.get_dummies()
方法用於轉換指示符變數為DataFrame
print(monteData["Info"].str.get_dummies("|"))

Pandas時間序列操作
l   NumPy時間型別陣列:
import numpy as np
date = np.array("2018-06-21", dtype = np.datetime64)
print(date)
print(date + np.arange(11))
print(date == np.datetime64("2018-06-21"))
l   Pandas時間型別物件-基本操作:
import pandas as pd

#
傳遞一個日期到pandas.to_datetime()會產生Timestamp物件
date = pd.to_datetime("21th of June, 2018")
print(date)
#
傳遞多個日期到pandas.to_datetime()會產生DatetimeIndex物件
dates = pd.to_datetime(["21th of June, 2018", "2018-Jun-22", "06-23-2018", "20180624"])
print(dates)

#
轉換為Period物件
print(date.to_period("D"))
#
轉換為PeriodIndex物件
print(dates.to_period("D"))

#
傳遞一段時間到pandas.to_timedelta()會產生Timedelta物件
timeDuration = pd.to_timedelta(10, "D")
print(timeDuration)
#
傳遞多段時間到pandas.to_timedelta()會產生TimedeltaIndex物件
timeDurationS = pd.to_timedelta(["10D", "20D"])
print(timeDurationS)

#
時間運算
print(date + timeDuration)
print(dates + timeDuration)
print(dates - dates[0])
l   Pandas時間型別物件-規則序列:
import pandas as pd

#
開始與結束的時間點
print(pd.date_range("2018-06-21", "2018-07-01"))

# periods
引數設定結束的時間長度(包含開始的時間點)
print(pd.date_range("2018-06-21", periods = 11))

# frep
引數設定時間單位,預設為"D"
print(pd.date_range("2018-06-21", periods = 13, freq = "H"))
print(pd.period_range("2018-06", periods = 7, freq = "M"))
print(pd.timedelta_range(0, periods = 7, freq = "1H30T"))

# frep
引數設定時間單位,使用工作日
from pandas.tseries.offsets import BDay
print(pd.date_range("2018-06-21", periods = 8, freq = BDay()))
l   Pandas時間型別物件-搭配Series物件使用:
import pandas as pd
index = pd.to_datetime(["2018-06-21", "2017-08-04", "2017-08-14", "2016-11-09"])
data = pd.Series(["Me", "Sister", "Mom", "Dad"], index = index)
print(data)
print(data["2017-08-01":"2017-08-31"])
print(data["2016"])
l   以時間重新取樣:
import pandas as pd

# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
print(data.head())

#
重新取樣縮寫參考http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
# .resample()
以資料聚合(Data Aggregation)為基礎,此例傳回年均值
resampleData = data.resample("BA").mean()
print(resampleData)

# .asfreq()
以資料選擇(Data Selection)為基礎,此例傳回年末值
asfreqData = data.asfreq("BA")
print(asfreqData)
l   時間移位:
import pandas as pd

# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").sum()

#
資料或索引移位前之樣貌
print(resampleData.head())

# .shift()
把資料移位
print(resampleData.shift(2).head())

# .tshift()
把索引移位
print(resampleData.tshift(2).head())
l   時間滾動:
import pandas as pd

# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").mean()

#
時間滾動前之樣貌
print(resampleData.head())

#
時間滾動後之樣貌
print(resampleData.rolling(2).mean().head())

讀取與匯出資料集
l   讀取seaborn套件內建資料集:
import seaborn as sns
titanic = sns.load_dataset("titanic")
print(titanic.head())
l   讀取JSON格式資料:
# http://openrecipes.s3.amazonaws.com/openrecipes.txt
import pandas as pd
#
範例檔案每列都是JSON格式,但是整體不是,因此需額外處理
with open("openrecipes.txt") as file:
    data = (line.strip() for line in file)
    jsonDataList = "[{0}]".format(", ".join(data))
recipesData = pd.read_json(jsonDataList)
print(recipesData.head())
l   讀取.csv檔:
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_csv("state-population.csv")
print(popData.head())
l   讀取.xlsx檔:
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_excel("state-population.xlsx")
print(popData.head())
l   匯出.csv檔:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_csv("testFile.csv")
l   匯出.xlsx檔:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_excel("testFile.xlsx", "Sheet2")

高效率Pandaseval()方法與query()方法
l   高效率原理來自Numexpr套件,擁有複合敘述式逐項元素計算的能力。
l   使用pandas.eval()方法:
import numpy as np
import pandas as pd

aDf, bDf, cDf, dDf = (pd.DataFrame(np.random.rand(1000, 10)) for i in range(4))

#
算數運算
aMethod = aDf + bDf + cDf + dDf
bMethod = pd.eval("aDf + bDf + cDf + dDf")
# aMethod
bMethod兩物件若元素計算上相同,返回True
print(np.allclose(aMethod, bMethod))

#
比較運算與位元運算
cMethod = (aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)
dMethod = pd.eval("(aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)")
print(np.allclose(cMethod, dMethod))

#
物件屬性和索引
eMethod = bDf.T[0] + cDf.iloc[1]
fMethod = pd.eval("bDf.T[0] + cDf.iloc[1]")
print(np.allclose(eMethod, fMethod))
l   使用DataFrame物件.eval()方法:
import numpy as np
import pandas as pd

aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
indexMean = aDf.mean(axis = 1)

#
算數運算
aMethod = (aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)
bMethod = pd.eval("(aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)")
cMethod = aDf.eval("(Col_A + Col_B) / (Col_C - 1)")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))

#
賦值運算
aDf.eval("Col_D = (Col_A + Col_B) / Col_C", inplace = True)
print(aDf.head())

#
呼叫本地端變數,@標記變數而非欄名
dMethod = aDf.Col_A + indexMean
eMethod = pd.eval("aDf.Col_A + indexMean")
fMethod = aDf.eval("Col_A + @indexMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))
l   使用DataFrame物件.query()方法:
import numpy as np
import pandas as pd

aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
columnCMean = aDf["Col_C"].mean()

#
比較運算與位元運算
aMethod = aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]
bMethod = pd.eval("aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]")
cMethod = aDf.query("Col_A < 0.5 and Col_B < 0.5")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))

#
呼叫本地端變數,@標記變數而非欄名
dMethod = aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]
eMethod = pd.eval("aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]")
fMethod = aDf.query("Col_A < @columnCMean and Col_B < @columnCMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))

沒有留言:

張貼留言