工作上或生活中,難免需要處理多欄位表格資料的清理、篩選、串接、合併、聚合等資訊方面的操作,進而繪製圖表以利掌握數據的整體樣貌,而部落格版主我亦於工作上碰到這方面的問題:使用傳統的程式迴圈雖然也能解決資料處理的問題,但隨著大數據時代的來臨,傳統的程式迴圈顯得極度缺乏效率,版主我曾寫了簡單的三層迴圈以處理工作上臨時性的資料需求,僅僅八千餘筆的資料卻讓該三層迴圈跑了約五分鐘之久才產出結果,萬一程式運作過程出現任何問題,豈不是要重新再耗費一段五分鐘時間?更不用說處理數萬筆、數十萬筆,甚至是真正的大數據時,將耗費極大的時間與運算資源。
資料科學正是因應大數據與機器學習的崛起,而不斷發光發熱的一門顯學,這本由Jake
VanderPlas著作,何敏煌譯著的《Python資料科學學習手冊》(Python Data Science Handbook: Essential Tools for Working with Data),是入門資料科學的良好讀物,作者詳盡說明了NumPy、Pandas、Matplotlib與Scikit-Learn四大資料科學相關之Python套件的操作方法,並透過各式範例展現這些套件工具的靈活應用性與高效率運算能力。
本書適合已擁有基礎Python能力的人閱讀,且閱讀本書時可能需要自行整理本書重點,因為本書將各套件功能詳盡說明,編排時不免顯得些許混雜,若沒有自行梳理程式邏輯的習慣,閱讀起來可能會覺得混亂,而難以駕馭本書所介紹的各項功能強大且好用的程式工具。
第一章:IPython:更好用的Python
本書內容
l 本書使用套件:
IPython用來做為互動式的執行及程式碼之共享。
NumPy用來處理以同質性陣列為主的資料。
Pandas用來處理異質性和標籤類型的資料。
Matplotlib用來處理具有發行等級品質的視覺化。
Scikit-Learn可以進行機器學習。
SciPy用來進行一般的科學計算工作。
IPython用來做為互動式的執行及程式碼之共享。
NumPy用來處理以同質性陣列為主的資料。
Pandas用來處理異質性和標籤類型的資料。
Matplotlib用來處理具有發行等級品質的視覺化。
Scikit-Learn可以進行機器學習。
SciPy用來進行一般的科學計算工作。
在Anaconda Prompt輸入ipython或jupyter notebook
l IPython說明文件:
# 使用?取得說明文件
help(len)
len?
# 使用??取得原始程式碼
def square(a):
return a ** 2
square??
# 使用?取得說明文件
help(len)
len?
# 使用??取得原始程式碼
def square(a):
return a ** 2
square??
l IPython補齊功能:
# 使用Tab鍵的補齊功能來探索模組
from bs4 import Beautiful<tab>
# 使用*萬用字元的補齊功能來探索物件
example = "Hello World"
*ample?
# 使用Tab鍵的補齊功能來探索模組
from bs4 import Beautiful<tab>
# 使用*萬用字元的補齊功能來探索物件
example = "Hello World"
*ample?
l IPython輸入輸出:
In [6]: import math
math.sin(2)
Out[6]: 0.9092974268256817
In [7]: math.cos(2)
Out[7]: -0.4161468365471424
In [8]: # In物件是一個串列,Out物件是一個字典
Out[6] ** 2 + Out[7] ** 2
Out[8]: 1.0
In [9]: # Out[6]與_6相同
_6 ** 2 + _7 ** 2
Out[9]: 1.0
In [10]: # 使用_n個底線快捷符號取得最近的第n次輸出
print(_)
print(__)
print(___)
In [6]: import math
math.sin(2)
Out[6]: 0.9092974268256817
In [7]: math.cos(2)
Out[7]: -0.4161468365471424
In [8]: # In物件是一個串列,Out物件是一個字典
Out[6] ** 2 + Out[7] ** 2
Out[8]: 1.0
In [9]: # Out[6]與_6相同
_6 ** 2 + _7 ** 2
Out[9]: 1.0
In [10]: # 使用_n個底線快捷符號取得最近的第n次輸出
print(_)
print(__)
print(___)
l IPython抑制輸出:
# 在敘述行末處加上;分號,會安靜地執行該敘述,不會顯示結果
In [26]: 2 + 3;
In [27]: 26 in Out
Out[27]: False
# 在敘述行末處加上;分號,會安靜地執行該敘述,不會顯示結果
In [26]: 2 + 3;
In [27]: 26 in Out
Out[27]: False
IPython中的Shell命令
l 所有放在!驚嘆號之後的內容都會在作業系統命令列上面執行:
# 印出目前工作目錄的內容
!dir
# 印出目前工作目錄的路徑
directory = !cd
print(directory)
# echo就像是Python的print()函式
message = "Hello World"
!echo {message}
# 印出目前工作目錄的內容
!dir
# 印出目前工作目錄的路徑
directory = !cd
print(directory)
# echo就像是Python的print()函式
message = "Hello World"
!echo {message}
IPython神奇命令(魔術命令,Magic Command)
l 以%開頭,用來操作單行輸入的Line Magic;以%%開頭,用來操作多行輸入的Cell Magic。
l Magic命令說明文件:
# 印出Magic命令說明文件
%magic
# 印出所有可以使用的Magic命令
%lsmagic
# 印出Magic命令說明文件
%magic
# 印出所有可以使用的Magic命令
%lsmagic
l 執行外部程式碼:
%run myScript.py
%run myScript.py
l 測定程式碼時間:
# %time單次測定程式碼的執行時間
# %timeit重複測定程式碼的執行時間
%time a = [i **2 for i in range(1000)]
# %time單次測定程式碼的執行時間
# %timeit重複測定程式碼的執行時間
%time a = [i **2 for i in range(1000)]
l 和Shell相關的Magic命令:
# %automagic函式啟用之下,如同在作業系統的命令提示字元一樣
%automagic on
# %automagic函式啟用之下,如同在作業系統的命令提示字元一樣
%automagic on
第二章:NumPy介紹
陣列
l Python的固定型態陣列:
import array
print(array.array("i", list(range(10))))
import array
print(array.array("i", list(range(10))))
l NumPy的固定型態陣列:
import numpy as np
print(np.array([1, 2, 3.14, 4, 5]))
import numpy as np
print(np.array([1, 2, 3.14, 4, 5]))
l 從無到有建立NumPy陣列:
import numpy as np
# 內容全為0,1*5的整數陣列
print(np.zeros(5, dtype = int))
# 內容全為1,3*5的浮點數陣列
print(np.ones((3, 5), dtype = float))
# 內容全為3.14,3*5的陣列
print(np.full((3, 5), 3.14))
# 內容從0到20,每次以2為間隔
print(np.arange(0, 20, 2))
# 內容為0到1之間,1*5的平均分布陣列
print(np.linspace(0, 1, 5))
# 內容為0到1之間,3*5的亂數值陣列
print(np.random.random((3, 5)))
# 內容為0到1之間,3*5的亂數值陣列
print(np.random.rand(3, 5))
# 內容為取自平均0、標準差1的常態分布,3*5的亂數值陣列
print(np.random.normal(0, 1, (3, 5)))
# 內容取自標準常態分布,3*5的亂數值陣列
print(np.random.randn(3, 5))
# 內容為0到10之間,3*5的整數亂數值陣列
print(np.random.randint(0, 10, (3, 5)))
# 內容為3*3的單位矩陣
print(np.eye(3))
# 建立1*3的未初始化陣列
print(np.empty(3))
import numpy as np
# 內容全為0,1*5的整數陣列
print(np.zeros(5, dtype = int))
# 內容全為1,3*5的浮點數陣列
print(np.ones((3, 5), dtype = float))
# 內容全為3.14,3*5的陣列
print(np.full((3, 5), 3.14))
# 內容從0到20,每次以2為間隔
print(np.arange(0, 20, 2))
# 內容為0到1之間,1*5的平均分布陣列
print(np.linspace(0, 1, 5))
# 內容為0到1之間,3*5的亂數值陣列
print(np.random.random((3, 5)))
# 內容為0到1之間,3*5的亂數值陣列
print(np.random.rand(3, 5))
# 內容為取自平均0、標準差1的常態分布,3*5的亂數值陣列
print(np.random.normal(0, 1, (3, 5)))
# 內容取自標準常態分布,3*5的亂數值陣列
print(np.random.randn(3, 5))
# 內容為0到10之間,3*5的整數亂數值陣列
print(np.random.randint(0, 10, (3, 5)))
# 內容為3*3的單位矩陣
print(np.eye(3))
# 建立1*3的未初始化陣列
print(np.empty(3))
NumPy陣列基礎
l NumPy陣列屬性:
import numpy as np
a3 = np.random.randint(10, size = (3, 4, 5))
print(a3.ndim)
print(a3.shape)
print(a3.size)
print(a3.dtype)
import numpy as np
a3 = np.random.randint(10, size = (3, 4, 5))
print(a3.ndim)
print(a3.shape)
print(a3.size)
print(a3.dtype)
l NumPy陣列索引:
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, 1])
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, 1])
l NumPy陣列Fancy索引:
# Fancy索引返回索引陣列廣播後的形狀,而不是被索引的陣列形狀
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, [1]])
print(a2[[2, 1]])
print(a2[[2, 0], [1, 2]])
ind = np.array([[2, 0], [1, 2]])
print(a2[ind])
# 以下範例需配合陣列重塑與陣列廣播概念
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
print(a2[row, col])
print(a2[row[:, np.newaxis], col])
# Fancy索引返回索引陣列廣播後的形狀,而不是被索引的陣列形狀
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
print(a2)
print(a2[2, [1]])
print(a2[[2, 1]])
print(a2[[2, 0], [1, 2]])
ind = np.array([[2, 0], [1, 2]])
print(a2[ind])
# 以下範例需配合陣列重塑與陣列廣播概念
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
print(a2[row, col])
print(a2[row[:, np.newaxis], col])
l NumPy陣列切片:
import numpy as np
a1 = np.arange(10)
print(a1)
# a1[start:stop:step]
print(a1[::2])
print(a1[1::2])
print(a1[::-1])
print(a1[5::-2])
import numpy as np
a1 = np.arange(10)
print(a1)
# a1[start:stop:step]
print(a1[::2])
print(a1[1::2])
print(a1[::-1])
print(a1[5::-2])
l NumPy陣列複製:
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
a2Same = a2[:2, :2]
a2Same[0, 0] = 11
print(a2)
print(a2Same)
a2Copy = a2[:2, :2].copy()
a2Copy[0, 0] = 21
print(a2)
print(a2Copy)
import numpy as np
a2 = np.random.randint(10, size = (3, 4))
a2Same = a2[:2, :2]
a2Same[0, 0] = 11
print(a2)
print(a2Same)
a2Copy = a2[:2, :2].copy()
a2Copy[0, 0] = 21
print(a2)
print(a2Copy)
l NumPy陣列重塑:
import numpy as np
a1 = np.array([1, 2, 3])
print(a1.reshape((1, 3)))
print(a1.reshape((3, 1)))
print(a1[np.newaxis, :])
print(a1[:, np.newaxis])
a2 = np.array([[1, 2, 3], [4, 5, 6]])
print(np.ravel(a2))
import numpy as np
a1 = np.array([1, 2, 3])
print(a1.reshape((1, 3)))
print(a1.reshape((3, 1)))
print(a1[np.newaxis, :])
print(a1[:, np.newaxis])
a2 = np.array([[1, 2, 3], [4, 5, 6]])
print(np.ravel(a2))
l NumPy陣列串接:
import numpy as np
a2 = np.array([[1, 2, 3], [4, 5, 6]])
# 沿著第一軸(垂直)串接
print(np.concatenate([a2, a2]))
# 沿著第二軸(水平)串接
print(np.concatenate([a2, a2], axis = 1))
# 沿著第一軸(垂直)串接
print(np.vstack([a2, a2]))
# 沿著第二軸(水平)串接
print(np.hstack([a2, a2]))
# numpy.dstack()函式沿著第三軸串接
import numpy as np
a2 = np.array([[1, 2, 3], [4, 5, 6]])
# 沿著第一軸(垂直)串接
print(np.concatenate([a2, a2]))
# 沿著第二軸(水平)串接
print(np.concatenate([a2, a2], axis = 1))
# 沿著第一軸(垂直)串接
print(np.vstack([a2, a2]))
# 沿著第二軸(水平)串接
print(np.hstack([a2, a2]))
# numpy.dstack()函式沿著第三軸串接
l NumPy陣列分割:
import numpy as np
# 使用numpy.split()函式做一維陣列分割
a1 = np.arange(10)
a11, a12, a13 = np.split(a1, [3, 5])
print(a11, a12, a13)
# 使用numpy.vsplit()、numpy.hsplit()函式做二維陣列分割
a2 = np.arange(16).reshape((4, 4))
upper, lower = np.vsplit(a2, [2])
print(upper, lower)
left, right = np.hsplit(a2, [2])
print(left, right)
# 使用numpy.dsplit()函式做三維陣列以上的第三軸分割
import numpy as np
# 使用numpy.split()函式做一維陣列分割
a1 = np.arange(10)
a11, a12, a13 = np.split(a1, [3, 5])
print(a11, a12, a13)
# 使用numpy.vsplit()、numpy.hsplit()函式做二維陣列分割
a2 = np.arange(16).reshape((4, 4))
upper, lower = np.vsplit(a2, [2])
print(upper, lower)
left, right = np.hsplit(a2, [2])
print(left, right)
# 使用numpy.dsplit()函式做三維陣列以上的第三軸分割
NumPy陣列計算
l Universal Functions:
UFuncs(Universal Functions)向量化方式是被設計用來把迴圈推送到NumPy的已編譯層,讓執行的速度更快;UFuncs亦支援scipy.special子模組更特殊和晦澀的函式。
UFuncs(Universal Functions)向量化方式是被設計用來把迴圈推送到NumPy的已編譯層,讓執行的速度更快;UFuncs亦支援scipy.special子模組更特殊和晦澀的函式。
l 邏輯運算子:
and和or決定整個物件的真或假,而使用&和|則是針對物件的每一個位元運算,因此在NumPy幾乎都使用&和|。
and和or決定整個物件的真或假,而使用&和|則是針對物件的每一個位元運算,因此在NumPy幾乎都使用&和|。
l NumPy陣列算術:
import numpy as np
a1 = np.random.random(10)
# 使用Python基本方法,較慢較不佳
print(sum(a1))
# 使用NumPy函式方法,較快較佳
print(np.sum(a1))
# 使用NumPy物件方法,較快較佳
print(a1.sum())
import numpy as np
a1 = np.random.random(10)
# 使用Python基本方法,較慢較不佳
print(sum(a1))
# 使用NumPy函式方法,較快較佳
print(np.sum(a1))
# 使用NumPy物件方法,較快較佳
print(a1.sum())
l NumPy陣列廣播(Broadcasting):
# 規則一-以低維度陣列起始
# 規則二-拉長具有形狀是1的陣列,以符合另一陣列
# 規則三-若前兩項規則執行後,兩陣列形狀仍不同,會發生錯誤
import numpy as np
a1 = np.array([0, 1, 2])
a2 = np.ones((3,3))
print(a1 + 5)
print(a1 + a2)
print(a1 + a1[:, np.newaxis])
# 規則一-以低維度陣列起始
# 規則二-拉長具有形狀是1的陣列,以符合另一陣列
# 規則三-若前兩項規則執行後,兩陣列形狀仍不同,會發生錯誤
import numpy as np
a1 = np.array([0, 1, 2])
a2 = np.ones((3,3))
print(a1 + 5)
print(a1 + a2)
print(a1 + a1[:, np.newaxis])
l NumPy陣列比較:
import numpy as np
a = np.arange(1, 10)
print(a < 5)
print(a[a < 5])
# numpy.count_nonzero()函式計算有多少個True
print(np.count_nonzero(a < 5))
# False是0,True是1
print(np.sum(a < 5))
# numpy.any()函式返回是否有任一值符合給定條件
print(np.any(a > 9))
# numpy.all()函式返回是否所有值皆符合給定條件
print(np.all(a > 0))
# numpy.where()函式返回判斷式的判斷結果
print(np.where(a > 5, "Bigger than Five", "Not Bigger than Five"))
import numpy as np
a = np.arange(1, 10)
print(a < 5)
print(a[a < 5])
# numpy.count_nonzero()函式計算有多少個True
print(np.count_nonzero(a < 5))
# False是0,True是1
print(np.sum(a < 5))
# numpy.any()函式返回是否有任一值符合給定條件
print(np.any(a > 9))
# numpy.all()函式返回是否所有值皆符合給定條件
print(np.all(a > 0))
# numpy.where()函式返回判斷式的判斷結果
print(np.where(a > 5, "Bigger than Five", "Not Bigger than Five"))
l NumPy陣列賦值:
import numpy as np
# 使用Fancy索引修改,重複同一索引位置,先修改為4再修改為6
a = np.zeros(10)
a[[0, 0]] = [4, 6]
print(a)
# 使用Fancy索引修改,重複同一索引位置,因為一次計算,故不會重複修改
b = np.zeros(10)
i = [2, 3, 3, 4, 4, 4]
b[i] += 1
print(b)
# 使用.at()函式,重複同一索引位置,會重複修改
c = np.ones(10)
j = [3, 4, 4, 5, 5, 5]
np.add.at(c, j, 1)
print(c)
import numpy as np
# 使用Fancy索引修改,重複同一索引位置,先修改為4再修改為6
a = np.zeros(10)
a[[0, 0]] = [4, 6]
print(a)
# 使用Fancy索引修改,重複同一索引位置,因為一次計算,故不會重複修改
b = np.zeros(10)
i = [2, 3, 3, 4, 4, 4]
b[i] += 1
print(b)
# 使用.at()函式,重複同一索引位置,會重複修改
c = np.ones(10)
j = [3, 4, 4, 5, 5, 5]
np.add.at(c, j, 1)
print(c)
l NumPy陣列排序:
import numpy as np
# 排序元素,使用numpy.sort()函式
np.random.seed(0)
a = np.random.randint(10, size = 5)
print(a)
print(np.sort(a))
# 排序元素,使用numpy.argsort()函式顯示被排序過元素的索引值
bRand = np.random.RandomState(0)
b = bRand.randint(10, size = 5)
print(b)
print(np.argsort(b))
print(b[np.argsort(b)])
import numpy as np
# 排序元素,使用numpy.sort()函式
np.random.seed(0)
a = np.random.randint(10, size = 5)
print(a)
print(np.sort(a))
# 排序元素,使用numpy.argsort()函式顯示被排序過元素的索引值
bRand = np.random.RandomState(0)
b = bRand.randint(10, size = 5)
print(b)
print(np.argsort(b))
print(b[np.argsort(b)])
l NumPy陣列分區(Partitioning)(部分排序):
import numpy as np
a = np.array([7, 2, 3, 1, 6, 5, 4])
# 使用numpy.partition()函式把最小的3個值放在左側
print(np.partition(a, 3))
# 使用numpy.argpartition()函式顯示被分區過元素的索引值
print(np.argpartition(a, 3))
print(a[np.argpartition(a, 3)])
import numpy as np
a = np.array([7, 2, 3, 1, 6, 5, 4])
# 使用numpy.partition()函式把最小的3個值放在左側
print(np.partition(a, 3))
# 使用numpy.argpartition()函式顯示被分區過元素的索引值
print(np.argpartition(a, 3))
print(a[np.argpartition(a, 3)])
NumPy陣列計算技巧
l 引數out用來設定輸出:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.zeros(10)
np.multiply(a, 3, out = b[::2])
print(b)
import numpy as np
a = np.array([1, 2, 3, 4, 5])
b = np.zeros(10)
np.multiply(a, 3, out = b[::2])
print(b)
l 引數axis用來設定軸度:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(np.min(a, axis = 0))
print(a.max(axis = 1))
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(np.min(a, axis = 0))
print(a.max(axis = 1))
l .reduce()與.accumulate()函式用來計算聚合:
import numpy as np
a = np.arange(1, 10)
# .reduce()函式僅儲存最終結果
print(np.add.reduce(a))
# .accumulate()函式儲存計算過程
print(np.add.accumulate(a))
import numpy as np
a = np.arange(1, 10)
# .reduce()函式僅儲存最終結果
print(np.add.reduce(a))
# .accumulate()函式儲存計算過程
print(np.add.accumulate(a))
l .outer()函式用來計算外積:
# 九九乘法表
import numpy as np
a = np.arange(1, 10)
print(np.multiply.outer(a, a))
# 九九乘法表
import numpy as np
a = np.arange(1, 10)
print(np.multiply.outer(a, a))
l 資料裝箱(Binning Data):
import numpy as np
# 準備原始資料
rawData = np.random.randn(100)
# 準備裝箱區間
bins = np.linspace(-5, 5, 20)
# 準備裝箱計次空間
counts = np.zeros_like(bins)
# 返回原始資料裝箱後的索引
ind = np.searchsorted(bins, rawData)
# 返回裝箱計次
np.add.at(counts, ind, 1)
print(counts)
# 使用numpy.histogram()函式結果相同
counts2, bins2 = np.histogram(rawData, bins)
print(counts2)
import numpy as np
# 準備原始資料
rawData = np.random.randn(100)
# 準備裝箱區間
bins = np.linspace(-5, 5, 20)
# 準備裝箱計次空間
counts = np.zeros_like(bins)
# 返回原始資料裝箱後的索引
ind = np.searchsorted(bins, rawData)
# 返回裝箱計次
np.add.at(counts, ind, 1)
print(counts)
# 使用numpy.histogram()函式結果相同
counts2, bins2 = np.histogram(rawData, bins)
print(counts2)
NumPy結構化陣列(Structured Array)
l NumPy資料型態:
字元 說明 範例
b 位元組 np.dtype("b")
i 有號整數 np.dtype("i4") == np.int32
u 無號整數 np.dtype("u1") == np.uint8
f 浮點數 np.dtype("f8") == np.int64
c 複數浮點數 np.dtype("c16") == np.complex128
S或a 字串 np.dtype("S5")
U Unicode字串 np.dtype("U") == np.str_
V 原始資料(void) np.dtype("V") == np.void
字元 說明 範例
b 位元組 np.dtype("b")
i 有號整數 np.dtype("i4") == np.int32
u 無號整數 np.dtype("u1") == np.uint8
f 浮點數 np.dtype("f8") == np.int64
c 複數浮點數 np.dtype("c16") == np.complex128
S或a 字串 np.dtype("S5")
U Unicode字串 np.dtype("U") == np.str_
V 原始資料(void) np.dtype("V") == np.void
l 結構化與記錄陣列:
import numpy as np
name = ["Alice", "Bob", "Cathy", "Doug"]
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
# 結構化陣列
data = np.zeros(4, dtype = {"names": ("name", "age", "weight"), "formats": ("U10", "i4", "f8")})
data["name"] = name
data["age"] = age
data["weight"] = weight
print(data.dtype)
print(data)
print(data["name"])
print(data[0])
print(data[-1]["name"])
# 記錄陣列
dataRec = data.view(np.recarray)
print(dataRec.name)
import numpy as np
name = ["Alice", "Bob", "Cathy", "Doug"]
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
# 結構化陣列
data = np.zeros(4, dtype = {"names": ("name", "age", "weight"), "formats": ("U10", "i4", "f8")})
data["name"] = name
data["age"] = age
data["weight"] = weight
print(data.dtype)
print(data)
print(data["name"])
print(data[0])
print(data[-1]["name"])
# 記錄陣列
dataRec = data.view(np.recarray)
print(dataRec.name)
第三章:使用Pandas操作資料
Pandas物件介紹-Series物件
l 從無到有建立Series物件:
import pandas as pd
# 資料是List,未定義索引
a = pd.Series([0.8, 0.2])
print(a)
# 資料是List,定義索引
b = pd.Series([0.8, 0.2], index = ["Ind_A", "Ind_B"])
print(b)
# 資料是純量,它會被重複地填到指定索引中
c = pd.Series(5, index = ["Ind_A", "Ind_B"])
print(c)
# 資料是字典
d = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
print(d)
# 資料是字典,設定索引做選擇
e = pd.Series({2: "Two", 1: "One", 3: "Three"}, index = [3, 2])
print(e)
import pandas as pd
# 資料是List,未定義索引
a = pd.Series([0.8, 0.2])
print(a)
# 資料是List,定義索引
b = pd.Series([0.8, 0.2], index = ["Ind_A", "Ind_B"])
print(b)
# 資料是純量,它會被重複地填到指定索引中
c = pd.Series(5, index = ["Ind_A", "Ind_B"])
print(c)
# 資料是字典
d = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
print(d)
# 資料是字典,設定索引做選擇
e = pd.Series({2: "Two", 1: "One", 3: "Three"}, index = [3, 2])
print(e)
l Series物件屬性和方法:
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data.values)
print(data.index)
print(data.keys())
print(list(data.items()))
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data.values)
print(data.index)
print(data.keys())
print(list(data.items()))
l Series物件索引和切片:
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data[1])
print(data[1:3])
import pandas as pd
data = pd.Series(["a", "b", "c"])
print(data[1])
print(data[1:3])
l Series物件Indexer索引和切片:
import pandas as pd
data = pd.Series(["a", "b", "c"], index = [1, 3, 5])
# loc屬性允許索引和切片總是參考到明確的索引
print(data.loc[1])
print(data.loc[1:3])
# iloc屬性允許索引和切片總是參考到隱含的Python型態索引
print(data.iloc[1])
print(data.iloc[1:3])
import pandas as pd
data = pd.Series(["a", "b", "c"], index = [1, 3, 5])
# loc屬性允許索引和切片總是參考到明確的索引
print(data.loc[1])
print(data.loc[1:3])
# iloc屬性允許索引和切片總是參考到隱含的Python型態索引
print(data.iloc[1])
print(data.iloc[1:3])
l Series物件選取資料:
import pandas as pd
data = pd.Series({"Ind_A": 0.2, "Ind_B": 0.8, "Ind_C": 0.4, "Ind_D": 0.6})
# 使用切片選取
print(data["Ind_B":"Ind_C"])
print(data[1:3])
# 使用遮罩選取
print(data[(data > 0.3) & (data < 0.7)])
# 使用Fancy索引選取
print(data[["Ind_C", "Ind_D"]])
import pandas as pd
data = pd.Series({"Ind_A": 0.2, "Ind_B": 0.8, "Ind_C": 0.4, "Ind_D": 0.6})
# 使用切片選取
print(data["Ind_B":"Ind_C"])
print(data[1:3])
# 使用遮罩選取
print(data[(data > 0.3) & (data < 0.7)])
# 使用Fancy索引選取
print(data[["Ind_C", "Ind_D"]])
l Series物件新增、修改、刪除資料:
import pandas as pd
data = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
# 新增資料
data["Ind_C"] = 0.4
# 修改資料
data["Ind_B"] = 0.1
# 刪除資料
data.pop("Ind_A")
print(data)
import pandas as pd
data = pd.Series({"Ind_A": 0.8, "Ind_B": 0.2})
# 新增資料
data["Ind_C"] = 0.4
# 修改資料
data["Ind_B"] = 0.1
# 刪除資料
data.pop("Ind_A")
print(data)
Pandas物件介紹-DataFrame物件
l 從無到有建立DataFrame物件:
import numpy as np
import pandas as pd
# Series物件
population = pd.Series({"California": 38332521, "Texas": 26448193, "New York": 19651127, "Florida": 19552860, "Illinois": 12882135})
area = pd.Series({"California": 423967, "Texas": 695662, "New York": 141297, "Florida": 170312, "Illinois": 149995})
# 從Series物件的字典建構DataFrame物件
a = pd.DataFrame({"Population": population, "Area": area})
print(a)
# 從Series物件建構DataFrame物件
b = pd.DataFrame(population, columns = ["Population"])
print(b)
# 從List物件建構DataFrame物件
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
c = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(c)
# 從NumPy二維陣列建構DataFrame物件
npTwoData = np.random.rand(3, 2)
d = pd.DataFrame(npTwoData, columns = ["Col_A", "Col_B"], index = ["Ind_A", "Ind_B", "Ind_C"])
print(d)
# 從NumPy結構陣列建構DataFrame物件
npStrucData = np.zeros(3, dtype = [("Col_A", "i8"), ("Col_B", "f8")])
e = pd.DataFrame(npStrucData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(e)
import numpy as np
import pandas as pd
# Series物件
population = pd.Series({"California": 38332521, "Texas": 26448193, "New York": 19651127, "Florida": 19552860, "Illinois": 12882135})
area = pd.Series({"California": 423967, "Texas": 695662, "New York": 141297, "Florida": 170312, "Illinois": 149995})
# 從Series物件的字典建構DataFrame物件
a = pd.DataFrame({"Population": population, "Area": area})
print(a)
# 從Series物件建構DataFrame物件
b = pd.DataFrame(population, columns = ["Population"])
print(b)
# 從List物件建構DataFrame物件
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
c = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(c)
# 從NumPy二維陣列建構DataFrame物件
npTwoData = np.random.rand(3, 2)
d = pd.DataFrame(npTwoData, columns = ["Col_A", "Col_B"], index = ["Ind_A", "Ind_B", "Ind_C"])
print(d)
# 從NumPy結構陣列建構DataFrame物件
npStrucData = np.zeros(3, dtype = [("Col_A", "i8"), ("Col_B", "f8")])
e = pd.DataFrame(npStrucData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(e)
l DataFrame物件屬性:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data.values)
print(data.index)
print(data.columns)
print(data.T)
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data.values)
print(data.index)
print(data.columns)
print(data.T)
l DataFrame物件選取資料:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
# 針對元素選取
print(data.values[1])
print(data.values[1, 1])
# 針對Columns選取
print(data["Col_A"])
print(data.Col_B)
# 使用Indexer切片選取
print(data.loc["Ind_B":, "Col_B":])
print(data.iloc[:2, :1])
# 使用遮罩選取
print(data.loc[data["Col_B"] >= 2])
# 使用Fancy索引選取
print(data.loc[["Ind_B", "Ind_C"]])
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
# 針對元素選取
print(data.values[1])
print(data.values[1, 1])
# 針對Columns選取
print(data["Col_A"])
print(data.Col_B)
# 使用Indexer切片選取
print(data.loc["Ind_B":, "Col_B":])
print(data.iloc[:2, :1])
# 使用遮罩選取
print(data.loc[data["Col_B"] >= 2])
# 使用Fancy索引選取
print(data.loc[["Ind_B", "Ind_C"]])
l DataFrame物件新增、修改、刪除資料:
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
# 新增欄資料
data["Col_C"] = (data["Col_B"] - data["Col_A"]) ** 2
# 修改資料
data.iloc[0, 0] = 9
# 刪除欄資料
data.pop("Col_A")
print(data)
import pandas as pd
listData = [{"Col_A": i, "Col_B": 2 * i} for i in range(3)]
data = pd.DataFrame(listData, index = ["Ind_A", "Ind_B", "Ind_C"])
# 新增欄資料
data["Col_C"] = (data["Col_B"] - data["Col_A"]) ** 2
# 修改資料
data.iloc[0, 0] = 9
# 刪除欄資料
data.pop("Col_A")
print(data)
Pandas物件介紹-Index物件
l Index物件與NumPy陣列操作幾乎相同,不同處為Index物件不可修改:
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
# 這會發生錯誤
# aInd[1] = 4
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
# 這會發生錯誤
# aInd[1] = 4
l Index物件有序的集合:
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
bInd = pd.Index([2, 3, 5, 7, 11])
# 交集
print(aInd & bInd)
# 聯集
print(aInd | bInd)
# 對等差分
print(aInd ^ bInd)
import pandas as pd
aInd = pd.Index([1, 3, 5, 7, 9])
bInd = pd.Index([2, 3, 5, 7, 11])
# 交集
print(aInd & bInd)
# 聯集
print(aInd | bInd)
# 對等差分
print(aInd ^ bInd)
Pandas物件介紹-MultiIndex物件
l 從無到有建立MultiIndex物件:
import numpy as np
import pandas as pd
# MultiIndex物件
aInd = pd.MultiIndex.from_arrays([["A", "A", "B", "B"], [1, 2, 1, 2]])
print(aInd)
bInd = pd.MultiIndex.from_tuples([("A", 1), ("A", 2), ("B", 1), ("B", 2)])
print(bInd)
cInd = pd.MultiIndex.from_product([["A", "B"], [1, 2]])
print(cInd)
dInd = pd.MultiIndex(levels = [["A", "B"], [1, 2]], labels = [[0, 0, 1, 1], [0, 1, 0, 1]])
print(dInd)
# 在Series物件上套用MultiIndex物件
aData = pd.Series(np.random.rand(4), index = aInd)
print(aData)
# 在DataFrame物件上套用MultiIndex物件
bData = pd.DataFrame(np.random.rand(4, 2), columns = ["Col_A", "Col_B"], index = bInd)
print(bData)
# 直接在Series物件上建構階層式索引(Hierarchical Indexing)
cData = pd.Series({("A", 1): 0.25, ("A", 2): 0.5, ("B", 1): 0.75, ("B", 2): 1.0})
print(cData)
# 直接在DataFrame物件上建構階層式索引
dData = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
print(dData)
import numpy as np
import pandas as pd
# MultiIndex物件
aInd = pd.MultiIndex.from_arrays([["A", "A", "B", "B"], [1, 2, 1, 2]])
print(aInd)
bInd = pd.MultiIndex.from_tuples([("A", 1), ("A", 2), ("B", 1), ("B", 2)])
print(bInd)
cInd = pd.MultiIndex.from_product([["A", "B"], [1, 2]])
print(cInd)
dInd = pd.MultiIndex(levels = [["A", "B"], [1, 2]], labels = [[0, 0, 1, 1], [0, 1, 0, 1]])
print(dInd)
# 在Series物件上套用MultiIndex物件
aData = pd.Series(np.random.rand(4), index = aInd)
print(aData)
# 在DataFrame物件上套用MultiIndex物件
bData = pd.DataFrame(np.random.rand(4, 2), columns = ["Col_A", "Col_B"], index = bInd)
print(bData)
# 直接在Series物件上建構階層式索引(Hierarchical Indexing)
cData = pd.Series({("A", 1): 0.25, ("A", 2): 0.5, ("B", 1): 0.75, ("B", 2): 1.0})
print(cData)
# 直接在DataFrame物件上建構階層式索引
dData = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
print(dData)
l 在Series物件上以MultiIndex物件選取、轉換資料:
import pandas as pd
ind = [("California", 2000), ("California", 2010), ("New York", 2000), ("New York", 2010), ("Texas", 2000), ("Texas", 2010)]
index = pd.MultiIndex.from_tuples(ind, names = ["State", "Year"])
pop = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
# 若需更改索引可用popData.reindex(newIndex)
# 若需更改索引欄名可用popData.index.names = ["newState", "newYear"]
popData = pd.Series(pop, index = index)
# 使用切片選取
print(popData[:, 2010])
print(popData["California"])
print(popData["California":"New York"])
# 使用遮罩選取
print(popData[popData > 22000000])
# 使用Fancy索引選取
print(popData[["California", "Texas"]])
# 轉換階層式索引的Series物件與DataFrame物件(可逆轉換)
popDf = popData.unstack()
print(popDf)
popDfZero = popData.unstack(level = 0)
print(popDfZero)
popSe = popDf.stack()
print(popSe)
# 轉換階層式索引的Series物件為DataFrame物件(不可逆轉換)
popDf2 = popData.reset_index(name = "Population")
print(popDf2)
popDf3 = popDf2.set_index(["State", "Year"])
print(popDf3)
import pandas as pd
ind = [("California", 2000), ("California", 2010), ("New York", 2000), ("New York", 2010), ("Texas", 2000), ("Texas", 2010)]
index = pd.MultiIndex.from_tuples(ind, names = ["State", "Year"])
pop = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
# 若需更改索引可用popData.reindex(newIndex)
# 若需更改索引欄名可用popData.index.names = ["newState", "newYear"]
popData = pd.Series(pop, index = index)
# 使用切片選取
print(popData[:, 2010])
print(popData["California"])
print(popData["California":"New York"])
# 使用遮罩選取
print(popData[popData > 22000000])
# 使用Fancy索引選取
print(popData[["California", "Texas"]])
# 轉換階層式索引的Series物件與DataFrame物件(可逆轉換)
popDf = popData.unstack()
print(popDf)
popDfZero = popData.unstack(level = 0)
print(popDfZero)
popSe = popDf.stack()
print(popSe)
# 轉換階層式索引的Series物件為DataFrame物件(不可逆轉換)
popDf2 = popData.reset_index(name = "Population")
print(popDf2)
popDf3 = popDf2.set_index(["State", "Year"])
print(popDf3)
l 在DataFrame物件上以MultiIndex物件選取、聚合資料:
import numpy as np
import pandas as pd
data = np.round(np.random.rand(4, 6), 1)
data[:, ::2] *= 10
data += 37
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names = ["Year", "Visit"])
columns = pd.MultiIndex.from_product([["Bob", "Guido", "Sue"], ["HR", "Temp"]], names = ["Subject", "Type"])
healthData = pd.DataFrame(data, index = index, columns = columns)
# 針對Columns選取
print(healthData["Guido", "HR"])
# 使用Indexer切片選取
print(healthData.loc[:, ("Bob", "HR")])
print(healthData.iloc[:2, :2])
# 建立想要的切片,使用切片選取
idx = pd.IndexSlice
print(healthData.loc[idx[:, 1], idx[:, "HR"]])
# 針對Index或Columns聚合資料
yearMean = healthData.mean(level = "Year")
print(yearMean)
print(yearMean.mean(axis = 1, level = "Type"))
import numpy as np
import pandas as pd
data = np.round(np.random.rand(4, 6), 1)
data[:, ::2] *= 10
data += 37
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names = ["Year", "Visit"])
columns = pd.MultiIndex.from_product([["Bob", "Guido", "Sue"], ["HR", "Temp"]], names = ["Subject", "Type"])
healthData = pd.DataFrame(data, index = index, columns = columns)
# 針對Columns選取
print(healthData["Guido", "HR"])
# 使用Indexer切片選取
print(healthData.loc[:, ("Bob", "HR")])
print(healthData.iloc[:2, :2])
# 建立想要的切片,使用切片選取
idx = pd.IndexSlice
print(healthData.loc[idx[:, 1], idx[:, "HR"]])
# 針對Index或Columns聚合資料
yearMean = healthData.mean(level = "Year")
print(yearMean)
print(yearMean.mean(axis = 1, level = "Type"))
l 排序階層式索引:
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([["Ind_A", "Ind_C", "Ind_B"], [1, 2]])
data = pd.Series(np.random.rand(6), index = index)
print(data)
data = data.sort_index()
print(data)
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_product([["Ind_A", "Ind_C", "Ind_B"], [1, 2]])
data = pd.Series(np.random.rand(6), index = index)
print(data)
data = data.sort_index()
print(data)
空值(Not a Number, NaN)
l 建立包含空值的DataFrame物件:
import pandas as pd
nanData = [{"Col_A": 1, "Col_B": 2}, {"Col_B": 3, "Col_C": 4}, {"Col_A": 5, "Col_C": 6}]
data = pd.DataFrame(nanData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data)
import pandas as pd
nanData = [{"Col_A": 1, "Col_B": 2}, {"Col_B": 3, "Col_C": 4}, {"Col_A": 5, "Col_C": 6}]
data = pd.DataFrame(nanData, index = ["Ind_A", "Ind_B", "Ind_C"])
print(data)
l 空值的型態:
import pandas as pd
# 數值與其空值以浮點數(float64)的方式儲存
a = pd.Series([2, 4, 6], index = [0, 1, 2])
b = pd.Series([1, 3, 5], index = [1, 2, 3])
print(a + b)
# 字串與其空值以物件(object)的方式儲存
c = pd.Series(["Two", "Four", "Six"], index = [0, 1, 2])
d = pd.Series(["One", "Three", "Five"], index = [1, 2, 3])
print(c + d)
import pandas as pd
# 數值與其空值以浮點數(float64)的方式儲存
a = pd.Series([2, 4, 6], index = [0, 1, 2])
b = pd.Series([1, 3, 5], index = [1, 2, 3])
print(a + b)
# 字串與其空值以物件(object)的方式儲存
c = pd.Series(["Two", "Four", "Six"], index = [0, 1, 2])
d = pd.Series(["One", "Three", "Five"], index = [1, 2, 3])
print(c + d)
l 空值視為0:
# NumPy空值的忽略
import numpy as np
a = np.array([1, np.nan, 3, 4])
print(np.sum(a))
print(np.nansum(a))
print(np.max(a))
print(np.nanmax(a))
print(np.min(a))
print(np.nanmin(a))
# Pandas空值的忽略
import pandas as pd
b = pd.Series([2, 4, 6], index = [0, 1, 2])
c = pd.Series([1, 3, 5], index = [1, 2, 3])
print(b.add(c, fill_value = 0))
# NumPy空值的忽略
import numpy as np
a = np.array([1, np.nan, 3, 4])
print(np.sum(a))
print(np.nansum(a))
print(np.max(a))
print(np.nanmax(a))
print(np.min(a))
print(np.nanmin(a))
# Pandas空值的忽略
import pandas as pd
b = pd.Series([2, 4, 6], index = [0, 1, 2])
c = pd.Series([1, 3, 5], index = [1, 2, 3])
print(b.add(c, fill_value = 0))
l .isnull()與.notnull()用以偵測空值:
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, "Hello", None])
print(a[a.isnull()])
print(a[a.notnull()])
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, "Hello", None])
print(a[a.isnull()])
print(a[a.notnull()])
l .dropna()用以拋棄空值:
import numpy as np
import pandas as pd
# Series拋棄空值
a = pd.Series([1, np.nan, "Hello", None])
print(a.dropna())
# DataFrame拋棄空值,只能夠丟棄一整列或是欄
b = pd.DataFrame([[1, np.nan, 3], [4, 5, 6], [np.nan, 8, 9]])
print(b.dropna())
print(b.dropna(axis = "columns"))
# 引數how與thresh可控制要多少數量的空值才拋棄
c = pd.DataFrame([[1, np.nan, np.nan], [4, 5, np.nan], [7, 8, np.nan]])
# 全為空值才拋棄
print(c.dropna(axis = "columns", how = "all"))
# 非空值數要多少個才保留
print(c.dropna(axis = "rows", thresh = 2))
import numpy as np
import pandas as pd
# Series拋棄空值
a = pd.Series([1, np.nan, "Hello", None])
print(a.dropna())
# DataFrame拋棄空值,只能夠丟棄一整列或是欄
b = pd.DataFrame([[1, np.nan, 3], [4, 5, 6], [np.nan, 8, 9]])
print(b.dropna())
print(b.dropna(axis = "columns"))
# 引數how與thresh可控制要多少數量的空值才拋棄
c = pd.DataFrame([[1, np.nan, np.nan], [4, 5, np.nan], [7, 8, np.nan]])
# 全為空值才拋棄
print(c.dropna(axis = "columns", how = "all"))
# 非空值數要多少個才保留
print(c.dropna(axis = "rows", thresh = 2))
l .fillna()用以填入空值:
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, 2, None, 3])
# 填入0
print(a.fillna(0))
# 往前填(Forward-fill),填入前一個值
print(a.fillna(method = "ffill"))
# 往後填(Back-fill),填入後一個值
print(a.fillna(method = "bfill"))
import numpy as np
import pandas as pd
a = pd.Series([1, np.nan, 2, None, 3])
# 填入0
print(a.fillna(0))
# 往前填(Forward-fill),填入前一個值
print(a.fillna(method = "ffill"))
# 往後填(Back-fill),填入後一個值
print(a.fillna(method = "bfill"))
串接資料集
l 使用pandas.concat()串接資料:
import pandas as pd
# 串接Series物件
aSer = pd.Series(["A", "B", "C"], index = [1, 2, 3])
bSer = pd.Series(["D", "E", "F"], index = [4, 5, 6])
print(pd.concat([aSer, bSer]))
# 串接DataFrame物件
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(pd.concat([aDf, bDf]))
cDf = makeDf("AB", [1, 2])
dDf = makeDf("CD", [1, 2])
print(pd.concat([cDf, dDf], axis = 1))
import pandas as pd
# 串接Series物件
aSer = pd.Series(["A", "B", "C"], index = [1, 2, 3])
bSer = pd.Series(["D", "E", "F"], index = [4, 5, 6])
print(pd.concat([aSer, bSer]))
# 串接DataFrame物件
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(pd.concat([aDf, bDf]))
cDf = makeDf("AB", [1, 2])
dDf = makeDf("CD", [1, 2])
print(pd.concat([cDf, dDf], axis = 1))
l 使用pandas.concat()串接資料-索引重複問題:
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [2, 3])
# 索引重複沒有問題
print(pd.concat([aDf, bDf]))
# 索引重複當做錯誤
# print(pd.concat([aDf, bDf], verify_integrity = True))
# 索引重複忽略錯誤,使用新的索引
print(pd.concat([aDf, bDf], ignore_index = True))
# 加上階層式索引
print(pd.concat([aDf, bDf], keys = ["X", "Y"]))
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [2, 3])
# 索引重複沒有問題
print(pd.concat([aDf, bDf]))
# 索引重複當做錯誤
# print(pd.concat([aDf, bDf], verify_integrity = True))
# 索引重複忽略錯誤,使用新的索引
print(pd.concat([aDf, bDf], ignore_index = True))
# 加上階層式索引
print(pd.concat([aDf, bDf], keys = ["X", "Y"]))
l 使用pandas.concat()串接資料-不同欄名問題:
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("BC", [3, 4])
# 不同欄名串接時,輸出結果預設使用聯集join = "outer"
print(pd.concat([aDf, bDf]))
# 不同欄名串接時,輸出結果改為使用交集join = "inner"
print(pd.concat([aDf, bDf], join = "inner"))
# 不同欄名串接時,輸出結果指定欄名與aDf相同
print(pd.concat([aDf, bDf], join_axes = [aDf.columns]))
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("BC", [3, 4])
# 不同欄名串接時,輸出結果預設使用聯集join = "outer"
print(pd.concat([aDf, bDf]))
# 不同欄名串接時,輸出結果改為使用交集join = "inner"
print(pd.concat([aDf, bDf], join = "inner"))
# 不同欄名串接時,輸出結果指定欄名與aDf相同
print(pd.concat([aDf, bDf], join_axes = [aDf.columns]))
l 使用.append()串接資料,.append()方法與pandas.concat()方法不同,.append()方法串接資料後會建立一個新物件:
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(aDf.append(bDf))
import pandas as pd
def makeDf(cols, ind):
data = {c: [str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
aDf = makeDf("AB", [1, 2])
bDf = makeDf("AB", [3, 4])
print(aDf.append(bDf))
合併資料集
l 使用pandas.merge()合併資料:
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Group": ["Accounting", "Engineering", "HR"], "Supervisor": ["Carly", "Guido", "Steve"]})
dDf = pd.DataFrame({"Group": ["Accounting", "Accounting", "Engineering", "Engineering", "HR", "HR"], "Skills": ["Math", "Spreadsheets", "Coding", "Linux", "Spreadsheets", "Organization"]})
# 一對一合併
print(pd.merge(aDf, bDf))
# 多對一合併
print(pd.merge(aDf, cDf))
# 多對多合併
print(pd.merge(aDf, dDf))
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Group": ["Accounting", "Engineering", "HR"], "Supervisor": ["Carly", "Guido", "Steve"]})
dDf = pd.DataFrame({"Group": ["Accounting", "Accounting", "Engineering", "Engineering", "HR", "HR"], "Skills": ["Math", "Spreadsheets", "Coding", "Linux", "Spreadsheets", "Organization"]})
# 一對一合併
print(pd.merge(aDf, bDf))
# 多對一合併
print(pd.merge(aDf, cDf))
# 多對多合併
print(pd.merge(aDf, dDf))
l 使用pandas.merge()合併資料-搭配引數:
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Salary": [70000, 80000, 120000, 90000]})
aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")
# 引數on指定要合併的欄名
print(pd.merge(aDf, bDf, on = "Employee"))
# 引數left_on和right_on指定要串接的兩個不同欄名
# 可使用.drop()方法丟棄重複的欄位
newDf = pd.merge(aDf, cDf, left_on = "Employee", right_on = "Name")
print(newDf)
print(newDf.drop("Name", axis = 1))
# 引數left_index和right_index指定要合併的兩個不同索引
print(pd.merge(aDfInd, bDfInd, left_index = True, right_index = True))
# 混合使用,指定要合併的欄名或索引
print(pd.merge(aDfInd, cDf, left_index = True, right_on = "Name"))
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
cDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Salary": [70000, 80000, 120000, 90000]})
aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")
# 引數on指定要合併的欄名
print(pd.merge(aDf, bDf, on = "Employee"))
# 引數left_on和right_on指定要串接的兩個不同欄名
# 可使用.drop()方法丟棄重複的欄位
newDf = pd.merge(aDf, cDf, left_on = "Employee", right_on = "Name")
print(newDf)
print(newDf.drop("Name", axis = 1))
# 引數left_index和right_index指定要合併的兩個不同索引
print(pd.merge(aDfInd, bDfInd, left_index = True, right_index = True))
# 混合使用,指定要合併的欄名或索引
print(pd.merge(aDfInd, cDf, left_index = True, right_on = "Name"))
l 使用pandas.merge()合併資料-不同資料集之間的交集、聯集:
import pandas as pd
aDf = pd.DataFrame({"Name": ["Peter", "Paul", "Mary"], "Food": ["Fish", "Beans", "Bread"]})
bDf = pd.DataFrame({"Name": ["Mary", "Joseph"], "Drink": ["Wine", "Beer"]})
# 輸出結果預設使用交集how = "inner"
print(pd.merge(aDf, bDf))
# 輸出結果使用聯集
print(pd.merge(aDf, bDf, how = "outer"))
# 輸出結果使用左方DataFrame
print(pd.merge(aDf, bDf, how = "left"))
# 輸出結果使用右方DataFrame
print(pd.merge(aDf, bDf, how = "right"))
import pandas as pd
aDf = pd.DataFrame({"Name": ["Peter", "Paul", "Mary"], "Food": ["Fish", "Beans", "Bread"]})
bDf = pd.DataFrame({"Name": ["Mary", "Joseph"], "Drink": ["Wine", "Beer"]})
# 輸出結果預設使用交集how = "inner"
print(pd.merge(aDf, bDf))
# 輸出結果使用聯集
print(pd.merge(aDf, bDf, how = "outer"))
# 輸出結果使用左方DataFrame
print(pd.merge(aDf, bDf, how = "left"))
# 輸出結果使用右方DataFrame
print(pd.merge(aDf, bDf, how = "right"))
l 使用pandas.merge()合併資料-相同欄名問題:
import pandas as pd
aDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [1, 2, 3, 4]})
bDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [3, 1, 4, 2]})
# 預設會自動區分相同的欄名(_x、_y)
print(pd.merge(aDf, bDf, on = "Name"))
# 自行設定區分相同欄名的方式
print(pd.merge(aDf, bDf, on = "Name", suffixes = ["_L", "_R"]))
import pandas as pd
aDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [1, 2, 3, 4]})
bDf = pd.DataFrame({"Name": ["Bob", "Jake", "Lisa", "Sue"], "Rank": [3, 1, 4, 2]})
# 預設會自動區分相同的欄名(_x、_y)
print(pd.merge(aDf, bDf, on = "Name"))
# 自行設定區分相同欄名的方式
print(pd.merge(aDf, bDf, on = "Name", suffixes = ["_L", "_R"]))
l 使用.join()合併資料:
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")
print(aDfInd.join(bDfInd))
import pandas as pd
aDf = pd.DataFrame({"Employee": ["Bob", "Jake", "Lisa", "Sue"], "Group": ["Accounting", "Engineering", "Engineering", "HR"]})
bDf = pd.DataFrame({"Employee": ["Lisa", "Bob", "Jake", "Sue"], "Hire_Date": [2004, 2008, 2012, 2014]})
aDfInd = aDf.set_index("Employee")
bDfInd = bDf.set_index("Employee")
print(aDfInd.join(bDfInd))
Pandas聚合運算
l Pandas聚合運算列表:
describe() 敘述統計
count() 資料項目總數
first(), last() 第一個和最後一個資料項目
min(), max() 最小值和最大值
mean(), median() 平均數和中位數
std(), var() 標準差和變異數
sum(), prod() 資料項目的和與積
mad() 平均絕對差
describe() 敘述統計
count() 資料項目總數
first(), last() 第一個和最後一個資料項目
min(), max() 最小值和最大值
mean(), median() 平均數和中位數
std(), var() 標準差和變異數
sum(), prod() 資料項目的和與積
mad() 平均絕對差
l Pandas簡單聚合運算:
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Col_A": np.random.rand(5), "Col_B": np.random.rand(5)})
print(aDf.describe())
print(aDf.sum())
print(aDf.sum(axis = 1))
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Col_A": np.random.rand(5), "Col_B": np.random.rand(5)})
print(aDf.describe())
print(aDf.sum())
print(aDf.sum(axis = 1))
l Pandas簡單聚合運算搭配.groupby()方法:
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
# 在GroupBy物件上進行聚合運算
print(aDf.groupby("Key").sum())
# 在GroupBy物件上選擇欄位進行聚合運算
print(aDf.groupby("Key")["Col_B"].sum())
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
# 在GroupBy物件上進行聚合運算
print(aDf.groupby("Key").sum())
# 在GroupBy物件上選擇欄位進行聚合運算
print(aDf.groupby("Key")["Col_B"].sum())
l GroupBy物件方法:
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
# .aggregate()聚合計算
print(aDf.groupby("Key").aggregate(["min", np.median, max]))
print(aDf.groupby("Key").aggregate({"Col_A": min, "Col_B": "max"}))
# .filter()過濾
def filterFunc(x):
return x["Col_B"].sum() < 10
print(aDf.groupby("Key").filter(filterFunc))
# .transform()轉換
print(aDf.groupby("Key").transform(lambda x: x - x.mean()))
# .apply()套用,極有彈性,可以套用任一函式,傳回Pandas物件或純量
def normByColB(x):
x["Col_A"] /= x["Col_B"].sum()
return x
print(aDf.groupby("Key").apply(normByColB))
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
# .aggregate()聚合計算
print(aDf.groupby("Key").aggregate(["min", np.median, max]))
print(aDf.groupby("Key").aggregate({"Col_A": min, "Col_B": "max"}))
# .filter()過濾
def filterFunc(x):
return x["Col_B"].sum() < 10
print(aDf.groupby("Key").filter(filterFunc))
# .transform()轉換
print(aDf.groupby("Key").transform(lambda x: x - x.mean()))
# .apply()套用,極有彈性,可以套用任一函式,傳回Pandas物件或純量
def normByColB(x):
x["Col_A"] /= x["Col_B"].sum()
return x
print(aDf.groupby("Key").apply(normByColB))
l GroupBy物件方法-搭配指定分割鍵:
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
bDf = aDf.set_index("Key")
# List做為指定分割鍵
print(aDf.groupby([0, 1, 0, 1, 2, 0]).sum())
# 字典做為指定分割鍵
print(bDf.groupby({"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}).sum())
# Python函式修改索引做為指定分割鍵
print(bDf.groupby(str.lower).sum())
# 混合使用
print(bDf.groupby([str.lower, {"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}]).sum())
import numpy as np
import pandas as pd
aDf = pd.DataFrame({"Key": ["Ind_A", "Ind_B", "Ind_C", "Ind_A", "Ind_B", "Ind_C"], "Col_A": range(6), "Col_B": np.random.randint(0, 10, 6)})
bDf = aDf.set_index("Key")
# List做為指定分割鍵
print(aDf.groupby([0, 1, 0, 1, 2, 0]).sum())
# 字典做為指定分割鍵
print(bDf.groupby({"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}).sum())
# Python函式修改索引做為指定分割鍵
print(bDf.groupby(str.lower).sum())
# 混合使用
print(bDf.groupby([str.lower, {"Ind_A": "Vowel", "Ind_B": "Consonant", "Ind_C": "Consonant"}]).sum())
樞紐分析表(Pivot Table)
l 樞紐分析表基本操作:
import seaborn as sns
import pandas as pd
# 使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")
age = pd.cut(titanic["age"], [0, 18, 80])
fare = pd.qcut(titanic["fare"], 2)
# 基本樞紐分析表
# 等同titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
print(titanic.pivot_table("survived", index = "sex", columns = "class"))
# 階層式樞紐分析表
print(titanic.pivot_table("survived", index = ["sex", age], columns = "class"))
print(titanic.pivot_table("survived", ["sex", age], [fare, "class"]))
import seaborn as sns
import pandas as pd
# 使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")
age = pd.cut(titanic["age"], [0, 18, 80])
fare = pd.qcut(titanic["fare"], 2)
# 基本樞紐分析表
# 等同titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()
print(titanic.pivot_table("survived", index = "sex", columns = "class"))
# 階層式樞紐分析表
print(titanic.pivot_table("survived", index = ["sex", age], columns = "class"))
print(titanic.pivot_table("survived", ["sex", age], [fare, "class"]))
l 樞紐分析表引數選項:
import seaborn as sns
# 使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")
# aggfunc引數控制聚合計算型態,預設是平均值(mean)
print(titanic.pivot_table(index = "sex", columns = "class", aggfunc = {"survived": sum, "fare": "mean"}))
# margins引數在輸出表格的最後欄列計算總數
print(titanic.pivot_table("survived", index = "sex", columns = "class", margins = True))
import seaborn as sns
# 使用seaborn套件內建資料集titanic做為範例
titanic = sns.load_dataset("titanic")
# aggfunc引數控制聚合計算型態,預設是平均值(mean)
print(titanic.pivot_table(index = "sex", columns = "class", aggfunc = {"survived": sum, "fare": "mean"}))
# margins引數在輸出表格的最後欄列計算總數
print(titanic.pivot_table("survived", index = "sex", columns = "class", margins = True))
Pandas向量化字串操作
l Pandas基本向量化字串操作:
import pandas as pd
names = pd.Series(["peter", "Paul", None, "MARY", "gUIDO"])
# Pandas字元取得
print(names.str[-1])
print(names.str.get(-1))
# Pandas字串切片
print(names.str[0:3])
print(names.str.slice(0, 3))
# Pandas字串方法
print(names.str.capitalize())
print(names.str.lower())
import pandas as pd
names = pd.Series(["peter", "Paul", None, "MARY", "gUIDO"])
# Pandas字元取得
print(names.str[-1])
print(names.str.get(-1))
# Pandas字串切片
print(names.str[0:3])
print(names.str.slice(0, 3))
# Pandas字串方法
print(names.str.capitalize())
print(names.str.lower())
l Pandas字串方法-搭配正規表示式:
import pandas as pd
monte = pd.Series(["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])
# .str.match()方法呼叫re.match()並傳回布林值
print(monte.str.match(r"^[^AEIOU].*[^aeiou]$"))
# .str.extract()方法呼叫re.match()並傳回符合的群組
print(monte.str.extract(r"(^[^AEIOU].*[^aeiou]$)"))
# .str.contains()方法呼叫re.search()並傳回布林值
print(monte.str.contains(r"^[^AEIOU].*[^aeiou]$"))
# .str.findall()方法呼叫re.findall()並傳回所有符合的結果
print(monte.str.findall(r"^[^AEIOU].*[^aeiou]$"))
# .str.replace()方法用指定的字串取代符合的樣式
print(monte.str.replace(r"[A-Za-z]+ ", "Name "))
# .str.count()方法計算符合樣式的數目
print(monte.str.count(r"[A-Za-z]+"))
# .str.split()方法以符合的樣式分隔字串
print(monte.str.split(r"[A-Z]"))
import pandas as pd
monte = pd.Series(["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])
# .str.match()方法呼叫re.match()並傳回布林值
print(monte.str.match(r"^[^AEIOU].*[^aeiou]$"))
# .str.extract()方法呼叫re.match()並傳回符合的群組
print(monte.str.extract(r"(^[^AEIOU].*[^aeiou]$)"))
# .str.contains()方法呼叫re.search()並傳回布林值
print(monte.str.contains(r"^[^AEIOU].*[^aeiou]$"))
# .str.findall()方法呼叫re.findall()並傳回所有符合的結果
print(monte.str.findall(r"^[^AEIOU].*[^aeiou]$"))
# .str.replace()方法用指定的字串取代符合的樣式
print(monte.str.replace(r"[A-Za-z]+ ", "Name "))
# .str.count()方法計算符合樣式的數目
print(monte.str.count(r"[A-Za-z]+"))
# .str.split()方法以符合的樣式分隔字串
print(monte.str.split(r"[A-Z]"))
l Pandas字串方法-特殊方法:
import pandas as pd
# A = Born in America
# B = Born in the United Kingdom
# C = Likes Cheese
# D = Likes Spam
monteData = pd.DataFrame({"Info": ["B|C|D", "B|D", "A|C", "B|D", "B|C", "B|C|D"]}, index = ["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])
# .str.get_dummies()方法用於轉換指示符變數為DataFrame
print(monteData["Info"].str.get_dummies("|"))
import pandas as pd
# A = Born in America
# B = Born in the United Kingdom
# C = Likes Cheese
# D = Likes Spam
monteData = pd.DataFrame({"Info": ["B|C|D", "B|D", "A|C", "B|D", "B|C", "B|C|D"]}, index = ["Graham Chapman", "John Cleese", "Terry Gilliam", "Eric Idle", "Terry Jones", "Michael Palin"])
# .str.get_dummies()方法用於轉換指示符變數為DataFrame
print(monteData["Info"].str.get_dummies("|"))
Pandas時間序列操作
l NumPy時間型別陣列:
import numpy as np
date = np.array("2018-06-21", dtype = np.datetime64)
print(date)
print(date + np.arange(11))
print(date == np.datetime64("2018-06-21"))
import numpy as np
date = np.array("2018-06-21", dtype = np.datetime64)
print(date)
print(date + np.arange(11))
print(date == np.datetime64("2018-06-21"))
l Pandas時間型別物件-基本操作:
import pandas as pd
# 傳遞一個日期到pandas.to_datetime()會產生Timestamp物件
date = pd.to_datetime("21th of June, 2018")
print(date)
# 傳遞多個日期到pandas.to_datetime()會產生DatetimeIndex物件
dates = pd.to_datetime(["21th of June, 2018", "2018-Jun-22", "06-23-2018", "20180624"])
print(dates)
# 轉換為Period物件
print(date.to_period("D"))
# 轉換為PeriodIndex物件
print(dates.to_period("D"))
# 傳遞一段時間到pandas.to_timedelta()會產生Timedelta物件
timeDuration = pd.to_timedelta(10, "D")
print(timeDuration)
# 傳遞多段時間到pandas.to_timedelta()會產生TimedeltaIndex物件
timeDurationS = pd.to_timedelta(["10D", "20D"])
print(timeDurationS)
# 時間運算
print(date + timeDuration)
print(dates + timeDuration)
print(dates - dates[0])
import pandas as pd
# 傳遞一個日期到pandas.to_datetime()會產生Timestamp物件
date = pd.to_datetime("21th of June, 2018")
print(date)
# 傳遞多個日期到pandas.to_datetime()會產生DatetimeIndex物件
dates = pd.to_datetime(["21th of June, 2018", "2018-Jun-22", "06-23-2018", "20180624"])
print(dates)
# 轉換為Period物件
print(date.to_period("D"))
# 轉換為PeriodIndex物件
print(dates.to_period("D"))
# 傳遞一段時間到pandas.to_timedelta()會產生Timedelta物件
timeDuration = pd.to_timedelta(10, "D")
print(timeDuration)
# 傳遞多段時間到pandas.to_timedelta()會產生TimedeltaIndex物件
timeDurationS = pd.to_timedelta(["10D", "20D"])
print(timeDurationS)
# 時間運算
print(date + timeDuration)
print(dates + timeDuration)
print(dates - dates[0])
l Pandas時間型別物件-規則序列:
import pandas as pd
# 開始與結束的時間點
print(pd.date_range("2018-06-21", "2018-07-01"))
# periods引數設定結束的時間長度(包含開始的時間點)
print(pd.date_range("2018-06-21", periods = 11))
# frep引數設定時間單位,預設為"D"
print(pd.date_range("2018-06-21", periods = 13, freq = "H"))
print(pd.period_range("2018-06", periods = 7, freq = "M"))
print(pd.timedelta_range(0, periods = 7, freq = "1H30T"))
# frep引數設定時間單位,使用工作日
from pandas.tseries.offsets import BDay
print(pd.date_range("2018-06-21", periods = 8, freq = BDay()))
import pandas as pd
# 開始與結束的時間點
print(pd.date_range("2018-06-21", "2018-07-01"))
# periods引數設定結束的時間長度(包含開始的時間點)
print(pd.date_range("2018-06-21", periods = 11))
# frep引數設定時間單位,預設為"D"
print(pd.date_range("2018-06-21", periods = 13, freq = "H"))
print(pd.period_range("2018-06", periods = 7, freq = "M"))
print(pd.timedelta_range(0, periods = 7, freq = "1H30T"))
# frep引數設定時間單位,使用工作日
from pandas.tseries.offsets import BDay
print(pd.date_range("2018-06-21", periods = 8, freq = BDay()))
l Pandas時間型別物件-搭配Series物件使用:
import pandas as pd
index = pd.to_datetime(["2018-06-21", "2017-08-04", "2017-08-14", "2016-11-09"])
data = pd.Series(["Me", "Sister", "Mom", "Dad"], index = index)
print(data)
print(data["2017-08-01":"2017-08-31"])
print(data["2016"])
import pandas as pd
index = pd.to_datetime(["2018-06-21", "2017-08-04", "2017-08-14", "2016-11-09"])
data = pd.Series(["Me", "Sister", "Mom", "Dad"], index = index)
print(data)
print(data["2017-08-01":"2017-08-31"])
print(data["2016"])
l 以時間重新取樣:
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
print(data.head())
# 重新取樣縮寫參考http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
# .resample()以資料聚合(Data Aggregation)為基礎,此例傳回年均值
resampleData = data.resample("BA").mean()
print(resampleData)
# .asfreq()以資料選擇(Data Selection)為基礎,此例傳回年末值
asfreqData = data.asfreq("BA")
print(asfreqData)
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
print(data.head())
# 重新取樣縮寫參考http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
# .resample()以資料聚合(Data Aggregation)為基礎,此例傳回年均值
resampleData = data.resample("BA").mean()
print(resampleData)
# .asfreq()以資料選擇(Data Selection)為基礎,此例傳回年末值
asfreqData = data.asfreq("BA")
print(asfreqData)
l 時間移位:
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").sum()
# 資料或索引移位前之樣貌
print(resampleData.head())
# .shift()把資料移位
print(resampleData.shift(2).head())
# .tshift()把索引移位
print(resampleData.tshift(2).head())
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").sum()
# 資料或索引移位前之樣貌
print(resampleData.head())
# .shift()把資料移位
print(resampleData.shift(2).head())
# .tshift()把索引移位
print(resampleData.tshift(2).head())
l 時間滾動:
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").mean()
# 時間滾動前之樣貌
print(resampleData.head())
# 時間滾動後之樣貌
print(resampleData.rolling(2).mean().head())
import pandas as pd
# https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k
data = pd.read_csv("Fremont_Bridge.csv", index_col = "Date", parse_dates = True)
resampleData = data.resample("D").mean()
# 時間滾動前之樣貌
print(resampleData.head())
# 時間滾動後之樣貌
print(resampleData.rolling(2).mean().head())
讀取與匯出資料集
l 讀取seaborn套件內建資料集:
import seaborn as sns
titanic = sns.load_dataset("titanic")
print(titanic.head())
import seaborn as sns
titanic = sns.load_dataset("titanic")
print(titanic.head())
l 讀取JSON格式資料:
# http://openrecipes.s3.amazonaws.com/openrecipes.txt
import pandas as pd
# 範例檔案每列都是JSON格式,但是整體不是,因此需額外處理
with open("openrecipes.txt") as file:
data = (line.strip() for line in file)
jsonDataList = "[{0}]".format(", ".join(data))
recipesData = pd.read_json(jsonDataList)
print(recipesData.head())
# http://openrecipes.s3.amazonaws.com/openrecipes.txt
import pandas as pd
# 範例檔案每列都是JSON格式,但是整體不是,因此需額外處理
with open("openrecipes.txt") as file:
data = (line.strip() for line in file)
jsonDataList = "[{0}]".format(", ".join(data))
recipesData = pd.read_json(jsonDataList)
print(recipesData.head())
l 讀取.csv檔:
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_csv("state-population.csv")
print(popData.head())
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_csv("state-population.csv")
print(popData.head())
l 讀取.xlsx檔:
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_excel("state-population.xlsx")
print(popData.head())
import pandas as pd
# http://github.com/jakevdp/data-USstates/
popData = pd.read_excel("state-population.xlsx")
print(popData.head())
l 匯出.csv檔:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_csv("testFile.csv")
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_csv("testFile.csv")
l 匯出.xlsx檔:
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_excel("testFile.xlsx", "Sheet2")
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 2), index = [["A", "A", "B", "B"], [1, 2, 1, 2]], columns = ["Col_A", "Col_B"])
data.to_excel("testFile.xlsx", "Sheet2")
高效率Pandas-eval()方法與query()方法
l 高效率原理來自Numexpr套件,擁有複合敘述式逐項元素計算的能力。
l 使用pandas.eval()方法:
import numpy as np
import pandas as pd
aDf, bDf, cDf, dDf = (pd.DataFrame(np.random.rand(1000, 10)) for i in range(4))
# 算數運算
aMethod = aDf + bDf + cDf + dDf
bMethod = pd.eval("aDf + bDf + cDf + dDf")
# aMethod與bMethod兩物件若元素計算上相同,返回True
print(np.allclose(aMethod, bMethod))
# 比較運算與位元運算
cMethod = (aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)
dMethod = pd.eval("(aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)")
print(np.allclose(cMethod, dMethod))
# 物件屬性和索引
eMethod = bDf.T[0] + cDf.iloc[1]
fMethod = pd.eval("bDf.T[0] + cDf.iloc[1]")
print(np.allclose(eMethod, fMethod))
import numpy as np
import pandas as pd
aDf, bDf, cDf, dDf = (pd.DataFrame(np.random.rand(1000, 10)) for i in range(4))
# 算數運算
aMethod = aDf + bDf + cDf + dDf
bMethod = pd.eval("aDf + bDf + cDf + dDf")
# aMethod與bMethod兩物件若元素計算上相同,返回True
print(np.allclose(aMethod, bMethod))
# 比較運算與位元運算
cMethod = (aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)
dMethod = pd.eval("(aDf < 0.5) & (bDf < 0.5) | (cDf < dDf)")
print(np.allclose(cMethod, dMethod))
# 物件屬性和索引
eMethod = bDf.T[0] + cDf.iloc[1]
fMethod = pd.eval("bDf.T[0] + cDf.iloc[1]")
print(np.allclose(eMethod, fMethod))
l 使用DataFrame物件.eval()方法:
import numpy as np
import pandas as pd
aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
indexMean = aDf.mean(axis = 1)
# 算數運算
aMethod = (aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)
bMethod = pd.eval("(aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)")
cMethod = aDf.eval("(Col_A + Col_B) / (Col_C - 1)")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))
# 賦值運算
aDf.eval("Col_D = (Col_A + Col_B) / Col_C", inplace = True)
print(aDf.head())
# 呼叫本地端變數,@標記變數而非欄名
dMethod = aDf.Col_A + indexMean
eMethod = pd.eval("aDf.Col_A + indexMean")
fMethod = aDf.eval("Col_A + @indexMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))
import numpy as np
import pandas as pd
aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
indexMean = aDf.mean(axis = 1)
# 算數運算
aMethod = (aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)
bMethod = pd.eval("(aDf.Col_A + aDf.Col_B) / (aDf.Col_C - 1)")
cMethod = aDf.eval("(Col_A + Col_B) / (Col_C - 1)")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))
# 賦值運算
aDf.eval("Col_D = (Col_A + Col_B) / Col_C", inplace = True)
print(aDf.head())
# 呼叫本地端變數,@標記變數而非欄名
dMethod = aDf.Col_A + indexMean
eMethod = pd.eval("aDf.Col_A + indexMean")
fMethod = aDf.eval("Col_A + @indexMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))
l 使用DataFrame物件.query()方法:
import numpy as np
import pandas as pd
aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
columnCMean = aDf["Col_C"].mean()
# 比較運算與位元運算
aMethod = aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]
bMethod = pd.eval("aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]")
cMethod = aDf.query("Col_A < 0.5 and Col_B < 0.5")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))
# 呼叫本地端變數,@標記變數而非欄名
dMethod = aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]
eMethod = pd.eval("aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]")
fMethod = aDf.query("Col_A < @columnCMean and Col_B < @columnCMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))
import numpy as np
import pandas as pd
aDf = pd.DataFrame(np.random.rand(1000, 3), columns = ["Col_A", "Col_B", "Col_C"])
columnCMean = aDf["Col_C"].mean()
# 比較運算與位元運算
aMethod = aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]
bMethod = pd.eval("aDf[(aDf.Col_A < 0.5) & (aDf.Col_B < 0.5)]")
cMethod = aDf.query("Col_A < 0.5 and Col_B < 0.5")
print(np.allclose(aMethod, bMethod))
print(np.allclose(aMethod, cMethod))
# 呼叫本地端變數,@標記變數而非欄名
dMethod = aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]
eMethod = pd.eval("aDf[(aDf.Col_A < columnCMean) & (aDf.Col_B < columnCMean)]")
fMethod = aDf.query("Col_A < @columnCMean and Col_B < @columnCMean")
print(np.allclose(dMethod, eMethod))
print(np.allclose(dMethod, fMethod))
沒有留言:
張貼留言