一、建立R語言的開發環境
下載安裝
l 左上角-來源(Source):
需點選File à New File à R Script開啟,撰寫與執行程式碼的區域,需將程式碼反白再執行,否則將依游標位置一次一行逐行執行程式碼。
需點選File à New File à R Script開啟,撰寫與執行程式碼的區域,需將程式碼反白再執行,否則將依游標位置一次一行逐行執行程式碼。
l 左下角-命令列(Console):
撰寫與執行單行程式碼的區域。
撰寫與執行單行程式碼的區域。
l 右上角-環境與歷史
l 右下角-檔案、圖形、套件、查詢與預覽器
二、瞭解不同的變數類型
變數類型
l class()函數返回變數類型:
> class(2) # 數值
[1] "numeric"
> class(2L) # 整數
[1] "integer"
> class(TRUE) # 邏輯值,只可用TRUE、FALSE、T、F
[1] "logical"
> class("Hello World") # 文字
[1] "character"
> class(Sys.Date()) # 日期
[1] "Date"
> class(Sys.time()) # 時間
[1] "POSIXct" "POSIXt"
> class(2) # 數值
[1] "numeric"
> class(2L) # 整數
[1] "integer"
> class(TRUE) # 邏輯值,只可用TRUE、FALSE、T、F
[1] "logical"
> class("Hello World") # 文字
[1] "character"
> class(Sys.Date()) # 日期
[1] "Date"
> class(Sys.time()) # 時間
[1] "POSIXct" "POSIXt"
查詢文件
l 兩者皆可:
?class
help(class)
?class
help(class)
賦值
l 兩者皆可,其它程式語言多使用=,<-為R語言特有,本書作者建議使用<-,例如在函數中使用<-可以避免被視為函數的引數:
myNum <- 2
myNum = 2
myNum <- 2
myNum = 2
判斷條件
l 大於(>)、小於(<)、大於等於(>=)、小於等於(<=)、等於(==)、不等於(!=)、AND(&)、OR(|)的用法與Python等其它程式語言皆相同:
> # 判斷7是否包含於一個c(8, 7)之中,c(8, 7)是向量資料結構
> 7 %in% c(8, 7)
[1] TRUE
> # 判斷7是否包含於一個c(8, 7)之中,c(8, 7)是向量資料結構
> 7 %in% c(8, 7)
[1] TRUE
數學運算
l 加(+)、減(-)、乘(*)、除(/)、指數(^或**)的用法與Python等其它程式語言皆相同:
> 11 %% 10 # 回傳餘數
[1] 1
> 7 + 8L + TRUE # TRUE與1或1L相同
[1] 16
> class(7 + 8L + TRUE)
[1] "numeric"
> 7 + 8L + FALSE # FALSE與0或0L相同
[1] 15
> class(7 + 8L + FALSE)
[1] "numeric"
> 11 %% 10 # 回傳餘數
[1] 1
> 7 + 8L + TRUE # TRUE與1或1L相同
[1] 16
> class(7 + 8L + TRUE)
[1] "numeric"
> 7 + 8L + FALSE # FALSE與0或0L相同
[1] 15
> class(7 + 8L + FALSE)
[1] "numeric"
日期與時間
l 日期轉換字串、整數:
> sysDate <- Sys.Date()
> # 日期轉換字串
> as.character(sysDate)
[1] "2018-09-01"
> # 日期轉換整數
> as.integer(sysDate)
[1] 17775
> sysDate <- Sys.Date()
> # 日期轉換字串
> as.character(sysDate)
[1] "2018-09-01"
> # 日期轉換整數
> as.integer(sysDate)
[1] 17775
l 字串轉換日期、日期運算:
> # 字串轉換日期,Unix紀元起
> dateOfOrigin <- as.Date("1970-01-01")
> # 日期運算
> dateOfOrigin + 1
[1] "1970-01-02"
> # 字串轉換日期,Unix紀元起
> dateOfOrigin <- as.Date("1970-01-01")
> # 日期運算
> dateOfOrigin + 1
[1] "1970-01-02"
l 時間轉換字串、整數:
> sysTime <- Sys.time()
> # 時間轉換字串
> as.character(sysTime)
[1] "2018-09-01 10:32:06"
> # 時間轉換整數
> as.integer(sysTime)
[1] 1535769126
> sysTime <- Sys.time()
> # 時間轉換字串
> as.character(sysTime)
[1] "2018-09-01 10:32:06"
> # 時間轉換整數
> as.integer(sysTime)
[1] 1535769126
l 字串轉換時間、時間運算:
> # 字串轉換時間,格林威治時間Unix紀元起
> timeOfOrigin <- as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
> # 時間運算
> timeOfOrigin + 1
[1] "1970-01-01 00:00:01 GMT"
> # 字串轉換時間,中原標準時間(Chungyuan Standard Time)Unix紀元起
> timeOfOriginCst = as.POSIXct("1970-01-01 08:00:00")
> as.integer(timeOfOriginCst)
[1] 0
> # 字串轉換時間,格林威治時間Unix紀元起
> timeOfOrigin <- as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
> # 時間運算
> timeOfOrigin + 1
[1] "1970-01-01 00:00:01 GMT"
> # 字串轉換時間,中原標準時間(Chungyuan Standard Time)Unix紀元起
> timeOfOriginCst = as.POSIXct("1970-01-01 08:00:00")
> as.integer(timeOfOriginCst)
[1] 0
三、變數類型的判斷與轉換
變數類型的判斷
l is.類型名稱()函數與inherits()函數回傳邏輯值:
> is.numeric(7.7)
[1] TRUE
> is.integer(7L)
[1] TRUE
> is.logical(FALSE)
[1] TRUE
> is.character("Hello World")
[1] TRUE
> inherits(Sys.Date(), what = "Date")
[1] TRUE
> inherits(Sys.time(), what = "POSIXct")
[1] TRUE
> is.numeric(7.7)
[1] TRUE
> is.integer(7L)
[1] TRUE
> is.logical(FALSE)
[1] TRUE
> is.character("Hello World")
[1] TRUE
> inherits(Sys.Date(), what = "Date")
[1] TRUE
> inherits(Sys.time(), what = "POSIXct")
[1] TRUE
變數類型的轉換
l as.類型名稱()函數轉換變數類型:
> as.numeric(7L)
[1] 7
> as.integer(7)
[1] 7
> as.logical("TRUE")
[1] TRUE
> as.character(Sys.Date())
[1] "2018-09-01"
> as.Date("1970/01/01")
[1] "1970-01-01"
> as.Date("01/01/70", format = "%m/%d/%y")
[1] "1970-01-01"
> as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
[1] "1970-01-01 GMT"
> as.POSIXct("1970-01-01 00:00:00")
[1] "1970-01-01 CST"
> as.numeric(7L)
[1] 7
> as.integer(7)
[1] 7
> as.logical("TRUE")
[1] TRUE
> as.character(Sys.Date())
[1] "2018-09-01"
> as.Date("1970/01/01")
[1] "1970-01-01"
> as.Date("01/01/70", format = "%m/%d/%y")
[1] "1970-01-01"
> as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
[1] "1970-01-01 GMT"
> as.POSIXct("1970-01-01 00:00:00")
[1] "1970-01-01 CST"
l 日期格式:
%d 日 01
%a 禮拜幾的縮寫 Mon
%A 禮拜幾 Monday
%m 月 01
%b 月名稱的縮寫 Jan
%B 月名稱 January
%y 兩位數的年 70
%Y 四位數的年 1970
%d 日 01
%a 禮拜幾的縮寫 Mon
%A 禮拜幾 Monday
%m 月 01
%b 月名稱的縮寫 Jan
%B 月名稱 January
%y 兩位數的年 70
%Y 四位數的年 1970
四、把變數集結起來
一維資料結構-向量(Vector)
l 選擇向量元素:
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> fourSeasons
[1] "spring" "summer" "autumn" "winter"
> fourSeasons[2]
[1] "summer"
> fourSeasons[c(-1, -3)]
[1] "summer" "winter"
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> fourSeasons
[1] "spring" "summer" "autumn" "winter"
> fourSeasons[2]
[1] "summer"
> fourSeasons[c(-1, -3)]
[1] "summer" "winter"
l 判斷向量元素:
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> myFavoriteSeason <- fourSeasons == "summer"
> myFavoriteSeason
[1] FALSE TRUE FALSE FALSE
> fourSeasons[myFavoriteSeason]
[1] "summer"
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> myFavoriteSeason <- fourSeasons == "summer"
> myFavoriteSeason
[1] FALSE TRUE FALSE FALSE
> fourSeasons[myFavoriteSeason]
[1] "summer"
l rep()函數可以生成重複變數的向量:
> rep(7, times = 3)
[1] 7 7 7
> rep(7, times = 3)
[1] 7 7 7
l seq()函數可以生成等差數列:
> seq(from = 7, to = 25, by = 7)
[1] 7 14 21
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from = 7, to = 25, by = 7)
[1] 7 14 21
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
一維資料結構-因素向量(Factor)
l 因素向量(Factor)是儲存字串的資料結構,帶有層級(Levels)資訊。
l factor()函數可以將向量轉為因素向量:
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: autumn spring summer winter
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: autumn spring summer winter
l factor()函數有ordered = TRUE引數,卻沒有levels = c()引數,預設使用字母排序:
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"), ordered = TRUE, levels = c("winter", "spring", "autumn", "summer"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: winter < spring < autumn < summer
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"), ordered = TRUE, levels = c("winter", "spring", "autumn", "summer"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: winter < spring < autumn < summer
二維資料結構-矩陣(Matrix)
l 水平方向的資料稱為列(Row),垂直方向的資料稱為欄(Column)。
l 顯示橫向排序矩陣:
> myMat <- matrix(1:6, nrow = 2, byrow = TRUE)
> myMat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> myMat <- matrix(1:6, nrow = 2, byrow = TRUE)
> myMat
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
l 顯示直向排序矩陣,選擇矩陣元素:
> myMat <- matrix(1:6, nrow = 2)
> myMat
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> myMat[2, 3]
[1] 6
> myMat[2, ]
[1] 2 4 6
> myMat[, 3]
[1] 5 6
> myMat <- matrix(1:6, nrow = 2)
> myMat
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
> myMat[2, 3]
[1] 6
> myMat[2, ]
[1] 2 4 6
> myMat[, 3]
[1] 5 6
l 判斷矩陣元素:
> myMat <- matrix(1:6, nrow = 2)
> filter <- myMat < 6 & myMat > 1
> filter
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE TRUE FALSE
> myMat[filter]
[1] 2 3 4 5
> myMat <- matrix(1:6, nrow = 2)
> filter <- myMat < 6 & myMat > 1
> filter
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE TRUE FALSE
> myMat[filter]
[1] 2 3 4 5
二維資料結構-資料框(Data.Frame)
l 水平方向的資料稱為觀測值(Observation),垂直方向的資料稱為變數(Variable)。
l 資料框內的資料,以下範例將省略:
> teamName <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> isChampion <- c(TRUE, FALSE)
> season <- c(2013, 2011)
> teamName <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> isChampion <- c(TRUE, FALSE)
> season <- c(2013, 2011)
l 顯示資料框:
> # 在來源(Source)區域顯示資料框,關閉後可使用View()函數再次呼叫:
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> # 在命令列(Console)區域顯示資料框
> greatMlbTeams
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
2 Texas Rangers 96 66 FALSE 2011
> # 在命令列(Console)區域顯示資料框
> str(greatMlbTeams)
'data.frame': 2 obs. of 5 variables:
$ teamName : Factor w/ 2 levels "Boston Red Sox",..: 1 2
$ wins : num 97 96
$ losses : num 65 66
$ isChampion: logi TRUE FALSE
$ season : num 2013 2011
> # 在來源(Source)區域顯示資料框,關閉後可使用View()函數再次呼叫:
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> # 在命令列(Console)區域顯示資料框
> greatMlbTeams
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
2 Texas Rangers 96 66 FALSE 2011
> # 在命令列(Console)區域顯示資料框
> str(greatMlbTeams)
'data.frame': 2 obs. of 5 variables:
$ teamName : Factor w/ 2 levels "Boston Red Sox",..: 1 2
$ wins : num 97 96
$ losses : num 65 66
$ isChampion: logi TRUE FALSE
$ season : num 2013 2011
l 選擇資料框元素:
> # 預設將字串以因素向量儲存
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> greatMlbTeams[1, 1]
[1] Boston Red Sox
Levels: Boston Red Sox Texas Rangers
> greatMlbTeams[1, ]
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
> greatMlbTeams[, 1]
[1] Boston Red Sox Texas Rangers
Levels: Boston Red Sox Texas Rangers
> # 字串不以因素向量儲存的方法一
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams[1, 1]
[1] "Boston Red Sox"
> greatMlbTeams[, 1]
[1] "Boston Red Sox" "Texas Rangers"
> # 字串不以因素向量儲存的方法二
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> as.character(greatMlbTeams[1, 1])
[1] "Boston Red Sox"
> as.character(greatMlbTeams[, 1])
[1] "Boston Red Sox" "Texas Rangers"
> # 使用變數名稱選擇資料框元素
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams$teamName
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, "teamName"]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, c("teamName", "isChampion")]
teamName isChampion
1 Boston Red Sox TRUE
2 Texas Rangers FALSE
> # 預設將字串以因素向量儲存
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> greatMlbTeams[1, 1]
[1] Boston Red Sox
Levels: Boston Red Sox Texas Rangers
> greatMlbTeams[1, ]
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
> greatMlbTeams[, 1]
[1] Boston Red Sox Texas Rangers
Levels: Boston Red Sox Texas Rangers
> # 字串不以因素向量儲存的方法一
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams[1, 1]
[1] "Boston Red Sox"
> greatMlbTeams[, 1]
[1] "Boston Red Sox" "Texas Rangers"
> # 字串不以因素向量儲存的方法二
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> as.character(greatMlbTeams[1, 1])
[1] "Boston Red Sox"
> as.character(greatMlbTeams[, 1])
[1] "Boston Red Sox" "Texas Rangers"
> # 使用變數名稱選擇資料框元素
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams$teamName
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, "teamName"]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, c("teamName", "isChampion")]
teamName isChampion
1 Boston Red Sox TRUE
2 Texas Rangers FALSE
l 判斷資料框元素:
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> filter <- greatMlbTeams$isChampion == TRUE
> filter
[1] TRUE FALSE
> greatMlbTeams[filter, ]
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> filter <- greatMlbTeams$isChampion == TRUE
> filter
[1] TRUE FALSE
> greatMlbTeams[filter, ]
teamName wins losses isChampion season
1 Boston Red Sox 97 65 TRUE 2013
多維資料結構-陣列(Array)
l 選擇陣列元素:
> myArr <- array(1:18, dim = c(3, 3, 2))
> myArr
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
> myArr[3, 2, 1]
[1] 6
> myArr[3, , ]
[,1] [,2]
[1,] 3 12
[2,] 6 15
[3,] 9 18
> myArr[, 2, ]
[,1] [,2]
[1,] 4 13
[2,] 5 14
[3,] 6 15
> myArr[, , 1]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> myArr <- array(1:18, dim = c(3, 3, 2))
> myArr
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
> myArr[3, 2, 1]
[1] 6
> myArr[3, , ]
[,1] [,2]
[1,] 3 12
[2,] 6 15
[3,] 9 18
> myArr[, 2, ]
[,1] [,2]
[1,] 4 13
[2,] 5 14
[3,] 6 15
> myArr[, , 1]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
多維資料結構-清單(List)
l 清單內的資料,以下範例將省略:
> title <- "Great MLB Teams"
> teams <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> winningPercentage <- wins / (wins + losses)
> season <- c(2013, 2011)
> winsLosses <- matrix(c(wins, losses), nrow = 2)
> df <- data.frame(Teams = teams, Winning_Percentage = winningPercentage, Season = season)
> title <- "Great MLB Teams"
> teams <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> winningPercentage <- wins / (wins + losses)
> season <- c(2013, 2011)
> winsLosses <- matrix(c(wins, losses), nrow = 2)
> df <- data.frame(Teams = teams, Winning_Percentage = winningPercentage, Season = season)
l 顯示清單:
> greatMlbTeams <- list(title, teams, winsLosses, df)
> # 在來源(Source)區域顯示清單
> View(greatMlbTeams)
> # 在命令列(Console)區域顯示清單
> greatMlbTeams
[[1]]
[1] "Great MLB Teams"
[[2]]
[1] "Boston Red Sox" "Texas Rangers"
[[3]]
[,1] [,2]
[1,] 97 65
[2,] 96 66
[[4]]
Teams Winning_Percentage Season
1 Boston Red Sox 0.5987654 2013
2 Texas Rangers 0.5925926 2011
> greatMlbTeams <- list(title, teams, winsLosses, df)
> # 在來源(Source)區域顯示清單
> View(greatMlbTeams)
> # 在命令列(Console)區域顯示清單
> greatMlbTeams
[[1]]
[1] "Great MLB Teams"
[[2]]
[1] "Boston Red Sox" "Texas Rangers"
[[3]]
[,1] [,2]
[1,] 97 65
[2,] 96 66
[[4]]
Teams Winning_Percentage Season
1 Boston Red Sox 0.5987654 2013
2 Texas Rangers 0.5925926 2011
l 選擇清單元素:
> greatMlbTeams <- list(title, teams, winsLosses, df)
> greatMlbTeams[[2]]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[[3]][1, ]
[1] 97 65
> greatMlbTeams[[4]]$Winning_Percentage
[1] 0.5987654 0.5925926
> # 使用物件名稱選擇清單元素
> greatMlbTeams <- list(Title = title, Teams = teams, Wins_Losses = winsLosses, DF = df)
> greatMlbTeams$Teams
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams <- list(title, teams, winsLosses, df)
> greatMlbTeams[[2]]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[[3]][1, ]
[1] 97 65
> greatMlbTeams[[4]]$Winning_Percentage
[1] 0.5987654 0.5925926
> # 使用物件名稱選擇清單元素
> greatMlbTeams <- list(Title = title, Teams = teams, Wins_Losses = winsLosses, DF = df)
> greatMlbTeams$Teams
[1] "Boston Red Sox" "Texas Rangers"
l R語言函數的輸出多半是清單:
> x <- 1:10
> y <- 2 * x + 5
> lmFit <- lm(formula = y ~ x)
> lmFit$coefficients
(Intercept) x
5 2
> lmFit$coefficients[1]
(Intercept)
5
> lmFit$coefficients[2]
x
2
> x <- 1:10
> y <- 2 * x + 5
> lmFit <- lm(formula = y ~ x)
> lmFit$coefficients
(Intercept) x
5 2
> lmFit$coefficients[1]
(Intercept)
5
> lmFit$coefficients[2]
x
2
五、函數
數值函數
l abs()函數使數值取絕對值:
> abs(-9)
[1] 9
> abs(-9)
[1] 9
l sqrt()函數使數值開根號:
> sqrt(9)
[1] 3
> sqrt(9)
[1] 3
l ceiling()函數使數值無條件進位:
> ceiling(pi)
[1] 4
> ceiling(pi)
[1] 4
l floor()函數使數值無條件捨去:
> floor(pi)
[1] 3
> floor(pi)
[1] 3
l round()函數使數值四捨五入:
> round(pi)
[1] 3
> round(pi, digits = 2)
[1] 3.14
> round(pi, digits = 4)
[1] 3.1416
> round(pi)
[1] 3
> round(pi, digits = 2)
[1] 3.14
> round(pi, digits = 4)
[1] 3.1416
l exp()函數使數值轉為ex,e = 2.7182818尤拉數:
> exp(2)
[1] 7.389056
> exp(2)
[1] 7.389056
l log()函數使數值取自然對數,loge:
> log(exp(2))
[1] 2
> log(exp(2))
[1] 2
l log10()函數使數值取10為底對數,log10:
> log10(10^3)
[1] 3
> log10(10^3)
[1] 3
文字函數
l toupper()函數使文字轉為大寫:
> toupper("Hello World")
[1] "HELLO WORLD"
> toupper("Hello World")
[1] "HELLO WORLD"
l tolower()函數使文字轉為小寫:
> tolower("Hello World")
[1] "hello world"
> tolower("Hello World")
[1] "hello world"
l substr()函數使文字擷取:
> substr("Hello World", start = 1, stop = 4)
[1] "Hell"
> substr("Hello World", start = 1, stop = 4)
[1] "Hell"
l grep()函數使文字搜尋,符合者返回索引值,不符者返回integer(0):
> grep(pattern = "Poor", c("Hello", "Poor", "World"))
[1] 2
> grep(pattern = "Hell", c("Hello", "Poor", "World"))
[1] 1
> grep(pattern = "poor", c("Hello", "Poor", "World"))
integer(0)
> grep(pattern = "poor", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] 2
> grep(pattern = "Poor", c("Hello", "Poor", "World"))
[1] 2
> grep(pattern = "Hell", c("Hello", "Poor", "World"))
[1] 1
> grep(pattern = "poor", c("Hello", "Poor", "World"))
integer(0)
> grep(pattern = "poor", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] 2
l sub()函數使文字替換:
> sub(pattern = "Hello", replacement = "Hell", c("Hello", "Poor", "World"))
[1] "Hell" "Poor" "World"
> sub(pattern = "hello", replacement = "Hell", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] "Hell" "Poor" "World"
> sub(pattern = "Hello", replacement = "Hell", c("Hello", "Poor", "World"))
[1] "Hell" "Poor" "World"
> sub(pattern = "hello", replacement = "Hell", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] "Hell" "Poor" "World"
l strsplit()函數使文字切割:
> strsplit("Hello Poor World", split = " ")
[[1]]
[1] "Hello" "Poor" "World"
> strsplit("Hello Poor World", split = " ")
[[1]]
[1] "Hello" "Poor" "World"
l paste()函數使文字連結:
> paste("Hello", "Poor", "World")
[1] "Hello Poor World"
> paste("Hello", "Poor", "World", sep = "|")
[1] "Hello|Poor|World"
> paste("Hello", "Poor", "World")
[1] "Hello Poor World"
> paste("Hello", "Poor", "World", sep = "|")
[1] "Hello|Poor|World"
描述統計函數
l mean()函數回傳平均值:
> mean(1:5)
[1] 3
> mean(c(1:5, NA)) # 加入遺漏值
[1] NA
> mean(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3
> mean(1:5)
[1] 3
> mean(c(1:5, NA)) # 加入遺漏值
[1] NA
> mean(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3
l sd()函數回傳標準差:
> sd(1:5)
[1] 1.581139
> sd(c(1:5, NA)) # 加入遺漏值
[1] NA
> sd(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1.581139
> sd(1:5)
[1] 1.581139
> sd(c(1:5, NA)) # 加入遺漏值
[1] NA
> sd(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1.581139
l median()函數回傳中位數:
> median(1:5)
[1] 3
> median(c(1:5, NA)) # 加入遺漏值
[1] NA
> median(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3
> median(1:5)
[1] 3
> median(c(1:5, NA)) # 加入遺漏值
[1] NA
> median(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3
l range()函數回傳最小值與最大值:
> range(1:5)
[1] 1 5
> range(c(1:5, NA)) # 加入遺漏值
[1] NA NA
> range(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1 5
> range(1:5)
[1] 1 5
> range(c(1:5, NA)) # 加入遺漏值
[1] NA NA
> range(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1 5
l sum()函數回傳總數:
> sum(1:5)
[1] 15
> sum(c(1:5, NA)) # 加入遺漏值
[1] NA
> sum(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 15
> sum(1:5)
[1] 15
> sum(c(1:5, NA)) # 加入遺漏值
[1] NA
> sum(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 15
l max()函數回傳最大值:
> max(1:5)
[1] 5
> max(c(1:5, NA)) # 加入遺漏值
[1] NA
> max(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 5
> max(1:5)
[1] 5
> max(c(1:5, NA)) # 加入遺漏值
[1] NA
> max(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 5
l min()函數回傳最小值:
> min(1:5)
[1] 1
> min(c(1:5, NA)) # 加入遺漏值
[1] NA
> min(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1
> min(1:5)
[1] 1
> min(c(1:5, NA)) # 加入遺漏值
[1] NA
> min(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1
六、迴圈與流程控制
迴圈
l for迴圈:
> for (month in month.name) {
+ print(month)
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
> for (month in month.name) {
+ print(month)
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
l while迴圈:
> i <- 1
> while (i < 13) {
+ print(month.name[i])
+ i <- i + 1
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
> i <- 1
> while (i < 13) {
+ print(month.name[i])
+ i <- i + 1
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
流程控制
l if - else if - else:
> # sample()函數從向量中隨機抽樣,size引數指定隨機抽樣個數
> weather <- sample(c("Sunny", "Cloudy", "Rainy"), size = 1)
> # 與Java不同,else前面必須緊接},不可任意換行擺放,否則會發生錯誤
> if (weather == "Sunny") {
+ print("Cycling")
+ } else if (weather == "Cloudy") {
+ print("Running")
+ } else {
+ print("Working Out in the Gym")
+ }
[1] "Cycling"
> # sample()函數從向量中隨機抽樣,size引數指定隨機抽樣個數
> weather <- sample(c("Sunny", "Cloudy", "Rainy"), size = 1)
> # 與Java不同,else前面必須緊接},不可任意換行擺放,否則會發生錯誤
> if (weather == "Sunny") {
+ print("Cycling")
+ } else if (weather == "Cloudy") {
+ print("Running")
+ } else {
+ print("Working Out in the Gym")
+ }
[1] "Cycling"
結合迴圈與流程控制
l break敘述,等同Python的break敘述,跳出迴圈:
> for (month in month.name) {
+ if (month == "June") {
+ break
+ } else {
+ print(month)
+ }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
> for (month in month.name) {
+ if (month == "June") {
+ break
+ } else {
+ print(month)
+ }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
l next敘述,等同Python的continue敘述,直接跳至下一輪迴圈:
> for (month in month.name) {
+ if (month == "June") {
+ next
+ } else {
+ print(month)
+ }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
> for (month in month.name) {
+ if (month == "June") {
+ next
+ } else {
+ print(month)
+ }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"
七、自訂函數
自訂函數
l 簡單範例:
# 自訂函數
> myFunc <- function(x, mode = TRUE) {
+ y <- x ^ 2
+ z <- x ^ 3
+ if (mode == TRUE) {
+ return(y)
+ } else {
+ return(z)
+ }
+ }
> # 呼叫函數
> myFunc(1:3, FALSE)
[1] 1 8 27
# 自訂函數
> myFunc <- function(x, mode = TRUE) {
+ y <- x ^ 2
+ z <- x ^ 3
+ if (mode == TRUE) {
+ return(y)
+ } else {
+ return(z)
+ }
+ }
> # 呼叫函數
> myFunc(1:3, FALSE)
[1] 1 8 27
l 複雜範例,處理雜亂無章的資料:
> # 雜亂無章的資料
> messyData <- data.frame(c(1, 2, 3, 4, NA), c(1, 2, 3, NA, 5), c(1, 2, NA, 4, 5))
> names(messyData) <- c("a", "b", "c")
> messyData
a b c
1 1 1 1
2 2 2 2
3 3 3 NA
4 4 NA 4
5 NA 5 5
> # 自訂函數
> cleanData <- function(df, imputeValue) {
+ nRows <- nrow(df)
+ naSum <- rep(NA, times = nRows)
+ for (i in 1:nRows) {
+ naSum[i] <- sum(is.na(df[i, ]))
+ df[i, ][is.na(df[i, ])] <- imputeValue
+ }
+ dfList <- list(completeCases = df[as.logical(!naSum), ], imputedData = df)
+ return(dfList)
+ }
> # 呼叫函數
> cleanedData <- cleanData(messyData, imputeValue = 999)
> cleanedData$completeCases
a b c
1 1 1 1
2 2 2 2
> cleanedData$imputedData
a b c
1 1 1 1
2 2 2 2
3 3 3 999
4 4 999 4
5 999 5 5
> # 雜亂無章的資料
> messyData <- data.frame(c(1, 2, 3, 4, NA), c(1, 2, 3, NA, 5), c(1, 2, NA, 4, 5))
> names(messyData) <- c("a", "b", "c")
> messyData
a b c
1 1 1 1
2 2 2 2
3 3 3 NA
4 4 NA 4
5 NA 5 5
> # 自訂函數
> cleanData <- function(df, imputeValue) {
+ nRows <- nrow(df)
+ naSum <- rep(NA, times = nRows)
+ for (i in 1:nRows) {
+ naSum[i] <- sum(is.na(df[i, ]))
+ df[i, ][is.na(df[i, ])] <- imputeValue
+ }
+ dfList <- list(completeCases = df[as.logical(!naSum), ], imputedData = df)
+ return(dfList)
+ }
> # 呼叫函數
> cleanedData <- cleanData(messyData, imputeValue = 999)
> cleanedData$completeCases
a b c
1 1 1 1
2 2 2 2
> cleanedData$imputedData
a b c
1 1 1 1
2 2 2 2
3 3 3 999
4 4 999 4
5 999 5 5
八、資料的輸入與輸出
內建資料
l 顯示有哪些內建資料可以使用:
data()
data()
輸入.txt檔與.csv檔資料
l 在R語言,Windows系統路徑的反斜線(\)必須改為斜線(/)。
l 讀取硬碟.txt檔表格資料:
favoriteBands <- read.table("D:/favorite_bands.txt", header = TRUE, stringsAsFactors = FALSE)
View(favoriteBands)
favoriteBands <- read.table("D:/favorite_bands.txt", header = TRUE, stringsAsFactors = FALSE)
View(favoriteBands)
l 讀取硬碟.csv檔表格資料:
# 引數sep預設辨識一個或多個空格
favoriteBands <- read.table("D:/favorite_bands.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)
# 引數sep預設辨識一個或多個空格
favoriteBands <- read.table("D:/favorite_bands.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)
l 讀取網路.csv檔表格資料:
url = "https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/data_ch11/favorite_bands.csv"
favoriteBands <- read.table(url, header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)
url = "https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/data_ch11/favorite_bands.csv"
favoriteBands <- read.table(url, header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)
l 讀取硬碟.txt檔文字資料:
# 引數n限制讀取的筆數
lyricsScript <- readLines("D:/lyrics.txt", n = 5)
lyricsScript
# 引數n限制讀取的筆數
lyricsScript <- readLines("D:/lyrics.txt", n = 5)
lyricsScript
輸入Excel與JSON資料
l 方法一-以程式碼安裝與載入套件:
install.packages("套件名稱")
library(套件名稱)
install.packages("套件名稱")
library(套件名稱)
l 方法二-在使用者介面安裝與載入套件:
Packages à Install à 在Packages輸入套件名稱 à Install à 把方框勾選起來表示載入這個套件
Packages à Install à 在Packages輸入套件名稱 à Install à 把方框勾選起來表示載入這個套件
l 讀取Excel資料:
# 安裝、載入readxl套件
install.packages("readxl")
library(readxl)
favoriteBands <- read_excel("D:/favorite_bands.xlsx")
View(favoriteBands)
# 安裝、載入readxl套件
install.packages("readxl")
library(readxl)
favoriteBands <- read_excel("D:/favorite_bands.xlsx")
View(favoriteBands)
l 讀取JSON資料:
# 安裝、載入jsonlite套件
install.packages("jsonlite")
library(jsonlite)
favoriteBands <- fromJSON("D:/favorite_bands.json")
View(favoriteBands)
# 安裝、載入jsonlite套件
install.packages("jsonlite")
library(jsonlite)
favoriteBands <- fromJSON("D:/favorite_bands.json")
View(favoriteBands)
輸出.txt檔與.csv檔資料
l 輸出.txt檔表格資料:
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
# 引數row.names指定是否輸出觀測值的索引值
write.table(favoriteBandsDf, file = "D:/favorite_bands.txt", row.names = FALSE)
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
# 引數row.names指定是否輸出觀測值的索引值
write.table(favoriteBandsDf, file = "D:/favorite_bands.txt", row.names = FALSE)
l 輸出.csv檔表格資料:
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
write.table(favoriteBandsDf, file = "D:/favorite_bands.csv", row.names = FALSE, sep = ",")
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
write.table(favoriteBandsDf, file = "D:/favorite_bands.csv", row.names = FALSE, sep = ",")
l 以內建資料cars為例輸出.csv檔表格資料:
write.csv(cars, file = "D:/cars.csv", row.names = FALSE)
write.csv(cars, file = "D:/cars.csv", row.names = FALSE)
l 輸出.txt檔文字資料:
lyricsScript <- c("Side Effects", "It's 4AM, I don't know where to go", "Everywhere is closed, I should just go home, yeah", "My feet are takin' me to your front door", "I know I shouldn't though, heaven only knows")
writeLines(lyricsScript, con = "D:/lyrics.txt")
lyricsScript <- c("Side Effects", "It's 4AM, I don't know where to go", "Everywhere is closed, I should just go home, yeah", "My feet are takin' me to your front door", "I know I shouldn't though, heaven only knows")
writeLines(lyricsScript, con = "D:/lyrics.txt")
輸出JSON資料
l 輸出JSON資料:
library(jsonlite)
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
writeLines(toJSON(favoriteBandsDf), con = "D:/favorite_bands.json")
library(jsonlite)
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
writeLines(toJSON(favoriteBandsDf), con = "D:/favorite_bands.json")
九、探索資料分析(Exploratory Data Analysis, EDA)
內建函數(以內建資料iris為例)
l 傳回觀測值總數:
nrow(iris)
nrow(iris)
l 傳回變數總數:
ncol(iris)
ncol(iris)
l 傳回觀測值與變數總數:
dim(iris)
dim(iris)
l 傳回變數名稱與前六個觀測值:
head(iris)
head(iris)
l 傳回變數名稱與後六個觀測值:
tail(iris)
tail(iris)
l 傳回變數名稱:
names(iris)
names(iris)
l 傳回敘述統計等資訊:
summary(iris)
summary(iris)
l 傳回資料結構等資訊:
str(iris)
str(iris)
Base Plotting System-R語言內建的繪圖系統
l 直方圖:
# rnorm()函數隨機產生指定數量符合標準常態分布的數字
hist(rnorm(1000))
# rnorm()函數隨機產生指定數量符合標準常態分布的數字
hist(rnorm(1000))
l 盒鬚圖:
boxplot(Sepal.Length ~ Species, data = iris)
boxplot(Sepal.Length ~ Species, data = iris)
l 折線圖:
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
# 亂數種子
set.seed(123)
# 引數replace = TRUE表示可抽出重複的數字
y <- sample(1:100, size = 31, replace = TRUE)
plot(x, y, type = "l")
# 以內建資料AirPassengers為例,若資料結構類型是ts(time series),可直接傳入plot(),不需使用引數type = "l"
class(AirPassengers)
plot(AirPassengers)
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
# 亂數種子
set.seed(123)
# 引數replace = TRUE表示可抽出重複的數字
y <- sample(1:100, size = 31, replace = TRUE)
plot(x, y, type = "l")
# 以內建資料AirPassengers為例,若資料結構類型是ts(time series),可直接傳入plot(),不需使用引數type = "l"
class(AirPassengers)
plot(AirPassengers)
l 散佈圖:
# 單一散佈圖,以內建資料cars為例
plot(cars)
# 單一散佈圖,指定X軸與Y軸變數
plot(cars$dist, cars$speed)
# 散佈圖矩陣,以內建資料iris為例
plot(iris)
# 單一散佈圖,以內建資料cars為例
plot(cars)
# 單一散佈圖,指定X軸與Y軸變數
plot(cars$dist, cars$speed)
# 散佈圖矩陣,以內建資料iris為例
plot(iris)
l 長條圖:
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# table()函數用以統整資料
barplot(table(iceCreamFlavor))
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# table()函數用以統整資料
barplot(table(iceCreamFlavor))
l 曲線圖:
curve(sin, from = -pi, to = pi)
curve(sin, from = -pi, to = pi)
Base Plotting System-常用的自訂元素
l 自訂標題、X軸標籤、Y軸標籤、加入格線:
# 以內建資料cars散佈圖為例
# 引數main表示自訂標題
# 引數xlab表示X軸標籤
# 引數ylab表示Y軸標籤
plot(cars, main = "Car Speed VS. Braking Distance", xlab = "Car Speed (mph)", ylab = "Braking Distance (ft)")
# grid()函數表示加入格線
grid()
# 以內建資料cars散佈圖為例
# 引數main表示自訂標題
# 引數xlab表示X軸標籤
# 引數ylab表示Y軸標籤
plot(cars, main = "Car Speed VS. Braking Distance", xlab = "Car Speed (mph)", ylab = "Braking Distance (ft)")
# grid()函數表示加入格線
grid()
l 調整方向、文字縮放:
# 以長條圖為例
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 引數horiz = TRUE調整圖形為水平方向
# 引數las = 1調整刻度顯示方向
# 引數cex.name = 0.8與cex.axis = 1.2調整刻度文字縮放倍數
barplot(table(iceCreamFlavor), horiz = TRUE, las = 1, cex.name = 0.8, cex.axis = 1.2)
# 以長條圖為例
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 引數horiz = TRUE調整圖形為水平方向
# 引數las = 1調整刻度顯示方向
# 引數cex.name = 0.8與cex.axis = 1.2調整刻度文字縮放倍數
barplot(table(iceCreamFlavor), horiz = TRUE, las = 1, cex.name = 0.8, cex.axis = 1.2)
l 多個圖形:
# 以內建資料iris盒鬚圖為例
par(mfrow = c(2, 2))
boxplot(iris$Sepal.Length ~ iris$Species, main = "Sepal Length by Species")
boxplot(iris$Sepal.Width ~ iris$Species, main = "Sepal Width by Species")
boxplot(iris$Petal.Length ~ iris$Species, main = "Petal Length by Species")
boxplot(iris$Petal.Width ~ iris$Species, main = "Petal Width by Species")
# 以內建資料iris盒鬚圖為例
par(mfrow = c(2, 2))
boxplot(iris$Sepal.Length ~ iris$Species, main = "Sepal Length by Species")
boxplot(iris$Sepal.Width ~ iris$Species, main = "Sepal Width by Species")
boxplot(iris$Petal.Length ~ iris$Species, main = "Petal Length by Species")
boxplot(iris$Petal.Width ~ iris$Species, main = "Petal Width by Species")
l 直方圖的密度曲線:
normDist <- rnorm(1000)
hist(normDist, freq = FALSE)
lines(density(normDist))
normDist <- rnorm(1000)
hist(normDist, freq = FALSE)
lines(density(normDist))
l 散佈圖資料點的形狀與顏色:
# 單一資料點的形狀與顏色
plot(cars, pch = 2, col = "red")
# 不同類別資料點的形狀與顏色
plot(iris$Sepal.Length, iris$Sepal.Width, pch = as.numeric(iris$Species), col = iris$Species)
# 單一資料點的形狀與顏色
plot(cars, pch = 2, col = "red")
# 不同類別資料點的形狀與顏色
plot(iris$Sepal.Length, iris$Sepal.Width, pch = as.numeric(iris$Species), col = iris$Species)
ggplot2套件
l 安裝、載入ggplot2套件:
install.packages("ggplot2")
library(ggplot2)
install.packages("ggplot2")
library(ggplot2)
l 直方圖:
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) + geom_histogram()
# 修改為較多的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.1)
# 修改為較少的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.5)
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) + geom_histogram()
# 修改為較多的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.1)
# 修改為較少的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.5)
l 盒鬚圖:
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
l 折線圖:
library(ggplot2)
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
set.seed(123)
y <- sample(1:100, size = 31, replace = TRUE)
lineDf <- data.frame(x = x, y = y)
ggplot(lineDf, aes(x = x, y = y)) + geom_line()
# 修改日期顯示格式,預設為%b %d
ggplot(lineDf, aes(x = x, y = y)) + geom_line() + scale_x_date(date_labels = "%m.%d")
library(ggplot2)
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
set.seed(123)
y <- sample(1:100, size = 31, replace = TRUE)
lineDf <- data.frame(x = x, y = y)
ggplot(lineDf, aes(x = x, y = y)) + geom_line()
# 修改日期顯示格式,預設為%b %d
ggplot(lineDf, aes(x = x, y = y)) + geom_line() + scale_x_date(date_labels = "%m.%d")
l 散佈圖:
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point()
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point()
l 長條圖:
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 若傳入的資料不是統計過的資訊
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar()
# 若傳入的資料是統計過的資訊
flavor <- names(table(iceCreamFlavor))
votes <- as.vector(unname(table(iceCreamFlavor)))
iceCreamDf <- data.frame(flavor = flavor, votes = votes)
ggplot(iceCreamDf, aes(x = flavor, y = votes)) + geom_bar(stat = "identity")
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 若傳入的資料不是統計過的資訊
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar()
# 若傳入的資料是統計過的資訊
flavor <- names(table(iceCreamFlavor))
votes <- as.vector(unname(table(iceCreamFlavor)))
iceCreamDf <- data.frame(flavor = flavor, votes = votes)
ggplot(iceCreamDf, aes(x = flavor, y = votes)) + geom_bar(stat = "identity")
l 曲線圖:
library(ggplot2)
sinDf <- data.frame(x = c(-pi, pi))
ggplot(sinDf, aes(x = x)) + stat_function(fun = sin, geom = "line")
library(ggplot2)
sinDf <- data.frame(x = c(-pi, pi))
ggplot(sinDf, aes(x = x)) + stat_function(fun = sin, geom = "line")
ggplot2套件-常用的自訂元素
l 自訂標題、X軸標籤、Y軸標籤、隱藏格線:
# 以內建資料cars散佈圖為例
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point() +
ggtitle("Car Speed VS. Braking Distance") +
xlab("Car Speed (mph)") +
ylab("Braking Distance (ft)") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# 主要格線panel.grid.major
# 次要格線panel.grid.minor
# X軸主要格線panel.grid.major.x
# Y軸主要格線panel.grid.major.y
# X軸次要格線panel.grid.minor.x
# Y軸次要格線panel.grid.minor.y
# 以內建資料cars散佈圖為例
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point() +
ggtitle("Car Speed VS. Braking Distance") +
xlab("Car Speed (mph)") +
ylab("Braking Distance (ft)") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# 主要格線panel.grid.major
# 次要格線panel.grid.minor
# X軸主要格線panel.grid.major.x
# Y軸主要格線panel.grid.major.y
# X軸次要格線panel.grid.minor.x
# Y軸次要格線panel.grid.minor.y
l 調整方向:
# 以長條圖為例
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar() +
coord_flip()
# 以長條圖為例
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar() +
coord_flip()
l 多個圖形:
# 以內建資料iris盒鬚圖為例
library(ggplot2)
# 安裝、載入gridExtra套件
install.packages("gridExtra")
library(gridExtra)
g1 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
g2 <- ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()
g3 <- ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_boxplot()
g4 <- ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_boxplot()
grid.arrange(g1, g2, g3, g4, nrow = 2, ncol = 2)
# 以內建資料iris盒鬚圖為例
library(ggplot2)
# 安裝、載入gridExtra套件
install.packages("gridExtra")
library(gridExtra)
g1 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
g2 <- ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()
g3 <- ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_boxplot()
g4 <- ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_boxplot()
grid.arrange(g1, g2, g3, g4, nrow = 2, ncol = 2)
l 直方圖的密度曲線:
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) +
# 引數alpha調整透明度
geom_histogram(aes(y = ..density..), alpha = 0.5) +
geom_density()
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) +
# 引數alpha調整透明度
geom_histogram(aes(y = ..density..), alpha = 0.5) +
geom_density()
l 散佈圖資料點的形狀與顏色:
library(ggplot2)
# 單一資料點的形狀與顏色
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(shape = 2, colour = "red")
# 不同類別資料點的形狀與顏色
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(shape = Species, colour = Species))
library(ggplot2)
# 單一資料點的形狀與顏色
ggplot(cars, aes(x = speed, y = dist)) +
geom_point(shape = 2, colour = "red")
# 不同類別資料點的形狀與顏色
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(shape = Species, colour = Species))
輸出圖形
l 在使用者介面:
Plots à Export
Plots à Export
十、資料處理技巧
查詢資料框
l 利用邏輯值選擇,是實務上較常使用的方式:
iris[iris$Petal.Length >= 6, c("Sepal.Length", "Petal.Length", "Species")]
iris[iris$Petal.Length >= 6, c("Sepal.Length", "Petal.Length", "Species")]
資料框的觀測值與變數
l 資料框內的資料,以下範例將省略:
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)
isChampion <- c(TRUE, FALSE)
season <- c(2013, 2011)
coloradoRockies2007 <- c("Colorado Rockies", 90, 73, FALSE, 2007)
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)
isChampion <- c(TRUE, FALSE)
season <- c(2013, 2011)
coloradoRockies2007 <- c("Colorado Rockies", 90, 73, FALSE, 2007)
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)
l 新增與刪除觀測值:
# 引數stringsAsFactors = FALSE防止因素向量層級的錯誤
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
# 新增觀測值
greatMlbTeams <- rbind(greatMlbTeams, coloradoRockies2007)
greatMlbTeams
# 刪除觀測值
greatMlbTeams <- greatMlbTeams[-3, ]
greatMlbTeams
# 引數stringsAsFactors = FALSE防止因素向量層級的錯誤
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
# 新增觀測值
greatMlbTeams <- rbind(greatMlbTeams, coloradoRockies2007)
greatMlbTeams
# 刪除觀測值
greatMlbTeams <- greatMlbTeams[-3, ]
greatMlbTeams
l 新增與刪除變數:
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion)
# 新增變數
greatMlbTeams$season <- season
greatMlbTeams
# 刪除變數
greatMlbTeams$season <- NULL
greatMlbTeams
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion)
# 新增變數
greatMlbTeams$season <- season
greatMlbTeams
# 刪除變數
greatMlbTeams$season <- NULL
greatMlbTeams
l 重新命名變數:
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
names(greatMlbTeams)[4] <- "isWorldSeriesChampion"
greatMlbTeams
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
names(greatMlbTeams)[4] <- "isWorldSeriesChampion"
greatMlbTeams
l 調整變數位置:
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams <- greatMlbTeams[, c("teamName", "season", "isChampion", "wins", "losses")]
greatMlbTeams
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams <- greatMlbTeams[, c("teamName", "season", "isChampion", "wins", "losses")]
greatMlbTeams
l 對類別變數重新編碼:
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams$isChampion[greatMlbTeams$isChampion == TRUE] <- "Y"
greatMlbTeams$isChampion[greatMlbTeams$isChampion == FALSE] <- "N"
greatMlbTeams
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams$isChampion[greatMlbTeams$isChampion == TRUE] <- "Y"
greatMlbTeams$isChampion[greatMlbTeams$isChampion == FALSE] <- "N"
greatMlbTeams
l 對數值變數重新編碼為類別變數:
strawHatDf <- data.frame(name, age)
strawHatDf$ageCategory <- cut(strawHatDf$age, breaks = c(0, 20, 30, 40, Inf), labels = c("0 < Age <= 20", "20 < Age <= 30", "30 < Age <= 40", "Age > 40"))
strawHatDf
strawHatDf <- data.frame(name, age)
strawHatDf$ageCategory <- cut(strawHatDf$age, breaks = c(0, 20, 30, 40, Inf), labels = c("0 < Age <= 20", "20 < Age <= 30", "30 < Age <= 40", "Age > 40"))
strawHatDf
合併資料框
l 垂直合併資料框:
carsUpper <- cars[1:25, ]
carsBottom <- cars[26:50, ]
carsCombined <- rbind(carsUpper, carsBottom)
carsUpper <- cars[1:25, ]
carsBottom <- cars[26:50, ]
carsCombined <- rbind(carsUpper, carsBottom)
l 水平合併資料框:
carsLeft <- cars[, 1]
carsRight <- cars[, 2]
carsCombined <- cbind(carsLeft, carsRight)
carsLeft <- cars[, 1]
carsRight <- cars[, 2]
carsCombined <- cbind(carsLeft, carsRight)
l 合併查詢變數名稱相同的資料框:
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
name <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(name, devilFruit)
# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf)
mergedDf
# 保留左邊的資料框
mergedDfX <- merge(leftDf, rightDf, all.x = TRUE)
mergedDfX
# 保留右邊的資料框
mergedDfY <- merge(leftDf, rightDf, all.y = TRUE)
mergedDfY
# 保留左右兩邊全部的資料框
mergedDfXY <- merge(leftDf, rightDf, all.x = TRUE, all.y = TRUE)
mergedDfXY
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
name <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(name, devilFruit)
# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf)
mergedDf
# 保留左邊的資料框
mergedDfX <- merge(leftDf, rightDf, all.x = TRUE)
mergedDfX
# 保留右邊的資料框
mergedDfY <- merge(leftDf, rightDf, all.y = TRUE)
mergedDfY
# 保留左右兩邊全部的資料框
mergedDfXY <- merge(leftDf, rightDf, all.x = TRUE, all.y = TRUE)
mergedDfXY
l 合併查詢變數名稱不同的資料框:
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
character <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(character, devilFruit)
# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf, by.x = "name", by.y = "character")
mergedDf
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
character <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(character, devilFruit)
# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf, by.x = "name", by.y = "character")
mergedDf
tidyverse套件
l 安裝、載入tidyverse套件:
install.packages("tidyverse")
library(tidyverse)
install.packages("tidyverse")
library(tidyverse)
l tidyverse套件內含多個套件,其中magrittr套件能夠使用%>%運算子:
# 以傳統方法呼叫函數
summary(cars)
# 使用%>%運算子
library(tidyverse)
cars %>% summary()
# 以傳統方法呼叫函數
summary(cars)
# 使用%>%運算子
library(tidyverse)
cars %>% summary()
l tidyverse套件內含多個套件,其中tidyr套件能夠轉換長寬表格。
l tidyverse套件內含多個套件,其中dplyr套件融入很多與結構化查詢語言相仿的函數。
magrittr套件的%>%運算子介紹
l 需要呼叫多次函數時會採用%>%運算子:
# 方法一,建立多個物件並以傳統方法呼叫函數
sysDate <- Sys.Date()
sysDateYr <- format(sysDate, format = "%Y")
sysDateNum <- as.numeric(sysDateYr)
sysDateNum
# 方法二,盡量少建物件並以傳統方法呼叫函數
sysDateNum <- as.numeric(format(Sys.Date(), format = "%Y"))
sysDateNum
# 方法三,盡量少建物件並使用%>%運算子
library(tidyverse)
sysDateNum <- Sys.Date() %>%
format(format = "%Y") %>%
as.numeric()
sysDateNum
# 方法一,建立多個物件並以傳統方法呼叫函數
sysDate <- Sys.Date()
sysDateYr <- format(sysDate, format = "%Y")
sysDateNum <- as.numeric(sysDateYr)
sysDateNum
# 方法二,盡量少建物件並以傳統方法呼叫函數
sysDateNum <- as.numeric(format(Sys.Date(), format = "%Y"))
sysDateNum
# 方法三,盡量少建物件並使用%>%運算子
library(tidyverse)
sysDateNum <- Sys.Date() %>%
format(format = "%Y") %>%
as.numeric()
sysDateNum
l 加入運算符號:
library(tidyverse)
beyondStart <- 1983
beyondYr <- Sys.Date() %>%
format(format = "%Y") %>%
as.numeric() %>%
# `符號叫做tilt
`-` (beyondStart)
beyondYr
library(tidyverse)
beyondStart <- 1983
beyondYr <- Sys.Date() %>%
format(format = "%Y") %>%
as.numeric() %>%
# `符號叫做tilt
`-` (beyondStart)
beyondYr
l %>%運算子預設將輸入放在函數第一個引數的位置,若有需要,可以透過.來指定輸入的位置:
# 以傳統方法呼叫lm()函數
carsLm <- lm(formula = dist ~ speed, data = cars)
# 使用%>%運算子並以.指定cars輸入的位置
library(tidyverse)
carsLm <- cars %>%
lm(formula = dist ~ speed, data = .)
# 以傳統方法呼叫lm()函數
carsLm <- lm(formula = dist ~ speed, data = cars)
# 使用%>%運算子並以.指定cars輸入的位置
library(tidyverse)
carsLm <- cars %>%
lm(formula = dist ~ speed, data = .)
tidyr套件的長寬表格轉換
l 資料框內的資料,以下範例將省略:
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)
l 長寬表格轉換:
library(tidyverse)
greatMlbTeams <- data.frame(teamName, wins, losses)
# 寬轉長表格,引數key指定類別變數名稱,引數value指定數值變數名稱
longFormat <- gather(greatMlbTeams, key = variable_names, value = values, wins, losses)
longFormat
# 長轉寬表格
wideFormat <- spread(longFormat, key = variable_names, value = values)
wideFormat
library(tidyverse)
greatMlbTeams <- data.frame(teamName, wins, losses)
# 寬轉長表格,引數key指定類別變數名稱,引數value指定數值變數名稱
longFormat <- gather(greatMlbTeams, key = variable_names, value = values, wins, losses)
longFormat
# 長轉寬表格
wideFormat <- spread(longFormat, key = variable_names, value = values)
wideFormat
dplyr套件的函數介紹
l 資料框內的資料,以下範例將省略:
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
gender <- c("male", "male", "female", "male", "male", "male", "female", "male", "male", "male")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
gender <- c("male", "male", "female", "male", "male", "male", "female", "male", "male", "male")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)
l filter()函數篩選符合條件的觀測值:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
filter(strawHatDf, gender == "female")
# 比較使用R語言原生語法
strawHatDf[strawHatDf$gender == "female", ]
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
filter(strawHatDf, gender == "female")
# 比較使用R語言原生語法
strawHatDf[strawHatDf$gender == "female", ]
l select()函數篩選變數:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
select(strawHatDf, crew_name = name, gender)
# 比較使用R語言原生語法並維持原本的資料框資料結構
names(strawHatDf)[1] <- "crew_name"
strawHatDf[, c("crew_name", "gender"), drop = FALSE]
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
select(strawHatDf, crew_name = name, gender)
# 比較使用R語言原生語法並維持原本的資料框資料結構
names(strawHatDf)[1] <- "crew_name"
strawHatDf[, c("crew_name", "gender"), drop = FALSE]
l mutate()函數新增變數:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
mutate(strawHatDf, age_two_years_ago = age - 2)
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
mutate(strawHatDf, age_two_years_ago = age - 2)
l arrange()函數依照變數排序觀測值:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
arrange(strawHatDf, age)
# 由大到小排序
arrange(strawHatDf, desc(age))
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
arrange(strawHatDf, age)
# 由大到小排序
arrange(strawHatDf, desc(age))
l summarise()函數聚合變數,例如總和、平均數、標準差:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
summarise(strawHatDf, mean(age))
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
summarise(strawHatDf, mean(age))
l group_by()函數依照類別變數分組:
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
group_by(strawHatDf, gender) %>%
summarise(mean(age)) %>%
# 將tibble資料框轉為原生資料框
as.data.frame()
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
group_by(strawHatDf, gender) %>%
summarise(mean(age)) %>%
# 將tibble資料框轉為原生資料框
as.data.frame()
資料框處理效率
l 向量計算 > 一系列apply()函數 > 迴圈語法:
# 既有的資料框,runif()函數隨機產生指定數量符合均勻分布的數字
heights <- ceiling(runif(500000) * 50) + 140
weights <- ceiling(runif(500000) * 50) + 40
hwDf <- data.frame(heights, weights)
# 迴圈語法有較低的運行效率
bmi <- rep(NA, times = nrow(hwDf))
system.time(
for (i in 1:nrow(hwDf)) {
bmi[i] <- hwDf[i, "weights"] / (hwDf[i, "heights"] / 100)^2
}
)
# 一系列apply()函數有中等的運行效率
bmiFunction <- function(x, y) {
x / (y / 100)^2
}
system.time(
bmi <- mapply(hwDf$weights, hwDf$heights, FUN = bmiFunction)
)
# 向量計算有較高的運行效率
system.time(
bmi <- hwDf$weights / (hwDf$heights / 100)^2
)
# 既有的資料框,runif()函數隨機產生指定數量符合均勻分布的數字
heights <- ceiling(runif(500000) * 50) + 140
weights <- ceiling(runif(500000) * 50) + 40
hwDf <- data.frame(heights, weights)
# 迴圈語法有較低的運行效率
bmi <- rep(NA, times = nrow(hwDf))
system.time(
for (i in 1:nrow(hwDf)) {
bmi[i] <- hwDf[i, "weights"] / (hwDf[i, "heights"] / 100)^2
}
)
# 一系列apply()函數有中等的運行效率
bmiFunction <- function(x, y) {
x / (y / 100)^2
}
system.time(
bmi <- mapply(hwDf$weights, hwDf$heights, FUN = bmiFunction)
)
# 向量計算有較高的運行效率
system.time(
bmi <- hwDf$weights / (hwDf$heights / 100)^2
)
l 一系列apply()函數:
# 自訂函數傳回不重複值的數量
distinctCounts <- function(x) {
return(length(unique(x)))
}
# apply()函數
# 引數MARGIN = 2表示應用在變數的方向上,本例應用此引數
apply(iris, MARGIN = 2, distinctCounts)
# 引數MARGIN = 1表示應用在觀測值的方向上
apply(iris, MARGIN = 1, distinctCounts)
# lapply()函數將輸出儲存為清單
lapply(iris, FUN = distinctCounts)
# sapply()函數將輸出儲存為向量,與apply()函數的引數MARGIN = 2相同
sapply(iris, FUN = distinctCounts)
# tapply()函數是融入table()函數功能的形式
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = distinctCounts)
# mapply()函數是sapply()函數的多變數版本
mapply(iris, FUN = distinctCounts)
# 自訂函數傳回不重複值的數量
distinctCounts <- function(x) {
return(length(unique(x)))
}
# apply()函數
# 引數MARGIN = 2表示應用在變數的方向上,本例應用此引數
apply(iris, MARGIN = 2, distinctCounts)
# 引數MARGIN = 1表示應用在觀測值的方向上
apply(iris, MARGIN = 1, distinctCounts)
# lapply()函數將輸出儲存為清單
lapply(iris, FUN = distinctCounts)
# sapply()函數將輸出儲存為向量,與apply()函數的引數MARGIN = 2相同
sapply(iris, FUN = distinctCounts)
# tapply()函數是融入table()函數功能的形式
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = distinctCounts)
# mapply()函數是sapply()函數的多變數版本
mapply(iris, FUN = distinctCounts)
十一、撰寫資料分析報告
新增一個R Markdown檔案(.Rmd檔)
l R Markdown檔案為純文字檔,可利用knitr套件轉為資料分析報告。
l 新增R Markdown檔案:
點選File à New File à New R Markdown...,第一次新增R Markdown檔案時會出現提示訊息以安裝相關套件,接著填寫Document、Title、Author、Default Output Format等欄位。
點選File à New File à New R Markdown...,第一次新增R Markdown檔案時會出現提示訊息以安裝相關套件,接著填寫Document、Title、Author、Default Output Format等欄位。
l 儲存R Markdown檔案:
點選Knit,選擇編碼(一般選擇UTF-8),接著再選擇存檔路徑。
點選Knit,選擇編碼(一般選擇UTF-8),接著再選擇存檔路徑。
基本分析文件元素
l 段落標題,從第一層到第六層:
#第一層標題
##第二層標題
###第三層標題
####第四層標題
#####第五層標題
######第六層標題
#第一層標題
##第二層標題
###第三層標題
####第四層標題
#####第五層標題
######第六層標題
l 段落內文:
直接輸入內文,這是**粗體**,這是*斜體*。
直接輸入內文,這是**粗體**,這是*斜體*。
l 行內程式(Inline Code):
使用`q()`函數來離開RStudio。
使用`q()`函數來離開RStudio。
l 程式區塊(Code Chunks):
```
myObj <- "輕鬆學習R語言"
```
在程式區塊旁邊加註{r},輸出的時候會執行這段程式:
```{r}
plot(cars)
```
在程式區塊旁邊加註的{r},還可以加入引數-
echo = TRUE 程式預設顯示在文件中;
message = TRUE 執行回傳訊息預設顯示在文件中;
warning = TRUE 執行警告訊息預設顯示在文件中;
results = markup 執行結果預設顯示在文件中,可選擇asis、hold、hide;
error = FALSE 預設不允許有錯誤訊息的程式:
```{r echo = FALSE}
plot(cars)
```
```
myObj <- "輕鬆學習R語言"
```
在程式區塊旁邊加註{r},輸出的時候會執行這段程式:
```{r}
plot(cars)
```
在程式區塊旁邊加註的{r},還可以加入引數-
echo = TRUE 程式預設顯示在文件中;
message = TRUE 執行回傳訊息預設顯示在文件中;
warning = TRUE 執行警告訊息預設顯示在文件中;
results = markup 執行結果預設顯示在文件中,可選擇asis、hold、hide;
error = FALSE 預設不允許有錯誤訊息的程式:
```{r echo = FALSE}
plot(cars)
```
l 清單:
- 母標題一
- 子標題一之一
- 子標題一之二
* 母標題二
* 子標題二之一
* 子標題二之二
- 母標題一
- 子標題一之一
- 子標題一之二
* 母標題二
* 子標題二之一
* 子標題二之二
l 表格:
|資料格式|函數|套件|
|--------|----|----|
|結構化文字|`read.table()`|`utils`|
|非結構化文字|`readLines()`|`base`|
|Excel試算表|`read_excel()`|`readxl`|
|JSON|`fromJSON()`|`jsonlite`|
|資料格式|函數|套件|
|--------|----|----|
|結構化文字|`read.table()`|`utils`|
|非結構化文字|`readLines()`|`base`|
|Excel試算表|`read_excel()`|`readxl`|
|JSON|`fromJSON()`|`jsonlite`|
l 圖片:
![R_logo](https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/screenshots_ch16/Rlogo.png)
![R_logo](https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/screenshots_ch16/Rlogo.png)
l 連結:
1. 安裝R:[CRAN](https://cran.r-project.org/)
2. 安裝RStudio:[RStudio](https://www.rstudio.com/products/rstudio/download/)
1. 安裝R:[CRAN](https://cran.r-project.org/)
2. 安裝RStudio:[RStudio](https://www.rstudio.com/products/rstudio/download/)
l 引用:
> R, at its heart, is a functional programming (FP) language.
By Hadley Wickham
> R, at its heart, is a functional programming (FP) language.
By Hadley Wickham
十二、實用R語言技巧彙整
加總資料
l 加總矩陣:
iceCream <- matrix(round(runif(15) * 100), nrow = 5)
colnames(iceCream) <- c("Vanilla", "Chocolate", "Strawberry")
rownames(iceCream) <- c("Mon", "Tue", "Wed", "Thu", "Fri")
# rowSums()函數可以加總列資料
iceCream <- cbind(iceCream, Total = rowSums(iceCream))
# colSums()函數可以加總欄資料
iceCream <- rbind(iceCream, Total = colSums(iceCream))
iceCream
iceCream <- matrix(round(runif(15) * 100), nrow = 5)
colnames(iceCream) <- c("Vanilla", "Chocolate", "Strawberry")
rownames(iceCream) <- c("Mon", "Tue", "Wed", "Thu", "Fri")
# rowSums()函數可以加總列資料
iceCream <- cbind(iceCream, Total = rowSums(iceCream))
# colSums()函數可以加總欄資料
iceCream <- rbind(iceCream, Total = colSums(iceCream))
iceCream
回傳索引值
l match()函數回傳第一個吻合特定值的索引值:
myVector <- c(11:20, 17)
match(17, myVector)
myVector <- c(11:20, 17)
match(17, myVector)
l which()函數回傳所有條件為TRUE的特定值的索引值:
myVector <- c(11:20, 17)
which(myVector == 17)
myVector <- c(11:20, 17)
which(myVector == 17)
l which.min()函數回傳第一個最小值的索引值:
myVector <- c(11:20, 11, 20)
which.min(myVector)
# 回傳所有最小值的索引值
which(myVector == min(myVector))
myVector <- c(11:20, 11, 20)
which.min(myVector)
# 回傳所有最小值的索引值
which(myVector == min(myVector))
l which.max()函數回傳第一個最大值的索引值:
myVector <- c(11:20, 11, 20)
which.max(myVector)
# 回傳所有最大值的索引值
which(myVector == max(myVector))
myVector <- c(11:20, 11, 20)
which.max(myVector)
# 回傳所有最大值的索引值
which(myVector == max(myVector))
排序資料
l 排序向量:
myVector <- round(runif(10) * 100)
# 未排序向量
myVector
# 遞增排序向量
sort(myVector)
# 遞減排序向量
sort(myVector, decreasing = TRUE)
myVector <- round(runif(10) * 100)
# 未排序向量
myVector
# 遞增排序向量
sort(myVector)
# 遞減排序向量
sort(myVector, decreasing = TRUE)
l 排序資料框:
# order()函數回傳排序後觀察值的索引值
reorderedCars <- cars[order(cars$dist), ]
reorderedCars
# order()函數回傳排序後觀察值的索引值
reorderedCars <- cars[order(cars$dist), ]
reorderedCars
讀取網頁資料
l 讀取HTML:
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- readLines(url)
class(aSimpleFavor)
mode(aSimpleFavor)
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- readLines(url)
class(aSimpleFavor)
mode(aSimpleFavor)
l 安裝、載入rvest套件:
install.packages("rvest")
library(rvest)
install.packages("rvest")
library(rvest)
l 載入rvest套件讀取HTML:
library(rvest)
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- read_html(url)
class(aSimpleFavor)
mode(aSimpleFavor)
# html_nodes()函數所需CSS Selector,可以參考https://www.w3.org/TR/2011/REC-css3-selectors-20110929/#selectors
# R語言的正規表示式可以參考https://blog.yjtseng.info/post/regexpr/
# 擷取電影片名
title <- aSimpleFavor %>%
html_nodes(css = "h1") %>%
html_text()
# 清理並印出電影片名,regexpr()函數回傳第一個吻合特定值的索引值
title <- regexpr(pattern = ")", title) %>%
substr(title, start = 1, stop = .)
title
# 擷取電影片長
time <- aSimpleFavor %>%
html_nodes(css = "#title-overview-widget time") %>%
html_text()
# 清理並印出電影片長,gsub()函數以正規表示式找出並取代字串
time <- gsub(pattern = "\n\\s+", time, replacement = "")
time
# 擷取並印出電影評分
rating <- aSimpleFavor %>%
html_nodes(css = "strong span") %>%
html_text() %>%
as.numeric()
rating
library(rvest)
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- read_html(url)
class(aSimpleFavor)
mode(aSimpleFavor)
# html_nodes()函數所需CSS Selector,可以參考https://www.w3.org/TR/2011/REC-css3-selectors-20110929/#selectors
# R語言的正規表示式可以參考https://blog.yjtseng.info/post/regexpr/
# 擷取電影片名
title <- aSimpleFavor %>%
html_nodes(css = "h1") %>%
html_text()
# 清理並印出電影片名,regexpr()函數回傳第一個吻合特定值的索引值
title <- regexpr(pattern = ")", title) %>%
substr(title, start = 1, stop = .)
title
# 擷取電影片長
time <- aSimpleFavor %>%
html_nodes(css = "#title-overview-widget time") %>%
html_text()
# 清理並印出電影片長,gsub()函數以正規表示式找出並取代字串
time <- gsub(pattern = "\n\\s+", time, replacement = "")
time
# 擷取並印出電影評分
rating <- aSimpleFavor %>%
html_nodes(css = "strong span") %>%
html_text() %>%
as.numeric()
rating
線性回歸模型
l 預測並繪圖:
# 銷售資料
temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
icedTeaSales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
# 印出截距與係數
lmFit <- lm(icedTeaSales ~ temperatures)
lmFit$coefficients
# 預測資料
toBePredicted <- data.frame(temperatures = 30)
predictedSales <- predict(lmFit, newdata = toBePredicted)
predictedSales
# 繪出銷售資料點
plot(icedTeaSales ~ temperatures, bg = "blue", pch = 16)
# 繪出預測資料點
points(x = toBePredicted$temperatures, y = predictedSales, col = "red", cex = 2, pch = 17)
# 繪出銷售資料回歸線
abline(reg = lmFit$coefficients, col = "blue", lwd = 4)
# 銷售資料
temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
icedTeaSales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)
# 印出截距與係數
lmFit <- lm(icedTeaSales ~ temperatures)
lmFit$coefficients
# 預測資料
toBePredicted <- data.frame(temperatures = 30)
predictedSales <- predict(lmFit, newdata = toBePredicted)
predictedSales
# 繪出銷售資料點
plot(icedTeaSales ~ temperatures, bg = "blue", pch = 16)
# 繪出預測資料點
points(x = toBePredicted$temperatures, y = predictedSales, col = "red", cex = 2, pch = 17)
# 繪出銷售資料回歸線
abline(reg = lmFit$coefficients, col = "blue", lwd = 4)
決策樹分類器
l 安裝、載入rpart套件:
install.packages("rpart")
library(rpart)
install.packages("rpart")
library(rpart)
l 以內建資料iris為例,拆分資料作為訓練用與測試用:
# 自訂函數-資料洗牌,並依比例拆分資料
trainTestSplit <- function(x, trainPercentage) {
n <- nrow(x)
dataShuffled <- x[sample(n), ]
trainTestCut <- round(trainPercentage * n)
trainData <- dataShuffled[1:trainTestCut, ]
testData <- dataShuffled[(trainTestCut + 1):n, ]
outputs <- list(Train = trainData, Test = testData)
return(outputs)
}
# 拆分資料
irisTrainTest <- trainTestSplit(iris, trainPercentage = 0.7)
irisTrain <- irisTrainTest$Train
irisTest <- irisTrainTest$Test
# 建立決策樹分類器,Species ~ .代表以其它變數解釋Species變數
irisClf <- rpart(Species ~ ., data = irisTrain, method = "class")
# 預測資料
predicted <- predict(irisClf, irisTest, type = "class")
# 比對irisTest$Species與predicted來得知決策樹分類器的準確率
confMat <- table(irisTest$Species, predicted)
accuracy <- sum(diag(confMat)) / sum(confMat)
accuracy
# 自訂函數-資料洗牌,並依比例拆分資料
trainTestSplit <- function(x, trainPercentage) {
n <- nrow(x)
dataShuffled <- x[sample(n), ]
trainTestCut <- round(trainPercentage * n)
trainData <- dataShuffled[1:trainTestCut, ]
testData <- dataShuffled[(trainTestCut + 1):n, ]
outputs <- list(Train = trainData, Test = testData)
return(outputs)
}
# 拆分資料
irisTrainTest <- trainTestSplit(iris, trainPercentage = 0.7)
irisTrain <- irisTrainTest$Train
irisTest <- irisTrainTest$Test
# 建立決策樹分類器,Species ~ .代表以其它變數解釋Species變數
irisClf <- rpart(Species ~ ., data = irisTrain, method = "class")
# 預測資料
predicted <- predict(irisClf, irisTest, type = "class")
# 比對irisTest$Species與predicted來得知決策樹分類器的準確率
confMat <- table(irisTest$Species, predicted)
accuracy <- sum(diag(confMat)) / sum(confMat)
accuracy
K-Means資料分群
l 以內建資料iris為例:
# 數值資料
irisKmeans <- iris[, -5]
# 印出分群結果(此例隨機執行20次再收斂,分3類)
kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = 3)
kmeansFit
# 印出組內差異/組間差異(Total WSS/Total SS)
ratio <- kmeansFit$tot.withinss / kmeansFit$totss
ratio
# 繪出陡坡圖(Scree Plot)
ratio <- rep(NA, times = 10)
for (k in 2:length(ratio)) {
kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = k)
ratio[k] <- kmeansFit$tot.withinss / kmeansFit$betweenss
}
plot(ratio, type = "b", xlab = "k")
# 數值資料
irisKmeans <- iris[, -5]
# 印出分群結果(此例隨機執行20次再收斂,分3類)
kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = 3)
kmeansFit
# 印出組內差異/組間差異(Total WSS/Total SS)
ratio <- kmeansFit$tot.withinss / kmeansFit$totss
ratio
# 繪出陡坡圖(Scree Plot)
ratio <- rep(NA, times = 10)
for (k in 2:length(ratio)) {
kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = k)
ratio[k] <- kmeansFit$tot.withinss / kmeansFit$betweenss
}
plot(ratio, type = "b", xlab = "k")
十三、統計機率分布函數
統計機率分布函數
l 函數開頭名稱的意義:
d代表density,回傳機率密度值。
p代表probability,回傳累積機率值。
q代表quantile,回傳分位數。
r代表random,回傳隨機值。
d代表density,回傳機率密度值。
p代表probability,回傳累積機率值。
q代表quantile,回傳分位數。
r代表random,回傳隨機值。
l 函數結尾名稱的意義:
unif指的是均勻分布。
norm指的是常態分布。
binom指的是二項式分布。
pois指的是Poisson分布。
chisq指的是卡方分布。
unif指的是均勻分布。
norm指的是常態分布。
binom指的是二項式分布。
pois指的是Poisson分布。
chisq指的是卡方分布。
均勻分布
l 預設均勻分布最小值為0,最大值為1,可以修改引數min與max。
l dunif()函數:
x <- seq(from = -2, to = 3, by = 0.01)
y <- dunif(x, min = -1, max = 2)
plot(x, y, type = "l", ylab = "Probability Density")
x <- seq(from = -2, to = 3, by = 0.01)
y <- dunif(x, min = -1, max = 2)
plot(x, y, type = "l", ylab = "Probability Density")
l punif()函數:
punif(0.5)
punif(0.5)
l qunif()函數:
qunif(0.5)
qunif(0.5)
l runif()函數:
x <- runif(1000)
hist(x, ylab = "Frequency")
x <- runif(1000)
hist(x, ylab = "Frequency")
常態分布
l 預設常態分布為標準常態分布,平均值為0,標準差為1,可以修改引數mean與sd。
l dnorm()函數:
x <- seq(from = -3, to = 3, by = 0.01)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "Probability Density")
x <- seq(from = -3, to = 3, by = 0.01)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "Probability Density")
l pnorm()函數:
pnorm(1.96)
pnorm(1.96)
l qnorm()函數:
qnorm(0.975)
qnorm(0.975)
l rnorm()函數:
x <- rnorm(1000)
hist(x, ylab = "Frequency")
x <- rnorm(1000)
hist(x, ylab = "Frequency")
二項式分布
l 例如投擲一枚公正硬幣,引數size代表投擲次數,引數prob代表機率。
l dbinom()函數:
x <- 0:100
y <- dbinom(x, size = 100, prob = 0.5)
plot(x, y, type = "l", ylab = "Probility Density")
x <- 0:100
y <- dbinom(x, size = 100, prob = 0.5)
plot(x, y, type = "l", ylab = "Probility Density")
l pbinom()函數:
pbinom(50, size = 100, prob = 0.5)
pbinom(50, size = 100, prob = 0.5)
l qbinom()函數:
qbinom(0.53, size = 100, prob = 0.5)
qbinom(0.53, size = 100, prob = 0.5)
l rbinom()函數:
x <- rbinom(1000, size = 100, prob = 0.5)
hist(x, ylab = "Frequency")
x <- rbinom(1000, size = 100, prob = 0.5)
hist(x, ylab = "Frequency")
Poisson分布
l 單位時間內發生次數的機率分布,必須指定單位時間引數lambda。
l dpois()函數:
x <- 0:20
y <- dpois(x, lambda = 4)
plot(x, y, type = "l", ylab = "Probability Density")
x <- 0:20
y <- dpois(x, lambda = 4)
plot(x, y, type = "l", ylab = "Probability Density")
l ppois()函數:
ppois(4, lambda = 4)
ppois(4, lambda = 4)
l qpois()函數:
qpois(0.62, lambda = 4)
qpois(0.62, lambda = 4)
l rpois()函數:
x <- rpois(1000, lambda = 4)
hist(x, ylab = "Frequency")
x <- rpois(1000, lambda = 4)
hist(x, ylab = "Frequency")
卡方分布
l dchisq()函數:
x <- 1:50
y <- dchisq(x, df = 5)
plot(x, y, type = "l", ylab = "Probability Density")
x <- 1:50
y <- dchisq(x, df = 5)
plot(x, y, type = "l", ylab = "Probability Density")
l pchisq()函數:
pchisq(5, df = 5)
pchisq(5, df = 5)
l qchisq()函數:
qchisq(0.58, df = 5)
qchisq(0.58, df = 5)
l rchisq()函數:
x <- rchisq(1000, df = 5)
hist(x, ylab = "Frequency")
x <- rchisq(1000, df = 5)
hist(x, ylab = "Frequency")
十四、本書作者推薦的學習資源
學習資源名稱
|
學習資源類型
|
適合讀者
|
R in Action
|
網站
實體書本
|
初學者
中階使用者
|
電子書
|
初學者
中階使用者
中高階使用者
|
|
電子書
實體書本
|
中階使用者
|
|
網站
實體書本
|
中高階使用者
|
|
The Art of R Programming
|
實體書本
|
中高階使用者
|
課程
|
||
課程
|
沒有留言:
張貼留言