Timmy's Column: 輕鬆學習R語言學習筆記

　　本書《輕鬆學習R語言》作者為郭耀仁，本書作者已將全文放在網路上，讀者可自行上網閱讀，由於自己已聆聽多場本書作者的講座，了解其對於R語言的熱忱與專精，因此選擇閱讀此書做為R語言入門的濫觴。

一、建立R語言的開發環境

下載安裝

l 本書全文：
http://www.learn-r-the-easy-way.tw/

l 本書程式下載：
http://books.gotop.com.tw/download/AEL018500

l 下載安裝R：
R是結合統計分析與繪圖功能的免費開放原始碼軟體
https://cran.r-project.org/

l 下載安裝RStudio：
RStudio是整合開發環境IDE
https://www.rstudio.com/products/rstudio/download/

RStudio整合開發環境

l 左上角－來源(Source)：
需點選File à New File à R Script開啟，撰寫與執行程式碼的區域，需將程式碼反白再執行，否則將依游標位置一次一行逐行執行程式碼。

l 左下角－命令列(Console)：
撰寫與執行單行程式碼的區域。

l 右上角－環境與歷史

l 右下角－檔案、圖形、套件、查詢與預覽器

二、瞭解不同的變數類型

變數類型

l class()函數返回變數類型：
> class(2) # 數值
[1] "numeric"
> class(2L) # 整數
[1] "integer"
> class(TRUE) # 邏輯值，只可用TRUE、FALSE、T、F
[1] "logical"
> class("Hello World") # 文字
[1] "character"
> class(Sys.Date()) # 日期
[1] "Date"
> class(Sys.time()) # 時間
[1] "POSIXct" "POSIXt"

查詢文件

l 兩者皆可：
?class
help(class)

賦值

l 兩者皆可，其它程式語言多使用=，<-為R語言特有，本書作者建議使用<-，例如在函數中使用<-可以避免被視為函數的引數：
myNum <- 2
myNum = 2

判斷條件

l 大於(>)、小於(<)、大於等於(>=)、小於等於(<=)、等於(==)、不等於(!=)、AND(&)、OR(|)的用法與Python等其它程式語言皆相同：
> # 判斷7是否包含於一個c(8, 7)之中，c(8, 7)是向量資料結構
> 7 %in% c(8, 7)
[1] TRUE

數學運算

l 加(+)、減(-)、乘(*)、除(/)、指數(^或**)的用法與Python等其它程式語言皆相同：
> 11 %% 10 # 回傳餘數
[1] 1
> 7 + 8L + TRUE # TRUE與1或1L相同
[1] 16
> class(7 + 8L + TRUE)
[1] "numeric"
> 7 + 8L + FALSE # FALSE與0或0L相同
[1] 15
> class(7 + 8L + FALSE)
[1] "numeric"

日期與時間

l 日期轉換字串、整數：
> sysDate <- Sys.Date()
> # 日期轉換字串
> as.character(sysDate)
[1] "2018-09-01"
> # 日期轉換整數
> as.integer(sysDate)
[1] 17775

l 字串轉換日期、日期運算：
> # 字串轉換日期，Unix紀元起
> dateOfOrigin <- as.Date("1970-01-01")
> # 日期運算
> dateOfOrigin + 1
[1] "1970-01-02"

l 時間轉換字串、整數：
> sysTime <- Sys.time()
> # 時間轉換字串
> as.character(sysTime)
[1] "2018-09-01 10:32:06"
> # 時間轉換整數
> as.integer(sysTime)
[1] 1535769126

l 字串轉換時間、時間運算：
> # 字串轉換時間，格林威治時間Unix紀元起
> timeOfOrigin <- as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
> # 時間運算
> timeOfOrigin + 1
[1] "1970-01-01 00:00:01 GMT"
> # 字串轉換時間，中原標準時間(Chungyuan Standard Time)Unix紀元起
> timeOfOriginCst = as.POSIXct("1970-01-01 08:00:00")
> as.integer(timeOfOriginCst)
[1] 0

三、變數類型的判斷與轉換

變數類型的判斷

l is.類型名稱()函數與inherits()函數回傳邏輯值：
> is.numeric(7.7)
[1] TRUE
> is.integer(7L)
[1] TRUE
> is.logical(FALSE)
[1] TRUE
> is.character("Hello World")
[1] TRUE
> inherits(Sys.Date(), what = "Date")
[1] TRUE
> inherits(Sys.time(), what = "POSIXct")
[1] TRUE

變數類型的轉換

l as.類型名稱()函數轉換變數類型：
> as.numeric(7L)
[1] 7
> as.integer(7)
[1] 7
> as.logical("TRUE")
[1] TRUE
> as.character(Sys.Date())
[1] "2018-09-01"
> as.Date("1970/01/01")
[1] "1970-01-01"
> as.Date("01/01/70", format = "%m/%d/%y")
[1] "1970-01-01"
> as.POSIXct("1970-01-01 00:00:00", tz = "GMT")
[1] "1970-01-01 GMT"
> as.POSIXct("1970-01-01 00:00:00")
[1] "1970-01-01 CST"

l   日期格式：
%d   日                            01
%a   禮拜幾的縮寫        Mon
%A   禮拜幾                    Monday
%m 月                            01
%b   月名稱的縮寫        Jan
%B   月名稱                    January
%y    兩位數的年            70
%Y   四位數的年            1970

四、把變數集結起來

一維資料結構－向量(Vector)

l 選擇向量元素：
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> fourSeasons
[1] "spring" "summer" "autumn" "winter"
> fourSeasons[2]
[1] "summer"
> fourSeasons[c(-1, -3)]
[1] "summer" "winter"

l 判斷向量元素：
> fourSeasons <- c("spring", "summer", "autumn", "winter")
> myFavoriteSeason <- fourSeasons == "summer"
> myFavoriteSeason
[1] FALSE TRUE FALSE FALSE
> fourSeasons[myFavoriteSeason]
[1] "summer"

l rep()函數可以生成重複變數的向量：
> rep(7, times = 3)
[1] 7 7 7

l seq()函數可以生成等差數列：
> seq(from = 7, to = 25, by = 7)
[1] 7 14 21
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10

一維資料結構－因素向量(Factor)

l 因素向量(Factor)是儲存字串的資料結構，帶有層級(Levels)資訊。

l factor()函數可以將向量轉為因素向量：
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: autumn spring summer winter

l factor()函數有ordered = TRUE引數，卻沒有levels = c()引數，預設使用字母排序：
> fourSeasonsFactor <- factor(c("spring", "summer", "autumn", "winter"), ordered = TRUE, levels = c("winter", "spring", "autumn", "summer"))
> fourSeasonsFactor
[1] spring summer autumn winter
Levels: winter < spring < autumn < summer

二維資料結構－矩陣(Matrix)

l 水平方向的資料稱為列(Row)，垂直方向的資料稱為欄(Column)。

l   顯示橫向排序矩陣：
> myMat <- matrix(1:6, nrow = 2, byrow = TRUE)
> myMat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   4    5    6

l   顯示直向排序矩陣，選擇矩陣元素：
> myMat <- matrix(1:6, nrow = 2)
> myMat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> myMat[2, 3]
[1] 6
> myMat[2, ]
[1] 2 4 6
> myMat[, 3]
[1] 5 6

l 判斷矩陣元素：
> myMat <- matrix(1:6, nrow = 2)
> filter <- myMat < 6 & myMat > 1
> filter
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE TRUE FALSE
> myMat[filter]
[1] 2 3 4 5

二維資料結構－資料框(Data.Frame)

l 水平方向的資料稱為觀測值(Observation)，垂直方向的資料稱為變數(Variable)。

l 資料框內的資料，以下範例將省略：
> teamName <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> isChampion <- c(TRUE, FALSE)
> season <- c(2013, 2011)

l   顯示資料框：
> # 在來源(Source)區域顯示資料框，關閉後可使用View()函數再次呼叫：
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> # 在命令列(Console)區域顯示資料框
> greatMlbTeams
       teamName wins losses isChampion season
1 Boston Red Sox   97     65        TRUE   2013
2 Texas Rangers   96     66       FALSE   2011
> # 在命令列(Console)區域顯示資料框
> str(greatMlbTeams)
'data.frame':   2 obs. of 5 variables:
$ teamName : Factor w/ 2 levels "Boston Red Sox",..: 1 2
$ wins     : num 97 96
$ losses    : num 65 66
$ isChampion: logi TRUE FALSE
$ season    : num 2013 2011

l   選擇資料框元素：
> # 預設將字串以因素向量儲存
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> greatMlbTeams[1, 1]
[1] Boston Red Sox
Levels: Boston Red Sox Texas Rangers
> greatMlbTeams[1, ]
       teamName wins losses isChampion season
1 Boston Red Sox   97     65        TRUE   2013
> greatMlbTeams[, 1]
[1] Boston Red Sox Texas Rangers
Levels: Boston Red Sox Texas Rangers

> # 字串不以因素向量儲存的方法一
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams[1, 1]
[1] "Boston Red Sox"
> greatMlbTeams[, 1]
[1] "Boston Red Sox" "Texas Rangers"

> # 字串不以因素向量儲存的方法二
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> as.character(greatMlbTeams[1, 1])
[1] "Boston Red Sox"
> as.character(greatMlbTeams[, 1])
[1] "Boston Red Sox" "Texas Rangers"

> # 使用變數名稱選擇資料框元素
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
> greatMlbTeams$teamName
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, "teamName"]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[, c("teamName", "isChampion")]
        teamName isChampion
1 Boston Red Sox       TRUE
2 Texas Rangers      FALSE

l   判斷資料框元素：
> greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
> filter <- greatMlbTeams$isChampion == TRUE
> filter
[1] TRUE FALSE
> greatMlbTeams[filter, ]
       teamName wins losses isChampion season
1 Boston Red Sox   97     65        TRUE   2013

多維資料結構－陣列(Array)

l   選擇陣列元素：
> myArr <- array(1:18, dim = c(3, 3, 2))
> myArr
, , 1
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
, , 2
     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18
> myArr[3, 2, 1]
[1] 6
> myArr[3, , ]
     [,1] [,2]
[1,]    3   12
[2,]    6   15
[3,]    9   18
> myArr[, 2, ]
     [,1] [,2]
[1,]    4   13
[2,]    5   14
[3,]    6   15
> myArr[, , 1]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

多維資料結構－清單(List)

l 清單內的資料，以下範例將省略：
> title <- "Great MLB Teams"
> teams <- c("Boston Red Sox", "Texas Rangers")
> wins <- c(97, 96)
> losses <- c(65, 66)
> winningPercentage <- wins / (wins + losses)
> season <- c(2013, 2011)
> winsLosses <- matrix(c(wins, losses), nrow = 2)
> df <- data.frame(Teams = teams, Winning_Percentage = winningPercentage, Season = season)

l   顯示清單：
> greatMlbTeams <- list(title, teams, winsLosses, df)
> # 在來源(Source)區域顯示清單
> View(greatMlbTeams)
> # 在命令列(Console)區域顯示清單
> greatMlbTeams
[[1]]
[1] "Great MLB Teams"
[[2]]
[1] "Boston Red Sox" "Texas Rangers"
[[3]]
     [,1] [,2]
[1,]   97   65
[2,]   96   66
[[4]]
           Teams Winning_Percentage Season
1 Boston Red Sox          0.5987654   2013
2 Texas Rangers          0.5925926   2011

l 選擇清單元素：
> greatMlbTeams <- list(title, teams, winsLosses, df)
> greatMlbTeams[[2]]
[1] "Boston Red Sox" "Texas Rangers"
> greatMlbTeams[[3]][1, ]
[1] 97 65
> greatMlbTeams[[4]]$Winning_Percentage
[1] 0.5987654 0.5925926

> # 使用物件名稱選擇清單元素
> greatMlbTeams <- list(Title = title, Teams = teams, Wins_Losses = winsLosses, DF = df)
> greatMlbTeams$Teams
[1] "Boston Red Sox" "Texas Rangers"

l   R語言函數的輸出多半是清單：
> x <- 1:10
> y <- 2 * x + 5
> lmFit <- lm(formula = y ~ x)
> lmFit$coefficients
(Intercept)           x
          5           2
> lmFit$coefficients[1]
(Intercept)
          5
> lmFit$coefficients[2]
x
2

五、函數

數值函數

l abs()函數使數值取絕對值：
> abs(-9)
[1] 9

l sqrt()函數使數值開根號：
> sqrt(9)
[1] 3

l ceiling()函數使數值無條件進位：
> ceiling(pi)
[1] 4

l floor()函數使數值無條件捨去：
> floor(pi)
[1] 3

l round()函數使數值四捨五入：
> round(pi)
[1] 3
> round(pi, digits = 2)
[1] 3.14
> round(pi, digits = 4)
[1] 3.1416

l exp()函數使數值轉為e^x，e = 2.7182818尤拉數：
> exp(2)
[1] 7.389056

l log()函數使數值取自然對數，log_e：
> log(exp(2))
[1] 2

l log10()函數使數值取10為底對數，log₁₀：
> log10(10^3)
[1] 3

文字函數

l toupper()函數使文字轉為大寫：
> toupper("Hello World")
[1] "HELLO WORLD"

l tolower()函數使文字轉為小寫：
> tolower("Hello World")
[1] "hello world"

l substr()函數使文字擷取：
> substr("Hello World", start = 1, stop = 4)
[1] "Hell"

l grep()函數使文字搜尋，符合者返回索引值，不符者返回integer(0)：
> grep(pattern = "Poor", c("Hello", "Poor", "World"))
[1] 2
> grep(pattern = "Hell", c("Hello", "Poor", "World"))
[1] 1
> grep(pattern = "poor", c("Hello", "Poor", "World"))
integer(0)
> grep(pattern = "poor", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] 2

l sub()函數使文字替換：
> sub(pattern = "Hello", replacement = "Hell", c("Hello", "Poor", "World"))
[1] "Hell" "Poor" "World"
> sub(pattern = "hello", replacement = "Hell", c("Hello", "Poor", "World"), ignore.case = TRUE)
[1] "Hell" "Poor" "World"

l strsplit()函數使文字切割：
> strsplit("Hello Poor World", split = " ")
[[1]]
[1] "Hello" "Poor" "World"

l paste()函數使文字連結：
> paste("Hello", "Poor", "World")
[1] "Hello Poor World"
> paste("Hello", "Poor", "World", sep = "|")
[1] "Hello|Poor|World"

描述統計函數

l mean()函數回傳平均值：
> mean(1:5)
[1] 3
> mean(c(1:5, NA)) # 加入遺漏值
[1] NA
> mean(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3

l sd()函數回傳標準差：
> sd(1:5)
[1] 1.581139
> sd(c(1:5, NA)) # 加入遺漏值
[1] NA
> sd(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1.581139

l median()函數回傳中位數：
> median(1:5)
[1] 3
> median(c(1:5, NA)) # 加入遺漏值
[1] NA
> median(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 3

l range()函數回傳最小值與最大值：
> range(1:5)
[1] 1 5
> range(c(1:5, NA)) # 加入遺漏值
[1] NA NA
> range(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1 5

l sum()函數回傳總數：
> sum(1:5)
[1] 15
> sum(c(1:5, NA)) # 加入遺漏值
[1] NA
> sum(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 15

l max()函數回傳最大值：
> max(1:5)
[1] 5
> max(c(1:5, NA)) # 加入遺漏值
[1] NA
> max(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 5

l min()函數回傳最小值：
> min(1:5)
[1] 1
> min(c(1:5, NA)) # 加入遺漏值
[1] NA
> min(c(1:5, NA), na.rm = TRUE) # 排除遺漏值
[1] 1

六、迴圈與流程控制

迴圈

l for迴圈：
> for (month in month.name) {
+ print(month)
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"

l   while迴圈：
> i <- 1
> while (i < 13) {
+   print(month.name[i])
+   i <- i + 1
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "June"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"

流程控制

l   if - else if - else：
> # sample()函數從向量中隨機抽樣，size引數指定隨機抽樣個數
> weather <- sample(c("Sunny", "Cloudy", "Rainy"), size = 1)
> # 與Java不同，else前面必須緊接}，不可任意換行擺放，否則會發生錯誤
> if (weather == "Sunny") {
+   print("Cycling")
+ } else if (weather == "Cloudy") {
+   print("Running")
+ } else {
+   print("Working Out in the Gym")
+ }
[1] "Cycling"

結合迴圈與流程控制

l   break敘述，等同Python的break敘述，跳出迴圈：
> for (month in month.name) {
+   if (month == "June") {
+     break
+   } else {
+     print(month)
+   }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"

l   next敘述，等同Python的continue敘述，直接跳至下一輪迴圈：
> for (month in month.name) {
+   if (month == "June") {
+     next
+   } else {
+     print(month)
+   }
+ }
[1] "January"
[1] "February"
[1] "March"
[1] "April"
[1] "May"
[1] "July"
[1] "August"
[1] "September"
[1] "October"
[1] "November"
[1] "December"

七、自訂函數

自訂函數

l   簡單範例：
# 自訂函數
> myFunc <- function(x, mode = TRUE) {
+   y <- x ^ 2
+   z <- x ^ 3
+   if (mode == TRUE) {
+     return(y)
+   } else {
+     return(z)
+   }
+ }
> # 呼叫函數
> myFunc(1:3, FALSE)
[1] 1 8 27

l   複雜範例，處理雜亂無章的資料：
> # 雜亂無章的資料
> messyData <- data.frame(c(1, 2, 3, 4, NA), c(1, 2, 3, NA, 5), c(1, 2, NA, 4, 5))
> names(messyData) <- c("a", "b", "c")
> messyData
   a b c
1 1 1 1
2 2 2 2
3 3 3 NA
4 4 NA 4
5 NA 5 5
> # 自訂函數
> cleanData <- function(df, imputeValue) {
+   nRows <- nrow(df)
+   naSum <- rep(NA, times = nRows)
+   for (i in 1:nRows) {
+     naSum[i] <- sum(is.na(df[i, ]))
+     df[i, ][is.na(df[i, ])] <- imputeValue
+   }
+   dfList <- list(completeCases = df[as.logical(!naSum), ], imputedData = df)
+   return(dfList)
+ }
> # 呼叫函數
> cleanedData <- cleanData(messyData, imputeValue = 999)
> cleanedData$completeCases
a b c
1 1 1 1
2 2 2 2
> cleanedData$imputedData
    a   b   c
1   1   1   1
2   2   2   2
3   3   3 999
4   4 999   4
5 999   5   5

八、資料的輸入與輸出

內建資料

l 顯示有哪些內建資料可以使用：
data()

輸入.txt檔與.csv檔資料

l 在R語言，Windows系統路徑的反斜線(\)必須改為斜線(/)。

l 讀取硬碟.txt檔表格資料：
favoriteBands <- read.table("D:/favorite_bands.txt", header = TRUE, stringsAsFactors = FALSE)
View(favoriteBands)

l 讀取硬碟.csv檔表格資料：
# 引數sep預設辨識一個或多個空格
favoriteBands <- read.table("D:/favorite_bands.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)

l 讀取網路.csv檔表格資料：
url = "https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/data_ch11/favorite_bands.csv"
favoriteBands <- read.table(url, header = TRUE, stringsAsFactors = FALSE, sep = ",")
View(favoriteBands)

l 讀取硬碟.txt檔文字資料：
# 引數n限制讀取的筆數
lyricsScript <- readLines("D:/lyrics.txt", n = 5)
lyricsScript

輸入Excel與JSON資料

l 方法一－以程式碼安裝與載入套件：
install.packages("套件名稱")
library(套件名稱)

l 方法二－在使用者介面安裝與載入套件：
Packages à Install à 在Packages輸入套件名稱 à Install à 把方框勾選起來表示載入這個套件

l 讀取Excel資料：
# 安裝、載入readxl套件
install.packages("readxl")
library(readxl)
favoriteBands <- read_excel("D:/favorite_bands.xlsx")
View(favoriteBands)

l 讀取JSON資料：
# 安裝、載入jsonlite套件
install.packages("jsonlite")
library(jsonlite)
favoriteBands <- fromJSON("D:/favorite_bands.json")
View(favoriteBands)

輸出.txt檔與.csv檔資料

l 輸出.txt檔表格資料：
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
# 引數row.names指定是否輸出觀測值的索引值
write.table(favoriteBandsDf, file = "D:/favorite_bands.txt", row.names = FALSE)

l 輸出.csv檔表格資料：
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
write.table(favoriteBandsDf, file = "D:/favorite_bands.csv", row.names = FALSE, sep = ",")

l 以內建資料cars為例輸出.csv檔表格資料：
write.csv(cars, file = "D:/cars.csv", row.names = FALSE)

l 輸出.txt檔文字資料：
lyricsScript <- c("Side Effects", "It's 4AM, I don't know where to go", "Everywhere is closed, I should just go home, yeah", "My feet are takin' me to your front door", "I know I shouldn't though, heaven only knows")
writeLines(lyricsScript, con = "D:/lyrics.txt")

輸出JSON資料

l 輸出JSON資料：
library(jsonlite)
favoriteBandsDf <- data.frame(band = c("Beyond", "Beatles"), lead_vocal = c("Wong Ka Kui", "John Lennon"), formed = c(1983, 1960))
writeLines(toJSON(favoriteBandsDf), con = "D:/favorite_bands.json")

九、探索資料分析(Exploratory Data Analysis, EDA)

內建函數(以內建資料iris為例)

l 傳回觀測值總數：
nrow(iris)

l 傳回變數總數：
ncol(iris)

l 傳回觀測值與變數總數：
dim(iris)

l 傳回變數名稱與前六個觀測值：
head(iris)

l 傳回變數名稱與後六個觀測值：
tail(iris)

l 傳回變數名稱：
names(iris)

l 傳回敘述統計等資訊：
summary(iris)

l 傳回資料結構等資訊：
str(iris)

Base Plotting System－R語言內建的繪圖系統

l 直方圖：
# rnorm()函數隨機產生指定數量符合標準常態分布的數字
hist(rnorm(1000))

l 盒鬚圖：
boxplot(Sepal.Length ~ Species, data = iris)

l 折線圖：
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
# 亂數種子
set.seed(123)
# 引數replace = TRUE表示可抽出重複的數字
y <- sample(1:100, size = 31, replace = TRUE)
plot(x, y, type = "l")

# 以內建資料AirPassengers為例，若資料結構類型是ts(time series)，可直接傳入plot()，不需使用引數type = "l"
class(AirPassengers)
plot(AirPassengers)

l 散佈圖：
# 單一散佈圖，以內建資料cars為例
plot(cars)
# 單一散佈圖，指定X軸與Y軸變數
plot(cars$dist, cars$speed)
# 散佈圖矩陣，以內建資料iris為例
plot(iris)

l 長條圖：
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# table()函數用以統整資料
barplot(table(iceCreamFlavor))

l 曲線圖：
curve(sin, from = -pi, to = pi)

Base Plotting System－常用的自訂元素

l 自訂標題、X軸標籤、Y軸標籤、加入格線：
# 以內建資料cars散佈圖為例
# 引數main表示自訂標題
# 引數xlab表示X軸標籤
# 引數ylab表示Y軸標籤
plot(cars, main = "Car Speed VS. Braking Distance", xlab = "Car Speed (mph)", ylab = "Braking Distance (ft)")
# grid()函數表示加入格線
grid()

l 調整方向、文字縮放：
# 以長條圖為例
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 引數horiz = TRUE調整圖形為水平方向
# 引數las = 1調整刻度顯示方向
# 引數cex.name = 0.8與cex.axis = 1.2調整刻度文字縮放倍數
barplot(table(iceCreamFlavor), horiz = TRUE, las = 1, cex.name = 0.8, cex.axis = 1.2)

l 多個圖形：
# 以內建資料iris盒鬚圖為例
par(mfrow = c(2, 2))
boxplot(iris$Sepal.Length ~ iris$Species, main = "Sepal Length by Species")
boxplot(iris$Sepal.Width ~ iris$Species, main = "Sepal Width by Species")
boxplot(iris$Petal.Length ~ iris$Species, main = "Petal Length by Species")
boxplot(iris$Petal.Width ~ iris$Species, main = "Petal Width by Species")

l 直方圖的密度曲線：
normDist <- rnorm(1000)
hist(normDist, freq = FALSE)
lines(density(normDist))

l 散佈圖資料點的形狀與顏色：
# 單一資料點的形狀與顏色
plot(cars, pch = 2, col = "red")
# 不同類別資料點的形狀與顏色
plot(iris$Sepal.Length, iris$Sepal.Width, pch = as.numeric(iris$Species), col = iris$Species)

ggplot2套件

l 安裝、載入ggplot2套件：
install.packages("ggplot2")
library(ggplot2)

l 直方圖：
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) + geom_histogram()
# 修改為較多的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.1)
# 修改為較少的分組數
ggplot(histDf, aes(x = norm_nums)) + geom_histogram(binwidth = 0.5)

l 盒鬚圖：
library(ggplot2)
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()

l 折線圖：
library(ggplot2)
x <- seq(from = as.Date("2017-01-01"), to = as.Date("2017-01-31"), by = 1)
set.seed(123)
y <- sample(1:100, size = 31, replace = TRUE)
lineDf <- data.frame(x = x, y = y)
ggplot(lineDf, aes(x = x, y = y)) + geom_line()
# 修改日期顯示格式，預設為%b %d
ggplot(lineDf, aes(x = x, y = y)) + geom_line() + scale_x_date(date_labels = "%m.%d")

l 散佈圖：
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point()

l 長條圖：
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
# 若傳入的資料不是統計過的資訊
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar()
# 若傳入的資料是統計過的資訊
flavor <- names(table(iceCreamFlavor))
votes <- as.vector(unname(table(iceCreamFlavor)))
iceCreamDf <- data.frame(flavor = flavor, votes = votes)
ggplot(iceCreamDf, aes(x = flavor, y = votes)) + geom_bar(stat = "identity")

l 曲線圖：
library(ggplot2)
sinDf <- data.frame(x = c(-pi, pi))
ggplot(sinDf, aes(x = x)) + stat_function(fun = sin, geom = "line")

ggplot2套件－常用的自訂元素

l   自訂標題、X軸標籤、Y軸標籤、隱藏格線：
# 以內建資料cars散佈圖為例
library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + geom_point() +
    ggtitle("Car Speed VS. Braking Distance") +
    xlab("Car Speed (mph)") +
    ylab("Braking Distance (ft)") +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# 主要格線panel.grid.major
# 次要格線panel.grid.minor
# X軸主要格線panel.grid.major.x
# Y軸主要格線panel.grid.major.y
# X軸次要格線panel.grid.minor.x
# Y軸次要格線panel.grid.minor.y

l   調整方向：
# 以長條圖為例
library(ggplot2)
iceCreamFlavor <- rep(NA, times = 100)
for (i in 1:100) {
    iceCreamFlavor[i] <- sample(c("vanilla", "chocolate", "matcha", "other"), size = 1)
}
iceCreamDf <- data.frame(ice_cream_flavor = iceCreamFlavor)
ggplot(iceCreamDf, aes(x = ice_cream_flavor)) + geom_bar() +
    coord_flip()

l 多個圖形：
# 以內建資料iris盒鬚圖為例
library(ggplot2)
# 安裝、載入gridExtra套件
install.packages("gridExtra")
library(gridExtra)
g1 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
g2 <- ggplot(iris, aes(x = Species, y = Sepal.Width)) + geom_boxplot()
g3 <- ggplot(iris, aes(x = Species, y = Petal.Length)) + geom_boxplot()
g4 <- ggplot(iris, aes(x = Species, y = Petal.Width)) + geom_boxplot()
grid.arrange(g1, g2, g3, g4, nrow = 2, ncol = 2)

l   直方圖的密度曲線：
library(ggplot2)
normNums <- rnorm(1000)
histDf <- data.frame(norm_nums = normNums)
ggplot(histDf, aes(x = norm_nums)) +
    # 引數alpha調整透明度
    geom_histogram(aes(y = ..density..), alpha = 0.5) +
    geom_density()

l   散佈圖資料點的形狀與顏色：
library(ggplot2)
# 單一資料點的形狀與顏色
ggplot(cars, aes(x = speed, y = dist)) +
    geom_point(shape = 2, colour = "red")
# 不同類別資料點的形狀與顏色
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(shape = Species, colour = Species))

輸出圖形

l 在使用者介面：
Plots à Export

十、資料處理技巧

查詢資料框

l 利用邏輯值選擇，是實務上較常使用的方式：
iris[iris$Petal.Length >= 6, c("Sepal.Length", "Petal.Length", "Species")]

資料框的觀測值與變數

l 資料框內的資料，以下範例將省略：
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)
isChampion <- c(TRUE, FALSE)
season <- c(2013, 2011)
coloradoRockies2007 <- c("Colorado Rockies", 90, 73, FALSE, 2007)
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)

l 新增與刪除觀測值：
# 引數stringsAsFactors = FALSE防止因素向量層級的錯誤
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season, stringsAsFactors = FALSE)
# 新增觀測值
greatMlbTeams <- rbind(greatMlbTeams, coloradoRockies2007)
greatMlbTeams
# 刪除觀測值
greatMlbTeams <- greatMlbTeams[-3, ]
greatMlbTeams

l 新增與刪除變數：
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion)
# 新增變數
greatMlbTeams$season <- season
greatMlbTeams
# 刪除變數
greatMlbTeams$season <- NULL
greatMlbTeams

l 重新命名變數：
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
names(greatMlbTeams)[4] <- "isWorldSeriesChampion"
greatMlbTeams

l 調整變數位置：
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams <- greatMlbTeams[, c("teamName", "season", "isChampion", "wins", "losses")]
greatMlbTeams

l 對類別變數重新編碼：
greatMlbTeams <- data.frame(teamName, wins, losses, isChampion, season)
greatMlbTeams$isChampion[greatMlbTeams$isChampion == TRUE] <- "Y"
greatMlbTeams$isChampion[greatMlbTeams$isChampion == FALSE] <- "N"
greatMlbTeams

l 對數值變數重新編碼為類別變數：
strawHatDf <- data.frame(name, age)
strawHatDf$ageCategory <- cut(strawHatDf$age, breaks = c(0, 20, 30, 40, Inf), labels = c("0 < Age <= 20", "20 < Age <= 30", "30 < Age <= 40", "Age > 40"))
strawHatDf

合併資料框

l 垂直合併資料框：
carsUpper <- cars[1:25, ]
carsBottom <- cars[26:50, ]
carsCombined <- rbind(carsUpper, carsBottom)

l 水平合併資料框：
carsLeft <- cars[, 1]
carsRight <- cars[, 2]
carsCombined <- cbind(carsLeft, carsRight)

l 合併查詢變數名稱相同的資料框：
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
name <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(name, devilFruit)

# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf)
mergedDf
# 保留左邊的資料框
mergedDfX <- merge(leftDf, rightDf, all.x = TRUE)
mergedDfX
# 保留右邊的資料框
mergedDfY <- merge(leftDf, rightDf, all.y = TRUE)
mergedDfY
# 保留左右兩邊全部的資料框
mergedDfXY <- merge(leftDf, rightDf, all.x = TRUE, all.y = TRUE)
mergedDfXY

l 合併查詢變數名稱不同的資料框：
# 左邊的資料框
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Tony Tony Chopper")
age <- c(19, 21, 20, 17)
leftDf <- data.frame(name, age)
# 右邊的資料框
character <- c("Monkey·D·Luffy", "Tony Tony Chopper", "Nico Robin", "Brook")
devilFruit <- c("Gum-Gum Fruit", "Human-Human Fruit", "Flower-Flower Fruit", "Revive-Revive Fruit")
rightDf <- data.frame(character, devilFruit)

# 預設保留交集的資料框
mergedDf <- merge(leftDf, rightDf, by.x = "name", by.y = "character")
mergedDf

tidyverse套件

l 安裝、載入tidyverse套件：
install.packages("tidyverse")
library(tidyverse)

l tidyverse套件內含多個套件，其中magrittr套件能夠使用%>%運算子：
# 以傳統方法呼叫函數
summary(cars)
# 使用%>%運算子
library(tidyverse)
cars %>% summary()

l tidyverse套件內含多個套件，其中tidyr套件能夠轉換長寬表格。

l tidyverse套件內含多個套件，其中dplyr套件融入很多與結構化查詢語言相仿的函數。

magrittr套件的%>%運算子介紹

l   需要呼叫多次函數時會採用%>%運算子：
# 方法一，建立多個物件並以傳統方法呼叫函數
sysDate <- Sys.Date()
sysDateYr <- format(sysDate, format = "%Y")
sysDateNum <- as.numeric(sysDateYr)
sysDateNum
# 方法二，盡量少建物件並以傳統方法呼叫函數
sysDateNum <- as.numeric(format(Sys.Date(), format = "%Y"))
sysDateNum
# 方法三，盡量少建物件並使用%>%運算子
library(tidyverse)
sysDateNum <- Sys.Date() %>%
    format(format = "%Y") %>%
    as.numeric()
sysDateNum

l   加入運算符號：
library(tidyverse)
beyondStart <- 1983
beyondYr <- Sys.Date() %>%
    format(format = "%Y") %>%
    as.numeric() %>%
    # `符號叫做tilt
    `-` (beyondStart)
beyondYr

l %>%運算子預設將輸入放在函數第一個引數的位置，若有需要，可以透過.來指定輸入的位置：
# 以傳統方法呼叫lm()函數
carsLm <- lm(formula = dist ~ speed, data = cars)
# 使用%>%運算子並以.指定cars輸入的位置
library(tidyverse)
carsLm <- cars %>%
lm(formula = dist ~ speed, data = .)

tidyr套件的長寬表格轉換

l 資料框內的資料，以下範例將省略：
teamName <- c("Boston Red Sox", "Texas Rangers")
wins <- c(97, 96)
losses <- c(65, 66)

l 長寬表格轉換：
library(tidyverse)
greatMlbTeams <- data.frame(teamName, wins, losses)
# 寬轉長表格，引數key指定類別變數名稱，引數value指定數值變數名稱
longFormat <- gather(greatMlbTeams, key = variable_names, value = values, wins, losses)
longFormat
# 長轉寬表格
wideFormat <- spread(longFormat, key = variable_names, value = values)
wideFormat

dplyr套件的函數介紹

l 資料框內的資料，以下範例將省略：
name <- c("Monkey·D·Luffy", "Roronoa Zoro", "Nami", "Usopp", "Vinsmoke Sanji", "Tony Tony Chopper", "Nico Robin", "Franky", "Brook", "Jinbe")
gender <- c("male", "male", "female", "male", "male", "male", "female", "male", "male", "male")
age <- c(19, 21, 20, 19, 21, 17, 30, 36, 90, 46)

l filter()函數篩選符合條件的觀測值：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
filter(strawHatDf, gender == "female")
# 比較使用R語言原生語法
strawHatDf[strawHatDf$gender == "female", ]

l select()函數篩選變數：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
select(strawHatDf, crew_name = name, gender)
# 比較使用R語言原生語法並維持原本的資料框資料結構
names(strawHatDf)[1] <- "crew_name"
strawHatDf[, c("crew_name", "gender"), drop = FALSE]

l mutate()函數新增變數：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
mutate(strawHatDf, age_two_years_ago = age - 2)

l arrange()函數依照變數排序觀測值：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
arrange(strawHatDf, age)
# 由大到小排序
arrange(strawHatDf, desc(age))

l summarise()函數聚合變數，例如總和、平均數、標準差：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
summarise(strawHatDf, mean(age))

l   group_by()函數依照類別變數分組：
library(tidyverse)
strawHatDf <- data.frame(name, gender, age, stringsAsFactors = FALSE)
group_by(strawHatDf, gender) %>%
    summarise(mean(age)) %>%
    # 將tibble資料框轉為原生資料框
    as.data.frame()

資料框處理效率

l   向量計算 > 一系列apply()函數 > 迴圈語法：
# 既有的資料框，runif()函數隨機產生指定數量符合均勻分布的數字
heights <- ceiling(runif(500000) * 50) + 140
weights <- ceiling(runif(500000) * 50) + 40
hwDf <- data.frame(heights, weights)

# 迴圈語法有較低的運行效率
bmi <- rep(NA, times = nrow(hwDf))
system.time(
    for (i in 1:nrow(hwDf)) {
        bmi[i] <- hwDf[i, "weights"] / (hwDf[i, "heights"] / 100)^2
    }
)

# 一系列apply()函數有中等的運行效率
bmiFunction <- function(x, y) {
    x / (y / 100)^2
}
system.time(
    bmi <- mapply(hwDf$weights, hwDf$heights, FUN = bmiFunction)
)

# 向量計算有較高的運行效率
system.time(
    bmi <- hwDf$weights / (hwDf$heights / 100)^2
)

l 一系列apply()函數：
# 自訂函數傳回不重複值的數量
distinctCounts <- function(x) {
return(length(unique(x)))
}

# apply()函數
# 引數MARGIN = 2表示應用在變數的方向上，本例應用此引數
apply(iris, MARGIN = 2, distinctCounts)
# 引數MARGIN = 1表示應用在觀測值的方向上
apply(iris, MARGIN = 1, distinctCounts)

# lapply()函數將輸出儲存為清單
lapply(iris, FUN = distinctCounts)

# sapply()函數將輸出儲存為向量，與apply()函數的引數MARGIN = 2相同
sapply(iris, FUN = distinctCounts)

# tapply()函數是融入table()函數功能的形式
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = distinctCounts)

# mapply()函數是sapply()函數的多變數版本
mapply(iris, FUN = distinctCounts)

十一、撰寫資料分析報告

新增一個R Markdown檔案(.Rmd檔)

l R Markdown檔案為純文字檔，可利用knitr套件轉為資料分析報告。

l 新增R Markdown檔案：
點選File à New File à New R Markdown...，第一次新增R Markdown檔案時會出現提示訊息以安裝相關套件，接著填寫Document、Title、Author、Default Output Format等欄位。

l 儲存R Markdown檔案：
點選Knit，選擇編碼(一般選擇UTF-8)，接著再選擇存檔路徑。

基本分析文件元素

l 段落標題，從第一層到第六層：
#第一層標題
##第二層標題
###第三層標題
####第四層標題
#####第五層標題
######第六層標題

l 段落內文：
直接輸入內文，這是**粗體**，這是*斜體*。

l 行內程式(Inline Code)：
使用`q()`函數來離開RStudio。

l   程式區塊(Code Chunks)：
```
myObj <- "輕鬆學習R語言"
```
在程式區塊旁邊加註{r}，輸出的時候會執行這段程式：
```{r}
plot(cars)
```
在程式區塊旁邊加註的{r}，還可以加入引數－
echo = TRUE            程式預設顯示在文件中；
message = TRUE     執行回傳訊息預設顯示在文件中；
warning = TRUE      執行警告訊息預設顯示在文件中；
results = markup     執行結果預設顯示在文件中，可選擇asis、hold、hide；
error = FALSE          預設不允許有錯誤訊息的程式：
```{r echo = FALSE}
plot(cars)
```

l   清單：
- 母標題一
    - 子標題一之一
    - 子標題一之二
* 母標題二
    * 子標題二之一
    * 子標題二之二

l 表格：
|資料格式|函數|套件|
|--------|----|----|
|結構化文字|`read.table()`|`utils`|
|非結構化文字|`readLines()`|`base`|
|Excel試算表|`read_excel()`|`readxl`|
|JSON|`fromJSON()`|`jsonlite`|

l 圖片：
![R_logo](https://storage.googleapis.com/learn-r-the-easy-way.appspot.com/screenshots_ch16/Rlogo.png)

l 連結：
1. 安裝R：[CRAN](https://cran.r-project.org/)
2. 安裝RStudio：[RStudio](https://www.rstudio.com/products/rstudio/download/)

l 引用：
> R, at its heart, is a functional programming (FP) language.
By Hadley Wickham

十二、實用R語言技巧彙整

加總資料

l 加總矩陣：
iceCream <- matrix(round(runif(15) * 100), nrow = 5)
colnames(iceCream) <- c("Vanilla", "Chocolate", "Strawberry")
rownames(iceCream) <- c("Mon", "Tue", "Wed", "Thu", "Fri")
# rowSums()函數可以加總列資料
iceCream <- cbind(iceCream, Total = rowSums(iceCream))
# colSums()函數可以加總欄資料
iceCream <- rbind(iceCream, Total = colSums(iceCream))
iceCream

回傳索引值

l match()函數回傳第一個吻合特定值的索引值：
myVector <- c(11:20, 17)
match(17, myVector)

l which()函數回傳所有條件為TRUE的特定值的索引值：
myVector <- c(11:20, 17)
which(myVector == 17)

l which.min()函數回傳第一個最小值的索引值：
myVector <- c(11:20, 11, 20)
which.min(myVector)
# 回傳所有最小值的索引值
which(myVector == min(myVector))

l which.max()函數回傳第一個最大值的索引值：
myVector <- c(11:20, 11, 20)
which.max(myVector)
# 回傳所有最大值的索引值
which(myVector == max(myVector))

排序資料

l 排序向量：
myVector <- round(runif(10) * 100)
# 未排序向量
myVector
# 遞增排序向量
sort(myVector)
# 遞減排序向量
sort(myVector, decreasing = TRUE)

l 排序資料框：
# order()函數回傳排序後觀察值的索引值
reorderedCars <- cars[order(cars$dist), ]
reorderedCars

讀取網頁資料

l 讀取HTML：
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- readLines(url)
class(aSimpleFavor)
mode(aSimpleFavor)

l 安裝、載入rvest套件：
install.packages("rvest")
library(rvest)

l   載入rvest套件讀取HTML：
library(rvest)
url <- "https://www.imdb.com/title/tt7040874/"
aSimpleFavor <- read_html(url)
class(aSimpleFavor)
mode(aSimpleFavor)

# html_nodes()函數所需CSS Selector，可以參考https://www.w3.org/TR/2011/REC-css3-selectors-20110929/#selectors
# R語言的正規表示式可以參考https://blog.yjtseng.info/post/regexpr/

# 擷取電影片名
title <- aSimpleFavor %>%
    html_nodes(css = "h1") %>%
    html_text()
# 清理並印出電影片名，regexpr()函數回傳第一個吻合特定值的索引值
title <- regexpr(pattern = ")", title) %>%
    substr(title, start = 1, stop = .)
title

# 擷取電影片長
time <- aSimpleFavor %>%
    html_nodes(css = "#title-overview-widget time") %>%
    html_text()
# 清理並印出電影片長，gsub()函數以正規表示式找出並取代字串
time <- gsub(pattern = "\n\\s+", time, replacement = "")
time

# 擷取並印出電影評分
rating <- aSimpleFavor %>%
    html_nodes(css = "strong span") %>%
    html_text() %>%
    as.numeric()
rating

線性回歸模型

l 預測並繪圖：
# 銷售資料
temperatures <- c(29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30)
icedTeaSales <- c(77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84)

# 印出截距與係數
lmFit <- lm(icedTeaSales ~ temperatures)
lmFit$coefficients
# 預測資料
toBePredicted <- data.frame(temperatures = 30)
predictedSales <- predict(lmFit, newdata = toBePredicted)
predictedSales
# 繪出銷售資料點
plot(icedTeaSales ~ temperatures, bg = "blue", pch = 16)
# 繪出預測資料點
points(x = toBePredicted$temperatures, y = predictedSales, col = "red", cex = 2, pch = 17)
# 繪出銷售資料回歸線
abline(reg = lmFit$coefficients, col = "blue", lwd = 4)

決策樹分類器

l 安裝、載入rpart套件：
install.packages("rpart")
library(rpart)

l   以內建資料iris為例，拆分資料作為訓練用與測試用：
# 自訂函數－資料洗牌，並依比例拆分資料
trainTestSplit <- function(x, trainPercentage) {
    n <- nrow(x)
    dataShuffled <- x[sample(n), ]
    trainTestCut <- round(trainPercentage * n)
    trainData <- dataShuffled[1:trainTestCut, ]
    testData <- dataShuffled[(trainTestCut + 1):n, ]
    outputs <- list(Train = trainData, Test = testData)
    return(outputs)
}
# 拆分資料
irisTrainTest <- trainTestSplit(iris, trainPercentage = 0.7)
irisTrain <- irisTrainTest$Train
irisTest <- irisTrainTest$Test

# 建立決策樹分類器，Species ~ .代表以其它變數解釋Species變數
irisClf <- rpart(Species ~ ., data = irisTrain, method = "class")
# 預測資料
predicted <- predict(irisClf, irisTest, type = "class")
# 比對irisTest$Species與predicted來得知決策樹分類器的準確率
confMat <- table(irisTest$Species, predicted)
accuracy <- sum(diag(confMat)) / sum(confMat)
accuracy

K-Means資料分群

l   以內建資料iris為例：
# 數值資料
irisKmeans <- iris[, -5]
# 印出分群結果(此例隨機執行20次再收斂，分3類)
kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = 3)
kmeansFit
# 印出組內差異/組間差異(Total WSS/Total SS)
ratio <- kmeansFit$tot.withinss / kmeansFit$totss
ratio
# 繪出陡坡圖(Scree Plot)
ratio <- rep(NA, times = 10)
for (k in 2:length(ratio)) {
    kmeansFit <- kmeans(irisKmeans, nstart = 20, centers = k)
    ratio[k] <- kmeansFit$tot.withinss / kmeansFit$betweenss
}
plot(ratio, type = "b", xlab = "k")

十三、統計機率分布函數

統計機率分布函數

l 函數開頭名稱的意義：
d代表density，回傳機率密度值。
p代表probability，回傳累積機率值。
q代表quantile，回傳分位數。
r代表random，回傳隨機值。

l 函數結尾名稱的意義：
unif指的是均勻分布。
norm指的是常態分布。
binom指的是二項式分布。
pois指的是Poisson分布。
chisq指的是卡方分布。

均勻分布

l 預設均勻分布最小值為0，最大值為1，可以修改引數min與max。

l dunif()函數：
x <- seq(from = -2, to = 3, by = 0.01)
y <- dunif(x, min = -1, max = 2)
plot(x, y, type = "l", ylab = "Probability Density")

l punif()函數：
punif(0.5)

l qunif()函數：
qunif(0.5)

l runif()函數：
x <- runif(1000)
hist(x, ylab = "Frequency")

常態分布

l 預設常態分布為標準常態分布，平均值為0，標準差為1，可以修改引數mean與sd。

l dnorm()函數：
x <- seq(from = -3, to = 3, by = 0.01)
y <- dnorm(x)
plot(x, y, type = "l", ylab = "Probability Density")

l pnorm()函數：
pnorm(1.96)

l qnorm()函數：
qnorm(0.975)

l rnorm()函數：
x <- rnorm(1000)
hist(x, ylab = "Frequency")

二項式分布

l 例如投擲一枚公正硬幣，引數size代表投擲次數，引數prob代表機率。

l dbinom()函數：
x <- 0:100
y <- dbinom(x, size = 100, prob = 0.5)
plot(x, y, type = "l", ylab = "Probility Density")

l pbinom()函數：
pbinom(50, size = 100, prob = 0.5)

l qbinom()函數：
qbinom(0.53, size = 100, prob = 0.5)

l rbinom()函數：
x <- rbinom(1000, size = 100, prob = 0.5)
hist(x, ylab = "Frequency")

Poisson分布

l 單位時間內發生次數的機率分布，必須指定單位時間引數lambda。

l dpois()函數：
x <- 0:20
y <- dpois(x, lambda = 4)
plot(x, y, type = "l", ylab = "Probability Density")

l ppois()函數：
ppois(4, lambda = 4)

l qpois()函數：
qpois(0.62, lambda = 4)

l rpois()函數：
x <- rpois(1000, lambda = 4)
hist(x, ylab = "Frequency")

卡方分布

l dchisq()函數：
x <- 1:50
y <- dchisq(x, df = 5)
plot(x, y, type = "l", ylab = "Probability Density")

l pchisq()函數：
pchisq(5, df = 5)

l qchisq()函數：
qchisq(0.58, df = 5)

l rchisq()函數：
x <- rchisq(1000, df = 5)
hist(x, ylab = "Frequency")

十四、本書作者推薦的學習資源

學習資源名稱	學習資源類型	適合讀者
Quick R R in Action	網站實體書本	初學者中階使用者
R Programming for Data Science	電子書	初學者中階使用者中高階使用者
R for Data Science	電子書實體書本	中階使用者
Advanced R	網站實體書本	中高階使用者
The Art of R Programming	實體書本	中高階使用者
swirl套件	課程
DataCamp	課程

Timmy's Column

2018年9月29日星期六

輕鬆學習R語言學習筆記

沒有留言:

張貼留言

2018年9月29日 星期六

輕鬆學習R語言學習筆記

沒有留言:

張貼留言

2018年9月29日星期六