H Solutions ch. 9 - Decision trees and random forests

Solutions to exercises of chapter 9.

H.1 Exercise 1

Load the necessary packages
readr to read in the data
dplyr to process data
party and rpart for the classification tree algorithms

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
library(rpart)
library(rpart.plot)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
set.seed(100)

Select features that may explain survival

Each row in the data is a passenger. Columns are features:

survived: 0 if died, 1 if survived
embarked: Port of Embarkation (Cherbourg, Queenstown,Southampton)
sex: Gender
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
fare: Fare Payed

Make categorical features should be made into factors

titanic3 <- "https://goo.gl/At238b" %>%
  read_csv %>% # read in the data
  select(survived, embarked, sex, 
         sibsp, parch, fare) %>%
  mutate(embarked = factor(embarked),
         sex = factor(sex))
## Parsed with column specification:
## cols(
##   pclass = col_character(),
##   survived = col_integer(),
##   name = col_character(),
##   sex = col_character(),
##   age = col_double(),
##   sibsp = col_integer(),
##   parch = col_integer(),
##   ticket = col_character(),
##   fare = col_double(),
##   cabin = col_character(),
##   embarked = col_character(),
##   boat = col_character(),
##   body = col_integer(),
##   home.dest = col_character()
## )
#load("/Users/robertness/Downloads/titanic.Rdata")

Split data into training and test sets

.data <- c("training", "test") %>%
  sample(nrow(titanic3), replace = T) %>%
  split(titanic3, .)

Recursive partitioning is implemented in “rpart” package

rtree_fit <- rpart(survived ~ ., 
          .data$training) 
rpart.plot(rtree_fit)
## Warning: Bad 'data' field in model 'call' (expected a data.frame or a matrix).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Conditional partitioning is implemented in the “ctree” method

tree_fit <- ctree(survived ~ ., 
                  data = .data$training)
plot(tree_fit)

Use ROCR package to visualize ROC Curve and compare methods

tree_roc <- tree_fit %>%
  predict(newdata = .data$test) %>%
  prediction(.data$test$survived) %>%
  performance("tpr", "fpr")
plot(tree_roc)

Acknowledgement: the code for this excersise is from http://bit.ly/2fqWKvK