E Solutions ch. 6 - Decision trees and random forests

Solutions to exercises of chapter 6.

E.1 Exercise 1

Load the necessary packages
readr to read in the data
dplyr to process data
party and rpart for the classification tree algorithms

Select features that may explain survival

Each row in the data is a passenger. Columns are features:

survived: 0 if died, 1 if survived
embarked: Port of Embarkation (Cherbourg, Queenstown,Southampton)
sex: Gender
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
fare: Fare Payed

Make categorical features should be made into factors

titanic3 <- "https://goo.gl/At238b" %>%
  read_csv %>% # read in the data
  select(survived, embarked, sex, 
         sibsp, parch, fare) %>%
  mutate(embarked = factor(embarked),
         sex = factor(sex))
Split data into training and test sets

.data <- c("training", "test") %>%
  sample(nrow(titanic3), replace = T) %>%
  split(titanic3, .)

Recursive partitioning is implemented in “rpart” package

rtree_fit <- rpart(survived ~ ., 

Conditional partitioning is implemented in the “ctree” method

tree_fit <- ctree(survived ~ ., 
                  data = .data$training)

Use ROCR package to visualize ROC Curve and compare methods

tree_roc <- tree_fit %>%
  predict(newdata = .data$test) %>%
  prediction(.data$test$survived) %>%
  performance("tpr", "fpr")

Acknowledgement: the code for this excersise is from http://bit.ly/2fqWKvK