15.071 | Spring 2017 | Graduate

# The Analytics Edge

4 Trees

## 4.5 Assignment 4

### Predicting Earnings from Census Data

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns – in particular, whether the person earns more than \$50,000 per year. This data comes from the UCI Machine Learning Repository.

The file census (CSV - 3.3MB) contains 1994 census data for 31,978 individuals in the United States.

The dataset includes the following 13 variables:

• age = the age of the individual in years
• workclass = the classification of the individual’s working status (does the person work for the federal government, work for the local government, work without pay, and so on)
• education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
• maritalstatus = the marital status of the individual
• occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
• relationship = relationship of individual to his/her household
• race = the individual’s race
• sex = the individual’s sex
• capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
• capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
• hoursperweek = the number of hours the individual works per week
• nativecountry = the native country of the individual
• over50k = whether or not the individual earned more than \$50,000 in 1994

### Problem 1.1 - A Logistic Regression Model

Let’s begin by building a logistic regression model to predict whether an individual’s earnings are above \$50,000 (the variable “over50k”) using all of the other variables as independent variables. First, read the dataset census.csv into R.

Then, split the data randomly into a training set and a testing set, setting the seed to 2000 before creating the split. Split the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.

Next, build a logistic regression model to predict the dependent variable “over50k”, using all of the other variables in the dataset as independent variables. Use the training set to build the model.

Which variables are significant, or have factors that have at least one significant dummy variable? (Use 0.1 as your significance threshold, so variables with a period or dot in the stars column should be counted too. You might see a warning message here - you can ignore it and proceed. This message is a warning that we might be overfitting our model to the training set.) Select all that apply.

Exercise 1

age

workclass

education

maritalstatus

occupation

relationship

race

sex

capitalgain

capitalloss

¨C10Choursperweek

¨C11Cnativecountry

Explanation

To read census.csv, set your working directory to the same directory that census.csv is in, and run the following command:

We now need to split the data. Load the caTools package, and set the seed to 2000:

library(caTools)

set.seed(2000)

Split the data set according to the over50k variable using filter from the dplyr package:

spl = sample.split(census\$over50k, SplitRatio = 0.6)

library(dplyr)

censusTrain = census %>% filter(spl == TRUE)

censusTest = census %>% filter(spl == FALSE)

¨C76C ¨C77C ¨C78C ¨C79C

### Problem 1.2 - A Logistic Regression Model

What is the accuracy of the model on the testing set? Use a threshold of p = 0.5. (You might see a warning message when you make predictions on the test set - you can safely ignore it.)

Exercise 2

Numerical Response

Explanation

Generate the predictions for the test set:

predictTest = predict(censusglm, newdata = censusTest, type = “response”)

Then we can generate the confusion matrix:

table(censusTest\$over50k, predictTest >= 0.5)

If we divide the sum of the main diagonal by the sum of all of the entries in the matrix, we obtain the accuracy:

(9051+1888)/(9051+662+1190+1888) = 0.8552107

### Problem 1.3 - A Logistic Regression Model

What is the accuracy of the baseline model on the testing set?

Exercise 3

Numerical Response

Explanation

We need to first determine the most frequent outcome in the training set. To do that, we table the dependent variable in the training set:

table(censusTrain\$over50k)

“<=50K” is the more frequent outcome (14570 observations), so this is what the baseline predicts. To generate the accuracy of the baseline on the test set, we can table the dependent variable in the test set:

table(censusTest\$over50k)

The baseline accuracy is

9713/(9713+3078) = 0.7593621.

### Problem 1.4 - A Logistic Regression Model

What is the area under the curve (AUC) for the logistic regression model on the test set?

Exercise 4

Numerical Response

Explanation

library(ROCR)

Then you can use the following commands to compute the AUC (assuming your test set predictions are called “predictTest”):

ROCRpred = prediction(predictTest, censusTest\$over50k)

as.numeric(performance(ROCRpred, “auc”)@y.values)

### Problem 2.1 - A CART Model

We have just seen how a logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others, especially due to the large number of factor variables in this problem.

Let us now build a CART model to predict “over50k”. Use the training set to build the model, with all of the other variables as independent variables. Use the default parameters (with the default loss table), so don’t set a value for minbucket or cp and do not specify a loss table. Remember to specify method=“class” as an argument to rpart, since this is a classification problem. After you are done building the model, plot the resulting tree.

How many splits does the tree have in total?

Exercise 5

Numerical Response

Explanation

To get started, load the rpart and rpart.plot packages:

library(rpart)

library(rpart.plot)

Estimate the CART tree:

censustree = rpart( over50k ~ . , method=“class”, data = censusTrain)

Plot the tree:

prp(censustree)

There are four splits in total.

### Problem 2.2 - A CART Model

Which variable does the tree split on at the first level (the very first split of the tree)?

Exercise 6

age

workclass

education

maritalstatus

occupation

relationship

race

sex

capitalgain

capitalloss

¨C34Choursperweek

¨C35Cnativecountry

Explanation

Plot the tree and examine the first split:

prp(censustree)

The first split uses the variable relationship.

### Problem 2.3 - A CART Model

Which variables does the tree split on at the second level (immediately after the first split of the tree)? Select all that apply.

Exercise 7

age

workclass

education

maritalstatus

occupation

relationship

race

sex

capitalgain

capitalloss

¨C46Choursperweek

¨C47Cnativecountry

Explanation

Plot the tree and examine the second splits:

prp(censustree)

The second splits are on capitalgains and education.

### Problem 2.4 - A CART Model

What is the accuracy of the model on the testing set?

Exercise 8

Numerical Response

This highlights a very regular phenomenon when comparing CART and logistic regression. CART often performs a little worse than logistic regression in out-of-sample accuracy. However, as is the case here, the CART model is often much simpler to describe and understand.

Explanation

First, generate predictions on the test set using the CART tree:

predictTest = predict(censustree, newdata = censusTest, type = “class”)

Then create the confusion matrix:

table(censusTest\$over50k, predictTest)

To compute the accuracy, sum the diagonal entries and divide by the sum of all of the terms:

(9243+1596)/(9243+470+1482+1596) = 0.8473927

### Problem 3.1 - A Random Forest Model

Before building a random forest model, we’ll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called “censusTrain”):

set.seed(1)

trainSmall = censusTrain[sample(nrow(censusTrain), 2000), ]

Let us now build a random forest model to predict “over50k”, using the dataset “trainSmall” as the data used to build the model. Set the seed to 1 again right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest “can not handle categorical predictors with more than 32 categories”, re-build the model without the nativecountry variable as one of the independent variables.)

Then, make predictions using this model on the entire test set. What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that you don’t need a “type” argument when making predictions with a random forest model if you want to use a threshold of 0.5. Also, note that your accuracy might be different from the one reported here, since random forest models can still differ depending on your operating system, even when the random seed is set. )

Exercise 9

Numerical Response

Explanation

To generate the random forest model with all of the variables, just run:

library(randomForest)

set.seed(1)

censusrf = randomForest(over50k ~ . , data = trainSmall)

And then you can make predictions on the test set by using the following command:

predictTest = predict(censusrf, newdata=censusTest)

And to compute the accuracy, you can create the confusion matrix:

table(censusTest\$over50k, predictTest)

The accuracy of the model should be around

(9614+1050)/nrow(censusTest) = 0.8337112

You may be surprised that the test-set accuracy of the random forest model is lower than the test-set accuracies of some of the other, simpler models we’ve trained. This can occur when the parameters of the random forest have not been tuned via cross-validation and the default parameters are ill suited for the dataset.

### Problem 4.1 - Selecting cp by Cross-Validation

We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.

Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. Do this by using the train function. Set the seed beforehand to 2. Test cp values from 0.002 to 0.092 in 0.01 increments, by using the following command:

cartGrid = data.frame( .cp = seq(0.002,0.092,0.01))

After you create the cartGrid variable, you can pass it on to the train function by letting tuneGrid = cartGrid.

Also, remember to use the entire training set “censusTrain” when building this model. The train function might take some time to run.

Which value of cp does the train function recommend?

Exercise 10

Numerical Response

Explanation

Before doing anything, load the caret package:

library(caret)

Set the seed to 2:

set.seed(2)

Specify that we are going to use k-fold cross validation with 10 folds:

fitControl = trainControl( method = “cv”, number = 10 )

Specify the grid of cp values that we wish to evaluate:

cartGrid = data.frame( .cp = seq(0.002,0.092,0.01))

Finally, run the train function and view the result:

cartCV = train( over50k ~ . , data = censusTrain, method = “rpart”, trControl = fitControl, tuneGrid = cartGrid )

¨C155C ¨C156C ¨C157C ¨C158C ¨C159C

### Problem 4.2 - Selecting cp by Cross-Validation

Recall that the caret package causes categorical independent variables to be split into factor levels. To generate a version of the test set with dummy variables, please run the following command (which assumes your test set is stored in variable “censusTest”):

censusTestMM = as.data.frame(model.matrix(over50k~.+0, data=censusTest))

Use the final model selected via cross-validation to make predictions on censusTestMM. What is the prediction accuracy on the test set?

Exercise 11

Numerical Response

Explanation

Assuming you stored the cross-validation results in a variable called “cartCV”, you can access the final selected model with the following command:

model = cartCV\$finalModel

You can make predictions on the test set using the following commands:

censusTestMM = as.data.frame(model.matrix(~.+0, data=censusTest))

predictTest = predict(model, newdata=censusTestMM, type=“class”)

Then you can generate the confusion matrix with the command

table(censusTest\$over50k, predictTest)

The accuracy is (9167+1821)/(9167+546+1257+1821) = 0.8590415.

### Problem 4.3 - Selecting cp by Cross-Validation

Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one – or should we? Plot the CART tree for this model. How many splits are there?

Exercise 12

Numerical Response

This highlights one important tradeoff in building predictive models. By tuning cp, we improved our accuracy, but our tree became significantly more complex. In some applications, such an improvement in accuracy would be worth the loss in interpretability. In others, we may prefer a less accurate model that is simpler to understand and describe over a more accurate – but more complicated – model.

Explanation

If you plot the tree with prp(model), where “model” is the name of your CART tree, you should see that there are 17 splits.

### State Data Revisited

We will be revisiting the “state” dataset from one of the optional problems in Unit 2. This dataset has, for each of the fifty U.S. states, the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation. This dataset comes from the U.S. Department of Commerce, Bureau of the Census.

Load the dataset into R and convert it to a data frame by running the following two commands in R:

data(state) statedata = data.frame(state.x77)

If you can’t access the state dataset in R, here is a CSV file with the same data that you can load into R using the read.csv function: statedataSimple (CSV).  Be sure to call the output of the read.csv function “statedata”.

After you have loaded the data into R, inspect the data set using the command: str(statedata)

This dataset has 50 observations (one for each US state) and the following 8 variables:

• Population - the population estimate of the state in 1975
• Income - per capita income in 1974
• Illiteracy - illiteracy rates in 1970, as a percent of the population
• Life.Exp - the life expectancy in years of residents of the state in 1970
• Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976
• Frost - the mean number of days with minimum temperature below freezing from 1931–1960 in the capital or a large city of the state
• Area - the land area (in square miles) of the state

We will try to build a model for life expectancy using regression trees, and employ cross-validation to improve our tree’s performance.

### Problem 1.1 - Linear Regression Models

Let’s recreate the linear regression models we made in the previous homework question. First, predict Life.Exp using all of the other variables as the independent variables (Population, Income, Illiteracy, Murder, HS.Grad, Frost, Area ). Use the entire dataset to build the model.

What is the adjusted R-squared of the model?

Exercise 1

Numerical Response

Explanation

To build the regression model, type the following command in your R console:

RegModel = lm(Life.Exp ~ ., data=statedata)

Then, if you look at the output of summary(RegModel), you should see that the Adjusted R-squared is 0.6922.

### Problem 1.2 - Linear Regression Models

Calculate the sum of squared errors (SSE) between the predicted life expectancies using this model and the actual life expectancies:

Exercise 2

Numerical Response

Explanation

To make predictions, type in your R console:

Predictions = predict(RegModel)

where “RegModel” is the name of your regression model. You can then compute the sum of squared errors by typing the following in your R console:

sum((statedata\$Life.Exp - Predictions)^2)

The SSE is 23.29714.

Alternatively, you can use the following command to get the SSE:

sum(RegModel\$residuals^2)

### Problem 1.3 - Linear Regression Models

Build a second linear regression model using just Population, Murder, Frost, and HS.Grad as independent variables (the best 4 variable model from the previous homework). What is the adjusted R-squared for this model?

Exercise 3

Numerical Response

Explanation

You can create this regression model by typing the following into your R console:

RegModel2 = lm(Life.Exp ~ Population + Murder + Frost + HS.Grad, data=statedata)

Then, if you type:

summary(RegModel2)

The Adjusted R-squared is at the bottom right of the output, and is 0.7126

### Problem 1.4 - Linear Regression Models

Calculate the sum of squared errors again, using this reduced model:

Exercise 4

Numerical Response

Explanation

The sum of squared errors (SSE) can be computed by first making predictions:

Predictions2 = predict(RegModel2)

and then computing the sum of the squared difference between the actual values and the predictions:

sum((statedata\$Life.Exp - Predictions2)^2).

Alternatively, you can compute the SSE with the following command:

SSE = sum(RegModel2\$residuals^2)

### Problem 1.5 - Linear Regression Models

Which of the following is correct?

Exercise 5

Trying different combinations of variables in linear regression is like trying different numbers of splits in a tree - this controls the complexity of the model.

Using many variables in a linear regression is always better than using just a few.

The variables we removed were uncorrelated with Life.Exp

Explanation

The correct answer is the first one. Trying different combinations of variables in linear regression controls the complexity of the model. This is similar to trying different numbers of splits in a tree, which is also controlling the complexity of the model.

The second answer is incorrect because as we see here, a model with fewer variables actually has a higher adjusted R-squared. If your accuracy is just as good, a model with fewer variables is almost always better.

The third answer is incorrect because the variables we removed have non-zero correlations with the dependent variable Life.Exp. Illiteracy and Area are negatively correlated with Life.Exp, with correlations of -0.59 and -0.11. Income is positively correlated with Life.Exp, with a correlation of 0.34. These correlations can be computed by typing the following into your R console:

cor(statedata\$Life.Exp, statedata\$Income)

cor(statedata\$Life.Exp, statedata\$Illiteracy)

cor(statedata\$Life.Exp, statedata\$Area)

### Problem 2.1 - CART Models

Let’s now build a CART model to predict Life.Exp using all of the other variables as independent variables (Population, Income, Illiteracy, Murder, HS.Grad, Frost, Area). We’ll use the default minbucket parameter, so don’t add the minbucket argument. Remember that in this problem we are not as interested in predicting life expectancies for new observations as we are understanding how they relate to the other variables we have, so we’ll use all of the data to build our model. You shouldn’t use the method=“class” argument since this is a regression tree.

Plot the tree. Which of these variables appear in the tree? Select all that apply.

Exercise 6

Population

Murder

Frost

¨C19CArea

Explanation

You can create the tree in R by typing the following command:

CARTmodel = rpart(Life.Exp ~ ., data=statedata)

Be sure to load the “rpart” and “rpart.plot” packages with the library command if they are not already loaded.

You can then plot the tree by typing:

prp(CARTmodel)

We can see that the only variable used in the tree is “Murder”.

### Problem 2.2 - CART Models

Use the regression tree you just built to predict life expectancies (using the predict function), and calculate the sum-of-squared-errors (SSE) like you did for linear regression. What is the SSE?

Exercise 7

Numerical Response

Explanation

You can make predictions using the CART model by typing the following line in your R console (assuming your model is called “CARTmodel”):

PredictionsCART = predict(CARTmodel)

Then, you can compute the sum of squared errors (SSE) by typing the following:

sum((statedata\$Life.Exp - PredictionsCART)^2)

The SSE is 28.99848.

### Problem 2.3 - CART Models

The error is higher than for the linear regression models. One reason might be that we haven’t made the tree big enough. Set the minbucket parameter to 5, and recreate the tree.

Which variables appear in this new tree? Select all that apply.

Exercise 8

Population

Murder

Frost

Area

Explanation

You can create a tree with a minbucket value of 5 with the following command:

CARTmodel2 = rpart(Life.Exp ~ ., data=statedata, minbucket=5)

Then, if you plot the tree using prp(CARTmodel2), you can see that Murder, HS.Grad, and Area are all used in this new tree.

### Problem 2.4 - CART Models

Do you think the default minbucket parameter is smaller or larger than 5 based on the tree that was built?

Exercise 9

Smaller

Larger

Explanation

Since the tree now has more splits, it must be true that the default minbucket parameter was limiting the tree from splitting more before. So the default minbucket parameter must be larger than 5.

### Problem 2.5 - CART Models

What is the SSE of this tree?

Exercise 10

Numerical Response

Explanation

You can compute the SSE of this tree by first making predictions:

PredictionsCART2 = predict(CARTmodel2)

and then computing the sum of the squared differences between the actual values and the predicted values:

sum((statedata\$Life.Exp - PredictionsCART2)^2)

The SSE is 23.64283

This is much closer to the linear regression model’s error. By changing the parameters we have improved the fit of our model.

### Problem 2.6 - CART Models

Can we do even better? Create a tree that predicts Life.Exp using only Area, with the minbucket parameter to 1. What is the SSE of this newest tree?

Exercise 11

Numerical Response

Explanation

You can create this third tree by typing:

CARTmodel3 = rpart(Life.Exp ~ Area, data=statedata, minbucket=1)

Then to compute the SSE, first make predictions:

PredictionsCART3 = predict(CARTmodel3)

And then compute the sum of squared differences between the actual values and the predicted values:

sum((statedata\$Life.Exp - PredictionsCART3)^2)

The SSE is 9.312442.

Note that the SSE is not zero here - we still make some mistakes. This is because there are other parameters in rpart that are also trying to prevent the tree from overfitting by setting default values. So our tree doesn’t necessarily have one observation in each bucket - by setting minbucket=1 we are just allowing the tree to have one observation in each bucket.

### Problem 2.7 - CART Models

This is the lowest error we have seen so far. What would be the best interpretation of this result?

Exercise 12

Trees are much better than linear regression for this problem because they can capture nonlinearities that linear regression misses.

We can build almost perfect models given the right parameters, even if they violate our intuition of what a good model should be.

Area is obviously a very meaningful predictor of life expectancy, given we were able to get such low error using just Area as our independent variable.

Explanation

The correct answer is the second one. By making the minbucket parameter very small, we could build an almost perfect model using just one variable, that is not even our most significant variable. However, if you plot the tree using prp(CARTmodel3), you can see that the tree has 22 splits! This is not a very interpretable model, and will not generalize well.

The first answer is incorrect because our tree model that was not overfit performed similarly to the linear regression model. Trees only look better than linear regression here because we are overfitting the model to the data.

The third answer is incorrect because Area is not actually a very meaningful predictor. Without overfitting the tree, our model would not be very accurate only using Area.

### Problem 3.1 - Cross-validation

Adjusting the variables included in a linear regression model is a form of model tuning. In Problem 1 we showed that by removing variables in our linear regression model (tuning the model), we were able to maintain the fit of the model while using a simpler model. A rule of thumb is that simpler models are more interpretable and generalizeable. We will now tune our regression tree to see if we can improve the fit of our tree while keeping it as simple as possible.

Load the caret library, and set the seed to 111. Set up the controls exactly like we did in the lecture (10-fold cross-validation) with cp varying over the range 0.01 to 0.50 in increments of 0.01. Use the train function to determine the best cp value for a CART model using all of the available independent variables, and the entire dataset statedata. What value of cp does the train function recommend? (Remember that the train function tells you to pick the largest value of cp with the lowest error when there are ties, and explains this at the bottom of the output.)

Exercise 13

Numerical Response

Explanation

You can load the library caret and set the seed by typing the following commands:

library(caret)

set.seed(111)

Then, you can set up the cross-validation controls by typing:

fitControl = trainControl(method = “cv”, number = 10)

cartGrid = expand.grid(.cp = seq(0.01, 0.5, 0.01) )

You can then use the train function to find the best value of cp by typing:

train(Life.Exp ~ ., data=statedata, method=“rpart”, trControl = fitControl, tuneGrid = cartGrid)

At the bottom of the output, it says that the best value of cp is 0.12.

### Problem 3.2 - Cross-Validation

Create a tree with the value of cp you found in the previous problem, all of the available independent variables, and the entire dataset “statedata” as the training data. Then plot the tree. You’ll notice that this is actually quite similar to the first tree we created with the initial model. Interpret the tree: we predict the life expectancy to be 70 if the murder rate is greater than or equal to

Exercise 14

Numerical Response

and is less than

Exercise 15

Numerical Response

Explanation

You can create a new tree with cp=0.12 by typing:

CARTmodel4 = rpart(Life.Exp ~ ., data=statedata, cp=0.12)

Then, if you plot the tree using prp(CARTmodel4), you can see that the life expectancy is predicted to be 70 if Murder is greater than or equal to 6.6 (the first split) and less than 11 (the second split).

### Problem 3.3 - Cross-Validation

Calculate the SSE of this tree:

Exercise 16

Numerical Response

Explanation

To compute the SSE, first make predictions:

PredictionsCART4 = predict(CARTmodel4)

and then compute the sum of squared differences between the actual values and the predicted values:

sum((statedata\$Life.Exp - PredictionsCART4)^2)

The SSE is 32.86549

### Problem 3.4 - Cross-Validation

Recall the first tree (default parameters), second tree (minbucket = 5), and the third tree (selected with cross validation) we made. Given what you have learned about cross-validation, which of the three models would you expect to be better if we did use it for prediction on a test set? For this question, suppose we had actually set aside a few observations (states) in a test set, and we want to make predictions on those states.

Exercise 17

The first model

The second model

The model we just made with the “best” cp

Explanation

The purpose of cross-validation is to pick the tree that will perform the best on a test set. So we would expect the model we made with the “best” cp to perform best on a test set.

### Problem 3.5 - Cross-Validation

At the end of Problem 2 we made a very complex tree using just Area. Use train with the same parameters as before but just using Area as an independent variable to find the best cp value (set the seed to 111 first). Then build a new tree using just Area and this value of cp.

How many splits does the tree have?

Exercise 18

Numerical Response

Explanation

To find the best value of cp when using only Area, use the following command:

set.seed(111)

train(Life.Exp ~ Area, data=statedata, method=“rpart”, trControl = fitControl, tuneGrid = cartGrid )

Then, build a new CART tree by typing:

CARTmodel5 = rpart(Life.Exp ~ Area, data=statedata, cp=0.02)

You can plot the tree with prp(CARTmodel5), and see that the tree has 4 splits.

### Problem 3.6 - Cross-Validation

The lower left leaf (or bucket) corresponds to the lowest predicted Life.Exp of 70. Observations in this leaf correspond to states with area greater than or equal to

Exercise 19

Numerical Response

and area less than

Exercise 20

Numerical Response

Explanation

To get to this leaf, we go through 3 splits:

Area less than 62,000

Area greater than or equal to 9,579

Area less than 51,000

This means that this leaf is composed of states that have an area greater than 9,579 and less than 51,000.

### Problem 3.7 - Cross-Validation

We have simplified the previous “Area tree” considerably by using cross-validation. Calculate the SSE of the cross-validated “Area tree”, and select all of the following correct statements that apply:

Exercise 21

The best model in this whole question is the first “Area tree” because it had the lowest SSE.

The Area variable is not as predictive as Murder rate.

Cross-validation is intended to decrease the SSE for a model on the training data, compared to a tree that isn’t cross-validated.

Cross-validation will always improve the SSE of a model on unseen data, compared to a tree that isn’t cross-validated.

Explanation

You can calculate the SSE by first making predictions and then computing the SSE:

PredictionsCART5 = predict(CARTmodel5)

sum((statedata\$Life.Exp - PredictionsCART5)^2)

The original Area tree was overfitting the data - it was uninterpretable. Area is not as useful as Murder - if it was, it would have been in the cross-validated tree. Cross-validation is not designed to improve the fit on the training data, but it won’t necessarily make it worse either. Cross-validation cannot guarantee improving the SSE on unseen data, although it often helps.

Back: Understanding Why People Vote

### Understanding Why People Vote

In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper (PDF):

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that … “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout…

In this homework problem we will use both logistic regression and classification trees to analyze the data they collected.

### The Data

The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

1. “Civic Duty” (variable civicduty) group members were sent a letter that simply said “DO YOUR CIVIC DUTY — VOTE!”
2. “Hawthorne Effect” (variable hawthorne) group members were sent a letter that had the “Civic Duty” message plus the additional message “YOU ARE BEING STUDIED” and they were informed that their voting behavior would be examined by means of public records.
3. “Self” (variable self) group members received the “Civic Duty” message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.
4. “Neighbors” (variable neighbors) group members were given the same message as that for the “Self” group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.
5. “Control” (variable control) group members were not sent anything, and represented the typical voting situation.

Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).

### Problem 1.1 - Exploration and Logistic Regression

We will first get familiar with the data. Load the CSV file gerber (CSV - 6.2MB) into R. What proportion of people in this dataset voted in this election?

Exercise 1

Numerical Response

Explanation

Then we can compute the percentage of people who voted by using the table function:

table(gerber\$voting)

The output tells us that 235,388 people did not vote, and 108,696 people did vote. This means that 108696/(108696+235388) = 0.316 of all people voted in the election.

### Problem 1.2 - Exploration and Logistic Regression

Which of the four “treatment groups” had the largest percentage of people who actually voted (voting = 1)?

Exercise 2

Civic Duty

Hawthorne Effect

Self

Neighbors

Explanation

There are several ways to get this answer. One is to use the tapply function, and compute the mean value of “voting”, sorted by whether or not the people were in each group:

tapply(gerber\$voting, gerber\$civicduty, mean)

tapply(gerber\$voting, gerber\$hawthorne, mean)

tapply(gerber\$voting, gerber\$self, mean)

tapply(gerber\$voting, gerber\$neighbors, mean)

The variable with the largest value in the “1” column has the largest fraction of people voting in their group - this is the Neighbors group.

### Problem 1.3 - Exploration and Logistic Regression

Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.

Exercise 3

Civic Duty

Hawthorne Effect

Self

Neighbors

Explanation

You can build the logistic regression model with the following command:

LogModel = glm(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, family=“binomial”)

If you look at the output of summary(LogModel), you can see that all of the variables are significant.

### Problem 1.4 - Exploration and Logistic Regression

Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don’t need to use the newdata argument since we didn’t split our data.)

Exercise 4

Numerical Response

Explanation

First compute predictions:

predictLog = predict(LogModel, type=“response”)

Then, use the table function to make a confusion matrix:

table(gerber\$voting, predictLog > 0.3)

We can compute the accuracy of the sum of the true positives and true negatives, divided by the sum of all numbers in the table:

(134513+51966)/(134513+100875+56730+51966) = 0.542

### Problem 1.5 - Exploration and Logistic Regression

Using a threshold of 0.5, what is the accuracy of the logistic regression model?

Exercise 5

Numerical Response

Explanation

First compute predictions:

predictLog = predict(LogModel, type=“response”)

Then, use the table function to make a confusion matrix:

table(gerber\$voting, predictLog > 0.5)

We can compute the accuracy of the sum of the true positives and true negatives, divided by the sum of all numbers in the table:

(235388+0)/(235388+108696) = 0.684

### Problem 1.6 - Exploration and Logistic Regression

Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?

Exercise 6

Even though all of the variables are significant, this is a weak predictive model.

The model’s accuracy doesn’t improve over the baseline, but the AUC is high, so this is a strong predictive model.

Explanation

You can compute the AUC with the following commands (if your model’s predictions are called “predictLog”):

library(ROCR)

ROCRpred = prediction(predictLog, gerber\$voting)

as.numeric(performance(ROCRpred, “auc”)@y.values)

Even though all of our variables are significant, our model does not improve over the baseline model of just predicting that someone will not vote, and the AUC is low. So while the treatment groups do make a difference, this is a weak predictive model.

### Problem 2.1 - Trees

We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don’t set the option method=“class” - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We’d like CART to split our groups if they have different probabilities of voting. If we used method=‘class’, CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

Leave all the parameters at their default values. You can use the following command in R to build the tree:

CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)

Plot the tree. What happens, and if relevant, why?

Exercise 7

Only the “Neighbors” variable is used in the tree - it is the only one with a big enough effect.

All variables are used - they all make a difference.

No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on.

Explanation

If you plot the tree, with prp(CARTmodel), you should just see one leaf! There are no splits in the tree, because none of the variables make a big enough effect to be split on.

### Problem 2.2 - Trees

Now build the tree using the command:

CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)

to force the complete tree to be built. Then plot the tree. What do you observe about the order of the splits?

Exercise 8

Civic duty is the first split, neighbor is the last.

Neighbor is the first split, civic duty is the last.

Explanation

You can plot the tree with prp(CARTmodel2).

We saw in Problem 1 that the highest fraction of voters was in the Neighbors group, followed by the Self group, followed by the Hawthorne group, and lastly the Civic Duty group. And we see here that the tree detects this trend.

### Problem 2.3 - Trees

Using only the CART tree plot, determine what fraction (a number between 0 and 1) of “Civic Duty” people voted:

Exercise 9

Numerical Response

Explanation

You can find this answer by reading the tree - the people in the civic duty group correspond to the bottom right split, which has value 0.31 in the leaf.

### Problem 2.4 - Trees

Make a new tree that includes the “sex” variable, again with cp = 0.0. Notice that sex appears as a split that is of secondary importance to the treatment group.

In the control group, which gender is more likely to vote?

Exercise 10

Men (0)

Women (1)

In the “Civic Duty” group, which gender is more likely to vote?

Exercise 11

Men (0)

Women (1)

Explanation

You can generate the new tree using the command:

CARTmodel3 = rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)

Then, if you plot the tree with prp(CARTmodel3), you can see that there is a split on the “sex” variable after every treatment variable split. For the control group, which corresponds to the bottom left, sex = 0 (male) corresponds to a higher voting percentage.

For the civic duty group, which corresponds to the bottom right, sex = 0 (male) corresponds to a higher voting percentage.

### Problem 3.1 - Interaction Terms

We know trees can handle “nonlinear” relationships, e.g. “in the ‘Civic Duty’ group and female”, but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let’s explore what trees can tell us some more.

Let’s just focus on the “Control” treatment group. Create a regression tree using just the “control” variable, then create another tree with the “control” and “sex” variables, both with cp=0.0.

In the “control” only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument “digits = 6” to the prp command to get a more accurate estimate.

Exercise 12

Numerical Response

Explanation

You can build the two trees with the following two commands:

CARTcontrol = rpart(voting ~ control, data=gerber, cp=0.0)

CARTsex = rpart(voting ~ control + sex, data=gerber, cp=0.0)

Then, plot the “control” tree with the following command:

prp(CARTcontrol, digits=6)

The split says that if control = 1, predict 0.296638, and if control = 0, predict 0.34. The absolute difference between these is 0.043362.

### Problem 3.2 - Interaction Terms

Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups):

Exercise 13

Men, by a margin of more than 0.001

Women, by a margin of more than 0.001

They are affected about the same (change in probability within 0.001 of each other).

Explanation

You can plot the second tree using the command:

prp(CARTsex, digits=6)

The first split says that if control = 1, go left. Then, if sex = 1 (female) predict 0.290456, and if sex = 0 (male) predict 0.302795. On the other side of the tree, where control = 0, if sex = 1 (female) predict 0.334176, and if sex = 0 (male) predict 0.345818. So for women, not being in the control group increases the fraction voting by 0.04372. For men, not being in the control group increases the fraction voting by 0.04302. So men and women are affected about the same.

### Problem 3.3 - Interaction Terms

Going back to logistic regression now, create a model using “sex” and “control”. Interpret the coefficient for “sex”:

Exercise 14

Coefficient is negative, reflecting that women are less likely to vote

Coefficient is negative, reflecting that women are more likely to vote

Coefficient is positive, reflecting that women are less likely to vote

Coefficient is positive, reflecting that women are more likely to vote

Explanation

You can create the logistic regression model by using the following command:

LogModelSex = glm(voting ~ control + sex, data=gerber, family=“binomial”)

If you look at the summary of the model, you can see that the coefficient for the “sex” variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0.

### Problem 3.4 - Interaction Terms

The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). Logistic regression has attempted to do the same, although it wasn’t able to do as well because it can’t consider exactly the joint possibility of being a women and in the control group.

We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where “LogModelSex” is the name of your logistic regression model that uses both control and sex):

`Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1)) predict(LogModelSex, newdata=Possibilities, type=&quot;response&quot;)`

The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ). What is the absolute difference between the tree and the logistic regression for the (Woman, Control) case? Give an answer with five numbers after the decimal point.

Exercise 15

Numerical Response

Explanation

The CART tree predicts 0.290456 for the (Woman, Control) case, and the logistic regression model predicts 0.2908065. So the absolute difference, to five decimal places, is 0.00035.

### Problem 3.5 - Interaction Terms

So the difference is not too big for this dataset, but it is there. We’re going to add a new term to our logistic regression now, that is the combination of the “sex” and “control” variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:

`````` LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family=&quot;binomial&quot;)
``````

How do you interpret the coefficient for the new variable in isolation? That is, how does it relate to the dependent variable?

Exercise 16

If a person is a woman or in the control group, the chance that she voted goes up.

If a person is a woman and in the control group, the chance that she voted goes up.

If a person is a woman or in the control group, the chance that she voted goes down.

If a person is a woman and in the control group, the chance that she voted goes down.

Explanation

This coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.

### Problem 3.6 - Interaction Terms

Run the same code as before to calculate the average for each group:

predict(LogModel2, newdata=Possibilities, type=“response”)

Now what is the difference between the logistic regression model and the CART model for the (Woman, Control) case? Again, give your answer with five numbers after the decimal point.

Exercise 17

Numerical Response

Explanation

The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.

### Problem 3.7 - Interaction Terms

This example has shown that trees can capture nonlinear relationships that logistic regression can not, but that we can get around this sometimes by using variables that are the combination of two variables. Should we always include all possible interaction terms of the independent variables when building a logistic regression model?

Exercise 18

Yes

No

Explanation

We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.