15.071 | Spring 2017 | Graduate

The Analytics Edge

4 Trees

4.2 Judge, Jury, and Classifier: An Introduction to Trees

Quick Question

How much data do you think Andrew Martin should use to build his model?

Explanation Andrew Martin should use all data from the cases with the same set of justices. The justices do not change every year, and typically you want to use as much data as you have available.

Quick Question

Suppose that you have the following CART tree:

Example of a cart tree.

How many splits are in this tree?

Exercise 1

 Numerical Response 

 

For which data observations should we predict “Red”, according to this tree? Select all that apply.

Exercise 2

 If X is less than 60, and Y is any value. 

 If X is greater than or equal to 60, and Y is greater than or equal to 20. 

 If X is greater than or equal to 85, and Y is less than 20. 

 If X is greater than or equal to 60 and less than 85, and Y is less than 20. 

 

Explanation

This tree has three splits. The first split says to predict “Red” if X is less than 60, regardless of the value of Y. Otherwise, we move to the second split. The second split says to check the value of Y - if it is greater than or equal to 20, predict “Gray”. Otherwise, we move to the third split. This split checks the value of X again. If X is less than 85 (and greater than or equal to 60 by the first split) and Y is less than 20, then we predict “Red”. Otherwise, we predict “Gray”.

CheckShow Answer

Quick Question

 

Suppose you have a subset of 20 observations, where 14 have outcome A and 6 have outcome B. What proportion of observations have outcome A?

Exercise 1

 Numerical Response 

 

Explanation

The fraction of observations that have outcome A is 14/20 = 0.7.

CheckShow Answer

 

Quick Question

 

The following questions ask about the subset of 20 observations from the previous question.

If we set the threshold to 0.25 when computing predictions of outcome A, will we predict A or B for these observations?

Exercise 2

 A 

 B 

If we set the threshold to 0.5 when computing predictions of outcome A, will we predict A or B for these observations?

Exercise 3

 A 

 B 

If we set the threshold to 0.75 when computing predictions of outcome A, will we predict A or B for these observations?

Exercise 4

 A 

 B 

Explanation

Since 70% of these observations have outcome A, we will predict A if the threshold is below 0.7, and we will predict B if the threshold is above 0.7.

CheckShow Answer

 

Quick Question

 

Compute the AUC of the CART model from the previous video, using the following command in your R console:

as.numeric(performance(pred, “auc”)@y.values)

What is the AUC?

Exercise 1

 Numerical Response 

 

Explanation

If you run the command given above after going through the commands in Video 4, you get an AUC of 0.6927.

Now, recall that in Video 4, our tree had 7 splits. Let’s see how this changes if we change the value of minbucket.

First build a CART model that is similar to the one we built in Video 4, except change the minbucket parameter to 5. Plot the tree.

How many splits does the tree have?

Exercise 2

 Numerical Response 

 

Explanation

You can build a CART model with minbucket=5 by using the following command:

StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, method=“class”, data = Train, minbucket=5)

If you plot the tree with prp(StevensTree), you can see that the tree has 16 splits! This tree is probably overfit to the training data, and is not as interpretable.

Now build a CART model that is similar to the one we built in Video 4, except change the minbucket parameter to 100. Plot the tree.

How many splits does the tree have?

Exercise 3

 Numerical Response 

 

Explanation

You can build a CART model with minbucket=100 by using the following command:

StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, method=“class”, data = Train, minbucket=100)

If you plot the tree with prp(StevensTree), you can see that the tree only has one split! This tree is probably not fit well enough to the training data.

CheckShow Answer

 

Quick Question

Important Note: When creating random forest models, you might still get different answers from the ones you see here even if you set the random seed. This has to do with different operating systems and the random forest implementation.

Let’s see what happens if we set the seed to two different values and create two different random forest models.

First, set the seed to 100, and the re-build the random forest model, exactly like we did in the previous video (Video 5). Then make predictions on the test set. What is the accuracy of the model on the test set?

Exercise 1

 Numerical Response 

 

Now, set the seed to 200, and then re-build the random forest model, exactly like we did in the previous video (Video 5). Then make predictions on the test set. What is the accuracy of this model on the test set?

Exercise 2

 Numerical Response 

 

Explanation

You can create the models and compute the accuracies with the following commands in R:

set.seed(100)

StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25)

PredictForest = predict(StevensForest, newdata = Test)

table(Test$Reverse, PredictForest)

and then repeat it, but with set.seed(200) first.

As we see here, the random component of the random forest method can change the accuracy. The accuracy for a more stable dataset will not change very much, but a noisy dataset can be significantly affected by the random samples.

CheckShow Answer

Quick Question

 

Plot the tree that we created using cross-validation. How many splits does it have?

Exercise 1

 Numerical Response 

 

Explanation

If you follow the R commands from the previous video, you can plot the tree with prp(StevensTreeCV).

The tree with the best accuracy only has one split! When we were picking different minbucket parameters before, it seemed like this tree was probably not doing a good job of fitting the data. However, this tree with one split gives us the best out-of-sample accuracy. This reminds us that sometimes the simplest models are the best!

CheckShow Answer

 

Video 4: CART in R

In the next few videos, we’ll be using the dataset stevens (CSV) to build trees in R. Please download the dataset to follow along. This data comes from the Supreme Court Forecasting Project website.

An R script file with all of the R commands used in this lecture can be downloaded here: Unit4_SupremeCourt (R).

Video 6: Cross-Validation

Important Note: In this video, we install and load two new packages so that we can perform cross-validation: “caret”, and “e1071”. You may need to additionally install and load the following packages for cross-validation to work on your computer: “class” and “ggplot2”. If you receive an error message after trying to load caret and e1071, please try installing and loading these two additional packages.

Cross-Validation for Random Forests

You might be wondering why we used cross-validation on our CART model, but not on our random forest model. According to the creaters of the random forest algorithm, the model is not very sensitive to the parameters and therefore does not easily overfit to the training set. You can read more on the Random Forests website

However, if you are interested in experimenting with the parameters of the random forest model more, you can read about the parameters and cross-validation for random forests in the documentation for the randomForest package (PDF).

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions