15.071 | Spring 2017 | Graduate

The Analytics Edge

4 Trees

4.3 Keeping an Eye on Healthcare Costs: The D2Hawkeye Story

Quick Question

In what ways do you think an analytics approach to predicting healthcare cost will improve upon the previous approach of human judgment? Select all that apply.

 
 
 

Explanation All of the above are correct. There are many advantages to having an analytics approach to predict cost. However, it is important that the models are interpretable, so trees are a great model to use in this situation.

Continue: Video 2: Claims Data

Quick Question

A common problem in analytics is that you have some data available, but it’s not the ideal dataset. This is the case for this problem, where we only have claims data. Which of the following pieces of information would we ideally like to have in our dataset, but are not included in claims data? (Select all that apply.)

Explanation In claims data, we have drugs prescribed to the patient, but we don't have blood test results or physical exam results.

Quick Question

While we don’t have all of the data we would ideally like to have in this problem (like test results), we can define new variables using the data we do have. Which of the following were new variables defined to help predict healthcare cost? Select all that apply.

Explanation All of these variables were defined using the claims data to improve cost predictions. This shows how the intuition of experts can be used to define new variables and improve the model.

Quick Question

The image below shows the penalty error matrix that we discussed in the previous video.

Penalty error matrix showing forecast vs. outcome.

We can interpret this matrix as follows. Suppose the actual outcome for an observation is 3, and we predict 2. We find 3 on the top of the matrix, and go down to the second row (since we forecasted 2). The penalty error for this mistake is 2. If for another observation we predict (forecast) 4, but the actual outcome is 1, that is a penalty error of 3.

What is the worst mistake we can make, according to the penalty error matrix?

 
 

Explanation The highest cost is 8, which occurs when the forecast is 1 (very low cost), but the actual cost is 5 (very high cost). It would be much worse for us to ignore an actual high cost observation than to accidentally predict high cost for someone who turns out to be low cost

What are the “best” types of mistakes we can make, according to the penalty error matrix?

 
 

Explanation We are happier with mistakes where we predict one cost bucket higher than the actual outcome, since this just means we are being a little overly cautious.

Quick Question

What were the most important factors in the CART trees to predict cost?

Explanation The most important variables in a CART tree are at the top of the tree - in this case, they are the cost ranges from the previous year.

Quick Question

 

What is the average age of patients in the training set, ClaimsTrain?

Exercise 1

 Numerical Response 

 

What proportion of people in the training set (ClaimsTrain) had at least one diagnosis code for diabetes?

Exercise 2

 Numerical Response 

 

Explanation

Both of these answers can be found by looking at summary(ClaimsTrain). The mean age should be listed under the age variable, and since diabetes is a binary variable, the mean value of diabetes gives the proportion of people with at least one diagnosis code for diabetes.

Alternatively, you could use the mean, table, and nrow functions:

mean(ClaimsTrain$age)

table(ClaimsTrain$diabetes)/nrow(ClaimsTrain)

CheckShow Answer

 

Quick Question

 

Suppose that instead of the baseline method discussed in the previous video, we used the baseline method of predicting the most frequent outcome for all observations. This new baseline method would predict cost bucket 1 for everyone.

What would the accuracy of this baseline method be on the test set?

Exercise 1

 Numerical Response 

 

What would the penalty error of this baseline method be on the test set?

Exercise 2

 Numerical Response 

 

Explanation

To compute the accuracy, you can create a table of the variable ClaimsTest$bucket2009:

table(ClaimsTest$bucket2009)

According to the table output, this baseline method would get 122978 observations correct, and all other observations wrong. So the accuracy of this baseline method is 122978/nrow(ClaimsTest) = 0.67127.

For the penalty error, since this baseline method predicts 1 for all observations, it would have a penalty error of:

(0*122978 + 2*34840 + 4*16390 + 6*7937 + 8*1057)/nrow(ClaimsTest) = 1.044301

CheckShow Answer

 

Quick Question

In the previous video, we constructed two CART models. The first CART model, without the loss matrix, predicted bucket 1 for 78.6% of the observations in the test set. Did the second CART model, with the loss matrix, predict bucket 1 for more or fewer of the observations, and why?

Explanation If you look at the classification matrix for the second CART model, we predicted bucket 1 less frequently. This is because, according to the penalty matrix, some of the worst types of errors are to predict bucket 1 when the actual cost bucket is higher.

Video 6: Claims Data in R

In the next few videos, we’ll be using the dataset ClaimsData (ZIP - 2.2MB): this file contains 1 CSV file. Please download the dataset to follow along. Note that this file is in ZIP format due to its large size. You will need to decompress (or unzip) the file before loading it into R.

This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).

An R script file with all of the R commands used in this lecture can be downloaded here: Unit4_D2Hawkeye (R)

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions