15.071 | Spring 2017 | Graduate

The Analytics Edge

3 Logistic Regression

3.2 Modeling the Expert: An Introduction to Logistic Regression

Quick Question

Which of the following dependent variables are categorical? (Select all that apply.)

 
 
 
 
 
 

Explanation The weekly revenue of a company is not categorical, since it has a large number of possible values, on a continuous range. The number of daily car thefts in New York City is also not categorical because the number of car thefts could range from 0 to hundreds. On the other hand, the other options each have a limited number of possible outcomes.

Which of the following dependent variables are binary? (Select all that apply.)

 
 
 
 
 
 

Explanation The only variables with two possible outcomes are the winner of an election with two candidates, and whether or not revenue will exceed $50,000.

Quick Question

 

Suppose the coefficients of a logistic regression model with two independent variables are as follows:

( \beta_{()} = -1.5 , \enspace \beta_1 = 3 , \enspace \beta_2 = -0.5 )

 

And we have an observation with the following values for the independent variables:

( x_1 = 1 , \enspace x_2 = 5 )

 

What is the value of the Logit for this observation? Recall that the Logit is log(Odds).

Exercise 1

 Numerical Response 

 

Explanation

The Logit is just log(Odds), and looks like the linear regression equation. So the Logit is -1.5 + 31 - 0.55 = -1.

What is the value of the Odds for this observation? Note that you can compute e^x, for some number x, in your R console by typing exp(x). The function exp() computes the exponential of its argument.

Exercise 2

 Numerical Response 

 

Explanation

Using the value of the Logit from the previous question, we have that Odds = e^(-1) = 0.3678794.

What is the value of P(y = 1) for this observation?

Exercise 3

 Numerical Response 

 

Explanation

Using the Logistic Response Function, we can compute that P(y = 1) = 1/(1 + e^(-Logit)) = 1/(1 + e^(1)) = 0.2689414.

CheckShow Answer

Quick Question

 

In R, create a logistic regression model to predict “PoorCare” using the independent variables “StartedOnCombination” and “ProviderCount”. Use the training set we created in the previous video to build the model.

Note: If you haven’t already loaded and split the data in R, please run these commands in your R console to load and split the data set. Remember to first navigate to the directory where you have saved “quality.csv”.

quality = read.csv(“quality.csv”)

install.packages(“caTools”)

library(caTools)

set.seed(88)

split = sample.split(quality$PoorCare, SplitRatio = 0.75)

qualityTrain = subset(quality, split == TRUE)

qualityTest = subset(quality, split == FALSE)

Then recall that we built a logistic regression model to predict PoorCare using the R command:

QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)

You will need to adjust this command to answer this question, and then look at the summary(QualityLog) output.

What is the coefficient for “StartedOnCombination”?

Exercise 1

 Numerical Response 

 

Explanation

To construct this model in R, use the command:

Model = glm(PoorCare ~ StartedOnCombination + ProviderCount, data=qualityTrain, family=binomial)

If you look at the output of summary(Model), the value of the coefficient (Estimate) for StartedOnCombination is 1.95230.

CheckShow Answer

 

Quick Question

 

StartedOnCombination is a binary variable, which equals 1 if the patient is started on a combination of drugs to treat their diabetes, and equals 0 if the patient is not started on a combination of drugs. All else being equal, does this model imply that starting a patient on a combination of drugs is indicative of poor care, or good care?

Exercise 2

 Poor Care 

 Good Care 

Explanation

The coefficient value is positive, meaning that positive values of the variable make the outcome of 1 more likely. This corresponds to Poor Care.

CheckShow Answer

 

Quick Question

This question asks about the following two confusion matrices:

Confusion Matrix #1:

  Predicted = 0 Predicted = 1
Actual = 0 15 10
Actual = 1 5 20

 

Confusion Matrix #2:

  Predicted = 0 Predicted = 1
Actual = 0 20 5
Actual = 1 10 15

What is the sensitivity of Confusion Matrix #1?

Exercise 1

 Numerical Response 

 

Explanation

The sensitivity of a confusion matrix is the true positives, divided by the true positives plus the false negatives. In this case, it is 20/(20+5) = 0.8

What is the specificity of Confusion Matrix #1?

Exercise 2

 Numerical Response 

 

Explanation

The specificity of a confusion matrix is the true negatives, divided by the true negatives plus the false positives. In this case, it is 15/(15+10) = 0.6

CheckShow Answer

Quick Question

To go from Confusion Matrix #1 to Confusion Matrix #2, did we increase or decrease the threshold value?

Exercise 3

 We increased the threshold value. 

 We decreased the threshold value. 

Explanation

We predict the outcome 1 less often in Confusion Matrix #2. This means we must have increased the threshold.

CheckShow Answer

Quick Question

This question will ask about the following ROC curve:

Plot of receiver operator characteristic curve false vs. true positive rates.

Given this ROC curve, which threshold would you pick if you wanted to correctly identify a small group of patients who are receiving the worst care with high confidence?

   
   
   
   

Explanation The threshold 0.7 is best to identify a small group of patients who are receiving the worst care with high confidence, since at this threshold we make very few false positive mistakes, and identify about 35% of the true positives. The threshold t = 0.8 is not a good choice, since it makes about the same number of false positives, but only identifies 10% of the true positives. The thresholds 0.2 and 0.3 both identify more of the true positives, but they make more false positive mistakes, so our confidence decreases.

Which threshold would you pick if you wanted to correctly identify half of the patients receiving poor care, while making as few errors as possible?

   
   
   
   

Explanation The threshold 0.3 is the best choice in this scenerio. The threshold 0.2 also identifies over half of the patients receiving poor care, but it makes many more false positive mistakes. The thresholds 0.7 and 0.8 don't identify at least half of the patients receiving poor care.

Quick Question

 

Important Note: This question uses the original model with the independent variables “OfficeVisits” and “Narcotics”. Be sure to use this model, instead of the model you built in Quick Question 4.

Compute the test set predictions in R by running the command:

predictTest = predict(QualityLog, type=“response”, newdata=qualityTest)

You can compute the test set AUC by running the following two commands in R:

ROCRpredTest = prediction(predictTest, qualityTest$PoorCare)

auc = as.numeric(performance(ROCRpredTest, “auc”)@y.values)

What is the AUC of this model on the test set?

Exercise 1

 Numerical Response 

 

Explanation

If you run the commands given above in your R console, you can see the value of the AUC of this model on the test set by just typing auc in your console. The value should be 0.7994792.

CheckShow Answer

 

The AUC of a model has the following nice interpretation: given a random patient from the dataset who actually received poor care, and a random patient from the dataset who actually received good care, the AUC is the perecentage of time that our model will classify which is which correctly.

Video 4: Logistic Regression in R

In this video, we’ll be using the dataset quality (CSV) to build a logistic regression model in R. Please download this file to follow along.

An R script file with all of the commands used in this lecture can be downloaded here: Unit3_ModelingExpert (R).

The variables in the dataset quality.csv are as follows:

  • MemberID numbers the patients from 1 to 131, and is just an identifying number.
  • InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
  • ERVisits is the number of times the patient visited the emergency room.
  • OfficeVisits is the number of times the patient visited any doctor’s office.
  • Narcotics is the number of prescriptions the patient had for narcotics.
  • DaysSinceLastERVisit is the number of days between the patient’s last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER). 
  • Pain is the number of visits for which the patient complained about pain.
  • TotalVisits is the total number of times the patient visited any healthcare provider.
  • ProviderCount is the number of providers that served the patient.
  • MedicalClaims is the number of days on which the patient had a medical claim.
  • ClaimLines is the total number of medical claims.
  • StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
  • AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
  • PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

In this video we learned how to use the sample.split() function from the caTools package to split data for a classification problem, balancing the positive and negative observations in the training and testing sets.

If you wanted to instead split a data frame data, where the dependent variable is a continuous outcome (this was the case for all the datasets we used last week), you could instead use the sample() function. Here is how to select 70% of observations for the training set (called “train”) and 30% of observations for the testing set (called “test”):

spl = sample(1:nrow(data), size=0.7 * nrow(data))

train = data[spl,]

test = data[-spl,]

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions