15.071 | Spring 2017 | Graduate

The Analytics Edge

3 Logistic Regression

3.3 The Framingham Heart Study: Evaluating Risk Factors to Save Lives

Quick Question

Why was the city of Framingham, Massachusetts selected for this study? Select all that apply.

 
 
 
 
 

Explanation The reasons for Framingham being selected for this study are listed on Slide 4 of the previous video: it had an appropriate size, it had a stable population, and the doctors and residents in the town were willing to participate. However, the city did not represent all types of people in the United States (we'll see later in the lecture how to extend the model to different populations) and there were not an abnormally large number of people with heart disease.

Continue: Video 2: Risk Factors

Quick Question

Are “risk factors” the independent variables or the dependent variables in our model?

 
 
 

Explanation Risk factors are the independent variables in our model, and are what we will use to predict the dependent variable.

In many situations, a dataset is handed to you and you are tasked with discovering which variables are important. But for the Framingham Heart Study, the researchers had to collect data from patients. In a situation like this one, where data needs to be collected by the researchers, should the potential risk factors be defined before or after the data is collected?

 
 

Explanation The researchers should first hypothesize potential risk factors, and then collect data corresponding to those risk factors. Of course, they could always define more risk factors later and collect more data, but this data would take longer to collect.

Quick Question

 

In the previous video, we computed the following confusion matrix for our logistic regression model on our test set with a threshold of 0.5:

  FALSE TRUE
0 1069 6
1 187 11

Using this confusion matrix, answer the following questions.

What is the sensitivity of our logistic regression model on the test set, using a threshold of 0.5?

Exercise 1

 Numerical Response 

 

What is the specificity of our logistic regression model on the test set, using a threshold of 0.5?

Exercise 2

 Numerical Response 

 

Explanation

Using this confusion matrix, we can compute that the sensitivity is 11/(11+187) and the specificity is 1069/(1069+6).

CheckShow Answer

Quick Question

For which of the following models should external validation be used? Consider both the population used to train the model, and the population that the model will be used on. (Select all that apply.)

Explanation In the first and third models, we are using a special sub-population to build the model. While we can use the model for that sub-population, we should use external validation to test the model on other populations. The second model uses data from a special sub-population, but the model is only intended for that sub-population, so external validation is not necessary.

Quick Question

In Video 3, we built a logistic regression model and found that the following variables were significant (or almost significant) for predicting ten year risk of CHD: male, age, number of cigarettes per day, whether or not the patient previously had a stroke, whether or not the patient is currently hypertensive, total cholesterol level, systolic blood pressure, and blood glucose level. Which one of the following variables would be the most dramatically affected by a behavioral intervention? HINT: Think about how much control the patient has over each of the variables.

Explanation The number of cigarettes smoked per day would be the most dramatically affected by a behavioral intervention. This is a variable that the patient has the ability to control the most.

Video 3: A Logistic Regression Model

In this video, we’ll use the dataset framingham (CSV) to build a logistic regression model. Please download this dataset to follow along. This data comes from the BioLINCC website.

An R script file with all of the commands used in this lecture can be downloaded here: Unit3_Framingham (R).

Video 4: Validating the Model

In this video, we mention that the Framingham Risk Model was tested on diverse cohorts. The original Framingham Risk Model was actually computed by a different sort of regression, called a Cox Proportional Hazards Model. This method is different but related to logistic regression, and it will return a similar estimate of 10-year CHD risk.

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions