15.071 | Spring 2017 | Graduate

The Analytics Edge

2 Linear Regression

2.2 The Statistical Sommelier: An Introduction to Linear Regression

Quick Question

Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply.

Explanation The formula for R² is R² = 1 - SSE/SST, where SST is calculated using the average value of the dependent variable on the training set. Since SSE and SST are the sums of squared terms, we know that both will be positive. Thus SSE/SST must be greater than or equal to zero. This means it is not possible to have an out-of-sample R² value of 2.4. However, all other values are valid (even the negative ones!), since SSE can be more or less than SST, due to the fact that this is an out-of-sample R², not a model R².

Quick Question

The plots below show the relationship between two of the independent variables considered by Ashenfelter and the price of wine.

Plot of price vs. average growing season temperature. Plot of price vs. harvest rainfall.

 

What is the correct relationship between harvest rain, average growing season temperature, and wine prices?

 
 
 
 

Explanation The plots show a positive trend between average growing season temperature and the wine price. While the trend is less clear between harvest rain and price, there is a slight negative association.

Continue: Video 2: One-Variable Linear Regression

Quick Question

 

The following figure shows three data points and the best fit line

( y = 3x + 2 . )

The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.

Figure showing three data points and the best fit line.

Please answer the following questions using this figure.

What is the baseline prediction?

Exercise 1

 Numerical Response 

 

Explanation

The baseline prediction is the average value of the dependent variable. Since our dependent variable takes values 2, 2, and 8 in our data set, the average is (2+2+8)/3 = 4.

What is the Sum of Squared Errors (SSE) ?

Exercise 2

 Numerical Response 

 

Explanation

The SSE is computed by summing the squared errors between the actual values and our predictions. For each value of the independent variable (x), our best fit line makes the following predictions:

If x = 0, y = 3(0) + 2 = 2,

If x = 1, y = 3(1) + 2 = 5.

Thus we make an error of 0 for the data point (0,2), an error of 3 for the data point (1,2), and an error of 3 for the data point (1,8). So we have

SSE = 0² + 3² + 3² = 18.

What is the Total Sum of Squares (SST) ?

Exercise 3

 Numerical Response 

 

Explanation

The SST is computed by summing the squared errors between the actual values and the baseline prediction. From the first question, we computed the baseline prediction to be 4. Thus the SST is:

SST = (2 - 4)² + (2 - 4)² + (8 - 4)² = 24.

What is the R² of the model?

Exercise 4

 Numerical Response 

 

Explanation

The R² formula is:

R² = 1 - SSE/SST

Thus using our answers to the previous questions, we have that

R² = 1 - 18/24 = 0.25.

CheckShow Answer

Quick Question

Suppose we add another variable, Average Winter Temperature, to our model to predict wine price. Is it possible for the model’s R² value to go down from 0.83 to 0.80?

Explanation The model's R² value can never decrease from adding new variables to the model. This is due to the fact that it is always possible to set the coefficient for the new variable to zero in the new model. However, this would be the same as the old model. So the only reason to make the coefficient non-zero is if it improves the R² value of the model, since linear regression picks the coefficients to minimize the error terms, which is the same as maximizing the R².

Quick Question

In R, use the dataset wine (CSV) to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:

What is the “Multiple R-squared” value of your model?

Exercise 1

 Numerical Response 

 

What is the coefficient for HarvestRain?

Exercise 2

 Numerical Response 

 

What is the intercept coefficient?

Exercise 3

 Numerical Response 

 

Explanation

In R, create the model by typing the following line into your R console:

modelQQ4 = lm(Price ~ HarvestRain + WinterRain, data=wine)

Then, look at the output of summary(modelQQ4). The Multiple R-squared is listed at the bottom of the output, and the coefficients can be found in the coefficients table.

CheckShow Answer

Quick Question

Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:

Is the coefficient for HarvestRain significant?

 
 
 

Explanation You can create the model and look at the summary output with the following command: model = lm(Price ~ WinterRain + HarvestRain, data=wine) summary(model) From the summary output, you can see that HarvestRain is significant (two stars), but WinterRain is not (no stars).

Is the coefficient for WinterRain significant?

 
 
 

Explanation You can create the model and look at the summary output with the following command: model = lm(Price ~ WinterRain + HarvestRain, data=wine) summary(model) From the summary output, you can see that HarvestRain is significant (two stars), but WinterRain is not (no stars).

Note that you will need to answer both questions before checking your answers.

Video 4: Linear Regression in R

Before starting this video, please download the datasets wine (CSV) and wine_test (CSV). Save them to a folder on your computer that you will remember, and in R, navigate to this folder (File->Change dir… on a PC, and Misc->Change Working Directory on a Mac). This data comes from Liquid Assets.

A script file containing all of the R commands used in this lecture can be downloaded here: Unit2_WineRegression (R).

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions