Video 1: Predicting the Quality of Wine
The slides from all videos in this Lecture Sequence can be downloaded here: Introduction to Linear Regression (PDF - 1.3MB).
The slides from all videos in this Lecture Sequence can be downloaded here: Introduction to Linear Regression (PDF - 1.3MB).
Which of the following are NOT valid values for an out-of-sample (test set) R² ? Select all that apply.
Explanation The formula for R² is R² = 1 - SSE/SST, where SST is calculated using the average value of the dependent variable on the training set. Since SSE and SST are the sums of squared terms, we know that both will be positive. Thus SSE/SST must be greater than or equal to zero. This means it is not possible to have an out-of-sample R² value of 2.4. However, all other values are valid (even the negative ones!), since SSE can be more or less than SST, due to the fact that this is an out-of-sample R², not a model R².
The plots below show the relationship between two of the independent variables considered by Ashenfelter and the price of wine.
What is the correct relationship between harvest rain, average growing season temperature, and wine prices?
Explanation The plots show a positive trend between average growing season temperature and the wine price. While the trend is less clear between harvest rain and price, there is a slight negative association.
The following figure shows three data points and the best fit line
( y = 3x + 2 . )
The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.
Please answer the following questions using this figure.
What is the baseline prediction?
Exercise 1
Numerical Response
Explanation
The baseline prediction is the average value of the dependent variable. Since our dependent variable takes values 2, 2, and 8 in our data set, the average is (2+2+8)/3 = 4.
What is the Sum of Squared Errors (SSE) ?
Exercise 2
Numerical Response
Explanation
The SSE is computed by summing the squared errors between the actual values and our predictions. For each value of the independent variable (x), our best fit line makes the following predictions:
If x = 0, y = 3(0) + 2 = 2,
If x = 1, y = 3(1) + 2 = 5.
Thus we make an error of 0 for the data point (0,2), an error of 3 for the data point (1,2), and an error of 3 for the data point (1,8). So we have
SSE = 0² + 3² + 3² = 18.
What is the Total Sum of Squares (SST) ?
Exercise 3
Numerical Response
Explanation
The SST is computed by summing the squared errors between the actual values and the baseline prediction. From the first question, we computed the baseline prediction to be 4. Thus the SST is:
SST = (2 - 4)² + (2 - 4)² + (8 - 4)² = 24.
What is the R² of the model?
Exercise 4
Numerical Response
Explanation
The R² formula is:
R² = 1 - SSE/SST
Thus using our answers to the previous questions, we have that
R² = 1 - 18/24 = 0.25.
CheckShow Answer
Suppose we add another variable, Average Winter Temperature, to our model to predict wine price. Is it possible for the model’s R² value to go down from 0.83 to 0.80?
Explanation The model's R² value can never decrease from adding new variables to the model. This is due to the fact that it is always possible to set the coefficient for the new variable to zero in the new model. However, this would be the same as the old model. So the only reason to make the coefficient non-zero is if it improves the R² value of the model, since linear regression picks the coefficients to minimize the error terms, which is the same as maximizing the R².
In R, use the dataset wine (CSV) to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:
What is the “Multiple R-squared” value of your model?
Exercise 1
Numerical Response
What is the coefficient for HarvestRain?
Exercise 2
Numerical Response
What is the intercept coefficient?
Exercise 3
Numerical Response
Explanation
In R, create the model by typing the following line into your R console:
modelQQ4 = lm(Price ~ HarvestRain + WinterRain, data=wine)
Then, look at the output of summary(modelQQ4). The Multiple R-squared is listed at the bottom of the output, and the coefficients can be found in the coefficients table.
CheckShow Answer
Use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables, like you did in the previous quick question. Using the summary output of this model, answer the following questions:
Is the coefficient for HarvestRain significant?
Explanation You can create the model and look at the summary output with the following command: model = lm(Price ~ WinterRain + HarvestRain, data=wine) summary(model) From the summary output, you can see that HarvestRain is significant (two stars), but WinterRain is not (no stars).
Is the coefficient for WinterRain significant?
Explanation You can create the model and look at the summary output with the following command: model = lm(Price ~ WinterRain + HarvestRain, data=wine) summary(model) From the summary output, you can see that HarvestRain is significant (two stars), but WinterRain is not (no stars).
Note that you will need to answer both questions before checking your answers.
Using the data set wine (CSV), what is the correlation between HarvestRain and WinterRain?
Exercise 1
Numerical Response
Explanation
You can compute the correlation between HarvestRain and WinterRain by typing the following command into your R console:
> cor(wine$HarvestRain, wine$WinterRain)
CheckShow Answer
Before starting this video, please download the datasets wine (CSV) and wine_test (CSV). Save them to a folder on your computer that you will remember, and in R, navigate to this folder (File->Change dir… on a PC, and Misc->Change Working Directory on a Mac). This data comes from Liquid Assets.
A script file containing all of the R commands used in this lecture can be downloaded here: Unit2_WineRegression (R).