15.071 | Spring 2017 | Graduate

The Analytics Edge

2 Linear Regression

2.5 Assignment 2

Climate Change

There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file climate_change (CSV) contains climate data from May 1983 to December 2008. The available variables include:

 

  • Year: the observation year.
  • Month: the observation month.
  • Temp: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.
  • CO2N2O, CH4CFC.11, CFC.12: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane  (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.
  • CO2, N2O and CH4 are expressed in ppmv (parts per million by volume  – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
  • CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). 
  • Aerosols: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.
  • TSI: the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.
     
  • MEI: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

Problem 1.1 - Creating Our First Model

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.

Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables (Year and Month should NOT be used in the model). Use the training set to build the model.

Enter the model R2 (the “Multiple R-squared” value):

Exercise 1

 Numerical Response 

 

Explanation

First, read in the data and split it using the subset command:

climate = read.csv(“climate_change.csv”)

train = subset(climate, Year <= 2006)

test = subset(climate, Year > 2006)

Then, you can create the model using the command:

climatelm = lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)

Lastly, look at the model using summary(climatelm). The Multiple R-squared value is 0.7509.

CheckShow Answer

Problem 1.2 - Creating Our First Model

Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)

Exercise 2

 MEI 

 CO2 

 CH4 

 N2O 

 CFC.11 

 CFC.12 

 TSI 

 Aerosols 

 

Explanation

If you look at the model we created in the previous problem using summary(climatelm), all of the variables have at least one star except for CH4 and N2O. So MEI, CO2, CFC.11, CFC.12, TSI, and Aerosols are all significant.

CheckShow Answer

Problem 2.1 - Understanding the Model

Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative, indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

Which of the following is the simplest correct explanation for this contradiction?

Exercise 3

 Climate scientists are wrong that N2O and CFC-11 are greenhouse gases - this regression analysis constitutes part of a disproof. 

 There is not enough data, so the regression coefficients being estimated are not accurate. 

 All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set. 

Explanation

The linear correlation of N2O and CFC.11 with other variables in the data set is quite large. The first explanation does not seem correct, as the warming effect of nitrous oxide and CFC-11 are well documented, and our regression analysis is not enough to disprove it. The second explanation is unlikely, as we have estimated eight coefficients and the intercept from 284 observations.

CheckShow Answer

Problem 2.2 - Understanding the Model

Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.

Exercise 4

 MEI 

 CO2 

 CH4 

 CFC.11 

 CFC.12 

 Aerosols 

 TSI 

 

Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.

Exercise 5

 MEI 

 CO2 

 CH4 

 N2O 

 CFC.12 

 Aerosols 

 TSI 

 

Explanation

You can calculate all correlations at once using cor(train) where train is the name of the training data set.

CheckShow Answer

Problem 3 - Simplifying the Model

Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

Enter the coefficient of N2O in this reduced model:

Exercise 6

 Numerical Response 

 

(How does this compare to the coefficient in the previous model with all of the variables?)

Enter the model R2:

Exercise 7

 Numerical Response 

 

Explanation

We can create this simplified model with the command:

LinReg = lm(Temp ~ MEI + N2O + TSI + Aerosols, data=train)

You can get the coefficient for N2O and the model R-squared by typing summary(LinReg).

We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. As discussed in lecture, this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.

CheckShow Answer

Detecting Flu Epidemics via Search Engine Query Data 

Flu epidemics constitute a major public health concern causing respiratory illnesses, hospitalizations, and deaths. According to the National Vital Statistics Reports published in October 2012, influenza ranked as the eighth leading cause of death in 2011 in the United States. Each year, 250,000 to 500,000 deaths are attributed to influenza related diseases throughout the world.

The U.S. Centers for Disease Control and Prevention (CDC) and the European Influenza Surveillance Scheme (EISS) detect influenza activity through virologic and clinical data, including Influenza-like Illness (ILI) physician visits. Reporting national and regional data, however, are published with a 1-2 week lag.

The Google Flu Trends project was initiated to see if faster reporting can be made possible by considering flu-related online search queries – data that is available almost immediately.

 

Problem 1.1 - Understanding the Data

We would like to estimate influenza-like illness (ILI) activity using Google web search logs. Fortunately, one can easily access this data online:

ILI Data - The CDC publishes on its website the official regional and state-level percentage of patient visits to healthcare providers for ILI purposes on a weekly basis.

 

Google Search Queries - Google Trends allows public retrieval of weekly counts for every query searched by users around the world. For each location, the counts are normalized by dividing the count for each query in a particular week by the total number of online search queries submitted in that location during the week. Then, the values are adjusted to be between 0 and 1.

 

The csv file FluTrain (CSV) aggregates this data from January 1, 2004 until December 31, 2011 as follows:

“Week” - The range of dates represented by this observation, in year/month/day format.

“ILI” - This column lists the percentage of ILI-related physician visits for the corresponding week.

“Queries” - This column lists the fraction of queries that are ILI-related for the corresponding week, adjusted to be between 0 and 1 (higher values correspond to more ILI-related search queries).

Before applying analytics tools on the training set, we first need to understand the data at hand. Load “FluTrain.csv” into a data frame called FluTrain. Looking at the time period 2004-2011, which week corresponds to the highest percentage of ILI-related physician visits? Select the day of the month corresponding to the start of this week.

Exercise 1

   January February March April May June July August September October November December  October 

Exercise 2

   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  18 

Exercise 3

   2004 2005 2006 2007 2008 2009 2010 2011  2009 

Explanation

We can limit FluTrain to the observations that obtain the maximum ILI value with subset(FluTrain, ILI == max(ILI)). From here, we can read information about the week at which the maximum was obtained. Alternatively, you can use which.max(FluTrain$ILI) to find the row number corresponding to the observation with the maximum value of ILI, which is 303. Then, you can output the corresponding week using FluTrain$Week[303].

Which week corresponds to the highest percentage of ILI-related query fraction?

Exercise 4

   January February March April May June July August September October November December  October 

Exercise 5

   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  18 

Exercise 6

   2004 2005 2006 2007 2008 2009 2010 2011  2009 

    CheckShow Answer

Explanation

We can limit FluTrain to the observations that obtain the maximum ILI value with subset(FluTrain, Queries == max(Queries)). From here, we can read information about the week at which the maximum was obtained. Alternatively, you can use which.max(FluTrain$Queries) to find the row number corresponding to the observation with the maximum value of Queries, which is 303. Then, you can output the corresponding week using FluTrain$Week[303].

Problem 1.2 - Understanding the Data

Let us now understand the data at an aggregate level. Plot the histogram of the dependent variable, ILI. What best describes the distribution of values of ILI?

Exercise 7

 Most of the ILI values are small, with a relatively small number of much larger values (in statistics, this sort of data is called “skew right”).  

 The ILI values are balanced, with equal numbers of unusually large and unusually small values.  

 Most of the ILI values are large, with a relatively small number of much smaller values (in statistics, this sort of data is called “skew left”).  

Explanation

The histogram of ILI can be obtained with hist(FluTrain$ILI). Visually, the data is skew right.

CheckShow Answer

Problem 1.3 - Understanding the Data

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself – this prevents the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models. In this problem, we will predict the natural log of the ILI variable, which can be computed in R using the log() function.

Plot the natural logarithm of ILI versus Queries. What does the plot suggest?.

Exercise 8

 There is a negative, linear relationship between log(ILI) and Queries.  

 There is no apparent linear relationship between log(ILI) and Queries.  

 There is a positive, linear relationship between log(ILI) and Queries.  

Explanation

The plot can be obtained with plot(FluTrain$Queries, log(FluTrain$ILI)). Visually, there is a positive, linear relationship between log(ILI) and Queries.

CheckShow Answer

Problem 2.1 - Linear Regression Model

Based on the plot we just made, it seems that a linear regression model could be a good modeling choice. Based on our understanding of the data from the previous subproblem, which model best describes our estimation problem?

Exercise 9

 ILI = intercept + coefficient x Queries, where the coefficient is negative 

 Queries = intercept + coefficient x ILI, where the coefficient is negative 

 ILI = intercept + coefficient x Queries, where the coefficient is positive 

 Queries = intercept + coefficient x ILI, where the coefficient is positive 

 log(ILI) = intercept + coefficient x Queries, where the coefficient is negative 

 Queries = intercept + coefficient x log(ILI), where the coefficient is negative 

 log(ILI) = intercept + coefficient x Queries, where the coefficient is positive 

 Queries = intercept + coefficient x log(ILI), where the coeffcient is positive 

Explanation

From the previous subproblem, we are predicting log(ILI) using the Queries variable. From the plot in the previous subproblem, we expect the coefficient on Queries to be positive.

CheckShow Answer

Problem 2.2 - Linear Regression Model

Let’s call the regression model from the previous problem (Problem 2.1) FluTrend1 and run it in R. Hint: to take the logarithm of a variable Var in a regression equation, you simply use log(Var) when specifying the formula to the lm() function.

What is the training set R-squared value for FluTrend1 model (the “Multiple R-squared”)?

Exercise 10

 Numerical Response 

 

Explanation

The model can be trained with:

FluTrend1 = lm(log(ILI)~Queries, data=FluTrain)

From summary(FluTrend1), we read that the R-squared value is 0.709.

 

CheckShow Answer

Problem 2.3 - Linear Regression Model

For a single variable linear regression model, there is a direct relationship between the R-squared and the correlation between the independent and the dependent variables. What is the relationship we infer from our problem? (Don’t forget that you can use the cor function to compute the correlation between two variables.)

Exercise 11

 R-squared = Correlation^2 

 R-squared = log(1/Correlation) 

 R-squared = exp(-0.5Correlation) 

Explanation

To test these hypotheses, we first need to compute the correlation between the independent variable used in the model (Queries) and the dependent variable (log(ILI)). This can be done with

Correlation = cor(FluTrain$Queries, log(FluTrain$ILI))

The values of the three expressions are then:

Correlation^2 = 0.7090201

log(1/Correlation) = 0.1719357

exp(-0.5Correlation) = 0.6563792

It appears that Correlation^2 is equal to the R-squared value. It can be proved that this is always the case.

Note that the “exp” function stands for the exponential function. The exponential can be computed in R using the function exp().

CheckShow Answer

Problem 3.1 - Performance on the Test Set

The csv file FluTest.csv provides the 2012 weekly data of the ILI-related search queries and the observed weekly percentage of ILI-related physician visits. Load this data into a data frame called FluTest.

Normally, we would obtain test-set predictions from the model FluTrend1 using the code

PredTest1 = predict(FluTrend1, newdata=FluTest)

However, the dependent variable in our model is log(ILI), so PredTest1 would contain predictions of the log(ILI) value. We are instead interested in obtaining predictions of the ILI value. We can convert from predictions of log(ILI) to predictions of ILI via exponentiation, or the exp() function. The new code, which predicts the ILI value, is

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))

What is our estimate for the percentage of ILI-related physician visits for the week of March 11, 2012? (HINT: You can either just output FluTest$Week to find which element corresponds to March 11, 2012, or you can use the “which” function in R. To learn more about the which function, type “?which” in your R console.)

Exercise 12

 Numerical Response 

 

Explanation

To obtain the predictions, we need can run

PredTest1 = exp(predict(FluTrend1, newdata=FluTest))

Next, we need to determine which element in the test set is for March 11, 2012. We can determine this with:

which(FluTest$Week == “2012-03-11 - 2012-03-17”)

Now we know we are looking for prediction number 11. This can be accessed with:

PredTest1[11]

CheckShow Answer

Problem 3.2 - Performance on the Test Set

What is the relative error betweeen the estimate (our prediction) and the observed value for the week of March 11, 2012? Note that the relative error is calculated as

(Observed ILI - Estimated ILI)/Observed ILI

Exercise 13

 Numerical Response 

 

Explanation

From the previous problem, we know the predicted value is 2.187378. The actual value is the 11th testing set ILI value or FluTest$ILI[11], which has value 2.293422. Finally we compute the relative error to be (2.293422 - 2.187378)/2.293422.

CheckShow Answer

Problem 3.3 - Performance on the Test Set

What is the Root Mean Square Error (RMSE) between our estimates and the actual observations for the percentage of ILI-related physician visits, on the test set?

Exercise 14

 Numerical Response 

 

Explanation

The RMSE can be calculated by first computing the SSE:

SSE = sum((PredTest1-FluTest$ILI)^2)

and then dividing by the number of observations and taking the square root:

RMSE = sqrt(SSE / nrow(FluTest))

Alternatively, you could use the following command:

sqrt(mean((PredTest1-FluTest$ILI)^2)).

CheckShow Answer

Problem 4.1 - Training a Time Series Model

The observations in this dataset are consecutive weekly measurements of the dependent and independent variables. This sort of dataset is called a “time series.” Often, statistical models can be improved by predicting the current value of the dependent variable using the value of the dependent variable from earlier weeks. In our models, this means we will predict the ILI variable in the current week using values of the ILI variable from previous weeks.

First, we need to decide the amount of time to lag the observations. Because the ILI variable is reported with a 1- or 2-week lag, a decision maker cannot rely on the previous week’s ILI value to predict the current week’s value. Instead, the decision maker will only have data available from 2 or more weeks ago. We will build a variable called ILILag2 that contains the ILI value from 2 weeks before the current observation.

To do so, we will use the “zoo” package, which provides a number of helpful methods for time series models. While many functions are built into R, you need to add new packages to use some functions. New packages can be installed and loaded easily in R, and we will do this many times in this class. Run the following two commands to install and load the zoo package. In the first command, you will be prompted to select a CRAN mirror to use for your download. Select a mirror near you geographically.

install.packages(“zoo”)

library(zoo)

After installing and loading the zoo package, run the following commands to create the ILILag2 variable in the training set:

ILILag2 = lag(zoo(FluTrain$ILI), -2, na.pad=TRUE)

FluTrain$ILILag2 = coredata(ILILag2)

In these commands, the value of -2 passed to lag means to return 2 observations before the current one; a positive value would have returned future observations. The parameter na.pad=TRUE means to add missing values for the first two weeks of our dataset, where we can’t compute the data from 2 weeks earlier.

How many values are missing in the new ILILag2 variable?

Exercise 15

 Numerical Response 

 

Explanation

This can be read from the output of summary(FluTrain$ILILag2).

CheckShow Answer

Problem 4.2 - Training a Time Series Model

Use the plot() function to plot the log of ILILag2 against the log of ILI. Which best describes the relationship between these two variables?

Exercise 16

 There is a strong negative relationship between log(ILILag2) and log(ILI). 

 This is a weak or no relationship between log(ILILag2) and log(ILI) 

 There is a strong positive relationship between log(ILILag2) and log(ILI). 

Explanation

From plot(log(FluTrain$ILILag2), log(FluTrain$ILI)), we observe a strong positive relationship.

CheckShow Answer

Problem 4.3 - Training a Time Series Model

Train a linear regression model on the FluTrain dataset to predict the log of the ILI variable using the Queries variable as well as the log of the ILILag2 variable. Call this model FluTrend2.

Which coefficients are significant at the p=0.05 level in this regression model? (Select all that apply.)

Exercise 17

 Intercept 

 Queries 

 log(ILILag2) 

 

What is the R^2 value of the FluTrend2 model?

Exercise 18

 Numerical Response 

 

Explanation

The following code builds and summarizes the FluTrend2 model:

FluTrend2 = lm(log(ILI)~Queries+log(ILILag2), data=FluTrain)

summary(FluTrend2)

As can be seen, all three coefficients are highly significant, and the R^2 value is 0.9063.

CheckShow Answer

Problem 4.4 - Training a Time Series Model

On the basis of R-squared value and significance of coefficients, which statement is the most accurate?

Exercise 19

 Due to overfitting, FluTrend2 is a weaker model then FluTrend1 on the training set. 

 FluTrend2 is about the same quality as FluTrend1 on the training set. 

 FluTrend2 is a stronger model than FluTrend1 on the training set. 

Explanation

Moving from FluTrend1 to FluTrend2, in-sample R^2 improved from 0.709 to 0.9063, and the new variable is highly significant. As a result, there is no sign of overfitting, and FluTrend2 is superior to FluTrend1 on the training set.

CheckShow Answer

Problem 5.1 - Evaluating the Time Series Model in the Test Set

So far, we have only added the ILILag2 variable to the FluTrain data frame. To make predictions with our FluTrend2 model, we will also need to add ILILag2 to the FluTest data frame (note that adding variables before splitting into a training and testing set can prevent this duplication of effort).

Modify the code from the previous subproblem to add an ILILag2 variable to the FluTest data frame. How many missing values are there in this new variable?

Exercise 20

 Numerical Response 

 

Explanation

We can add the new variable with:

ILILag2 = lag(zoo(FluTest$ILI), -2, na.pad=TRUE)

FluTest$ILILag2 = coredata(ILILag2)

From summary(FluTest$ILILag2), we can see that we’re missing two values of the new variable.

CheckShow Answer

Problem 5.2 - Evaluating the Time Series Model in the Test Set

In this problem, the training and testing sets are split sequentially – the training set contains all observations from 2004-2011 and the testing set contains all observations from 2012. There is no time gap between the two datasets, meaning the first observation in FluTest was recorded one week after the last observation in FluTrain. From this, we can identify how to fill in the missing values for the ILILag2 variable in FluTest.

Which value should be used to fill in the ILILag2 variable for the first observation in FluTest?

Exercise 21

 The ILI value of the second-to-last observation in the FluTrain data frame. 

 The ILI value of the last observation in the FluTrain data frame. 

 The ILI value of the first observation in the FluTest data frame. 

 The ILI value of the second observation in the FluTest data frame. 

Explanation

The time two weeks before the first week of 2012 is the second-to-last week of 2011. This corresponds to the second-to-last observation in FluTrain.

Which value should be used to fill in the ILILag2 variable for the second observation in FluTest?

Exercise 22

 The ILI value of the second-to-last observation in the FluTrain data frame. 

 The ILI value of the last observation in the FluTrain data frame. 

 The ILI value of the first observation in the FluTest data frame. 

 The ILI value of the second observation in the FluTest data frame. 

Explanation

The time two weeks before the second week of 2012 is the last week of 2011. This corresponds to the last observation in FluTrain.

CheckShow Answer

Problem 5.3 - Evaluating the Time Series Model in the Test Set

Fill in the missing values for ILILag2 in FluTest. In terms of syntax, you could set the value of ILILag2 in row “x” of the FluTest data frame to the value of ILI in row “y” of the FluTrain data frame with “FluTest$ILILag2[x] = FluTrain$ILI[y]”. Use the answer to the previous questions to determine the appropriate values of “x” and “y”. It may be helpful to check the total number of rows in FluTrain using str(FluTrain) or nrow(FluTrain).

Explanation

From nrow(FluTrain), we see that there are 417 observations in the training set. Therefore, we need to run the following two commands:

FluTest$ILILag2[1] = FluTrain$ILI[416]

FluTest$ILILag2[2] = FluTrain$ILI[417]

What is the new value of the ILILag2 variable in the first row of FluTest?

Exercise 23

 Numerical Response 

 

Explanation

This can be read from FluTest$ILILag2[1].

What is the new value of the ILILag2 variable in the second row of FluTest?

Exercise 24

 Numerical Response 

 

Explanation

This can be read from FluTest$ILILag2[2].

CheckShow Answer

Problem 5.4 - Evaluating the Time Series Model in the Test Set

Obtain test set predictions of the ILI variable from the FluTrend2 model, again remembering to call the exp() function on the result of the predict() function to obtain predictions for ILI instead of log(ILI).

What is the test-set RMSE of the FluTrend2 model?

Exercise 25

 Numerical Response 

 

Explanation

We can obtain the test-set predictions with:

PredTest2 = exp(predict(FluTrend2, newdata=FluTest))

And then we can compute the RMSE with the following commands:

SSE = sum((PredTest2-FluTest$ILI)^2)

RMSE = sqrt(SSE / nrow(FluTest))

Alternatively, you could use the following command to compute the RMSE:

sqrt(mean((PredTest2-FluTest$ILI)^2)).

The test-set RMSE of FluTrend2 is 0.294.

CheckShow Answer

Problem 5.5 - Evaluating the Time Series Model in the Test Set

Which model obtained the best test-set RMSE?

Exercise 26

 FluTrend1 

 FluTrend2 

Explanation

The test-set RMSE of FluTrend2 is 0.294, as opposed to the 0.749 value obtained by the FluTrend1 model.

In this problem, we used a simple time series model with a single lag term. ARIMA models are a more general form of the model we built, which can include multiple lag terms as well as more complicated combinations of previous values of the dependent variable. If you’re interested in learning more, check out “?arima” or the available online tutorials for these sorts of models.

CheckShow Answer

Reading Test Scores

The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train (CSV) and pisa2009test (CSV) contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

male: Whether the student is male (1/0)

raceeth: The race/ethnicity composite of the student

preschool: Whether the student attended preschool (1/0)

expectBachelors: Whether the student expects to obtain a bachelor’s degree (1/0)

motherHS: Whether the student’s mother completed high school (1/0)

motherBachelors: Whether the student’s mother obtained a bachelor’s degree (1/0)

motherWork: Whether the student’s mother has part-time or full-time work (1/0)

fatherHS: Whether the student’s father completed high school (1/0)

fatherBachelors: Whether the student’s father obtained a bachelor’s degree (1/0)

fatherWork: Whether the student’s father has part-time or full-time work (1/0)

selfBornUS: Whether the student was born in the United States of America (1/0)

motherBornUS: Whether the student’s mother was born in the United States of America (1/0)

fatherBornUS: Whether the student’s father was born in the United States of America (1/0)

englishAtHome: Whether the student speaks English at home (1/0)

computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)

read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)

minutesPerWeekEnglish: The number of minutes per week the student spend in English class

studentsInEnglish: The number of students in this student’s English class at school

schoolHasLibrary: Whether this student’s school has a library (1/0)

publicSchool: Whether this student attends a public school (1/0)

urban: Whether this student’s school is in an urban area (1/0)

schoolSize: The number of students in this student’s school

readingScore: The student’s reading score, on a 1000-point scale

 

Problem 1.1 - Dataset size

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

How many students are there in the training set?

Exercise 1

 Numerical Response 

 

Explanation

The datasets can be loaded with:

pisaTrain = read.csv(“pisa2009train.csv”)

pisaTest = read.csv(“pisa2009test.csv”)

We can then access the number of rows in the training set with str(pisaTrain) or nrow(pisaTrain).

CheckShow Answer

Problem 1.2 - Summarizing the dataset

Using tapply() on pisaTrain, what is the average reading test score of males?

Exercise 2

 Numerical Response 

 

Of females?

Exercise 3

 Numerical Response 

 

Explanation

The correct invocation of tapply() here is:

tapply(pisaTrain$readingScore, pisaTrain$male, mean)

CheckShow Answer

Problem 1.3 - Locating missing values

Which variables are missing data in at least one observation in the training set? Select all that apply.

Exercise 4

 grade 

 male 

 raceeth 

 preschool 

 expectBachelors 

 motherHS 

 motherBachelors 

 motherWork 

 fatherHS 

 fatherBachelors 

 ¨C19CfatherWork 

 ¨C20CselfBornUS 

 ¨C21CmotherBornUS 

 ¨C22CfatherBornUS 

 ¨C23CenglishAtHome 

 ¨C24CcomputerForSchoolwork 

 ¨C25Cread30MinsADay 

 ¨C26CminutesPerWeekEnglish 

 ¨C27CstudentsInEnglish 

 ¨C28CschoolHasLibrary 

 ¨C29CpublicSchool 

 ¨C30Curban 

 ¨C31CschoolSize 

 ¨C32CreadingScore 

 

Explanation

We can read which variables have missing values from summary(pisaTrain). Because most variables are collected from study participants via survey, it is expected that most questions will have at least one missing value.

CheckShow Answer

Problem 1.4 - Removing missing values

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

Type the following commands into your R console to remove observations with any missing value from pisaTrain and pisaTest:

pisaTrain = na.omit(pisaTrain)

pisaTest = na.omit(pisaTest)

How many observations are now in the training set?

Exercise 5

 Numerical Response 

 

How many observations are now in the testing set?

Exercise 6

 Numerical Response 

 

Explanation

After running the provided commands we can use str(pisaTrain) and str(pisaTest), or nrow(pisaTrain) and nrow(pisaTest), to check the new number of rows in the datasets.

CheckShow Answer

Problem 2.1 - Factor variables

Factor variables are variables that take on a discrete set of values, like the “Region” variable in the WHO dataset from the second lecture of Unit 1. This is an unordered factor because there isn’t any natural ordering between the levels. An ordered factor has a natural ordering between the levels (an example would be the classifications “large,” “medium,” and “small”).

Which of the following variables is an unordered factor with at least 3 levels? (Select all that apply.)

Exercise 7

 grade 

 male 

 raceeth 

 

Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply.)

Exercise 8

 grade 

 male 

 raceeth 

 

Explanation

Male only has 2 levels (1 and 0). There is no natural ordering between the different values of raceeth, so it is an unordered factor. Meanwhile, we can order grades (8, 9, 10, 11, 12), so it is an ordered factor.

CheckShow Answer

Problem 2.2 - Unordered factors in regression models

To include unordered factors in a linear regression model, we define one level as the “reference level” and add a binary variable for each of the remaining levels. In this way, a factor with n levels is replaced by n-1 binary variables. The reference level is typically selected to be the most frequently occurring level in the dataset.

As an example, consider the unordered factor variable “color”, with levels “red”, “green”, and “blue”. If “green” were the reference level, then we would add binary variables “colorred” and “colorblue” to a linear regression problem. All red examples would have colorred=1 and colorblue=0. All blue examples would have colorred=0 and colorblue=1. All green examples would have colorred=0 and colorblue=0.

Now, consider the variable “raceeth” in our problem, which has levels “American Indian/Alaska Native”, “Asian”, “Black”, “Hispanic”, “More than one race”, “Native Hawaiian/Other Pacific Islander”, and “White”. Because it is the most common in our population, we will select White as the reference level.

Which binary variables will be included in the regression model? (Select all that apply.)

Exercise 9

 raceethAmerican Indian/Alaska Native 

 raceethAsian 

 raceethBlack 

 raceethHispanic 

 raceethMore than one race 

 raceethNative Hawaiian/Other Pacific Islander 

 raceethWhite 

 

Explanation

We create a binary variable for each level except the reference level, so we would create all these variables except for raceethWhite.

CheckShow Answer

Problem 2.3 - Example unordered factors

Consider again adding our unordered factor race to the regression model with reference level “White”.

For a student who is Asian, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)

Exercise 10

 raceethAmerican Indian/Alaska Native 

 raceethAsian 

 raceethBlack 

 raceethHispanic 

 raceethMore than one race 

 raceethNative Hawaiian/Other Pacific Islander 

 

For a student who is white, which binary variables would be set to 0? All remaining variables will be set to 1. (Select all that apply.)

Exercise 11

 raceethAmerican Indian/Alaska Native 

 raceethAsian 

 raceethBlack 

 raceethHispanic 

 raceethMore than one race 

 raceethNative Hawaiian/Other Pacific Islander 

 

Explanation

An Asian student will have raceethAsian set to 1 and all other raceeth binary variables set to 0. Because “White” is the reference level, a white student will have all raceeth binary variables set to 0.

CheckShow Answer

Problem 3.1 - Building a model

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() – you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically (“American Indian/Alaska Native”) as the reference level of our factor instead of the most common level (“White”). Set the reference level of the factor by typing the following two lines in your R console:

pisaTrain$raceeth = relevel(pisaTrain$raceeth, “White”)

pisaTest$raceeth = relevel(pisaTest$raceeth, “White”)

Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

It would be time-consuming to type all the variables, but R provides the shorthand notation “readingScore ~ .” to mean “predict readingScore using all the other variables in the data frame.” The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called “Y”, your independent variables are called “X1”, “X2”, and “X3”, and your training data set is called “Train”, instead of the regular notation:

LinReg = lm(Y ~ X1 + X2 + X3, data = Train)

You would use the following command to build your model:

LinReg = lm(Y ~ ., data = Train)

What is the Multiple R-squared value of lmScore on the training set?

Exercise 12

 Numerical Response 

 

Explanation

We can train the model with:

lmScore = lm(readingScore~., data=pisaTrain)

We can then read the training set R^2 from the “Multiple R-squared” value of summary(lmScore).

Note that this R-squared is lower than the ones for the models we saw in the lectures and recitation. This does not necessarily imply that the model is of poor quality. More often than not, it simply means that the prediction problem at hand (predicting a student’s test score based on demographic and school-related variables) is more difficult than other prediction problems (like predicting a team’s number of wins from their runs scored and allowed, or predicting the quality of wine from weather conditions).

CheckShow Answer

Problem 3.2 - Computing the root-mean squared error of the model

What is the training-set root-mean squared error (RMSE) of lmScore?

Exercise 13

 Numerical Response 

 

Explanation

The training-set RMSE can be computed by first computing the SSE:

SSE = sum(lmScore$residuals^2)

and then dividing by the number of observations and taking the square root:

RMSE = sqrt(SSE / nrow(pisaTrain))

A alternative way of getting this answer would be with the following command:

sqrt(mean(lmScore$residuals^2)).

CheckShow Answer

Problem 3.3 - Comparing predictions for similar students

Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?

Exercise 14

 -59.09 

 -29.54 

 0 

 29.54 

 59.09 

 The difference cannot be determined without more information about the two students 

Explanation

The coefficient 29.54 on grade is the difference in reading score between two students who are identical other than having a difference in grade of 1. Because A and B have a difference in grade of 2, the model predicts that student A has a reading score that is 2*29.54 larger.

CheckShow Answer

Problem 3.4 - Interpreting model coefficients

What is the meaning of the coefficient associated with variable raceethAsian?

Exercise 15

 Predicted average reading score of an Asian student 

 Difference between the average reading score of an Asian student and the average reading score of a white student 

 Difference between the average reading score of an Asian student and the average reading score of all the students in the dataset 

 Predicted difference in the reading score between an Asian student and a white student who is otherwise identical 

Explanation

The only difference between an Asian student and white student with otherwise identical variables is that the former has raceethAsian=1 and the latter has raceethAsian=0. The predicted reading score for these two students will differ by the coefficient on the variable raceethAsian.

CheckShow Answer

Problem 3.5 - Identifying variables lacking statistical significance

Based on the significance codes, which variables are candidates for removal from the model? Select all that apply. (We’ll assume that the factor variable raceeth should only be removed if none of its levels are significant.)

Exercise 16

 grade 

 male 

 raceeth 

 preschool 

 expectBachelors 

 motherHS 

 motherBachelors 

 motherWork 

 fatherHS 

 fatherBachelors 

 ¨C90CfatherWork 

 ¨C91CselfBornUS 

 ¨C92CmotherBornUS 

 ¨C93CfatherBornUS 

 ¨C94CenglishAtHome 

 ¨C95CcomputerForSchoolwork 

 ¨C96Cread30MinsADay 

 ¨C97CminutesPerWeekEnglish 

 ¨C98CstudentsInEnglish 

 ¨C99CschoolHasLibrary 

 ¨C100CpublicSchool 

 ¨C101Curban 

 ¨C102CschoolSize 

 

Explanation

From summary(lmScore), we can see which variables were significant at the 0.05 level. Because several of the binary variables generated from the race factor variable are significant, we should not remove this variable.

CheckShow Answer

Problem 4.1 - Predicting on unseen data

Using the “predict” function and supplying the “newdata” argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions “predTest”. Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

What is the range between the maximum and minimum predicted reading score on the test set?

Exercise 17

 Numerical Response 

 

Explanation

We can obtain the predictions with:

predTest = predict(lmScore, newdata=pisaTest)

From summary(predTest), we see that the maximum predicted reading score is 637.7, and the minimum predicted score is 353.2. Therefore, the range is 284.5.

CheckShow Answer

Problem 4.2 - Test set SSE and RMSE

What is the sum of squared errors (SSE) of lmScore on the testing set?

Exercise 18

 Numerical Response 

 

Explanation

This can be calculated with sum((predTest-pisaTest$readingScore)^2).

What is the root-mean squared error (RMSE) of lmScore on the testing set?

Exercise 19

 Numerical Response 

 

Explanation

This can be calculated with sqrt(mean((predTest-pisaTest$readingScore)^2)).

CheckShow Answer

Problem 4.3 - Baseline prediction and test-set SSE

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

Exercise 20

 Numerical Response 

 

Explanation

This can be computed with:

baseline = mean(pisaTrain$readingScore)

What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).

Exercise 21

 Numerical Response 

 

Explanation

This can be computed with sum((baseline-pisaTest$readingScore)^2).

CheckShow Answer

Problem 4.4 - Test-set R-squared

What is the test-set R-squared value of lmScore?

Exercise 22

 Numerical Response 

 

Explanation

The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354.

CheckShow Answer

State Data

We often take data for granted. However, one of the hardest parts about analyzing a problem you’re interested in can be to find good data to answer the questions you want to ask. As you’re learning R, though, there are many datasets that R has built in that you can take advantage of.

In this problem, we will be examining the “state” dataset, which has data from the 1970s on all fifty US states. For each state, the dataset includes the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation.

Load the dataset and convert it to a data frame by running the following two commands in R:

data(state)

statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center,  state.division, state.name, state.region)

If you can’t access the state dataset in R, here is a CSV file with the same data that you can load into R using the read.csv function: statedata (CSV).

After you have loaded the data into R, inspect the data set using the command: str(statedata)

This dataset has 50 observations (one for each US state) and the following 15 variables:

 

  • Population - the population estimate of the state in 1975
  • Income - per capita income in 1974
  • Illiteracy - illiteracy rates in 1970, as a percent of the population
  • Life.Exp - the life expectancy in years of residents of the state in 1970
  • Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976 
  • HS.Grad - percent of high-school graduates in 1970
  • Frost - the mean number of days with minimum temperature below freezing from 1931–1960 in the capital or a large city of the state
  • Area - the land area (in square miles) of the state
  • state.abb - a 2-letter abreviation for each state
  • state.area - the area of each state, in square miles
  • x - the longitude of the center of the state
  • y - the latitude of the center of the state
  • state.division - the division each state belongs to (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, or Pacific)
  • state.name - the full names of each state
  • state.region - the region each state belong to (Northeast, South, North Central, or West)

Problem 1.1 - Data Exploration

We begin by exploring the data. Plot all of the states’ centers with latitude on the y axis (the “y” variable in our dataset) and longitude on the x axis (the “x” variable in our dataset). The shape of the plot should look like the outline of the United States! Note that Alaska and Hawaii have had their coordinates adjusted to appear just off of the west coast.

In the R command you used to generate this plot, which variable name did you use as the first argument?

Exercise 1

 statedata$y 

 statedata$x 

 I used a different variable name. 

Explanation

To generate the described plot, you should type plot(statedata$x, statedata$y) in your R console. The first variable here is statedata$x.

CheckShow Answer

Problem 1.2 - Data Exploration

Using the tapply command, determine which region of the US (West, North Central, South, or Northeast) has the highest average high school graduation rate of all the states in the region:

Exercise 2

 West 

 North Central 

 South 

 Northeast 

Explanation

You can find the average high school graduation rate of all states in each of the regions by typing the following command in your R console:

tapply(statedata$HS.Grad, statedata$state.region, mean)

The highest value is for the West region.

CheckShow Answer

Problem 1.3 - Data Exploration

Now, make a boxplot of the murder rate by region (for more information about creating boxplots in R, type ?boxplot in your console).

Which region has the highest median murder rate?

Exercise 3

 Northeast 

 South 

 North Central 

 West 

Explanation

To generate the boxplot, you should type boxplot(statedata$Murder ~ statedata$state.region) in your R console. You can see that the region with the highest median murder rate (the one with the highest solid line in the box) is the South.

CheckShow Answer

Problem 1.4 - Data Exploration

You should see that there is an outlier in the Northeast region of the boxplot you just generated. Which state does this correspond to? (Hint: There are many ways to find the answer to this question, but one way is to use the subset command to only look at the Northeast data.)

Exercise 4

 Delaware 

 Rhode Island 

 Maine 

 New York 

Explanation

The correct answer is New York. If you first use the subset command:

NortheastData = subset(statedata, state.region == “Northeast”)

You can then look at NortheastData$Murder together with NortheastData$state.abb to identify the outlier.

CheckShow Answer

Problem 2.1 - Predicting Life Expectancy - An Initial Model

We would like to build a model to predict life expectancy by state using the state statistics we have in our dataset.

Build the model with all potential variables included (Population, Income, Illiteracy, Murder, HS.Grad, Frost, and Area). Note that you should use the variable “Area” in your model, NOT the variable “state.area”.

What is the coefficient for “Income” in your linear regression model?

Exercise 5

 Numerical Response 

 

Explanation

You can build the linear regression model with the following command:

LinReg = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area, data=statedata)

Then, to find the coefficient for income, you can look at the summary of the regression with summary(LinReg).

 

CheckShow Answer

Problem 2.2 - Predicting Life Expectancy - An Initial Model

Call the coefficient for income x (the answer to Problem 2.1). What is the interpretation of the coefficient x?

Exercise 6

 For a one unit increase in income, predicted life expectancy increases by |x| 

 For a one unit increase in income, predicted life expectancy decreases by |x| 

 For a one unit increase in predicted life expectancy, income decreases by |x| 

 For a one unit increase in predicted life expectancy, income increases by |x| 

Explanation

If we increase income by one unit, then our model’s prediction will increase by the coefficient of income, x. Because x is negative, this is the same as predicted life expectancy decreasing by |x|.

CheckShow Answer

Problem 2.3 - Predicting Life Expectancy - An Initial Model

Now plot a graph of life expectancy vs. income using the command:

plot(statedata$Income, statedata$Life.Exp)

Visually observe the plot. What appears to be the relationship?

Exercise 7

 Life expectancy is somewhat positively correlated with income. 

 Life expectancy is somewhat negatively correlated with income. 

 Life expectancy is not correlated with income. 

Explanation

Although the point in the lower right hand corner of the plot appears to be an outlier, we observe a positive linear relationship in the plot.

CheckShow Answer

Problem 2.4 - Predicting Life Expectancy - An Initial Model

The model we built does not display the relationship we saw from the plot of life expectancy vs. income. Which of the following explanations seems the most reasonable?

Exercise 8

 Income is not related to life expectancy. 

 Multicollinearity 

Explanation

Although income is an insignificant variable in the model, this does not mean that there is no association between income and life expectancy. However, in the presence of all of the other variables, income does not add statistically significant explanatory power to the model. This means that multicollinearity is probably the issue.

CheckShow Answer

Problem 3.1 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Recall that we discussed the principle of simplicity: that is, a model with fewer variables is preferable to a model with many unnnecessary variables. Experiment with removing independent variables from the original model. Remember to use the significance of the coefficients to decide which variables to remove (remove the one with the largest “p-value” first, or the one with the “t value” closest to zero), and to remove them one at a time (this is called “backwards variable selection”). This is important due to multicollinearity issues - removing one insignificant variable may make another previously insignificant variable become significant.

You should be able to find a good model with only 4 independent variables, instead of the original 7. Which variables does this model contain?

Exercise 9

 Income, HS.Grad, Frost, Murder 

 HS.Grad, Population, Income, Frost 

 Frost, Murder, HS.Grad, Illiteracy 

 Population, Murder, Frost, HS.Grad 

Explanation

We would eliminate the variable “Area” first (since it has the highest p-value, or probability, with a value of 0.9649), by adjusting our lm command to the following:

LinReg = lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost, data=statedata)

Looking at summary(LinReg) now, we would choose to eliminate “Illiteracy” since it now has the highest p-value of 0.9340, using the following command:

LinReg = lm(Life.Exp ~ Population + Income + Murder + HS.Grad + Frost, data=statedata)

Looking at summary(LinReg) again, we would next choose to eliminate “Income”, since it has a p-value of 0.9153. This gives the following four variable model:

LinReg = lm(Life.Exp ~ Population + Murder + HS.Grad + Frost, data=statedata)

This model with 4 variables is a good model. However, we can see that the variable “Population” is not quite significant. In practice, it would be up to you whether or not to keep the variable “Population” or eliminate it for a 3-variable model. Population does not add much statistical significance in the presence of murder, high school graduation rate, and frost days. However, for the remainder of this question, we will analyze the 4-variable model.

CheckShow Answer

Problem 3.2 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Removing insignificant variables changes the Multiple R-squared value of the model. By looking at the summary output for both the initial model (all independent variables) and the simplified model (only 4 independent variables) and using what you learned in class, which of the following correctly explains the change in the Multiple R-squared value?

Exercise 10

 We expect the “Multiple R-squared” value of the simplified model to be slightly worse than that of the initial model. It can’t be better than the “Multiple R-squared” value of the initial model. 

 We expect the “Multiple R-squared” value of the simplified model to be slightly better than that of the initial model. It can’t be worse than the “Multiple R-squared” value of the initial model.  

 We expect the “Multiple R-squared” of the simplified model to be about the same as the intial model (we have no way of knowing if it will be slightly worse or slightly better than the Multiple R-squared of the intial model). 

Explanation

When we remove insignificant variables, the “Multiple R-squared” will always be worse, but only slightly worse. This is due to the nature of a linear regression model. It is always possible for the regression model to make a coefficient zero, which would be the same as removing the variable from the model. The fact that the coefficient is not zero in the intial model means it must be helping the R-squared value, even if it is only a very small improvement. So when we force the variable to be removed, it will decrease the R-squared a little bit. However, this small decrease is worth it to have a simpler model.

On the contrary, when we remove insignificant variables, the “Adjusted R-squred” will frequently be better. This value accounts for the complexity of the model, and thus tends to increase as insignificant variables are removed, and decrease as insignificant variables are added.

CheckShow Answer

Problem 3.3 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Using the simplified 4 variable model that we created, we’ll now take a look at how our predictions compare to the actual values.

Take a look at the vector of predictions by using the predict function (since we are just looking at predictions on the training set, you don’t need to pass a “newdata” argument to the predict function).

Which state do we predict to have the lowest life expectancy? (Hint: use the sort function)

Exercise 11

 South Carolina 

 Mississippi 

 Alabama 

 Georgia 

Explanation

If your simplified 4-variable model is called “LinReg”, you can answer this question by typing

sort(predict(LinReg))

in your R console. The first state listed has the lowest predicted life expectancy, which is Alabama.

Which state actually has the lowest life expectancy? (Hint: use the which.min function)

Exercise 12

 South Carolina 

 Mississippi 

 Alabama 

 Georgia 

Explanation

You can find the row number of the state with the lowest life expectancy by typing which.min(statedata$Life.Exp) into your R console. This returns 40. The 40th state name in the vector statedata$state.name is South Carolina.

CheckShow Answer

Problem 3.4 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Which state do we predict to have the highest life expectancy?

Exercise 13

 Massachusetts 

 Maine 

 Washington 

 Hawaii 

Explanation

If your simplified 4-variable model is called “LinReg”, you can answer this question by typing “sort(predict(LinReg))” in your R console. The last state listed has the highest predicted life expectancy, which is Washington.

Which state actually has the highest life expectancy?

Exercise 14

 Massachusetts 

 Maine 

 Washington 

 Hawaii 

Explanation

You can find the row number of the state with the highest life expectancy by typing which.max(statedata$Life.Exp) into your R console. This returns 11. The 11th state name in the vector statedata$state.name is Hawaii.

CheckShow Answer

Problem 3.5 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Take a look at the vector of residuals (the difference between the predicted and actual values).

For which state do we make the smallest absolute error?

Exercise 15

 Maine 

 Florida 

 Indiana 

 Illinois 

Explanation

You can look at the sorted list of absolute errors by typing

sort(abs(model$residuals))

into your R console (where “model” is the name of your model). Alternatively, you can compute the residuals manually by typing

sort(abs(statedata$Life.Exp - predict(model)))

in your R console. The smallest absolute error is for Indiana.

For which state do we make the largest absolute error?

Exercise 16

 Hawaii 

 Maine 

 Texas 

 South Carolina 

Explanation

You can look at the sorted list of absolute errors by typing

sort(abs(model$residuals))

into your R console (where “model” is the name of your model). Alternatively, you can compute the residuals manually by typing

sort(abs(statedata$Life.Exp - predict(model)))

in your R console. The largest absolute error is for Hawaii.

CheckShow Answer

Back: Detecting Flu Epidemics via Search Engine Query Data

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions