2.5 Assignment 2

State Data

We often take data for granted. However, one of the hardest parts about analyzing a problem you’re interested in can be to find good data to answer the questions you want to ask. As you’re learning R, though, there are many datasets that R has built in that you can take advantage of.

In this problem, we will be examining the “state” dataset, which has data from the 1970s on all fifty US states. For each state, the dataset includes the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation.

Load the dataset and convert it to a data frame by running the following two commands in R:

data(state)

statedata = cbind(data.frame(state.x77), state.abb, state.area, state.center, state.division, state.name, state.region)

If you can’t access the state dataset in R, here is a CSV file with the same data that you can load into R using the read.csv function: statedata (CSV).

After you have loaded the data into R, inspect the data set using the command: str(statedata)

This dataset has 50 observations (one for each US state) and the following 15 variables:

Population - the population estimate of the state in 1975
Income - per capita income in 1974
Illiteracy - illiteracy rates in 1970, as a percent of the population
Life.Exp - the life expectancy in years of residents of the state in 1970
Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976
HS.Grad - percent of high-school graduates in 1970
Frost - the mean number of days with minimum temperature below freezing from 1931–1960 in the capital or a large city of the state
Area - the land area (in square miles) of the state
state.abb - a 2-letter abreviation for each state
state.area - the area of each state, in square miles
x - the longitude of the center of the state
y - the latitude of the center of the state
state.division - the division each state belongs to (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, or Pacific)
state.name - the full names of each state
state.region - the region each state belong to (Northeast, South, North Central, or West)

Problem 1.1 - Data Exploration

We begin by exploring the data. Plot all of the states’ centers with latitude on the y axis (the “y” variable in our dataset) and longitude on the x axis (the “x” variable in our dataset). The shape of the plot should look like the outline of the United States! Note that Alaska and Hawaii have had their coordinates adjusted to appear just off of the west coast.

In the R command you used to generate this plot, which variable name did you use as the first argument?

A. statedata$y

B. statedata$x

C. I used a different variable name.

Problem 1.2 - Data Exploration

Using the tapply command, determine which region of the US (West, North Central, South, or Northeast) has the highest average high school graduation rate of all the states in the region:

West

North Central

South

Northeast

Problem 1.3 - Data Exploration

Now, make a boxplot of the murder rate by region (for more information about creating boxplots in R, type ?boxplot in your console).

Which region has the highest median murder rate?

Northeast

South

North Central

West

Problem 1.4 - Data Exploration

You should see that there is an outlier in the Northeast region of the boxplot you just generated. Which state does this correspond to? (Hint: There are many ways to find the answer to this question, but one way is to use the subset command to only look at the Northeast data.)

Delaware

Rhode Island

Maine

New York

Problem 2.1 - Predicting Life Expectancy - An Initial Model

We would like to build a model to predict life expectancy by state using the state statistics we have in our dataset.

Build the model with all potential variables included (Population, Income, Illiteracy, Murder, HS.Grad, Frost, and Area). Note that you should use the variable “Area” in your model, NOT the variable “state.area”.

What is the coefficient for “Income” in your linear regression model?

Problem 2.2 - Predicting Life Expectancy - An Initial Model

Call the coefficient for income x (the answer to Problem 2.1). What is the interpretation of the coefficient x?

For a one unit increase in income, predicted life expectancy increases by |x|

For a one unit increase in income, predicted life expectancy decreases by |x|

For a one unit increase in predicted life expectancy, income decreases by |x|

For a one unit increase in predicted life expectancy, income increases by |x|

Problem 2.3 - Predicting Life Expectancy - An Initial Model

Now plot a graph of life expectancy vs. income using the command:

plot(statedata$Income, statedata$Life.Exp)

Visually observe the plot. What appears to be the relationship?

Life expectancy is somewhat positively correlated with income.

Life expectancy is somewhat negatively correlated with income.

Life expectancy is not correlated with income.

Problem 2.4 - Predicting Life Expectancy - An Initial Model

The model we built does not display the relationship we saw from the plot of life expectancy vs. income. Which of the following explanations seems the most reasonable?

Income is not related to life expectancy.

Multicollinearity

Problem 3.1 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Recall that we discussed the principle of simplicity: that is, a model with fewer variables is preferable to a model with many unnnecessary variables. Experiment with removing independent variables from the original model. Remember to use the significance of the coefficients to decide which variables to remove (remove the one with the largest “p-value” first, or the one with the “t value” closest to zero), and to remove them one at a time (this is called “backwards variable selection”). This is important due to multicollinearity issues - removing one insignificant variable may make another previously insignificant variable become significant.

You should be able to find a good model with only 4 independent variables, instead of the original 7. Which variables does this model contain?

Income, HS.Grad, Frost, Murder

HS.Grad, Population, Income, Frost

Frost, Murder, HS.Grad, Illiteracy

Population, Murder, Frost, HS.Grad

Problem 3.2 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Removing insignificant variables changes the Multiple R-squared value of the model. By looking at the summary output for both the initial model (all independent variables) and the simplified model (only 4 independent variables) and using what you learned in class, which of the following correctly explains the change in the Multiple R-squared value?

We expect the “Multiple R-squared” value of the simplified model to be slightly worse than that of the initial model. It can’t be better than the “Multiple R-squared” value of the initial model.

We expect the “Multiple R-squared” value of the simplified model to be slightly better than that of the initial model. It can’t be worse than the “Multiple R-squared” value of the initial model.

We expect the “Multiple R-squared” of the simplified model to be about the same as the intial model (we have no way of knowing if it will be slightly worse or slightly better than the Multiple R-squared of the intial model).

Problem 3.3 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Using the simplified 4 variable model that we created, we’ll now take a look at how our predictions compare to the actual values.

Take a look at the vector of predictions by using the predict function (since we are just looking at predictions on the training set, you don’t need to pass a “newdata” argument to the predict function).

Which state do we predict to have the lowest life expectancy? (Hint: use the sort function)

South Carolina

Mississippi

Alabama

Georgia

Which state actually has the lowest life expectancy? (Hint: use the which.min function)

South Carolina

Mississippi

Alabama

Georgia

Problem 3.4 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Which state do we predict to have the highest life expectancy?

Massachusetts

Maine

Washington

Hawaii

Which state actually has the highest life expectancy?

Massachusetts

Maine

Washington

Hawaii

Problem 3.5 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Take a look at the vector of residuals (the difference between the predicted and actual values).

For which state do we make the smallest absolute error?

Maine

Florida

Indiana

Illinois

For which state do we make the largest absolute error?

Hawaii

Maine

Texas

South Carolina

Back: Detecting Flu Epidemics via Search Engine Query Data

Browse Course Material

Course Info

Instructor

Departments

As Taught In

Level

Topics

Learning Resource Types

The Analytics Edge

State Data

Problem 1.1 - Data Exploration

Problem 1.2 - Data Exploration

Problem 1.3 - Data Exploration

Problem 1.4 - Data Exploration

Problem 2.1 - Predicting Life Expectancy - An Initial Model

Problem 2.2 - Predicting Life Expectancy - An Initial Model

Problem 2.3 - Predicting Life Expectancy - An Initial Model

Problem 2.4 - Predicting Life Expectancy - An Initial Model

Problem 3.1 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Problem 3.2 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Problem 3.3 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Problem 3.4 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Problem 3.5 - Predicting Life Expectancy - Refining the Model and Analyzing Predictions

Course Info

Instructor

Departments

As Taught In

Level

Topics

Learning Resource Types