1.5 Assignment 1

Demographics and Employment in the United States

In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset.

The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData (CSV), which has the following variables:

PeopleInHousehold: The number of people in the interviewee’s household.

Region: The census region where the interviewee lives.

State: The state where the interviewee lives.

MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes (CSV).

Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.

Married: The marriage status of the interviewee.

Sex: The sex of the interviewee.

Education: The maximum level of education obtained by the interviewee.

Race: The race of the interviewee.

Hispanic: Whether the interviewee is of Hispanic ethnicity.

CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes (CSV).

Citizenship: The United States citizenship status of the interviewee.

EmploymentStatus: The status of employment of the interviewee.

Industry: The industry of employment of the interviewee (only available if they are employed).

Load the dataset from CPSData (CSV) into a data frame called CPS, and view the dataset with the summary() and str() commands.

How many interviewees are in the dataset?

Exercise 1

Numerical Response

Explanation

You can load the data with:

From str(CPS), we can read that there are 131302 interviewees.

Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

Exercise 2

Text Response  Answer:Educational and health services

Explanation

The output of summary(CPS) orders the levels of a factor variable like Industry from largest to smallest, so we can see that “Educational and health services” is the most common Industry. table(CPS$Industry) would have provided the breakdown across all industries. CheckShow Answer Problem 1.3 - Loading and Summarizing the Dataset Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.

Which state has the fewest interviewees?

Exercise 3

Which state has the largest number of interviewees?

Exercise 4

Explanation

These can be read from sort(table(CPS$State)) CheckShow Answer Problem 1.4 - Loading and Summarizing the Dataset What proportion of interviewees are citizens of the United States? Exercise 5 Numerical Response Explanation From table(CPS$Citizenship), we see that 123,712 of the 131,302 interviewees are citizens of the United States (either native or naturalized). This is a proportion of 123712/131302=0.942.

The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity? (Select all that apply.)

Exercise 6

American Indian

Asian

Black

Multiracial

Pacific Islander

White

Explanation

The breakdown of race and Hispanic ethnicity can be obtained with table(CPS$Race, CPS$Hispanic).

Problem 2.1 - Evaluating Missing Values

Which variables have at least one interviewee with a missing (NA) value? (Select all that apply.)

Exercise 7

PeopleInHousehold

Region

State

MetroAreaCode

Age

Married

Sex

Education

Race

Hispanic

¨C25CCountryOfBirthCode

¨C26CCitizenship

¨C27CEmploymentStatus

¨C28CIndustry

Explanation

This can be read from the output of summary(CPS).

Problem 3.4 - Integrating Metropolitan Area Data

Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.

Exercise 19

Explanation

The correct application of tapply here is

tapply(CPS$Hispanic, CPS$MetroArea, mean)

It will be easiest to obtain the maximum by actually using the sorted output:

sort(tapply(CPS$Hispanic, CPS$MetroArea, mean))

As we can see, 96.6% of the interviewees from Laredo, TX, are of Hispanic ethnicity, the highest proportion among metropolitan areas in the United States.

Problem 4.3 - Integrating Country of Birth Data

What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.

Exercise 25

Numerical Response

Explanation

From table(CPS$MetroArea == “New York-Northern New Jersey-Long Island, NY-NJ-PA”, CPS$Country != “United States”), we can see that 1668 of interviewees from this metropolitan area were born outside the United States and 3736 were born in the United States (it turns out an additional 5 have a missing country of origin). Therefore, the proportion is 1668/(1668+3736)=0.309.

Problem 4.4 - Integrating Country of Birth Data

Which metropolitan area has the largest number (note – not proportion) of interviewees with a country of birth in India? Hint – remember to include na.rm=TRUE if you are using tapply() to answer this question.

Exercise 26

Boston-Cambridge-Quincy, MA-NH

Minneapolis-St Paul-Bloomington, MN-WI

New York-Northern New Jersey-Long Island, NY-NJ-PA

Washington-Arlington-Alexandria, DC-VA-MD-WV

In Brazil?

Exercise 27

Boston-Cambridge-Quincy, MA-NH

Minneapolis-St Paul-Bloomington, MN-WI

New York-Northern New Jersey-Long Island, NY-NJ-PA

Washington-Arlington-Alexandria, DC-VA-MD-WV

In Somalia?

Exercise 28

Boston-Cambridge-Quincy, MA-NH

Minneapolis-St Paul-Bloomington, MN-WI

New York-Northern New Jersey-Long Island, NY-NJ-PA

Washington-Arlington-Alexandria, DC-VA-MD-WV

Explanation

To obtain the number of TRUE values in a vector of TRUE/FALSE values, you can use the sum() function. For instance, sum(c(TRUE, FALSE, TRUE, TRUE)) is 3. Therefore, we can obtain counts of people born in a particular country living in a particular metropolitan area with:

sort(tapply(CPS$Country == “India”, CPS$MetroArea, sum, na.rm=TRUE))

sort(tapply(CPS$Country == “Brazil”, CPS$MetroArea, sum, na.rm=TRUE))

sort(tapply(CPS$Country == “Somalia”, CPS$MetroArea, sum, na.rm=TRUE))

We see that New York has the most interviewees born in India (96), Boston has the most born in Brazil (18), and Minneapolis has the most born in Somalia (17).