15.071 | Spring 2017 | Graduate

The Analytics Edge

1 An Introduction to Analytics

1.3 Working with Data: An Introduction to R

Introduction to R

The slides for this sequence can be downloaded here: Introduction to the R Programming Language (PDF).

Before beginning this lecture, please download and install R for free from the following webpage: The R Project.

At the top of the page, select your operating system (Linux, Windows, or Mac) and then follow the instructions. 

For most Windows users, you will select “install R for the first time” and then select “Download R 3.2.0 for Windows” at the top of the page. 

For Mac users, you will want to download R-3.2.0.pkg if you have OS X 10.9 (mavericks) or higher installed and R-3.1.3-snowleopard.pkg for earlier versions of the operating system.  Please note that either version of R will work for this course.

Once you have downloaded and installed R, please start the application, and then answer the following quick questions. This is to make sure that you have correctly installed R.

 

Quick Questions for Installing R

When you start R, you should see a window titled “R Console”. In this window, there is some text, and then at the bottom there should be a > symbol (greater than symbol), followed by a blinking cursor. At the cursor, type:

sd(c(5,8,12))

and then hit enter. You should see [1] followed by a number. What is this number?

Exercise 1

 Numerical Response 

 

Now type:

which.min(c(4,1,6))

at the cursor, and hit enter. You should again see [1], followed by a number. What is this number?

Exercise 2

 Numerical Response 

 

If you correctly answered both of these questions in R, you are ready to start the lecture! If not, please go to the discussion forum to get help installing R.

CheckShow Answer

Important Note

If you downloaded and installed R in a location other than the United States, you might encounter some formatting issues later in this class due to language differences. To fix this, you will need to type in your R console:

Sys.setlocale(“LC_ALL”, “C”)

This will only change the locale for your current R session, so please make a note to run this command when you are working on any lectures or exercises that might depend on the English language (for example, the names for the days of the week).

Quick Question

If you want to know the mean value of a numerical variable, which function could you use?

  
  
  

Explanation If you using the summary function (in the video, we typed summary(WHO) in our R console) you can see a statistical summary of each variable. For numerical variables, the mean value is listed.

If you want to know the standard deviation of a numerical variable, which function could you use?

  
  
  

Explanation Neither the str function nor the summary function provides the standard deviation value of a variable. We'll see how to compute this value in the next video.

Quick Question

 

Please answer the following questions using the entire data frame WHO (and not one of the subsets we have created in R).

What is the mean value of the “Over60” variable?

Exercise 1

 Numerical Response 

 

Explanation

You can compute this value by either typing mean(WHO$Over60) in your R console, or by typing summary(WHO$Over60) in your R console. The output is 11.16.

Which country has the smallest percentage of the population over 60?

Exercise 2

 Japan 

 United Arab Emirates (UAE) 

 Sierra Leone 

 Cuba 

 Luxembourg 

 Mali 

Explanation

To get this value, you should type which.min(WHO$Over60) in your R console. The output is 183. Then, to see the name of the 183rd country in your data frame, type WHO$Country[183] in your R console. The output is United Arab Emirates.

Which country has the largest literacy rate?

Exercise 3

 Japan 

 United Arab Emirates (UAE) 

 Sierra Leone 

 Cuba 

 Luxembourg 

 Mali 

Explanation

To get this value, you should type which.max(WHO$LiteracyRate) in your R console. The output is 44. Then, to see the name of the 44th country in your data frame, type WHO$Country[44] in your R console. The output is Cuba.

CheckShow Answer

 

Quick Question

Which of the following are recommended variable names in R? (Select all that apply.)

   
   
   
   

Explanation SquareRoot2 and Square2.Root are recommended variable names. The second option is not recommended because it has a space, and the fourth option is not recommended because it starts with a number.

Quick Question

Use the tapply function to find the average child mortality rate of countries in each region.

Which region has the lowest average child mortality rate across all countries in that region?

  
  
  
  
  
  

Explanation You can find the average child mortaility rate of countries in each region by using the following command: tapply(WHO$ChildMortality, WHO$Region, mean) The output tells us that Europe has the lowest average child mortality rate across all countries, with an average of 10.05.

Quick Question

If you want to add new observations to a data frame, what should you use?

   
   
   

Explanation To add new observations to a data frame with the same variable values, you should use rbind.

If you want to combine two vectors into a data frame, what should you use?

   
   
   

Explanation To combine two vectors into a data frame, you should use data.frame.

If you want to add a variable to your data frame, what should you use?

   
   
   

Explanation To add a variable to your data frame, you should the dollar sign notation.

Quick Question

At which university was the first version of R developed?

  
  
  
  

Explanation The first version of R was developed by Robert Gentleman and Ross Ihaka at the University of Auckland in the mid-1990s.

Video 5: Data Analysis - Summary Statistics and Scatterplots

This video covers a lot of material regarding basic data analysis in R. Don’t worry if you don’t absorb everything, as the focus of the recitation and the homework assignment is to go over all of these topics again. By the time you are done with the homework assignment for this week, you will hopefully feel much more comfortable performing basic data analysis in R.

Video 6: Data Analysis - Plots and Summary Tables

In the above video, we state that the outliers in a boxplot are computed as any points greater than the third quartile plus the inter-quartile range (IQR), or any points less than the first quartile minus the inter-quartile range (IQR). They are actually computed as any points greater than the third quartile plus 1.5*IQR, or less than the first quartile minus 1.5*IQR. 

Course Info

As Taught In
Spring 2017
Level
Learning Resource Types
Lecture Videos
Lecture Notes
Problem Sets with Solutions