WEBVTT

00:00:04.860 --> 00:00:09.110
Often, you will need to load
an external data file into R

00:00:09.110 --> 00:00:11.920
to do some analysis
and modeling.

00:00:11.920 --> 00:00:15.410
In this class, we'll be
working with csv files,

00:00:15.410 --> 00:00:18.550
or comma separated value files.

00:00:18.550 --> 00:00:21.320
This is a common
format for data files

00:00:21.320 --> 00:00:24.120
and is easy to work with in R.

00:00:24.120 --> 00:00:27.570
The first thing you need to
do to read in a data file

00:00:27.570 --> 00:00:30.520
is to navigate to the
directory on your computer

00:00:30.520 --> 00:00:32.970
where the data file is saved.

00:00:32.970 --> 00:00:37.990
On a Mac, go to the Misc menu,
then select "Change Working

00:00:37.990 --> 00:00:39.390
Directory...".

00:00:39.390 --> 00:00:46.280
On a PC, go to the File menu
and select "Change dir...".

00:00:46.280 --> 00:00:50.240
This should pop up a browsing
or navigation window.

00:00:50.240 --> 00:00:57.390
Navigate to the folder where
you saved the data file WHO.csv

00:00:57.390 --> 00:00:59.630
that you've downloaded
for this class,

00:00:59.630 --> 00:01:01.260
and then select that folder.

00:01:11.930 --> 00:01:13.850
Nothing should
have happened in R,

00:01:13.850 --> 00:01:17.690
but if you type
getwd, and then empty

00:01:17.690 --> 00:01:20.970
parentheses and hit Enter,
you should see the path

00:01:20.970 --> 00:01:25.590
to the folder containing
the data set that you just

00:01:25.590 --> 00:01:28.250
selected.

00:01:28.250 --> 00:01:33.460
Now, read in the data
file by typing WHO =

00:01:33.460 --> 00:01:42.490
read.csv("WHO.csv") the name of
the data file we want to read

00:01:42.490 --> 00:01:44.030
in.

00:01:44.030 --> 00:01:50.160
If you hit Enter, this will
save the data set in WHO.csv

00:01:50.160 --> 00:01:53.240
to the data frame WHO.

00:01:53.240 --> 00:01:57.009
To look at our data, there
are two very useful commands.

00:01:57.009 --> 00:02:00.480
The first is the
str function, which

00:02:00.480 --> 00:02:03.000
shows us the
structure of the data.

00:02:03.000 --> 00:02:09.570
If you type str(WHO),
and hit Enter,

00:02:09.570 --> 00:02:12.070
you can see that we
have a data frame

00:02:12.070 --> 00:02:16.740
of 194 observations
and 13 variables.

00:02:16.740 --> 00:02:19.530
This data set contains
recent statistics

00:02:19.530 --> 00:02:22.190
from the World Health
Organization-- W,

00:02:22.190 --> 00:02:26.490
H, O, or WHO-- on all countries.

00:02:26.490 --> 00:02:29.329
The variables are the
name of the country,

00:02:29.329 --> 00:02:34.150
the region the country is in,
the population in thousands,

00:02:34.150 --> 00:02:38.980
the percentage of the
population under 15 and over 60,

00:02:38.980 --> 00:02:43.660
the fertility rate or average
number of children per woman,

00:02:43.660 --> 00:02:48.670
the life expectancy in years,
the child mortality rate which

00:02:48.670 --> 00:02:52.260
is the number of children
who die by age five per 1,000

00:02:52.260 --> 00:02:58.160
births, the number of cellular
subscribers per 100 population,

00:02:58.160 --> 00:03:01.110
the literacy rate among
adults aged greater than

00:03:01.110 --> 00:03:06.320
or equal to 15, the gross
national income per capita,

00:03:06.320 --> 00:03:09.960
the percentage of male children
enrolled in primary school,

00:03:09.960 --> 00:03:12.000
and the percentage
of female children

00:03:12.000 --> 00:03:14.160
enrolled in primary school.

00:03:14.160 --> 00:03:19.110
For each variable, str gives
us the name of the variable,

00:03:19.110 --> 00:03:22.300
and then a description of
the type of the variable

00:03:22.300 --> 00:03:25.550
followed by a first few
values of the variable.

00:03:25.550 --> 00:03:27.650
We see a couple
different types here.

00:03:27.650 --> 00:03:30.270
One is a factor variable.

00:03:30.270 --> 00:03:33.700
Country and Region are
both factor variables.

00:03:33.700 --> 00:03:35.579
This means that
the variables have

00:03:35.579 --> 00:03:40.210
several different categories,
not necessarily numerical.

00:03:40.210 --> 00:03:44.390
For example, the Region variable
has six different categories

00:03:44.390 --> 00:03:45.740
or levels.

00:03:45.740 --> 00:03:48.930
These include
Africa and Americas.

00:03:48.930 --> 00:03:52.790
So each observation
in the Region variable

00:03:52.790 --> 00:03:56.079
belongs to one of six
different categories.

00:03:56.079 --> 00:03:57.770
For variables like
Country, where

00:03:57.770 --> 00:04:01.760
there's 194 levels, which is
the same number of observations

00:04:01.760 --> 00:04:06.330
we have, each value in
this variable is different.

00:04:06.330 --> 00:04:09.150
In this case, it makes sense,
since each country name

00:04:09.150 --> 00:04:11.230
is different.

00:04:11.230 --> 00:04:14.340
Then we have two types of
numerical values-- integer

00:04:14.340 --> 00:04:16.040
and then general
numerical values.

00:04:18.930 --> 00:04:22.200
The other very useful function
to take a look at our data

00:04:22.200 --> 00:04:24.130
is the summary function.

00:04:24.130 --> 00:04:27.190
In your R console,
type summary and then,

00:04:27.190 --> 00:04:32.159
in parentheses, WHO, the name of
our data frame, and hit Enter.

00:04:32.159 --> 00:04:36.250
This gives a numerical summary
of each of our variables.

00:04:36.250 --> 00:04:39.010
For the factor variables,
country and region,

00:04:39.010 --> 00:04:41.330
it counts the number
of observations

00:04:41.330 --> 00:04:44.860
in each of the
levels or categories.

00:04:44.860 --> 00:04:48.120
So here, we see that we have
46 countries in the region

00:04:48.120 --> 00:04:53.110
Africa, 35 in the
region Americas, etc.

00:04:53.110 --> 00:04:55.010
For each of the
numerical values,

00:04:55.010 --> 00:05:01.420
we see the min, first quartile,
median, mean, third quartile,

00:05:01.420 --> 00:05:05.640
and maximum values
in that variable.

00:05:05.640 --> 00:05:09.430
We can also see in some of the
variables that we have this

00:05:09.430 --> 00:05:11.630
category called NA's.

00:05:11.630 --> 00:05:15.160
This means that
some observations

00:05:15.160 --> 00:05:17.640
are missing values
in that variable.

00:05:17.640 --> 00:05:21.200
So for FertilityRate,
there 11 observations that

00:05:21.200 --> 00:05:25.440
are missing the value
of FertilityRate.

00:05:25.440 --> 00:05:28.100
When working with data
in R, it can often

00:05:28.100 --> 00:05:30.540
be useful to subset your data.

00:05:30.540 --> 00:05:33.820
For example, suppose we
want to create a new data

00:05:33.820 --> 00:05:36.610
frame with only the
countries in Europe.

00:05:36.610 --> 00:05:42.990
Let's call it WHO_Europe
and use the subset function

00:05:42.990 --> 00:05:46.210
to subset the data
frame WHO to take

00:05:46.210 --> 00:05:49.290
only the observations
for which Region

00:05:49.290 --> 00:05:51.300
is exactly equal to Europe.

00:05:53.890 --> 00:05:57.030
The subset function
takes two arguments.

00:05:57.030 --> 00:05:58.560
The first is the
data frame we want

00:05:58.560 --> 00:06:01.870
to take a subset of,
in this case, WHO.

00:06:01.870 --> 00:06:04.200
And the second argument
is the criteria

00:06:04.200 --> 00:06:08.350
for which observations of WHO
should belong in our new data

00:06:08.350 --> 00:06:10.160
frame, WHO_Europe.

00:06:10.160 --> 00:06:12.760
In this case, we
want the observations

00:06:12.760 --> 00:06:17.500
for which the Region variable
is exactly equal to Europe.

00:06:17.500 --> 00:06:21.500
The double equal sign here
means exactly equal to.

00:06:21.500 --> 00:06:27.240
If we hit Enter and then look
at the structure of WHO_Europe,

00:06:27.240 --> 00:06:29.200
we can see that
we now have a data

00:06:29.200 --> 00:06:33.909
frame of 53 observations
of the same 13 variables.

00:06:33.909 --> 00:06:35.760
Does 53 sound right?

00:06:35.760 --> 00:06:39.450
Well, let's look back at
the summary output of WHO.

00:06:39.450 --> 00:06:42.200
We can see in the
Region output, there

00:06:42.200 --> 00:06:47.320
were 53 observations that
belonged in the region Europe.

00:06:47.320 --> 00:06:49.970
So we should expect
53 observations

00:06:49.970 --> 00:06:53.010
in our Europe subset,
which is right.

00:06:55.650 --> 00:06:58.060
Now, suppose we want
to save this new data

00:06:58.060 --> 00:07:01.500
frame, WHO_Europe,
to a csv file.

00:07:01.500 --> 00:07:05.430
You can use the write.csv
function to do this.

00:07:05.430 --> 00:07:10.440
Type write.csv, and
then in parentheses

00:07:10.440 --> 00:07:12.700
the name of the data
frame we want to save,

00:07:12.700 --> 00:07:17.280
WHO_Europe, comma,
and then in quotes

00:07:17.280 --> 00:07:19.740
the name of the file
we want to save it to.

00:07:19.740 --> 00:07:20.950
Let's call it WHO_Europe.csv.

00:07:24.810 --> 00:07:27.850
If you hit Enter,
nothing should happen,

00:07:27.850 --> 00:07:31.780
but you should now have a
file called WHO_Europe.csv

00:07:31.780 --> 00:07:35.040
in the same folder that
you saved WHO.csv in.

00:07:37.670 --> 00:07:40.980
And now that we've saved
this as a csv file,

00:07:40.980 --> 00:07:43.570
if we're not working
with it anymore in R,

00:07:43.570 --> 00:07:47.409
we can remove the data frame
from our current session in R.

00:07:47.409 --> 00:07:50.370
This is often useful if you're
working with a large data

00:07:50.370 --> 00:07:53.159
set that's taking
up a lot of space.

00:07:53.159 --> 00:07:58.110
First, let's type ls() to see
what variables we currently

00:07:58.110 --> 00:08:01.620
have in R. You could see
that WHO_Europe is one

00:08:01.620 --> 00:08:03.590
of our variables.

00:08:03.590 --> 00:08:09.990
Now, type rm for remove and
then the name WHO_Europe and hit

00:08:09.990 --> 00:08:11.210
Enter.

00:08:11.210 --> 00:08:14.930
If you type ls() again, you
should see that WHO_Europe is

00:08:14.930 --> 00:08:16.850
gone.

00:08:16.850 --> 00:08:21.430
In the next video, we'll
explore the WHO data set.