WEBVTT

00:00:04.490 --> 00:00:06.220
In this video, we
will see how we

00:00:06.220 --> 00:00:09.490
can add a new variable
to our data frame.

00:00:09.490 --> 00:00:13.070
Suppose that we want to add
a variable to our USDA data

00:00:13.070 --> 00:00:16.490
frame that takes a value
1 if the food has higher

00:00:16.490 --> 00:00:19.610
sodium than average,
and 0 if the food has

00:00:19.610 --> 00:00:21.710
lower sodium than average.

00:00:21.710 --> 00:00:24.090
Let's do this step by step.

00:00:24.090 --> 00:00:26.140
To check if the first
food in the dataset

00:00:26.140 --> 00:00:29.280
has a higher amount of sodium
compared to the average,

00:00:29.280 --> 00:00:34.270
we can simply ask R to dig up
the first value in the Sodium

00:00:34.270 --> 00:00:37.410
vector, using the square
brackets and the index 1.

00:00:37.410 --> 00:00:40.850
And then compare it using
the greater-than sign

00:00:40.850 --> 00:00:43.980
to the mean of
the Sodium vector,

00:00:43.980 --> 00:00:49.770
and then do not forget to remove
the non-available entries.

00:00:49.770 --> 00:00:51.670
And we obtain TRUE.

00:00:51.670 --> 00:00:53.300
How about the 50th food?

00:00:53.300 --> 00:00:55.580
Well, let's go back
using the Up arrow,

00:00:55.580 --> 00:01:00.950
and simply change the index 1
to 50, and now we get FALSE.

00:01:00.950 --> 00:01:03.320
This means that the first
food has higher sodium

00:01:03.320 --> 00:01:05.890
content than average,
and the 50th food

00:01:05.890 --> 00:01:09.090
has lower sodium
content than average.

00:01:09.090 --> 00:01:10.560
Now, we can write
the same command,

00:01:10.560 --> 00:01:12.860
but on all the vector Sodium.

00:01:12.860 --> 00:01:16.060
Let's use the Up arrow, and
delete the square brackets

00:01:16.060 --> 00:01:18.010
with the index 50.

00:01:18.010 --> 00:01:20.940
But we know we have 7,000
foods, and we really

00:01:20.940 --> 00:01:23.410
don't want to output
7,000 values right now.

00:01:23.410 --> 00:01:27.770
So how about instead, we just
save the output to a vector,

00:01:27.770 --> 00:01:30.370
and we're going to
call it HighSodium.

00:01:30.370 --> 00:01:33.460
And now let's look at the
structure of the HighSodium

00:01:33.460 --> 00:01:35.310
vector.

00:01:35.310 --> 00:01:38.080
And then we see that the
HighSodium vector indeed

00:01:38.080 --> 00:01:41.020
has all these values--
TRUE and FALSE--

00:01:41.020 --> 00:01:43.050
which are called logicals.

00:01:43.050 --> 00:01:49.080
So basically the type of the
HighSodium vector is logical.

00:01:49.080 --> 00:01:52.920
But remember, we said we
wanted values 1's and 0's.

00:01:52.920 --> 00:01:55.590
So instead of TRUE, we want 1.

00:01:55.590 --> 00:01:58.500
And instead of FALSE,
we want a value of 0.

00:01:58.500 --> 00:02:00.900
Well, to do this, we
need to change the data

00:02:00.900 --> 00:02:04.370
type of HighSodium
to numeric, and we

00:02:04.370 --> 00:02:07.300
can do this using the
as.numeric function.

00:02:07.300 --> 00:02:10.259
So let's use the Up
arrow twice, and then

00:02:10.259 --> 00:02:14.250
enclose this logical expression
by the as.numeric function.

00:02:14.250 --> 00:02:20.530
So as.numeric, and now look up
the structure of HighSodium,

00:02:20.530 --> 00:02:23.860
and now we see that we turned
it into a numerical vector with

00:02:23.860 --> 00:02:26.700
values 0's and 1's.

00:02:26.700 --> 00:02:28.690
Now, this vector,
HighSodium, is not

00:02:28.690 --> 00:02:31.760
associated with the
USDA data frame.

00:02:31.760 --> 00:02:35.940
How can we add a variable,
HighSodium, to our data frame?

00:02:35.940 --> 00:02:39.020
Well, simply we need to
use the dollar notation.

00:02:39.020 --> 00:02:43.010
So let's go back
twice to the command

00:02:43.010 --> 00:02:45.530
where we created the
HighSodium vector,

00:02:45.530 --> 00:02:47.900
and then simply right now,
instead of just calling

00:02:47.900 --> 00:02:51.360
it HighSodium, we associate
it with the USDA data

00:02:51.360 --> 00:02:53.880
frame using the dollar notation.

00:02:53.880 --> 00:02:56.780
Now, pressing Enter, and going
and checking the structure

00:02:56.780 --> 00:03:00.480
of the USDA data frame,
we see that we just added

00:03:00.480 --> 00:03:03.700
the HighSodium variable
that was not present before,

00:03:03.700 --> 00:03:09.230
and it's a numerical variable
with values 1's and 0's.

00:03:09.230 --> 00:03:13.040
Now we can do the same, and
add the variables HighProtein,

00:03:13.040 --> 00:03:16.770
HighCarbs, HighFat,
similarly to our data frame.

00:03:16.770 --> 00:03:19.620
Well, let's do this
quickly using the Up arrow,

00:03:19.620 --> 00:03:25.390
and then let's go and replace
Sodium now by Protein.

00:03:25.390 --> 00:03:28.870
So again, here Sodium
is replaced by Protein.

00:03:28.870 --> 00:03:34.200
And then we're going to call
this new variable HighProtein.

00:03:34.200 --> 00:03:36.980
And do the same with TotalFat.

00:03:36.980 --> 00:03:38.920
So instead of
Protein, we're going

00:03:38.920 --> 00:03:47.730
to have TotalFat, and then
replace it here again,

00:03:47.730 --> 00:03:53.240
and the variable name
is going to be HighFat.

00:03:53.240 --> 00:03:57.480
And finally,
Carbohydrates-- so here

00:03:57.480 --> 00:04:05.260
is the vector of Carbohydrates,
and this is, too,

00:04:05.260 --> 00:04:07.290
getting the
Carbohydrates vector.

00:04:10.150 --> 00:04:13.860
And finally, this last
variable that we want to add

00:04:13.860 --> 00:04:16.610
is called HighCarbs.

00:04:16.610 --> 00:04:20.570
And now looking at the structure
of the USDA data frame,

00:04:20.570 --> 00:04:22.980
we see that we
successfully added

00:04:22.980 --> 00:04:27.340
these three new variables, which
are high HighProtein, HighFat,

00:04:27.340 --> 00:04:30.270
and HighCarbs, in
addition to the HighSodium

00:04:30.270 --> 00:04:33.250
variable that we
added previously.

00:04:33.250 --> 00:04:35.990
So how can we now
find relationships

00:04:35.990 --> 00:04:40.140
between these variables, and
also the original variables

00:04:40.140 --> 00:04:43.190
that we had in the
USDA data frame?

00:04:43.190 --> 00:04:45.630
Well, we're going to
be using the table

00:04:45.630 --> 00:04:49.659
and the tapply functions
in our next video.