WEBVTT

00:00:00.580 --> 00:00:03.220
The covariance between
two random variables

00:00:03.220 --> 00:00:04.930
tells us something
about the strength

00:00:04.930 --> 00:00:07.160
of the dependence between them.

00:00:07.160 --> 00:00:09.890
But it is not so easy to
interpret qualitatively.

00:00:09.890 --> 00:00:13.100
For example, if I tell you
that the covariance of X and Y

00:00:13.100 --> 00:00:15.690
is equal to 5, this
does not tell you

00:00:15.690 --> 00:00:20.990
very much about whether X and
Y are closely related or not.

00:00:20.990 --> 00:00:25.300
Another difficulty is that
if X and Y are in units,

00:00:25.300 --> 00:00:27.830
let's say, of meters,
then the covariance

00:00:27.830 --> 00:00:30.760
will have units
of meters squared.

00:00:30.760 --> 00:00:32.820
And this is hard to interpret.

00:00:32.820 --> 00:00:36.160
A much more informative quantity
is the so-called correlation

00:00:36.160 --> 00:00:39.880
coefficient, which is
a dimensionless version

00:00:39.880 --> 00:00:41.960
of the covariance.

00:00:41.960 --> 00:00:45.750
It is defined by
this formula here.

00:00:45.750 --> 00:00:48.480
We just take the
covariance and divide it

00:00:48.480 --> 00:00:51.430
by the product of the
standard deviations of the two

00:00:51.430 --> 00:00:53.210
random variables.

00:00:53.210 --> 00:00:57.660
Now, if X has units of meters,
then the standard deviation

00:00:57.660 --> 00:01:00.140
also has units of meters.

00:01:00.140 --> 00:01:03.430
And so this ratio
will be dimensionless.

00:01:03.430 --> 00:01:06.300
And it is not affected by
the units that we're using.

00:01:06.300 --> 00:01:08.990
The same is true
for this ratio here,

00:01:08.990 --> 00:01:13.100
and this is why the correlation
coefficient does not

00:01:13.100 --> 00:01:16.090
have any units of its own.

00:01:16.090 --> 00:01:20.330
One remark-- if we're dealing
with a random variable whose

00:01:20.330 --> 00:01:23.450
standard deviation
is equal to 0--

00:01:23.450 --> 00:01:25.850
so its variance is
also equal to 0--

00:01:25.850 --> 00:01:27.890
then we have a random
variable, which

00:01:27.890 --> 00:01:30.460
is identically
equal to a constant.

00:01:30.460 --> 00:01:33.620
Well, for such cases of
degenerate random variables,

00:01:33.620 --> 00:01:35.700
then the correlation
coefficient is not

00:01:35.700 --> 00:01:40.450
defined, because it would
have involved a division by 0.

00:01:40.450 --> 00:01:43.500
A very important property of
the correlation coefficient

00:01:43.500 --> 00:01:44.940
is the following.

00:01:44.940 --> 00:01:47.220
It turns out that the
correlation coefficient

00:01:47.220 --> 00:01:50.440
is always between minus 1 and 1.

00:01:50.440 --> 00:01:54.130
And this allows us to judge
whether a certain correlation

00:01:54.130 --> 00:01:56.930
coefficient is big or
not, because we now

00:01:56.930 --> 00:01:58.789
have an absolute scale.

00:01:58.789 --> 00:02:03.070
And so it does provide a measure
of the degree to which two

00:02:03.070 --> 00:02:05.530
random variables are associated.

00:02:05.530 --> 00:02:07.620
To interpret the
correlation coefficient,

00:02:07.620 --> 00:02:10.820
let's now look at
some extreme cases.

00:02:10.820 --> 00:02:14.820
Suppose that X and
Y are independent.

00:02:14.820 --> 00:02:17.570
In that case, we know
that the covariance

00:02:17.570 --> 00:02:19.490
is going to be equal to 0.

00:02:19.490 --> 00:02:21.480
And therefore, the
correlation coefficient

00:02:21.480 --> 00:02:23.780
is also going to be equal to 0.

00:02:23.780 --> 00:02:26.690
And in that case, we say
that the two random variables

00:02:26.690 --> 00:02:28.590
are uncorrelated.

00:02:28.590 --> 00:02:32.810
However, the converse
statement is not true.

00:02:32.810 --> 00:02:35.960
We have seen already
an example in which we

00:02:35.960 --> 00:02:39.310
have zero covariance and
therefore zero correlation,

00:02:39.310 --> 00:02:44.220
but yet the two random
variables were dependent.

00:02:44.220 --> 00:02:46.130
Let us now look at
the other extreme,

00:02:46.130 --> 00:02:48.400
where the two
random variables are

00:02:48.400 --> 00:02:51.180
as dependent as they can be.

00:02:51.180 --> 00:02:53.300
So let's look at the
correlation coefficient

00:02:53.300 --> 00:02:56.840
of one random
variable with itself.

00:02:56.840 --> 00:02:58.520
What is it going to be?

00:02:58.520 --> 00:03:01.650
The covariance of a random
variable with itself

00:03:01.650 --> 00:03:07.090
is just the variance of
that random variable, now,

00:03:07.090 --> 00:03:09.950
sigma X is going to be
the same as sigma Y,

00:03:09.950 --> 00:03:12.900
because we're taking
Y to be the same as X.

00:03:12.900 --> 00:03:15.260
So we're dividing
by sigma X squared.

00:03:15.260 --> 00:03:18.420
But the square of the standard
deviation is the variance.

00:03:18.420 --> 00:03:21.500
So we obtain a value of 1.

00:03:21.500 --> 00:03:23.579
So a correlation
coefficient of 1

00:03:23.579 --> 00:03:27.770
shows up in such a case
of an extreme dependence.

00:03:27.770 --> 00:03:31.220
If instead we had taken the
correlation coefficient of X

00:03:31.220 --> 00:03:34.510
with the negative
of X, in that case,

00:03:34.510 --> 00:03:37.200
we would have obtained a
correlation coefficient

00:03:37.200 --> 00:03:38.540
of minus 1.

00:03:41.540 --> 00:03:43.400
A somewhat more
general situation

00:03:43.400 --> 00:03:47.970
than the one we considered
here is the following.

00:03:47.970 --> 00:03:50.800
If we have two
random variables that

00:03:50.800 --> 00:03:54.560
have a linear relationship--
that is, if I know Y

00:03:54.560 --> 00:03:58.540
I can figure out the value
of X with absolute certainty,

00:03:58.540 --> 00:04:02.600
and I can figure it out
by using a linear formula.

00:04:02.600 --> 00:04:06.840
In this case, it turns out that
the correlation coefficient

00:04:06.840 --> 00:04:09.660
is either plus 1 or minus 1.

00:04:09.660 --> 00:04:11.380
And the converse is true.

00:04:11.380 --> 00:04:14.310
If the correlation coefficient
has absolute value of 1,

00:04:14.310 --> 00:04:15.820
then the two random
variables obey

00:04:15.820 --> 00:04:19.899
a deterministic linear
relation between them.

00:04:19.899 --> 00:04:24.380
So to conclude, an extreme value
for the correlation coefficient

00:04:24.380 --> 00:04:28.280
of plus or minus 1 is
equivalent to having

00:04:28.280 --> 00:04:31.730
a deterministic relation
between the two random variables

00:04:31.730 --> 00:04:33.820
involved.

00:04:33.820 --> 00:04:36.500
A final remark about
the algebraic properties

00:04:36.500 --> 00:04:40.530
of the correlation
coefficient- What can we

00:04:40.530 --> 00:04:42.250
say about the
correlation coefficient

00:04:42.250 --> 00:04:46.450
of a linear function of a
random variable with another?

00:04:46.450 --> 00:04:48.760
Well, we already know
something about what

00:04:48.760 --> 00:04:52.280
happens to the covariance when
we form a linear function.

00:04:52.280 --> 00:04:57.430
And the covariance of aX plus
b with Y is related this way

00:04:57.430 --> 00:05:01.980
to the covariance of X with Y.
Now, let us use this property

00:05:01.980 --> 00:05:04.830
and calculate the correlation
coefficient between aX

00:05:04.830 --> 00:05:06.630
plus b and Y.

00:05:06.630 --> 00:05:10.010
In the numerator, we
have the covariance of aX

00:05:10.010 --> 00:05:13.890
plus b with Y, which
is equal to a times

00:05:13.890 --> 00:05:20.640
the covariance of X with
Y. At the denominator,

00:05:20.640 --> 00:05:27.120
we have the standard deviation
of this random variable.

00:05:27.120 --> 00:05:30.310
Now, the standard deviation
of this random variable

00:05:30.310 --> 00:05:34.450
is equal to a times the
standard deviation of X,

00:05:34.450 --> 00:05:36.080
if a is positive.

00:05:36.080 --> 00:05:39.110
If a is negative, then we
need to put the minus sign.

00:05:39.110 --> 00:05:43.450
But in either case, we will
have here the absolute value

00:05:43.450 --> 00:05:45.940
of a times the
standard deviation

00:05:45.940 --> 00:05:49.030
of X. And then we divide
by the standard deviation

00:05:49.030 --> 00:05:52.370
of the second random
variable, which is Y.

00:05:52.370 --> 00:05:57.080
And so what we obtain
here is this ratio, which

00:05:57.080 --> 00:06:01.840
is a correlation coefficient
of X with Y times

00:06:01.840 --> 00:06:07.160
this quantity, which
is the sign of a.

00:06:07.160 --> 00:06:11.690
So we have the sign of a times
the correlation coefficient

00:06:11.690 --> 00:06:15.950
of X with Y. So in
particular, the magnitude

00:06:15.950 --> 00:06:18.050
of the correlation
coefficient is not

00:06:18.050 --> 00:06:22.370
going to change when we
replace X by aX plus b.

00:06:22.370 --> 00:06:24.210
And this essentially
means that if we

00:06:24.210 --> 00:06:27.280
change the units of
the random variable X,

00:06:27.280 --> 00:06:29.300
for example, suppose
that X was degrees

00:06:29.300 --> 00:06:34.180
Celsius and aX plus b is
degrees Fahrenheit, going

00:06:34.180 --> 00:06:37.850
from one set of units,
Celsius degrees,

00:06:37.850 --> 00:06:41.490
to another set of units,
degrees in Fahrenheit,

00:06:41.490 --> 00:06:43.570
is not going to
change the correlation

00:06:43.570 --> 00:06:46.909
coefficient of the temperature
with some other random

00:06:46.909 --> 00:06:48.690
variable.

00:06:48.690 --> 00:06:51.380
So this is a nice property of
the correlation coefficient,

00:06:51.380 --> 00:06:54.710
again, which reflects the
fact that it's dimensionless,

00:06:54.710 --> 00:06:57.490
it doesn't have any
units of its own,

00:06:57.490 --> 00:07:00.120
and it doesn't depend
on what kinds of units

00:07:00.120 --> 00:07:03.360
we use for each one of
the random variables.