WEBVTT
00:00:00.520 --> 00:00:03.180
We have seen that several
properties, such as, for
00:00:03.180 --> 00:00:06.320
example, linearity of
expectations, are common for
00:00:06.320 --> 00:00:08.670
discrete and continuous
random variables.
00:00:08.670 --> 00:00:11.570
For this reason, it would be
nice to have a way of talking
00:00:11.570 --> 00:00:14.890
about the distribution of all
kinds of random variables
00:00:14.890 --> 00:00:17.920
without having to keep making
a distinction between the
00:00:17.920 --> 00:00:19.150
different types--
00:00:19.150 --> 00:00:20.865
discrete or continuous.
00:00:20.865 --> 00:00:23.900
This leads us to describe the
distribution of a random
00:00:23.900 --> 00:00:27.510
variable in a new way, in
terms of a so-called
00:00:27.510 --> 00:00:31.660
cumulative distribution function
or CDF for short.
00:00:31.660 --> 00:00:35.370
A CDF is defined as follows.
00:00:35.370 --> 00:00:38.620
The CDF is a function of a
single argument, which we
00:00:38.620 --> 00:00:41.060
denote by little
x in this case.
00:00:41.060 --> 00:00:43.600
And it gives us the probability
that the random
00:00:43.600 --> 00:00:46.210
variable takes a value less
than or equal to this
00:00:46.210 --> 00:00:47.930
particular little x.
00:00:47.930 --> 00:00:52.880
We will always use uppercase
Fs to indicate CDFs.
00:00:52.880 --> 00:00:56.760
And we will always have some
subscripts that indicate which
00:00:56.760 --> 00:01:00.090
random variable we're
talking about.
00:01:00.090 --> 00:01:02.800
The beauty of the CDF is
that it just involves a
00:01:02.800 --> 00:01:03.600
probability--
00:01:03.600 --> 00:01:06.280
a concept that is well defined,
no matter what kind
00:01:06.280 --> 00:01:08.920
of random variable we're
dealing with.
00:01:08.920 --> 00:01:13.710
So in particular, if X is a
continuous random variable,
00:01:13.710 --> 00:01:17.289
the probability that X
is less than or equal
00:01:17.289 --> 00:01:18.789
to a certain number--
00:01:18.789 --> 00:01:24.539
this is just the integral of the
PDF over that range from
00:01:24.539 --> 00:01:27.670
minus infinity up
to that number.
00:01:27.670 --> 00:01:31.039
As a more concrete example,
let us consider a uniform
00:01:31.039 --> 00:01:35.330
random variable that ranges
between a and b, and let us
00:01:35.330 --> 00:01:39.030
just try to plot the
corresponding CDF.
00:01:39.030 --> 00:01:41.630
The CDF is a function
of little x.
00:01:41.630 --> 00:01:44.780
And the form that it takes
depends on what kind of x
00:01:44.780 --> 00:01:46.080
we're talking about.
00:01:46.080 --> 00:01:52.009
If little x falls somewhere here
to the left of a, and we
00:01:52.009 --> 00:01:54.440
ask for the probability that
our random variable takes
00:01:54.440 --> 00:01:58.400
values in this interval, then
this probability will be 0
00:01:58.400 --> 00:02:01.150
because all of the probability
of this uniform is
00:02:01.150 --> 00:02:03.010
between a and b.
00:02:03.010 --> 00:02:09.038
Therefore, the CDF is going to
be 0 for values of x less than
00:02:09.038 --> 00:02:10.990
or equal to a.
00:02:10.990 --> 00:02:14.100
How about the case where
x lies somewhere
00:02:14.100 --> 00:02:16.620
between a and b?
00:02:16.620 --> 00:02:20.710
In that case, the probability
that our random variable falls
00:02:20.710 --> 00:02:23.650
to the left of here--
00:02:23.650 --> 00:02:29.680
this is whatever mass there is
under the PDF when we consider
00:02:29.680 --> 00:02:32.430
the integral up to this
particular point.
00:02:32.430 --> 00:02:37.310
So we're looking at the area
under the PDF up to this
00:02:37.310 --> 00:02:39.030
particular point x.
00:02:39.030 --> 00:02:43.490
This area is of the form the
base of the rectangle, which
00:02:43.490 --> 00:02:47.360
is x minus a, times the height
of the rectangle, which is 1
00:02:47.360 --> 00:02:49.340
over b minus a.
00:02:49.340 --> 00:02:53.500
This is a linear function in x
that takes the value of 0 when
00:02:53.500 --> 00:03:00.170
x is equal to a, grows linearly,
and when x reaches a
00:03:00.170 --> 00:03:03.130
value of b, it becomes
equal to 1.
00:03:08.060 --> 00:03:12.830
How about the case where x
lies to the right of b?
00:03:12.830 --> 00:03:14.840
We're talking about the
probability that our random
00:03:14.840 --> 00:03:18.050
variable takes values less
than or equal to this
00:03:18.050 --> 00:03:19.840
particular x.
00:03:19.840 --> 00:03:22.420
But this includes the
entire probability
00:03:22.420 --> 00:03:23.900
mass of this uniform.
00:03:23.900 --> 00:03:27.090
We have unit mass on this
particular interval, so the
00:03:27.090 --> 00:03:32.030
probability of falling to the
left of here is equal to 1.
00:03:32.030 --> 00:03:35.210
And this is the shape of the CDF
for the case of a uniform
00:03:35.210 --> 00:03:36.220
random variable.
00:03:36.220 --> 00:03:39.880
It starts at 0, eventually it
rises, and eventually it
00:03:39.880 --> 00:03:43.290
reaches a value of 1
and stays constant.
00:03:43.290 --> 00:03:47.150
Coming back to the general case,
CDFs are very useful,
00:03:47.150 --> 00:03:50.900
because once we know the CDF of
a random variable, we have
00:03:50.900 --> 00:03:54.140
enough information to calculate
anything we might
00:03:54.140 --> 00:03:55.720
want to calculate.
00:03:55.720 --> 00:04:00.050
For example, consider the
following calculation.
00:04:00.050 --> 00:04:05.160
Let us look at the range of
numbers from minus infinity to
00:04:05.160 --> 00:04:08.320
3 and then up to 4.
00:04:08.320 --> 00:04:12.750
If we want to calculate the
probability that X is less
00:04:12.750 --> 00:04:17.329
than or equal to 4, we can
break it down as the
00:04:17.329 --> 00:04:22.130
probability that X is less
than or equal to 3--
00:04:22.130 --> 00:04:25.040
this is one term--
00:04:25.040 --> 00:04:35.860
plus the probability that X
falls between 3 and 4, which
00:04:35.860 --> 00:04:40.340
would be this event here.
00:04:40.340 --> 00:04:43.320
So this equality is true because
of the additivity
00:04:43.320 --> 00:04:45.940
property of probabilities.
00:04:45.940 --> 00:04:49.409
This event is broken down into
two possible events.
00:04:49.409 --> 00:04:53.070
Either x is less than or equal
to 3 or x is bigger than 3 but
00:04:53.070 --> 00:04:55.120
less than or equal to 4.
00:04:55.120 --> 00:04:59.050
But now we recognize that if we
know the CDF of the random
00:04:59.050 --> 00:05:01.580
variable, then we know
this quantity.
00:05:01.580 --> 00:05:04.820
We also know this quantity,
and this allows us to
00:05:04.820 --> 00:05:06.390
calculate this quantity.
00:05:06.390 --> 00:05:08.810
So we can calculate the
probability of a
00:05:08.810 --> 00:05:10.370
more general interval.
00:05:10.370 --> 00:05:13.140
So in general, the CDF contains
all available
00:05:13.140 --> 00:05:15.770
probabilistic information
about a random variable.
00:05:15.770 --> 00:05:19.170
It is just a different way of
describing the probability
00:05:19.170 --> 00:05:20.280
distribution.
00:05:20.280 --> 00:05:22.900
From the CDF, we can recover
any quantity we
00:05:22.900 --> 00:05:24.530
might wish to know.
00:05:24.530 --> 00:05:27.430
And for continuous random
variables, the CDF actually
00:05:27.430 --> 00:05:32.680
has enough information for us to
be able to recover the PDF.
00:05:32.680 --> 00:05:34.230
How can we do that?
00:05:34.230 --> 00:05:37.050
Let's look at this relation
here, and let's take
00:05:37.050 --> 00:05:39.510
derivatives of both sides.
00:05:39.510 --> 00:05:44.040
On the left, we obtain the
derivative of the CDF.
00:05:44.040 --> 00:05:47.150
And let's evaluate it at
a particular point x.
00:05:47.150 --> 00:05:50.250
What do we get on the right?
00:05:50.250 --> 00:05:54.850
By basic calculus results, the
derivative of an integral,
00:05:54.850 --> 00:05:57.820
with respect to the upper limit
of the integration, is
00:05:57.820 --> 00:06:00.400
just the integrand itself.
00:06:00.400 --> 00:06:02.500
So it is the density itself.
00:06:05.690 --> 00:06:08.830
So this is a very useful
formula, which tells us that
00:06:08.830 --> 00:06:12.730
once we have the CDF, we
can calculate the PDF.
00:06:12.730 --> 00:06:16.830
And conversely, if we have the
PDF, we can find the CDF by
00:06:16.830 --> 00:06:18.120
integrating.
00:06:18.120 --> 00:06:21.410
Of course, this formula can
only be correct at those
00:06:21.410 --> 00:06:24.920
places where the CDF
has a derivative.
00:06:24.920 --> 00:06:28.480
For example, at this corner
here, the derivative of the
00:06:28.480 --> 00:06:30.370
CDF is not well defined.
00:06:30.370 --> 00:06:33.150
We would get a different value
if we differentiate from the
00:06:33.150 --> 00:06:35.580
left, a different value when
we differentiate from the
00:06:35.580 --> 00:06:38.370
right, so we cannot apply
this formula.
00:06:38.370 --> 00:06:42.760
But at those places where the
CDF is differentiable, at
00:06:42.760 --> 00:06:44.940
those places we can find
the corresponding
00:06:44.940 --> 00:06:47.030
value of the PDF.
00:06:47.030 --> 00:06:51.159
For instance, in this diagram,
at this point the CDF is
00:06:51.159 --> 00:06:52.590
differentiable.
00:06:52.590 --> 00:06:56.850
The derivative is equal to the
slope, which is this quantity.
00:06:56.850 --> 00:07:01.070
And this quantity happens to
be exactly the same as the
00:07:01.070 --> 00:07:03.090
value of the PDF.
00:07:03.090 --> 00:07:07.430
So indeed, here, we see that the
PDF can be found by taking
00:07:07.430 --> 00:07:11.820
the derivative of the CDF.
00:07:11.820 --> 00:07:15.390
Now, as we discussed earlier,
CDFs are relevant to all types
00:07:15.390 --> 00:07:16.510
of random variables.
00:07:16.510 --> 00:07:19.200
So in particular, they are
also relevant to discrete
00:07:19.200 --> 00:07:20.580
random variables.
00:07:20.580 --> 00:07:23.000
For a discrete random variable,
the CDF is, of
00:07:23.000 --> 00:07:27.230
course, defined the same way,
except that we calculate this
00:07:27.230 --> 00:07:31.460
probability by adding the
probabilities of all possible
00:07:31.460 --> 00:07:35.050
values of the random variable
that are less than
00:07:35.050 --> 00:07:35.115
[or equal to]
00:07:35.115 --> 00:07:38.650
the particular little x that
we're considering.
00:07:38.650 --> 00:07:41.720
So we have a summation instead
of an integral.
00:07:41.720 --> 00:07:43.500
Let us look at an example.
00:07:43.500 --> 00:07:46.030
This is an example of a discrete
random variable
00:07:46.030 --> 00:07:47.940
described by a PMF.
00:07:47.940 --> 00:07:51.530
And let us try to calculate
the corresponding CDF.
00:07:51.530 --> 00:07:55.770
The probability of falling to
the left of this number, for
00:07:55.770 --> 00:07:58.020
example, is equal to 0.
00:07:58.020 --> 00:08:03.160
And all the way up to 1, there
is 0 probability of getting a
00:08:03.160 --> 00:08:06.470
value for the random variable
less than that.
00:08:06.470 --> 00:08:11.380
But now, if we let x to be equal
to 1, then we're talking
00:08:11.380 --> 00:08:17.210
about the probability that the
random variable takes a value
00:08:17.210 --> 00:08:19.540
less than or equal to 1.
00:08:19.540 --> 00:08:23.670
And because this includes the
value of 1, this probability
00:08:23.670 --> 00:08:25.880
would be equal to 1/4.
00:08:25.880 --> 00:08:29.810
This means that once we reach
this point, the value of the
00:08:29.810 --> 00:08:34.320
CDF becomes 1/4.
00:08:34.320 --> 00:08:38.500
At this point, the
CDF makes a jump.
00:08:38.500 --> 00:08:42.440
At 1, the value of the
CDF is equal to 1/4.
00:08:42.440 --> 00:08:47.340
Just before 1, the value of
the CDF was equal to 0.
00:08:47.340 --> 00:08:50.080
Now what's the probability
of falling to the left
00:08:50.080 --> 00:08:52.320
of, let's say, 2?
00:08:52.320 --> 00:08:54.880
This probability is again 1/4.
00:08:54.880 --> 00:08:58.450
There's no change in the
probability as we keep moving
00:08:58.450 --> 00:09:00.280
inside this interval.
00:09:00.280 --> 00:09:04.590
So the CDF stays constant, until
at some point we reach
00:09:04.590 --> 00:09:07.730
the value of 3.
00:09:07.730 --> 00:09:11.410
And at that point, the
probability that the random
00:09:11.410 --> 00:09:15.750
variable takes a value less than
or equal to 3 is going to
00:09:15.750 --> 00:09:20.520
be the probability of a 3 plus
the probability of a 1 which
00:09:20.520 --> 00:09:22.675
becomes 3 over 4.
00:09:29.040 --> 00:09:32.940
For any other x in this
interval, the probability that
00:09:32.940 --> 00:09:36.300
the random variable takes a
value less than this number
00:09:36.300 --> 00:09:42.730
will stay at 1/4 plus 1/2, so
the CDF stays constant.
00:09:42.730 --> 00:09:47.200
And at this point, the
probability of being less than
00:09:47.200 --> 00:09:53.160
or equal to 4, this probability
becomes 1.
00:09:53.160 --> 00:09:59.130
And so the CDF jumps once
more to a value of 1.
00:09:59.130 --> 00:10:03.800
Again, at the places where the
CDF makes a jump, which one of
00:10:03.800 --> 00:10:05.470
the two is the correct value?
00:10:05.470 --> 00:10:08.060
The correct value is this one.
00:10:08.060 --> 00:10:13.250
And this is because the CDF is
defined by using a less than
00:10:13.250 --> 00:10:18.280
or equal sign in the probability
involved here.
00:10:18.280 --> 00:10:21.630
So in the case of discrete
random variables, the CDF
00:10:21.630 --> 00:10:24.310
takes the form of a staircase
function.
00:10:24.310 --> 00:10:25.530
It starts at 0.
00:10:25.530 --> 00:10:27.290
It ends up at 1.
00:10:27.290 --> 00:10:32.390
It has a jump at those points
where the PMF assigns a
00:10:32.390 --> 00:10:33.810
positive mass.
00:10:33.810 --> 00:10:39.630
And the size of the jump
is exactly equal to the
00:10:39.630 --> 00:10:43.870
corresponding value
of the PMF.
00:10:43.870 --> 00:10:49.450
Similarly, the size of the PMF
here is 1/4, and so the size
00:10:49.450 --> 00:10:52.190
of the corresponding jump
at the CDF will
00:10:52.190 --> 00:10:55.390
also be equal to 1/4.
00:10:55.390 --> 00:10:59.140
CDFs have some general
properties, and we have seen a
00:10:59.140 --> 00:11:03.110
hint of those properties in
what we have done so far.
00:11:03.110 --> 00:11:07.060
So the CDF is, by definition,
the probability of obtaining a
00:11:07.060 --> 00:11:11.025
value less than or equal to
a certain number little x.
00:11:11.025 --> 00:11:13.810
It's the probability
of this interval.
00:11:13.810 --> 00:11:18.040
If I were to take a larger
interval, and go up to some
00:11:18.040 --> 00:11:20.380
larger number y, this
would be the
00:11:20.380 --> 00:11:22.100
probability of a bigger interval.
00:11:22.100 --> 00:11:25.770
So that probability would
only be bigger.
00:11:25.770 --> 00:11:28.990
And this translates into the
fact that the CDF is an
00:11:28.990 --> 00:11:31.340
non-decreasing function.
00:11:31.340 --> 00:11:36.970
If y is larger than or equal to
x, as in this picture, then
00:11:36.970 --> 00:11:41.690
the value of the CDF evaluated
at that point y is going to be
00:11:41.690 --> 00:11:45.830
larger than or equal to the
CDF evaluated at that
00:11:45.830 --> 00:11:47.700
point little x.
00:11:47.700 --> 00:11:52.270
Other properties that the CDF
has is that as x goes to
00:11:52.270 --> 00:11:56.060
infinity, we're talking about
the probability essentially of
00:11:56.060 --> 00:11:58.200
the entire real line.
00:11:58.200 --> 00:12:00.900
And so the CDF will
converge to 1.
00:12:00.900 --> 00:12:06.160
On the other hand, if x tends
to minus infinity, so we're
00:12:06.160 --> 00:12:09.740
talking about the probability of
an interval to the left of
00:12:09.740 --> 00:12:14.650
a point that's all the way out,
further and further out.
00:12:14.650 --> 00:12:17.640
That probability has to
diminish, and eventually
00:12:17.640 --> 00:12:19.070
converge to 0.
00:12:19.070 --> 00:12:24.070
So in general, CDFs
asymptotically start at 0.
00:12:24.070 --> 00:12:25.500
They can never go down.
00:12:25.500 --> 00:12:27.410
They can only go up.
00:12:27.410 --> 00:12:32.560
And in the limit, as x goes to
infinity, the CDF has to
00:12:32.560 --> 00:12:34.230
approach 1.
00:12:34.230 --> 00:12:37.800
Actually in the examples that we
saw earlier, it reaches the
00:12:37.800 --> 00:12:41.640
value of 1 after a certain
finite point.
00:12:41.640 --> 00:12:44.420
But in general, for general
random variables, it might
00:12:44.420 --> 00:12:47.270
only reach the value
1 asymptotically