WEBVTT

00:00:00.600 --> 00:00:03.060
In this segment we look
into the probability

00:00:03.060 --> 00:00:05.570
that the sum of n
independent identically

00:00:05.570 --> 00:00:07.760
distributed random
variables takes

00:00:07.760 --> 00:00:10.790
an abnormally large value.

00:00:10.790 --> 00:00:14.460
We will get an upper bound
on this quantity, which

00:00:14.460 --> 00:00:16.800
is known as
Hoeffding's inequality.

00:00:16.800 --> 00:00:19.800
This is an upper bound that
applies to a special case,

00:00:19.800 --> 00:00:23.040
although the method
actually generalizes.

00:00:23.040 --> 00:00:25.800
Here is the special case
that we will consider.

00:00:25.800 --> 00:00:28.900
The random variables,
the Xi's, are

00:00:28.900 --> 00:00:33.030
equally likely to take the
values minus 1 and plus

00:00:33.030 --> 00:00:34.755
1, with equal probability.

00:00:37.420 --> 00:00:40.140
And we're interested
in the random variable,

00:00:40.140 --> 00:00:42.270
which is the sum of the X's.

00:00:46.300 --> 00:00:49.180
What do we know about
this random variable?

00:00:49.180 --> 00:00:54.000
Well, the expected value of each
one of the Xi's is equal to 0,

00:00:54.000 --> 00:00:56.410
because the distribution
is symmetric.

00:00:56.410 --> 00:01:00.340
And also, the distance
of Xi from the mean

00:01:00.340 --> 00:01:02.410
has always magnitude 1.

00:01:02.410 --> 00:01:04.379
And for this
reason, the variance

00:01:04.379 --> 00:01:08.130
of the Xi's is equal to 1.

00:01:08.130 --> 00:01:14.180
For this reason, the random
variable Y has a mean of 0

00:01:14.180 --> 00:01:16.690
and a variance equal to n.

00:01:21.050 --> 00:01:24.630
Now, what do we know about
the random variable Y?

00:01:24.630 --> 00:01:30.350
By the central limit theorem,
Y has an approximately normal

00:01:30.350 --> 00:01:32.370
distribution.

00:01:32.370 --> 00:01:35.990
The distribution
is centered at 0.

00:01:35.990 --> 00:01:42.670
And also, the random variable
Y over square root of n,

00:01:42.670 --> 00:01:45.700
this is a standardized
random variable.

00:01:45.700 --> 00:01:47.520
So it's approximately normal.

00:01:47.520 --> 00:01:50.910
And so the probability that this
number is larger than or equal

00:01:50.910 --> 00:01:57.840
to some a is approximately
1 minus the cumulative

00:01:57.840 --> 00:02:00.350
of the standard normal.

00:02:00.350 --> 00:02:04.815
So Phi here stands for
the standard normal CDF.

00:02:10.400 --> 00:02:12.310
What does this tell us?

00:02:12.310 --> 00:02:15.130
It tells us that if we
take somewhere here,

00:02:15.130 --> 00:02:22.860
the number square root of n
times a, then this probability

00:02:22.860 --> 00:02:26.940
down here in the tail is
approximately constant,

00:02:26.940 --> 00:02:29.250
no mater what n is.

00:02:29.250 --> 00:02:32.260
And this in particular
tells us that values

00:02:32.260 --> 00:02:38.940
of the order of square root
n are fairly likely to occur.

00:02:38.940 --> 00:02:42.190
However, what we're
interested in here is not

00:02:42.190 --> 00:02:44.680
being larger than
square root n times a.

00:02:44.680 --> 00:02:48.040
We're interested in being
larger than n times a.

00:02:48.040 --> 00:02:52.970
So we're talking about
what happens further down

00:02:52.970 --> 00:02:55.450
in the tail of the distribution.

00:02:55.450 --> 00:02:58.710
So if we take here
n times a, we're

00:02:58.710 --> 00:03:01.530
looking at this
probability down here.

00:03:01.530 --> 00:03:06.210
And we want to ask, how
small is that probability?

00:03:06.210 --> 00:03:08.430
Well, we have
Chebyshev's inequality.

00:03:08.430 --> 00:03:10.775
And Chebyshev's
inequality tells us

00:03:10.775 --> 00:03:14.950
that the probability of Y being
larger than a certain number

00:03:14.950 --> 00:03:19.850
is less than or equal to
the variance of Y divided

00:03:19.850 --> 00:03:21.829
by the square of that number.

00:03:24.510 --> 00:03:27.100
And in this case, since
the variance is n.

00:03:27.100 --> 00:03:30.470
This is 1 over n a squared.

00:03:30.470 --> 00:03:32.210
So Chebyshev's
inequality tells us

00:03:32.210 --> 00:03:35.460
that this probability
goes to 0, and it

00:03:35.460 --> 00:03:40.829
goes to 0 at least as
fast as 1/n goes to 0.

00:03:40.829 --> 00:03:44.640
However, it turns out that
this is extremely conservative.

00:03:44.640 --> 00:03:47.930
Hoeffding's inequality, which
we're going to establish,

00:03:47.930 --> 00:03:50.350
tells us something
much stronger.

00:03:50.350 --> 00:03:54.230
It tells us that this
tail probability down here

00:03:54.230 --> 00:03:59.110
falls exponentially with n.

00:03:59.110 --> 00:04:01.440
So this is what we want to show.

00:04:01.440 --> 00:04:07.130
And let us get started to
see how the derivation goes.

00:04:07.130 --> 00:04:10.900
The derivation relies
on a beautiful trick.

00:04:10.900 --> 00:04:13.900
Instead of looking
at this event here,

00:04:13.900 --> 00:04:17.440
we're going to look at the
following equivalent event.

00:04:20.399 --> 00:04:22.780
Let us fix some number s.

00:04:25.390 --> 00:04:30.520
We're going to leave the
choice of s free for now.

00:04:30.520 --> 00:04:33.130
It is only that
we're going to assume

00:04:33.130 --> 00:04:35.720
that s is a positive number.

00:04:38.680 --> 00:04:40.850
And throughout,
we're also assuming

00:04:40.850 --> 00:04:43.380
that a is also a
positive number.

00:04:46.070 --> 00:04:49.300
Now, we look at this
quantity, and the event

00:04:49.300 --> 00:04:52.240
that this quantity is
larger than or equal

00:04:52.240 --> 00:04:57.920
to e to the sn times a.

00:04:57.920 --> 00:05:02.460
Now, this sum is larger than
or equal to na if and only

00:05:02.460 --> 00:05:07.216
if this quantity is larger
than or equal to e to the sna.

00:05:07.216 --> 00:05:10.520
This is because a and
s are both positive.

00:05:10.520 --> 00:05:12.510
So the direction
of the inequalities

00:05:12.510 --> 00:05:14.840
does not get reversed,
and also because

00:05:14.840 --> 00:05:18.360
the exponential
function is monotonic.

00:05:18.360 --> 00:05:23.000
So this event is the
same as that event.

00:05:23.000 --> 00:05:26.050
So we will try to say
something about the probability

00:05:26.050 --> 00:05:27.660
of this event.

00:05:27.660 --> 00:05:29.680
How are we going to do it?

00:05:29.680 --> 00:05:31.880
We will use the
Markov inequality,

00:05:31.880 --> 00:05:37.740
where Z is the random
variable that appears here.

00:05:37.740 --> 00:05:41.450
So by Markov's inequality,
this probability

00:05:41.450 --> 00:05:45.150
is less than or equal
to the expected value

00:05:45.150 --> 00:05:46.930
of the random
variable that we are

00:05:46.930 --> 00:05:56.550
dealing with divided
by this value.

00:06:02.190 --> 00:06:05.660
Now, the exponential
of a sum, we

00:06:05.660 --> 00:06:08.410
can factor it as a
product of exponentials.

00:06:20.950 --> 00:06:26.220
And then we use the assumption
that the X's are independent.

00:06:26.220 --> 00:06:28.870
Since the X's are
independent, e to the sX1

00:06:28.870 --> 00:06:32.510
is independent from e
to the sX2 and so on.

00:06:32.510 --> 00:06:34.409
And so we have
the expected value

00:06:34.409 --> 00:06:37.950
of a product of independent
random variables.

00:06:37.950 --> 00:06:42.670
And so this is equal to the
product of the expectations.

00:06:42.670 --> 00:06:46.909
So we're going to multiply
the expected value of e

00:06:46.909 --> 00:06:52.230
to the sX1 with the expected
value of e to the sX2

00:06:52.230 --> 00:06:52.890
and so on.

00:06:52.890 --> 00:06:56.710
But because all the Xi's
are identically distributed,

00:06:56.710 --> 00:06:59.440
the terms we get
are all the same.

00:06:59.440 --> 00:07:04.880
So we get this term to the
nth power divided, again,

00:07:04.880 --> 00:07:05.990
by e to the sna.

00:07:09.320 --> 00:07:14.270
Or we can write this in more
suggestive form, as follows.

00:07:14.270 --> 00:07:21.720
It's the expected value of
e to the sX1 divided by e

00:07:21.720 --> 00:07:26.750
to the sa, all of
that to the nth power.

00:07:26.750 --> 00:07:31.590
So think of that as being some
number rho to the nth power.

00:07:34.330 --> 00:07:37.580
When is this bound
going to be interesting?

00:07:37.580 --> 00:07:41.930
It's going to be interesting
if rho is less than 1,

00:07:41.930 --> 00:07:47.450
because in that case, this bound
falls exponentially with n.

00:07:47.450 --> 00:07:49.470
And so this probability
in particular

00:07:49.470 --> 00:07:51.610
will fall exponentially with n.

00:07:55.050 --> 00:07:58.560
The key here is that we
have freedom to choose s.

00:07:58.560 --> 00:08:02.940
For any value of s, we
obtain an upper bound.

00:08:02.940 --> 00:08:07.390
We're going to choose s so that
we get the most informative

00:08:07.390 --> 00:08:10.860
or a most powerful upper bound.

00:08:10.860 --> 00:08:14.690
So let us continue to
see what we can do.

00:08:14.690 --> 00:08:19.600
First, let us write down
what this expected value is.

00:08:19.600 --> 00:08:27.110
Since X1 takes values minus 1 or
plus 1 with equal probability,

00:08:27.110 --> 00:08:29.380
this expectation
is the following.

00:08:29.380 --> 00:08:34.130
With probability 1/2,
X1 takes the value of 1.

00:08:34.130 --> 00:08:37.250
And so this random
variable is e to the s.

00:08:37.250 --> 00:08:39.440
And with probability
1/2, it takes the value

00:08:39.440 --> 00:08:43.159
minus one, in which case
this random variable is

00:08:43.159 --> 00:08:45.070
e to the minus s.

00:08:45.070 --> 00:08:47.650
So this is the expectation
in the numerator.

00:08:47.650 --> 00:08:52.400
And we write again the
term in the denominator.

00:08:52.400 --> 00:08:56.870
And we have all
this to the power n.

00:08:56.870 --> 00:09:00.820
If we can choose s so that
this quantity is less than 1,

00:09:00.820 --> 00:09:05.450
we will have achieved
our objective.

00:09:05.450 --> 00:09:07.730
Can we do that?

00:09:07.730 --> 00:09:08.700
Let's see.

00:09:08.700 --> 00:09:12.070
Let's look at the numerator
as a function of s.

00:09:14.720 --> 00:09:18.680
When s is equal to 0, we
have 1 plus 1 divided by 1/2.

00:09:18.680 --> 00:09:21.570
That gives us 1.

00:09:21.570 --> 00:09:25.640
And then as s moves away
from 0, this function

00:09:25.640 --> 00:09:28.520
will have this kind of shape.

00:09:28.520 --> 00:09:32.873
And it is symmetric around
0, because we have an s

00:09:32.873 --> 00:09:34.400
and a minus s here.

00:09:37.110 --> 00:09:42.260
In particular, the derivative
of this function is 0 at 0.

00:09:42.260 --> 00:09:44.990
Let's look at the
denominator term.

00:09:44.990 --> 00:09:49.490
The denominator term
is an exponential.

00:09:49.490 --> 00:09:52.850
a is a positive number,
so it's an exponential

00:09:52.850 --> 00:09:55.925
that has a shape of this kind.

00:09:58.930 --> 00:10:01.740
The important thing to notice
is that this exponential

00:10:01.740 --> 00:10:06.060
has a positive derivative at 0.

00:10:06.060 --> 00:10:07.200
What does that tell us?

00:10:07.200 --> 00:10:14.100
That at least in the vicinity of
0, this term, the denominator,

00:10:14.100 --> 00:10:17.490
is going to be larger
than the numerator term.

00:10:17.490 --> 00:10:21.260
And that implies that
in the vicinity of 0,

00:10:21.260 --> 00:10:25.350
this fraction is going
to be less than 1.

00:10:25.350 --> 00:10:27.890
And we will have
achieved our goal

00:10:27.890 --> 00:10:29.880
of an exponentially
decaying bound.

00:10:32.890 --> 00:10:41.090
So the conclusion
is that for small s,

00:10:41.090 --> 00:10:45.100
we have that rho is less than 1.

00:10:45.100 --> 00:10:49.190
Now, we would like to get
an explicit value for rho.

00:10:49.190 --> 00:10:53.300
And we will do that by fixing
a specific value for s.

00:10:53.300 --> 00:10:59.970
It turns out that if we set s
to be equal to a, then the bound

00:10:59.970 --> 00:11:05.710
that we get is going to be
that this probability here

00:11:05.710 --> 00:11:16.130
is less than or equal to e to
the minus na squared over 2.

00:11:16.130 --> 00:11:18.090
And this is the Hoeffding bound.

00:11:22.130 --> 00:11:26.190
At this point, you
may just pause.

00:11:26.190 --> 00:11:30.490
Or if you're curious, you
can continue with this video

00:11:30.490 --> 00:11:33.800
to see the algebraic
manipulations involved

00:11:33.800 --> 00:11:36.560
in order to show that this
expression is less than

00:11:36.560 --> 00:11:39.210
or equal to that expression.

00:11:39.210 --> 00:11:43.770
But before going there, I would
like to make a general comment.

00:11:43.770 --> 00:11:49.770
Even if the X's had a different
distribution but with 0 mean,

00:11:49.770 --> 00:11:53.640
the derivation up to this
point would go through,

00:11:53.640 --> 00:11:56.660
here you would have a
somewhat different expression

00:11:56.660 --> 00:12:00.590
for the expected
value of e to the sX1.

00:12:00.590 --> 00:12:04.380
However, it turns out that the
expression that you get here

00:12:04.380 --> 00:12:09.740
will always have this property
that it has a 0 derivative.

00:12:09.740 --> 00:12:11.840
This is a consequence
of the assumption

00:12:11.840 --> 00:12:14.480
that we assumed zero mean.

00:12:14.480 --> 00:12:16.520
And because of
that, we will still

00:12:16.520 --> 00:12:18.380
have a picture of this kind.

00:12:18.380 --> 00:12:22.210
And so this fraction will
always be less than 1

00:12:22.210 --> 00:12:25.810
when we choose s to
be suitably small.

00:12:25.810 --> 00:12:29.210
And so this is going to
give us a result for more

00:12:29.210 --> 00:12:30.770
general distributions.

00:12:30.770 --> 00:12:37.520
And that more general results
is known as the Chernoff bound.

00:12:37.520 --> 00:12:40.290
However, we will not
develop in this video

00:12:40.290 --> 00:12:43.850
the Chernoff bound in
its greater generality.

00:12:43.850 --> 00:12:46.190
We will just stay with
Hoeffding's inequality

00:12:46.190 --> 00:12:48.590
that gives us the basic idea.

00:12:48.590 --> 00:12:54.630
And what we will do next will
be to derive this inequality.

00:12:54.630 --> 00:12:57.870
So I'm carrying over
what we figured out

00:12:57.870 --> 00:13:02.150
in the previous slide-- and
this is the quantity here

00:13:02.150 --> 00:13:03.390
that we wish to bound.

00:13:06.210 --> 00:13:08.570
We will look at
the numerator term.

00:13:08.570 --> 00:13:11.620
And we're going to
use a Taylor series

00:13:11.620 --> 00:13:15.020
for the exponential function.

00:13:15.020 --> 00:13:17.440
Remember, the Taylor series
for the exponential function

00:13:17.440 --> 00:13:18.800
takes this form.

00:13:18.800 --> 00:13:24.960
And using that, we have 1/2
e to be s plus e to the minus

00:13:24.960 --> 00:13:28.840
s is equal to the following.

00:13:28.840 --> 00:13:31.870
We first write the Taylor
series for e to the s.

00:13:31.870 --> 00:13:33.620
I'm just copying from here.

00:13:33.620 --> 00:13:37.020
It's 1 plus s plus
s squared over

00:13:37.020 --> 00:13:43.170
2 factorial plus s
cubed over 3 factorial.

00:13:43.170 --> 00:13:45.730
And we continue similarly.

00:13:45.730 --> 00:13:48.845
And then for the
term e to the minus

00:13:48.845 --> 00:13:53.840
s, we have a similar expansion,
except that we put a minus s

00:13:53.840 --> 00:13:55.510
in the place of s.

00:13:55.510 --> 00:14:01.310
Now, minus s squared is the same
as s squared, with a plus sign.

00:14:01.310 --> 00:14:04.320
But for s cubed,
when we have minus s,

00:14:04.320 --> 00:14:08.050
this becomes minus
s cube and so on.

00:14:08.050 --> 00:14:11.460
And so we see that in
this expansion here,

00:14:11.460 --> 00:14:15.710
we will alternate between
positive and negative signs.

00:14:15.710 --> 00:14:19.280
This means that all
of the odd power terms

00:14:19.280 --> 00:14:21.630
will cancel each other.

00:14:21.630 --> 00:14:25.460
But the even power
terms will survive.

00:14:25.460 --> 00:14:33.295
So what we obtain is the
sum of all of those terms.

00:14:35.990 --> 00:14:39.210
But we only have the
even power terms.

00:14:39.210 --> 00:14:43.100
So we have powers
of the form 2i.

00:14:43.100 --> 00:14:46.270
These are the even integers.

00:14:46.270 --> 00:14:49.030
And in the denominators,
we will always

00:14:49.030 --> 00:14:53.300
have the factorial of whatever
exponent we have at the top.

00:14:59.100 --> 00:15:05.270
Now, let us get a bound on
this term in the denominator.

00:15:05.270 --> 00:15:13.490
2i factorial is 1 times 2
times 3, all the way up to i.

00:15:13.490 --> 00:15:22.810
And then we continue-- i plus 1,
i plus 2, all the way up to 2i.

00:15:22.810 --> 00:15:26.390
And what we have is,
first, i factorial.

00:15:29.120 --> 00:15:34.120
But then each one of these terms
is larger than or equal to 2.

00:15:34.120 --> 00:15:37.100
And we have i such terms.

00:15:37.100 --> 00:15:39.760
And this gives us
this inequality.

00:15:39.760 --> 00:15:42.170
So we're going to use
the substitution here.

00:15:42.170 --> 00:15:44.630
Because this term is
in the denominator,

00:15:44.630 --> 00:15:48.590
the direction of the inequality
is going to be reversed.

00:15:48.590 --> 00:15:49.780
And we obtain this.

00:16:05.380 --> 00:16:09.520
Now, we can rewrite this by
taking this term 2 to the i

00:16:09.520 --> 00:16:12.285
and combining it with the
other term in the numerator.

00:16:18.190 --> 00:16:22.710
And what we have is s
squared divided by 2--

00:16:22.710 --> 00:16:24.395
all of that to the i'th power.

00:16:34.810 --> 00:16:38.070
Now, does this
expression look familiar?

00:16:38.070 --> 00:16:41.870
It is of exactly the same
form as this expansion.

00:16:41.870 --> 00:16:45.990
But instead of s, we now
have s squared over 2.

00:16:45.990 --> 00:16:54.230
Therefore, this is equal to
e to the s squared over 2.

00:16:54.230 --> 00:16:58.840
So we managed to
bound this term.

00:16:58.840 --> 00:17:04.319
Using now this bound, we
go back to this inequality.

00:17:04.319 --> 00:17:09.890
And we have that this is
less than or equal to--

00:17:09.890 --> 00:17:17.060
in the numerator, we have
e to the s squared over 2.

00:17:17.060 --> 00:17:21.252
In the denominator,
we have e to the sa,

00:17:21.252 --> 00:17:26.118
and all that is raised
to the n'th power.

00:17:26.118 --> 00:17:29.880
Or another way to
write this is, e

00:17:29.880 --> 00:17:38.940
to the s squared
over 2 minus sa,

00:17:38.940 --> 00:17:42.850
and all that to the n'th power.

00:17:42.850 --> 00:17:53.460
And now, if I choose s equal
to a, what I obtain here

00:17:53.460 --> 00:17:58.110
is going to be e to the
a over 2 minus a squared.

00:17:58.110 --> 00:18:02.620
That leaves me with e to
the minus a squared over 2.

00:18:02.620 --> 00:18:05.590
And then I take this
factor of n as well.

00:18:05.590 --> 00:18:09.810
And the final conclusion
is that this quantity

00:18:09.810 --> 00:18:14.220
becomes equal to this term.

00:18:14.220 --> 00:18:16.900
And so we have
completed the derivation

00:18:16.900 --> 00:18:21.500
that this expression is less
than or equal to this quantity

00:18:21.500 --> 00:18:23.770
when we choose s equal to a.

00:18:23.770 --> 00:18:26.590
And this is
Hoeffding's inequality.