WEBVTT
00:00:00.000 --> 00:00:02.350
The following content is
provided under a Creative
00:00:02.350 --> 00:00:03.640
Commons license.
00:00:03.640 --> 00:00:06.590
Your support will help MIT
OpenCourseWare continue to
00:00:06.590 --> 00:00:09.970
offer high quality educational
resources for free.
00:00:09.970 --> 00:00:13.060
To make a donation or to view
additional materials from
00:00:13.060 --> 00:00:16.780
hundreds of MIT courses, visit
MIT OpenCourseWare at
00:00:16.780 --> 00:00:18.030
ocw.mit.edu.
00:00:21.570 --> 00:00:22.070
PROFESSOR: OK.
00:00:22.070 --> 00:00:26.430
Just to review where we are,
we've been talking about
00:00:26.430 --> 00:00:30.230
source coding as one of the
two major parts of digital
00:00:30.230 --> 00:00:31.000
communication.
00:00:31.000 --> 00:00:34.420
Remember, you take sources,
you turn them into bits.
00:00:34.420 --> 00:00:38.110
Then you take bits and you
transmit them over channels.
00:00:38.110 --> 00:00:40.470
And that sums up the
whole course.
00:00:40.470 --> 00:00:44.420
This is the part where you
transmit over channels.
00:00:44.420 --> 00:00:48.220
This is the part where you
process the sources.
00:00:48.220 --> 00:00:51.740
We're concentrating now on the
source side of things.
00:00:51.740 --> 00:00:54.760
Partly because by concentrating
on the source
00:00:54.760 --> 00:00:58.300
side of things, we will build
up the machinery we need to
00:00:58.300 --> 00:00:59.820
look at the channel
side of things.
00:00:59.820 --> 00:01:03.760
The channel side is really more
interesting, I think.
00:01:03.760 --> 00:01:07.090
Although there's been a great
deal of work on both of them.
00:01:07.090 --> 00:01:09.070
They're both important.
00:01:09.070 --> 00:01:12.950
And we said that we could
separate source coding into
00:01:12.950 --> 00:01:13.770
three pieces.
00:01:13.770 --> 00:01:17.030
If you start out with a waveform
source, the typical
00:01:17.030 --> 00:01:21.210
thing to do, and almost the only
thing to do, is to turn
00:01:21.210 --> 00:01:24.430
those waveforms into sequences
of numbers.
00:01:24.430 --> 00:01:27.130
Those sequences might
be samples.
00:01:27.130 --> 00:01:31.080
They might be numbers
in an expansion.
00:01:31.080 --> 00:01:32.450
They might be whatever.
00:01:32.450 --> 00:01:36.440
But the first thing you almost
always do is turn waveforms
00:01:36.440 --> 00:01:38.280
into a sequence of numbers.
00:01:38.280 --> 00:01:42.600
Because waveforms are just too
complicated to deal with.
00:01:42.600 --> 00:01:44.860
The next thing we do with
a sequence of numbers is
00:01:44.860 --> 00:01:46.240
quantize them.
00:01:46.240 --> 00:01:48.440
After we quantize them
we wind up with a
00:01:48.440 --> 00:01:50.170
finite set of symbols.
00:01:50.170 --> 00:01:51.900
And the next thing
we do is, we take
00:01:51.900 --> 00:01:54.270
that sequence of symbols.
00:01:54.270 --> 00:01:55.750
And --
00:01:58.510 --> 00:02:01.960
and what we do at that point
is to do data compression.
00:02:01.960 --> 00:02:05.940
So that we try to represent
those symbols with as small as
00:02:05.940 --> 00:02:09.950
possible a number of binary
digits per source symbol.
00:02:09.950 --> 00:02:13.290
We want to do that in such
a way that we can receive
00:02:13.290 --> 00:02:15.760
it the other end.
00:02:15.760 --> 00:02:19.060
So let's review a little bit
about what we've done in the
00:02:19.060 --> 00:02:22.230
last couple of lectures.
00:02:22.230 --> 00:02:26.680
We talked about the
Kraft inequality.
00:02:26.680 --> 00:02:30.820
And the Kraft inequality, you
remember, says that the
00:02:30.820 --> 00:02:35.350
lengths of the codewords in any
prefix-free code have to
00:02:35.350 --> 00:02:38.120
satisfy this funny
inequality here.
00:02:38.120 --> 00:02:42.580
And this funny inequality, in
some sense, says if you want
00:02:42.580 --> 00:02:47.550
to make one codeword short, by
making that one codeword
00:02:47.550 --> 00:02:53.590
short, it eats up a large
part of this fraction.
00:02:53.590 --> 00:02:56.840
Since this sum has to be less
than or equal to 1.
00:02:56.840 --> 00:03:01.150
If, for example, you make l sub
1 equal to 1, then that
00:03:01.150 --> 00:03:03.870
uses up a half of this
sum right there.
00:03:03.870 --> 00:03:06.990
And all the other codewords
have to be much longer.
00:03:06.990 --> 00:03:09.890
So what this is saying,
essentially, and we proved it,
00:03:09.890 --> 00:03:13.760
and we did a bunch of things
with it, and your homework you
00:03:13.760 --> 00:03:18.060
worked with it, we have
shown that that
00:03:18.060 --> 00:03:20.380
inequality has to hold.
00:03:20.380 --> 00:03:23.750
The next thing we did is, given
a set of probabilities
00:03:23.750 --> 00:03:29.120
on a source, for example, p1
up to p sub m, by this time
00:03:29.120 --> 00:03:32.170
you should feel very comfortable
in realizing that
00:03:32.170 --> 00:03:35.020
what you call these symbols
doesn't make any difference
00:03:35.020 --> 00:03:38.970
whatsoever as far
as any matter of
00:03:38.970 --> 00:03:41.180
encoding sources is concerned.
00:03:41.180 --> 00:03:44.520
The first thing you can do, if
you like to, is take whatever
00:03:44.520 --> 00:03:47.940
name somebody has given to a set
of symbols, replace them
00:03:47.940 --> 00:03:52.410
with your own symbols, and the
easiest set of symbols to use
00:03:52.410 --> 00:03:55.020
is the integers 1 to m.
00:03:55.020 --> 00:03:58.110
And that's what we
will usually do.
00:03:58.110 --> 00:04:01.820
So, given this set of
probabilities, and they have
00:04:01.820 --> 00:04:06.430
to add up to 1, the Huffman
algorithm is this ingenious
00:04:06.430 --> 00:04:10.410
algorithm, very, very, clever,
which constructs a prefix-free
00:04:10.410 --> 00:04:14.170
code of minimum expected
length.
00:04:14.170 --> 00:04:18.560
And that minimum expected length
is just defined as the
00:04:18.560 --> 00:04:22.080
sum over i, of p sub
i times l sub i.
00:04:22.080 --> 00:04:25.340
And the trick in the algorithm
is to find that set of l sub
00:04:25.340 --> 00:04:30.570
i's that satisfy this inequality
but minimize this
00:04:30.570 --> 00:04:32.810
expected value.
00:04:32.810 --> 00:04:35.110
Next thing we started to talk
about was a discrete
00:04:35.110 --> 00:04:36.790
memoryless source.
00:04:36.790 --> 00:04:40.150
A discrete memoryless source
is really a toy source.
00:04:40.150 --> 00:04:43.770
It's a toy source where you
assume that each letter in the
00:04:43.770 --> 00:04:48.700
sequence is independent, and
equally independent, and
00:04:48.700 --> 00:04:49.820
identically distributed.
00:04:49.820 --> 00:04:51.880
In other words, every
letter is the same.
00:04:51.880 --> 00:04:55.000
Every letter is independent
of every other letter.
00:04:55.000 --> 00:04:58.590
That's more appropriate for a
gambling game than it is for
00:04:58.590 --> 00:04:59.940
real sources.
00:04:59.940 --> 00:05:02.830
But, on the other hand, by
understanding this problem,
00:05:02.830 --> 00:05:05.385
we're starting to see that we
understand the whole problem
00:05:05.385 --> 00:05:07.350
of source coding.
00:05:07.350 --> 00:05:10.050
So we'll get on with
that as we go.
00:05:10.050 --> 00:05:13.910
But, anyway, when we have a
discrete memoryless source,
00:05:13.910 --> 00:05:17.780
what we found -- first we
defined the entropy of such a
00:05:17.780 --> 00:05:24.460
source as h of x, which is the
sum over i of minus p sub i,
00:05:24.460 --> 00:05:27.580
of logarithm of p sub i.
00:05:27.580 --> 00:05:30.570
And that was just something that
came out of trying to do
00:05:30.570 --> 00:05:34.170
this optimization not the way
that Huffman did it but the
00:05:34.170 --> 00:05:35.940
way that Shannon did it.
00:05:35.940 --> 00:05:39.500
Namely, Shannon looked at this
optimization in terms of
00:05:39.500 --> 00:05:42.600
dealing with entropy and
things like that.
00:05:42.600 --> 00:05:45.610
Huffman dealt with it in terms
of finding the optimal code.
00:05:45.610 --> 00:05:49.100
One of the very surprising
things is that the way Huffman
00:05:49.100 --> 00:05:52.450
looked at it, in terms of
entropy, is the way this
00:05:52.450 --> 00:05:55.120
really valuable, even though
it doesn't come up with an
00:05:55.120 --> 00:05:56.410
optimal code.
00:05:56.410 --> 00:06:00.420
I mean, here was poor Huffman,
who generated this beautiful
00:06:00.420 --> 00:06:03.610
algorithm, which is
extraordinarily simple, which
00:06:03.610 --> 00:06:06.350
solved what looked like
a hard problem.
00:06:06.350 --> 00:06:10.440
But yet, as far as information
theory is concerned, he used
00:06:10.440 --> 00:06:11.940
that algorithm, sure.
00:06:11.940 --> 00:06:14.900
But as far as all the
generalizations are concerned,
00:06:14.900 --> 00:06:17.960
it has almost nothing
to do with anything.
00:06:17.960 --> 00:06:21.790
But, anyway, when you look at
this entropy, what comes out
00:06:21.790 --> 00:06:25.710
of it is the fact that the
entropy of the source is less
00:06:25.710 --> 00:06:30.100
than or equal to the minimum
number of bits per source
00:06:30.100 --> 00:06:33.840
symbol that you can come up with
in a prefix-free code,
00:06:33.840 --> 00:06:36.930
which is less than
h of x plus 1.
00:06:36.930 --> 00:06:39.340
And the way we did that was just
to try to look at this
00:06:39.340 --> 00:06:40.860
minimization.
00:06:40.860 --> 00:06:42.720
And by looking at the
minimization, we usually
00:06:42.720 --> 00:06:45.900
showed it had to be greater
than or equal to H of x.
00:06:45.900 --> 00:06:48.810
And by looking at any code
which satisfied this
00:06:48.810 --> 00:06:51.050
inequality with any
set of length --
00:06:51.050 --> 00:06:55.630
well, after we looked at this,
this said that what we really
00:06:55.630 --> 00:06:59.680
wanted to do was to make the
length of each codeword minus
00:06:59.680 --> 00:07:01.700
log of p sub i.
00:07:01.700 --> 00:07:03.050
That's not an integer.
00:07:03.050 --> 00:07:06.140
So the thing we did to get this
inequality is said, OK,
00:07:06.140 --> 00:07:09.630
if it's not an integer, we'll
raise it up the next value.
00:07:09.630 --> 00:07:10.880
Make it an integer.
00:07:10.880 --> 00:07:14.810
And as soon as we do that, the
Kraft inequality is satisfied.
00:07:14.810 --> 00:07:17.170
And you can generate a code
with that set of lengths.
00:07:17.170 --> 00:07:21.710
So that's where you get
these two bounds.
00:07:21.710 --> 00:07:25.840
This bound here is usually
very, very weak.
00:07:25.840 --> 00:07:31.030
Can anybody suggest the almost
unique example where this is
00:07:31.030 --> 00:07:33.890
almost tight?
00:07:33.890 --> 00:07:36.600
It's a simplistic sample
you can think of.
00:07:36.600 --> 00:07:38.960
It's a binary source.
00:07:38.960 --> 00:07:44.840
And what binary source has the
property that this is almost
00:07:44.840 --> 00:07:46.810
equal to this?
00:07:52.000 --> 00:07:53.150
Anybody out there?
00:07:53.150 --> 00:07:57.540
AUDIENCE: [UNINTELLIGIBLE]
00:07:57.540 --> 00:08:00.310
PROFESSOR: Make it almost
probability 0
00:08:00.310 --> 00:08:01.700
and probability 1.
00:08:01.700 --> 00:08:05.390
You can't quite do that because
as soon as you make
00:08:05.390 --> 00:08:09.280
the probability of the 0, 0,
then you don't have to
00:08:09.280 --> 00:08:10.360
represent it.
00:08:10.360 --> 00:08:13.260
And you just know that it's
a sequence of all 1's.
00:08:13.260 --> 00:08:14.140
So you're all done.
00:08:14.140 --> 00:08:16.980
And you don't need any
bits to represent it.
00:08:16.980 --> 00:08:21.740
But if there's just some very
tiny epsilon probability of a
00:08:21.740 --> 00:08:26.910
0 and a big probability of a 1,
then the entropy is almost
00:08:26.910 --> 00:08:28.570
equal to 0.
00:08:28.570 --> 00:08:34.040
And this 1 here is needed
because l bar min is 1.
00:08:34.040 --> 00:08:37.250
You can't make it any
smaller than that.
00:08:37.250 --> 00:08:42.590
So, that's where that
comes from.
00:08:42.590 --> 00:08:47.390
Let's talk about entropy
just a little bit.
00:08:47.390 --> 00:08:53.450
If we have an alphabet which
has size m, which is the
00:08:53.450 --> 00:08:58.370
symbol we'll usually use
for the alphabet, x.
00:08:58.370 --> 00:09:01.670
And the probability that
x equals i, is p sub i.
00:09:01.670 --> 00:09:04.520
In other words, again, we're
using this convention so we're
00:09:04.520 --> 00:09:08.850
going to call the symbols the
integers 1 to capital M. Then
00:09:08.850 --> 00:09:11.340
the entropy is equal to this.
00:09:11.340 --> 00:09:15.050
And a nice way of representing
this is that the entropy is
00:09:15.050 --> 00:09:18.930
equal to the expected value
of minus the logarithm.
00:09:18.930 --> 00:09:22.090
Logarithms are always to the
base 2, in this course.
00:09:22.090 --> 00:09:24.950
When we want natural logarithms
we'll write ln, in
00:09:24.950 --> 00:09:26.770
other words it's log
to the base 2.
00:09:26.770 --> 00:09:30.960
So it's log to the base
2 of the probability
00:09:30.960 --> 00:09:32.910
of the symbol x.
00:09:32.910 --> 00:09:37.630
We call this the log pmf
random variable.
00:09:37.630 --> 00:09:42.580
We started out with x being
a chance variable.
00:09:42.580 --> 00:09:44.800
I mean, we happen to have
turned it into a random
00:09:44.800 --> 00:09:46.840
variable because we've
given it numbers.
00:09:46.840 --> 00:09:48.360
But that's irrelevant.
00:09:48.360 --> 00:09:52.050
We really want to think of
it as a chance variable.
00:09:52.050 --> 00:09:56.720
But now this quantity is a
numerical function of the
00:09:56.720 --> 00:09:59.300
symbol which comes out
of the source.
00:09:59.300 --> 00:10:02.140
And, therefore, this
quantity is a
00:10:02.140 --> 00:10:03.860
well-defined random variable.
00:10:03.860 --> 00:10:06.690
And we call it the log
pmf random variable.
00:10:06.690 --> 00:10:08.980
Some people call it
self-information.
00:10:08.980 --> 00:10:10.850
We'll find out why later.
00:10:10.850 --> 00:10:13.180
I don't particularly
like that word.
00:10:13.180 --> 00:10:16.030
One, because what we're dealing
with here has nothing
00:10:16.030 --> 00:10:18.300
to do with information.
00:10:18.300 --> 00:10:20.780
Probably the thought that this
all has something to do with
00:10:20.780 --> 00:10:23.350
information, namely, that
information theory has
00:10:23.350 --> 00:10:26.700
something to do with
information, probably held up
00:10:26.700 --> 00:10:28.730
the field for about
five years.
00:10:28.730 --> 00:10:31.370
Because everyone tried to figure
out, what does this
00:10:31.370 --> 00:10:33.510
have to do with information.
00:10:33.510 --> 00:10:36.050
And, of course, it had nothing
to do with information.
00:10:36.050 --> 00:10:38.110
It really only had
to do with data.
00:10:38.110 --> 00:10:41.260
And with probabilities of
various things in the data.
00:10:41.260 --> 00:10:43.110
So, anyway.
00:10:43.110 --> 00:10:45.930
Some people call it
self-information and
00:10:45.930 --> 00:10:47.530
we'll see why later.
00:10:47.530 --> 00:10:50.130
But this is the quantity
we're interested in.
00:10:50.130 --> 00:10:52.350
And we call it log pmf,
we sort of forget
00:10:52.350 --> 00:10:54.750
about the minus sign.
00:10:54.750 --> 00:10:57.340
It's not good to forget
about the minus sign,
00:10:57.340 --> 00:10:58.240
but I always do it.
00:10:58.240 --> 00:11:02.660
So I sort of expect other
people to do it, too.
00:11:02.660 --> 00:11:05.060
One of the properties of entropy
is, it has to be
00:11:05.060 --> 00:11:06.920
greater than or equal to 0.
00:11:06.920 --> 00:11:09.050
Why is it greater than
or equal to 0?
00:11:09.050 --> 00:11:12.130
Well, because these
probabilities here have to be
00:11:12.130 --> 00:11:14.120
less than or equal to 1.
00:11:14.120 --> 00:11:16.070
And the logarithm of something
less than or
00:11:16.070 --> 00:11:18.000
equal to 1 is negative.
00:11:18.000 --> 00:11:21.480
And therefore minus the
logarithm has to be greater
00:11:21.480 --> 00:11:23.130
than or equal to 0.
00:11:23.130 --> 00:11:26.970
This entropy is also less than
or equal to log M, log capital
00:11:26.970 --> 00:11:28.950
M. I'm not going to
prove that here.
00:11:28.950 --> 00:11:32.370
It's proven in the notes, it's
a trivial thing to do.
00:11:32.370 --> 00:11:35.410
Or maybe it's proven in one
of the problems, I forget.
00:11:35.410 --> 00:11:39.860
But, anyway, you can do it using
the inequality log of x
00:11:39.860 --> 00:11:42.380
is less than or equal
to x minus 1.
00:11:42.380 --> 00:11:44.890
Just like all inequalities
can be proven with that
00:11:44.890 --> 00:11:47.300
inequality.
00:11:47.300 --> 00:11:52.240
So there's a quality here
if x is equiprobable.
00:11:52.240 --> 00:11:56.420
Which is pretty clear, because
if all of these probabilities
00:11:56.420 --> 00:12:00.200
are equal to 1 over M, and you
take the expected value of
00:12:00.200 --> 00:12:04.140
logarithm of M, you get
logarithm of M. So there's
00:12:04.140 --> 00:12:07.030
nothing very surprising here.
00:12:07.030 --> 00:12:11.210
Now, the next thing -- and
here's where what we're going
00:12:11.210 --> 00:12:15.640
to do is, on one hand very
simple, and on the other hand
00:12:15.640 --> 00:12:17.400
very confusing.
00:12:17.400 --> 00:12:21.480
After you get the picture of
it, it becomes very simple.
00:12:21.480 --> 00:12:24.780
Before that, it all looks
rather strange.
00:12:24.780 --> 00:12:30.680
If you have two independent
chance variables, say x and y,
00:12:30.680 --> 00:12:37.030
then the choice where the sample
value of the chance
00:12:37.030 --> 00:12:41.660
variable x, and the choice of
the sample value y, together
00:12:41.660 --> 00:12:44.900
that's a pair of sample values
which we can view as one
00:12:44.900 --> 00:12:46.730
sample value.
00:12:46.730 --> 00:12:51.540
In other words, we can view xy
as a chance variable all in
00:12:51.540 --> 00:12:54.280
its own right.
00:12:54.280 --> 00:12:57.540
This isn't the sequence x,
followed by y, where you can
00:12:57.540 --> 00:12:58.730
think of it that way.
00:12:58.730 --> 00:13:01.150
But we want to think
of this here as a
00:13:01.150 --> 00:13:02.730
chance variable itself.
00:13:02.730 --> 00:13:04.690
Which takes on different
values.
00:13:04.690 --> 00:13:11.400
And the values it takes on are
pairs of sample values, 1 from
00:13:11.400 --> 00:13:21.290
x, 1 from ensemble y, and xy
takes on the sample value of
00:13:21.290 --> 00:13:28.500
little xy with the probability
p of x times p of y.
00:13:28.500 --> 00:13:31.510
As we move one with this course,
we'll become much less
00:13:31.510 --> 00:13:34.520
careful about putting these
subscripts down, which talk
00:13:34.520 --> 00:13:36.540
about random variables.
00:13:36.540 --> 00:13:40.510
And the arguments which talk
about sample values of those
00:13:40.510 --> 00:13:41.830
random variables.
00:13:41.830 --> 00:13:46.500
I want to keep doing it for a
while because most courses in
00:13:46.500 --> 00:13:50.250
probability, even 6.041, which
is the first course in
00:13:50.250 --> 00:13:53.910
probability, almost deliberately
fudges the
00:13:53.910 --> 00:13:57.360
difference between sample values
and random variables.
00:13:57.360 --> 00:13:59.800
And most people who work
with probability do
00:13:59.800 --> 00:14:00.870
this all the time.
00:14:00.870 --> 00:14:03.520
And you never know when you're
talking about a random
00:14:03.520 --> 00:14:06.210
variable and when you're talking
about a sample value
00:14:06.210 --> 00:14:07.500
of a random variable.
00:14:07.500 --> 00:14:11.040
And that's convenient for
getting insight about thing.
00:14:11.040 --> 00:14:13.770
And you do it for a while and
then pretty soon you wonder,
00:14:13.770 --> 00:14:15.410
what the heck is going on.
00:14:15.410 --> 00:14:19.055
Because you have no idea what's
a random variable any
00:14:19.055 --> 00:14:22.180
more, and what's a sample
value of it.
00:14:22.180 --> 00:14:28.190
So, this entropy, H, of the
chance variable, xy, is then
00:14:28.190 --> 00:14:32.580
expected value of minus the
logarithm of the probability
00:14:32.580 --> 00:14:37.480
of the chance variable, xy.
00:14:37.480 --> 00:14:39.890
Mainly, when you take the
expected value, you're taking
00:14:39.890 --> 00:14:43.610
the expected value of
a random variable.
00:14:43.610 --> 00:14:49.750
And for the random variable
here, in the chance variables,
00:14:49.750 --> 00:14:51.000
xy and here.
00:14:51.000 --> 00:14:55.180
This is the expected value of
minus the logarithm of p of x
00:14:55.180 --> 00:14:57.750
times the probability p of x.
00:14:57.750 --> 00:14:59.490
Which is the expected value.
00:14:59.490 --> 00:15:03.320
Since they're independent of
each other it's the sum.
00:15:03.320 --> 00:15:08.380
And that gives you H of x y is
equal to H of x plus H of y.
00:15:08.380 --> 00:15:11.350
This is really the reason why
you're interested in these
00:15:11.350 --> 00:15:14.870
chance variables which are
logarithms of probabilities.
00:15:14.870 --> 00:15:18.820
Because when you have
independent chance variables
00:15:18.820 --> 00:15:22.780
then you have the situation that
probabilities multiply
00:15:22.780 --> 00:15:26.810
and therefore log probabilities
add.
00:15:26.810 --> 00:15:29.780
All of the major theorems in
probability theory, in
00:15:29.780 --> 00:15:32.360
particular the law of large
numbers, which is the most
00:15:32.360 --> 00:15:35.700
important result in probability,
simple though it
00:15:35.700 --> 00:15:39.700
might be, that's the key to
everything in probability.
00:15:39.700 --> 00:15:44.700
That particular result talks
about sums of random variables
00:15:44.700 --> 00:15:47.090
and not about products
of random variables.
00:15:47.090 --> 00:15:50.400
So that's why Shannon did
everything in terms
00:15:50.400 --> 00:15:53.550
of these log PMF.
00:15:53.550 --> 00:15:55.720
And we will soon be doing
everything in
00:15:55.720 --> 00:15:57.660
terms of log PMF also.
00:16:01.120 --> 00:16:06.710
So now let's get back to
discrete memoryless sources.
00:16:06.710 --> 00:16:12.710
If you now have a block of n
chance variables, x1 to xn,
00:16:12.710 --> 00:16:18.500
and they're all IID, again we
can do this whole block as one
00:16:18.500 --> 00:16:21.640
big monster chance variable.
00:16:21.640 --> 00:16:27.100
If each one of these takes on
m possible values, then this
00:16:27.100 --> 00:16:29.630
monster chance variable
can take on m to
00:16:29.630 --> 00:16:32.400
the n possible values.
00:16:32.400 --> 00:16:37.190
Namely, each possible string
of symbols, each possible
00:16:37.190 --> 00:16:41.060
string of n symbols where each
one is a choice from the
00:16:41.060 --> 00:16:44.640
integers 1 to capital M. So
we're talking about tuples of
00:16:44.640 --> 00:16:46.420
numbers now.
00:16:46.420 --> 00:16:48.860
And those are the values that
this particular chance
00:16:48.860 --> 00:16:52.420
variable, x sub n, takes on.
00:16:52.420 --> 00:16:55.820
So it takes on these
probabilities, takes on these
00:16:55.820 --> 00:17:02.145
values with the probability p
of x n is the product from i
00:17:02.145 --> 00:17:05.220
equals 1 to n, of the individual
probabilities.
00:17:05.220 --> 00:17:08.570
In other words, again, when you
have independent chance
00:17:08.570 --> 00:17:11.570
variables, the probabilities
multiply.
00:17:11.570 --> 00:17:13.160
Which is all I'm saying here.
00:17:13.160 --> 00:17:17.730
So the chance variable x sub
n has the entropy H of x n,
00:17:17.730 --> 00:17:20.930
which is the expected value of
minus the logarithm of that
00:17:20.930 --> 00:17:22.010
probability.
00:17:22.010 --> 00:17:25.260
Which is the expected value of
minus the logarithm of the
00:17:25.260 --> 00:17:28.570
product of probabilities, which
is the expected value of
00:17:28.570 --> 00:17:32.590
the sum of minus the log of the
probabilities, which is n
00:17:32.590 --> 00:17:34.590
times H of x.
00:17:34.590 --> 00:17:37.650
If you compare this with the
previous slide, you'll see I
00:17:37.650 --> 00:17:41.410
haven't said anything new.
00:17:41.410 --> 00:17:44.420
This argument and this
argument are
00:17:44.420 --> 00:17:46.120
really exactly the same.
00:17:46.120 --> 00:17:49.910
All I did was do it for two
chance variables first.
00:17:49.910 --> 00:17:51.390
And then observe.
00:17:51.390 --> 00:17:54.000
But it generalizes,
to an arbitrary
00:17:54.000 --> 00:17:56.510
number of chance variables.
00:17:56.510 --> 00:17:58.850
You can say that it generalizes
to an infinite
00:17:58.850 --> 00:18:00.550
number of chance
variables also.
00:18:00.550 --> 00:18:02.170
And in some sense it does.
00:18:02.170 --> 00:18:04.870
And I would advise you
not to go there.
00:18:04.870 --> 00:18:08.510
Because you just get tangled up
with a lot of mathematics
00:18:08.510 --> 00:18:09.760
that you don't need.
00:18:12.470 --> 00:18:16.410
So the next thing is, how
do we fix the variable
00:18:16.410 --> 00:18:20.400
prefix-free codes and what
do we gain by it?
00:18:20.400 --> 00:18:24.170
So the thing we're going to do
now, instead of trying to
00:18:24.170 --> 00:18:27.680
compress one symbol at a time
from the source, we're going
00:18:27.680 --> 00:18:32.270
to segment the source into
blocks of n symbols each.
00:18:32.270 --> 00:18:35.650
And after we segment it into
blocks of n symbols each,
00:18:35.650 --> 00:18:39.620
we're going to encode the
block of n symbols.
00:18:39.620 --> 00:18:41.750
Now, what's new there?
00:18:41.750 --> 00:18:44.280
Nothing whatsoever is new.
00:18:44.280 --> 00:18:48.230
A block of n symbols is just
a chance variable.
00:18:48.230 --> 00:18:51.290
We know how to do optimal
encoding of chance variables.
00:18:51.290 --> 00:18:53.430
Namely, we use the Huffman
algorithm.
00:18:53.430 --> 00:18:56.980
You can do that here
on these n blocks.
00:18:56.980 --> 00:19:00.520
We also have this nice theorem,
which says that the
00:19:00.520 --> 00:19:06.020
entropy -- well, first the
entropy of x n as n times the
00:19:06.020 --> 00:19:07.650
entropy of x.
00:19:07.650 --> 00:19:10.610
So, in other words, when you
have independent identically
00:19:10.610 --> 00:19:15.560
distributed chance variables,
this entropy is just n times
00:19:15.560 --> 00:19:16.440
this entropy.
00:19:16.440 --> 00:19:19.500
But the important thing
is this result
00:19:19.500 --> 00:19:21.180
of doing the encoding.
00:19:21.180 --> 00:19:23.930
Which is the same result
we had before.
00:19:23.930 --> 00:19:27.800
Namely, this is the result of
what happens when you take a
00:19:27.800 --> 00:19:31.530
set -- when you take an
alphabet, and the alphabet can
00:19:31.530 --> 00:19:33.780
be anything whatsoever.
00:19:33.780 --> 00:19:37.690
And you encode that alphabet in
an optimal way, according
00:19:37.690 --> 00:19:42.040
to the probabilities of each
symbol within the alphabet.
00:19:42.040 --> 00:19:47.030
And the result that you get is
the entropy of this big chance
00:19:47.030 --> 00:19:52.420
variable x sub n is less than
or equal to the minimum --
00:19:52.420 --> 00:19:57.730
well, it's less than or equal
to the expected value of the
00:19:57.730 --> 00:20:00.300
length of a code, whatever
code you have.
00:20:00.300 --> 00:20:05.090
But when you put the minimum on
here, this is less than the
00:20:05.090 --> 00:20:09.730
entropy of the chance variable
x super n plus 1.
00:20:09.730 --> 00:20:13.270
That's the same theorem
that we proved before.
00:20:13.270 --> 00:20:15.490
There's nothing new here.
00:20:15.490 --> 00:20:20.220
Now, if you divide this by n,
this by n, and this by n, you
00:20:20.220 --> 00:20:22.200
still have a valid inequality.
00:20:22.200 --> 00:20:25.710
When you divide this by
n, what do you get?
00:20:25.710 --> 00:20:27.620
You get H of x.
00:20:27.620 --> 00:20:34.110
When you divide this by n,
by definition L bar --
00:20:34.110 --> 00:20:38.210
what we mean is the number of
bits per source symbol.
00:20:38.210 --> 00:20:42.700
We have n source symbols here.
00:20:42.700 --> 00:20:46.990
So when we divide by n, we get
the number of bits per source
00:20:46.990 --> 00:20:51.300
symbol in this monster symbol.
00:20:51.300 --> 00:20:53.540
So l min is equal to this.
00:20:53.540 --> 00:20:56.260
When we divide this
by n, we get this.
00:20:56.260 --> 00:20:59.530
When we divide this by
n, we get H of x.
00:20:59.530 --> 00:21:03.110
And now the whole reason for
doing this is, this silly
00:21:03.110 --> 00:21:06.900
little 1 here, which we're
trying very hard to think of
00:21:06.900 --> 00:21:10.400
as being negligible or
unimportant, has suddenly
00:21:10.400 --> 00:21:12.090
become 1 over n.
00:21:12.090 --> 00:21:15.690
And by making n big enough,
this truly is unimportant.
00:21:18.610 --> 00:21:22.870
If you're thinking in terms of
encoding a binary source, this
00:21:22.870 --> 00:21:25.000
1 here is very important.
00:21:27.960 --> 00:21:30.860
In other words, when you're
trying to encode a binary
00:21:30.860 --> 00:21:34.230
source, if you're encoding one
letter at a time, there's
00:21:34.230 --> 00:21:36.110
nothing you can do.
00:21:36.110 --> 00:21:37.450
You're absolutely stuck.
00:21:37.450 --> 00:21:40.250
Because if both of those
letters have non-zero
00:21:40.250 --> 00:21:44.870
probabilities, and you want a
uniquely decodable code, and
00:21:44.870 --> 00:21:48.690
you want to find codewords for
each of those two symbols, the
00:21:48.690 --> 00:21:52.940
best you can do is to have
an expected length of 1.
00:21:52.940 --> 00:21:58.530
Namely, you need 1 to encode 1,
and you need 0 to encode 0.
00:21:58.530 --> 00:22:01.800
And there's nothing else,
there's no freedom at
00:22:01.800 --> 00:22:03.340
all that you have.
00:22:03.340 --> 00:22:06.930
So you say, OK, in that
situation, I really have to go
00:22:06.930 --> 00:22:08.300
to longer blocks.
00:22:08.300 --> 00:22:10.720
And when I go to longer
blocks, I
00:22:10.720 --> 00:22:12.330
can get this resolved.
00:22:12.330 --> 00:22:13.540
And I know how to do it.
00:22:13.540 --> 00:22:16.430
I use Huffman's algorithm
or whatever.
00:22:16.430 --> 00:22:21.110
So, suddenly, I can start to
get the expected number of
00:22:21.110 --> 00:22:23.850
bits per source symbol.
00:22:23.850 --> 00:22:27.440
Down as close to h of x
as I want to make it.
00:22:27.440 --> 00:22:30.040
And I can't make
it any smaller.
00:22:30.040 --> 00:22:35.080
Which says that H of x now has
a very clear interpretation,
00:22:35.080 --> 00:22:38.493
at least for prefix-free codes,
of being the number of
00:22:38.493 --> 00:22:42.430
bits you need for encoding
prefix-free codes when you
00:22:42.430 --> 00:22:44.940
allow the possibility
of encoding them
00:22:44.940 --> 00:22:47.570
a block at a time.
00:22:47.570 --> 00:22:50.560
We're going to find later today
that the significance of
00:22:50.560 --> 00:22:53.640
this is far greater
than that, even.
00:22:53.640 --> 00:22:56.710
Because why use prefix-free
codes, we could use any old
00:22:56.710 --> 00:22:57.780
kind of code.
00:22:57.780 --> 00:23:00.840
When we study the Lempel-Ziv
codes tomorrow, we'll find out
00:23:00.840 --> 00:23:04.810
they aren't prefix-free
codes at all.
00:23:04.810 --> 00:23:07.450
They're really variable length
of variable length codes.
00:23:07.450 --> 00:23:10.170
So they aren't fixed
to variable length.
00:23:10.170 --> 00:23:13.050
And they do some pretty fancy
and tricky things.
00:23:13.050 --> 00:23:16.140
But they're still limited
to this same inequality.
00:23:16.140 --> 00:23:18.390
You can never beat the
entropy bound.
00:23:18.390 --> 00:23:21.170
If you want something to be
uniquely decodable, you're
00:23:21.170 --> 00:23:22.980
stuck with this bound.
00:23:22.980 --> 00:23:26.620
And we'll see why in a very
straightforward way, later.
00:23:26.620 --> 00:23:31.500
It's a very straightforward way
which I can guarantee all
00:23:31.500 --> 00:23:35.960
of you are going to look at it
and say, yes, that's obvious.
00:23:35.960 --> 00:23:38.540
And tomorrow you will look at it
and say, I don't understand
00:23:38.540 --> 00:23:39.730
that at all.
00:23:39.730 --> 00:23:41.500
And the next day you'll
look at it again and
00:23:41.500 --> 00:23:42.880
say, well, of course.
00:23:42.880 --> 00:23:45.740
And the day after that you'll
say, I don't understand that.
00:23:45.740 --> 00:23:48.690
And you'll go back and forth
like that for about two weeks.
00:23:48.690 --> 00:23:51.980
Don't be frustrated, because
it is simple.
00:23:51.980 --> 00:23:54.950
But at the same time it's
very sophisticated.
00:23:59.370 --> 00:24:05.170
So, let's now review the weak
law of large numbers.
00:24:05.170 --> 00:24:09.000
I usually just call it the
law of large numbers.
00:24:09.000 --> 00:24:12.670
I bridle a little bit when
people call it weak because,
00:24:12.670 --> 00:24:16.070
in fact it's the centerpiece
of probability theory.
00:24:16.070 --> 00:24:19.310
But there is this other thing
called the strong law of large
00:24:19.310 --> 00:24:24.070
numbers, which mathematicians
love because it lets them look
00:24:24.070 --> 00:24:27.630
at all kinds of mathematical
minutiae.
00:24:27.630 --> 00:24:29.830
It's also important,
I shouldn't
00:24:29.830 --> 00:24:30.950
play it down too much.
00:24:30.950 --> 00:24:33.530
And there are places where
you truly need it.
00:24:33.530 --> 00:24:36.540
For what we'll be doing, we
don't need it at all.
00:24:36.540 --> 00:24:39.470
And the weak law of large
numbers does in fact hold in
00:24:39.470 --> 00:24:41.930
many places where the strong
law doesn't hold.
00:24:41.930 --> 00:24:48.130
So if you know what the strong
law, is temporarily forget it
00:24:48.130 --> 00:24:50.320
for the for the rest
of the term.
00:24:50.320 --> 00:24:52.580
And just focus on
the weak law.
00:24:52.580 --> 00:24:56.130
And the weak law is not
terribly complicated.
00:24:56.130 --> 00:25:00.310
We have a sequence of
random variables.
00:25:00.310 --> 00:25:04.230
And each of them has
a mean y bar.
00:25:04.230 --> 00:25:08.360
And each of them has a variance
sigma sub y squared.
00:25:08.360 --> 00:25:10.950
And let's assume that they're
independent and identically
00:25:10.950 --> 00:25:12.600
distributed for the
time being.
00:25:12.600 --> 00:25:15.900
Just to avoid worrying
about anything.
00:25:15.900 --> 00:25:20.020
If we look at the sum of those
random variables, namely a is
00:25:20.020 --> 00:25:23.700
equal to y1 up to y sub n.
00:25:23.700 --> 00:25:27.570
Then the expected value of a is
the expected value of this
00:25:27.570 --> 00:25:30.860
plus the expected valuable
of y2, and so forth.
00:25:30.860 --> 00:25:35.270
So the expected value of
a is n times y bar.
00:25:35.270 --> 00:25:37.990
And I think in one of the
homework problems, you found
00:25:37.990 --> 00:25:39.640
the variance of a.
00:25:39.640 --> 00:25:46.090
And the variance of a, well, the
easiest thing to do is to
00:25:46.090 --> 00:25:49.280
reduce this to its
fluctuation.
00:25:49.280 --> 00:25:51.790
Reduce all of these to
their fluctuation.
00:25:51.790 --> 00:25:54.570
Then look at the variance of the
fluctuation, which is just
00:25:54.570 --> 00:25:56.960
the expected value
of this squared.
00:25:56.960 --> 00:25:59.760
Which is the expected value of
this squared plus the expected
00:25:59.760 --> 00:26:02.250
value of this squared,
and so forth.
00:26:02.250 --> 00:26:06.940
So the variance of a is n times
sigma sub y squared.
00:26:06.940 --> 00:26:08.530
I want all of you know that.
00:26:08.530 --> 00:26:13.230
That's sort of day two of
a probability course.
00:26:13.230 --> 00:26:15.700
As soon as you start talking
about random variables, that's
00:26:15.700 --> 00:26:17.630
one of the key things
that you do.
00:26:17.630 --> 00:26:21.320
One of the most important
things you do.
00:26:21.320 --> 00:26:23.270
The thing that we're interested
in here is more the
00:26:23.270 --> 00:26:26.600
sample average of y1
up to y sub n.
00:26:26.600 --> 00:26:29.570
And the sample average,
by definition, is the
00:26:29.570 --> 00:26:32.050
sum divided by n.
00:26:32.050 --> 00:26:35.870
So in other words, the thing
that you're interested in here
00:26:35.870 --> 00:26:39.990
is to add all of these
random variables up.
00:26:39.990 --> 00:26:43.360
Take one over n times it.
00:26:43.360 --> 00:26:44.950
Which is a thing we
do all the time.
00:26:44.950 --> 00:26:50.270
I mean, we sum up a lot of
events, we divide by n, and we
00:26:50.270 --> 00:26:54.790
hope by doing that to get some
sort of typical value.
00:26:54.790 --> 00:26:58.210
And, usually there is some sort
of typical value that
00:26:58.210 --> 00:26:59.200
arises from that.
00:26:59.200 --> 00:27:02.810
What the law of large numbers
says is that there in fact is
00:27:02.810 --> 00:27:05.620
a typical value that arises.
00:27:05.620 --> 00:27:08.600
So this sample value is
a over n, which is the
00:27:08.600 --> 00:27:10.410
sum divided by n.
00:27:10.410 --> 00:27:12.630
And we call that the
sample average.
00:27:12.630 --> 00:27:18.340
The mean of the sample average
is just the mean of a, which
00:27:18.340 --> 00:27:23.020
is n times y bar,
divided by n.
00:27:23.020 --> 00:27:27.780
So the mean of the sample
average is y bar itself.
00:27:27.780 --> 00:27:37.190
The variance of the sample
variance, --
00:27:37.190 --> 00:27:42.290
the variance of the sample
average, OK, that's, --
00:27:45.630 --> 00:27:48.880
I'm talking too fast.
00:27:48.880 --> 00:27:55.600
The sample average here, you
would like to think of it as
00:27:55.600 --> 00:27:58.380
something which is known
and specific, like
00:27:58.380 --> 00:27:59.680
the expected value.
00:27:59.680 --> 00:28:02.150
It, in fact, is a
random variable.
00:28:02.150 --> 00:28:05.140
It changes with different
sample values.
00:28:05.140 --> 00:28:07.840
It can change from almost
nothing to very large
00:28:07.840 --> 00:28:08.820
quantities.
00:28:08.820 --> 00:28:11.250
And what we're interested in
saying is that most of the
00:28:11.250 --> 00:28:14.480
time, it's close to the
expected value.
00:28:14.480 --> 00:28:15.830
And that's what we're
aiming at here.
00:28:15.830 --> 00:28:19.020
And that's what the law
of large numbers says.
00:28:19.020 --> 00:28:22.970
The sample average here, the
variance of this, is now equal
00:28:22.970 --> 00:28:28.000
to the variance of a divided
by n squared.
00:28:28.000 --> 00:28:31.530
In other words, we're trying to
take the expected value of
00:28:31.530 --> 00:28:33.080
this quantity squared.
00:28:33.080 --> 00:28:36.770
So there's a 1 over n squared
that comes in here.
00:28:36.770 --> 00:28:40.800
When you take the 1 over n
squared here, this variance
00:28:40.800 --> 00:28:44.080
then becomes sigma --
00:28:46.610 --> 00:28:50.230
I don't know why I
have the n there.
00:28:50.230 --> 00:28:52.100
Just take that n out,
if you will.
00:28:52.100 --> 00:28:54.640
I don't have my red
pen with me.
00:28:57.290 --> 00:29:03.390
And so it's the variance
of the random variable
00:29:03.390 --> 00:29:06.590
y, divided by n.
00:29:06.590 --> 00:29:12.170
In other words, the limit as n
goes to infinity of the of the
00:29:12.170 --> 00:29:16.980
variance of the sum is
equal to infinity.
00:29:16.980 --> 00:29:21.630
And the variance of the sample
average as n goes to infinity
00:29:21.630 --> 00:29:23.790
is equal to 0.
00:29:23.790 --> 00:29:27.890
And that's because of this 1
over n squared effect here.
00:29:27.890 --> 00:29:32.400
When you add up a lot of
independent random variables,
00:29:32.400 --> 00:29:35.990
what you wind up with is the
sample average has a variance,
00:29:35.990 --> 00:29:38.440
which is going to 0.
00:29:38.440 --> 00:29:44.150
And the sum has a variance which
is going to infinity.
00:29:44.150 --> 00:29:46.560
That's important.
00:29:46.560 --> 00:29:49.820
Aside from all of the theorems
you've ever heard, this is
00:29:49.820 --> 00:29:54.520
sort of the gross, simple-minded
thing which you
00:29:54.520 --> 00:29:57.690
always ought to keep foremost
in your mind.
00:29:57.690 --> 00:30:00.290
This is what's happening
in probability theory.
00:30:00.290 --> 00:30:03.350
When you talk about sample
averages, this variance is
00:30:03.350 --> 00:30:06.590
getting small.
00:30:06.590 --> 00:30:09.710
Let's look at a picture
of this.
00:30:09.710 --> 00:30:12.420
Let's look at the distribution
function
00:30:12.420 --> 00:30:14.500
of this random variable.
00:30:14.500 --> 00:30:18.110
The sample average as
a random variable.
00:30:18.110 --> 00:30:22.070
And what we're finding here
is that this distribution
00:30:22.070 --> 00:30:27.510
function, if we look at it for
some modest value of n, we get
00:30:27.510 --> 00:30:31.250
something which looks like
this upper curve here.
00:30:31.250 --> 00:30:33.360
Which is then the lower
curve here.
00:30:33.360 --> 00:30:37.580
It's spread out more, so it
has a larger variance.
00:30:37.580 --> 00:30:40.180
Namely, the sample average
has a larger variance.
00:30:40.180 --> 00:30:45.360
When you make n bigger, what's
happening to the variance?
00:30:45.360 --> 00:30:46.900
The variance is getting
smaller.
00:30:46.900 --> 00:30:51.850
The variance is getting smaller
by a factor of 1/2.
00:30:51.850 --> 00:30:55.800
So this quantity is supposed
to have a variance which is
00:30:55.800 --> 00:30:59.200
equal to 1/2 of the
variance of this.
00:30:59.200 --> 00:31:01.270
How you find a variance
in a distribution
00:31:01.270 --> 00:31:04.010
function is your problem.
00:31:04.010 --> 00:31:08.310
But you know that if something
has a small variance, it's
00:31:08.310 --> 00:31:10.600
very closely tucked
in around this.
00:31:10.600 --> 00:31:14.150
In other words, as the variance
goes to 0, and the
00:31:14.150 --> 00:31:17.620
mean is y bar, you have a
distribution function which
00:31:17.620 --> 00:31:20.910
approaches a unit step.
00:31:20.910 --> 00:31:23.410
And all that just comes from
this very, very simple
00:31:23.410 --> 00:31:27.260
argument that says, when you
have a sum of IID random
00:31:27.260 --> 00:31:31.050
variables and you take the
sample average of it, namely,
00:31:31.050 --> 00:31:34.780
you divide by n, the
variance goes to 0.
00:31:34.780 --> 00:31:37.850
Which says, no matter how you
look at it, you wind up with
00:31:37.850 --> 00:31:40.770
something that looks
like a unit step.
00:31:40.770 --> 00:31:45.480
Now, the Chebyshev inequality,
which is one of the simpler
00:31:45.480 --> 00:31:49.500
inequalities in probability
theory, and I don't prove it
00:31:49.500 --> 00:31:52.040
because it's something
you've all seen.
00:31:52.040 --> 00:31:55.800
I don't know of any course in
probability which avoids the
00:31:55.800 --> 00:31:57.780
Chebyshev inequality.
00:31:57.780 --> 00:32:02.190
And what it says is, for any
epsilon greater than 0, the
00:32:02.190 --> 00:32:05.280
probability that the difference
between the sample
00:32:05.280 --> 00:32:09.350
average and the true mean, the
probability that that quantity
00:32:09.350 --> 00:32:13.380
and magnitude is greater than
or equal to epsilon, is less
00:32:13.380 --> 00:32:17.070
than or equal to sigma
sub y squared divided
00:32:17.070 --> 00:32:18.580
by n epsilon squared.
00:32:18.580 --> 00:32:22.340
Oh, incidentally that thing that
was called sigma sub n
00:32:22.340 --> 00:32:26.310
before was really
sigma squared.
00:32:26.310 --> 00:32:31.420
That's mainly the
variance of y.
00:32:31.420 --> 00:32:33.120
I hope it's right
in the notes.
00:32:33.120 --> 00:32:34.970
Might not be.
00:32:34.970 --> 00:32:35.930
It is?
00:32:35.930 --> 00:32:37.180
Good.
00:32:39.940 --> 00:32:42.520
So, that's what this
inequality says.
00:32:42.520 --> 00:32:46.540
There's an easy way to derive
this on the fly.
00:32:46.540 --> 00:32:49.480
Namely, if you're wondering what
all these constants are
00:32:49.480 --> 00:32:53.890
here, here's a way to do it.
00:32:53.890 --> 00:32:58.980
What we're looking at, in this
curve here, is we're trying to
00:32:58.980 --> 00:33:04.250
say, how much probability is
there outside of these plus
00:33:04.250 --> 00:33:06.360
and minus epsilon limits.
00:33:06.360 --> 00:33:09.550
And the Chebyshev inequality
says there can't be too much
00:33:09.550 --> 00:33:11.250
probability out here.
00:33:11.250 --> 00:33:14.780
And there can't be too much
probability out here.
00:33:14.780 --> 00:33:19.620
So, one way to get at this is
to say, OK, suppose I have
00:33:19.620 --> 00:33:22.970
some given probability
out here.
00:33:22.970 --> 00:33:25.950
And some given probability
out here.
00:33:25.950 --> 00:33:29.380
And suppose I want to minimize
the variance of a random
00:33:29.380 --> 00:33:32.960
variable which has that much
probability out here and that
00:33:32.960 --> 00:33:35.050
much probability out here.
00:33:35.050 --> 00:33:36.630
How do I do it?
00:33:36.630 --> 00:33:40.700
Well, the variance deals with
the square of how far you were
00:33:40.700 --> 00:33:42.160
away from the mean.
00:33:42.160 --> 00:33:44.840
So if I want to have a certain
amount of probability out
00:33:44.840 --> 00:33:49.750
here, I minimize my variance by
making this come straight,
00:33:49.750 --> 00:33:54.370
come up here with a little step,
then go across here.
00:33:54.370 --> 00:33:56.050
Go up here.
00:33:56.050 --> 00:33:59.560
And then, oops.
00:33:59.560 --> 00:34:02.160
Go up here.
00:34:02.160 --> 00:34:03.710
I wish I had my red pencil.
00:34:03.710 --> 00:34:07.060
Does anybody have a red pen?
00:34:07.060 --> 00:34:08.480
That will write on this stuff?
00:34:13.870 --> 00:34:14.360
Yes?
00:34:14.360 --> 00:34:15.610
No?
00:34:21.140 --> 00:34:21.670
Oh, great.
00:34:21.670 --> 00:34:23.990
Thank you.
00:34:23.990 --> 00:34:25.240
I will return it.
00:34:27.220 --> 00:34:31.330
So what we want is something
which goes over here.
00:34:31.330 --> 00:34:33.170
Comes up here.
00:34:33.170 --> 00:34:35.220
Goes across here.
00:34:35.220 --> 00:34:36.470
Goes up here.
00:34:39.400 --> 00:34:42.900
Goes across here, and
goes up again.
00:34:42.900 --> 00:34:44.535
That's the smallest you
can make the variance.
00:34:44.535 --> 00:34:46.790
It's squeezing everything
in as far as it
00:34:46.790 --> 00:34:48.130
can be squeezed in.
00:34:48.130 --> 00:34:50.830
Namely, everything out
here gets squeezed
00:34:50.830 --> 00:34:53.270
in to y minus epsilon.
00:34:53.270 --> 00:34:55.910
Everything in here gets
squeezed into 0.
00:34:55.910 --> 00:34:59.500
And everything out here gets
squeezed into epsilon.
00:34:59.500 --> 00:35:01.570
OK, calculate the
variance there.
00:35:01.570 --> 00:35:03.650
And that satisfies
the Chebyshev
00:35:03.650 --> 00:35:06.130
inequality with equality.
00:35:06.130 --> 00:35:10.410
So that's all the Chebyshev
inequality is.
00:35:10.410 --> 00:35:13.210
And it's a loose inequality
usually, because usually these
00:35:13.210 --> 00:35:14.900
curves look very nice.
00:35:14.900 --> 00:35:17.410
Usually this looks like a
Gaussian distribution
00:35:17.410 --> 00:35:20.660
function, and the central limit
theorem says that we
00:35:20.660 --> 00:35:23.810
don't need the central limit
theorem here, and we don't
00:35:23.810 --> 00:35:26.980
want the central limit theorem
here, because this thing is an
00:35:26.980 --> 00:35:31.490
inequality that says, life can't
be any worse than this.
00:35:31.490 --> 00:35:33.890
And all the central limit
theorem is, is an
00:35:33.890 --> 00:35:35.570
approximation.
00:35:35.570 --> 00:35:37.550
And then we have to worry
about when it's a good
00:35:37.550 --> 00:35:41.510
approximation and when it's
not a good approximation.
00:35:41.510 --> 00:35:45.160
So this says, when we carry it
one piece further, it's for
00:35:45.160 --> 00:35:48.050
any epsilon and delta
greater than 0.
00:35:48.050 --> 00:35:51.500
If we make n large enough --
in other words, substitute
00:35:51.500 --> 00:35:53.330
delta for this.
00:35:53.330 --> 00:35:55.850
And then, when you make n small
enough, this quantity is
00:35:55.850 --> 00:35:57.600
smaller than delta.
00:35:57.600 --> 00:36:01.180
And that says that the
probability that s and y
00:36:01.180 --> 00:36:05.240
differ by more than epsilon is
less than or equal to delta
00:36:05.240 --> 00:36:08.180
when we make n big enough.
00:36:08.180 --> 00:36:10.960
So it says, you can make this
as small as you want.
00:36:10.960 --> 00:36:13.050
You can make this as
small as you want.
00:36:13.050 --> 00:36:15.530
And all you need to do
is make a sequence
00:36:15.530 --> 00:36:17.540
which is long enough.
00:36:17.540 --> 00:36:21.380
Now, the thing which is
mystifying about the law of
00:36:21.380 --> 00:36:24.720
large numbers is, you
need both the
00:36:24.720 --> 00:36:26.510
epsilon and the delta.
00:36:26.510 --> 00:36:29.470
You can't get rid of
either of them.
00:36:29.470 --> 00:36:33.260
In other words, you
can't say --
00:36:33.260 --> 00:36:36.500
you can't reduce this to 0.
00:36:36.500 --> 00:36:38.670
Because it won't
make any sense.
00:36:38.670 --> 00:36:41.550
In other words, this
curve here tends to
00:36:41.550 --> 00:36:44.520
spread out on its tails.
00:36:44.520 --> 00:36:49.070
And therefore, there's always a
probability of error there.
00:36:49.070 --> 00:36:54.160
You can't move epsilon into 0
because, for no finite end, do
00:36:54.160 --> 00:36:56.590
you really get a step
function here.
00:36:56.590 --> 00:37:00.580
So you need some wiggle
room on both end.
00:37:00.580 --> 00:37:05.950
You need wiggle room here, and
you need wiggle room here.
00:37:05.950 --> 00:37:08.950
And once you recognize that you
need those two pieces of
00:37:08.950 --> 00:37:09.850
wiggle room.
00:37:09.850 --> 00:37:13.830
Namely, you cannot talk about
the probability that this is
00:37:13.830 --> 00:37:19.230
equal to y bar, because
that's usually 0.
00:37:19.230 --> 00:37:25.390
And you cannot talk about
reducing this to 0 either.
00:37:25.390 --> 00:37:27.750
So both of those are needed.
00:37:27.750 --> 00:37:29.590
Why did I go through
all of this?
00:37:29.590 --> 00:37:31.780
Well, partly because
it's important.
00:37:31.780 --> 00:37:36.140
But partly because I want to
talk about something which is
00:37:36.140 --> 00:37:39.890
called the asymptotic
equipartition property.
00:37:39.890 --> 00:37:43.520
And because of those long words
you believe this has to
00:37:43.520 --> 00:37:45.980
be very complicated.
00:37:45.980 --> 00:37:48.690
I hope to convince you that
what the asymptotic
00:37:48.690 --> 00:37:52.580
equipartition property is, is
simply the week law of large
00:37:52.580 --> 00:37:57.570
numbers applied to
the log pmf.
00:37:57.570 --> 00:38:01.030
Because that, in fact,
is all it is.
00:38:01.030 --> 00:38:05.430
But it says some unusual
and fascinating things.
00:38:05.430 --> 00:38:11.020
So let's suppose that x1, x2,
and so forth is the output
00:38:11.020 --> 00:38:14.970
from a discrete memoryless
source.
00:38:14.970 --> 00:38:18.180
Look at the log pmf
of each of these.
00:38:18.180 --> 00:38:22.800
Namely, they each have the same
distribution function.
00:38:22.800 --> 00:38:26.090
So w of f x is going to be equal
to minus the logarithm
00:38:26.090 --> 00:38:31.790
of p sub x of x, for each of
these chance variables.
00:38:31.790 --> 00:38:36.460
w of x maps source symbols
into real numbers.
00:38:36.460 --> 00:38:40.850
So there's a random variable,
capital W of x sub j, which is
00:38:40.850 --> 00:38:41.970
a random variable.
00:38:41.970 --> 00:38:46.140
We have a random variable for
each one of these symbols to
00:38:46.140 --> 00:38:47.660
come out of the source.
00:38:47.660 --> 00:38:51.050
So, for each one of these
symbols, there's this log pmf
00:38:51.050 --> 00:38:55.900
random variable, which takes
on different values.
00:38:55.900 --> 00:39:00.790
So the expected value of this
log pmf, for the j'th symbol
00:39:00.790 --> 00:39:05.820
out of the source is the sum
of p sub x of x, namely the
00:39:05.820 --> 00:39:09.560
probability that the source
takes on the value x, times
00:39:09.560 --> 00:39:12.290
minus the logarithm
of p sub x.
00:39:12.290 --> 00:39:14.530
And that's just the entropy.
00:39:14.530 --> 00:39:18.770
So, the strange feeling about
this log pmf random variable
00:39:18.770 --> 00:39:22.480
is its expected value
is entropy.
00:39:22.480 --> 00:39:27.440
And w of x1, w of x2, and so
forth, are a sequence of IID
00:39:27.440 --> 00:39:31.320
random variables, each one of
them which has a mean, which
00:39:31.320 --> 00:39:32.570
is the entropy.
00:39:35.320 --> 00:39:38.560
So it's just exactly the
situation we had before.
00:39:38.560 --> 00:39:42.460
Instead of y bar, we have
the entropy of x.
00:39:42.460 --> 00:39:48.840
And instead of the random
variable y sub j, we have this
00:39:48.840 --> 00:39:53.580
random variable w of x sub j,
which is defined by the symbol
00:39:53.580 --> 00:39:54.830
in an alphabet.
00:40:00.170 --> 00:40:04.240
And just to review this, but
it's what we said before, if
00:40:04.240 --> 00:40:09.900
capital X1, this little x1,
namely, if little x1 is the
00:40:09.900 --> 00:40:15.650
sample value for the chance
variable x1 and if x2 is a
00:40:15.650 --> 00:40:19.720
sample value for the chance
variable X2, then the outcome
00:40:19.720 --> 00:40:25.660
for w of x1 plus w of x2 --
00:40:28.350 --> 00:40:31.200
very hard to keep all these
little letters and capital
00:40:31.200 --> 00:40:32.450
letters straight.
00:40:35.030 --> 00:40:39.850
Is w of x1 plus w of x2 is minus
the logarithm of the
00:40:39.850 --> 00:40:43.570
probability of x1 minus the
logarithm of the probability
00:40:43.570 --> 00:40:47.790
of x2, which is minus the
logarithm of the product,
00:40:47.790 --> 00:40:51.620
which is minus the logarithm of
the joint probability of x1
00:40:51.620 --> 00:40:57.610
and x2, which is the random
variable w of x1 x2.
00:40:57.610 --> 00:41:03.870
So the sum here is the random
variable corresponding to log
00:41:03.870 --> 00:41:10.650
pmf of the joint outputs
x1 and x2.
00:41:10.650 --> 00:41:18.110
So w of x1 x2 is the log pmf of
the event, but this joint
00:41:18.110 --> 00:41:21.740
chance variable takes
on the value x1 x2.
00:41:21.740 --> 00:41:27.820
And the random variable x1 x2
is the sum of x1 and x2.
00:41:27.820 --> 00:41:29.690
So, what have I done here?
00:41:29.690 --> 00:41:32.540
I said this at the end of the
last slide, and you won't
00:41:32.540 --> 00:41:34.050
believe me.
00:41:34.050 --> 00:41:39.690
So, again this is one of these
things where tomorrow you
00:41:39.690 --> 00:41:40.610
won't believe me.
00:41:40.610 --> 00:41:42.580
And you'll have to go back
and look at that.
00:41:42.580 --> 00:41:45.630
But, anyway, x1 x2 is
a chance variable.
00:41:45.630 --> 00:41:50.090
And probabilities multiply in
log pmf's add, which is what
00:41:50.090 --> 00:41:52.020
we've been saying for a
couple of days now.
00:41:55.460 --> 00:41:56.820
So.
00:41:56.820 --> 00:42:06.430
If I look at the sum of n of
these random variables, the
00:42:06.430 --> 00:42:11.000
sum of these log probabilities
is the sum of the log of
00:42:11.000 --> 00:42:16.370
pmf's, which is minus the
logarithm of the probability
00:42:16.370 --> 00:42:19.190
of the entire sequence.
00:42:19.190 --> 00:42:22.110
That's just saying the same
thing we said before, for two
00:42:22.110 --> 00:42:23.250
random variables.
00:42:23.250 --> 00:42:28.140
The sample average of a log
pmf's is the sum of the w's
00:42:28.140 --> 00:42:31.480
divided by n, which is minus
the logarithm of the
00:42:31.480 --> 00:42:33.830
probability divided by n.
00:42:33.830 --> 00:42:37.700
The weak law of large numbers
applies, and the probability
00:42:37.700 --> 00:42:42.840
that this sample average minus
the expected value of w of x
00:42:42.840 --> 00:42:46.170
is greater than or equal to
epsilon is less than or equal
00:42:46.170 --> 00:42:47.690
to this quantity here.
00:42:47.690 --> 00:42:51.610
This quantity is minus the
logarithm of the probability
00:42:51.610 --> 00:42:57.740
of x sub n, divided by n, minus
H of x, greater than or
00:42:57.740 --> 00:42:59.340
equal to epsilon.
00:43:07.210 --> 00:43:09.450
So this is the thing that
we really want.
00:43:15.610 --> 00:43:17.470
I'm going to spend a few
slides trying to
00:43:17.470 --> 00:43:18.620
say what this means.
00:43:18.620 --> 00:43:22.170
But let's try to just look
at it now, and see
00:43:22.170 --> 00:43:24.190
what it must mean.
00:43:24.190 --> 00:43:29.350
It says that with high
probability, this quantity is
00:43:29.350 --> 00:43:32.900
almost the same as
this quantity.
00:43:32.900 --> 00:43:35.810
It says that with high
probability, the thing which
00:43:35.810 --> 00:43:42.630
comes out of the source is going
to have a probability, a
00:43:42.630 --> 00:43:47.930
log probability, divided by n,
which is close to the entropy.
00:43:47.930 --> 00:43:52.350
It says in some sense that with
high probability, the
00:43:52.350 --> 00:43:54.240
probability of what
comes out of the
00:43:54.240 --> 00:43:55.940
source is almost a constant.
00:43:59.020 --> 00:44:02.060
And that's amazing.
00:44:02.060 --> 00:44:04.200
That's what you'll wake up in
the morning and say, I don't
00:44:04.200 --> 00:44:05.450
believe that.
00:44:07.900 --> 00:44:10.430
But it's true.
00:44:10.430 --> 00:44:12.870
But you have to be careful
to interpret it right.
00:44:15.450 --> 00:44:18.710
So, we're going to define
the typical set.
00:44:18.710 --> 00:44:22.680
Namely, this is the typical set
of x's, which come out of
00:44:22.680 --> 00:44:23.630
the source.
00:44:23.630 --> 00:44:26.520
Namely, the typical
set of blocks of n
00:44:26.520 --> 00:44:29.490
symbols out of the source.
00:44:29.490 --> 00:44:32.510
And when you talk about a
typical set, you want
00:44:32.510 --> 00:44:36.180
something which includes most
of the probability.
00:44:36.180 --> 00:44:40.560
So what I'm going to include in
this typical set is all of
00:44:40.560 --> 00:44:43.160
these things that we talked
about before.
00:44:43.160 --> 00:44:46.520
Namely, we showed that the
probability that this quantity
00:44:46.520 --> 00:44:49.960
is greater than or equal to
epsilon is very small.
00:44:49.960 --> 00:44:53.980
So, with high probability what
comes out of the source
00:44:53.980 --> 00:44:57.030
satisfies this inequality
here.
00:44:57.030 --> 00:45:00.840
So I can write down the
distribution function of this
00:45:00.840 --> 00:45:02.480
random variable here.
00:45:02.480 --> 00:45:09.070
It's just this w --
00:45:12.840 --> 00:45:14.750
this is a random variable w.
00:45:14.750 --> 00:45:17.550
I'm looking at the distribution
of that random
00:45:17.550 --> 00:45:20.170
variable w.
00:45:20.170 --> 00:45:25.340
And this quantity in here is
the probability of this
00:45:25.340 --> 00:45:28.260
typical set.
00:45:28.260 --> 00:45:31.090
In other words, when I draw this
distribution function for
00:45:31.090 --> 00:45:34.820
this combined random variable,
I've defined this typical set
00:45:34.820 --> 00:45:40.020
to be all the sequences which
lie between this point and
00:45:40.020 --> 00:45:41.070
this point.
00:45:41.070 --> 00:45:43.690
Namely, this is H
minus epsilon.
00:45:43.690 --> 00:45:47.360
And this is H plus epsilon,
moving H out here.
00:45:47.360 --> 00:45:50.580
So these are all the sequences
which satisfy
00:45:50.580 --> 00:45:52.510
this inequality here.
00:45:52.510 --> 00:45:54.550
So that's what I mean
by the typical set.
00:45:54.550 --> 00:45:59.290
It's all things which are
clustered around H in this
00:45:59.290 --> 00:46:00.540
distribution function.
00:46:03.450 --> 00:46:07.320
And as n approaches infinity,
this typical set approaches
00:46:07.320 --> 00:46:09.170
probability 1.
00:46:09.170 --> 00:46:11.560
In the same way that the
law of large numbers
00:46:11.560 --> 00:46:12.420
behaves that way.
00:46:12.420 --> 00:46:18.090
The probability that x sub n
is in this typical set is
00:46:18.090 --> 00:46:23.180
greater than or equal to 1 minus
sigma squared divided by
00:46:23.180 --> 00:46:25.080
n epsilon squared.
00:46:30.800 --> 00:46:34.230
Let's try to express that in
a bunch of other ways.
00:46:40.400 --> 00:46:44.880
If you're getting lost,
please ask questions.
00:46:44.880 --> 00:46:49.800
But I hope to come back to this
in a little bit, after we
00:46:49.800 --> 00:46:52.850
finish a little more
of the story.
00:46:52.850 --> 00:47:03.060
So, another way of expressing
this typical set -- let me
00:47:03.060 --> 00:47:05.760
look at that as the
typical set.
00:47:05.760 --> 00:47:10.920
If I take this inequality here
and I rewrite this, namely,
00:47:10.920 --> 00:47:16.190
this is the set of x's for
which this is less than
00:47:16.190 --> 00:47:20.970
epsilon plus H of x
and greater than
00:47:20.970 --> 00:47:23.330
H of x minus epsilon.
00:47:23.330 --> 00:47:26.900
So that's what I've
expressed here.
00:47:26.900 --> 00:47:31.880
It's the set of x's for which
n times H of x minus epsilon
00:47:31.880 --> 00:47:36.260
is less than this logarithm of
probability is great less than
00:47:36.260 --> 00:47:38.830
n times H of x plus epsilon.
00:47:38.830 --> 00:47:43.630
Namely, I'm looking at this
range of epsilon around H,
00:47:43.630 --> 00:47:46.980
which is this and this.
00:47:46.980 --> 00:47:50.840
If I write it again, if I
exponentiate all of this, it's
00:47:50.840 --> 00:47:55.810
the set of x's for which 2 to
the minus n, H of x, plus
00:47:55.810 --> 00:47:59.740
epsilon, that's this term
exponentiated, is less than
00:47:59.740 --> 00:48:03.170
this is less than this
term exponentiated.
00:48:03.170 --> 00:48:05.610
And what's going on here
is, I've taken care of
00:48:05.610 --> 00:48:08.170
the minus sign also.
00:48:08.170 --> 00:48:10.400
And if you can follow that in
your head, you're a better
00:48:10.400 --> 00:48:12.400
person than I am.
00:48:12.400 --> 00:48:14.700
But, anyway, it works.
00:48:14.700 --> 00:48:16.790
And if you fiddle around with
that, you'll see that that's
00:48:16.790 --> 00:48:17.860
what it is.
00:48:17.860 --> 00:48:24.300
So this typical set is a bound
on the probabilities of all
00:48:24.300 --> 00:48:26.090
these typical sequences.
00:48:26.090 --> 00:48:31.980
The typical sequences all are
enclosed in this range of
00:48:31.980 --> 00:48:33.230
probabilities.
00:48:35.680 --> 00:48:39.440
So the typical elements are
approximately equiprobable, in
00:48:39.440 --> 00:48:42.130
this strange sense above.
00:48:42.130 --> 00:48:45.100
Why do I say this is
a strange sense?
00:48:45.100 --> 00:48:49.690
Well, as n gets large,
what happens here?
00:48:49.690 --> 00:48:52.810
This is 2 to the minus
n times H of x.
00:48:52.810 --> 00:48:55.360
Which is the important
part of this.
00:48:55.360 --> 00:48:59.820
This epsilon here is
multiplied by n.
00:48:59.820 --> 00:49:02.400
And we're trying to say, as n
gets very, very big, we can
00:49:02.400 --> 00:49:04.700
make epsilon very, very small.
00:49:04.700 --> 00:49:09.130
But we really can't make n times
epsilon very negligible.
00:49:09.130 --> 00:49:12.410
But the point is, the important
thing here is, 2 to
00:49:12.410 --> 00:49:15.680
the minus n times H of x.
00:49:15.680 --> 00:49:19.640
So, in some sense, this
is close to 2 to the
00:49:19.640 --> 00:49:21.750
minus n H of x.
00:49:21.750 --> 00:49:23.140
In what sense is it true?
00:49:23.140 --> 00:49:28.140
Well, it's true in that sense.
00:49:28.140 --> 00:49:31.980
Where that, in fact is,
a valid inequality.
00:49:31.980 --> 00:49:35.130
Namely in terms of
sample averages,
00:49:35.130 --> 00:49:37.160
these things are close.
00:49:37.160 --> 00:49:40.210
When I do the exponentiation and
get rid of the n and all
00:49:40.210 --> 00:49:43.820
that stuff, they aren't
very close.
00:49:43.820 --> 00:49:48.760
But saying this sort of thing is
sort of like saying that 10
00:49:48.760 --> 00:49:52.502
to the minus 23 is approximately
equal to 10 to
00:49:52.502 --> 00:49:54.950
the minus 25.
00:49:54.950 --> 00:49:57.060
And they're approximately equal
because they're both
00:49:57.060 --> 00:49:59.170
very, very small.
00:49:59.170 --> 00:50:02.510
And that's the kind of thing
that's going on here.
00:50:02.510 --> 00:50:05.330
And you're trying to distinguish
10 to the minus 23
00:50:05.330 --> 00:50:10.540
and 10 to the minus 25 from 10
to the minus 60th and from 10
00:50:10.540 --> 00:50:12.950
to the minus three.
00:50:12.950 --> 00:50:16.500
So that's the kind of
approximations we're using.
00:50:16.500 --> 00:50:20.040
Namely, we're using
approximations on a log scale,
00:50:20.040 --> 00:50:23.510
instead of approximations
of ordinary numbers.
00:50:23.510 --> 00:50:27.800
But, still it's convenient to
think of these typical x's,
00:50:27.800 --> 00:50:31.270
typical sequences, as being
sequences which are
00:50:31.270 --> 00:50:33.900
constrained in probability
in this way.
00:50:33.900 --> 00:50:37.290
And this is the thing which
is easy to work with.
00:50:37.290 --> 00:50:41.910
The atypical set of strings,
namely, the compliment to this
00:50:41.910 --> 00:50:45.990
set, the thing we know about
that is the entire set doesn't
00:50:45.990 --> 00:50:48.030
have much probability.
00:50:48.030 --> 00:50:53.080
Namely, if you fix epsilon and
you let n get bigger and
00:50:53.080 --> 00:50:57.860
bigger, this atypical set
becomes totally negligible.
00:50:57.860 --> 00:50:59.110
And you can ignore it.
00:51:02.330 --> 00:51:06.940
So let's plow ahead.
00:51:06.940 --> 00:51:12.220
Stop for an example pretty
soon, but --
00:51:12.220 --> 00:51:15.830
If I have a sequence which is
in the typical set, we then
00:51:15.830 --> 00:51:20.400
know that its probability is
greater than 2 to the minus n
00:51:20.400 --> 00:51:23.520
times H of x plus epsilon.
00:51:23.520 --> 00:51:26.150
That's what we said before.
00:51:26.150 --> 00:51:29.330
And, therefore, when I use
this inequality, the
00:51:29.330 --> 00:51:34.900
probability of x to the n, for
something in the typical set,
00:51:34.900 --> 00:51:37.940
is greater than this
quantity here.
00:51:37.940 --> 00:51:47.950
In other words, this is
greater than that.
00:51:47.950 --> 00:51:50.420
For everything in
a typical set.
00:51:50.420 --> 00:51:53.640
So now I'm heading over things
in a typical set.
00:51:53.640 --> 00:51:56.170
So I need to include the
number of things
00:51:56.170 --> 00:51:57.590
in a typical set.
00:51:57.590 --> 00:52:01.190
So what I have is this sum.
00:52:01.190 --> 00:52:02.470
And what is this sum?
00:52:02.470 --> 00:52:06.000
This is the probability
of the typical set.
00:52:06.000 --> 00:52:08.960
Because I'm adding overall
elements in the typical set.
00:52:08.960 --> 00:52:11.880
And it's greater than or equal
to the number of elements in a
00:52:11.880 --> 00:52:15.660
typical set times these
small probabilities.
00:52:15.660 --> 00:52:19.230
If I turn this around, it says
that the number of elements in
00:52:19.230 --> 00:52:22.460
a typical set is less
than 2 to the n
00:52:22.460 --> 00:52:25.820
times H of x plus epsilon.
00:52:25.820 --> 00:52:30.000
For any epsilon, no matter how
small I want to make it.
00:52:30.000 --> 00:52:33.710
Which says that the elements
in a typical set have
00:52:33.710 --> 00:52:38.200
probabilities which are about 2
to the minus n times H of x.
00:52:38.200 --> 00:52:41.480
And the number of them is
approximately 2 to the
00:52:41.480 --> 00:52:44.110
n times H of x.
00:52:44.110 --> 00:52:47.910
In other words, what it says is
that this typical set is a
00:52:47.910 --> 00:52:53.900
bunch of essentially uniform
probabilities.
00:52:53.900 --> 00:52:58.550
So what I've done is to take
this very complicated source.
00:52:58.550 --> 00:53:05.360
And when I look at these very
humongous chance variables,
00:53:05.360 --> 00:53:10.670
which are very large sequences
out of the source, what I find
00:53:10.670 --> 00:53:14.510
is that there's a bunch of
things which collectively have
00:53:14.510 --> 00:53:16.410
zilch probability.
00:53:16.410 --> 00:53:18.980
There's a bunch of other things
which all have equal
00:53:18.980 --> 00:53:20.090
probability.
00:53:20.090 --> 00:53:24.650
And a number of them is
enough to add up to y.
00:53:24.650 --> 00:53:28.820
So I have turned this source,
when I look at it over a long
00:53:28.820 --> 00:53:38.080
enough sequence, into a source
of equiprobable events.
00:53:38.080 --> 00:53:41.470
And each of those events has
this probability here.
00:53:41.470 --> 00:53:46.540
Now, we know how to encode
equiprobable events.
00:53:46.540 --> 00:53:48.140
And that's the whole
point of this.
00:53:50.770 --> 00:53:55.820
So, this is less than
or equal to that.
00:53:55.820 --> 00:53:59.000
On the other side, we know that
1 minus delta is less
00:53:59.000 --> 00:54:04.970
than or equal to this
probability of a typical set.
00:54:04.970 --> 00:54:09.590
And this is less than the
number of elements in a
00:54:09.590 --> 00:54:13.860
typical set times 2 to the minus
n h of x minus epsilon.
00:54:13.860 --> 00:54:16.320
This is an upper
bound on this.
00:54:16.320 --> 00:54:24.240
This is less than this.
00:54:27.600 --> 00:54:30.570
So I just add all these things
up and I get this bound.
00:54:30.570 --> 00:54:34.200
So it says, the size of the
typical set is greater than 1
00:54:34.200 --> 00:54:37.360
minus delta, times
this quantity.
00:54:37.360 --> 00:54:41.520
In other words, this is a pretty
exact sort of thing.
00:54:41.520 --> 00:54:44.870
If you don't mind dealing
with this 2 to the n
00:54:44.870 --> 00:54:47.270
epsilon factor here.
00:54:47.270 --> 00:54:50.150
If you agree that that's
negligible in some strange
00:54:50.150 --> 00:54:53.860
sense, the all of this
makes good sense.
00:54:53.860 --> 00:54:57.760
And if it is negligible, let me
start talking about source
00:54:57.760 --> 00:55:01.420
coding, which is why
this all works out.
00:55:01.420 --> 00:55:05.460
So the summary is that the
probability of the complement
00:55:05.460 --> 00:55:10.650
of the typical set
is essentially 0.
00:55:10.650 --> 00:55:14.340
The number of elements in a
typical set is approximately 2
00:55:14.340 --> 00:55:16.130
to the n times h of x.
00:55:16.130 --> 00:55:18.610
I'm getting rid of all the
deltas and epsilons here, to
00:55:18.610 --> 00:55:22.380
get sort of the broad view
of what's important here.
00:55:22.380 --> 00:55:25.650
Each of the elements in a
typical set has probability 2
00:55:25.650 --> 00:55:28.170
to the minus n times H of x.
00:55:28.170 --> 00:55:32.175
So I've turned a source
into a source of
00:55:32.175 --> 00:55:34.230
equiprobable elements.
00:55:34.230 --> 00:55:37.070
And there are 2 to the n
times h of x of them.
00:55:43.100 --> 00:55:46.320
Let's do an example of this.
00:55:46.320 --> 00:55:48.890
It's an example that you'll work
on more in the homework
00:55:48.890 --> 00:55:52.810
and do it a little
more cleanly.
00:55:52.810 --> 00:55:57.120
Let's look at a binary discrete
memoryless source,
00:55:57.120 --> 00:56:02.310
where the probability that x is
equal to 1 is p, which is
00:56:02.310 --> 00:56:03.920
less than 1/2.
00:56:03.920 --> 00:56:07.070
And the probability of 0
is greater than 1/2.
00:56:07.070 --> 00:56:12.640
So, this is what you get when
you have a biased coin.
00:56:12.640 --> 00:56:17.420
And the biased coin has a
1 on one side and a 0
00:56:17.420 --> 00:56:19.340
on the other side.
00:56:19.340 --> 00:56:23.070
And it's more likely to
come up 0's than 1's.
00:56:23.070 --> 00:56:26.080
I always used to wonder how
to make a biased coin.
00:56:26.080 --> 00:56:28.240
And I can give you a little
experiment which shows you you
00:56:28.240 --> 00:56:30.400
can make a biased coin.
00:56:30.400 --> 00:56:34.140
I mean, a biased is a little
round thing which is flat on
00:56:34.140 --> 00:56:35.840
the top and bottom.
00:56:35.840 --> 00:56:40.070
Suppose instead of that you
make a triangular coin.
00:56:40.070 --> 00:56:43.140
And instead of making it flat on
top and bottom, you turn it
00:56:43.140 --> 00:56:45.800
into a tetrahedron.
00:56:45.800 --> 00:56:50.630
So in fact, what this is now is
a coin which is built up on
00:56:50.630 --> 00:56:54.090
one side into a very
massive thing.
00:56:54.090 --> 00:56:57.070
And is flat on the other side.
00:56:57.070 --> 00:56:59.700
Since it's a tetrahedron
and it's an equilateral
00:56:59.700 --> 00:57:04.730
tetrahedron, the probability of
1 is going to be 1/4, and
00:57:04.730 --> 00:57:07.850
the probability of 0
is going to be 3/4.
00:57:07.850 --> 00:57:10.760
So you can make biased coins.
00:57:10.760 --> 00:57:12.760
So when you get into
coin-tossing games with
00:57:12.760 --> 00:57:15.045
people, watch the coin
that they're using.
00:57:15.045 --> 00:57:19.120
It probably won't be a
tetrahedron, but anyway.
00:57:21.820 --> 00:57:28.520
So the entropy here, the log pmf
random variable, takes on
00:57:28.520 --> 00:57:32.300
the value of minus log
p with probability p.
00:57:32.300 --> 00:57:35.950
And it takes on the value minus
log 1 minus p, with
00:57:35.950 --> 00:57:37.490
probability 1 minus p.
00:57:37.490 --> 00:57:40.080
This is a probability of a 1.
00:57:40.080 --> 00:57:42.700
This is a probability of a 0.
00:57:42.700 --> 00:57:46.270
So, the entropy is
equal to this.
00:57:46.270 --> 00:57:48.980
Used to be that in information
theory courses, people would
00:57:48.980 --> 00:57:52.050
almost memorize what this
curve looked like.
00:57:52.050 --> 00:57:53.250
And they'd draw pictures
of it.
00:57:53.250 --> 00:57:56.140
There were famous curves
of this function,
00:57:56.140 --> 00:57:58.950
which looks like this.
00:58:07.280 --> 00:58:17.620
0, 1, 1.
00:58:17.620 --> 00:58:20.800
Turns out, that's not all that
important a distribution.
00:58:20.800 --> 00:58:24.510
It's a nice example
to talk about.
00:58:24.510 --> 00:58:28.400
The typical set, t epsilon n,
is the set of strings with
00:58:28.400 --> 00:58:34.710
about p n1's and about 1
minus p times n 0's.
00:58:34.710 --> 00:58:38.770
In other words, that's the
typical thing to happen.
00:58:38.770 --> 00:58:41.900
And it's the typical thing in
terms of this law of large
00:58:41.900 --> 00:58:42.690
numbers here.
00:58:42.690 --> 00:58:46.520
Because you get 1's with
probability p.
00:58:46.520 --> 00:58:48.700
And therefore in a long
sequence, you're going to get
00:58:48.700 --> 00:58:53.190
about pn 1's and
1 minus p 0's.
00:58:53.190 --> 00:58:58.520
The probability of a typical
string is, if you get a string
00:58:58.520 --> 00:59:01.940
with this many 1's and
this many 0's, it's
00:59:01.940 --> 00:59:04.500
probability is p.
00:59:04.500 --> 00:59:08.280
Namely, the probability of a 1
times the number of 1's you
00:59:08.280 --> 00:59:10.610
get, which is pn.
00:59:10.610 --> 00:59:13.420
Times the probability
of a 0, times the
00:59:13.420 --> 00:59:16.210
number of 0's you get.
00:59:16.210 --> 00:59:19.170
And if you look at what this
is, if you take p up in the
00:59:19.170 --> 00:59:22.850
exponent and 1 minus the p up in
the exponent, this becomes
00:59:22.850 --> 00:59:27.700
2 to the minus n times h of x,
just like what it should be.
00:59:27.700 --> 00:59:31.780
So these typical strings, with
about pn 1's and 1 minus pn
00:59:31.780 --> 00:59:34.720
0's, are in fact typical
in the sense we've
00:59:34.720 --> 00:59:36.560
been talking about.
00:59:36.560 --> 00:59:43.100
The number of n strings with pn
1's is n factorial divided
00:59:43.100 --> 00:59:47.760
by pn factorial divided by n
times 1 minus p factorial.
00:59:52.070 --> 00:59:54.960
I mean I hope you learned that
a long time ago, but you
00:59:54.960 --> 00:59:56.910
should learn it in probability
anyway.
00:59:56.910 --> 01:00:01.260
It's just very simple
combinatorics.
01:00:01.260 --> 01:00:04.270
So you have that many
different strings.
01:00:04.270 --> 01:00:07.430
So what I'm trying to get across
here is, there are a
01:00:07.430 --> 01:00:10.580
bunch of different things
going on here.
01:00:10.580 --> 01:00:13.600
We can talk about the random
variable which is the number
01:00:13.600 --> 01:00:16.990
of 1's that occur in
this long sequence.
01:00:16.990 --> 01:00:20.460
And with high probability, the
number of 1's that occur is
01:00:20.460 --> 01:00:22.970
close to pn.
01:00:22.970 --> 01:00:26.470
But if pn 1's occur, there's
still an awful lot of
01:00:26.470 --> 01:00:28.400
randomness left.
01:00:28.400 --> 01:00:33.310
Because we have to worry about
where those pn 1's appear.
01:00:33.310 --> 01:00:36.140
And those are the sequences
we're talking about.
01:00:36.140 --> 01:00:41.520
So, there are this many
sequences, all of which have
01:00:41.520 --> 01:00:44.940
that many 1's in them.
01:00:44.940 --> 01:00:48.850
And there's a similar number of
sequences for all similar
01:00:48.850 --> 01:00:50.160
numbers of 1's.
01:00:50.160 --> 01:00:54.510
Namely, if you take pn plus 1
and pn plus 2, pn minus 1, pn
01:00:54.510 --> 01:00:57.780
minus 2, you get similar
numbers here.
01:00:57.780 --> 01:01:00.890
So those are the typical
sequences.
01:01:00.890 --> 01:01:03.980
Now, the important thing to
observe here is that you
01:01:03.980 --> 01:01:08.890
really have 2 to the n binary
strings altogether.
01:01:08.890 --> 01:01:13.270
And what this result is saying
is that collectively those
01:01:13.270 --> 01:01:14.490
don't make any difference.
01:01:14.490 --> 01:01:17.820
The law of large numbers says,
OK, there's just a humongous
01:01:17.820 --> 01:01:20.080
number of strings.
01:01:20.080 --> 01:01:23.780
You get the largest number
strings which have about half
01:01:23.780 --> 01:01:25.510
1's and half 0's.
01:01:25.510 --> 01:01:29.100
But their probability
is zilch.
01:01:29.100 --> 01:01:32.540
So the thing which is probable
is getting pn 1's
01:01:32.540 --> 01:01:34.750
and 1 minus pn 0's.
01:01:34.750 --> 01:01:37.290
Now, we have this typical set.
01:01:37.290 --> 01:01:41.410
What is the most likely sequence
of all, in this
01:01:41.410 --> 01:01:42.660
experiment?
01:01:45.450 --> 01:01:48.130
How do I maximize the
probability of
01:01:48.130 --> 01:01:49.620
a particular sequence?
01:01:49.620 --> 01:02:03.910
The probability of the sequence
is p to the i times 1
01:02:03.910 --> 01:02:07.420
minus p to the n minus i.
01:02:07.420 --> 01:02:11.050
And 1 minus p is the
probability of 0.
01:02:11.050 --> 01:02:14.240
And p is the probability
of a 1.
01:02:14.240 --> 01:02:15.970
How do I choose i to
maximize this?
01:02:15.970 --> 01:02:16.300
Yeah?
01:02:16.300 --> 01:02:18.150
AUDIENCE: [UNINTELLIGIBLE]
all 0's.
01:02:18.150 --> 01:02:19.540
PROFESSOR: You make
them all 0's.
01:02:19.540 --> 01:02:23.750
So the most likely sequence
is all 0's.
01:02:23.750 --> 01:02:25.860
But that's not a typical
sequence.
01:02:29.700 --> 01:02:33.290
Why isn't it a typical
sequence?
01:02:33.290 --> 01:02:36.060
Because we chose to define
typical sequence in a
01:02:36.060 --> 01:02:37.880
different way.
01:02:37.880 --> 01:02:41.180
Namely is only one of those, and
there are only n of them
01:02:41.180 --> 01:02:43.650
with only a single one.
01:02:43.650 --> 01:02:46.920
So, in other words, what's going
on is that we have an
01:02:46.920 --> 01:02:49.640
enormous number of sequences
which have around half
01:02:49.640 --> 01:02:50.890
1's and half 0's.
01:02:53.430 --> 01:02:55.240
But they don't have
any probability.
01:02:55.240 --> 01:02:57.840
And collectively they don't
have any probability.
01:02:57.840 --> 01:03:01.380
We have a very small number of
sequences which have a very
01:03:01.380 --> 01:03:03.750
large number of 0's.
01:03:03.750 --> 01:03:07.960
But there aren't enough of those
to make any difference.
01:03:07.960 --> 01:03:10.750
And, therefore, the things that
make a difference are
01:03:10.750 --> 01:03:14.710
these typical things which
have about np 1's
01:03:14.710 --> 01:03:18.270
and 1 minus pn 0's.
01:03:18.270 --> 01:03:20.680
And that all sounds
very strange.
01:03:20.680 --> 01:03:22.800
But if I phrase this a different
way, you would all
01:03:22.800 --> 01:03:27.470
say that's exactly the way
you ought to do things.
01:03:27.470 --> 01:03:32.210
Because, in fact, when we look
at very, very long sequences,
01:03:32.210 --> 01:03:35.175
you know with extraordinarily
high probability what's going
01:03:35.175 --> 01:03:39.050
to come out of the source is
something with about pn 1's
01:03:39.050 --> 01:03:42.430
and about 1 minus
p times n 0's.
01:03:42.430 --> 01:03:46.410
So that's the likely set of
things to have happen.
01:03:46.410 --> 01:03:47.590
And it's just that there
are an enormous
01:03:47.590 --> 01:03:49.200
number of those things.
01:03:49.200 --> 01:03:51.890
There are this many of them.
01:03:51.890 --> 01:03:56.150
So, here what we're dealing with
is a balance between the
01:03:56.150 --> 01:04:01.090
number of elements of a
particular type, and the
01:04:01.090 --> 01:04:03.520
probability of them.
01:04:03.520 --> 01:04:07.030
And it turns out that this
number and its probability
01:04:07.030 --> 01:04:10.650
balance out to say that usually
what you get is about
01:04:10.650 --> 01:04:13.780
pn 1's and 1 minus
p times n 0's.
01:04:13.780 --> 01:04:16.730
Which is what the law of large
numbers said to begin with.
01:04:16.730 --> 01:04:20.300
All we're doing is interpreting
that here.
01:04:20.300 --> 01:04:25.210
But the thing that you see from
this example is, all of
01:04:25.210 --> 01:04:28.680
these things with exactly pn 1's
in them, assuming that pn
01:04:28.680 --> 01:04:31.270
is an integer, are
all equiprobable.
01:04:31.270 --> 01:04:34.940
They're all exactly
equiprobable.
01:04:34.940 --> 01:04:37.990
So what we're doing when we're
talking about this typical
01:04:37.990 --> 01:04:42.140
set, is first throwing out all
the things which have to many
01:04:42.140 --> 01:04:44.570
1's are or too few
1's in them.
01:04:44.570 --> 01:04:48.560
We're keeping only the ones
which are typical in the sense
01:04:48.560 --> 01:04:50.920
that they obey the law
of large numbers.
01:04:50.920 --> 01:04:54.100
And in this case, they obey the
law of large numbers for
01:04:54.100 --> 01:04:56.730
log pmf's also.
01:04:56.730 --> 01:05:01.770
And then all of those things
are about equally probable.
01:05:01.770 --> 01:05:05.460
So the idea in source coding
is, one of the ways to deal
01:05:05.460 --> 01:05:10.430
with source coding is, you want
to assign codewords to
01:05:10.430 --> 01:05:13.570
only these typical things.
01:05:13.570 --> 01:05:16.240
Now, maybe you might want to
assign codewords to something
01:05:16.240 --> 01:05:17.870
like all 0's also.
01:05:17.870 --> 01:05:20.570
Because it hardly
costs anything.
01:05:20.570 --> 01:05:23.810
And a Huffman code would
certainly do that.
01:05:23.810 --> 01:05:27.310
But it's not very important
whether you do or not.
01:05:27.310 --> 01:05:30.300
The important thing is, you
assign codewords to all of
01:05:30.300 --> 01:05:31.910
these typical sequences.
01:05:37.770 --> 01:05:41.280
So let's go back to
fixed-to-fixed
01:05:41.280 --> 01:05:42.660
length source codes.
01:05:42.660 --> 01:05:45.500
We talked a little bit about
fixed-to-fixed length source
01:05:45.500 --> 01:05:46.940
codes before.
01:05:46.940 --> 01:05:48.980
Do you remember what we did
with fixed-to-fixed length
01:05:48.980 --> 01:05:50.720
source codes before?
01:05:50.720 --> 01:05:53.520
We said we have an alphabet
of size m.
01:05:53.520 --> 01:05:56.250
We want something which
is uniquely decodable.
01:05:56.250 --> 01:05:59.020
And since we want something
which is uniquely decodable,
01:05:59.020 --> 01:06:02.510
we have to provide codewords
for everything.
01:06:02.510 --> 01:06:07.780
And, therefore, if we want to
choose a block length of n,
01:06:07.780 --> 01:06:11.730
we've got to generate m
to the n codewords.
01:06:11.730 --> 01:06:14.700
Here we say, wow, maybe we
don't have to provide
01:06:14.700 --> 01:06:17.250
codewords for everything.
01:06:17.250 --> 01:06:20.520
Maybe we're willing to tolerate
a certain small
01:06:20.520 --> 01:06:23.070
probability that the whole
thing fails and
01:06:23.070 --> 01:06:24.320
falls on its face.
01:06:27.040 --> 01:06:30.280
Now, does that make any sense?
01:06:30.280 --> 01:06:32.330
Well, view things the
following way.
01:06:32.330 --> 01:06:36.090
We said, when we started out
all of this, that we were
01:06:36.090 --> 01:06:38.880
going to look at prefix-free
codes.
01:06:38.880 --> 01:06:42.640
Where some codewords had a
longer length and some
01:06:42.640 --> 01:06:44.730
codewords had a shorter
length.
01:06:44.730 --> 01:06:48.040
And we were thinking of encoding
either single letters
01:06:48.040 --> 01:06:52.340
at a time, or a small block
of letters at a time.
01:06:52.340 --> 01:06:55.960
So think of encoding, say,
10 letters at a time.
01:06:55.960 --> 01:07:02.250
And think of doing this for
10 to the 20th letters.
01:07:02.250 --> 01:07:05.740
So you have the source here
which is pumping out letters
01:07:05.740 --> 01:07:08.280
at a regular rate.
01:07:08.280 --> 01:07:12.540
You're blocking them into
n letters at a time.
01:07:12.540 --> 01:07:15.540
You're encoding in a
prefix-free code.
01:07:15.540 --> 01:07:17.790
Out comes something.
01:07:17.790 --> 01:07:22.560
What comes is not coming
out at a regular right.
01:07:22.560 --> 01:07:25.670
What is coming out, sometimes
you get a lot of bits out.
01:07:25.670 --> 01:07:28.450
Sometimes a small number
of bits out.
01:07:28.450 --> 01:07:30.730
So, in other words, if you want
to send things over a
01:07:30.730 --> 01:07:34.970
channel, you need a buffer
there to save things.
01:07:34.970 --> 01:07:39.000
If, in fact, we decide that the
expected number of bits
01:07:39.000 --> 01:07:43.960
per source letter is, say, five
bits per source letter,
01:07:43.960 --> 01:07:48.540
then we expect over a very long
time to be producing five
01:07:48.540 --> 01:07:50.830
bits per source letter.
01:07:50.830 --> 01:07:54.460
And if we turn our channel on
for one year, to transmit all
01:07:54.460 --> 01:07:59.010
of these things, what's going
to happen is this very
01:07:59.010 --> 01:08:02.080
unlikely sequence occurs.
01:08:02.080 --> 01:08:05.910
Which in fact requires not one
year to transmit, but two
01:08:05.910 --> 01:08:09.520
years to transmit.
01:08:09.520 --> 01:08:13.150
In fact, what do we do if it
takes one year and five
01:08:13.150 --> 01:08:18.140
minutes to transmit instead
of one year?
01:08:18.140 --> 01:08:19.050
Well, we've got a failure.
01:08:19.050 --> 01:08:22.520
Somehow or other, the network
is going to fail us.
01:08:22.520 --> 01:08:25.350
I mean we all know that networks
fail all the time
01:08:25.350 --> 01:08:28.530
despite what engineers say.
01:08:28.530 --> 01:08:32.120
I mean, all of us who use
networks know that they do
01:08:32.120 --> 01:08:33.820
crazy things.
01:08:33.820 --> 01:08:36.590
And one of those crazy things
is that unusual things
01:08:36.590 --> 01:08:38.270
sometimes happen.
01:08:38.270 --> 01:08:42.640
So, we develop this very nice
theory of prefix-free codes.
01:08:42.640 --> 01:08:46.580
But prefix-free codes,
in fact, fail also.
01:08:46.580 --> 01:08:50.880
And they fail also because
buffers overflow.
01:08:50.880 --> 01:08:54.160
In other words, we are counting
on encoding things
01:08:54.160 --> 01:08:58.020
with a certain number of
bits per source symbol.
01:08:58.020 --> 01:09:00.770
And if these unusual things
occur, and we have too many
01:09:00.770 --> 01:09:04.780
bits per source symbol,
then we fail.
01:09:04.780 --> 01:09:08.960
So the idea that we're trying
to get at now is that
01:09:08.960 --> 01:09:13.560
prefix-free codes and
fixed-to-fixed length source
01:09:13.560 --> 01:09:16.640
codes which only encode
typical things.
01:09:16.640 --> 01:09:20.710
In fact, are sort of the same
if you look at them over a
01:09:20.710 --> 01:09:22.860
very, very large sequence
length.
01:09:22.860 --> 01:09:26.980
In other words, if you look at
a prefix-free code which is
01:09:26.980 --> 01:09:31.190
dealing with blocks of 10
letters, and you look at a
01:09:31.190 --> 01:09:34.120
fixed-to-fixed length code which
is only dealing with
01:09:34.120 --> 01:09:39.320
typical things but is looking at
a length of 10 to the 20th,
01:09:39.320 --> 01:09:43.570
then over that length of 10 to
the 20th, your variable length
01:09:43.570 --> 01:09:47.020
code is going to have a bunch of
things which are about the
01:09:47.020 --> 01:09:48.630
length they ought to be.
01:09:48.630 --> 01:09:50.970
And a bunch of other
things which are
01:09:50.970 --> 01:09:53.090
extraordinarily long.
01:09:53.090 --> 01:09:56.360
The bunch of things which are
extraordinarily long are
01:09:56.360 --> 01:09:59.910
extraordinarily unpopular, but
there are an extraordinarily
01:09:59.910 --> 01:10:02.020
large number of them.
01:10:02.020 --> 01:10:05.760
Just like with a fixed-to-fixed
length code,
01:10:05.760 --> 01:10:07.700
you are going to fail.
01:10:07.700 --> 01:10:10.200
And you're going to fail on
an extraordinary number of
01:10:10.200 --> 01:10:12.500
different sequences.
01:10:12.500 --> 01:10:15.290
But, collectively, that set of
sequences don't have any
01:10:15.290 --> 01:10:17.850
probability.
01:10:17.850 --> 01:10:20.720
So the point that I'm trying to
get across is that, really,
01:10:20.720 --> 01:10:24.020
these two situations come
together when we look very
01:10:24.020 --> 01:10:25.630
long lengths.
01:10:25.630 --> 01:10:30.030
Namely, prefix-free codes are
just a way of generating codes
01:10:30.030 --> 01:10:33.260
that work for typical sequences
and over a very
01:10:33.260 --> 01:10:37.390
large, long period of time, will
generate about the right
01:10:37.390 --> 01:10:40.550
number of symbols.
01:10:40.550 --> 01:10:42.420
And that's what I'm trying
to get at here.
01:10:42.420 --> 01:10:45.980
Or what I'm trying to get
at in the next slide.
01:10:45.980 --> 01:10:50.650
So the fixed-to-fixed length
source code, I'm going to pick
01:10:50.650 --> 01:10:52.860
some epsilon and some delta.
01:10:52.860 --> 01:10:55.770
Namely, that epsilon and delta
which appeared in the law of
01:10:55.770 --> 01:10:58.280
large numbers.
01:10:58.280 --> 01:11:01.400
I'm going to make n as big as
I have to make it for that
01:11:01.400 --> 01:11:03.220
epsilon and that delta.
01:11:03.220 --> 01:11:07.120
And calculate how large it
has to be, but we won't.
01:11:07.120 --> 01:11:12.150
Then I'm going to assign fixed
length codewords to each
01:11:12.150 --> 01:11:15.390
sequence in the typical set.
01:11:15.390 --> 01:11:16.490
Now, am I going to really build
01:11:16.490 --> 01:11:18.410
something which does this?
01:11:18.410 --> 01:11:20.210
Of course not.
01:11:20.210 --> 01:11:23.140
I mean, I'm talking about
truly humongous lengths.
01:11:23.140 --> 01:11:25.620
So, this is really a conceptual
tool to understand
01:11:25.620 --> 01:11:27.070
what's going on.
01:11:27.070 --> 01:11:30.100
It's not something we're
going to implement.
01:11:30.100 --> 01:11:32.490
So I'm going to assign
codewords to all
01:11:32.490 --> 01:11:34.910
these typical elements.
01:11:34.910 --> 01:11:40.900
And then what I find is that
since the typical set, since
01:11:40.900 --> 01:11:44.730
the number of elements in it is
less than 2 to the n times
01:11:44.730 --> 01:11:51.200
H of x plus epsilon, if I choose
L bar, namely, the
01:11:51.200 --> 01:11:56.980
number of bits I'm going to use
for encoding these things,
01:11:56.980 --> 01:12:00.470
it's going to have to be H of
x plus epsilon in length.
01:12:00.470 --> 01:12:02.190
Because I need to provide
codewords for
01:12:02.190 --> 01:12:05.600
each of these things.
01:12:05.600 --> 01:12:08.930
And it needs to be an extra 1
over n because of this integer
01:12:08.930 --> 01:12:11.460
constraint that we've been
dealing with all along, which
01:12:11.460 --> 01:12:14.120
doesn't make any difference.
01:12:14.120 --> 01:12:17.830
So if I choose L bar, that big,
in other words, if I make
01:12:17.830 --> 01:12:21.670
it just a little bit bigger
than the entropy, the
01:12:21.670 --> 01:12:23.790
probability of failure
is going to be less
01:12:23.790 --> 01:12:25.640
than or equal to delta.
01:12:25.640 --> 01:12:27.910
And I can make delta -- and I
can make the probability of
01:12:27.910 --> 01:12:30.110
failure as small as I want.
01:12:30.110 --> 01:12:32.960
So I can make this epsilon here
which is the extra bits
01:12:32.960 --> 01:12:36.710
per source symbol as
small as I want.
01:12:36.710 --> 01:12:39.790
So it says I can come as close
to the entropy bound in doing
01:12:39.790 --> 01:12:43.350
this, and come as close to
unique decodability as I want
01:12:43.350 --> 01:12:45.140
in doing this.
01:12:45.140 --> 01:12:48.720
And I have a fixed-to-fixed
length code, which after one
01:12:48.720 --> 01:12:50.880
year is going to stop.
01:12:50.880 --> 01:12:53.730
And I can turn my decoder off.
01:12:53.730 --> 01:12:55.950
I can turn my encoder off.
01:12:55.950 --> 01:12:59.160
I can go buy a new encoder
and a new decoder, which
01:12:59.160 --> 01:13:01.770
presumably works a little
bit better.
01:13:01.770 --> 01:13:04.150
And there isn't any problem
about when to turn it off.
01:13:04.150 --> 01:13:05.730
Because I know I can
turn it off.
01:13:05.730 --> 01:13:09.630
Because everything will
have come in by then.
01:13:09.630 --> 01:13:12.420
Here's a more interesting
story.
01:13:12.420 --> 01:13:18.250
Suppose I choose the number of
bits per source symbol that
01:13:18.250 --> 01:13:23.390
I'm going to use to be less than
or equal to the entropy
01:13:23.390 --> 01:13:24.420
minus 2 epsilon.
01:13:24.420 --> 01:13:25.670
Why 2 epsilon?
01:13:25.670 --> 01:13:29.110
Well, just wait a second.
01:13:29.110 --> 01:13:31.830
I mean, 2 epsilon is small
and epsilon is small.
01:13:31.830 --> 01:13:34.145
But I want to compare with this
other epsilon and my law
01:13:34.145 --> 01:13:35.590
of large numbers.
01:13:35.590 --> 01:13:39.430
And I'm going to pick
n large enough.
01:13:39.430 --> 01:13:43.480
The number of typical sequences,
we said before, was
01:13:43.480 --> 01:13:48.300
greater than 1 minus delta times
2 to the n times h of x
01:13:48.300 --> 01:13:48.950
minus epsilon.
01:13:48.950 --> 01:13:52.430
I'm going to make this epsilon
the same as that epsilon,
01:13:52.430 --> 01:13:54.170
which is why I wanted this
to be 2 epsilon.
01:13:56.700 --> 01:14:01.680
So my typical set is this big
when I choose n large enough.
01:14:01.680 --> 01:14:04.890
And this says that most
of the typical set
01:14:04.890 --> 01:14:07.440
can't be assigned codewords.
01:14:07.440 --> 01:14:15.510
In other words, this number
here is humongously larger
01:14:15.510 --> 01:14:35.870
then 2 to the l bar, which is in
the order of 2 to the nh of
01:14:35.870 --> 01:14:42.200
x minus 2 epsilon n.
01:14:42.200 --> 01:14:45.660
So the fraction of typical
elements that I can provide
01:14:45.660 --> 01:14:52.040
codewords for, between this and
this, I can only provide
01:14:52.040 --> 01:14:54.660
codewords for a fraction
2 to the minus
01:14:54.660 --> 01:14:58.670
epsilon n of the codewords.
01:14:58.670 --> 01:15:01.770
We have this big sea of
codewords, which are all
01:15:01.770 --> 01:15:04.200
essentially equally likely.
01:15:04.200 --> 01:15:07.230
And I can't provide codewords
for even a
01:15:07.230 --> 01:15:09.860
small fraction of them.
01:15:09.860 --> 01:15:13.130
So the probability of failure is
going to be 1 minus delta.
01:15:13.130 --> 01:15:15.460
The 1 minus delta's the
probability that I get
01:15:15.460 --> 01:15:17.950
something atypical.
01:15:17.950 --> 01:15:24.190
Plus, well, minus in this case,
2 to the minus epsilon
01:15:24.190 --> 01:15:28.280
n, which is the probability that
I can't encode a typical
01:15:28.280 --> 01:15:30.670
codeword that comes out.
01:15:30.670 --> 01:15:34.550
And this quantity goes to 1.
01:15:34.550 --> 01:15:37.995
So this says that if I'm willing
to use a number of
01:15:37.995 --> 01:15:42.690
bits bigger than the entropy, I
can succeed with probability
01:15:42.690 --> 01:15:45.010
very close to 1.
01:15:45.010 --> 01:15:48.150
And if I want to use a smaller
number of bits, I fail with
01:15:48.150 --> 01:15:49.400
probability 1.
01:15:52.810 --> 01:15:56.320
Which is the same as saying that
I'm using a prefix-free
01:15:56.320 --> 01:16:01.950
code, I'm going to run out of
buffer space eventually if I
01:16:01.950 --> 01:16:05.730
run long enough.
01:16:05.730 --> 01:16:11.650
If I have something that
I'm encoding --
01:16:11.650 --> 01:16:13.980
well, just erase that.
01:16:13.980 --> 01:16:15.570
I'll say it more carefully
later.
01:16:18.150 --> 01:16:22.210
I do want to talk a little bit
about this Kraft inequality
01:16:22.210 --> 01:16:23.610
for unique decodability.
01:16:23.610 --> 01:16:26.780
You remember we proved the
Kraft inequality for
01:16:26.780 --> 01:16:29.460
prefix-free codes.
01:16:29.460 --> 01:16:32.930
I now want to talk about the
Kraft inequality for uniquely
01:16:32.930 --> 01:16:36.060
decodable codes.
01:16:36.060 --> 01:16:39.330
And you might think that I've
done all of this development
01:16:39.330 --> 01:16:45.990
of the AEP, the asymptotic
equipartition property.
01:16:45.990 --> 01:16:49.560
Incidentally, you now know where
those words come from.
01:16:49.560 --> 01:16:53.500
It's asymptotic because this
result is valid asymptotically
01:16:53.500 --> 01:16:55.960
as n goes to infinity.
01:16:55.960 --> 01:17:01.260
It's equipartition because
everything is equally likely.
01:17:01.260 --> 01:17:03.480
And its property, because
it's a property.
01:17:03.480 --> 01:17:08.490
So it's the asymptotic
equipartition property.
01:17:08.490 --> 01:17:12.260
And I didn't do it so I could
prove the Kraft inequality.
01:17:12.260 --> 01:17:14.850
It's just that that's an extra
bonus that we get.
01:17:14.850 --> 01:17:20.070
And by understanding why the
Kraft inequality has to hold
01:17:20.070 --> 01:17:28.890
for uniquely decodable codes, if
is one application for AEP
01:17:28.890 --> 01:17:32.470
which lets you see a little
bit about how to use it.
01:17:32.470 --> 01:17:36.520
OK, so the argument is an
argument by contradiction.
01:17:36.520 --> 01:17:43.010
Suppose you generate a set
of lengths for codewords.
01:17:43.010 --> 01:17:44.550
And you want this -- yeah?
01:17:55.250 --> 01:17:58.090
And the thing you would like to
do is to assign codewords
01:17:58.090 --> 01:18:01.220
of these lengths.
01:18:01.220 --> 01:18:04.860
And what we want to do is to
set this equal to some
01:18:04.860 --> 01:18:05.630
quantity b.
01:18:05.630 --> 01:18:09.020
In other words, suppose we beat
the Kraft inequality.
01:18:09.020 --> 01:18:12.130
Suppose we can make the lengths
even shorter than
01:18:12.130 --> 01:18:15.730
Kraft says we can make them.
01:18:15.730 --> 01:18:17.905
I mean, he was only a graduate
student, so we've got to be
01:18:17.905 --> 01:18:21.480
able to beat his inequality
somehow.
01:18:21.480 --> 01:18:24.460
So we're going to try to
make this equal to b.
01:18:24.460 --> 01:18:27.930
We're going to assume that
b is greater than 1.
01:18:27.930 --> 01:18:30.890
And then what we're going to
do is to show that we get a
01:18:30.890 --> 01:18:32.470
contradiction here.
01:18:32.470 --> 01:18:36.090
And this same argument can
work whether we have a
01:18:36.090 --> 01:18:39.600
discrete memoryless source or
a source with memory, or
01:18:39.600 --> 01:18:40.420
anything else.
01:18:40.420 --> 01:18:42.830
It can work with blocks, it can
work with variable length
01:18:42.830 --> 01:18:46.000
to variable length codes.
01:18:46.000 --> 01:18:49.560
It's all essentially
the same argument.
01:18:49.560 --> 01:18:52.390
So what I want to do is to
get a contradiction.
01:18:52.390 --> 01:18:56.230
I'm going to choose a discrete
memoryless source.
01:18:56.230 --> 01:18:58.900
And I'm going to make the
probabilities equal to 1 over
01:18:58.900 --> 01:19:02.300
b times 2 to the minus li.
01:19:02.300 --> 01:19:04.800
In other words, I can generate
a discrete memoryless source
01:19:04.800 --> 01:19:07.270
for talking about it with
any probabilities I
01:19:07.270 --> 01:19:08.800
want to give it.
01:19:08.800 --> 01:19:12.650
So I'm going to generate one
with these probabilities.
01:19:12.650 --> 01:19:16.530
So the lengths are going to
be equal to minus log of
01:19:16.530 --> 01:19:19.220
b times p sub i.
01:19:19.220 --> 01:19:22.920
Which says that the expected
length of the codewords is
01:19:22.920 --> 01:19:27.820
equal to the sum of p sub i l
sub i, which is equal to the
01:19:27.820 --> 01:19:31.780
entropy minus the
logarithm of b.
01:19:31.780 --> 01:19:34.450
Which means I can get an
expected length which is a
01:19:34.450 --> 01:19:37.440
little bit less than
the entropy.
01:19:37.440 --> 01:19:40.600
So now what I'm going to do is
to consider strings of n
01:19:40.600 --> 01:19:41.330
source letters.
01:19:41.330 --> 01:19:43.460
I'm going to make these string
very, very long.
01:19:46.270 --> 01:19:50.430
When I concatenate all these
codewords, I'm going to wind
01:19:50.430 --> 01:19:54.290
up with a length that's less
than n times H of x minus b
01:19:54.290 --> 01:19:59.400
over 2, minus log b over 2
with high probability.
01:20:13.510 --> 01:20:18.940
And as a fixed-length code of
this length it's going to have
01:20:18.940 --> 01:20:21.810
a low failure probability.
01:20:21.810 --> 01:20:26.740
And, therefore, what this says
is I can, using this
01:20:26.740 --> 01:20:32.670
remarkable code with unique
decodability, and generating
01:20:32.670 --> 01:20:37.500
very long strings from it, I
can generate a fixed-length
01:20:37.500 --> 01:20:41.550
code which has a low failure
probability.
01:20:41.550 --> 01:20:45.640
And I just showed you
in the last slide
01:20:45.640 --> 01:20:46.530
that I can't do that.
01:20:46.530 --> 01:20:49.830
The probability of failure with
such a code has to be
01:20:49.830 --> 01:20:51.540
essentially 1.
01:20:51.540 --> 01:20:54.870
So that's a contradiction that
says you can't have these
01:20:54.870 --> 01:20:57.460
unique decodable codes.
01:20:57.460 --> 01:21:01.670
If you didn't get that in what
I said, don't be surprised.
01:21:01.670 --> 01:21:06.200
Because all I'm trying to do is
to steer you towards how to
01:21:06.200 --> 01:21:09.610
look at the section in the
notes that does that.
01:21:09.610 --> 01:21:12.430
It was a little too fast
and a little too late.
01:21:12.430 --> 01:21:15.570
But, anyway, that is the Kraft
inequality for unique
01:21:15.570 --> 01:21:16.650
decodability.
01:21:16.650 --> 01:21:18.170
OK, thanks.