1
00:00:00,000 --> 00:00:02,350
The following content is
provided under a Creative
2
00:00:02,350 --> 00:00:03,640
Commons license.
3
00:00:03,640 --> 00:00:06,590
Your support will help MIT
OpenCourseWare continue to
4
00:00:06,590 --> 00:00:09,970
offer high quality educational
resources for free.
5
00:00:09,970 --> 00:00:13,060
To make a donation or to view
additional materials from
6
00:00:13,060 --> 00:00:16,780
hundreds of MIT courses, visit
MIT OpenCourseWare at
7
00:00:16,780 --> 00:00:18,030
ocw.mit.edu.
8
00:00:21,570 --> 00:00:22,070
PROFESSOR: OK.
9
00:00:22,070 --> 00:00:26,430
Just to review where we are,
we've been talking about
10
00:00:26,430 --> 00:00:30,230
source coding as one of the
two major parts of digital
11
00:00:30,230 --> 00:00:31,000
communication.
12
00:00:31,000 --> 00:00:34,420
Remember, you take sources,
you turn them into bits.
13
00:00:34,420 --> 00:00:38,110
Then you take bits and you
transmit them over channels.
14
00:00:38,110 --> 00:00:40,470
And that sums up the
whole course.
15
00:00:40,470 --> 00:00:44,420
This is the part where you
transmit over channels.
16
00:00:44,420 --> 00:00:48,220
This is the part where you
process the sources.
17
00:00:48,220 --> 00:00:51,740
We're concentrating now on the
source side of things.
18
00:00:51,740 --> 00:00:54,760
Partly because by concentrating
on the source
19
00:00:54,760 --> 00:00:58,300
side of things, we will build
up the machinery we need to
20
00:00:58,300 --> 00:00:59,820
look at the channel
side of things.
21
00:00:59,820 --> 00:01:03,760
The channel side is really more
interesting, I think.
22
00:01:03,760 --> 00:01:07,090
Although there's been a great
deal of work on both of them.
23
00:01:07,090 --> 00:01:09,070
They're both important.
24
00:01:09,070 --> 00:01:12,950
And we said that we could
separate source coding into
25
00:01:12,950 --> 00:01:13,770
three pieces.
26
00:01:13,770 --> 00:01:17,030
If you start out with a waveform
source, the typical
27
00:01:17,030 --> 00:01:21,210
thing to do, and almost the only
thing to do, is to turn
28
00:01:21,210 --> 00:01:24,430
those waveforms into sequences
of numbers.
29
00:01:24,430 --> 00:01:27,130
Those sequences might
be samples.
30
00:01:27,130 --> 00:01:31,080
They might be numbers
in an expansion.
31
00:01:31,080 --> 00:01:32,450
They might be whatever.
32
00:01:32,450 --> 00:01:36,440
But the first thing you almost
always do is turn waveforms
33
00:01:36,440 --> 00:01:38,280
into a sequence of numbers.
34
00:01:38,280 --> 00:01:42,600
Because waveforms are just too
complicated to deal with.
35
00:01:42,600 --> 00:01:44,860
The next thing we do with
a sequence of numbers is
36
00:01:44,860 --> 00:01:46,240
quantize them.
37
00:01:46,240 --> 00:01:48,440
After we quantize them
we wind up with a
38
00:01:48,440 --> 00:01:50,170
finite set of symbols.
39
00:01:50,170 --> 00:01:51,900
And the next thing
we do is, we take
40
00:01:51,900 --> 00:01:54,270
that sequence of symbols.
41
00:01:54,270 --> 00:01:55,750
And --
42
00:01:58,510 --> 00:02:01,960
and what we do at that point
is to do data compression.
43
00:02:01,960 --> 00:02:05,940
So that we try to represent
those symbols with as small as
44
00:02:05,940 --> 00:02:09,950
possible a number of binary
digits per source symbol.
45
00:02:09,950 --> 00:02:13,290
We want to do that in such
a way that we can receive
46
00:02:13,290 --> 00:02:15,760
it the other end.
47
00:02:15,760 --> 00:02:19,060
So let's review a little bit
about what we've done in the
48
00:02:19,060 --> 00:02:22,230
last couple of lectures.
49
00:02:22,230 --> 00:02:26,680
We talked about the
Kraft inequality.
50
00:02:26,680 --> 00:02:30,820
And the Kraft inequality, you
remember, says that the
51
00:02:30,820 --> 00:02:35,350
lengths of the codewords in any
prefix-free code have to
52
00:02:35,350 --> 00:02:38,120
satisfy this funny
inequality here.
53
00:02:38,120 --> 00:02:42,580
And this funny inequality, in
some sense, says if you want
54
00:02:42,580 --> 00:02:47,550
to make one codeword short, by
making that one codeword
55
00:02:47,550 --> 00:02:53,590
short, it eats up a large
part of this fraction.
56
00:02:53,590 --> 00:02:56,840
Since this sum has to be less
than or equal to 1.
57
00:02:56,840 --> 00:03:01,150
If, for example, you make l sub
1 equal to 1, then that
58
00:03:01,150 --> 00:03:03,870
uses up a half of this
sum right there.
59
00:03:03,870 --> 00:03:06,990
And all the other codewords
have to be much longer.
60
00:03:06,990 --> 00:03:09,890
So what this is saying,
essentially, and we proved it,
61
00:03:09,890 --> 00:03:13,760
and we did a bunch of things
with it, and your homework you
62
00:03:13,760 --> 00:03:18,060
worked with it, we have
shown that that
63
00:03:18,060 --> 00:03:20,380
inequality has to hold.
64
00:03:20,380 --> 00:03:23,750
The next thing we did is, given
a set of probabilities
65
00:03:23,750 --> 00:03:29,120
on a source, for example, p1
up to p sub m, by this time
66
00:03:29,120 --> 00:03:32,170
you should feel very comfortable
in realizing that
67
00:03:32,170 --> 00:03:35,020
what you call these symbols
doesn't make any difference
68
00:03:35,020 --> 00:03:38,970
whatsoever as far
as any matter of
69
00:03:38,970 --> 00:03:41,180
encoding sources is concerned.
70
00:03:41,180 --> 00:03:44,520
The first thing you can do, if
you like to, is take whatever
71
00:03:44,520 --> 00:03:47,940
name somebody has given to a set
of symbols, replace them
72
00:03:47,940 --> 00:03:52,410
with your own symbols, and the
easiest set of symbols to use
73
00:03:52,410 --> 00:03:55,020
is the integers 1 to m.
74
00:03:55,020 --> 00:03:58,110
And that's what we
will usually do.
75
00:03:58,110 --> 00:04:01,820
So, given this set of
probabilities, and they have
76
00:04:01,820 --> 00:04:06,430
to add up to 1, the Huffman
algorithm is this ingenious
77
00:04:06,430 --> 00:04:10,410
algorithm, very, very, clever,
which constructs a prefix-free
78
00:04:10,410 --> 00:04:14,170
code of minimum expected
length.
79
00:04:14,170 --> 00:04:18,560
And that minimum expected length
is just defined as the
80
00:04:18,560 --> 00:04:22,080
sum over i, of p sub
i times l sub i.
81
00:04:22,080 --> 00:04:25,340
And the trick in the algorithm
is to find that set of l sub
82
00:04:25,340 --> 00:04:30,570
i's that satisfy this inequality
but minimize this
83
00:04:30,570 --> 00:04:32,810
expected value.
84
00:04:32,810 --> 00:04:35,110
Next thing we started to talk
about was a discrete
85
00:04:35,110 --> 00:04:36,790
memoryless source.
86
00:04:36,790 --> 00:04:40,150
A discrete memoryless source
is really a toy source.
87
00:04:40,150 --> 00:04:43,770
It's a toy source where you
assume that each letter in the
88
00:04:43,770 --> 00:04:48,700
sequence is independent, and
equally independent, and
89
00:04:48,700 --> 00:04:49,820
identically distributed.
90
00:04:49,820 --> 00:04:51,880
In other words, every
letter is the same.
91
00:04:51,880 --> 00:04:55,000
Every letter is independent
of every other letter.
92
00:04:55,000 --> 00:04:58,590
That's more appropriate for a
gambling game than it is for
93
00:04:58,590 --> 00:04:59,940
real sources.
94
00:04:59,940 --> 00:05:02,830
But, on the other hand, by
understanding this problem,
95
00:05:02,830 --> 00:05:05,385
we're starting to see that we
understand the whole problem
96
00:05:05,385 --> 00:05:07,350
of source coding.
97
00:05:07,350 --> 00:05:10,050
So we'll get on with
that as we go.
98
00:05:10,050 --> 00:05:13,910
But, anyway, when we have a
discrete memoryless source,
99
00:05:13,910 --> 00:05:17,780
what we found -- first we
defined the entropy of such a
100
00:05:17,780 --> 00:05:24,460
source as h of x, which is the
sum over i of minus p sub i,
101
00:05:24,460 --> 00:05:27,580
of logarithm of p sub i.
102
00:05:27,580 --> 00:05:30,570
And that was just something that
came out of trying to do
103
00:05:30,570 --> 00:05:34,170
this optimization not the way
that Huffman did it but the
104
00:05:34,170 --> 00:05:35,940
way that Shannon did it.
105
00:05:35,940 --> 00:05:39,500
Namely, Shannon looked at this
optimization in terms of
106
00:05:39,500 --> 00:05:42,600
dealing with entropy and
things like that.
107
00:05:42,600 --> 00:05:45,610
Huffman dealt with it in terms
of finding the optimal code.
108
00:05:45,610 --> 00:05:49,100
One of the very surprising
things is that the way Huffman
109
00:05:49,100 --> 00:05:52,450
looked at it, in terms of
entropy, is the way this
110
00:05:52,450 --> 00:05:55,120
really valuable, even though
it doesn't come up with an
111
00:05:55,120 --> 00:05:56,410
optimal code.
112
00:05:56,410 --> 00:06:00,420
I mean, here was poor Huffman,
who generated this beautiful
113
00:06:00,420 --> 00:06:03,610
algorithm, which is
extraordinarily simple, which
114
00:06:03,610 --> 00:06:06,350
solved what looked like
a hard problem.
115
00:06:06,350 --> 00:06:10,440
But yet, as far as information
theory is concerned, he used
116
00:06:10,440 --> 00:06:11,940
that algorithm, sure.
117
00:06:11,940 --> 00:06:14,900
But as far as all the
generalizations are concerned,
118
00:06:14,900 --> 00:06:17,960
it has almost nothing
to do with anything.
119
00:06:17,960 --> 00:06:21,790
But, anyway, when you look at
this entropy, what comes out
120
00:06:21,790 --> 00:06:25,710
of it is the fact that the
entropy of the source is less
121
00:06:25,710 --> 00:06:30,100
than or equal to the minimum
number of bits per source
122
00:06:30,100 --> 00:06:33,840
symbol that you can come up with
in a prefix-free code,
123
00:06:33,840 --> 00:06:36,930
which is less than
h of x plus 1.
124
00:06:36,930 --> 00:06:39,340
And the way we did that was just
to try to look at this
125
00:06:39,340 --> 00:06:40,860
minimization.
126
00:06:40,860 --> 00:06:42,720
And by looking at the
minimization, we usually
127
00:06:42,720 --> 00:06:45,900
showed it had to be greater
than or equal to H of x.
128
00:06:45,900 --> 00:06:48,810
And by looking at any code
which satisfied this
129
00:06:48,810 --> 00:06:51,050
inequality with any
set of length --
130
00:06:51,050 --> 00:06:55,630
well, after we looked at this,
this said that what we really
131
00:06:55,630 --> 00:06:59,680
wanted to do was to make the
length of each codeword minus
132
00:06:59,680 --> 00:07:01,700
log of p sub i.
133
00:07:01,700 --> 00:07:03,050
That's not an integer.
134
00:07:03,050 --> 00:07:06,140
So the thing we did to get this
inequality is said, OK,
135
00:07:06,140 --> 00:07:09,630
if it's not an integer, we'll
raise it up the next value.
136
00:07:09,630 --> 00:07:10,880
Make it an integer.
137
00:07:10,880 --> 00:07:14,810
And as soon as we do that, the
Kraft inequality is satisfied.
138
00:07:14,810 --> 00:07:17,170
And you can generate a code
with that set of lengths.
139
00:07:17,170 --> 00:07:21,710
So that's where you get
these two bounds.
140
00:07:21,710 --> 00:07:25,840
This bound here is usually
very, very weak.
141
00:07:25,840 --> 00:07:31,030
Can anybody suggest the almost
unique example where this is
142
00:07:31,030 --> 00:07:33,890
almost tight?
143
00:07:33,890 --> 00:07:36,600
It's a simplistic sample
you can think of.
144
00:07:36,600 --> 00:07:38,960
It's a binary source.
145
00:07:38,960 --> 00:07:44,840
And what binary source has the
property that this is almost
146
00:07:44,840 --> 00:07:46,810
equal to this?
147
00:07:52,000 --> 00:07:53,150
Anybody out there?
148
00:07:53,150 --> 00:07:57,540
AUDIENCE: [UNINTELLIGIBLE]
149
00:07:57,540 --> 00:08:00,310
PROFESSOR: Make it almost
probability 0
150
00:08:00,310 --> 00:08:01,700
and probability 1.
151
00:08:01,700 --> 00:08:05,390
You can't quite do that because
as soon as you make
152
00:08:05,390 --> 00:08:09,280
the probability of the 0, 0,
then you don't have to
153
00:08:09,280 --> 00:08:10,360
represent it.
154
00:08:10,360 --> 00:08:13,260
And you just know that it's
a sequence of all 1's.
155
00:08:13,260 --> 00:08:14,140
So you're all done.
156
00:08:14,140 --> 00:08:16,980
And you don't need any
bits to represent it.
157
00:08:16,980 --> 00:08:21,740
But if there's just some very
tiny epsilon probability of a
158
00:08:21,740 --> 00:08:26,910
0 and a big probability of a 1,
then the entropy is almost
159
00:08:26,910 --> 00:08:28,570
equal to 0.
160
00:08:28,570 --> 00:08:34,040
And this 1 here is needed
because l bar min is 1.
161
00:08:34,040 --> 00:08:37,250
You can't make it any
smaller than that.
162
00:08:37,250 --> 00:08:42,590
So, that's where that
comes from.
163
00:08:42,590 --> 00:08:47,390
Let's talk about entropy
just a little bit.
164
00:08:47,390 --> 00:08:53,450
If we have an alphabet which
has size m, which is the
165
00:08:53,450 --> 00:08:58,370
symbol we'll usually use
for the alphabet, x.
166
00:08:58,370 --> 00:09:01,670
And the probability that
x equals i, is p sub i.
167
00:09:01,670 --> 00:09:04,520
In other words, again, we're
using this convention so we're
168
00:09:04,520 --> 00:09:08,850
going to call the symbols the
integers 1 to capital M. Then
169
00:09:08,850 --> 00:09:11,340
the entropy is equal to this.
170
00:09:11,340 --> 00:09:15,050
And a nice way of representing
this is that the entropy is
171
00:09:15,050 --> 00:09:18,930
equal to the expected value
of minus the logarithm.
172
00:09:18,930 --> 00:09:22,090
Logarithms are always to the
base 2, in this course.
173
00:09:22,090 --> 00:09:24,950
When we want natural logarithms
we'll write ln, in
174
00:09:24,950 --> 00:09:26,770
other words it's log
to the base 2.
175
00:09:26,770 --> 00:09:30,960
So it's log to the base
2 of the probability
176
00:09:30,960 --> 00:09:32,910
of the symbol x.
177
00:09:32,910 --> 00:09:37,630
We call this the log pmf
random variable.
178
00:09:37,630 --> 00:09:42,580
We started out with x being
a chance variable.
179
00:09:42,580 --> 00:09:44,800
I mean, we happen to have
turned it into a random
180
00:09:44,800 --> 00:09:46,840
variable because we've
given it numbers.
181
00:09:46,840 --> 00:09:48,360
But that's irrelevant.
182
00:09:48,360 --> 00:09:52,050
We really want to think of
it as a chance variable.
183
00:09:52,050 --> 00:09:56,720
But now this quantity is a
numerical function of the
184
00:09:56,720 --> 00:09:59,300
symbol which comes out
of the source.
185
00:09:59,300 --> 00:10:02,140
And, therefore, this
quantity is a
186
00:10:02,140 --> 00:10:03,860
well-defined random variable.
187
00:10:03,860 --> 00:10:06,690
And we call it the log
pmf random variable.
188
00:10:06,690 --> 00:10:08,980
Some people call it
self-information.
189
00:10:08,980 --> 00:10:10,850
We'll find out why later.
190
00:10:10,850 --> 00:10:13,180
I don't particularly
like that word.
191
00:10:13,180 --> 00:10:16,030
One, because what we're dealing
with here has nothing
192
00:10:16,030 --> 00:10:18,300
to do with information.
193
00:10:18,300 --> 00:10:20,780
Probably the thought that this
all has something to do with
194
00:10:20,780 --> 00:10:23,350
information, namely, that
information theory has
195
00:10:23,350 --> 00:10:26,700
something to do with
information, probably held up
196
00:10:26,700 --> 00:10:28,730
the field for about
five years.
197
00:10:28,730 --> 00:10:31,370
Because everyone tried to figure
out, what does this
198
00:10:31,370 --> 00:10:33,510
have to do with information.
199
00:10:33,510 --> 00:10:36,050
And, of course, it had nothing
to do with information.
200
00:10:36,050 --> 00:10:38,110
It really only had
to do with data.
201
00:10:38,110 --> 00:10:41,260
And with probabilities of
various things in the data.
202
00:10:41,260 --> 00:10:43,110
So, anyway.
203
00:10:43,110 --> 00:10:45,930
Some people call it
self-information and
204
00:10:45,930 --> 00:10:47,530
we'll see why later.
205
00:10:47,530 --> 00:10:50,130
But this is the quantity
we're interested in.
206
00:10:50,130 --> 00:10:52,350
And we call it log pmf,
we sort of forget
207
00:10:52,350 --> 00:10:54,750
about the minus sign.
208
00:10:54,750 --> 00:10:57,340
It's not good to forget
about the minus sign,
209
00:10:57,340 --> 00:10:58,240
but I always do it.
210
00:10:58,240 --> 00:11:02,660
So I sort of expect other
people to do it, too.
211
00:11:02,660 --> 00:11:05,060
One of the properties of entropy
is, it has to be
212
00:11:05,060 --> 00:11:06,920
greater than or equal to 0.
213
00:11:06,920 --> 00:11:09,050
Why is it greater than
or equal to 0?
214
00:11:09,050 --> 00:11:12,130
Well, because these
probabilities here have to be
215
00:11:12,130 --> 00:11:14,120
less than or equal to 1.
216
00:11:14,120 --> 00:11:16,070
And the logarithm of something
less than or
217
00:11:16,070 --> 00:11:18,000
equal to 1 is negative.
218
00:11:18,000 --> 00:11:21,480
And therefore minus the
logarithm has to be greater
219
00:11:21,480 --> 00:11:23,130
than or equal to 0.
220
00:11:23,130 --> 00:11:26,970
This entropy is also less than
or equal to log M, log capital
221
00:11:26,970 --> 00:11:28,950
M. I'm not going to
prove that here.
222
00:11:28,950 --> 00:11:32,370
It's proven in the notes, it's
a trivial thing to do.
223
00:11:32,370 --> 00:11:35,410
Or maybe it's proven in one
of the problems, I forget.
224
00:11:35,410 --> 00:11:39,860
But, anyway, you can do it using
the inequality log of x
225
00:11:39,860 --> 00:11:42,380
is less than or equal
to x minus 1.
226
00:11:42,380 --> 00:11:44,890
Just like all inequalities
can be proven with that
227
00:11:44,890 --> 00:11:47,300
inequality.
228
00:11:47,300 --> 00:11:52,240
So there's a quality here
if x is equiprobable.
229
00:11:52,240 --> 00:11:56,420
Which is pretty clear, because
if all of these probabilities
230
00:11:56,420 --> 00:12:00,200
are equal to 1 over M, and you
take the expected value of
231
00:12:00,200 --> 00:12:04,140
logarithm of M, you get
logarithm of M. So there's
232
00:12:04,140 --> 00:12:07,030
nothing very surprising here.
233
00:12:07,030 --> 00:12:11,210
Now, the next thing -- and
here's where what we're going
234
00:12:11,210 --> 00:12:15,640
to do is, on one hand very
simple, and on the other hand
235
00:12:15,640 --> 00:12:17,400
very confusing.
236
00:12:17,400 --> 00:12:21,480
After you get the picture of
it, it becomes very simple.
237
00:12:21,480 --> 00:12:24,780
Before that, it all looks
rather strange.
238
00:12:24,780 --> 00:12:30,680
If you have two independent
chance variables, say x and y,
239
00:12:30,680 --> 00:12:37,030
then the choice where the sample
value of the chance
240
00:12:37,030 --> 00:12:41,660
variable x, and the choice of
the sample value y, together
241
00:12:41,660 --> 00:12:44,900
that's a pair of sample values
which we can view as one
242
00:12:44,900 --> 00:12:46,730
sample value.
243
00:12:46,730 --> 00:12:51,540
In other words, we can view xy
as a chance variable all in
244
00:12:51,540 --> 00:12:54,280
its own right.
245
00:12:54,280 --> 00:12:57,540
This isn't the sequence x,
followed by y, where you can
246
00:12:57,540 --> 00:12:58,730
think of it that way.
247
00:12:58,730 --> 00:13:01,150
But we want to think
of this here as a
248
00:13:01,150 --> 00:13:02,730
chance variable itself.
249
00:13:02,730 --> 00:13:04,690
Which takes on different
values.
250
00:13:04,690 --> 00:13:11,400
And the values it takes on are
pairs of sample values, 1 from
251
00:13:11,400 --> 00:13:21,290
x, 1 from ensemble y, and xy
takes on the sample value of
252
00:13:21,290 --> 00:13:28,500
little xy with the probability
p of x times p of y.
253
00:13:28,500 --> 00:13:31,510
As we move one with this course,
we'll become much less
254
00:13:31,510 --> 00:13:34,520
careful about putting these
subscripts down, which talk
255
00:13:34,520 --> 00:13:36,540
about random variables.
256
00:13:36,540 --> 00:13:40,510
And the arguments which talk
about sample values of those
257
00:13:40,510 --> 00:13:41,830
random variables.
258
00:13:41,830 --> 00:13:46,500
I want to keep doing it for a
while because most courses in
259
00:13:46,500 --> 00:13:50,250
probability, even 6.041, which
is the first course in
260
00:13:50,250 --> 00:13:53,910
probability, almost deliberately
fudges the
261
00:13:53,910 --> 00:13:57,360
difference between sample values
and random variables.
262
00:13:57,360 --> 00:13:59,800
And most people who work
with probability do
263
00:13:59,800 --> 00:14:00,870
this all the time.
264
00:14:00,870 --> 00:14:03,520
And you never know when you're
talking about a random
265
00:14:03,520 --> 00:14:06,210
variable and when you're talking
about a sample value
266
00:14:06,210 --> 00:14:07,500
of a random variable.
267
00:14:07,500 --> 00:14:11,040
And that's convenient for
getting insight about thing.
268
00:14:11,040 --> 00:14:13,770
And you do it for a while and
then pretty soon you wonder,
269
00:14:13,770 --> 00:14:15,410
what the heck is going on.
270
00:14:15,410 --> 00:14:19,055
Because you have no idea what's
a random variable any
271
00:14:19,055 --> 00:14:22,180
more, and what's a sample
value of it.
272
00:14:22,180 --> 00:14:28,190
So, this entropy, H, of the
chance variable, xy, is then
273
00:14:28,190 --> 00:14:32,580
expected value of minus the
logarithm of the probability
274
00:14:32,580 --> 00:14:37,480
of the chance variable, xy.
275
00:14:37,480 --> 00:14:39,890
Mainly, when you take the
expected value, you're taking
276
00:14:39,890 --> 00:14:43,610
the expected value of
a random variable.
277
00:14:43,610 --> 00:14:49,750
And for the random variable
here, in the chance variables,
278
00:14:49,750 --> 00:14:51,000
xy and here.
279
00:14:51,000 --> 00:14:55,180
This is the expected value of
minus the logarithm of p of x
280
00:14:55,180 --> 00:14:57,750
times the probability p of x.
281
00:14:57,750 --> 00:14:59,490
Which is the expected value.
282
00:14:59,490 --> 00:15:03,320
Since they're independent of
each other it's the sum.
283
00:15:03,320 --> 00:15:08,380
And that gives you H of x y is
equal to H of x plus H of y.
284
00:15:08,380 --> 00:15:11,350
This is really the reason why
you're interested in these
285
00:15:11,350 --> 00:15:14,870
chance variables which are
logarithms of probabilities.
286
00:15:14,870 --> 00:15:18,820
Because when you have
independent chance variables
287
00:15:18,820 --> 00:15:22,780
then you have the situation that
probabilities multiply
288
00:15:22,780 --> 00:15:26,810
and therefore log probabilities
add.
289
00:15:26,810 --> 00:15:29,780
All of the major theorems in
probability theory, in
290
00:15:29,780 --> 00:15:32,360
particular the law of large
numbers, which is the most
291
00:15:32,360 --> 00:15:35,700
important result in probability,
simple though it
292
00:15:35,700 --> 00:15:39,700
might be, that's the key to
everything in probability.
293
00:15:39,700 --> 00:15:44,700
That particular result talks
about sums of random variables
294
00:15:44,700 --> 00:15:47,090
and not about products
of random variables.
295
00:15:47,090 --> 00:15:50,400
So that's why Shannon did
everything in terms
296
00:15:50,400 --> 00:15:53,550
of these log PMF.
297
00:15:53,550 --> 00:15:55,720
And we will soon be doing
everything in
298
00:15:55,720 --> 00:15:57,660
terms of log PMF also.
299
00:16:01,120 --> 00:16:06,710
So now let's get back to
discrete memoryless sources.
300
00:16:06,710 --> 00:16:12,710
If you now have a block of n
chance variables, x1 to xn,
301
00:16:12,710 --> 00:16:18,500
and they're all IID, again we
can do this whole block as one
302
00:16:18,500 --> 00:16:21,640
big monster chance variable.
303
00:16:21,640 --> 00:16:27,100
If each one of these takes on
m possible values, then this
304
00:16:27,100 --> 00:16:29,630
monster chance variable
can take on m to
305
00:16:29,630 --> 00:16:32,400
the n possible values.
306
00:16:32,400 --> 00:16:37,190
Namely, each possible string
of symbols, each possible
307
00:16:37,190 --> 00:16:41,060
string of n symbols where each
one is a choice from the
308
00:16:41,060 --> 00:16:44,640
integers 1 to capital M. So
we're talking about tuples of
309
00:16:44,640 --> 00:16:46,420
numbers now.
310
00:16:46,420 --> 00:16:48,860
And those are the values that
this particular chance
311
00:16:48,860 --> 00:16:52,420
variable, x sub n, takes on.
312
00:16:52,420 --> 00:16:55,820
So it takes on these
probabilities, takes on these
313
00:16:55,820 --> 00:17:02,145
values with the probability p
of x n is the product from i
314
00:17:02,145 --> 00:17:05,220
equals 1 to n, of the individual
probabilities.
315
00:17:05,220 --> 00:17:08,570
In other words, again, when you
have independent chance
316
00:17:08,570 --> 00:17:11,570
variables, the probabilities
multiply.
317
00:17:11,570 --> 00:17:13,160
Which is all I'm saying here.
318
00:17:13,160 --> 00:17:17,730
So the chance variable x sub
n has the entropy H of x n,
319
00:17:17,730 --> 00:17:20,930
which is the expected value of
minus the logarithm of that
320
00:17:20,930 --> 00:17:22,010
probability.
321
00:17:22,010 --> 00:17:25,260
Which is the expected value of
minus the logarithm of the
322
00:17:25,260 --> 00:17:28,570
product of probabilities, which
is the expected value of
323
00:17:28,570 --> 00:17:32,590
the sum of minus the log of the
probabilities, which is n
324
00:17:32,590 --> 00:17:34,590
times H of x.
325
00:17:34,590 --> 00:17:37,650
If you compare this with the
previous slide, you'll see I
326
00:17:37,650 --> 00:17:41,410
haven't said anything new.
327
00:17:41,410 --> 00:17:44,420
This argument and this
argument are
328
00:17:44,420 --> 00:17:46,120
really exactly the same.
329
00:17:46,120 --> 00:17:49,910
All I did was do it for two
chance variables first.
330
00:17:49,910 --> 00:17:51,390
And then observe.
331
00:17:51,390 --> 00:17:54,000
But it generalizes,
to an arbitrary
332
00:17:54,000 --> 00:17:56,510
number of chance variables.
333
00:17:56,510 --> 00:17:58,850
You can say that it generalizes
to an infinite
334
00:17:58,850 --> 00:18:00,550
number of chance
variables also.
335
00:18:00,550 --> 00:18:02,170
And in some sense it does.
336
00:18:02,170 --> 00:18:04,870
And I would advise you
not to go there.
337
00:18:04,870 --> 00:18:08,510
Because you just get tangled up
with a lot of mathematics
338
00:18:08,510 --> 00:18:09,760
that you don't need.
339
00:18:12,470 --> 00:18:16,410
So the next thing is, how
do we fix the variable
340
00:18:16,410 --> 00:18:20,400
prefix-free codes and what
do we gain by it?
341
00:18:20,400 --> 00:18:24,170
So the thing we're going to do
now, instead of trying to
342
00:18:24,170 --> 00:18:27,680
compress one symbol at a time
from the source, we're going
343
00:18:27,680 --> 00:18:32,270
to segment the source into
blocks of n symbols each.
344
00:18:32,270 --> 00:18:35,650
And after we segment it into
blocks of n symbols each,
345
00:18:35,650 --> 00:18:39,620
we're going to encode the
block of n symbols.
346
00:18:39,620 --> 00:18:41,750
Now, what's new there?
347
00:18:41,750 --> 00:18:44,280
Nothing whatsoever is new.
348
00:18:44,280 --> 00:18:48,230
A block of n symbols is just
a chance variable.
349
00:18:48,230 --> 00:18:51,290
We know how to do optimal
encoding of chance variables.
350
00:18:51,290 --> 00:18:53,430
Namely, we use the Huffman
algorithm.
351
00:18:53,430 --> 00:18:56,980
You can do that here
on these n blocks.
352
00:18:56,980 --> 00:19:00,520
We also have this nice theorem,
which says that the
353
00:19:00,520 --> 00:19:06,020
entropy -- well, first the
entropy of x n as n times the
354
00:19:06,020 --> 00:19:07,650
entropy of x.
355
00:19:07,650 --> 00:19:10,610
So, in other words, when you
have independent identically
356
00:19:10,610 --> 00:19:15,560
distributed chance variables,
this entropy is just n times
357
00:19:15,560 --> 00:19:16,440
this entropy.
358
00:19:16,440 --> 00:19:19,500
But the important thing
is this result
359
00:19:19,500 --> 00:19:21,180
of doing the encoding.
360
00:19:21,180 --> 00:19:23,930
Which is the same result
we had before.
361
00:19:23,930 --> 00:19:27,800
Namely, this is the result of
what happens when you take a
362
00:19:27,800 --> 00:19:31,530
set -- when you take an
alphabet, and the alphabet can
363
00:19:31,530 --> 00:19:33,780
be anything whatsoever.
364
00:19:33,780 --> 00:19:37,690
And you encode that alphabet in
an optimal way, according
365
00:19:37,690 --> 00:19:42,040
to the probabilities of each
symbol within the alphabet.
366
00:19:42,040 --> 00:19:47,030
And the result that you get is
the entropy of this big chance
367
00:19:47,030 --> 00:19:52,420
variable x sub n is less than
or equal to the minimum --
368
00:19:52,420 --> 00:19:57,730
well, it's less than or equal
to the expected value of the
369
00:19:57,730 --> 00:20:00,300
length of a code, whatever
code you have.
370
00:20:00,300 --> 00:20:05,090
But when you put the minimum on
here, this is less than the
371
00:20:05,090 --> 00:20:09,730
entropy of the chance variable
x super n plus 1.
372
00:20:09,730 --> 00:20:13,270
That's the same theorem
that we proved before.
373
00:20:13,270 --> 00:20:15,490
There's nothing new here.
374
00:20:15,490 --> 00:20:20,220
Now, if you divide this by n,
this by n, and this by n, you
375
00:20:20,220 --> 00:20:22,200
still have a valid inequality.
376
00:20:22,200 --> 00:20:25,710
When you divide this by
n, what do you get?
377
00:20:25,710 --> 00:20:27,620
You get H of x.
378
00:20:27,620 --> 00:20:34,110
When you divide this by n,
by definition L bar --
379
00:20:34,110 --> 00:20:38,210
what we mean is the number of
bits per source symbol.
380
00:20:38,210 --> 00:20:42,700
We have n source symbols here.
381
00:20:42,700 --> 00:20:46,990
So when we divide by n, we get
the number of bits per source
382
00:20:46,990 --> 00:20:51,300
symbol in this monster symbol.
383
00:20:51,300 --> 00:20:53,540
So l min is equal to this.
384
00:20:53,540 --> 00:20:56,260
When we divide this
by n, we get this.
385
00:20:56,260 --> 00:20:59,530
When we divide this by
n, we get H of x.
386
00:20:59,530 --> 00:21:03,110
And now the whole reason for
doing this is, this silly
387
00:21:03,110 --> 00:21:06,900
little 1 here, which we're
trying very hard to think of
388
00:21:06,900 --> 00:21:10,400
as being negligible or
unimportant, has suddenly
389
00:21:10,400 --> 00:21:12,090
become 1 over n.
390
00:21:12,090 --> 00:21:15,690
And by making n big enough,
this truly is unimportant.
391
00:21:18,610 --> 00:21:22,870
If you're thinking in terms of
encoding a binary source, this
392
00:21:22,870 --> 00:21:25,000
1 here is very important.
393
00:21:27,960 --> 00:21:30,860
In other words, when you're
trying to encode a binary
394
00:21:30,860 --> 00:21:34,230
source, if you're encoding one
letter at a time, there's
395
00:21:34,230 --> 00:21:36,110
nothing you can do.
396
00:21:36,110 --> 00:21:37,450
You're absolutely stuck.
397
00:21:37,450 --> 00:21:40,250
Because if both of those
letters have non-zero
398
00:21:40,250 --> 00:21:44,870
probabilities, and you want a
uniquely decodable code, and
399
00:21:44,870 --> 00:21:48,690
you want to find codewords for
each of those two symbols, the
400
00:21:48,690 --> 00:21:52,940
best you can do is to have
an expected length of 1.
401
00:21:52,940 --> 00:21:58,530
Namely, you need 1 to encode 1,
and you need 0 to encode 0.
402
00:21:58,530 --> 00:22:01,800
And there's nothing else,
there's no freedom at
403
00:22:01,800 --> 00:22:03,340
all that you have.
404
00:22:03,340 --> 00:22:06,930
So you say, OK, in that
situation, I really have to go
405
00:22:06,930 --> 00:22:08,300
to longer blocks.
406
00:22:08,300 --> 00:22:10,720
And when I go to longer
blocks, I
407
00:22:10,720 --> 00:22:12,330
can get this resolved.
408
00:22:12,330 --> 00:22:13,540
And I know how to do it.
409
00:22:13,540 --> 00:22:16,430
I use Huffman's algorithm
or whatever.
410
00:22:16,430 --> 00:22:21,110
So, suddenly, I can start to
get the expected number of
411
00:22:21,110 --> 00:22:23,850
bits per source symbol.
412
00:22:23,850 --> 00:22:27,440
Down as close to h of x
as I want to make it.
413
00:22:27,440 --> 00:22:30,040
And I can't make
it any smaller.
414
00:22:30,040 --> 00:22:35,080
Which says that H of x now has
a very clear interpretation,
415
00:22:35,080 --> 00:22:38,493
at least for prefix-free codes,
of being the number of
416
00:22:38,493 --> 00:22:42,430
bits you need for encoding
prefix-free codes when you
417
00:22:42,430 --> 00:22:44,940
allow the possibility
of encoding them
418
00:22:44,940 --> 00:22:47,570
a block at a time.
419
00:22:47,570 --> 00:22:50,560
We're going to find later today
that the significance of
420
00:22:50,560 --> 00:22:53,640
this is far greater
than that, even.
421
00:22:53,640 --> 00:22:56,710
Because why use prefix-free
codes, we could use any old
422
00:22:56,710 --> 00:22:57,780
kind of code.
423
00:22:57,780 --> 00:23:00,840
When we study the Lempel-Ziv
codes tomorrow, we'll find out
424
00:23:00,840 --> 00:23:04,810
they aren't prefix-free
codes at all.
425
00:23:04,810 --> 00:23:07,450
They're really variable length
of variable length codes.
426
00:23:07,450 --> 00:23:10,170
So they aren't fixed
to variable length.
427
00:23:10,170 --> 00:23:13,050
And they do some pretty fancy
and tricky things.
428
00:23:13,050 --> 00:23:16,140
But they're still limited
to this same inequality.
429
00:23:16,140 --> 00:23:18,390
You can never beat the
entropy bound.
430
00:23:18,390 --> 00:23:21,170
If you want something to be
uniquely decodable, you're
431
00:23:21,170 --> 00:23:22,980
stuck with this bound.
432
00:23:22,980 --> 00:23:26,620
And we'll see why in a very
straightforward way, later.
433
00:23:26,620 --> 00:23:31,500
It's a very straightforward way
which I can guarantee all
434
00:23:31,500 --> 00:23:35,960
of you are going to look at it
and say, yes, that's obvious.
435
00:23:35,960 --> 00:23:38,540
And tomorrow you will look at it
and say, I don't understand
436
00:23:38,540 --> 00:23:39,730
that at all.
437
00:23:39,730 --> 00:23:41,500
And the next day you'll
look at it again and
438
00:23:41,500 --> 00:23:42,880
say, well, of course.
439
00:23:42,880 --> 00:23:45,740
And the day after that you'll
say, I don't understand that.
440
00:23:45,740 --> 00:23:48,690
And you'll go back and forth
like that for about two weeks.
441
00:23:48,690 --> 00:23:51,980
Don't be frustrated, because
it is simple.
442
00:23:51,980 --> 00:23:54,950
But at the same time it's
very sophisticated.
443
00:23:59,370 --> 00:24:05,170
So, let's now review the weak
law of large numbers.
444
00:24:05,170 --> 00:24:09,000
I usually just call it the
law of large numbers.
445
00:24:09,000 --> 00:24:12,670
I bridle a little bit when
people call it weak because,
446
00:24:12,670 --> 00:24:16,070
in fact it's the centerpiece
of probability theory.
447
00:24:16,070 --> 00:24:19,310
But there is this other thing
called the strong law of large
448
00:24:19,310 --> 00:24:24,070
numbers, which mathematicians
love because it lets them look
449
00:24:24,070 --> 00:24:27,630
at all kinds of mathematical
minutiae.
450
00:24:27,630 --> 00:24:29,830
It's also important,
I shouldn't
451
00:24:29,830 --> 00:24:30,950
play it down too much.
452
00:24:30,950 --> 00:24:33,530
And there are places where
you truly need it.
453
00:24:33,530 --> 00:24:36,540
For what we'll be doing, we
don't need it at all.
454
00:24:36,540 --> 00:24:39,470
And the weak law of large
numbers does in fact hold in
455
00:24:39,470 --> 00:24:41,930
many places where the strong
law doesn't hold.
456
00:24:41,930 --> 00:24:48,130
So if you know what the strong
law, is temporarily forget it
457
00:24:48,130 --> 00:24:50,320
for the for the rest
of the term.
458
00:24:50,320 --> 00:24:52,580
And just focus on
the weak law.
459
00:24:52,580 --> 00:24:56,130
And the weak law is not
terribly complicated.
460
00:24:56,130 --> 00:25:00,310
We have a sequence of
random variables.
461
00:25:00,310 --> 00:25:04,230
And each of them has
a mean y bar.
462
00:25:04,230 --> 00:25:08,360
And each of them has a variance
sigma sub y squared.
463
00:25:08,360 --> 00:25:10,950
And let's assume that they're
independent and identically
464
00:25:10,950 --> 00:25:12,600
distributed for the
time being.
465
00:25:12,600 --> 00:25:15,900
Just to avoid worrying
about anything.
466
00:25:15,900 --> 00:25:20,020
If we look at the sum of those
random variables, namely a is
467
00:25:20,020 --> 00:25:23,700
equal to y1 up to y sub n.
468
00:25:23,700 --> 00:25:27,570
Then the expected value of a is
the expected value of this
469
00:25:27,570 --> 00:25:30,860
plus the expected valuable
of y2, and so forth.
470
00:25:30,860 --> 00:25:35,270
So the expected value of
a is n times y bar.
471
00:25:35,270 --> 00:25:37,990
And I think in one of the
homework problems, you found
472
00:25:37,990 --> 00:25:39,640
the variance of a.
473
00:25:39,640 --> 00:25:46,090
And the variance of a, well, the
easiest thing to do is to
474
00:25:46,090 --> 00:25:49,280
reduce this to its
fluctuation.
475
00:25:49,280 --> 00:25:51,790
Reduce all of these to
their fluctuation.
476
00:25:51,790 --> 00:25:54,570
Then look at the variance of the
fluctuation, which is just
477
00:25:54,570 --> 00:25:56,960
the expected value
of this squared.
478
00:25:56,960 --> 00:25:59,760
Which is the expected value of
this squared plus the expected
479
00:25:59,760 --> 00:26:02,250
value of this squared,
and so forth.
480
00:26:02,250 --> 00:26:06,940
So the variance of a is n times
sigma sub y squared.
481
00:26:06,940 --> 00:26:08,530
I want all of you know that.
482
00:26:08,530 --> 00:26:13,230
That's sort of day two of
a probability course.
483
00:26:13,230 --> 00:26:15,700
As soon as you start talking
about random variables, that's
484
00:26:15,700 --> 00:26:17,630
one of the key things
that you do.
485
00:26:17,630 --> 00:26:21,320
One of the most important
things you do.
486
00:26:21,320 --> 00:26:23,270
The thing that we're interested
in here is more the
487
00:26:23,270 --> 00:26:26,600
sample average of y1
up to y sub n.
488
00:26:26,600 --> 00:26:29,570
And the sample average,
by definition, is the
489
00:26:29,570 --> 00:26:32,050
sum divided by n.
490
00:26:32,050 --> 00:26:35,870
So in other words, the thing
that you're interested in here
491
00:26:35,870 --> 00:26:39,990
is to add all of these
random variables up.
492
00:26:39,990 --> 00:26:43,360
Take one over n times it.
493
00:26:43,360 --> 00:26:44,950
Which is a thing we
do all the time.
494
00:26:44,950 --> 00:26:50,270
I mean, we sum up a lot of
events, we divide by n, and we
495
00:26:50,270 --> 00:26:54,790
hope by doing that to get some
sort of typical value.
496
00:26:54,790 --> 00:26:58,210
And, usually there is some sort
of typical value that
497
00:26:58,210 --> 00:26:59,200
arises from that.
498
00:26:59,200 --> 00:27:02,810
What the law of large numbers
says is that there in fact is
499
00:27:02,810 --> 00:27:05,620
a typical value that arises.
500
00:27:05,620 --> 00:27:08,600
So this sample value is
a over n, which is the
501
00:27:08,600 --> 00:27:10,410
sum divided by n.
502
00:27:10,410 --> 00:27:12,630
And we call that the
sample average.
503
00:27:12,630 --> 00:27:18,340
The mean of the sample average
is just the mean of a, which
504
00:27:18,340 --> 00:27:23,020
is n times y bar,
divided by n.
505
00:27:23,020 --> 00:27:27,780
So the mean of the sample
average is y bar itself.
506
00:27:27,780 --> 00:27:37,190
The variance of the sample
variance, --
507
00:27:37,190 --> 00:27:42,290
the variance of the sample
average, OK, that's, --
508
00:27:45,630 --> 00:27:48,880
I'm talking too fast.
509
00:27:48,880 --> 00:27:55,600
The sample average here, you
would like to think of it as
510
00:27:55,600 --> 00:27:58,380
something which is known
and specific, like
511
00:27:58,380 --> 00:27:59,680
the expected value.
512
00:27:59,680 --> 00:28:02,150
It, in fact, is a
random variable.
513
00:28:02,150 --> 00:28:05,140
It changes with different
sample values.
514
00:28:05,140 --> 00:28:07,840
It can change from almost
nothing to very large
515
00:28:07,840 --> 00:28:08,820
quantities.
516
00:28:08,820 --> 00:28:11,250
And what we're interested in
saying is that most of the
517
00:28:11,250 --> 00:28:14,480
time, it's close to the
expected value.
518
00:28:14,480 --> 00:28:15,830
And that's what we're
aiming at here.
519
00:28:15,830 --> 00:28:19,020
And that's what the law
of large numbers says.
520
00:28:19,020 --> 00:28:22,970
The sample average here, the
variance of this, is now equal
521
00:28:22,970 --> 00:28:28,000
to the variance of a divided
by n squared.
522
00:28:28,000 --> 00:28:31,530
In other words, we're trying to
take the expected value of
523
00:28:31,530 --> 00:28:33,080
this quantity squared.
524
00:28:33,080 --> 00:28:36,770
So there's a 1 over n squared
that comes in here.
525
00:28:36,770 --> 00:28:40,800
When you take the 1 over n
squared here, this variance
526
00:28:40,800 --> 00:28:44,080
then becomes sigma --
527
00:28:46,610 --> 00:28:50,230
I don't know why I
have the n there.
528
00:28:50,230 --> 00:28:52,100
Just take that n out,
if you will.
529
00:28:52,100 --> 00:28:54,640
I don't have my red
pen with me.
530
00:28:57,290 --> 00:29:03,390
And so it's the variance
of the random variable
531
00:29:03,390 --> 00:29:06,590
y, divided by n.
532
00:29:06,590 --> 00:29:12,170
In other words, the limit as n
goes to infinity of the of the
533
00:29:12,170 --> 00:29:16,980
variance of the sum is
equal to infinity.
534
00:29:16,980 --> 00:29:21,630
And the variance of the sample
average as n goes to infinity
535
00:29:21,630 --> 00:29:23,790
is equal to 0.
536
00:29:23,790 --> 00:29:27,890
And that's because of this 1
over n squared effect here.
537
00:29:27,890 --> 00:29:32,400
When you add up a lot of
independent random variables,
538
00:29:32,400 --> 00:29:35,990
what you wind up with is the
sample average has a variance,
539
00:29:35,990 --> 00:29:38,440
which is going to 0.
540
00:29:38,440 --> 00:29:44,150
And the sum has a variance which
is going to infinity.
541
00:29:44,150 --> 00:29:46,560
That's important.
542
00:29:46,560 --> 00:29:49,820
Aside from all of the theorems
you've ever heard, this is
543
00:29:49,820 --> 00:29:54,520
sort of the gross, simple-minded
thing which you
544
00:29:54,520 --> 00:29:57,690
always ought to keep foremost
in your mind.
545
00:29:57,690 --> 00:30:00,290
This is what's happening
in probability theory.
546
00:30:00,290 --> 00:30:03,350
When you talk about sample
averages, this variance is
547
00:30:03,350 --> 00:30:06,590
getting small.
548
00:30:06,590 --> 00:30:09,710
Let's look at a picture
of this.
549
00:30:09,710 --> 00:30:12,420
Let's look at the distribution
function
550
00:30:12,420 --> 00:30:14,500
of this random variable.
551
00:30:14,500 --> 00:30:18,110
The sample average as
a random variable.
552
00:30:18,110 --> 00:30:22,070
And what we're finding here
is that this distribution
553
00:30:22,070 --> 00:30:27,510
function, if we look at it for
some modest value of n, we get
554
00:30:27,510 --> 00:30:31,250
something which looks like
this upper curve here.
555
00:30:31,250 --> 00:30:33,360
Which is then the lower
curve here.
556
00:30:33,360 --> 00:30:37,580
It's spread out more, so it
has a larger variance.
557
00:30:37,580 --> 00:30:40,180
Namely, the sample average
has a larger variance.
558
00:30:40,180 --> 00:30:45,360
When you make n bigger, what's
happening to the variance?
559
00:30:45,360 --> 00:30:46,900
The variance is getting
smaller.
560
00:30:46,900 --> 00:30:51,850
The variance is getting smaller
by a factor of 1/2.
561
00:30:51,850 --> 00:30:55,800
So this quantity is supposed
to have a variance which is
562
00:30:55,800 --> 00:30:59,200
equal to 1/2 of the
variance of this.
563
00:30:59,200 --> 00:31:01,270
How you find a variance
in a distribution
564
00:31:01,270 --> 00:31:04,010
function is your problem.
565
00:31:04,010 --> 00:31:08,310
But you know that if something
has a small variance, it's
566
00:31:08,310 --> 00:31:10,600
very closely tucked
in around this.
567
00:31:10,600 --> 00:31:14,150
In other words, as the variance
goes to 0, and the
568
00:31:14,150 --> 00:31:17,620
mean is y bar, you have a
distribution function which
569
00:31:17,620 --> 00:31:20,910
approaches a unit step.
570
00:31:20,910 --> 00:31:23,410
And all that just comes from
this very, very simple
571
00:31:23,410 --> 00:31:27,260
argument that says, when you
have a sum of IID random
572
00:31:27,260 --> 00:31:31,050
variables and you take the
sample average of it, namely,
573
00:31:31,050 --> 00:31:34,780
you divide by n, the
variance goes to 0.
574
00:31:34,780 --> 00:31:37,850
Which says, no matter how you
look at it, you wind up with
575
00:31:37,850 --> 00:31:40,770
something that looks
like a unit step.
576
00:31:40,770 --> 00:31:45,480
Now, the Chebyshev inequality,
which is one of the simpler
577
00:31:45,480 --> 00:31:49,500
inequalities in probability
theory, and I don't prove it
578
00:31:49,500 --> 00:31:52,040
because it's something
you've all seen.
579
00:31:52,040 --> 00:31:55,800
I don't know of any course in
probability which avoids the
580
00:31:55,800 --> 00:31:57,780
Chebyshev inequality.
581
00:31:57,780 --> 00:32:02,190
And what it says is, for any
epsilon greater than 0, the
582
00:32:02,190 --> 00:32:05,280
probability that the difference
between the sample
583
00:32:05,280 --> 00:32:09,350
average and the true mean, the
probability that that quantity
584
00:32:09,350 --> 00:32:13,380
and magnitude is greater than
or equal to epsilon, is less
585
00:32:13,380 --> 00:32:17,070
than or equal to sigma
sub y squared divided
586
00:32:17,070 --> 00:32:18,580
by n epsilon squared.
587
00:32:18,580 --> 00:32:22,340
Oh, incidentally that thing that
was called sigma sub n
588
00:32:22,340 --> 00:32:26,310
before was really
sigma squared.
589
00:32:26,310 --> 00:32:31,420
That's mainly the
variance of y.
590
00:32:31,420 --> 00:32:33,120
I hope it's right
in the notes.
591
00:32:33,120 --> 00:32:34,970
Might not be.
592
00:32:34,970 --> 00:32:35,930
It is?
593
00:32:35,930 --> 00:32:37,180
Good.
594
00:32:39,940 --> 00:32:42,520
So, that's what this
inequality says.
595
00:32:42,520 --> 00:32:46,540
There's an easy way to derive
this on the fly.
596
00:32:46,540 --> 00:32:49,480
Namely, if you're wondering what
all these constants are
597
00:32:49,480 --> 00:32:53,890
here, here's a way to do it.
598
00:32:53,890 --> 00:32:58,980
What we're looking at, in this
curve here, is we're trying to
599
00:32:58,980 --> 00:33:04,250
say, how much probability is
there outside of these plus
600
00:33:04,250 --> 00:33:06,360
and minus epsilon limits.
601
00:33:06,360 --> 00:33:09,550
And the Chebyshev inequality
says there can't be too much
602
00:33:09,550 --> 00:33:11,250
probability out here.
603
00:33:11,250 --> 00:33:14,780
And there can't be too much
probability out here.
604
00:33:14,780 --> 00:33:19,620
So, one way to get at this is
to say, OK, suppose I have
605
00:33:19,620 --> 00:33:22,970
some given probability
out here.
606
00:33:22,970 --> 00:33:25,950
And some given probability
out here.
607
00:33:25,950 --> 00:33:29,380
And suppose I want to minimize
the variance of a random
608
00:33:29,380 --> 00:33:32,960
variable which has that much
probability out here and that
609
00:33:32,960 --> 00:33:35,050
much probability out here.
610
00:33:35,050 --> 00:33:36,630
How do I do it?
611
00:33:36,630 --> 00:33:40,700
Well, the variance deals with
the square of how far you were
612
00:33:40,700 --> 00:33:42,160
away from the mean.
613
00:33:42,160 --> 00:33:44,840
So if I want to have a certain
amount of probability out
614
00:33:44,840 --> 00:33:49,750
here, I minimize my variance by
making this come straight,
615
00:33:49,750 --> 00:33:54,370
come up here with a little step,
then go across here.
616
00:33:54,370 --> 00:33:56,050
Go up here.
617
00:33:56,050 --> 00:33:59,560
And then, oops.
618
00:33:59,560 --> 00:34:02,160
Go up here.
619
00:34:02,160 --> 00:34:03,710
I wish I had my red pencil.
620
00:34:03,710 --> 00:34:07,060
Does anybody have a red pen?
621
00:34:07,060 --> 00:34:08,480
That will write on this stuff?
622
00:34:13,870 --> 00:34:14,360
Yes?
623
00:34:14,360 --> 00:34:15,610
No?
624
00:34:21,140 --> 00:34:21,670
Oh, great.
625
00:34:21,670 --> 00:34:23,990
Thank you.
626
00:34:23,990 --> 00:34:25,240
I will return it.
627
00:34:27,220 --> 00:34:31,330
So what we want is something
which goes over here.
628
00:34:31,330 --> 00:34:33,170
Comes up here.
629
00:34:33,170 --> 00:34:35,220
Goes across here.
630
00:34:35,220 --> 00:34:36,470
Goes up here.
631
00:34:39,400 --> 00:34:42,900
Goes across here, and
goes up again.
632
00:34:42,900 --> 00:34:44,535
That's the smallest you
can make the variance.
633
00:34:44,535 --> 00:34:46,790
It's squeezing everything
in as far as it
634
00:34:46,790 --> 00:34:48,130
can be squeezed in.
635
00:34:48,130 --> 00:34:50,830
Namely, everything out
here gets squeezed
636
00:34:50,830 --> 00:34:53,270
in to y minus epsilon.
637
00:34:53,270 --> 00:34:55,910
Everything in here gets
squeezed into 0.
638
00:34:55,910 --> 00:34:59,500
And everything out here gets
squeezed into epsilon.
639
00:34:59,500 --> 00:35:01,570
OK, calculate the
variance there.
640
00:35:01,570 --> 00:35:03,650
And that satisfies
the Chebyshev
641
00:35:03,650 --> 00:35:06,130
inequality with equality.
642
00:35:06,130 --> 00:35:10,410
So that's all the Chebyshev
inequality is.
643
00:35:10,410 --> 00:35:13,210
And it's a loose inequality
usually, because usually these
644
00:35:13,210 --> 00:35:14,900
curves look very nice.
645
00:35:14,900 --> 00:35:17,410
Usually this looks like a
Gaussian distribution
646
00:35:17,410 --> 00:35:20,660
function, and the central limit
theorem says that we
647
00:35:20,660 --> 00:35:23,810
don't need the central limit
theorem here, and we don't
648
00:35:23,810 --> 00:35:26,980
want the central limit theorem
here, because this thing is an
649
00:35:26,980 --> 00:35:31,490
inequality that says, life can't
be any worse than this.
650
00:35:31,490 --> 00:35:33,890
And all the central limit
theorem is, is an
651
00:35:33,890 --> 00:35:35,570
approximation.
652
00:35:35,570 --> 00:35:37,550
And then we have to worry
about when it's a good
653
00:35:37,550 --> 00:35:41,510
approximation and when it's
not a good approximation.
654
00:35:41,510 --> 00:35:45,160
So this says, when we carry it
one piece further, it's for
655
00:35:45,160 --> 00:35:48,050
any epsilon and delta
greater than 0.
656
00:35:48,050 --> 00:35:51,500
If we make n large enough --
in other words, substitute
657
00:35:51,500 --> 00:35:53,330
delta for this.
658
00:35:53,330 --> 00:35:55,850
And then, when you make n small
enough, this quantity is
659
00:35:55,850 --> 00:35:57,600
smaller than delta.
660
00:35:57,600 --> 00:36:01,180
And that says that the
probability that s and y
661
00:36:01,180 --> 00:36:05,240
differ by more than epsilon is
less than or equal to delta
662
00:36:05,240 --> 00:36:08,180
when we make n big enough.
663
00:36:08,180 --> 00:36:10,960
So it says, you can make this
as small as you want.
664
00:36:10,960 --> 00:36:13,050
You can make this as
small as you want.
665
00:36:13,050 --> 00:36:15,530
And all you need to do
is make a sequence
666
00:36:15,530 --> 00:36:17,540
which is long enough.
667
00:36:17,540 --> 00:36:21,380
Now, the thing which is
mystifying about the law of
668
00:36:21,380 --> 00:36:24,720
large numbers is, you
need both the
669
00:36:24,720 --> 00:36:26,510
epsilon and the delta.
670
00:36:26,510 --> 00:36:29,470
You can't get rid of
either of them.
671
00:36:29,470 --> 00:36:33,260
In other words, you
can't say --
672
00:36:33,260 --> 00:36:36,500
you can't reduce this to 0.
673
00:36:36,500 --> 00:36:38,670
Because it won't
make any sense.
674
00:36:38,670 --> 00:36:41,550
In other words, this
curve here tends to
675
00:36:41,550 --> 00:36:44,520
spread out on its tails.
676
00:36:44,520 --> 00:36:49,070
And therefore, there's always a
probability of error there.
677
00:36:49,070 --> 00:36:54,160
You can't move epsilon into 0
because, for no finite end, do
678
00:36:54,160 --> 00:36:56,590
you really get a step
function here.
679
00:36:56,590 --> 00:37:00,580
So you need some wiggle
room on both end.
680
00:37:00,580 --> 00:37:05,950
You need wiggle room here, and
you need wiggle room here.
681
00:37:05,950 --> 00:37:08,950
And once you recognize that you
need those two pieces of
682
00:37:08,950 --> 00:37:09,850
wiggle room.
683
00:37:09,850 --> 00:37:13,830
Namely, you cannot talk about
the probability that this is
684
00:37:13,830 --> 00:37:19,230
equal to y bar, because
that's usually 0.
685
00:37:19,230 --> 00:37:25,390
And you cannot talk about
reducing this to 0 either.
686
00:37:25,390 --> 00:37:27,750
So both of those are needed.
687
00:37:27,750 --> 00:37:29,590
Why did I go through
all of this?
688
00:37:29,590 --> 00:37:31,780
Well, partly because
it's important.
689
00:37:31,780 --> 00:37:36,140
But partly because I want to
talk about something which is
690
00:37:36,140 --> 00:37:39,890
called the asymptotic
equipartition property.
691
00:37:39,890 --> 00:37:43,520
And because of those long words
you believe this has to
692
00:37:43,520 --> 00:37:45,980
be very complicated.
693
00:37:45,980 --> 00:37:48,690
I hope to convince you that
what the asymptotic
694
00:37:48,690 --> 00:37:52,580
equipartition property is, is
simply the week law of large
695
00:37:52,580 --> 00:37:57,570
numbers applied to
the log pmf.
696
00:37:57,570 --> 00:38:01,030
Because that, in fact,
is all it is.
697
00:38:01,030 --> 00:38:05,430
But it says some unusual
and fascinating things.
698
00:38:05,430 --> 00:38:11,020
So let's suppose that x1, x2,
and so forth is the output
699
00:38:11,020 --> 00:38:14,970
from a discrete memoryless
source.
700
00:38:14,970 --> 00:38:18,180
Look at the log pmf
of each of these.
701
00:38:18,180 --> 00:38:22,800
Namely, they each have the same
distribution function.
702
00:38:22,800 --> 00:38:26,090
So w of f x is going to be equal
to minus the logarithm
703
00:38:26,090 --> 00:38:31,790
of p sub x of x, for each of
these chance variables.
704
00:38:31,790 --> 00:38:36,460
w of x maps source symbols
into real numbers.
705
00:38:36,460 --> 00:38:40,850
So there's a random variable,
capital W of x sub j, which is
706
00:38:40,850 --> 00:38:41,970
a random variable.
707
00:38:41,970 --> 00:38:46,140
We have a random variable for
each one of these symbols to
708
00:38:46,140 --> 00:38:47,660
come out of the source.
709
00:38:47,660 --> 00:38:51,050
So, for each one of these
symbols, there's this log pmf
710
00:38:51,050 --> 00:38:55,900
random variable, which takes
on different values.
711
00:38:55,900 --> 00:39:00,790
So the expected value of this
log pmf, for the j'th symbol
712
00:39:00,790 --> 00:39:05,820
out of the source is the sum
of p sub x of x, namely the
713
00:39:05,820 --> 00:39:09,560
probability that the source
takes on the value x, times
714
00:39:09,560 --> 00:39:12,290
minus the logarithm
of p sub x.
715
00:39:12,290 --> 00:39:14,530
And that's just the entropy.
716
00:39:14,530 --> 00:39:18,770
So, the strange feeling about
this log pmf random variable
717
00:39:18,770 --> 00:39:22,480
is its expected value
is entropy.
718
00:39:22,480 --> 00:39:27,440
And w of x1, w of x2, and so
forth, are a sequence of IID
719
00:39:27,440 --> 00:39:31,320
random variables, each one of
them which has a mean, which
720
00:39:31,320 --> 00:39:32,570
is the entropy.
721
00:39:35,320 --> 00:39:38,560
So it's just exactly the
situation we had before.
722
00:39:38,560 --> 00:39:42,460
Instead of y bar, we have
the entropy of x.
723
00:39:42,460 --> 00:39:48,840
And instead of the random
variable y sub j, we have this
724
00:39:48,840 --> 00:39:53,580
random variable w of x sub j,
which is defined by the symbol
725
00:39:53,580 --> 00:39:54,830
in an alphabet.
726
00:40:00,170 --> 00:40:04,240
And just to review this, but
it's what we said before, if
727
00:40:04,240 --> 00:40:09,900
capital X1, this little x1,
namely, if little x1 is the
728
00:40:09,900 --> 00:40:15,650
sample value for the chance
variable x1 and if x2 is a
729
00:40:15,650 --> 00:40:19,720
sample value for the chance
variable X2, then the outcome
730
00:40:19,720 --> 00:40:25,660
for w of x1 plus w of x2 --
731
00:40:28,350 --> 00:40:31,200
very hard to keep all these
little letters and capital
732
00:40:31,200 --> 00:40:32,450
letters straight.
733
00:40:35,030 --> 00:40:39,850
Is w of x1 plus w of x2 is minus
the logarithm of the
734
00:40:39,850 --> 00:40:43,570
probability of x1 minus the
logarithm of the probability
735
00:40:43,570 --> 00:40:47,790
of x2, which is minus the
logarithm of the product,
736
00:40:47,790 --> 00:40:51,620
which is minus the logarithm of
the joint probability of x1
737
00:40:51,620 --> 00:40:57,610
and x2, which is the random
variable w of x1 x2.
738
00:40:57,610 --> 00:41:03,870
So the sum here is the random
variable corresponding to log
739
00:41:03,870 --> 00:41:10,650
pmf of the joint outputs
x1 and x2.
740
00:41:10,650 --> 00:41:18,110
So w of x1 x2 is the log pmf of
the event, but this joint
741
00:41:18,110 --> 00:41:21,740
chance variable takes
on the value x1 x2.
742
00:41:21,740 --> 00:41:27,820
And the random variable x1 x2
is the sum of x1 and x2.
743
00:41:27,820 --> 00:41:29,690
So, what have I done here?
744
00:41:29,690 --> 00:41:32,540
I said this at the end of the
last slide, and you won't
745
00:41:32,540 --> 00:41:34,050
believe me.
746
00:41:34,050 --> 00:41:39,690
So, again this is one of these
things where tomorrow you
747
00:41:39,690 --> 00:41:40,610
won't believe me.
748
00:41:40,610 --> 00:41:42,580
And you'll have to go back
and look at that.
749
00:41:42,580 --> 00:41:45,630
But, anyway, x1 x2 is
a chance variable.
750
00:41:45,630 --> 00:41:50,090
And probabilities multiply in
log pmf's add, which is what
751
00:41:50,090 --> 00:41:52,020
we've been saying for a
couple of days now.
752
00:41:55,460 --> 00:41:56,820
So.
753
00:41:56,820 --> 00:42:06,430
If I look at the sum of n of
these random variables, the
754
00:42:06,430 --> 00:42:11,000
sum of these log probabilities
is the sum of the log of
755
00:42:11,000 --> 00:42:16,370
pmf's, which is minus the
logarithm of the probability
756
00:42:16,370 --> 00:42:19,190
of the entire sequence.
757
00:42:19,190 --> 00:42:22,110
That's just saying the same
thing we said before, for two
758
00:42:22,110 --> 00:42:23,250
random variables.
759
00:42:23,250 --> 00:42:28,140
The sample average of a log
pmf's is the sum of the w's
760
00:42:28,140 --> 00:42:31,480
divided by n, which is minus
the logarithm of the
761
00:42:31,480 --> 00:42:33,830
probability divided by n.
762
00:42:33,830 --> 00:42:37,700
The weak law of large numbers
applies, and the probability
763
00:42:37,700 --> 00:42:42,840
that this sample average minus
the expected value of w of x
764
00:42:42,840 --> 00:42:46,170
is greater than or equal to
epsilon is less than or equal
765
00:42:46,170 --> 00:42:47,690
to this quantity here.
766
00:42:47,690 --> 00:42:51,610
This quantity is minus the
logarithm of the probability
767
00:42:51,610 --> 00:42:57,740
of x sub n, divided by n, minus
H of x, greater than or
768
00:42:57,740 --> 00:42:59,340
equal to epsilon.
769
00:43:07,210 --> 00:43:09,450
So this is the thing that
we really want.
770
00:43:15,610 --> 00:43:17,470
I'm going to spend a few
slides trying to
771
00:43:17,470 --> 00:43:18,620
say what this means.
772
00:43:18,620 --> 00:43:22,170
But let's try to just look
at it now, and see
773
00:43:22,170 --> 00:43:24,190
what it must mean.
774
00:43:24,190 --> 00:43:29,350
It says that with high
probability, this quantity is
775
00:43:29,350 --> 00:43:32,900
almost the same as
this quantity.
776
00:43:32,900 --> 00:43:35,810
It says that with high
probability, the thing which
777
00:43:35,810 --> 00:43:42,630
comes out of the source is going
to have a probability, a
778
00:43:42,630 --> 00:43:47,930
log probability, divided by n,
which is close to the entropy.
779
00:43:47,930 --> 00:43:52,350
It says in some sense that with
high probability, the
780
00:43:52,350 --> 00:43:54,240
probability of what
comes out of the
781
00:43:54,240 --> 00:43:55,940
source is almost a constant.
782
00:43:59,020 --> 00:44:02,060
And that's amazing.
783
00:44:02,060 --> 00:44:04,200
That's what you'll wake up in
the morning and say, I don't
784
00:44:04,200 --> 00:44:05,450
believe that.
785
00:44:07,900 --> 00:44:10,430
But it's true.
786
00:44:10,430 --> 00:44:12,870
But you have to be careful
to interpret it right.
787
00:44:15,450 --> 00:44:18,710
So, we're going to define
the typical set.
788
00:44:18,710 --> 00:44:22,680
Namely, this is the typical set
of x's, which come out of
789
00:44:22,680 --> 00:44:23,630
the source.
790
00:44:23,630 --> 00:44:26,520
Namely, the typical
set of blocks of n
791
00:44:26,520 --> 00:44:29,490
symbols out of the source.
792
00:44:29,490 --> 00:44:32,510
And when you talk about a
typical set, you want
793
00:44:32,510 --> 00:44:36,180
something which includes most
of the probability.
794
00:44:36,180 --> 00:44:40,560
So what I'm going to include in
this typical set is all of
795
00:44:40,560 --> 00:44:43,160
these things that we talked
about before.
796
00:44:43,160 --> 00:44:46,520
Namely, we showed that the
probability that this quantity
797
00:44:46,520 --> 00:44:49,960
is greater than or equal to
epsilon is very small.
798
00:44:49,960 --> 00:44:53,980
So, with high probability what
comes out of the source
799
00:44:53,980 --> 00:44:57,030
satisfies this inequality
here.
800
00:44:57,030 --> 00:45:00,840
So I can write down the
distribution function of this
801
00:45:00,840 --> 00:45:02,480
random variable here.
802
00:45:02,480 --> 00:45:09,070
It's just this w --
803
00:45:12,840 --> 00:45:14,750
this is a random variable w.
804
00:45:14,750 --> 00:45:17,550
I'm looking at the distribution
of that random
805
00:45:17,550 --> 00:45:20,170
variable w.
806
00:45:20,170 --> 00:45:25,340
And this quantity in here is
the probability of this
807
00:45:25,340 --> 00:45:28,260
typical set.
808
00:45:28,260 --> 00:45:31,090
In other words, when I draw this
distribution function for
809
00:45:31,090 --> 00:45:34,820
this combined random variable,
I've defined this typical set
810
00:45:34,820 --> 00:45:40,020
to be all the sequences which
lie between this point and
811
00:45:40,020 --> 00:45:41,070
this point.
812
00:45:41,070 --> 00:45:43,690
Namely, this is H
minus epsilon.
813
00:45:43,690 --> 00:45:47,360
And this is H plus epsilon,
moving H out here.
814
00:45:47,360 --> 00:45:50,580
So these are all the sequences
which satisfy
815
00:45:50,580 --> 00:45:52,510
this inequality here.
816
00:45:52,510 --> 00:45:54,550
So that's what I mean
by the typical set.
817
00:45:54,550 --> 00:45:59,290
It's all things which are
clustered around H in this
818
00:45:59,290 --> 00:46:00,540
distribution function.
819
00:46:03,450 --> 00:46:07,320
And as n approaches infinity,
this typical set approaches
820
00:46:07,320 --> 00:46:09,170
probability 1.
821
00:46:09,170 --> 00:46:11,560
In the same way that the
law of large numbers
822
00:46:11,560 --> 00:46:12,420
behaves that way.
823
00:46:12,420 --> 00:46:18,090
The probability that x sub n
is in this typical set is
824
00:46:18,090 --> 00:46:23,180
greater than or equal to 1 minus
sigma squared divided by
825
00:46:23,180 --> 00:46:25,080
n epsilon squared.
826
00:46:30,800 --> 00:46:34,230
Let's try to express that in
a bunch of other ways.
827
00:46:40,400 --> 00:46:44,880
If you're getting lost,
please ask questions.
828
00:46:44,880 --> 00:46:49,800
But I hope to come back to this
in a little bit, after we
829
00:46:49,800 --> 00:46:52,850
finish a little more
of the story.
830
00:46:52,850 --> 00:47:03,060
So, another way of expressing
this typical set -- let me
831
00:47:03,060 --> 00:47:05,760
look at that as the
typical set.
832
00:47:05,760 --> 00:47:10,920
If I take this inequality here
and I rewrite this, namely,
833
00:47:10,920 --> 00:47:16,190
this is the set of x's for
which this is less than
834
00:47:16,190 --> 00:47:20,970
epsilon plus H of x
and greater than
835
00:47:20,970 --> 00:47:23,330
H of x minus epsilon.
836
00:47:23,330 --> 00:47:26,900
So that's what I've
expressed here.
837
00:47:26,900 --> 00:47:31,880
It's the set of x's for which
n times H of x minus epsilon
838
00:47:31,880 --> 00:47:36,260
is less than this logarithm of
probability is great less than
839
00:47:36,260 --> 00:47:38,830
n times H of x plus epsilon.
840
00:47:38,830 --> 00:47:43,630
Namely, I'm looking at this
range of epsilon around H,
841
00:47:43,630 --> 00:47:46,980
which is this and this.
842
00:47:46,980 --> 00:47:50,840
If I write it again, if I
exponentiate all of this, it's
843
00:47:50,840 --> 00:47:55,810
the set of x's for which 2 to
the minus n, H of x, plus
844
00:47:55,810 --> 00:47:59,740
epsilon, that's this term
exponentiated, is less than
845
00:47:59,740 --> 00:48:03,170
this is less than this
term exponentiated.
846
00:48:03,170 --> 00:48:05,610
And what's going on here
is, I've taken care of
847
00:48:05,610 --> 00:48:08,170
the minus sign also.
848
00:48:08,170 --> 00:48:10,400
And if you can follow that in
your head, you're a better
849
00:48:10,400 --> 00:48:12,400
person than I am.
850
00:48:12,400 --> 00:48:14,700
But, anyway, it works.
851
00:48:14,700 --> 00:48:16,790
And if you fiddle around with
that, you'll see that that's
852
00:48:16,790 --> 00:48:17,860
what it is.
853
00:48:17,860 --> 00:48:24,300
So this typical set is a bound
on the probabilities of all
854
00:48:24,300 --> 00:48:26,090
these typical sequences.
855
00:48:26,090 --> 00:48:31,980
The typical sequences all are
enclosed in this range of
856
00:48:31,980 --> 00:48:33,230
probabilities.
857
00:48:35,680 --> 00:48:39,440
So the typical elements are
approximately equiprobable, in
858
00:48:39,440 --> 00:48:42,130
this strange sense above.
859
00:48:42,130 --> 00:48:45,100
Why do I say this is
a strange sense?
860
00:48:45,100 --> 00:48:49,690
Well, as n gets large,
what happens here?
861
00:48:49,690 --> 00:48:52,810
This is 2 to the minus
n times H of x.
862
00:48:52,810 --> 00:48:55,360
Which is the important
part of this.
863
00:48:55,360 --> 00:48:59,820
This epsilon here is
multiplied by n.
864
00:48:59,820 --> 00:49:02,400
And we're trying to say, as n
gets very, very big, we can
865
00:49:02,400 --> 00:49:04,700
make epsilon very, very small.
866
00:49:04,700 --> 00:49:09,130
But we really can't make n times
epsilon very negligible.
867
00:49:09,130 --> 00:49:12,410
But the point is, the important
thing here is, 2 to
868
00:49:12,410 --> 00:49:15,680
the minus n times H of x.
869
00:49:15,680 --> 00:49:19,640
So, in some sense, this
is close to 2 to the
870
00:49:19,640 --> 00:49:21,750
minus n H of x.
871
00:49:21,750 --> 00:49:23,140
In what sense is it true?
872
00:49:23,140 --> 00:49:28,140
Well, it's true in that sense.
873
00:49:28,140 --> 00:49:31,980
Where that, in fact is,
a valid inequality.
874
00:49:31,980 --> 00:49:35,130
Namely in terms of
sample averages,
875
00:49:35,130 --> 00:49:37,160
these things are close.
876
00:49:37,160 --> 00:49:40,210
When I do the exponentiation and
get rid of the n and all
877
00:49:40,210 --> 00:49:43,820
that stuff, they aren't
very close.
878
00:49:43,820 --> 00:49:48,760
But saying this sort of thing is
sort of like saying that 10
879
00:49:48,760 --> 00:49:52,502
to the minus 23 is approximately
equal to 10 to
880
00:49:52,502 --> 00:49:54,950
the minus 25.
881
00:49:54,950 --> 00:49:57,060
And they're approximately equal
because they're both
882
00:49:57,060 --> 00:49:59,170
very, very small.
883
00:49:59,170 --> 00:50:02,510
And that's the kind of thing
that's going on here.
884
00:50:02,510 --> 00:50:05,330
And you're trying to distinguish
10 to the minus 23
885
00:50:05,330 --> 00:50:10,540
and 10 to the minus 25 from 10
to the minus 60th and from 10
886
00:50:10,540 --> 00:50:12,950
to the minus three.
887
00:50:12,950 --> 00:50:16,500
So that's the kind of
approximations we're using.
888
00:50:16,500 --> 00:50:20,040
Namely, we're using
approximations on a log scale,
889
00:50:20,040 --> 00:50:23,510
instead of approximations
of ordinary numbers.
890
00:50:23,510 --> 00:50:27,800
But, still it's convenient to
think of these typical x's,
891
00:50:27,800 --> 00:50:31,270
typical sequences, as being
sequences which are
892
00:50:31,270 --> 00:50:33,900
constrained in probability
in this way.
893
00:50:33,900 --> 00:50:37,290
And this is the thing which
is easy to work with.
894
00:50:37,290 --> 00:50:41,910
The atypical set of strings,
namely, the compliment to this
895
00:50:41,910 --> 00:50:45,990
set, the thing we know about
that is the entire set doesn't
896
00:50:45,990 --> 00:50:48,030
have much probability.
897
00:50:48,030 --> 00:50:53,080
Namely, if you fix epsilon and
you let n get bigger and
898
00:50:53,080 --> 00:50:57,860
bigger, this atypical set
becomes totally negligible.
899
00:50:57,860 --> 00:50:59,110
And you can ignore it.
900
00:51:02,330 --> 00:51:06,940
So let's plow ahead.
901
00:51:06,940 --> 00:51:12,220
Stop for an example pretty
soon, but --
902
00:51:12,220 --> 00:51:15,830
If I have a sequence which is
in the typical set, we then
903
00:51:15,830 --> 00:51:20,400
know that its probability is
greater than 2 to the minus n
904
00:51:20,400 --> 00:51:23,520
times H of x plus epsilon.
905
00:51:23,520 --> 00:51:26,150
That's what we said before.
906
00:51:26,150 --> 00:51:29,330
And, therefore, when I use
this inequality, the
907
00:51:29,330 --> 00:51:34,900
probability of x to the n, for
something in the typical set,
908
00:51:34,900 --> 00:51:37,940
is greater than this
quantity here.
909
00:51:37,940 --> 00:51:47,950
In other words, this is
greater than that.
910
00:51:47,950 --> 00:51:50,420
For everything in
a typical set.
911
00:51:50,420 --> 00:51:53,640
So now I'm heading over things
in a typical set.
912
00:51:53,640 --> 00:51:56,170
So I need to include the
number of things
913
00:51:56,170 --> 00:51:57,590
in a typical set.
914
00:51:57,590 --> 00:52:01,190
So what I have is this sum.
915
00:52:01,190 --> 00:52:02,470
And what is this sum?
916
00:52:02,470 --> 00:52:06,000
This is the probability
of the typical set.
917
00:52:06,000 --> 00:52:08,960
Because I'm adding overall
elements in the typical set.
918
00:52:08,960 --> 00:52:11,880
And it's greater than or equal
to the number of elements in a
919
00:52:11,880 --> 00:52:15,660
typical set times these
small probabilities.
920
00:52:15,660 --> 00:52:19,230
If I turn this around, it says
that the number of elements in
921
00:52:19,230 --> 00:52:22,460
a typical set is less
than 2 to the n
922
00:52:22,460 --> 00:52:25,820
times H of x plus epsilon.
923
00:52:25,820 --> 00:52:30,000
For any epsilon, no matter how
small I want to make it.
924
00:52:30,000 --> 00:52:33,710
Which says that the elements
in a typical set have
925
00:52:33,710 --> 00:52:38,200
probabilities which are about 2
to the minus n times H of x.
926
00:52:38,200 --> 00:52:41,480
And the number of them is
approximately 2 to the
927
00:52:41,480 --> 00:52:44,110
n times H of x.
928
00:52:44,110 --> 00:52:47,910
In other words, what it says is
that this typical set is a
929
00:52:47,910 --> 00:52:53,900
bunch of essentially uniform
probabilities.
930
00:52:53,900 --> 00:52:58,550
So what I've done is to take
this very complicated source.
931
00:52:58,550 --> 00:53:05,360
And when I look at these very
humongous chance variables,
932
00:53:05,360 --> 00:53:10,670
which are very large sequences
out of the source, what I find
933
00:53:10,670 --> 00:53:14,510
is that there's a bunch of
things which collectively have
934
00:53:14,510 --> 00:53:16,410
zilch probability.
935
00:53:16,410 --> 00:53:18,980
There's a bunch of other things
which all have equal
936
00:53:18,980 --> 00:53:20,090
probability.
937
00:53:20,090 --> 00:53:24,650
And a number of them is
enough to add up to y.
938
00:53:24,650 --> 00:53:28,820
So I have turned this source,
when I look at it over a long
939
00:53:28,820 --> 00:53:38,080
enough sequence, into a source
of equiprobable events.
940
00:53:38,080 --> 00:53:41,470
And each of those events has
this probability here.
941
00:53:41,470 --> 00:53:46,540
Now, we know how to encode
equiprobable events.
942
00:53:46,540 --> 00:53:48,140
And that's the whole
point of this.
943
00:53:50,770 --> 00:53:55,820
So, this is less than
or equal to that.
944
00:53:55,820 --> 00:53:59,000
On the other side, we know that
1 minus delta is less
945
00:53:59,000 --> 00:54:04,970
than or equal to this
probability of a typical set.
946
00:54:04,970 --> 00:54:09,590
And this is less than the
number of elements in a
947
00:54:09,590 --> 00:54:13,860
typical set times 2 to the minus
n h of x minus epsilon.
948
00:54:13,860 --> 00:54:16,320
This is an upper
bound on this.
949
00:54:16,320 --> 00:54:24,240
This is less than this.
950
00:54:27,600 --> 00:54:30,570
So I just add all these things
up and I get this bound.
951
00:54:30,570 --> 00:54:34,200
So it says, the size of the
typical set is greater than 1
952
00:54:34,200 --> 00:54:37,360
minus delta, times
this quantity.
953
00:54:37,360 --> 00:54:41,520
In other words, this is a pretty
exact sort of thing.
954
00:54:41,520 --> 00:54:44,870
If you don't mind dealing
with this 2 to the n
955
00:54:44,870 --> 00:54:47,270
epsilon factor here.
956
00:54:47,270 --> 00:54:50,150
If you agree that that's
negligible in some strange
957
00:54:50,150 --> 00:54:53,860
sense, the all of this
makes good sense.
958
00:54:53,860 --> 00:54:57,760
And if it is negligible, let me
start talking about source
959
00:54:57,760 --> 00:55:01,420
coding, which is why
this all works out.
960
00:55:01,420 --> 00:55:05,460
So the summary is that the
probability of the complement
961
00:55:05,460 --> 00:55:10,650
of the typical set
is essentially 0.
962
00:55:10,650 --> 00:55:14,340
The number of elements in a
typical set is approximately 2
963
00:55:14,340 --> 00:55:16,130
to the n times h of x.
964
00:55:16,130 --> 00:55:18,610
I'm getting rid of all the
deltas and epsilons here, to
965
00:55:18,610 --> 00:55:22,380
get sort of the broad view
of what's important here.
966
00:55:22,380 --> 00:55:25,650
Each of the elements in a
typical set has probability 2
967
00:55:25,650 --> 00:55:28,170
to the minus n times H of x.
968
00:55:28,170 --> 00:55:32,175
So I've turned a source
into a source of
969
00:55:32,175 --> 00:55:34,230
equiprobable elements.
970
00:55:34,230 --> 00:55:37,070
And there are 2 to the n
times h of x of them.
971
00:55:43,100 --> 00:55:46,320
Let's do an example of this.
972
00:55:46,320 --> 00:55:48,890
It's an example that you'll work
on more in the homework
973
00:55:48,890 --> 00:55:52,810
and do it a little
more cleanly.
974
00:55:52,810 --> 00:55:57,120
Let's look at a binary discrete
memoryless source,
975
00:55:57,120 --> 00:56:02,310
where the probability that x is
equal to 1 is p, which is
976
00:56:02,310 --> 00:56:03,920
less than 1/2.
977
00:56:03,920 --> 00:56:07,070
And the probability of 0
is greater than 1/2.
978
00:56:07,070 --> 00:56:12,640
So, this is what you get when
you have a biased coin.
979
00:56:12,640 --> 00:56:17,420
And the biased coin has a
1 on one side and a 0
980
00:56:17,420 --> 00:56:19,340
on the other side.
981
00:56:19,340 --> 00:56:23,070
And it's more likely to
come up 0's than 1's.
982
00:56:23,070 --> 00:56:26,080
I always used to wonder how
to make a biased coin.
983
00:56:26,080 --> 00:56:28,240
And I can give you a little
experiment which shows you you
984
00:56:28,240 --> 00:56:30,400
can make a biased coin.
985
00:56:30,400 --> 00:56:34,140
I mean, a biased is a little
round thing which is flat on
986
00:56:34,140 --> 00:56:35,840
the top and bottom.
987
00:56:35,840 --> 00:56:40,070
Suppose instead of that you
make a triangular coin.
988
00:56:40,070 --> 00:56:43,140
And instead of making it flat on
top and bottom, you turn it
989
00:56:43,140 --> 00:56:45,800
into a tetrahedron.
990
00:56:45,800 --> 00:56:50,630
So in fact, what this is now is
a coin which is built up on
991
00:56:50,630 --> 00:56:54,090
one side into a very
massive thing.
992
00:56:54,090 --> 00:56:57,070
And is flat on the other side.
993
00:56:57,070 --> 00:56:59,700
Since it's a tetrahedron
and it's an equilateral
994
00:56:59,700 --> 00:57:04,730
tetrahedron, the probability of
1 is going to be 1/4, and
995
00:57:04,730 --> 00:57:07,850
the probability of 0
is going to be 3/4.
996
00:57:07,850 --> 00:57:10,760
So you can make biased coins.
997
00:57:10,760 --> 00:57:12,760
So when you get into
coin-tossing games with
998
00:57:12,760 --> 00:57:15,045
people, watch the coin
that they're using.
999
00:57:15,045 --> 00:57:19,120
It probably won't be a
tetrahedron, but anyway.
1000
00:57:21,820 --> 00:57:28,520
So the entropy here, the log pmf
random variable, takes on
1001
00:57:28,520 --> 00:57:32,300
the value of minus log
p with probability p.
1002
00:57:32,300 --> 00:57:35,950
And it takes on the value minus
log 1 minus p, with
1003
00:57:35,950 --> 00:57:37,490
probability 1 minus p.
1004
00:57:37,490 --> 00:57:40,080
This is a probability of a 1.
1005
00:57:40,080 --> 00:57:42,700
This is a probability of a 0.
1006
00:57:42,700 --> 00:57:46,270
So, the entropy is
equal to this.
1007
00:57:46,270 --> 00:57:48,980
Used to be that in information
theory courses, people would
1008
00:57:48,980 --> 00:57:52,050
almost memorize what this
curve looked like.
1009
00:57:52,050 --> 00:57:53,250
And they'd draw pictures
of it.
1010
00:57:53,250 --> 00:57:56,140
There were famous curves
of this function,
1011
00:57:56,140 --> 00:57:58,950
which looks like this.
1012
00:58:07,280 --> 00:58:17,620
0, 1, 1.
1013
00:58:17,620 --> 00:58:20,800
Turns out, that's not all that
important a distribution.
1014
00:58:20,800 --> 00:58:24,510
It's a nice example
to talk about.
1015
00:58:24,510 --> 00:58:28,400
The typical set, t epsilon n,
is the set of strings with
1016
00:58:28,400 --> 00:58:34,710
about p n1's and about 1
minus p times n 0's.
1017
00:58:34,710 --> 00:58:38,770
In other words, that's the
typical thing to happen.
1018
00:58:38,770 --> 00:58:41,900
And it's the typical thing in
terms of this law of large
1019
00:58:41,900 --> 00:58:42,690
numbers here.
1020
00:58:42,690 --> 00:58:46,520
Because you get 1's with
probability p.
1021
00:58:46,520 --> 00:58:48,700
And therefore in a long
sequence, you're going to get
1022
00:58:48,700 --> 00:58:53,190
about pn 1's and
1 minus p 0's.
1023
00:58:53,190 --> 00:58:58,520
The probability of a typical
string is, if you get a string
1024
00:58:58,520 --> 00:59:01,940
with this many 1's and
this many 0's, it's
1025
00:59:01,940 --> 00:59:04,500
probability is p.
1026
00:59:04,500 --> 00:59:08,280
Namely, the probability of a 1
times the number of 1's you
1027
00:59:08,280 --> 00:59:10,610
get, which is pn.
1028
00:59:10,610 --> 00:59:13,420
Times the probability
of a 0, times the
1029
00:59:13,420 --> 00:59:16,210
number of 0's you get.
1030
00:59:16,210 --> 00:59:19,170
And if you look at what this
is, if you take p up in the
1031
00:59:19,170 --> 00:59:22,850
exponent and 1 minus the p up in
the exponent, this becomes
1032
00:59:22,850 --> 00:59:27,700
2 to the minus n times h of x,
just like what it should be.
1033
00:59:27,700 --> 00:59:31,780
So these typical strings, with
about pn 1's and 1 minus pn
1034
00:59:31,780 --> 00:59:34,720
0's, are in fact typical
in the sense we've
1035
00:59:34,720 --> 00:59:36,560
been talking about.
1036
00:59:36,560 --> 00:59:43,100
The number of n strings with pn
1's is n factorial divided
1037
00:59:43,100 --> 00:59:47,760
by pn factorial divided by n
times 1 minus p factorial.
1038
00:59:52,070 --> 00:59:54,960
I mean I hope you learned that
a long time ago, but you
1039
00:59:54,960 --> 00:59:56,910
should learn it in probability
anyway.
1040
00:59:56,910 --> 01:00:01,260
It's just very simple
combinatorics.
1041
01:00:01,260 --> 01:00:04,270
So you have that many
different strings.
1042
01:00:04,270 --> 01:00:07,430
So what I'm trying to get across
here is, there are a
1043
01:00:07,430 --> 01:00:10,580
bunch of different things
going on here.
1044
01:00:10,580 --> 01:00:13,600
We can talk about the random
variable which is the number
1045
01:00:13,600 --> 01:00:16,990
of 1's that occur in
this long sequence.
1046
01:00:16,990 --> 01:00:20,460
And with high probability, the
number of 1's that occur is
1047
01:00:20,460 --> 01:00:22,970
close to pn.
1048
01:00:22,970 --> 01:00:26,470
But if pn 1's occur, there's
still an awful lot of
1049
01:00:26,470 --> 01:00:28,400
randomness left.
1050
01:00:28,400 --> 01:00:33,310
Because we have to worry about
where those pn 1's appear.
1051
01:00:33,310 --> 01:00:36,140
And those are the sequences
we're talking about.
1052
01:00:36,140 --> 01:00:41,520
So, there are this many
sequences, all of which have
1053
01:00:41,520 --> 01:00:44,940
that many 1's in them.
1054
01:00:44,940 --> 01:00:48,850
And there's a similar number of
sequences for all similar
1055
01:00:48,850 --> 01:00:50,160
numbers of 1's.
1056
01:00:50,160 --> 01:00:54,510
Namely, if you take pn plus 1
and pn plus 2, pn minus 1, pn
1057
01:00:54,510 --> 01:00:57,780
minus 2, you get similar
numbers here.
1058
01:00:57,780 --> 01:01:00,890
So those are the typical
sequences.
1059
01:01:00,890 --> 01:01:03,980
Now, the important thing to
observe here is that you
1060
01:01:03,980 --> 01:01:08,890
really have 2 to the n binary
strings altogether.
1061
01:01:08,890 --> 01:01:13,270
And what this result is saying
is that collectively those
1062
01:01:13,270 --> 01:01:14,490
don't make any difference.
1063
01:01:14,490 --> 01:01:17,820
The law of large numbers says,
OK, there's just a humongous
1064
01:01:17,820 --> 01:01:20,080
number of strings.
1065
01:01:20,080 --> 01:01:23,780
You get the largest number
strings which have about half
1066
01:01:23,780 --> 01:01:25,510
1's and half 0's.
1067
01:01:25,510 --> 01:01:29,100
But their probability
is zilch.
1068
01:01:29,100 --> 01:01:32,540
So the thing which is probable
is getting pn 1's
1069
01:01:32,540 --> 01:01:34,750
and 1 minus pn 0's.
1070
01:01:34,750 --> 01:01:37,290
Now, we have this typical set.
1071
01:01:37,290 --> 01:01:41,410
What is the most likely sequence
of all, in this
1072
01:01:41,410 --> 01:01:42,660
experiment?
1073
01:01:45,450 --> 01:01:48,130
How do I maximize the
probability of
1074
01:01:48,130 --> 01:01:49,620
a particular sequence?
1075
01:01:49,620 --> 01:02:03,910
The probability of the sequence
is p to the i times 1
1076
01:02:03,910 --> 01:02:07,420
minus p to the n minus i.
1077
01:02:07,420 --> 01:02:11,050
And 1 minus p is the
probability of 0.
1078
01:02:11,050 --> 01:02:14,240
And p is the probability
of a 1.
1079
01:02:14,240 --> 01:02:15,970
How do I choose i to
maximize this?
1080
01:02:15,970 --> 01:02:16,300
Yeah?
1081
01:02:16,300 --> 01:02:18,150
AUDIENCE: [UNINTELLIGIBLE]
all 0's.
1082
01:02:18,150 --> 01:02:19,540
PROFESSOR: You make
them all 0's.
1083
01:02:19,540 --> 01:02:23,750
So the most likely sequence
is all 0's.
1084
01:02:23,750 --> 01:02:25,860
But that's not a typical
sequence.
1085
01:02:29,700 --> 01:02:33,290
Why isn't it a typical
sequence?
1086
01:02:33,290 --> 01:02:36,060
Because we chose to define
typical sequence in a
1087
01:02:36,060 --> 01:02:37,880
different way.
1088
01:02:37,880 --> 01:02:41,180
Namely is only one of those, and
there are only n of them
1089
01:02:41,180 --> 01:02:43,650
with only a single one.
1090
01:02:43,650 --> 01:02:46,920
So, in other words, what's going
on is that we have an
1091
01:02:46,920 --> 01:02:49,640
enormous number of sequences
which have around half
1092
01:02:49,640 --> 01:02:50,890
1's and half 0's.
1093
01:02:53,430 --> 01:02:55,240
But they don't have
any probability.
1094
01:02:55,240 --> 01:02:57,840
And collectively they don't
have any probability.
1095
01:02:57,840 --> 01:03:01,380
We have a very small number of
sequences which have a very
1096
01:03:01,380 --> 01:03:03,750
large number of 0's.
1097
01:03:03,750 --> 01:03:07,960
But there aren't enough of those
to make any difference.
1098
01:03:07,960 --> 01:03:10,750
And, therefore, the things that
make a difference are
1099
01:03:10,750 --> 01:03:14,710
these typical things which
have about np 1's
1100
01:03:14,710 --> 01:03:18,270
and 1 minus pn 0's.
1101
01:03:18,270 --> 01:03:20,680
And that all sounds
very strange.
1102
01:03:20,680 --> 01:03:22,800
But if I phrase this a different
way, you would all
1103
01:03:22,800 --> 01:03:27,470
say that's exactly the way
you ought to do things.
1104
01:03:27,470 --> 01:03:32,210
Because, in fact, when we look
at very, very long sequences,
1105
01:03:32,210 --> 01:03:35,175
you know with extraordinarily
high probability what's going
1106
01:03:35,175 --> 01:03:39,050
to come out of the source is
something with about pn 1's
1107
01:03:39,050 --> 01:03:42,430
and about 1 minus
p times n 0's.
1108
01:03:42,430 --> 01:03:46,410
So that's the likely set of
things to have happen.
1109
01:03:46,410 --> 01:03:47,590
And it's just that there
are an enormous
1110
01:03:47,590 --> 01:03:49,200
number of those things.
1111
01:03:49,200 --> 01:03:51,890
There are this many of them.
1112
01:03:51,890 --> 01:03:56,150
So, here what we're dealing with
is a balance between the
1113
01:03:56,150 --> 01:04:01,090
number of elements of a
particular type, and the
1114
01:04:01,090 --> 01:04:03,520
probability of them.
1115
01:04:03,520 --> 01:04:07,030
And it turns out that this
number and its probability
1116
01:04:07,030 --> 01:04:10,650
balance out to say that usually
what you get is about
1117
01:04:10,650 --> 01:04:13,780
pn 1's and 1 minus
p times n 0's.
1118
01:04:13,780 --> 01:04:16,730
Which is what the law of large
numbers said to begin with.
1119
01:04:16,730 --> 01:04:20,300
All we're doing is interpreting
that here.
1120
01:04:20,300 --> 01:04:25,210
But the thing that you see from
this example is, all of
1121
01:04:25,210 --> 01:04:28,680
these things with exactly pn 1's
in them, assuming that pn
1122
01:04:28,680 --> 01:04:31,270
is an integer, are
all equiprobable.
1123
01:04:31,270 --> 01:04:34,940
They're all exactly
equiprobable.
1124
01:04:34,940 --> 01:04:37,990
So what we're doing when we're
talking about this typical
1125
01:04:37,990 --> 01:04:42,140
set, is first throwing out all
the things which have to many
1126
01:04:42,140 --> 01:04:44,570
1's are or too few
1's in them.
1127
01:04:44,570 --> 01:04:48,560
We're keeping only the ones
which are typical in the sense
1128
01:04:48,560 --> 01:04:50,920
that they obey the law
of large numbers.
1129
01:04:50,920 --> 01:04:54,100
And in this case, they obey the
law of large numbers for
1130
01:04:54,100 --> 01:04:56,730
log pmf's also.
1131
01:04:56,730 --> 01:05:01,770
And then all of those things
are about equally probable.
1132
01:05:01,770 --> 01:05:05,460
So the idea in source coding
is, one of the ways to deal
1133
01:05:05,460 --> 01:05:10,430
with source coding is, you want
to assign codewords to
1134
01:05:10,430 --> 01:05:13,570
only these typical things.
1135
01:05:13,570 --> 01:05:16,240
Now, maybe you might want to
assign codewords to something
1136
01:05:16,240 --> 01:05:17,870
like all 0's also.
1137
01:05:17,870 --> 01:05:20,570
Because it hardly
costs anything.
1138
01:05:20,570 --> 01:05:23,810
And a Huffman code would
certainly do that.
1139
01:05:23,810 --> 01:05:27,310
But it's not very important
whether you do or not.
1140
01:05:27,310 --> 01:05:30,300
The important thing is, you
assign codewords to all of
1141
01:05:30,300 --> 01:05:31,910
these typical sequences.
1142
01:05:37,770 --> 01:05:41,280
So let's go back to
fixed-to-fixed
1143
01:05:41,280 --> 01:05:42,660
length source codes.
1144
01:05:42,660 --> 01:05:45,500
We talked a little bit about
fixed-to-fixed length source
1145
01:05:45,500 --> 01:05:46,940
codes before.
1146
01:05:46,940 --> 01:05:48,980
Do you remember what we did
with fixed-to-fixed length
1147
01:05:48,980 --> 01:05:50,720
source codes before?
1148
01:05:50,720 --> 01:05:53,520
We said we have an alphabet
of size m.
1149
01:05:53,520 --> 01:05:56,250
We want something which
is uniquely decodable.
1150
01:05:56,250 --> 01:05:59,020
And since we want something
which is uniquely decodable,
1151
01:05:59,020 --> 01:06:02,510
we have to provide codewords
for everything.
1152
01:06:02,510 --> 01:06:07,780
And, therefore, if we want to
choose a block length of n,
1153
01:06:07,780 --> 01:06:11,730
we've got to generate m
to the n codewords.
1154
01:06:11,730 --> 01:06:14,700
Here we say, wow, maybe we
don't have to provide
1155
01:06:14,700 --> 01:06:17,250
codewords for everything.
1156
01:06:17,250 --> 01:06:20,520
Maybe we're willing to tolerate
a certain small
1157
01:06:20,520 --> 01:06:23,070
probability that the whole
thing fails and
1158
01:06:23,070 --> 01:06:24,320
falls on its face.
1159
01:06:27,040 --> 01:06:30,280
Now, does that make any sense?
1160
01:06:30,280 --> 01:06:32,330
Well, view things the
following way.
1161
01:06:32,330 --> 01:06:36,090
We said, when we started out
all of this, that we were
1162
01:06:36,090 --> 01:06:38,880
going to look at prefix-free
codes.
1163
01:06:38,880 --> 01:06:42,640
Where some codewords had a
longer length and some
1164
01:06:42,640 --> 01:06:44,730
codewords had a shorter
length.
1165
01:06:44,730 --> 01:06:48,040
And we were thinking of encoding
either single letters
1166
01:06:48,040 --> 01:06:52,340
at a time, or a small block
of letters at a time.
1167
01:06:52,340 --> 01:06:55,960
So think of encoding, say,
10 letters at a time.
1168
01:06:55,960 --> 01:07:02,250
And think of doing this for
10 to the 20th letters.
1169
01:07:02,250 --> 01:07:05,740
So you have the source here
which is pumping out letters
1170
01:07:05,740 --> 01:07:08,280
at a regular rate.
1171
01:07:08,280 --> 01:07:12,540
You're blocking them into
n letters at a time.
1172
01:07:12,540 --> 01:07:15,540
You're encoding in a
prefix-free code.
1173
01:07:15,540 --> 01:07:17,790
Out comes something.
1174
01:07:17,790 --> 01:07:22,560
What comes is not coming
out at a regular right.
1175
01:07:22,560 --> 01:07:25,670
What is coming out, sometimes
you get a lot of bits out.
1176
01:07:25,670 --> 01:07:28,450
Sometimes a small number
of bits out.
1177
01:07:28,450 --> 01:07:30,730
So, in other words, if you want
to send things over a
1178
01:07:30,730 --> 01:07:34,970
channel, you need a buffer
there to save things.
1179
01:07:34,970 --> 01:07:39,000
If, in fact, we decide that the
expected number of bits
1180
01:07:39,000 --> 01:07:43,960
per source letter is, say, five
bits per source letter,
1181
01:07:43,960 --> 01:07:48,540
then we expect over a very long
time to be producing five
1182
01:07:48,540 --> 01:07:50,830
bits per source letter.
1183
01:07:50,830 --> 01:07:54,460
And if we turn our channel on
for one year, to transmit all
1184
01:07:54,460 --> 01:07:59,010
of these things, what's going
to happen is this very
1185
01:07:59,010 --> 01:08:02,080
unlikely sequence occurs.
1186
01:08:02,080 --> 01:08:05,910
Which in fact requires not one
year to transmit, but two
1187
01:08:05,910 --> 01:08:09,520
years to transmit.
1188
01:08:09,520 --> 01:08:13,150
In fact, what do we do if it
takes one year and five
1189
01:08:13,150 --> 01:08:18,140
minutes to transmit instead
of one year?
1190
01:08:18,140 --> 01:08:19,050
Well, we've got a failure.
1191
01:08:19,050 --> 01:08:22,520
Somehow or other, the network
is going to fail us.
1192
01:08:22,520 --> 01:08:25,350
I mean we all know that networks
fail all the time
1193
01:08:25,350 --> 01:08:28,530
despite what engineers say.
1194
01:08:28,530 --> 01:08:32,120
I mean, all of us who use
networks know that they do
1195
01:08:32,120 --> 01:08:33,820
crazy things.
1196
01:08:33,820 --> 01:08:36,590
And one of those crazy things
is that unusual things
1197
01:08:36,590 --> 01:08:38,270
sometimes happen.
1198
01:08:38,270 --> 01:08:42,640
So, we develop this very nice
theory of prefix-free codes.
1199
01:08:42,640 --> 01:08:46,580
But prefix-free codes,
in fact, fail also.
1200
01:08:46,580 --> 01:08:50,880
And they fail also because
buffers overflow.
1201
01:08:50,880 --> 01:08:54,160
In other words, we are counting
on encoding things
1202
01:08:54,160 --> 01:08:58,020
with a certain number of
bits per source symbol.
1203
01:08:58,020 --> 01:09:00,770
And if these unusual things
occur, and we have too many
1204
01:09:00,770 --> 01:09:04,780
bits per source symbol,
then we fail.
1205
01:09:04,780 --> 01:09:08,960
So the idea that we're trying
to get at now is that
1206
01:09:08,960 --> 01:09:13,560
prefix-free codes and
fixed-to-fixed length source
1207
01:09:13,560 --> 01:09:16,640
codes which only encode
typical things.
1208
01:09:16,640 --> 01:09:20,710
In fact, are sort of the same
if you look at them over a
1209
01:09:20,710 --> 01:09:22,860
very, very large sequence
length.
1210
01:09:22,860 --> 01:09:26,980
In other words, if you look at
a prefix-free code which is
1211
01:09:26,980 --> 01:09:31,190
dealing with blocks of 10
letters, and you look at a
1212
01:09:31,190 --> 01:09:34,120
fixed-to-fixed length code which
is only dealing with
1213
01:09:34,120 --> 01:09:39,320
typical things but is looking at
a length of 10 to the 20th,
1214
01:09:39,320 --> 01:09:43,570
then over that length of 10 to
the 20th, your variable length
1215
01:09:43,570 --> 01:09:47,020
code is going to have a bunch of
things which are about the
1216
01:09:47,020 --> 01:09:48,630
length they ought to be.
1217
01:09:48,630 --> 01:09:50,970
And a bunch of other
things which are
1218
01:09:50,970 --> 01:09:53,090
extraordinarily long.
1219
01:09:53,090 --> 01:09:56,360
The bunch of things which are
extraordinarily long are
1220
01:09:56,360 --> 01:09:59,910
extraordinarily unpopular, but
there are an extraordinarily
1221
01:09:59,910 --> 01:10:02,020
large number of them.
1222
01:10:02,020 --> 01:10:05,760
Just like with a fixed-to-fixed
length code,
1223
01:10:05,760 --> 01:10:07,700
you are going to fail.
1224
01:10:07,700 --> 01:10:10,200
And you're going to fail on
an extraordinary number of
1225
01:10:10,200 --> 01:10:12,500
different sequences.
1226
01:10:12,500 --> 01:10:15,290
But, collectively, that set of
sequences don't have any
1227
01:10:15,290 --> 01:10:17,850
probability.
1228
01:10:17,850 --> 01:10:20,720
So the point that I'm trying to
get across is that, really,
1229
01:10:20,720 --> 01:10:24,020
these two situations come
together when we look very
1230
01:10:24,020 --> 01:10:25,630
long lengths.
1231
01:10:25,630 --> 01:10:30,030
Namely, prefix-free codes are
just a way of generating codes
1232
01:10:30,030 --> 01:10:33,260
that work for typical sequences
and over a very
1233
01:10:33,260 --> 01:10:37,390
large, long period of time, will
generate about the right
1234
01:10:37,390 --> 01:10:40,550
number of symbols.
1235
01:10:40,550 --> 01:10:42,420
And that's what I'm trying
to get at here.
1236
01:10:42,420 --> 01:10:45,980
Or what I'm trying to get
at in the next slide.
1237
01:10:45,980 --> 01:10:50,650
So the fixed-to-fixed length
source code, I'm going to pick
1238
01:10:50,650 --> 01:10:52,860
some epsilon and some delta.
1239
01:10:52,860 --> 01:10:55,770
Namely, that epsilon and delta
which appeared in the law of
1240
01:10:55,770 --> 01:10:58,280
large numbers.
1241
01:10:58,280 --> 01:11:01,400
I'm going to make n as big as
I have to make it for that
1242
01:11:01,400 --> 01:11:03,220
epsilon and that delta.
1243
01:11:03,220 --> 01:11:07,120
And calculate how large it
has to be, but we won't.
1244
01:11:07,120 --> 01:11:12,150
Then I'm going to assign fixed
length codewords to each
1245
01:11:12,150 --> 01:11:15,390
sequence in the typical set.
1246
01:11:15,390 --> 01:11:16,490
Now, am I going to really build
1247
01:11:16,490 --> 01:11:18,410
something which does this?
1248
01:11:18,410 --> 01:11:20,210
Of course not.
1249
01:11:20,210 --> 01:11:23,140
I mean, I'm talking about
truly humongous lengths.
1250
01:11:23,140 --> 01:11:25,620
So, this is really a conceptual
tool to understand
1251
01:11:25,620 --> 01:11:27,070
what's going on.
1252
01:11:27,070 --> 01:11:30,100
It's not something we're
going to implement.
1253
01:11:30,100 --> 01:11:32,490
So I'm going to assign
codewords to all
1254
01:11:32,490 --> 01:11:34,910
these typical elements.
1255
01:11:34,910 --> 01:11:40,900
And then what I find is that
since the typical set, since
1256
01:11:40,900 --> 01:11:44,730
the number of elements in it is
less than 2 to the n times
1257
01:11:44,730 --> 01:11:51,200
H of x plus epsilon, if I choose
L bar, namely, the
1258
01:11:51,200 --> 01:11:56,980
number of bits I'm going to use
for encoding these things,
1259
01:11:56,980 --> 01:12:00,470
it's going to have to be H of
x plus epsilon in length.
1260
01:12:00,470 --> 01:12:02,190
Because I need to provide
codewords for
1261
01:12:02,190 --> 01:12:05,600
each of these things.
1262
01:12:05,600 --> 01:12:08,930
And it needs to be an extra 1
over n because of this integer
1263
01:12:08,930 --> 01:12:11,460
constraint that we've been
dealing with all along, which
1264
01:12:11,460 --> 01:12:14,120
doesn't make any difference.
1265
01:12:14,120 --> 01:12:17,830
So if I choose L bar, that big,
in other words, if I make
1266
01:12:17,830 --> 01:12:21,670
it just a little bit bigger
than the entropy, the
1267
01:12:21,670 --> 01:12:23,790
probability of failure
is going to be less
1268
01:12:23,790 --> 01:12:25,640
than or equal to delta.
1269
01:12:25,640 --> 01:12:27,910
And I can make delta -- and I
can make the probability of
1270
01:12:27,910 --> 01:12:30,110
failure as small as I want.
1271
01:12:30,110 --> 01:12:32,960
So I can make this epsilon here
which is the extra bits
1272
01:12:32,960 --> 01:12:36,710
per source symbol as
small as I want.
1273
01:12:36,710 --> 01:12:39,790
So it says I can come as close
to the entropy bound in doing
1274
01:12:39,790 --> 01:12:43,350
this, and come as close to
unique decodability as I want
1275
01:12:43,350 --> 01:12:45,140
in doing this.
1276
01:12:45,140 --> 01:12:48,720
And I have a fixed-to-fixed
length code, which after one
1277
01:12:48,720 --> 01:12:50,880
year is going to stop.
1278
01:12:50,880 --> 01:12:53,730
And I can turn my decoder off.
1279
01:12:53,730 --> 01:12:55,950
I can turn my encoder off.
1280
01:12:55,950 --> 01:12:59,160
I can go buy a new encoder
and a new decoder, which
1281
01:12:59,160 --> 01:13:01,770
presumably works a little
bit better.
1282
01:13:01,770 --> 01:13:04,150
And there isn't any problem
about when to turn it off.
1283
01:13:04,150 --> 01:13:05,730
Because I know I can
turn it off.
1284
01:13:05,730 --> 01:13:09,630
Because everything will
have come in by then.
1285
01:13:09,630 --> 01:13:12,420
Here's a more interesting
story.
1286
01:13:12,420 --> 01:13:18,250
Suppose I choose the number of
bits per source symbol that
1287
01:13:18,250 --> 01:13:23,390
I'm going to use to be less than
or equal to the entropy
1288
01:13:23,390 --> 01:13:24,420
minus 2 epsilon.
1289
01:13:24,420 --> 01:13:25,670
Why 2 epsilon?
1290
01:13:25,670 --> 01:13:29,110
Well, just wait a second.
1291
01:13:29,110 --> 01:13:31,830
I mean, 2 epsilon is small
and epsilon is small.
1292
01:13:31,830 --> 01:13:34,145
But I want to compare with this
other epsilon and my law
1293
01:13:34,145 --> 01:13:35,590
of large numbers.
1294
01:13:35,590 --> 01:13:39,430
And I'm going to pick
n large enough.
1295
01:13:39,430 --> 01:13:43,480
The number of typical sequences,
we said before, was
1296
01:13:43,480 --> 01:13:48,300
greater than 1 minus delta times
2 to the n times h of x
1297
01:13:48,300 --> 01:13:48,950
minus epsilon.
1298
01:13:48,950 --> 01:13:52,430
I'm going to make this epsilon
the same as that epsilon,
1299
01:13:52,430 --> 01:13:54,170
which is why I wanted this
to be 2 epsilon.
1300
01:13:56,700 --> 01:14:01,680
So my typical set is this big
when I choose n large enough.
1301
01:14:01,680 --> 01:14:04,890
And this says that most
of the typical set
1302
01:14:04,890 --> 01:14:07,440
can't be assigned codewords.
1303
01:14:07,440 --> 01:14:15,510
In other words, this number
here is humongously larger
1304
01:14:15,510 --> 01:14:35,870
then 2 to the l bar, which is in
the order of 2 to the nh of
1305
01:14:35,870 --> 01:14:42,200
x minus 2 epsilon n.
1306
01:14:42,200 --> 01:14:45,660
So the fraction of typical
elements that I can provide
1307
01:14:45,660 --> 01:14:52,040
codewords for, between this and
this, I can only provide
1308
01:14:52,040 --> 01:14:54,660
codewords for a fraction
2 to the minus
1309
01:14:54,660 --> 01:14:58,670
epsilon n of the codewords.
1310
01:14:58,670 --> 01:15:01,770
We have this big sea of
codewords, which are all
1311
01:15:01,770 --> 01:15:04,200
essentially equally likely.
1312
01:15:04,200 --> 01:15:07,230
And I can't provide codewords
for even a
1313
01:15:07,230 --> 01:15:09,860
small fraction of them.
1314
01:15:09,860 --> 01:15:13,130
So the probability of failure is
going to be 1 minus delta.
1315
01:15:13,130 --> 01:15:15,460
The 1 minus delta's the
probability that I get
1316
01:15:15,460 --> 01:15:17,950
something atypical.
1317
01:15:17,950 --> 01:15:24,190
Plus, well, minus in this case,
2 to the minus epsilon
1318
01:15:24,190 --> 01:15:28,280
n, which is the probability that
I can't encode a typical
1319
01:15:28,280 --> 01:15:30,670
codeword that comes out.
1320
01:15:30,670 --> 01:15:34,550
And this quantity goes to 1.
1321
01:15:34,550 --> 01:15:37,995
So this says that if I'm willing
to use a number of
1322
01:15:37,995 --> 01:15:42,690
bits bigger than the entropy, I
can succeed with probability
1323
01:15:42,690 --> 01:15:45,010
very close to 1.
1324
01:15:45,010 --> 01:15:48,150
And if I want to use a smaller
number of bits, I fail with
1325
01:15:48,150 --> 01:15:49,400
probability 1.
1326
01:15:52,810 --> 01:15:56,320
Which is the same as saying that
I'm using a prefix-free
1327
01:15:56,320 --> 01:16:01,950
code, I'm going to run out of
buffer space eventually if I
1328
01:16:01,950 --> 01:16:05,730
run long enough.
1329
01:16:05,730 --> 01:16:11,650
If I have something that
I'm encoding --
1330
01:16:11,650 --> 01:16:13,980
well, just erase that.
1331
01:16:13,980 --> 01:16:15,570
I'll say it more carefully
later.
1332
01:16:18,150 --> 01:16:22,210
I do want to talk a little bit
about this Kraft inequality
1333
01:16:22,210 --> 01:16:23,610
for unique decodability.
1334
01:16:23,610 --> 01:16:26,780
You remember we proved the
Kraft inequality for
1335
01:16:26,780 --> 01:16:29,460
prefix-free codes.
1336
01:16:29,460 --> 01:16:32,930
I now want to talk about the
Kraft inequality for uniquely
1337
01:16:32,930 --> 01:16:36,060
decodable codes.
1338
01:16:36,060 --> 01:16:39,330
And you might think that I've
done all of this development
1339
01:16:39,330 --> 01:16:45,990
of the AEP, the asymptotic
equipartition property.
1340
01:16:45,990 --> 01:16:49,560
Incidentally, you now know where
those words come from.
1341
01:16:49,560 --> 01:16:53,500
It's asymptotic because this
result is valid asymptotically
1342
01:16:53,500 --> 01:16:55,960
as n goes to infinity.
1343
01:16:55,960 --> 01:17:01,260
It's equipartition because
everything is equally likely.
1344
01:17:01,260 --> 01:17:03,480
And its property, because
it's a property.
1345
01:17:03,480 --> 01:17:08,490
So it's the asymptotic
equipartition property.
1346
01:17:08,490 --> 01:17:12,260
And I didn't do it so I could
prove the Kraft inequality.
1347
01:17:12,260 --> 01:17:14,850
It's just that that's an extra
bonus that we get.
1348
01:17:14,850 --> 01:17:20,070
And by understanding why the
Kraft inequality has to hold
1349
01:17:20,070 --> 01:17:28,890
for uniquely decodable codes, if
is one application for AEP
1350
01:17:28,890 --> 01:17:32,470
which lets you see a little
bit about how to use it.
1351
01:17:32,470 --> 01:17:36,520
OK, so the argument is an
argument by contradiction.
1352
01:17:36,520 --> 01:17:43,010
Suppose you generate a set
of lengths for codewords.
1353
01:17:43,010 --> 01:17:44,550
And you want this -- yeah?
1354
01:17:55,250 --> 01:17:58,090
And the thing you would like to
do is to assign codewords
1355
01:17:58,090 --> 01:18:01,220
of these lengths.
1356
01:18:01,220 --> 01:18:04,860
And what we want to do is to
set this equal to some
1357
01:18:04,860 --> 01:18:05,630
quantity b.
1358
01:18:05,630 --> 01:18:09,020
In other words, suppose we beat
the Kraft inequality.
1359
01:18:09,020 --> 01:18:12,130
Suppose we can make the lengths
even shorter than
1360
01:18:12,130 --> 01:18:15,730
Kraft says we can make them.
1361
01:18:15,730 --> 01:18:17,905
I mean, he was only a graduate
student, so we've got to be
1362
01:18:17,905 --> 01:18:21,480
able to beat his inequality
somehow.
1363
01:18:21,480 --> 01:18:24,460
So we're going to try to
make this equal to b.
1364
01:18:24,460 --> 01:18:27,930
We're going to assume that
b is greater than 1.
1365
01:18:27,930 --> 01:18:30,890
And then what we're going to
do is to show that we get a
1366
01:18:30,890 --> 01:18:32,470
contradiction here.
1367
01:18:32,470 --> 01:18:36,090
And this same argument can
work whether we have a
1368
01:18:36,090 --> 01:18:39,600
discrete memoryless source or
a source with memory, or
1369
01:18:39,600 --> 01:18:40,420
anything else.
1370
01:18:40,420 --> 01:18:42,830
It can work with blocks, it can
work with variable length
1371
01:18:42,830 --> 01:18:46,000
to variable length codes.
1372
01:18:46,000 --> 01:18:49,560
It's all essentially
the same argument.
1373
01:18:49,560 --> 01:18:52,390
So what I want to do is to
get a contradiction.
1374
01:18:52,390 --> 01:18:56,230
I'm going to choose a discrete
memoryless source.
1375
01:18:56,230 --> 01:18:58,900
And I'm going to make the
probabilities equal to 1 over
1376
01:18:58,900 --> 01:19:02,300
b times 2 to the minus li.
1377
01:19:02,300 --> 01:19:04,800
In other words, I can generate
a discrete memoryless source
1378
01:19:04,800 --> 01:19:07,270
for talking about it with
any probabilities I
1379
01:19:07,270 --> 01:19:08,800
want to give it.
1380
01:19:08,800 --> 01:19:12,650
So I'm going to generate one
with these probabilities.
1381
01:19:12,650 --> 01:19:16,530
So the lengths are going to
be equal to minus log of
1382
01:19:16,530 --> 01:19:19,220
b times p sub i.
1383
01:19:19,220 --> 01:19:22,920
Which says that the expected
length of the codewords is
1384
01:19:22,920 --> 01:19:27,820
equal to the sum of p sub i l
sub i, which is equal to the
1385
01:19:27,820 --> 01:19:31,780
entropy minus the
logarithm of b.
1386
01:19:31,780 --> 01:19:34,450
Which means I can get an
expected length which is a
1387
01:19:34,450 --> 01:19:37,440
little bit less than
the entropy.
1388
01:19:37,440 --> 01:19:40,600
So now what I'm going to do is
to consider strings of n
1389
01:19:40,600 --> 01:19:41,330
source letters.
1390
01:19:41,330 --> 01:19:43,460
I'm going to make these string
very, very long.
1391
01:19:46,270 --> 01:19:50,430
When I concatenate all these
codewords, I'm going to wind
1392
01:19:50,430 --> 01:19:54,290
up with a length that's less
than n times H of x minus b
1393
01:19:54,290 --> 01:19:59,400
over 2, minus log b over 2
with high probability.
1394
01:20:13,510 --> 01:20:18,940
And as a fixed-length code of
this length it's going to have
1395
01:20:18,940 --> 01:20:21,810
a low failure probability.
1396
01:20:21,810 --> 01:20:26,740
And, therefore, what this says
is I can, using this
1397
01:20:26,740 --> 01:20:32,670
remarkable code with unique
decodability, and generating
1398
01:20:32,670 --> 01:20:37,500
very long strings from it, I
can generate a fixed-length
1399
01:20:37,500 --> 01:20:41,550
code which has a low failure
probability.
1400
01:20:41,550 --> 01:20:45,640
And I just showed you
in the last slide
1401
01:20:45,640 --> 01:20:46,530
that I can't do that.
1402
01:20:46,530 --> 01:20:49,830
The probability of failure with
such a code has to be
1403
01:20:49,830 --> 01:20:51,540
essentially 1.
1404
01:20:51,540 --> 01:20:54,870
So that's a contradiction that
says you can't have these
1405
01:20:54,870 --> 01:20:57,460
unique decodable codes.
1406
01:20:57,460 --> 01:21:01,670
If you didn't get that in what
I said, don't be surprised.
1407
01:21:01,670 --> 01:21:06,200
Because all I'm trying to do is
to steer you towards how to
1408
01:21:06,200 --> 01:21:09,610
look at the section in the
notes that does that.
1409
01:21:09,610 --> 01:21:12,430
It was a little too fast
and a little too late.
1410
01:21:12,430 --> 01:21:15,570
But, anyway, that is the Kraft
inequality for unique
1411
01:21:15,570 --> 01:21:16,650
decodability.
1412
01:21:16,650 --> 01:21:18,170
OK, thanks.