1
00:00:00,000 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:04,059
Commons license.

3
00:00:04,059 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,350
To make a donation or
view additional materials

6
00:00:13,350 --> 00:00:17,290
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,290 --> 00:00:18,294
at ocw.mit.edu.

8
00:00:28,437 --> 00:00:29,770
PROFESSOR: OK let's get started.

9
00:00:32,390 --> 00:00:34,410
Let's get started, please.

10
00:00:34,410 --> 00:00:38,630
All right, last time we talked
about information and entropy.

11
00:00:38,630 --> 00:00:40,970
The picture we had
was of some kind

12
00:00:40,970 --> 00:00:44,435
of a source emitting symbols.

13
00:00:50,360 --> 00:00:54,230
Symbols-- let's say n of them.

14
00:00:54,230 --> 00:01:00,875
So it chooses from these symbols
with probabilities P1 up to Pn.

15
00:01:04,110 --> 00:01:16,330
And then we talked about the
expected information here,

16
00:01:16,330 --> 00:01:24,500
or the entropy, so the
expected information

17
00:01:24,500 --> 00:01:27,800
you get when you see the symbol
that's emitted by the source.

18
00:01:27,800 --> 00:01:31,670
And that was the average
value of the information.

19
00:01:31,670 --> 00:01:36,590
So it was-- let's see,
you take 1 over log of 1

20
00:01:36,590 --> 00:01:39,782
over P i for each of
the possible symbols.

21
00:01:39,782 --> 00:01:41,240
And then you've
got to weight it by

22
00:01:41,240 --> 00:01:44,960
the corresponding probability
to get an expectation.

23
00:01:44,960 --> 00:01:48,480
And this was the
entropy of the source.

24
00:01:48,480 --> 00:01:50,720
Or if you want to make
explicit the source,

25
00:01:50,720 --> 00:01:55,190
you could say H
of S for source--

26
00:01:55,190 --> 00:01:59,413
capital S. All right?

27
00:01:59,413 --> 00:02:01,580
And then we were actually
thinking of this operating

28
00:02:01,580 --> 00:02:02,360
repeatedly.

29
00:02:02,360 --> 00:02:06,440
So in the model we had last
time, the source at each time

30
00:02:06,440 --> 00:02:08,870
chooses from one of these
symbols with this probability.

31
00:02:08,870 --> 00:02:11,760
And it does it independently
of choices at other times.

32
00:02:11,760 --> 00:02:14,390
So what the source
actually generates

33
00:02:14,390 --> 00:02:20,810
is what's referred to as
an iid sequence of symbols,

34
00:02:20,810 --> 00:02:25,852
independent,
identically distributed.

35
00:02:25,852 --> 00:02:26,810
You'll see this a lot--

36
00:02:32,570 --> 00:02:34,790
Or iid sequence of symbols.

37
00:02:39,860 --> 00:02:43,927
So the independent part
of this refers to the fact

38
00:02:43,927 --> 00:02:45,510
that it makes the
choice independently

39
00:02:45,510 --> 00:02:46,980
at each time instant.

40
00:02:46,980 --> 00:02:49,075
The identically
distributed means

41
00:02:49,075 --> 00:02:50,700
that at each time
instant, it goes back

42
00:02:50,700 --> 00:02:51,900
to these same probabilities.

43
00:02:51,900 --> 00:02:55,020
It's the same distribution
that it uses each time.

44
00:02:55,020 --> 00:02:56,430
So that's what iid means--

45
00:02:56,430 --> 00:02:59,220
so sort of a stationary
probabilistic source

46
00:02:59,220 --> 00:03:01,560
with no dependence from one
time instant to the next.

47
00:03:06,150 --> 00:03:10,320
Average information was
measured in bits per symbol.

48
00:03:13,560 --> 00:03:16,260
And what we wanted to do
was take those symbols

49
00:03:16,260 --> 00:03:24,450
and compress them
to binary digits.

50
00:03:30,740 --> 00:03:32,780
OK, so we were going to--

51
00:03:32,780 --> 00:03:34,790
you can compress
them to other things.

52
00:03:34,790 --> 00:03:36,957
We were going to think of
compressing them to binary

53
00:03:36,957 --> 00:03:41,000
digits because we're thinking
of a channel that can take 0s 1s

54
00:03:41,000 --> 00:03:43,710
or a signal that's in
two possible states.

55
00:03:43,710 --> 00:03:46,820
So what we wanted to do was
take each symbol or sequence

56
00:03:46,820 --> 00:03:50,640
of symbols and code it in
the form of binary digits.

57
00:03:50,640 --> 00:03:51,140
Right?

58
00:03:53,960 --> 00:03:57,530
Now, each binary
digit can, at most,

59
00:03:57,530 --> 00:03:59,300
carry one bit of information.

60
00:03:59,300 --> 00:04:03,525
If the binary digit is equally
likely to be a 0 or a 1,

61
00:04:03,525 --> 00:04:05,150
then it carries one
bit of information.

62
00:04:05,150 --> 00:04:07,400
So that tells you really
that if you're going

63
00:04:07,400 --> 00:04:10,760
to code this, the code length--

64
00:04:13,530 --> 00:04:17,190
let's see-- compress to binary
digits, let's say, or encode.

65
00:04:20,140 --> 00:04:22,500
And what we need is the
expected code length.

66
00:04:29,960 --> 00:04:37,880
L should be greater
than or equal to H. So

67
00:04:37,880 --> 00:04:42,680
you need to transmit at
least this many binary digits

68
00:04:42,680 --> 00:04:44,840
on average to convey
the information that's

69
00:04:44,840 --> 00:04:46,430
coming out of the source--

70
00:04:46,430 --> 00:04:50,015
per symbol or per timestamp.

71
00:04:50,015 --> 00:04:51,640
All right, so that
was the basic setup.

72
00:04:56,040 --> 00:05:00,210
I've given you one
of these bounds here.

73
00:05:00,210 --> 00:05:02,010
When we talked about
codes, by the way,

74
00:05:02,010 --> 00:05:06,240
we decided that if we're
talking about binary codes,

75
00:05:06,240 --> 00:05:12,670
we want to limit ourselves to
what are called instantaneously

76
00:05:12,670 --> 00:05:19,420
decodable or prefix-free codes.

77
00:05:19,420 --> 00:05:21,220
And these are codes
that correspond

78
00:05:21,220 --> 00:05:24,940
to the leaves of a code tree.

79
00:05:24,940 --> 00:05:27,820
So we had examples of this type.

80
00:05:27,820 --> 00:05:29,860
You want your symbols
to be associated

81
00:05:29,860 --> 00:05:34,000
with the leaves of--
the end of the tree, not

82
00:05:34,000 --> 00:05:35,660
intermediate points.

83
00:05:35,660 --> 00:05:38,200
The reason being
that, as you work

84
00:05:38,200 --> 00:05:40,990
your way down to the tree--

85
00:05:40,990 --> 00:05:44,800
by the way, I'm assuming
that this picture makes sense

86
00:05:44,800 --> 00:05:48,520
to you in some fashion
from recitation.

87
00:05:48,520 --> 00:05:50,800
But as you work your
way down to the symbol,

88
00:05:50,800 --> 00:05:52,940
you don't encounter any
other symbols on the way.

89
00:05:52,940 --> 00:05:54,398
So as soon as you
hit the leaf, you

90
00:05:54,398 --> 00:05:55,840
know what symbol you've got.

91
00:05:55,840 --> 00:05:58,330
So we're limiting
ourselves to codes

92
00:05:58,330 --> 00:06:00,190
of that type because
some of the statements

93
00:06:00,190 --> 00:06:03,050
I make are not true if you
don't have codes of this type.

94
00:06:03,050 --> 00:06:06,250
So I won't comment
on that again.

95
00:06:06,250 --> 00:06:11,380
All right, so we've got
that, the first inequality

96
00:06:11,380 --> 00:06:13,210
that I've put up there.

97
00:06:13,210 --> 00:06:16,000
And it turns out
that Shannon showed

98
00:06:16,000 --> 00:06:21,078
how to actually construct
codes that will give you

99
00:06:21,078 --> 00:06:22,120
a band on the other side.

100
00:06:22,120 --> 00:06:26,840
Let me actually write it
the way it is on the slide.

101
00:06:26,840 --> 00:06:30,730
So Shannon showed how to get
codes that satisfy this-- so

102
00:06:30,730 --> 00:06:36,430
can get code satisfying this.

103
00:06:40,060 --> 00:06:41,830
So Shannon showed
how to get within one

104
00:06:41,830 --> 00:06:44,740
of the lower bound in terms
of the expected length

105
00:06:44,740 --> 00:06:45,260
of the code.

106
00:06:45,260 --> 00:06:47,860
So that was pretty good.

107
00:06:47,860 --> 00:06:51,250
But after coming up
with this paper in '48

108
00:06:51,250 --> 00:06:54,760
and working on this for a while,
neither he nor other luminaries

109
00:06:54,760 --> 00:06:58,300
in the field had found how
to get the best such code,

110
00:06:58,300 --> 00:07:00,130
and that's what
Huffman ended up doing.

111
00:07:00,130 --> 00:07:05,380
So we've talked
about that already.

112
00:07:05,380 --> 00:07:07,900
OK, so Huffman showed
how to get a code

113
00:07:07,900 --> 00:07:10,690
of minimum expected
length per symbol

114
00:07:10,690 --> 00:07:12,632
with a very simple construction.

115
00:07:15,940 --> 00:07:22,890
Now, you can actually
extend Huffman--

116
00:07:22,890 --> 00:07:26,550
and maybe you talked about
this in recitation as well.

117
00:07:26,550 --> 00:07:28,050
So you can code
per symbol, or you

118
00:07:28,050 --> 00:07:31,890
can decide you're going
to create super-symbols.

119
00:07:31,890 --> 00:07:37,590
Take the same source, but say
that the symbols that it emits

120
00:07:37,590 --> 00:07:41,110
are the symbols from here
grouped two at a time.

121
00:07:41,110 --> 00:07:43,650
So you're going to
take the symbol emitted

122
00:07:43,650 --> 00:07:45,510
at some particular time
and then the symbol

123
00:07:45,510 --> 00:07:48,930
at the following time and
call that a super-symbol.

124
00:07:48,930 --> 00:07:51,750
And then take the
next pair, and that's

125
00:07:51,750 --> 00:07:52,962
a super-symbol and so on.

126
00:07:52,962 --> 00:07:54,420
So you're doing
the Huffman coding,

127
00:07:54,420 --> 00:07:57,225
but on pairs of symbols.

128
00:07:57,225 --> 00:08:00,480
So you can go through the
same kind of construction.

129
00:08:00,480 --> 00:08:04,410
If you assuming an iid
source, then the probability

130
00:08:04,410 --> 00:08:07,983
of a paired super-symbol
is easy to compute.

131
00:08:07,983 --> 00:08:09,900
It's just a probability
of the individual ones

132
00:08:09,900 --> 00:08:12,610
because they're
independently emitted.

133
00:08:12,610 --> 00:08:16,530
And then the entropy
of the resulting source

134
00:08:16,530 --> 00:08:19,440
here turns out to be twice
the entropy of the source

135
00:08:19,440 --> 00:08:22,330
here because these are
independent emissions,

136
00:08:22,330 --> 00:08:24,450
so the entropies will just add.

137
00:08:24,450 --> 00:08:28,200
So you can do the Huffman
construction again.

138
00:08:28,200 --> 00:08:30,450
And what you discover is
the same kind of thing

139
00:08:30,450 --> 00:08:37,830
except this is now
the inequality, right?

140
00:08:37,830 --> 00:08:39,000
And the reason is--

141
00:08:39,000 --> 00:08:42,750
well, here L is still the
expected length per symbol.

142
00:08:42,750 --> 00:08:45,210
But you're doing pairs
now, so the expected length

143
00:08:45,210 --> 00:08:47,310
for the pair is 2L.

144
00:08:47,310 --> 00:08:48,570
Right?

145
00:08:48,570 --> 00:08:50,670
The lower bound is the
entropy of the source.

146
00:08:50,670 --> 00:08:52,110
That's 2H.

147
00:08:52,110 --> 00:08:55,090
The upper bound is the
entropy of that source plus 1.

148
00:08:55,090 --> 00:08:56,940
So you can construct
a code of that type.

149
00:08:56,940 --> 00:08:59,707
You can do it with Shannon's
construction or Huffman's.

150
00:08:59,707 --> 00:09:01,290
And now see what
you've managed to do.

151
00:09:01,290 --> 00:09:05,610
You've got a little titer of a
squeeze on the expected length.

152
00:09:14,100 --> 00:09:19,040
So we've gone from H
plus 1 to H plus 1/2

153
00:09:19,040 --> 00:09:20,810
with this construction.

154
00:09:20,810 --> 00:09:25,910
If you took triples, this
would just change to 1 over 3.

155
00:09:25,910 --> 00:09:29,030
If you took K-tuples,
you'd get 1 over K.

156
00:09:29,030 --> 00:09:31,970
So if you encode larger
and larger blocks,

157
00:09:31,970 --> 00:09:33,680
you can squeeze
the expected length

158
00:09:33,680 --> 00:09:37,140
down to essentially what
the entropy band tells you.

159
00:09:41,250 --> 00:09:44,670
Now, Huffman-- you've
spent time in recitation.

160
00:09:44,670 --> 00:09:48,380
I just thought I would
quickly run through an example

161
00:09:48,380 --> 00:09:53,520
so that you have this
fresh in your minds.

162
00:09:53,520 --> 00:09:56,440
So we start off with
a set of symbols.

163
00:09:56,440 --> 00:09:58,890
This is kind of weak,
but I hope you can it.

164
00:09:58,890 --> 00:10:03,060
A set of symbols, A through D
in this case, with probabilities

165
00:10:03,060 --> 00:10:04,560
associated with it.

166
00:10:04,560 --> 00:10:06,930
The Huffman process
is to first sort

167
00:10:06,930 --> 00:10:09,570
these symbols in descending
order of probability.

168
00:10:09,570 --> 00:10:11,580
So that's what I
really start with.

169
00:10:11,580 --> 00:10:13,290
You take the two
smallest ones and lump

170
00:10:13,290 --> 00:10:19,548
them together to get a paired
symbol, rearrange, reorder.

171
00:10:19,548 --> 00:10:21,090
And then you do the
same thing again.

172
00:10:21,090 --> 00:10:24,210
You take the two,
combine them, reorder.

173
00:10:24,210 --> 00:10:27,660
Take the two smallest ones,
combine them, reorder.

174
00:10:27,660 --> 00:10:31,440
And that's what you have
for your reduction phase.

175
00:10:31,440 --> 00:10:33,210
And then you start
to trace back.

176
00:10:33,210 --> 00:10:37,380
So when you trace back, you can
pick the upper one to be zero,

177
00:10:37,380 --> 00:10:39,060
the lower one to be 1.

178
00:10:39,060 --> 00:10:41,950
And then every time you get a
bifurcation, as you go back,

179
00:10:41,950 --> 00:10:46,110
you'll pick the upper one to be
zero and the lower one to be 1.

180
00:10:46,110 --> 00:10:48,990
And you start to build
up your code word, right?

181
00:10:48,990 --> 00:10:51,070
So this one traces back.

182
00:10:51,070 --> 00:10:52,730
There's no bifurcation.

183
00:10:52,730 --> 00:10:53,620
This traces back.

184
00:10:53,620 --> 00:10:56,430
The 0 becomes 0001.

185
00:10:56,430 --> 00:10:59,410
And you go all
the way like that.

186
00:10:59,410 --> 00:10:59,910
OK?

187
00:10:59,910 --> 00:11:06,475
So trace back-- let's see.

188
00:11:06,475 --> 00:11:07,430
Oh, was there a--

189
00:11:07,430 --> 00:11:07,930
yeah.

190
00:11:07,930 --> 00:11:10,657
So the 1 here becomes
a 1 0 and a 1 1.

191
00:11:10,657 --> 00:11:12,740
And then at the next step,
you're all the way back

192
00:11:12,740 --> 00:11:15,590
with the Huffman code.

193
00:11:15,590 --> 00:11:16,640
Right?

194
00:11:16,640 --> 00:11:19,580
So that's the Huffman code
for that set of symbols.

195
00:11:19,580 --> 00:11:20,960
It's a Huffman code.

196
00:11:20,960 --> 00:11:23,620
I shouldn't say the Huffman
code because, if you notice,

197
00:11:23,620 --> 00:11:26,480
at various stages
we had probabilities

198
00:11:26,480 --> 00:11:30,920
that were identical, like over
here and over here and over

199
00:11:30,920 --> 00:11:31,700
here.

200
00:11:31,700 --> 00:11:34,790
And we could have chosen
how to order these things

201
00:11:34,790 --> 00:11:37,787
and then how to do the
subsequent grouping.

202
00:11:37,787 --> 00:11:39,620
And all of those will
give you Huffman codes

203
00:11:39,620 --> 00:11:41,720
with the same minimum
expected length.

204
00:11:47,270 --> 00:11:47,770
All right.

205
00:11:53,062 --> 00:11:55,270
All right, I want to give
you another way of thinking

206
00:11:55,270 --> 00:11:59,245
about entropy and why
it enters into coding.

207
00:12:02,470 --> 00:12:04,150
And here's the basic idea.

208
00:12:04,150 --> 00:12:07,570
All right, so we're still
thinking about the source

209
00:12:07,570 --> 00:12:09,640
emitting independent symbols.

210
00:12:09,640 --> 00:12:11,630
It's an iid source.

211
00:12:11,630 --> 00:12:13,750
And we've got a very
long string of emissions.

212
00:12:13,750 --> 00:12:23,910
So we've got a very long
string of symbols emitted,

213
00:12:23,910 --> 00:12:31,800
maybe S1 of the first time,
S17 here, S2 here, and so on.

214
00:12:31,800 --> 00:12:34,100
And the question is, in a
very long string of symbols,

215
00:12:34,100 --> 00:12:37,130
how many times do you
expect to see symbol S1?

216
00:12:37,130 --> 00:12:39,793
How many times you expect to
see a symbol S2 and so on?

217
00:12:39,793 --> 00:12:41,210
Well, if you
actually work it out,

218
00:12:41,210 --> 00:12:43,340
it turns out that
the expected number

219
00:12:43,340 --> 00:12:56,150
of times, number of times we
see SI in the [INAUDIBLE] symbol

220
00:12:56,150 --> 00:13:01,260
is K times the
probability of seeing SI.

221
00:13:01,260 --> 00:13:03,030
So it's what you'd expect.

222
00:13:03,030 --> 00:13:03,770
All right?

223
00:13:03,770 --> 00:13:06,140
So the expected number
of times is that.

224
00:13:06,140 --> 00:13:09,470
Well, but that doesn't tell
you what the number of times

225
00:13:09,470 --> 00:13:12,117
is that you'll see in
any given experiment.

226
00:13:12,117 --> 00:13:14,450
We know that you need to think
about standard deviations

227
00:13:14,450 --> 00:13:16,050
as well.

228
00:13:16,050 --> 00:13:24,910
So what this is saying is,
for instance, for symbol SI,

229
00:13:24,910 --> 00:13:32,382
that we expect to get
that many of symbol SI.

230
00:13:32,382 --> 00:13:34,340
But actually, there's a
distribution around it.

231
00:13:34,340 --> 00:13:39,050
So you'll get a
little histogram here.

232
00:13:39,050 --> 00:13:41,990
I'm not making any attempt
to draw it very carefully,

233
00:13:41,990 --> 00:13:43,958
but there's a distribution.

234
00:13:43,958 --> 00:13:45,500
You run different
experiments, you're

235
00:13:45,500 --> 00:13:50,300
going to get different numbers
of SI in that run of K. Right?

236
00:13:50,300 --> 00:13:51,510
So there's a distribution.

237
00:13:51,510 --> 00:13:53,450
And it turns out
you can actually

238
00:13:53,450 --> 00:13:56,360
write an explicit formula
for the standard deviation.

239
00:14:02,820 --> 00:14:05,960
This is something you'll see
if you do a probability course.

240
00:14:05,960 --> 00:14:08,540
It's actually very simple.

241
00:14:11,730 --> 00:14:13,170
So that's the
standard deviation.

242
00:14:13,170 --> 00:14:18,750
So the standard
deviation goes as root K.

243
00:14:18,750 --> 00:14:21,510
So the interesting thing is
that the standard deviation is

244
00:14:21,510 --> 00:14:24,270
a fraction of--

245
00:14:24,270 --> 00:14:27,270
the number of successes
get smaller and smaller

246
00:14:27,270 --> 00:14:29,490
as K becomes larger and larger.

247
00:14:29,490 --> 00:14:34,560
Or another way to see that
is, if I normalize this,

248
00:14:34,560 --> 00:14:37,680
so I'm going to do a
number of successes

249
00:14:37,680 --> 00:14:43,990
divided by K. This histogram
is going to cluster around P i.

250
00:14:47,340 --> 00:14:52,453
And the standard deviation
now, because I've divided by K,

251
00:14:52,453 --> 00:14:54,120
the standard deviation
actually now ends

252
00:14:54,120 --> 00:15:02,040
up being P1 minus P square
root of K. All right?

253
00:15:04,800 --> 00:15:09,450
So what this says is
if you get a run of K

254
00:15:09,450 --> 00:15:11,670
emissions of the symbol
and you try and estimate

255
00:15:11,670 --> 00:15:16,560
the probability, P i by taking
the ratio of times SI appears

256
00:15:16,560 --> 00:15:18,570
over the total number
of runs, you'll

257
00:15:18,570 --> 00:15:23,070
actually get a little
spread here centered on P i.

258
00:15:23,070 --> 00:15:25,747
But the spread actually
goes down as 1 over root K.

259
00:15:25,747 --> 00:15:28,330
So this is really what the law
of large numbers is telling us.

260
00:15:28,330 --> 00:15:30,610
It's telling us that if
you take a very long run,

261
00:15:30,610 --> 00:15:35,880
you almost certainly get
a number of successes,

262
00:15:35,880 --> 00:15:37,830
well, Kp i in this case.

263
00:15:37,830 --> 00:15:39,570
It's very tightly concentrated.

264
00:15:42,410 --> 00:15:44,910
All right, we don't want you
to remember all these formulas.

265
00:15:44,910 --> 00:15:46,110
I have them on the slides.

266
00:15:46,110 --> 00:15:48,010
It's just there for fun.

267
00:15:48,010 --> 00:15:50,580
There's something else
that I put on there

268
00:15:50,580 --> 00:15:51,900
that you can try out for fun.

269
00:15:51,900 --> 00:15:55,560
I don't want to talk through
it, but you can use exactly this

270
00:15:55,560 --> 00:15:58,500
to analyze things like polling.

271
00:15:58,500 --> 00:16:02,460
Why is it that you
can poll 2,500 people

272
00:16:02,460 --> 00:16:04,170
and say that I've
got a margin of error

273
00:16:04,170 --> 00:16:07,770
of 1% as to how the election
is going to turn out?

274
00:16:07,770 --> 00:16:10,778
Well, the answer is,
actually, in exactly this.

275
00:16:10,778 --> 00:16:12,820
If we have time at the
end, I'll come back to it.

276
00:16:12,820 --> 00:16:17,500
But it's easy enough that
you can look at it yourself.

277
00:16:17,500 --> 00:16:20,730
So let's focus on what it
is I wanted to show you.

278
00:16:25,150 --> 00:16:28,970
I picked Obama 0.55, but that
was just as illustration.

279
00:16:28,970 --> 00:16:29,670
[LAUGHTER]

280
00:16:29,670 --> 00:16:31,710
No.

281
00:16:31,710 --> 00:16:35,070
No political views to
be imputed to that.

282
00:16:35,070 --> 00:16:37,650
All right, so what
we're saying is you've

283
00:16:37,650 --> 00:16:43,200
got K emissions of this symbol.

284
00:16:43,200 --> 00:16:45,210
And with very high
probability, you've

285
00:16:45,210 --> 00:16:56,260
got Kp1 one of S1,
Kp2 of S2, and so on.

286
00:16:56,260 --> 00:16:58,350
So this is really what
you're expecting to get,

287
00:16:58,350 --> 00:17:05,030
provided you've tossed this
a large number of times.

288
00:17:05,030 --> 00:17:09,319
What's the probability of
getting a sequence that has Kp1

289
00:17:09,319 --> 00:17:13,349
of S1, Kp2 of S2, and so on?

290
00:17:13,349 --> 00:17:17,780
So you've got to get
S1 and Kp1 positions.

291
00:17:17,780 --> 00:17:19,640
What's the probability of that?

292
00:17:19,640 --> 00:17:23,480
And you've got to get
S2 and Kp2 positions.

293
00:17:23,480 --> 00:17:27,109
So how do you work out
those probabilities?

294
00:17:27,109 --> 00:17:29,220
We're invoking independence
of all the emissions.

295
00:17:29,220 --> 00:17:31,560
So you can multiply
probabilities.

296
00:17:31,560 --> 00:17:35,690
So what you have is S1
occurring with probability.

297
00:17:35,690 --> 00:17:39,830
P1 to the power
Kp1, because P1 is

298
00:17:39,830 --> 00:17:41,640
the probability with
which S1 occurs,

299
00:17:41,640 --> 00:17:43,370
and it's happening Kp1 times.

300
00:17:43,370 --> 00:17:51,410
So you take it to that power,
and then P2 to the Kp2,

301
00:17:51,410 --> 00:17:55,660
all the way up to Pn to the Kpn.

302
00:17:55,660 --> 00:17:56,540
OK?

303
00:17:56,540 --> 00:18:03,140
So this is the probability of
getting a sequence like this.

304
00:18:05,417 --> 00:18:07,750
And what we've said is this
is the only kind of sequence

305
00:18:07,750 --> 00:18:09,160
you're typically going to get.

306
00:18:09,160 --> 00:18:12,920
All the rest have very low
probability of occurrence.

307
00:18:12,920 --> 00:18:15,190
So it must be that when I
add up all these sequences,

308
00:18:15,190 --> 00:18:17,920
I get, essentially,
probability 1.

309
00:18:17,920 --> 00:18:21,760
So the question then is how
many such sequences are there.

310
00:18:21,760 --> 00:18:23,800
If a single sequence
of this type

311
00:18:23,800 --> 00:18:27,690
has this probability, and the
only sequences I'm going to get

312
00:18:27,690 --> 00:18:31,290
are sequences of this
type effectively,

313
00:18:31,290 --> 00:18:33,220
and the probabilities
have to sum to 1.

314
00:18:33,220 --> 00:18:35,680
How many sequences do
I have of this type?

315
00:18:38,740 --> 00:18:40,825
Do you agree that it's
1 over the probability?

316
00:18:43,402 --> 00:18:44,610
The number of such sequences?

317
00:18:44,610 --> 00:18:49,620
Because I've got to take the
number of sequences times

318
00:18:49,620 --> 00:18:53,420
this individual probability
has to come up to be one.

319
00:18:53,420 --> 00:18:54,060
Right?

320
00:18:54,060 --> 00:18:57,012
The number of sequences--
let me write this down.

321
00:18:57,012 --> 00:18:58,470
So that you see it
a little better.

322
00:19:07,800 --> 00:19:12,240
The number of such--

323
00:19:12,240 --> 00:19:14,130
let me call them
typical sequences--

324
00:19:18,550 --> 00:19:21,700
times the probability
of any such sequence

325
00:19:21,700 --> 00:19:24,560
has got to be approximately 1.

326
00:19:24,560 --> 00:19:27,420
I say approximately because
there are a few other sequences

327
00:19:27,420 --> 00:19:28,740
whose probabilities might--

328
00:19:28,740 --> 00:19:30,430
I would have to take a count of.

329
00:19:30,430 --> 00:19:32,210
But this is essentially it.

330
00:19:32,210 --> 00:19:35,300
So the number of such sequences
is 1 over this number.

331
00:19:35,300 --> 00:19:43,530
So the number of such sequences
is P1 to the minus K1,

332
00:19:43,530 --> 00:19:51,520
P1, P2 to the minus
Kp2 and so on.

333
00:19:56,600 --> 00:19:58,297
That's the number
of such sequences.

334
00:19:58,297 --> 00:20:00,130
And essentially, these
are all the sequences

335
00:20:00,130 --> 00:20:01,047
that I'm going to get.

336
00:20:04,120 --> 00:20:06,070
Well, if I take
the log of this--

337
00:20:11,210 --> 00:20:13,190
just visualize
how the log works.

338
00:20:13,190 --> 00:20:15,850
Now I've got the
log of a product,

339
00:20:15,850 --> 00:20:18,480
so that going to be a sum
of the individual logs.

340
00:20:18,480 --> 00:20:20,300
I've got the log of
a power of something,

341
00:20:20,300 --> 00:20:24,340
so the power will come
down to multiply the log.

342
00:20:24,340 --> 00:20:29,870
This comes out to be K
times H of S exactly.

343
00:20:29,870 --> 00:20:33,410
OK, so the log of the
number of sequences

344
00:20:33,410 --> 00:20:37,940
is K times H of S,
K times the entropy.

345
00:20:40,930 --> 00:20:44,310
This is log to the base 2.

346
00:20:44,310 --> 00:20:52,050
So the number of sequences
is equal to 2 to the KH.

347
00:20:52,050 --> 00:20:53,225
I'm saying equal to.

348
00:20:53,225 --> 00:20:54,600
I should be putting
approximately

349
00:20:54,600 --> 00:20:57,700
equal to signs everywhere,
but you get the idea.

350
00:20:57,700 --> 00:21:03,150
So the number of typical
sequences is 2 to the KH.

351
00:21:03,150 --> 00:21:07,290
How many binary digits does it
take to count 2 the KH things?

352
00:21:11,210 --> 00:21:12,760
KH, right?

353
00:21:12,760 --> 00:21:13,915
So what I need is--

354
00:21:16,954 --> 00:21:26,890
so I just need KHS binary digits
to count the typical sequences.

355
00:21:34,980 --> 00:21:37,935
So how many binary digits
do I need per symbol?

356
00:21:40,790 --> 00:21:42,500
It's just that divided
by K because I've

357
00:21:42,500 --> 00:21:44,370
got a string of K symbols.

358
00:21:44,370 --> 00:21:48,450
So I need a number of binary
digits equal to the entropy.

359
00:21:48,450 --> 00:21:51,320
So this is a quick way
of seeing that entropy

360
00:21:51,320 --> 00:21:54,950
is very relevant to minimal
coding of sequences of outputs

361
00:21:54,950 --> 00:21:57,330
from a source like this.

362
00:21:57,330 --> 00:22:01,500
All right, now I swept a
lot of math under the rug.

363
00:22:01,500 --> 00:22:04,650
The math that makes
this rigorous exists.

364
00:22:04,650 --> 00:22:08,490
We don't want to have
any part of it here.

365
00:22:08,490 --> 00:22:10,110
But for those of you
who are inclined,

366
00:22:10,110 --> 00:22:14,390
you can look in a book
on information theory.

367
00:22:14,390 --> 00:22:15,530
There's a nice name to it.

368
00:22:15,530 --> 00:22:24,740
It's called the
Asymptotic Equipartition--

369
00:22:24,740 --> 00:22:32,220
Equipartition Property.

370
00:22:32,220 --> 00:22:34,300
OK?

371
00:22:34,300 --> 00:22:37,010
It's basically saying that,
asymptotically the probability

372
00:22:37,010 --> 00:22:39,140
partitions into equal
probabilities for all

373
00:22:39,140 --> 00:22:41,380
these typical sequences.

374
00:22:41,380 --> 00:22:41,880
All right.

375
00:22:45,560 --> 00:22:52,430
So all that is for Huffman
and its application

376
00:22:52,430 --> 00:22:57,680
to symbols emitted independently
by a source over time.

377
00:22:57,680 --> 00:22:59,960
But there are
limitations to this.

378
00:22:59,960 --> 00:23:05,270
We've been working with Huffman
coding under the assumption

379
00:23:05,270 --> 00:23:07,340
that the probabilities
are given to us.

380
00:23:07,340 --> 00:23:10,010
But it's typically the case
that the probabilities are not

381
00:23:10,010 --> 00:23:15,170
known for some arbitrary source
that you're trying to code for.

382
00:23:15,170 --> 00:23:17,420
The probabilities might
change with time as the source

383
00:23:17,420 --> 00:23:18,830
characteristics change.

384
00:23:18,830 --> 00:23:22,040
So you would need to
detect that and recode,

385
00:23:22,040 --> 00:23:24,430
if you're going to do Huffman.

386
00:23:24,430 --> 00:23:26,990
And then the more
important point

387
00:23:26,990 --> 00:23:30,350
perhaps is that sources
are generally not iid.

388
00:23:30,350 --> 00:23:32,690
The sources of
interest are not really

389
00:23:32,690 --> 00:23:36,950
generating independent
identically

390
00:23:36,950 --> 00:23:38,870
distributed symbols.

391
00:23:38,870 --> 00:23:42,810
What's perhaps
more true is that--

392
00:23:42,810 --> 00:23:43,310
let's see.

393
00:23:43,310 --> 00:23:50,600
Oh, here-- once you're done
compressing your source

394
00:23:50,600 --> 00:23:52,910
to binary digits where
each binary digit carries

395
00:23:52,910 --> 00:23:54,950
a bit of information,
then you've

396
00:23:54,950 --> 00:24:02,210
got something that essentially
is not correlated over time.

397
00:24:02,210 --> 00:24:04,400
You've managed to
kind of decouple it.

398
00:24:04,400 --> 00:24:08,660
But at this stage, these symbols
are not really independent

399
00:24:08,660 --> 00:24:10,890
in typical cases of interest.

400
00:24:10,890 --> 00:24:15,110
So one important case, of
course, is just English text.

401
00:24:15,110 --> 00:24:17,810
You can still code
it symbol by symbol,

402
00:24:17,810 --> 00:24:19,910
but it's a very
inefficient coding.

403
00:24:19,910 --> 00:24:22,420
If you wanted to do
it symbol by symbol,

404
00:24:22,420 --> 00:24:24,140
let's just ignore uppercase.

405
00:24:24,140 --> 00:24:27,010
You've got 26
letters plus a space.

406
00:24:27,010 --> 00:24:30,110
So that's 27 symbols.

407
00:24:30,110 --> 00:24:32,810
Well, you could certainly code
that with five binary digits

408
00:24:32,810 --> 00:24:36,050
because that would give
you 32 things to count.

409
00:24:36,050 --> 00:24:38,630
You can do better
with a code that

410
00:24:38,630 --> 00:24:40,280
approaches the
entropy associated

411
00:24:40,280 --> 00:24:42,710
with a source of this type.

412
00:24:42,710 --> 00:24:45,730
That would be 4.755 bits.

413
00:24:45,730 --> 00:24:50,780
OK, so if you ignored
dependence in English text

414
00:24:50,780 --> 00:24:54,620
and just treated each
symbol is equally likely,

415
00:24:54,620 --> 00:24:56,152
you'd say that
that's the entropy,

416
00:24:56,152 --> 00:24:58,610
and you could attempt to code
it with something approaching

417
00:24:58,610 --> 00:24:59,110
that.

418
00:24:59,110 --> 00:25:02,310
But actually, not all
symbols are equally likely.

419
00:25:02,310 --> 00:25:04,700
If you look at a typical
distribution of frequencies--

420
00:25:04,700 --> 00:25:07,010
and we saw this
with Morse already.

421
00:25:07,010 --> 00:25:12,660
The E is much more common than
T, than A, O, I, N and so on.

422
00:25:12,660 --> 00:25:16,160
So there is a
distribution to this.

423
00:25:16,160 --> 00:25:20,300
But you can take account of
that distribution and compute

424
00:25:20,300 --> 00:25:23,960
the associated entropy, and
you get something a little bit

425
00:25:23,960 --> 00:25:27,590
smaller, 4.177 instead of
the 4.7-something that we

426
00:25:27,590 --> 00:25:29,480
hadn't before.

427
00:25:29,480 --> 00:25:31,610
Because not all letters
are equally likely.

428
00:25:31,610 --> 00:25:35,720
But this is still thinking
of it symbol by symbol,

429
00:25:35,720 --> 00:25:38,390
not recognizing
dependence over time.

430
00:25:41,640 --> 00:25:45,230
But English and other
languages are full of context.

431
00:25:45,230 --> 00:25:45,730
Right?

432
00:25:45,730 --> 00:25:50,260
If you know the preceding
part of the text,

433
00:25:50,260 --> 00:25:55,100
you have a very good way
to guess the next letter.

434
00:25:55,100 --> 00:25:57,070
Nothing can be said to
be certain except death

435
00:25:57,070 --> 00:25:58,528
and-- well, you
can-- in this case,

436
00:25:58,528 --> 00:26:00,260
you can give me the
next three letters.

437
00:26:00,260 --> 00:26:01,330
Right?

438
00:26:01,330 --> 00:26:02,060
Anyone?

439
00:26:02,060 --> 00:26:02,430
AUDIENCE: It's taxes

440
00:26:02,430 --> 00:26:03,388
PROFESSOR: Taxes, yeah.

441
00:26:07,440 --> 00:26:10,520
So even though X
taken in isolation

442
00:26:10,520 --> 00:26:12,740
has a very low
probability of occurrence,

443
00:26:12,740 --> 00:26:15,290
if you look at the histogram
on the previous page,

444
00:26:15,290 --> 00:26:19,190
you see that the
probability is 0.0017.

445
00:26:19,190 --> 00:26:21,273
Letters are not
independently generated.

446
00:26:21,273 --> 00:26:23,690
Now, it turns out Shannon was
actually one of the earliest

447
00:26:23,690 --> 00:26:27,300
to study this in
experiments on his wife.

448
00:26:27,300 --> 00:26:30,680
He had her-- he presented
her with bits of text

449
00:26:30,680 --> 00:26:32,145
from one particular
book and asked

450
00:26:32,145 --> 00:26:33,770
her to guess the next
letter and so on.

451
00:26:33,770 --> 00:26:37,820
And he had a 1951 paper
that actually launched

452
00:26:37,820 --> 00:26:39,680
a lot of this, because
he had developed now

453
00:26:39,680 --> 00:26:42,170
the tools for talking about it.

454
00:26:42,170 --> 00:26:45,100
His estimate was much lower
than the 4-point-something.

455
00:26:45,100 --> 00:26:49,280
It was more in the vicinity
of one bit, 1 to 1.5 bits.

456
00:26:49,280 --> 00:26:55,100
So there's a lot of compression
possible with English text

457
00:26:55,100 --> 00:26:57,510
because there's this kind
of a dependence here.

458
00:27:06,500 --> 00:27:09,250
And just to tell you
what it is that we're

459
00:27:09,250 --> 00:27:11,980
trying to compute when
we compute entropy

460
00:27:11,980 --> 00:27:14,200
for these long
sequences of symbols,

461
00:27:14,200 --> 00:27:18,700
we're sort of saying what's
the joint entropy of a sequence

462
00:27:18,700 --> 00:27:24,440
of K symbols divided by K in the
limit of K going to infinity.

463
00:27:24,440 --> 00:27:27,490
So this is what you
might call H under bar.

464
00:27:27,490 --> 00:27:28,990
It's not over bar
because I couldn't

465
00:27:28,990 --> 00:27:31,030
see how to do an over
bar on my PowerPoint.

466
00:27:31,030 --> 00:27:34,897
But it's usually an
over bar in the books.

467
00:27:34,897 --> 00:27:36,730
But this is really the
object that you would

468
00:27:36,730 --> 00:27:38,250
like to get your hands on.

469
00:27:38,250 --> 00:27:41,470
For sequential text
that has context in it,

470
00:27:41,470 --> 00:27:43,510
this is the kind of
entropy that you really

471
00:27:43,510 --> 00:27:46,480
would like to be working with.

472
00:27:46,480 --> 00:27:47,800
OK.

473
00:27:47,800 --> 00:27:51,130
So that brings us to
an approach to coding

474
00:27:51,130 --> 00:27:53,092
that's really focused--

475
00:27:53,092 --> 00:27:54,550
coding or compression
that's really

476
00:27:54,550 --> 00:27:56,180
focused on sequential text.

477
00:27:56,180 --> 00:28:00,740
And this is the Lempel-Ziv-Welch
algorithm that's in the notes.

478
00:28:00,740 --> 00:28:03,610
Turns out that Lempel
and Ziv or Ziv and Lempel

479
00:28:03,610 --> 00:28:05,860
had two earlier papers.

480
00:28:05,860 --> 00:28:09,640
And then Welch improved
on it in an '84 paper.

481
00:28:09,640 --> 00:28:12,700
And what's in blue over
there is a bit of a mouthful.

482
00:28:12,700 --> 00:28:14,840
And each word kind
of means something,

483
00:28:14,840 --> 00:28:17,320
so I thought I'd
list it all there.

484
00:28:17,320 --> 00:28:19,600
Maybe I've used too
many of these words--

485
00:28:19,600 --> 00:28:23,470
universal lossless compression
of sequential or streaming data

486
00:28:23,470 --> 00:28:25,630
by adaptive variable
length coding.

487
00:28:25,630 --> 00:28:29,580
And I'll come to talk about
those terms on the next slide.

488
00:28:29,580 --> 00:28:32,500
And it turns out that this is
a very widely used compression

489
00:28:32,500 --> 00:28:35,050
algorithm for all
sorts of files.

490
00:28:35,050 --> 00:28:36,880
Sometimes it's for a part of it.

491
00:28:36,880 --> 00:28:38,650
Sometimes it's optional.

492
00:28:38,650 --> 00:28:40,300
Sometimes it's
combined with Huffman,

493
00:28:40,300 --> 00:28:43,870
but all of these things
that do compression

494
00:28:43,870 --> 00:28:48,880
pay homage to Lempel
and Ziv at least.

495
00:28:48,880 --> 00:28:50,260
They were also patented.

496
00:28:50,260 --> 00:28:54,565
Actually, Unisys owned the
patent on LZW for many years.

497
00:28:54,565 --> 00:28:55,690
These have all expired now.

498
00:28:55,690 --> 00:29:00,370
But while the patents were held,
it made for a lot of heartburn

499
00:29:00,370 --> 00:29:02,697
because there were
things being done

500
00:29:02,697 --> 00:29:04,780
without knowledge of the
existence of the patents.

501
00:29:04,780 --> 00:29:09,230
And then people got hit
with lawsuits and so on.

502
00:29:09,230 --> 00:29:14,200
Jacob Ziv, again part of this
incredible heritage from MIT

503
00:29:14,200 --> 00:29:17,140
of people working here in
the early days of information

504
00:29:17,140 --> 00:29:17,950
theory.

505
00:29:17,950 --> 00:29:20,560
He was a graduate student here
at the same time as Huffman

506
00:29:20,560 --> 00:29:23,500
and many other people whose
names surface in all of this.

507
00:29:26,470 --> 00:29:29,020
I was actually at an award
ceremony of the IEEE,

508
00:29:29,020 --> 00:29:32,870
where Lempel got an award
for his compression work.

509
00:29:32,870 --> 00:29:36,980
And people were given a whole
minute for a thank you speech,

510
00:29:36,980 --> 00:29:37,990
a mini thank you speech.

511
00:29:37,990 --> 00:29:41,860
And everyone took their minute
to mention this person and that

512
00:29:41,860 --> 00:29:43,820
and talk about the
origins of the work.

513
00:29:43,820 --> 00:29:45,403
It's a lot to say
in a minute but they

514
00:29:45,403 --> 00:29:47,440
managed to convey a lot.

515
00:29:47,440 --> 00:29:49,582
Lempel came up and
said, "thank you."

516
00:29:49,582 --> 00:29:51,220
[LAUGHTER]

517
00:29:51,220 --> 00:29:53,220
It seemed kind of fitting
for someone whose life

518
00:29:53,220 --> 00:29:54,303
is devoted to compression.

519
00:29:54,303 --> 00:29:57,790
[LAUGHTER]

520
00:29:57,790 --> 00:30:00,610
I just couldn't help
but crack up there.

521
00:30:00,610 --> 00:30:04,060
That was-- all right.

522
00:30:04,060 --> 00:30:05,590
Now the interesting
thing about this

523
00:30:05,590 --> 00:30:09,112
is that there are
theoretical guarantees

524
00:30:09,112 --> 00:30:10,570
that, under
appropriate assumptions

525
00:30:10,570 --> 00:30:14,290
on the source, asymptotically,
this process will

526
00:30:14,290 --> 00:30:16,630
attain that bound.

527
00:30:16,630 --> 00:30:21,160
Now the thing is, the word
asymptotically hides many sins.

528
00:30:21,160 --> 00:30:23,490
Lots of things happen
at infinity that

529
00:30:23,490 --> 00:30:25,802
are not supposed to happen.

530
00:30:25,802 --> 00:30:27,760
Or lots of things happen
at infinity that never

531
00:30:27,760 --> 00:30:29,290
happen when you're watching.

532
00:30:29,290 --> 00:30:31,870
So the theoretical
performance perhaps

533
00:30:31,870 --> 00:30:34,720
is not as important as the fact
that it works exceedingly well

534
00:30:34,720 --> 00:30:36,565
in practice.

535
00:30:36,565 --> 00:30:38,440
So we're going to talk
a little bit about it.

536
00:30:38,440 --> 00:30:39,910
You've got a lab on it as well.

537
00:30:44,280 --> 00:30:48,050
So let me just say a little bit
about what these words mean.

538
00:30:48,050 --> 00:30:50,150
So this is universal
in the sense

539
00:30:50,150 --> 00:30:51,830
that it doesn't
necessarily-- it doesn't

540
00:30:51,830 --> 00:30:54,230
need any knowledge of
the particular statistics

541
00:30:54,230 --> 00:30:55,730
of the source that
it's compressing.

542
00:30:55,730 --> 00:30:59,830
It's willing to take
its hand at anything.

543
00:30:59,830 --> 00:31:01,790
OK?

544
00:31:01,790 --> 00:31:04,040
So it doesn't need to know
the source statistics.

545
00:31:04,040 --> 00:31:05,990
It actually learns
the source statistics

546
00:31:05,990 --> 00:31:09,650
in the course of
implementing the algorithm.

547
00:31:09,650 --> 00:31:11,870
And it does that by
actually building up

548
00:31:11,870 --> 00:31:14,300
a dictionary for
strings of symbols

549
00:31:14,300 --> 00:31:16,790
that it discovers
in the source text.

550
00:31:16,790 --> 00:31:21,500
So it's built around
construction of a dictionary.

551
00:31:21,500 --> 00:31:23,960
What it then does is
it compresses the text,

552
00:31:23,960 --> 00:31:27,950
not to things that we've
seen here in Huffman,

553
00:31:27,950 --> 00:31:29,800
but to actually
dictionary entries.

554
00:31:29,800 --> 00:31:32,210
So it's sort of like
Morse's original idea,

555
00:31:32,210 --> 00:31:34,850
which was communicate the
address in the dictionary

556
00:31:34,850 --> 00:31:36,660
rather than communicating
the word itself

557
00:31:36,660 --> 00:31:38,900
or some compressed
version of the word.

558
00:31:38,900 --> 00:31:40,712
So it compresses the
text to sequences

559
00:31:40,712 --> 00:31:42,920
of dictionary addresses,
and those are the code words

560
00:31:42,920 --> 00:31:46,650
that it sends to the receiver.

561
00:31:46,650 --> 00:31:49,380
It's also a variable
length compression scheme.

562
00:31:49,380 --> 00:31:51,660
But it's interesting
that it doesn't

563
00:31:51,660 --> 00:31:56,150
take a fixed length of symbols
to varying lengths of code

564
00:31:56,150 --> 00:31:56,650
words.

565
00:31:56,650 --> 00:31:58,140
It actually takes
varying lengths

566
00:31:58,140 --> 00:32:00,030
of symbols to fixed
length of code words.

567
00:32:00,030 --> 00:32:01,680
So it's a little bit backwards.

568
00:32:01,680 --> 00:32:05,700
But it's still a variable
length in that sense.

569
00:32:05,700 --> 00:32:09,150
So the way this works is that
the sender and the receiver

570
00:32:09,150 --> 00:32:12,210
start off with a core dictionary
that they both agreed on.

571
00:32:12,210 --> 00:32:17,790
And for our
illustrations, we might

572
00:32:17,790 --> 00:32:20,100
say that they've
agreed on the letters A

573
00:32:20,100 --> 00:32:30,390
through Z, lowercase
A through Z.

574
00:32:30,390 --> 00:32:33,390
So what they have is these
letters or this core dictionary

575
00:32:33,390 --> 00:32:35,340
stored in some register.

576
00:32:35,340 --> 00:32:39,280
Well, actually let me show
you what it might look like.

577
00:32:39,280 --> 00:32:42,690
So there's the register
with, let's say,

578
00:32:42,690 --> 00:32:45,210
you have an 8-bit table.

579
00:32:45,210 --> 00:32:47,490
This is the dictionary
that you have at both ends.

580
00:32:47,490 --> 00:32:50,550
So you can store 256
different things.

581
00:32:50,550 --> 00:32:54,000
And you've both agreed on what's
going to go into those slots.

582
00:32:54,000 --> 00:32:56,700
So somewhere-- I think
it's slot 97 in one

583
00:32:56,700 --> 00:32:59,820
of these particular codes,
you've got the letter A.

584
00:32:59,820 --> 00:33:01,920
And somewhere else
you've got B, and so on.

585
00:33:01,920 --> 00:33:03,570
Or the next position
you've got B.

586
00:33:03,570 --> 00:33:06,840
You can store a bunch
of standard symbols.

587
00:33:06,840 --> 00:33:09,090
So we'll consider that
all the single letter

588
00:33:09,090 --> 00:33:14,580
symbols are already stored
in designated positions

589
00:33:14,580 --> 00:33:17,880
in this dictionary that's known
to the sender and the receiver.

590
00:33:17,880 --> 00:33:24,180
So if the sender just
sends 252, the receiver

591
00:33:24,180 --> 00:33:27,030
knows what 252 refers
to because they've

592
00:33:27,030 --> 00:33:29,963
got that core dictionary
that they've agreed on.

593
00:33:29,963 --> 00:33:31,380
Some of the text
here, by the way,

594
00:33:31,380 --> 00:33:34,710
is stuff I've said already.

595
00:33:34,710 --> 00:33:35,850
So I'll actually go back.

596
00:33:48,260 --> 00:33:51,470
And then what happens is
that the source starts

597
00:33:51,470 --> 00:33:54,650
to sequentially scan
the text that's arriving

598
00:33:54,650 --> 00:34:01,160
or that it's looking at and
puts new strings that it's

599
00:34:01,160 --> 00:34:05,750
found into new
locations in this table

600
00:34:05,750 --> 00:34:09,199
and then communicates the
address for the receiver.

601
00:34:09,199 --> 00:34:12,380
The magic of this-- and I mean
it's fiendishly clever, very

602
00:34:12,380 --> 00:34:17,030
simple, but very clever, is
that the receiver can build up

603
00:34:17,030 --> 00:34:21,895
its dictionary in tandem with
the transmitter building up

604
00:34:21,895 --> 00:34:22,520
the dictionary.

605
00:34:22,520 --> 00:34:24,360
It's just a one-step delay.

606
00:34:24,360 --> 00:34:25,940
So one step later,
the receiver has

607
00:34:25,940 --> 00:34:30,060
figured out what that
dictionary entry is.

608
00:34:30,060 --> 00:34:34,400
So the transmitter or the source
is building up the dictionary,

609
00:34:34,400 --> 00:34:39,380
looking at strings in
the input sequence,

610
00:34:39,380 --> 00:34:42,080
communicating the address--

611
00:34:42,080 --> 00:34:46,250
the addresses of the appropriate
strings to the receiver,

612
00:34:46,250 --> 00:34:48,843
and the receiver is building
up a dictionary in parallel.

613
00:34:48,843 --> 00:34:50,510
Now I think the easiest
way to do this--

614
00:34:50,510 --> 00:34:52,760
there's discussion in the text.

615
00:34:52,760 --> 00:34:54,050
There's also code fragments.

616
00:34:54,050 --> 00:34:55,467
But I think the
easiest way for me

617
00:34:55,467 --> 00:34:57,980
to try and do this is to
actually just show you

618
00:34:57,980 --> 00:35:01,900
how it works on a
particular sequence.

619
00:35:04,450 --> 00:35:07,290
And you may not get all
the details all at once.

620
00:35:07,290 --> 00:35:09,810
I do have a little animation
that I need to tweak a bit,

621
00:35:09,810 --> 00:35:10,780
and I'll--

622
00:35:10,780 --> 00:35:12,930
well, it's not an animation,
but a set of slides

623
00:35:12,930 --> 00:35:15,030
that'll help you
understand, actually,

624
00:35:15,030 --> 00:35:16,990
this particular example.

625
00:35:16,990 --> 00:35:18,510
So I'll have that
posted as well.

626
00:35:22,022 --> 00:35:23,730
But for now, let's
just work through this

627
00:35:23,730 --> 00:35:27,520
and see what it looks like.

628
00:35:27,520 --> 00:35:30,250
And I hope I don't trip
over myself in the process.

629
00:35:30,250 --> 00:35:31,617
I hope you'll be forgiving.

630
00:35:41,150 --> 00:35:42,900
And I need these two
blackboards to do it.

631
00:35:48,020 --> 00:35:48,520
OK.

632
00:35:51,570 --> 00:35:52,820
And I need some colored chalk.

633
00:35:56,320 --> 00:36:01,670
So what I'm going to have
over here is the source.

634
00:36:01,670 --> 00:36:04,030
And over here is the receiver.

635
00:36:09,720 --> 00:36:16,590
And the source wants to send
a message that I'll put here--

636
00:36:16,590 --> 00:36:27,955
A-B-C. This is going to
look incredibly boring.

637
00:36:31,030 --> 00:36:33,620
But the algorithm does different
things at different stages,

638
00:36:33,620 --> 00:36:35,470
so that keeps it interesting.

639
00:36:35,470 --> 00:36:38,800
And let's see 1, 2, 3, 4, 5.

640
00:36:38,800 --> 00:36:42,937
And then we hit a special case
somewhere near the end here

641
00:36:42,937 --> 00:36:44,020
that is worth sorting out.

642
00:36:44,020 --> 00:36:45,730
Because otherwise
that, the fragment

643
00:36:45,730 --> 00:36:47,900
of the code that you
see doesn't make sense.

644
00:36:47,900 --> 00:36:53,080
Gee, can you believe that I
want to start this again here?

645
00:36:53,080 --> 00:36:53,740
Sorry.

646
00:36:53,740 --> 00:36:55,360
Let's start here.

647
00:36:55,360 --> 00:36:57,385
I want at least six
replications of ABC.

648
00:37:04,823 --> 00:37:06,240
I want you to get
comfortable also

649
00:37:06,240 --> 00:37:07,590
so you can settle into this.

650
00:37:11,900 --> 00:37:16,100
OK, here we go.

651
00:37:16,100 --> 00:37:18,630
All right.

652
00:37:18,630 --> 00:37:21,330
The receiver has no idea
that this is the sequence.

653
00:37:21,330 --> 00:37:23,550
The source has, and
the receiver both

654
00:37:23,550 --> 00:37:27,420
have A through Z sitting
in their dictionary

655
00:37:27,420 --> 00:37:30,970
at designated locations.

656
00:37:30,970 --> 00:37:37,649
So the source will
first see the letter A

657
00:37:37,649 --> 00:37:40,160
and does nothing because
A is in its dictionary.

658
00:37:40,160 --> 00:37:43,010
It doesn't want to
do anything yet.

659
00:37:43,010 --> 00:37:44,150
Then it looks at--

660
00:37:44,150 --> 00:37:46,775
it pulls in B. So now
it's looking at AB.

661
00:37:46,775 --> 00:37:49,700
AB is not in its dictionary
because it's a symbol of--

662
00:37:49,700 --> 00:37:52,410
it's a string of two symbols.

663
00:37:52,410 --> 00:37:55,550
So now it knows it needs
to make a dictionary entry.

664
00:37:55,550 --> 00:37:58,560
I'm going to indicate
dictionary entry with this.

665
00:37:58,560 --> 00:38:02,660
So the source is going to
make a dictionary entry of AB.

666
00:38:02,660 --> 00:38:05,420
So what this means is
somewhere in that register

667
00:38:05,420 --> 00:38:07,760
in a particular position,
or in the next position

668
00:38:07,760 --> 00:38:13,490
actually from the agreed on
table, it sticks in this.

669
00:38:13,490 --> 00:38:18,980
And then what it transmits
to the receiver is not this,

670
00:38:18,980 --> 00:38:25,260
but the code for A. OK?

671
00:38:25,260 --> 00:38:29,280
So it enters the longer fragment
here as a new dictionary word

672
00:38:29,280 --> 00:38:34,580
and sends the address for the
piece that the receiver sees.

673
00:38:34,580 --> 00:38:36,120
So what does the receiver get?

674
00:38:36,120 --> 00:38:40,140
The receiver sees A
coming in and says, OK,

675
00:38:40,140 --> 00:38:44,530
that's the sequence
A. That's the symbol.

676
00:38:44,530 --> 00:38:46,190
A, I'm all set.

677
00:38:46,190 --> 00:38:48,930
All right?

678
00:38:48,930 --> 00:38:54,420
Now what happens
is that the source

679
00:38:54,420 --> 00:38:56,190
pulls in the next letter.

680
00:38:56,190 --> 00:38:59,040
It's done with the A, so you can
essentially forget about that.

681
00:38:59,040 --> 00:39:00,900
It pulls in the next letter.

682
00:39:00,900 --> 00:39:05,280
Looks to see if it's got
B-C in its dictionary.

683
00:39:05,280 --> 00:39:08,880
It doesn't have BC because it
only has single letter entries,

684
00:39:08,880 --> 00:39:10,280
and it has AB.

685
00:39:10,280 --> 00:39:12,660
So it's got to put in BC.

686
00:39:12,660 --> 00:39:14,670
So it's going to put
in an entry for BC.

687
00:39:19,880 --> 00:39:28,170
And then what it's going
to transmit is the B.

688
00:39:28,170 --> 00:39:33,900
The receiver gets
the B. Oh, sorry--

689
00:39:33,900 --> 00:39:37,620
the directory entry for B. And
so it knows that's the letter

690
00:39:37,620 --> 00:39:42,990
B. And now it enters AB in its--

691
00:39:47,670 --> 00:39:51,010
in its dictionary, OK,
in the next location.

692
00:39:51,010 --> 00:39:52,920
So you see, with
a one-step delay,

693
00:39:52,920 --> 00:39:55,350
the AB that was in
the dictionary here

694
00:39:55,350 --> 00:39:57,590
has ended up in the
dictionary of the receiver.

695
00:40:00,740 --> 00:40:03,770
OK, we're done with this.

696
00:40:03,770 --> 00:40:06,830
We now pull in the
next letter here.

697
00:40:06,830 --> 00:40:09,140
That's A. We haven't seen A--

698
00:40:09,140 --> 00:40:10,950
we haven't seen CA
in our dictionary.

699
00:40:10,950 --> 00:40:19,550
So we make an entry for CA,
ship out C. C comes here.

700
00:40:23,700 --> 00:40:29,100
I should say that this was done
with the A. The C comes here,

701
00:40:29,100 --> 00:40:34,020
and the receiver knows
to make an entry for BC.

702
00:40:38,630 --> 00:40:39,890
So with one delay it's got it.

703
00:40:43,160 --> 00:40:44,675
OK, we're done with this.

704
00:40:48,910 --> 00:40:51,670
We pull in the next letter, AB.

705
00:40:51,670 --> 00:40:52,910
That's in our dictionary.

706
00:40:52,910 --> 00:40:54,880
So we keep going, all right?

707
00:40:54,880 --> 00:40:59,740
So this algorithm doesn't look
to ship out the dictionary

708
00:40:59,740 --> 00:41:02,718
address every time it sees a
sequence that it recognizes.

709
00:41:02,718 --> 00:41:04,510
If it's got this already
in its dictionary,

710
00:41:04,510 --> 00:41:07,420
it keeps going to try
and learn a new word.

711
00:41:07,420 --> 00:41:09,430
So it's already got AB
there, so it keeps going

712
00:41:09,430 --> 00:41:13,610
and it pulls in C. And
now that's a new word.

713
00:41:13,610 --> 00:41:20,170
So it's got ABC as a new entry.

714
00:41:20,170 --> 00:41:23,050
It ships out AB--

715
00:41:23,050 --> 00:41:24,280
the address for AB rather.

716
00:41:30,770 --> 00:41:37,650
This gets the address for AB,
which is in its dictionary.

717
00:41:37,650 --> 00:41:39,105
It puts the AB down there.

718
00:41:41,760 --> 00:41:43,650
It takes the first
letter of the string that

719
00:41:43,650 --> 00:41:45,280
came in and appends
it to the last one

720
00:41:45,280 --> 00:41:46,905
that it had there
and gives you the CA.

721
00:41:50,730 --> 00:41:53,340
So you see, it's keeping up
but with a one-step delay.

722
00:41:56,220 --> 00:41:57,120
Let's keep going.

723
00:41:57,120 --> 00:42:00,590
So the AB is done with.

724
00:42:00,590 --> 00:42:04,234
We pull in A. We've got CA.

725
00:42:04,234 --> 00:42:08,260
We pull in the B.
We don't have CAB,

726
00:42:08,260 --> 00:42:09,590
so let's enter that as well.

727
00:42:12,120 --> 00:42:14,120
By the time we've done
this example, by the way,

728
00:42:14,120 --> 00:42:17,030
I'm hoping you'll
know Lempel-Ziv.

729
00:42:17,030 --> 00:42:18,770
So bear with me.

730
00:42:21,680 --> 00:42:23,260
All right, dictionary
entry-- and now

731
00:42:23,260 --> 00:42:25,895
what does it send
out to the receiver?

732
00:42:25,895 --> 00:42:26,770
AUDIENCE: [INAUDIBLE]

733
00:42:26,770 --> 00:42:27,478
PROFESSOR: Sorry.

734
00:42:27,478 --> 00:42:28,120
AUDIENCE: C2

735
00:42:28,120 --> 00:42:35,030
PROFESSOR: CA-- the
address for CA, right?

736
00:42:35,030 --> 00:42:36,055
The address for CA.

737
00:42:36,055 --> 00:42:37,700
So the address for CA comes in.

738
00:42:40,330 --> 00:42:42,160
It decodes the CA.

739
00:42:48,450 --> 00:42:49,740
And so let's see.

740
00:42:49,740 --> 00:42:51,780
We're done with these
pieces, but this one

741
00:42:51,780 --> 00:42:56,670
has to build up its new
direct dictionary entry.

742
00:42:56,670 --> 00:42:59,460
And so what it's got is
the AB setting from before,

743
00:42:59,460 --> 00:43:01,650
and it pulls in
the first letter.

744
00:43:01,650 --> 00:43:03,240
Instead of wrapping
to the next board,

745
00:43:03,240 --> 00:43:05,970
let me start winding up again--

746
00:43:05,970 --> 00:43:06,690
winding upwards.

747
00:43:09,650 --> 00:43:13,240
OK, so that's the new
entry there, the receiver--

748
00:43:13,240 --> 00:43:14,470
one step delayed from here.

749
00:43:18,660 --> 00:43:22,210
OK, I pull in the C. I have BC .

750
00:43:22,210 --> 00:43:24,140
I keep going.

751
00:43:24,140 --> 00:43:28,240
I pull on the A.
I don't see that.

752
00:43:28,240 --> 00:43:29,015
So I need BCA.

753
00:43:33,650 --> 00:43:35,480
I ship out the address for BC.

754
00:43:38,250 --> 00:43:40,580
So I'm done with these.

755
00:43:40,580 --> 00:43:42,260
I get the address for BC here.

756
00:43:46,130 --> 00:43:50,120
I decode and get BC.

757
00:43:50,120 --> 00:43:53,405
I combined the first
letter of the new fragment

758
00:43:53,405 --> 00:43:54,530
with what was sitting here.

759
00:43:54,530 --> 00:44:03,300
So I get CAB as my
dictionary entry.

760
00:44:06,720 --> 00:44:08,400
And I keep going.

761
00:44:08,400 --> 00:44:10,565
All right, it's very systematic.

762
00:44:10,565 --> 00:44:12,190
I'm going to keep
going because there's

763
00:44:12,190 --> 00:44:15,820
a special case that will trip
you up if you don't get to it.

764
00:44:15,820 --> 00:44:19,300
And we need to proceed
a couple more here.

765
00:44:19,300 --> 00:44:23,680
OK, I pull in the
B. I've got a AB.

766
00:44:23,680 --> 00:44:25,525
I pull in the C. I've got ABC.

767
00:44:29,580 --> 00:44:32,920
I pull in the A.
I don't have ABCA.

768
00:44:32,920 --> 00:44:40,327
So I enter that
in my dictionary.

769
00:44:43,430 --> 00:44:44,430
And then I ship out ABC.

770
00:44:54,200 --> 00:44:57,980
OK, so you're always building
a new word, entering it

771
00:44:57,980 --> 00:45:00,643
in your dictionary, and
then the part that's already

772
00:45:00,643 --> 00:45:02,060
known you're
shipping out and then

773
00:45:02,060 --> 00:45:04,367
hanging onto the
end of this to start

774
00:45:04,367 --> 00:45:05,450
building the new fragment.

775
00:45:08,150 --> 00:45:09,350
ABC arrives here.

776
00:45:17,140 --> 00:45:18,710
I had the BC from before.

777
00:45:18,710 --> 00:45:21,890
I pull in the first
letter of that,

778
00:45:21,890 --> 00:45:28,160
and I get a BCA as my new
entry, which is this one.

779
00:45:32,080 --> 00:45:33,220
OK.

780
00:45:33,220 --> 00:45:35,140
Now we pull in the AB.

781
00:45:35,140 --> 00:45:36,940
I mean, we pull in
the B. We have AB.

782
00:45:36,940 --> 00:45:40,000
We pull in the C. We have ABC.

783
00:45:40,000 --> 00:45:47,770
We pull in the A, we have
ABCA, so we pull on the B.

784
00:45:47,770 --> 00:45:50,135
We ship out ABCA--

785
00:45:50,135 --> 00:45:55,950
A-B-C-A. Right?

786
00:45:55,950 --> 00:45:59,620
And now we're done
with all those guys.

787
00:45:59,620 --> 00:46:00,650
And here comes ABCA.

788
00:46:06,370 --> 00:46:11,910
And I go to my dictionary,
and I don't have ABCA--

789
00:46:14,840 --> 00:46:15,370
big hiccup.

790
00:46:18,840 --> 00:46:22,190
So the reason that
happened is that I'm

791
00:46:22,190 --> 00:46:27,080
discovering I need to send
ABCA on the very next step

792
00:46:27,080 --> 00:46:30,230
after entering it in my
dictionary on the receiver--

793
00:46:30,230 --> 00:46:32,080
on the transmitter side.

794
00:46:32,080 --> 00:46:35,990
And so the receiver hasn't
yet had a chance to catch up.

795
00:46:35,990 --> 00:46:37,700
Now if you analyze
this, It turns out

796
00:46:37,700 --> 00:46:43,010
that whenever this happens,
the sequence involved

797
00:46:43,010 --> 00:46:46,980
has its last character equal
to its first character.

798
00:46:46,980 --> 00:46:50,510
So looking at this,
the dictionary

799
00:46:50,510 --> 00:46:52,340
here is waiting to build up.

800
00:46:52,340 --> 00:46:54,050
It's got the ABC
here, and it's waiting

801
00:46:54,050 --> 00:46:58,370
to pull in the first
letter from the sequence--

802
00:46:58,370 --> 00:47:00,770
the sequence associated
with this dictionary entry.

803
00:47:00,770 --> 00:47:02,390
It doesn't have that
dictionary entry.

804
00:47:02,390 --> 00:47:05,300
So it can't pull in the A
like it was doing all along.

805
00:47:05,300 --> 00:47:07,740
But if you analyze the cases
under which this happens,

806
00:47:07,740 --> 00:47:09,980
It turns out that
whenever you don't have it

807
00:47:09,980 --> 00:47:12,340
in your dictionary
entry, the missing letter

808
00:47:12,340 --> 00:47:14,090
that you want to pull
into your dictionary

809
00:47:14,090 --> 00:47:16,760
is the same as the first
one in that string that's

810
00:47:16,760 --> 00:47:18,320
waiting to be built up.

811
00:47:18,320 --> 00:47:22,500
So it completes it with
an A, and it's all set.

812
00:47:22,500 --> 00:47:28,010
Now it says ABCA,
and it continues

813
00:47:28,010 --> 00:47:30,200
So this happens under very
particular conditions.

814
00:47:30,200 --> 00:47:31,250
It's a special case.

815
00:47:31,250 --> 00:47:36,350
If you actually look at the code
that's in the notes you'll see.

816
00:47:36,350 --> 00:47:38,700
While the encoding
is straightforward,

817
00:47:38,700 --> 00:47:41,630
it's really remarkable that
a short fragment like this

818
00:47:41,630 --> 00:47:43,100
can do that encoding.

819
00:47:48,600 --> 00:47:49,440
Let's see here.

820
00:47:51,830 --> 00:47:52,830
I don't want to do this.

821
00:47:52,830 --> 00:47:53,747
I did another example.

822
00:48:00,210 --> 00:48:03,410
Let me just say what's on this
before I dispense with it.

823
00:48:03,410 --> 00:48:05,930
Sorry.

824
00:48:05,930 --> 00:48:06,530
OK.

825
00:48:06,530 --> 00:48:12,390
So look at what's happened.

826
00:48:12,390 --> 00:48:15,090
In terms of the number
of things we've sent,

827
00:48:15,090 --> 00:48:16,447
we've only sent these addresses.

828
00:48:16,447 --> 00:48:18,030
And there are fewer
of them than there

829
00:48:18,030 --> 00:48:19,462
were symbols in the original.

830
00:48:19,462 --> 00:48:21,170
So that's where the
compression comes in.

831
00:48:21,170 --> 00:48:25,595
And as you get the longer
strings, the benefit is higher.

832
00:48:25,595 --> 00:48:27,720
Actually, I'm going to pass
this and just tell you,

833
00:48:27,720 --> 00:48:33,060
when you look through the
code fragment for decoding,

834
00:48:33,060 --> 00:48:35,070
this is the special case
that we talked about.

835
00:48:35,070 --> 00:48:37,020
If the code is not
in your dictionary,

836
00:48:37,020 --> 00:48:38,260
then do such and such.

837
00:48:38,260 --> 00:48:40,020
So that's the explanation.

838
00:48:40,020 --> 00:48:41,142
All right.

839
00:48:41,142 --> 00:48:42,600
And that's described
in the slides.

840
00:48:42,600 --> 00:48:43,980
We'll put that on.

841
00:48:43,980 --> 00:48:47,500
I just wanted to end
with a couple of things.

842
00:48:47,500 --> 00:48:50,160
One is actually-- LZW is a
good example of something

843
00:48:50,160 --> 00:48:53,280
that you see in other
contexts as well, where you're

844
00:48:53,280 --> 00:48:55,950
faced with transmitting
data and you decide instead

845
00:48:55,950 --> 00:48:58,260
that you'll transmit the
model or your best model

846
00:48:58,260 --> 00:48:59,550
for what generates that data.

847
00:48:59,550 --> 00:49:02,220
That can often be a much more
efficient way to do things.

848
00:49:02,220 --> 00:49:04,650
And in fact, when you
speak into your cell phone,

849
00:49:04,650 --> 00:49:07,290
you're not transmitting
a raw speech waveform.

850
00:49:07,290 --> 00:49:10,890
There's actually a very
sophisticated code there

851
00:49:10,890 --> 00:49:13,380
that's modeling your
speech as the output

852
00:49:13,380 --> 00:49:14,840
of an autoregressive filter.

853
00:49:14,840 --> 00:49:19,470
And then it sends the filter
tap weights to the receiver.

854
00:49:19,470 --> 00:49:21,690
So this kind of thing
arises again and again.

855
00:49:21,690 --> 00:49:24,330
Sending the model and
the little information

856
00:49:24,330 --> 00:49:26,430
you need to run the model
at the receiving end

857
00:49:26,430 --> 00:49:28,572
can be much more efficient
than sending the data.

858
00:49:28,572 --> 00:49:30,030
The other thing is
everything we've

859
00:49:30,030 --> 00:49:32,590
talked about has been
lossless compression--

860
00:49:32,590 --> 00:49:33,930
Huffman and LZW.

861
00:49:33,930 --> 00:49:37,940
You can completely recover
what was compressed.

862
00:49:37,940 --> 00:49:40,700
But there's a whole world
of lossy compression,

863
00:49:40,700 --> 00:49:41,700
which is very important.

864
00:49:41,700 --> 00:49:44,760
And we'll find ways to sneak
in discussion of that as well.

865
00:49:44,760 --> 00:49:46,820
All right, thank you.