WEBVTT

00:00:00.000 --> 00:00:02.490
The following content is
provided under a Creative

00:00:02.490 --> 00:00:04.059
Commons license.

00:00:04.059 --> 00:00:06.360
Your support will help
MIT OpenCourseWare

00:00:06.360 --> 00:00:10.720
continue to offer high quality
educational resources for free.

00:00:10.720 --> 00:00:13.350
To make a donation or
view additional materials

00:00:13.350 --> 00:00:17.290
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.290 --> 00:00:18.294
at ocw.mit.edu.

00:00:28.437 --> 00:00:29.770
PROFESSOR: OK let's get started.

00:00:32.390 --> 00:00:34.410
Let's get started, please.

00:00:34.410 --> 00:00:38.630
All right, last time we talked
about information and entropy.

00:00:38.630 --> 00:00:40.970
The picture we had
was of some kind

00:00:40.970 --> 00:00:44.435
of a source emitting symbols.

00:00:50.360 --> 00:00:54.230
Symbols-- let's say n of them.

00:00:54.230 --> 00:01:00.875
So it chooses from these symbols
with probabilities P1 up to Pn.

00:01:04.110 --> 00:01:16.330
And then we talked about the
expected information here,

00:01:16.330 --> 00:01:24.500
or the entropy, so the
expected information

00:01:24.500 --> 00:01:27.800
you get when you see the symbol
that's emitted by the source.

00:01:27.800 --> 00:01:31.670
And that was the average
value of the information.

00:01:31.670 --> 00:01:36.590
So it was-- let's see,
you take 1 over log of 1

00:01:36.590 --> 00:01:39.782
over P i for each of
the possible symbols.

00:01:39.782 --> 00:01:41.240
And then you've
got to weight it by

00:01:41.240 --> 00:01:44.960
the corresponding probability
to get an expectation.

00:01:44.960 --> 00:01:48.480
And this was the
entropy of the source.

00:01:48.480 --> 00:01:50.720
Or if you want to make
explicit the source,

00:01:50.720 --> 00:01:55.190
you could say H
of S for source--

00:01:55.190 --> 00:01:59.413
capital S. All right?

00:01:59.413 --> 00:02:01.580
And then we were actually
thinking of this operating

00:02:01.580 --> 00:02:02.360
repeatedly.

00:02:02.360 --> 00:02:06.440
So in the model we had last
time, the source at each time

00:02:06.440 --> 00:02:08.870
chooses from one of these
symbols with this probability.

00:02:08.870 --> 00:02:11.760
And it does it independently
of choices at other times.

00:02:11.760 --> 00:02:14.390
So what the source
actually generates

00:02:14.390 --> 00:02:20.810
is what's referred to as
an iid sequence of symbols,

00:02:20.810 --> 00:02:25.852
independent,
identically distributed.

00:02:25.852 --> 00:02:26.810
You'll see this a lot--

00:02:32.570 --> 00:02:34.790
Or iid sequence of symbols.

00:02:39.860 --> 00:02:43.927
So the independent part
of this refers to the fact

00:02:43.927 --> 00:02:45.510
that it makes the
choice independently

00:02:45.510 --> 00:02:46.980
at each time instant.

00:02:46.980 --> 00:02:49.075
The identically
distributed means

00:02:49.075 --> 00:02:50.700
that at each time
instant, it goes back

00:02:50.700 --> 00:02:51.900
to these same probabilities.

00:02:51.900 --> 00:02:55.020
It's the same distribution
that it uses each time.

00:02:55.020 --> 00:02:56.430
So that's what iid means--

00:02:56.430 --> 00:02:59.220
so sort of a stationary
probabilistic source

00:02:59.220 --> 00:03:01.560
with no dependence from one
time instant to the next.

00:03:06.150 --> 00:03:10.320
Average information was
measured in bits per symbol.

00:03:13.560 --> 00:03:16.260
And what we wanted to do
was take those symbols

00:03:16.260 --> 00:03:24.450
and compress them
to binary digits.

00:03:30.740 --> 00:03:32.780
OK, so we were going to--

00:03:32.780 --> 00:03:34.790
you can compress
them to other things.

00:03:34.790 --> 00:03:36.957
We were going to think of
compressing them to binary

00:03:36.957 --> 00:03:41.000
digits because we're thinking
of a channel that can take 0s 1s

00:03:41.000 --> 00:03:43.710
or a signal that's in
two possible states.

00:03:43.710 --> 00:03:46.820
So what we wanted to do was
take each symbol or sequence

00:03:46.820 --> 00:03:50.640
of symbols and code it in
the form of binary digits.

00:03:50.640 --> 00:03:51.140
Right?

00:03:53.960 --> 00:03:57.530
Now, each binary
digit can, at most,

00:03:57.530 --> 00:03:59.300
carry one bit of information.

00:03:59.300 --> 00:04:03.525
If the binary digit is equally
likely to be a 0 or a 1,

00:04:03.525 --> 00:04:05.150
then it carries one
bit of information.

00:04:05.150 --> 00:04:07.400
So that tells you really
that if you're going

00:04:07.400 --> 00:04:10.760
to code this, the code length--

00:04:13.530 --> 00:04:17.190
let's see-- compress to binary
digits, let's say, or encode.

00:04:20.140 --> 00:04:22.500
And what we need is the
expected code length.

00:04:29.960 --> 00:04:37.880
L should be greater
than or equal to H. So

00:04:37.880 --> 00:04:42.680
you need to transmit at
least this many binary digits

00:04:42.680 --> 00:04:44.840
on average to convey
the information that's

00:04:44.840 --> 00:04:46.430
coming out of the source--

00:04:46.430 --> 00:04:50.015
per symbol or per timestamp.

00:04:50.015 --> 00:04:51.640
All right, so that
was the basic setup.

00:04:56.040 --> 00:05:00.210
I've given you one
of these bounds here.

00:05:00.210 --> 00:05:02.010
When we talked about
codes, by the way,

00:05:02.010 --> 00:05:06.240
we decided that if we're
talking about binary codes,

00:05:06.240 --> 00:05:12.670
we want to limit ourselves to
what are called instantaneously

00:05:12.670 --> 00:05:19.420
decodable or prefix-free codes.

00:05:19.420 --> 00:05:21.220
And these are codes
that correspond

00:05:21.220 --> 00:05:24.940
to the leaves of a code tree.

00:05:24.940 --> 00:05:27.820
So we had examples of this type.

00:05:27.820 --> 00:05:29.860
You want your symbols
to be associated

00:05:29.860 --> 00:05:34.000
with the leaves of--
the end of the tree, not

00:05:34.000 --> 00:05:35.660
intermediate points.

00:05:35.660 --> 00:05:38.200
The reason being
that, as you work

00:05:38.200 --> 00:05:40.990
your way down to the tree--

00:05:40.990 --> 00:05:44.800
by the way, I'm assuming
that this picture makes sense

00:05:44.800 --> 00:05:48.520
to you in some fashion
from recitation.

00:05:48.520 --> 00:05:50.800
But as you work your
way down to the symbol,

00:05:50.800 --> 00:05:52.940
you don't encounter any
other symbols on the way.

00:05:52.940 --> 00:05:54.398
So as soon as you
hit the leaf, you

00:05:54.398 --> 00:05:55.840
know what symbol you've got.

00:05:55.840 --> 00:05:58.330
So we're limiting
ourselves to codes

00:05:58.330 --> 00:06:00.190
of that type because
some of the statements

00:06:00.190 --> 00:06:03.050
I make are not true if you
don't have codes of this type.

00:06:03.050 --> 00:06:06.250
So I won't comment
on that again.

00:06:06.250 --> 00:06:11.380
All right, so we've got
that, the first inequality

00:06:11.380 --> 00:06:13.210
that I've put up there.

00:06:13.210 --> 00:06:16.000
And it turns out
that Shannon showed

00:06:16.000 --> 00:06:21.078
how to actually construct
codes that will give you

00:06:21.078 --> 00:06:22.120
a band on the other side.

00:06:22.120 --> 00:06:26.840
Let me actually write it
the way it is on the slide.

00:06:26.840 --> 00:06:30.730
So Shannon showed how to get
codes that satisfy this-- so

00:06:30.730 --> 00:06:36.430
can get code satisfying this.

00:06:40.060 --> 00:06:41.830
So Shannon showed
how to get within one

00:06:41.830 --> 00:06:44.740
of the lower bound in terms
of the expected length

00:06:44.740 --> 00:06:45.260
of the code.

00:06:45.260 --> 00:06:47.860
So that was pretty good.

00:06:47.860 --> 00:06:51.250
But after coming up
with this paper in '48

00:06:51.250 --> 00:06:54.760
and working on this for a while,
neither he nor other luminaries

00:06:54.760 --> 00:06:58.300
in the field had found how
to get the best such code,

00:06:58.300 --> 00:07:00.130
and that's what
Huffman ended up doing.

00:07:00.130 --> 00:07:05.380
So we've talked
about that already.

00:07:05.380 --> 00:07:07.900
OK, so Huffman showed
how to get a code

00:07:07.900 --> 00:07:10.690
of minimum expected
length per symbol

00:07:10.690 --> 00:07:12.632
with a very simple construction.

00:07:15.940 --> 00:07:22.890
Now, you can actually
extend Huffman--

00:07:22.890 --> 00:07:26.550
and maybe you talked about
this in recitation as well.

00:07:26.550 --> 00:07:28.050
So you can code
per symbol, or you

00:07:28.050 --> 00:07:31.890
can decide you're going
to create super-symbols.

00:07:31.890 --> 00:07:37.590
Take the same source, but say
that the symbols that it emits

00:07:37.590 --> 00:07:41.110
are the symbols from here
grouped two at a time.

00:07:41.110 --> 00:07:43.650
So you're going to
take the symbol emitted

00:07:43.650 --> 00:07:45.510
at some particular time
and then the symbol

00:07:45.510 --> 00:07:48.930
at the following time and
call that a super-symbol.

00:07:48.930 --> 00:07:51.750
And then take the
next pair, and that's

00:07:51.750 --> 00:07:52.962
a super-symbol and so on.

00:07:52.962 --> 00:07:54.420
So you're doing
the Huffman coding,

00:07:54.420 --> 00:07:57.225
but on pairs of symbols.

00:07:57.225 --> 00:08:00.480
So you can go through the
same kind of construction.

00:08:00.480 --> 00:08:04.410
If you assuming an iid
source, then the probability

00:08:04.410 --> 00:08:07.983
of a paired super-symbol
is easy to compute.

00:08:07.983 --> 00:08:09.900
It's just a probability
of the individual ones

00:08:09.900 --> 00:08:12.610
because they're
independently emitted.

00:08:12.610 --> 00:08:16.530
And then the entropy
of the resulting source

00:08:16.530 --> 00:08:19.440
here turns out to be twice
the entropy of the source

00:08:19.440 --> 00:08:22.330
here because these are
independent emissions,

00:08:22.330 --> 00:08:24.450
so the entropies will just add.

00:08:24.450 --> 00:08:28.200
So you can do the Huffman
construction again.

00:08:28.200 --> 00:08:30.450
And what you discover is
the same kind of thing

00:08:30.450 --> 00:08:37.830
except this is now
the inequality, right?

00:08:37.830 --> 00:08:39.000
And the reason is--

00:08:39.000 --> 00:08:42.750
well, here L is still the
expected length per symbol.

00:08:42.750 --> 00:08:45.210
But you're doing pairs
now, so the expected length

00:08:45.210 --> 00:08:47.310
for the pair is 2L.

00:08:47.310 --> 00:08:48.570
Right?

00:08:48.570 --> 00:08:50.670
The lower bound is the
entropy of the source.

00:08:50.670 --> 00:08:52.110
That's 2H.

00:08:52.110 --> 00:08:55.090
The upper bound is the
entropy of that source plus 1.

00:08:55.090 --> 00:08:56.940
So you can construct
a code of that type.

00:08:56.940 --> 00:08:59.707
You can do it with Shannon's
construction or Huffman's.

00:08:59.707 --> 00:09:01.290
And now see what
you've managed to do.

00:09:01.290 --> 00:09:05.610
You've got a little titer of a
squeeze on the expected length.

00:09:14.100 --> 00:09:19.040
So we've gone from H
plus 1 to H plus 1/2

00:09:19.040 --> 00:09:20.810
with this construction.

00:09:20.810 --> 00:09:25.910
If you took triples, this
would just change to 1 over 3.

00:09:25.910 --> 00:09:29.030
If you took K-tuples,
you'd get 1 over K.

00:09:29.030 --> 00:09:31.970
So if you encode larger
and larger blocks,

00:09:31.970 --> 00:09:33.680
you can squeeze
the expected length

00:09:33.680 --> 00:09:37.140
down to essentially what
the entropy band tells you.

00:09:41.250 --> 00:09:44.670
Now, Huffman-- you've
spent time in recitation.

00:09:44.670 --> 00:09:48.380
I just thought I would
quickly run through an example

00:09:48.380 --> 00:09:53.520
so that you have this
fresh in your minds.

00:09:53.520 --> 00:09:56.440
So we start off with
a set of symbols.

00:09:56.440 --> 00:09:58.890
This is kind of weak,
but I hope you can it.

00:09:58.890 --> 00:10:03.060
A set of symbols, A through D
in this case, with probabilities

00:10:03.060 --> 00:10:04.560
associated with it.

00:10:04.560 --> 00:10:06.930
The Huffman process
is to first sort

00:10:06.930 --> 00:10:09.570
these symbols in descending
order of probability.

00:10:09.570 --> 00:10:11.580
So that's what I
really start with.

00:10:11.580 --> 00:10:13.290
You take the two
smallest ones and lump

00:10:13.290 --> 00:10:19.548
them together to get a paired
symbol, rearrange, reorder.

00:10:19.548 --> 00:10:21.090
And then you do the
same thing again.

00:10:21.090 --> 00:10:24.210
You take the two,
combine them, reorder.

00:10:24.210 --> 00:10:27.660
Take the two smallest ones,
combine them, reorder.

00:10:27.660 --> 00:10:31.440
And that's what you have
for your reduction phase.

00:10:31.440 --> 00:10:33.210
And then you start
to trace back.

00:10:33.210 --> 00:10:37.380
So when you trace back, you can
pick the upper one to be zero,

00:10:37.380 --> 00:10:39.060
the lower one to be 1.

00:10:39.060 --> 00:10:41.950
And then every time you get a
bifurcation, as you go back,

00:10:41.950 --> 00:10:46.110
you'll pick the upper one to be
zero and the lower one to be 1.

00:10:46.110 --> 00:10:48.990
And you start to build
up your code word, right?

00:10:48.990 --> 00:10:51.070
So this one traces back.

00:10:51.070 --> 00:10:52.730
There's no bifurcation.

00:10:52.730 --> 00:10:53.620
This traces back.

00:10:53.620 --> 00:10:56.430
The 0 becomes 0001.

00:10:56.430 --> 00:10:59.410
And you go all
the way like that.

00:10:59.410 --> 00:10:59.910
OK?

00:10:59.910 --> 00:11:06.475
So trace back-- let's see.

00:11:06.475 --> 00:11:07.430
Oh, was there a--

00:11:07.430 --> 00:11:07.930
yeah.

00:11:07.930 --> 00:11:10.657
So the 1 here becomes
a 1 0 and a 1 1.

00:11:10.657 --> 00:11:12.740
And then at the next step,
you're all the way back

00:11:12.740 --> 00:11:15.590
with the Huffman code.

00:11:15.590 --> 00:11:16.640
Right?

00:11:16.640 --> 00:11:19.580
So that's the Huffman code
for that set of symbols.

00:11:19.580 --> 00:11:20.960
It's a Huffman code.

00:11:20.960 --> 00:11:23.620
I shouldn't say the Huffman
code because, if you notice,

00:11:23.620 --> 00:11:26.480
at various stages
we had probabilities

00:11:26.480 --> 00:11:30.920
that were identical, like over
here and over here and over

00:11:30.920 --> 00:11:31.700
here.

00:11:31.700 --> 00:11:34.790
And we could have chosen
how to order these things

00:11:34.790 --> 00:11:37.787
and then how to do the
subsequent grouping.

00:11:37.787 --> 00:11:39.620
And all of those will
give you Huffman codes

00:11:39.620 --> 00:11:41.720
with the same minimum
expected length.

00:11:47.270 --> 00:11:47.770
All right.

00:11:53.062 --> 00:11:55.270
All right, I want to give
you another way of thinking

00:11:55.270 --> 00:11:59.245
about entropy and why
it enters into coding.

00:12:02.470 --> 00:12:04.150
And here's the basic idea.

00:12:04.150 --> 00:12:07.570
All right, so we're still
thinking about the source

00:12:07.570 --> 00:12:09.640
emitting independent symbols.

00:12:09.640 --> 00:12:11.630
It's an iid source.

00:12:11.630 --> 00:12:13.750
And we've got a very
long string of emissions.

00:12:13.750 --> 00:12:23.910
So we've got a very long
string of symbols emitted,

00:12:23.910 --> 00:12:31.800
maybe S1 of the first time,
S17 here, S2 here, and so on.

00:12:31.800 --> 00:12:34.100
And the question is, in a
very long string of symbols,

00:12:34.100 --> 00:12:37.130
how many times do you
expect to see symbol S1?

00:12:37.130 --> 00:12:39.793
How many times you expect to
see a symbol S2 and so on?

00:12:39.793 --> 00:12:41.210
Well, if you
actually work it out,

00:12:41.210 --> 00:12:43.340
it turns out that
the expected number

00:12:43.340 --> 00:12:56.150
of times, number of times we
see SI in the [INAUDIBLE] symbol

00:12:56.150 --> 00:13:01.260
is K times the
probability of seeing SI.

00:13:01.260 --> 00:13:03.030
So it's what you'd expect.

00:13:03.030 --> 00:13:03.770
All right?

00:13:03.770 --> 00:13:06.140
So the expected number
of times is that.

00:13:06.140 --> 00:13:09.470
Well, but that doesn't tell
you what the number of times

00:13:09.470 --> 00:13:12.117
is that you'll see in
any given experiment.

00:13:12.117 --> 00:13:14.450
We know that you need to think
about standard deviations

00:13:14.450 --> 00:13:16.050
as well.

00:13:16.050 --> 00:13:24.910
So what this is saying is,
for instance, for symbol SI,

00:13:24.910 --> 00:13:32.382
that we expect to get
that many of symbol SI.

00:13:32.382 --> 00:13:34.340
But actually, there's a
distribution around it.

00:13:34.340 --> 00:13:39.050
So you'll get a
little histogram here.

00:13:39.050 --> 00:13:41.990
I'm not making any attempt
to draw it very carefully,

00:13:41.990 --> 00:13:43.958
but there's a distribution.

00:13:43.958 --> 00:13:45.500
You run different
experiments, you're

00:13:45.500 --> 00:13:50.300
going to get different numbers
of SI in that run of K. Right?

00:13:50.300 --> 00:13:51.510
So there's a distribution.

00:13:51.510 --> 00:13:53.450
And it turns out
you can actually

00:13:53.450 --> 00:13:56.360
write an explicit formula
for the standard deviation.

00:14:02.820 --> 00:14:05.960
This is something you'll see
if you do a probability course.

00:14:05.960 --> 00:14:08.540
It's actually very simple.

00:14:11.730 --> 00:14:13.170
So that's the
standard deviation.

00:14:13.170 --> 00:14:18.750
So the standard
deviation goes as root K.

00:14:18.750 --> 00:14:21.510
So the interesting thing is
that the standard deviation is

00:14:21.510 --> 00:14:24.270
a fraction of--

00:14:24.270 --> 00:14:27.270
the number of successes
get smaller and smaller

00:14:27.270 --> 00:14:29.490
as K becomes larger and larger.

00:14:29.490 --> 00:14:34.560
Or another way to see that
is, if I normalize this,

00:14:34.560 --> 00:14:37.680
so I'm going to do a
number of successes

00:14:37.680 --> 00:14:43.990
divided by K. This histogram
is going to cluster around P i.

00:14:47.340 --> 00:14:52.453
And the standard deviation
now, because I've divided by K,

00:14:52.453 --> 00:14:54.120
the standard deviation
actually now ends

00:14:54.120 --> 00:15:02.040
up being P1 minus P square
root of K. All right?

00:15:04.800 --> 00:15:09.450
So what this says is
if you get a run of K

00:15:09.450 --> 00:15:11.670
emissions of the symbol
and you try and estimate

00:15:11.670 --> 00:15:16.560
the probability, P i by taking
the ratio of times SI appears

00:15:16.560 --> 00:15:18.570
over the total number
of runs, you'll

00:15:18.570 --> 00:15:23.070
actually get a little
spread here centered on P i.

00:15:23.070 --> 00:15:25.747
But the spread actually
goes down as 1 over root K.

00:15:25.747 --> 00:15:28.330
So this is really what the law
of large numbers is telling us.

00:15:28.330 --> 00:15:30.610
It's telling us that if
you take a very long run,

00:15:30.610 --> 00:15:35.880
you almost certainly get
a number of successes,

00:15:35.880 --> 00:15:37.830
well, Kp i in this case.

00:15:37.830 --> 00:15:39.570
It's very tightly concentrated.

00:15:42.410 --> 00:15:44.910
All right, we don't want you
to remember all these formulas.

00:15:44.910 --> 00:15:46.110
I have them on the slides.

00:15:46.110 --> 00:15:48.010
It's just there for fun.

00:15:48.010 --> 00:15:50.580
There's something else
that I put on there

00:15:50.580 --> 00:15:51.900
that you can try out for fun.

00:15:51.900 --> 00:15:55.560
I don't want to talk through
it, but you can use exactly this

00:15:55.560 --> 00:15:58.500
to analyze things like polling.

00:15:58.500 --> 00:16:02.460
Why is it that you
can poll 2,500 people

00:16:02.460 --> 00:16:04.170
and say that I've
got a margin of error

00:16:04.170 --> 00:16:07.770
of 1% as to how the election
is going to turn out?

00:16:07.770 --> 00:16:10.778
Well, the answer is,
actually, in exactly this.

00:16:10.778 --> 00:16:12.820
If we have time at the
end, I'll come back to it.

00:16:12.820 --> 00:16:17.500
But it's easy enough that
you can look at it yourself.

00:16:17.500 --> 00:16:20.730
So let's focus on what it
is I wanted to show you.

00:16:25.150 --> 00:16:28.970
I picked Obama 0.55, but that
was just as illustration.

00:16:28.970 --> 00:16:29.670
[LAUGHTER]

00:16:29.670 --> 00:16:31.710
No.

00:16:31.710 --> 00:16:35.070
No political views to
be imputed to that.

00:16:35.070 --> 00:16:37.650
All right, so what
we're saying is you've

00:16:37.650 --> 00:16:43.200
got K emissions of this symbol.

00:16:43.200 --> 00:16:45.210
And with very high
probability, you've

00:16:45.210 --> 00:16:56.260
got Kp1 one of S1,
Kp2 of S2, and so on.

00:16:56.260 --> 00:16:58.350
So this is really what
you're expecting to get,

00:16:58.350 --> 00:17:05.030
provided you've tossed this
a large number of times.

00:17:05.030 --> 00:17:09.319
What's the probability of
getting a sequence that has Kp1

00:17:09.319 --> 00:17:13.349
of S1, Kp2 of S2, and so on?

00:17:13.349 --> 00:17:17.780
So you've got to get
S1 and Kp1 positions.

00:17:17.780 --> 00:17:19.640
What's the probability of that?

00:17:19.640 --> 00:17:23.480
And you've got to get
S2 and Kp2 positions.

00:17:23.480 --> 00:17:27.109
So how do you work out
those probabilities?

00:17:27.109 --> 00:17:29.220
We're invoking independence
of all the emissions.

00:17:29.220 --> 00:17:31.560
So you can multiply
probabilities.

00:17:31.560 --> 00:17:35.690
So what you have is S1
occurring with probability.

00:17:35.690 --> 00:17:39.830
P1 to the power
Kp1, because P1 is

00:17:39.830 --> 00:17:41.640
the probability with
which S1 occurs,

00:17:41.640 --> 00:17:43.370
and it's happening Kp1 times.

00:17:43.370 --> 00:17:51.410
So you take it to that power,
and then P2 to the Kp2,

00:17:51.410 --> 00:17:55.660
all the way up to Pn to the Kpn.

00:17:55.660 --> 00:17:56.540
OK?

00:17:56.540 --> 00:18:03.140
So this is the probability of
getting a sequence like this.

00:18:05.417 --> 00:18:07.750
And what we've said is this
is the only kind of sequence

00:18:07.750 --> 00:18:09.160
you're typically going to get.

00:18:09.160 --> 00:18:12.920
All the rest have very low
probability of occurrence.

00:18:12.920 --> 00:18:15.190
So it must be that when I
add up all these sequences,

00:18:15.190 --> 00:18:17.920
I get, essentially,
probability 1.

00:18:17.920 --> 00:18:21.760
So the question then is how
many such sequences are there.

00:18:21.760 --> 00:18:23.800
If a single sequence
of this type

00:18:23.800 --> 00:18:27.690
has this probability, and the
only sequences I'm going to get

00:18:27.690 --> 00:18:31.290
are sequences of this
type effectively,

00:18:31.290 --> 00:18:33.220
and the probabilities
have to sum to 1.

00:18:33.220 --> 00:18:35.680
How many sequences do
I have of this type?

00:18:38.740 --> 00:18:40.825
Do you agree that it's
1 over the probability?

00:18:43.402 --> 00:18:44.610
The number of such sequences?

00:18:44.610 --> 00:18:49.620
Because I've got to take the
number of sequences times

00:18:49.620 --> 00:18:53.420
this individual probability
has to come up to be one.

00:18:53.420 --> 00:18:54.060
Right?

00:18:54.060 --> 00:18:57.012
The number of sequences--
let me write this down.

00:18:57.012 --> 00:18:58.470
So that you see it
a little better.

00:19:07.800 --> 00:19:12.240
The number of such--

00:19:12.240 --> 00:19:14.130
let me call them
typical sequences--

00:19:18.550 --> 00:19:21.700
times the probability
of any such sequence

00:19:21.700 --> 00:19:24.560
has got to be approximately 1.

00:19:24.560 --> 00:19:27.420
I say approximately because
there are a few other sequences

00:19:27.420 --> 00:19:28.740
whose probabilities might--

00:19:28.740 --> 00:19:30.430
I would have to take a count of.

00:19:30.430 --> 00:19:32.210
But this is essentially it.

00:19:32.210 --> 00:19:35.300
So the number of such sequences
is 1 over this number.

00:19:35.300 --> 00:19:43.530
So the number of such sequences
is P1 to the minus K1,

00:19:43.530 --> 00:19:51.520
P1, P2 to the minus
Kp2 and so on.

00:19:56.600 --> 00:19:58.297
That's the number
of such sequences.

00:19:58.297 --> 00:20:00.130
And essentially, these
are all the sequences

00:20:00.130 --> 00:20:01.047
that I'm going to get.

00:20:04.120 --> 00:20:06.070
Well, if I take
the log of this--

00:20:11.210 --> 00:20:13.190
just visualize
how the log works.

00:20:13.190 --> 00:20:15.850
Now I've got the
log of a product,

00:20:15.850 --> 00:20:18.480
so that going to be a sum
of the individual logs.

00:20:18.480 --> 00:20:20.300
I've got the log of
a power of something,

00:20:20.300 --> 00:20:24.340
so the power will come
down to multiply the log.

00:20:24.340 --> 00:20:29.870
This comes out to be K
times H of S exactly.

00:20:29.870 --> 00:20:33.410
OK, so the log of the
number of sequences

00:20:33.410 --> 00:20:37.940
is K times H of S,
K times the entropy.

00:20:40.930 --> 00:20:44.310
This is log to the base 2.

00:20:44.310 --> 00:20:52.050
So the number of sequences
is equal to 2 to the KH.

00:20:52.050 --> 00:20:53.225
I'm saying equal to.

00:20:53.225 --> 00:20:54.600
I should be putting
approximately

00:20:54.600 --> 00:20:57.700
equal to signs everywhere,
but you get the idea.

00:20:57.700 --> 00:21:03.150
So the number of typical
sequences is 2 to the KH.

00:21:03.150 --> 00:21:07.290
How many binary digits does it
take to count 2 the KH things?

00:21:11.210 --> 00:21:12.760
KH, right?

00:21:12.760 --> 00:21:13.915
So what I need is--

00:21:16.954 --> 00:21:26.890
so I just need KHS binary digits
to count the typical sequences.

00:21:34.980 --> 00:21:37.935
So how many binary digits
do I need per symbol?

00:21:40.790 --> 00:21:42.500
It's just that divided
by K because I've

00:21:42.500 --> 00:21:44.370
got a string of K symbols.

00:21:44.370 --> 00:21:48.450
So I need a number of binary
digits equal to the entropy.

00:21:48.450 --> 00:21:51.320
So this is a quick way
of seeing that entropy

00:21:51.320 --> 00:21:54.950
is very relevant to minimal
coding of sequences of outputs

00:21:54.950 --> 00:21:57.330
from a source like this.

00:21:57.330 --> 00:22:01.500
All right, now I swept a
lot of math under the rug.

00:22:01.500 --> 00:22:04.650
The math that makes
this rigorous exists.

00:22:04.650 --> 00:22:08.490
We don't want to have
any part of it here.

00:22:08.490 --> 00:22:10.110
But for those of you
who are inclined,

00:22:10.110 --> 00:22:14.390
you can look in a book
on information theory.

00:22:14.390 --> 00:22:15.530
There's a nice name to it.

00:22:15.530 --> 00:22:24.740
It's called the
Asymptotic Equipartition--

00:22:24.740 --> 00:22:32.220
Equipartition Property.

00:22:32.220 --> 00:22:34.300
OK?

00:22:34.300 --> 00:22:37.010
It's basically saying that,
asymptotically the probability

00:22:37.010 --> 00:22:39.140
partitions into equal
probabilities for all

00:22:39.140 --> 00:22:41.380
these typical sequences.

00:22:41.380 --> 00:22:41.880
All right.

00:22:45.560 --> 00:22:52.430
So all that is for Huffman
and its application

00:22:52.430 --> 00:22:57.680
to symbols emitted independently
by a source over time.

00:22:57.680 --> 00:22:59.960
But there are
limitations to this.

00:22:59.960 --> 00:23:05.270
We've been working with Huffman
coding under the assumption

00:23:05.270 --> 00:23:07.340
that the probabilities
are given to us.

00:23:07.340 --> 00:23:10.010
But it's typically the case
that the probabilities are not

00:23:10.010 --> 00:23:15.170
known for some arbitrary source
that you're trying to code for.

00:23:15.170 --> 00:23:17.420
The probabilities might
change with time as the source

00:23:17.420 --> 00:23:18.830
characteristics change.

00:23:18.830 --> 00:23:22.040
So you would need to
detect that and recode,

00:23:22.040 --> 00:23:24.430
if you're going to do Huffman.

00:23:24.430 --> 00:23:26.990
And then the more
important point

00:23:26.990 --> 00:23:30.350
perhaps is that sources
are generally not iid.

00:23:30.350 --> 00:23:32.690
The sources of
interest are not really

00:23:32.690 --> 00:23:36.950
generating independent
identically

00:23:36.950 --> 00:23:38.870
distributed symbols.

00:23:38.870 --> 00:23:42.810
What's perhaps
more true is that--

00:23:42.810 --> 00:23:43.310
let's see.

00:23:43.310 --> 00:23:50.600
Oh, here-- once you're done
compressing your source

00:23:50.600 --> 00:23:52.910
to binary digits where
each binary digit carries

00:23:52.910 --> 00:23:54.950
a bit of information,
then you've

00:23:54.950 --> 00:24:02.210
got something that essentially
is not correlated over time.

00:24:02.210 --> 00:24:04.400
You've managed to
kind of decouple it.

00:24:04.400 --> 00:24:08.660
But at this stage, these symbols
are not really independent

00:24:08.660 --> 00:24:10.890
in typical cases of interest.

00:24:10.890 --> 00:24:15.110
So one important case, of
course, is just English text.

00:24:15.110 --> 00:24:17.810
You can still code
it symbol by symbol,

00:24:17.810 --> 00:24:19.910
but it's a very
inefficient coding.

00:24:19.910 --> 00:24:22.420
If you wanted to do
it symbol by symbol,

00:24:22.420 --> 00:24:24.140
let's just ignore uppercase.

00:24:24.140 --> 00:24:27.010
You've got 26
letters plus a space.

00:24:27.010 --> 00:24:30.110
So that's 27 symbols.

00:24:30.110 --> 00:24:32.810
Well, you could certainly code
that with five binary digits

00:24:32.810 --> 00:24:36.050
because that would give
you 32 things to count.

00:24:36.050 --> 00:24:38.630
You can do better
with a code that

00:24:38.630 --> 00:24:40.280
approaches the
entropy associated

00:24:40.280 --> 00:24:42.710
with a source of this type.

00:24:42.710 --> 00:24:45.730
That would be 4.755 bits.

00:24:45.730 --> 00:24:50.780
OK, so if you ignored
dependence in English text

00:24:50.780 --> 00:24:54.620
and just treated each
symbol is equally likely,

00:24:54.620 --> 00:24:56.152
you'd say that
that's the entropy,

00:24:56.152 --> 00:24:58.610
and you could attempt to code
it with something approaching

00:24:58.610 --> 00:24:59.110
that.

00:24:59.110 --> 00:25:02.310
But actually, not all
symbols are equally likely.

00:25:02.310 --> 00:25:04.700
If you look at a typical
distribution of frequencies--

00:25:04.700 --> 00:25:07.010
and we saw this
with Morse already.

00:25:07.010 --> 00:25:12.660
The E is much more common than
T, than A, O, I, N and so on.

00:25:12.660 --> 00:25:16.160
So there is a
distribution to this.

00:25:16.160 --> 00:25:20.300
But you can take account of
that distribution and compute

00:25:20.300 --> 00:25:23.960
the associated entropy, and
you get something a little bit

00:25:23.960 --> 00:25:27.590
smaller, 4.177 instead of
the 4.7-something that we

00:25:27.590 --> 00:25:29.480
hadn't before.

00:25:29.480 --> 00:25:31.610
Because not all letters
are equally likely.

00:25:31.610 --> 00:25:35.720
But this is still thinking
of it symbol by symbol,

00:25:35.720 --> 00:25:38.390
not recognizing
dependence over time.

00:25:41.640 --> 00:25:45.230
But English and other
languages are full of context.

00:25:45.230 --> 00:25:45.730
Right?

00:25:45.730 --> 00:25:50.260
If you know the preceding
part of the text,

00:25:50.260 --> 00:25:55.100
you have a very good way
to guess the next letter.

00:25:55.100 --> 00:25:57.070
Nothing can be said to
be certain except death

00:25:57.070 --> 00:25:58.528
and-- well, you
can-- in this case,

00:25:58.528 --> 00:26:00.260
you can give me the
next three letters.

00:26:00.260 --> 00:26:01.330
Right?

00:26:01.330 --> 00:26:02.060
Anyone?

00:26:02.060 --> 00:26:02.430
AUDIENCE: It's taxes

00:26:02.430 --> 00:26:03.388
PROFESSOR: Taxes, yeah.

00:26:07.440 --> 00:26:10.520
So even though X
taken in isolation

00:26:10.520 --> 00:26:12.740
has a very low
probability of occurrence,

00:26:12.740 --> 00:26:15.290
if you look at the histogram
on the previous page,

00:26:15.290 --> 00:26:19.190
you see that the
probability is 0.0017.

00:26:19.190 --> 00:26:21.273
Letters are not
independently generated.

00:26:21.273 --> 00:26:23.690
Now, it turns out Shannon was
actually one of the earliest

00:26:23.690 --> 00:26:27.300
to study this in
experiments on his wife.

00:26:27.300 --> 00:26:30.680
He had her-- he presented
her with bits of text

00:26:30.680 --> 00:26:32.145
from one particular
book and asked

00:26:32.145 --> 00:26:33.770
her to guess the next
letter and so on.

00:26:33.770 --> 00:26:37.820
And he had a 1951 paper
that actually launched

00:26:37.820 --> 00:26:39.680
a lot of this, because
he had developed now

00:26:39.680 --> 00:26:42.170
the tools for talking about it.

00:26:42.170 --> 00:26:45.100
His estimate was much lower
than the 4-point-something.

00:26:45.100 --> 00:26:49.280
It was more in the vicinity
of one bit, 1 to 1.5 bits.

00:26:49.280 --> 00:26:55.100
So there's a lot of compression
possible with English text

00:26:55.100 --> 00:26:57.510
because there's this kind
of a dependence here.

00:27:06.500 --> 00:27:09.250
And just to tell you
what it is that we're

00:27:09.250 --> 00:27:11.980
trying to compute when
we compute entropy

00:27:11.980 --> 00:27:14.200
for these long
sequences of symbols,

00:27:14.200 --> 00:27:18.700
we're sort of saying what's
the joint entropy of a sequence

00:27:18.700 --> 00:27:24.440
of K symbols divided by K in the
limit of K going to infinity.

00:27:24.440 --> 00:27:27.490
So this is what you
might call H under bar.

00:27:27.490 --> 00:27:28.990
It's not over bar
because I couldn't

00:27:28.990 --> 00:27:31.030
see how to do an over
bar on my PowerPoint.

00:27:31.030 --> 00:27:34.897
But it's usually an
over bar in the books.

00:27:34.897 --> 00:27:36.730
But this is really the
object that you would

00:27:36.730 --> 00:27:38.250
like to get your hands on.

00:27:38.250 --> 00:27:41.470
For sequential text
that has context in it,

00:27:41.470 --> 00:27:43.510
this is the kind of
entropy that you really

00:27:43.510 --> 00:27:46.480
would like to be working with.

00:27:46.480 --> 00:27:47.800
OK.

00:27:47.800 --> 00:27:51.130
So that brings us to
an approach to coding

00:27:51.130 --> 00:27:53.092
that's really focused--

00:27:53.092 --> 00:27:54.550
coding or compression
that's really

00:27:54.550 --> 00:27:56.180
focused on sequential text.

00:27:56.180 --> 00:28:00.740
And this is the Lempel-Ziv-Welch
algorithm that's in the notes.

00:28:00.740 --> 00:28:03.610
Turns out that Lempel
and Ziv or Ziv and Lempel

00:28:03.610 --> 00:28:05.860
had two earlier papers.

00:28:05.860 --> 00:28:09.640
And then Welch improved
on it in an '84 paper.

00:28:09.640 --> 00:28:12.700
And what's in blue over
there is a bit of a mouthful.

00:28:12.700 --> 00:28:14.840
And each word kind
of means something,

00:28:14.840 --> 00:28:17.320
so I thought I'd
list it all there.

00:28:17.320 --> 00:28:19.600
Maybe I've used too
many of these words--

00:28:19.600 --> 00:28:23.470
universal lossless compression
of sequential or streaming data

00:28:23.470 --> 00:28:25.630
by adaptive variable
length coding.

00:28:25.630 --> 00:28:29.580
And I'll come to talk about
those terms on the next slide.

00:28:29.580 --> 00:28:32.500
And it turns out that this is
a very widely used compression

00:28:32.500 --> 00:28:35.050
algorithm for all
sorts of files.

00:28:35.050 --> 00:28:36.880
Sometimes it's for a part of it.

00:28:36.880 --> 00:28:38.650
Sometimes it's optional.

00:28:38.650 --> 00:28:40.300
Sometimes it's
combined with Huffman,

00:28:40.300 --> 00:28:43.870
but all of these things
that do compression

00:28:43.870 --> 00:28:48.880
pay homage to Lempel
and Ziv at least.

00:28:48.880 --> 00:28:50.260
They were also patented.

00:28:50.260 --> 00:28:54.565
Actually, Unisys owned the
patent on LZW for many years.

00:28:54.565 --> 00:28:55.690
These have all expired now.

00:28:55.690 --> 00:29:00.370
But while the patents were held,
it made for a lot of heartburn

00:29:00.370 --> 00:29:02.697
because there were
things being done

00:29:02.697 --> 00:29:04.780
without knowledge of the
existence of the patents.

00:29:04.780 --> 00:29:09.230
And then people got hit
with lawsuits and so on.

00:29:09.230 --> 00:29:14.200
Jacob Ziv, again part of this
incredible heritage from MIT

00:29:14.200 --> 00:29:17.140
of people working here in
the early days of information

00:29:17.140 --> 00:29:17.950
theory.

00:29:17.950 --> 00:29:20.560
He was a graduate student here
at the same time as Huffman

00:29:20.560 --> 00:29:23.500
and many other people whose
names surface in all of this.

00:29:26.470 --> 00:29:29.020
I was actually at an award
ceremony of the IEEE,

00:29:29.020 --> 00:29:32.870
where Lempel got an award
for his compression work.

00:29:32.870 --> 00:29:36.980
And people were given a whole
minute for a thank you speech,

00:29:36.980 --> 00:29:37.990
a mini thank you speech.

00:29:37.990 --> 00:29:41.860
And everyone took their minute
to mention this person and that

00:29:41.860 --> 00:29:43.820
and talk about the
origins of the work.

00:29:43.820 --> 00:29:45.403
It's a lot to say
in a minute but they

00:29:45.403 --> 00:29:47.440
managed to convey a lot.

00:29:47.440 --> 00:29:49.582
Lempel came up and
said, "thank you."

00:29:49.582 --> 00:29:51.220
[LAUGHTER]

00:29:51.220 --> 00:29:53.220
It seemed kind of fitting
for someone whose life

00:29:53.220 --> 00:29:54.303
is devoted to compression.

00:29:54.303 --> 00:29:57.790
[LAUGHTER]

00:29:57.790 --> 00:30:00.610
I just couldn't help
but crack up there.

00:30:00.610 --> 00:30:04.060
That was-- all right.

00:30:04.060 --> 00:30:05.590
Now the interesting
thing about this

00:30:05.590 --> 00:30:09.112
is that there are
theoretical guarantees

00:30:09.112 --> 00:30:10.570
that, under
appropriate assumptions

00:30:10.570 --> 00:30:14.290
on the source, asymptotically,
this process will

00:30:14.290 --> 00:30:16.630
attain that bound.

00:30:16.630 --> 00:30:21.160
Now the thing is, the word
asymptotically hides many sins.

00:30:21.160 --> 00:30:23.490
Lots of things happen
at infinity that

00:30:23.490 --> 00:30:25.802
are not supposed to happen.

00:30:25.802 --> 00:30:27.760
Or lots of things happen
at infinity that never

00:30:27.760 --> 00:30:29.290
happen when you're watching.

00:30:29.290 --> 00:30:31.870
So the theoretical
performance perhaps

00:30:31.870 --> 00:30:34.720
is not as important as the fact
that it works exceedingly well

00:30:34.720 --> 00:30:36.565
in practice.

00:30:36.565 --> 00:30:38.440
So we're going to talk
a little bit about it.

00:30:38.440 --> 00:30:39.910
You've got a lab on it as well.

00:30:44.280 --> 00:30:48.050
So let me just say a little bit
about what these words mean.

00:30:48.050 --> 00:30:50.150
So this is universal
in the sense

00:30:50.150 --> 00:30:51.830
that it doesn't
necessarily-- it doesn't

00:30:51.830 --> 00:30:54.230
need any knowledge of
the particular statistics

00:30:54.230 --> 00:30:55.730
of the source that
it's compressing.

00:30:55.730 --> 00:30:59.830
It's willing to take
its hand at anything.

00:30:59.830 --> 00:31:01.790
OK?

00:31:01.790 --> 00:31:04.040
So it doesn't need to know
the source statistics.

00:31:04.040 --> 00:31:05.990
It actually learns
the source statistics

00:31:05.990 --> 00:31:09.650
in the course of
implementing the algorithm.

00:31:09.650 --> 00:31:11.870
And it does that by
actually building up

00:31:11.870 --> 00:31:14.300
a dictionary for
strings of symbols

00:31:14.300 --> 00:31:16.790
that it discovers
in the source text.

00:31:16.790 --> 00:31:21.500
So it's built around
construction of a dictionary.

00:31:21.500 --> 00:31:23.960
What it then does is
it compresses the text,

00:31:23.960 --> 00:31:27.950
not to things that we've
seen here in Huffman,

00:31:27.950 --> 00:31:29.800
but to actually
dictionary entries.

00:31:29.800 --> 00:31:32.210
So it's sort of like
Morse's original idea,

00:31:32.210 --> 00:31:34.850
which was communicate the
address in the dictionary

00:31:34.850 --> 00:31:36.660
rather than communicating
the word itself

00:31:36.660 --> 00:31:38.900
or some compressed
version of the word.

00:31:38.900 --> 00:31:40.712
So it compresses the
text to sequences

00:31:40.712 --> 00:31:42.920
of dictionary addresses,
and those are the code words

00:31:42.920 --> 00:31:46.650
that it sends to the receiver.

00:31:46.650 --> 00:31:49.380
It's also a variable
length compression scheme.

00:31:49.380 --> 00:31:51.660
But it's interesting
that it doesn't

00:31:51.660 --> 00:31:56.150
take a fixed length of symbols
to varying lengths of code

00:31:56.150 --> 00:31:56.650
words.

00:31:56.650 --> 00:31:58.140
It actually takes
varying lengths

00:31:58.140 --> 00:32:00.030
of symbols to fixed
length of code words.

00:32:00.030 --> 00:32:01.680
So it's a little bit backwards.

00:32:01.680 --> 00:32:05.700
But it's still a variable
length in that sense.

00:32:05.700 --> 00:32:09.150
So the way this works is that
the sender and the receiver

00:32:09.150 --> 00:32:12.210
start off with a core dictionary
that they both agreed on.

00:32:12.210 --> 00:32:17.790
And for our
illustrations, we might

00:32:17.790 --> 00:32:20.100
say that they've
agreed on the letters A

00:32:20.100 --> 00:32:30.390
through Z, lowercase
A through Z.

00:32:30.390 --> 00:32:33.390
So what they have is these
letters or this core dictionary

00:32:33.390 --> 00:32:35.340
stored in some register.

00:32:35.340 --> 00:32:39.280
Well, actually let me show
you what it might look like.

00:32:39.280 --> 00:32:42.690
So there's the register
with, let's say,

00:32:42.690 --> 00:32:45.210
you have an 8-bit table.

00:32:45.210 --> 00:32:47.490
This is the dictionary
that you have at both ends.

00:32:47.490 --> 00:32:50.550
So you can store 256
different things.

00:32:50.550 --> 00:32:54.000
And you've both agreed on what's
going to go into those slots.

00:32:54.000 --> 00:32:56.700
So somewhere-- I think
it's slot 97 in one

00:32:56.700 --> 00:32:59.820
of these particular codes,
you've got the letter A.

00:32:59.820 --> 00:33:01.920
And somewhere else
you've got B, and so on.

00:33:01.920 --> 00:33:03.570
Or the next position
you've got B.

00:33:03.570 --> 00:33:06.840
You can store a bunch
of standard symbols.

00:33:06.840 --> 00:33:09.090
So we'll consider that
all the single letter

00:33:09.090 --> 00:33:14.580
symbols are already stored
in designated positions

00:33:14.580 --> 00:33:17.880
in this dictionary that's known
to the sender and the receiver.

00:33:17.880 --> 00:33:24.180
So if the sender just
sends 252, the receiver

00:33:24.180 --> 00:33:27.030
knows what 252 refers
to because they've

00:33:27.030 --> 00:33:29.963
got that core dictionary
that they've agreed on.

00:33:29.963 --> 00:33:31.380
Some of the text
here, by the way,

00:33:31.380 --> 00:33:34.710
is stuff I've said already.

00:33:34.710 --> 00:33:35.850
So I'll actually go back.

00:33:48.260 --> 00:33:51.470
And then what happens is
that the source starts

00:33:51.470 --> 00:33:54.650
to sequentially scan
the text that's arriving

00:33:54.650 --> 00:34:01.160
or that it's looking at and
puts new strings that it's

00:34:01.160 --> 00:34:05.750
found into new
locations in this table

00:34:05.750 --> 00:34:09.199
and then communicates the
address for the receiver.

00:34:09.199 --> 00:34:12.380
The magic of this-- and I mean
it's fiendishly clever, very

00:34:12.380 --> 00:34:17.030
simple, but very clever, is
that the receiver can build up

00:34:17.030 --> 00:34:21.895
its dictionary in tandem with
the transmitter building up

00:34:21.895 --> 00:34:22.520
the dictionary.

00:34:22.520 --> 00:34:24.360
It's just a one-step delay.

00:34:24.360 --> 00:34:25.940
So one step later,
the receiver has

00:34:25.940 --> 00:34:30.060
figured out what that
dictionary entry is.

00:34:30.060 --> 00:34:34.400
So the transmitter or the source
is building up the dictionary,

00:34:34.400 --> 00:34:39.380
looking at strings in
the input sequence,

00:34:39.380 --> 00:34:42.080
communicating the address--

00:34:42.080 --> 00:34:46.250
the addresses of the appropriate
strings to the receiver,

00:34:46.250 --> 00:34:48.843
and the receiver is building
up a dictionary in parallel.

00:34:48.843 --> 00:34:50.510
Now I think the easiest
way to do this--

00:34:50.510 --> 00:34:52.760
there's discussion in the text.

00:34:52.760 --> 00:34:54.050
There's also code fragments.

00:34:54.050 --> 00:34:55.467
But I think the
easiest way for me

00:34:55.467 --> 00:34:57.980
to try and do this is to
actually just show you

00:34:57.980 --> 00:35:01.900
how it works on a
particular sequence.

00:35:04.450 --> 00:35:07.290
And you may not get all
the details all at once.

00:35:07.290 --> 00:35:09.810
I do have a little animation
that I need to tweak a bit,

00:35:09.810 --> 00:35:10.780
and I'll--

00:35:10.780 --> 00:35:12.930
well, it's not an animation,
but a set of slides

00:35:12.930 --> 00:35:15.030
that'll help you
understand, actually,

00:35:15.030 --> 00:35:16.990
this particular example.

00:35:16.990 --> 00:35:18.510
So I'll have that
posted as well.

00:35:22.022 --> 00:35:23.730
But for now, let's
just work through this

00:35:23.730 --> 00:35:27.520
and see what it looks like.

00:35:27.520 --> 00:35:30.250
And I hope I don't trip
over myself in the process.

00:35:30.250 --> 00:35:31.617
I hope you'll be forgiving.

00:35:41.150 --> 00:35:42.900
And I need these two
blackboards to do it.

00:35:48.020 --> 00:35:48.520
OK.

00:35:51.570 --> 00:35:52.820
And I need some colored chalk.

00:35:56.320 --> 00:36:01.670
So what I'm going to have
over here is the source.

00:36:01.670 --> 00:36:04.030
And over here is the receiver.

00:36:09.720 --> 00:36:16.590
And the source wants to send
a message that I'll put here--

00:36:16.590 --> 00:36:27.955
A-B-C. This is going to
look incredibly boring.

00:36:31.030 --> 00:36:33.620
But the algorithm does different
things at different stages,

00:36:33.620 --> 00:36:35.470
so that keeps it interesting.

00:36:35.470 --> 00:36:38.800
And let's see 1, 2, 3, 4, 5.

00:36:38.800 --> 00:36:42.937
And then we hit a special case
somewhere near the end here

00:36:42.937 --> 00:36:44.020
that is worth sorting out.

00:36:44.020 --> 00:36:45.730
Because otherwise
that, the fragment

00:36:45.730 --> 00:36:47.900
of the code that you
see doesn't make sense.

00:36:47.900 --> 00:36:53.080
Gee, can you believe that I
want to start this again here?

00:36:53.080 --> 00:36:53.740
Sorry.

00:36:53.740 --> 00:36:55.360
Let's start here.

00:36:55.360 --> 00:36:57.385
I want at least six
replications of ABC.

00:37:04.823 --> 00:37:06.240
I want you to get
comfortable also

00:37:06.240 --> 00:37:07.590
so you can settle into this.

00:37:11.900 --> 00:37:16.100
OK, here we go.

00:37:16.100 --> 00:37:18.630
All right.

00:37:18.630 --> 00:37:21.330
The receiver has no idea
that this is the sequence.

00:37:21.330 --> 00:37:23.550
The source has, and
the receiver both

00:37:23.550 --> 00:37:27.420
have A through Z sitting
in their dictionary

00:37:27.420 --> 00:37:30.970
at designated locations.

00:37:30.970 --> 00:37:37.649
So the source will
first see the letter A

00:37:37.649 --> 00:37:40.160
and does nothing because
A is in its dictionary.

00:37:40.160 --> 00:37:43.010
It doesn't want to
do anything yet.

00:37:43.010 --> 00:37:44.150
Then it looks at--

00:37:44.150 --> 00:37:46.775
it pulls in B. So now
it's looking at AB.

00:37:46.775 --> 00:37:49.700
AB is not in its dictionary
because it's a symbol of--

00:37:49.700 --> 00:37:52.410
it's a string of two symbols.

00:37:52.410 --> 00:37:55.550
So now it knows it needs
to make a dictionary entry.

00:37:55.550 --> 00:37:58.560
I'm going to indicate
dictionary entry with this.

00:37:58.560 --> 00:38:02.660
So the source is going to
make a dictionary entry of AB.

00:38:02.660 --> 00:38:05.420
So what this means is
somewhere in that register

00:38:05.420 --> 00:38:07.760
in a particular position,
or in the next position

00:38:07.760 --> 00:38:13.490
actually from the agreed on
table, it sticks in this.

00:38:13.490 --> 00:38:18.980
And then what it transmits
to the receiver is not this,

00:38:18.980 --> 00:38:25.260
but the code for A. OK?

00:38:25.260 --> 00:38:29.280
So it enters the longer fragment
here as a new dictionary word

00:38:29.280 --> 00:38:34.580
and sends the address for the
piece that the receiver sees.

00:38:34.580 --> 00:38:36.120
So what does the receiver get?

00:38:36.120 --> 00:38:40.140
The receiver sees A
coming in and says, OK,

00:38:40.140 --> 00:38:44.530
that's the sequence
A. That's the symbol.

00:38:44.530 --> 00:38:46.190
A, I'm all set.

00:38:46.190 --> 00:38:48.930
All right?

00:38:48.930 --> 00:38:54.420
Now what happens
is that the source

00:38:54.420 --> 00:38:56.190
pulls in the next letter.

00:38:56.190 --> 00:38:59.040
It's done with the A, so you can
essentially forget about that.

00:38:59.040 --> 00:39:00.900
It pulls in the next letter.

00:39:00.900 --> 00:39:05.280
Looks to see if it's got
B-C in its dictionary.

00:39:05.280 --> 00:39:08.880
It doesn't have BC because it
only has single letter entries,

00:39:08.880 --> 00:39:10.280
and it has AB.

00:39:10.280 --> 00:39:12.660
So it's got to put in BC.

00:39:12.660 --> 00:39:14.670
So it's going to put
in an entry for BC.

00:39:19.880 --> 00:39:28.170
And then what it's going
to transmit is the B.

00:39:28.170 --> 00:39:33.900
The receiver gets
the B. Oh, sorry--

00:39:33.900 --> 00:39:37.620
the directory entry for B. And
so it knows that's the letter

00:39:37.620 --> 00:39:42.990
B. And now it enters AB in its--

00:39:47.670 --> 00:39:51.010
in its dictionary, OK,
in the next location.

00:39:51.010 --> 00:39:52.920
So you see, with
a one-step delay,

00:39:52.920 --> 00:39:55.350
the AB that was in
the dictionary here

00:39:55.350 --> 00:39:57.590
has ended up in the
dictionary of the receiver.

00:40:00.740 --> 00:40:03.770
OK, we're done with this.

00:40:03.770 --> 00:40:06.830
We now pull in the
next letter here.

00:40:06.830 --> 00:40:09.140
That's A. We haven't seen A--

00:40:09.140 --> 00:40:10.950
we haven't seen CA
in our dictionary.

00:40:10.950 --> 00:40:19.550
So we make an entry for CA,
ship out C. C comes here.

00:40:23.700 --> 00:40:29.100
I should say that this was done
with the A. The C comes here,

00:40:29.100 --> 00:40:34.020
and the receiver knows
to make an entry for BC.

00:40:38.630 --> 00:40:39.890
So with one delay it's got it.

00:40:43.160 --> 00:40:44.675
OK, we're done with this.

00:40:48.910 --> 00:40:51.670
We pull in the next letter, AB.

00:40:51.670 --> 00:40:52.910
That's in our dictionary.

00:40:52.910 --> 00:40:54.880
So we keep going, all right?

00:40:54.880 --> 00:40:59.740
So this algorithm doesn't look
to ship out the dictionary

00:40:59.740 --> 00:41:02.718
address every time it sees a
sequence that it recognizes.

00:41:02.718 --> 00:41:04.510
If it's got this already
in its dictionary,

00:41:04.510 --> 00:41:07.420
it keeps going to try
and learn a new word.

00:41:07.420 --> 00:41:09.430
So it's already got AB
there, so it keeps going

00:41:09.430 --> 00:41:13.610
and it pulls in C. And
now that's a new word.

00:41:13.610 --> 00:41:20.170
So it's got ABC as a new entry.

00:41:20.170 --> 00:41:23.050
It ships out AB--

00:41:23.050 --> 00:41:24.280
the address for AB rather.

00:41:30.770 --> 00:41:37.650
This gets the address for AB,
which is in its dictionary.

00:41:37.650 --> 00:41:39.105
It puts the AB down there.

00:41:41.760 --> 00:41:43.650
It takes the first
letter of the string that

00:41:43.650 --> 00:41:45.280
came in and appends
it to the last one

00:41:45.280 --> 00:41:46.905
that it had there
and gives you the CA.

00:41:50.730 --> 00:41:53.340
So you see, it's keeping up
but with a one-step delay.

00:41:56.220 --> 00:41:57.120
Let's keep going.

00:41:57.120 --> 00:42:00.590
So the AB is done with.

00:42:00.590 --> 00:42:04.234
We pull in A. We've got CA.

00:42:04.234 --> 00:42:08.260
We pull in the B.
We don't have CAB,

00:42:08.260 --> 00:42:09.590
so let's enter that as well.

00:42:12.120 --> 00:42:14.120
By the time we've done
this example, by the way,

00:42:14.120 --> 00:42:17.030
I'm hoping you'll
know Lempel-Ziv.

00:42:17.030 --> 00:42:18.770
So bear with me.

00:42:21.680 --> 00:42:23.260
All right, dictionary
entry-- and now

00:42:23.260 --> 00:42:25.895
what does it send
out to the receiver?

00:42:25.895 --> 00:42:26.770
AUDIENCE: [INAUDIBLE]

00:42:26.770 --> 00:42:27.478
PROFESSOR: Sorry.

00:42:27.478 --> 00:42:28.120
AUDIENCE: C2

00:42:28.120 --> 00:42:35.030
PROFESSOR: CA-- the
address for CA, right?

00:42:35.030 --> 00:42:36.055
The address for CA.

00:42:36.055 --> 00:42:37.700
So the address for CA comes in.

00:42:40.330 --> 00:42:42.160
It decodes the CA.

00:42:48.450 --> 00:42:49.740
And so let's see.

00:42:49.740 --> 00:42:51.780
We're done with these
pieces, but this one

00:42:51.780 --> 00:42:56.670
has to build up its new
direct dictionary entry.

00:42:56.670 --> 00:42:59.460
And so what it's got is
the AB setting from before,

00:42:59.460 --> 00:43:01.650
and it pulls in
the first letter.

00:43:01.650 --> 00:43:03.240
Instead of wrapping
to the next board,

00:43:03.240 --> 00:43:05.970
let me start winding up again--

00:43:05.970 --> 00:43:06.690
winding upwards.

00:43:09.650 --> 00:43:13.240
OK, so that's the new
entry there, the receiver--

00:43:13.240 --> 00:43:14.470
one step delayed from here.

00:43:18.660 --> 00:43:22.210
OK, I pull in the C. I have BC .

00:43:22.210 --> 00:43:24.140
I keep going.

00:43:24.140 --> 00:43:28.240
I pull on the A.
I don't see that.

00:43:28.240 --> 00:43:29.015
So I need BCA.

00:43:33.650 --> 00:43:35.480
I ship out the address for BC.

00:43:38.250 --> 00:43:40.580
So I'm done with these.

00:43:40.580 --> 00:43:42.260
I get the address for BC here.

00:43:46.130 --> 00:43:50.120
I decode and get BC.

00:43:50.120 --> 00:43:53.405
I combined the first
letter of the new fragment

00:43:53.405 --> 00:43:54.530
with what was sitting here.

00:43:54.530 --> 00:44:03.300
So I get CAB as my
dictionary entry.

00:44:06.720 --> 00:44:08.400
And I keep going.

00:44:08.400 --> 00:44:10.565
All right, it's very systematic.

00:44:10.565 --> 00:44:12.190
I'm going to keep
going because there's

00:44:12.190 --> 00:44:15.820
a special case that will trip
you up if you don't get to it.

00:44:15.820 --> 00:44:19.300
And we need to proceed
a couple more here.

00:44:19.300 --> 00:44:23.680
OK, I pull in the
B. I've got a AB.

00:44:23.680 --> 00:44:25.525
I pull in the C. I've got ABC.

00:44:29.580 --> 00:44:32.920
I pull in the A.
I don't have ABCA.

00:44:32.920 --> 00:44:40.327
So I enter that
in my dictionary.

00:44:43.430 --> 00:44:44.430
And then I ship out ABC.

00:44:54.200 --> 00:44:57.980
OK, so you're always building
a new word, entering it

00:44:57.980 --> 00:45:00.643
in your dictionary, and
then the part that's already

00:45:00.643 --> 00:45:02.060
known you're
shipping out and then

00:45:02.060 --> 00:45:04.367
hanging onto the
end of this to start

00:45:04.367 --> 00:45:05.450
building the new fragment.

00:45:08.150 --> 00:45:09.350
ABC arrives here.

00:45:17.140 --> 00:45:18.710
I had the BC from before.

00:45:18.710 --> 00:45:21.890
I pull in the first
letter of that,

00:45:21.890 --> 00:45:28.160
and I get a BCA as my new
entry, which is this one.

00:45:32.080 --> 00:45:33.220
OK.

00:45:33.220 --> 00:45:35.140
Now we pull in the AB.

00:45:35.140 --> 00:45:36.940
I mean, we pull in
the B. We have AB.

00:45:36.940 --> 00:45:40.000
We pull in the C. We have ABC.

00:45:40.000 --> 00:45:47.770
We pull in the A, we have
ABCA, so we pull on the B.

00:45:47.770 --> 00:45:50.135
We ship out ABCA--

00:45:50.135 --> 00:45:55.950
A-B-C-A. Right?

00:45:55.950 --> 00:45:59.620
And now we're done
with all those guys.

00:45:59.620 --> 00:46:00.650
And here comes ABCA.

00:46:06.370 --> 00:46:11.910
And I go to my dictionary,
and I don't have ABCA--

00:46:14.840 --> 00:46:15.370
big hiccup.

00:46:18.840 --> 00:46:22.190
So the reason that
happened is that I'm

00:46:22.190 --> 00:46:27.080
discovering I need to send
ABCA on the very next step

00:46:27.080 --> 00:46:30.230
after entering it in my
dictionary on the receiver--

00:46:30.230 --> 00:46:32.080
on the transmitter side.

00:46:32.080 --> 00:46:35.990
And so the receiver hasn't
yet had a chance to catch up.

00:46:35.990 --> 00:46:37.700
Now if you analyze
this, It turns out

00:46:37.700 --> 00:46:43.010
that whenever this happens,
the sequence involved

00:46:43.010 --> 00:46:46.980
has its last character equal
to its first character.

00:46:46.980 --> 00:46:50.510
So looking at this,
the dictionary

00:46:50.510 --> 00:46:52.340
here is waiting to build up.

00:46:52.340 --> 00:46:54.050
It's got the ABC
here, and it's waiting

00:46:54.050 --> 00:46:58.370
to pull in the first
letter from the sequence--

00:46:58.370 --> 00:47:00.770
the sequence associated
with this dictionary entry.

00:47:00.770 --> 00:47:02.390
It doesn't have that
dictionary entry.

00:47:02.390 --> 00:47:05.300
So it can't pull in the A
like it was doing all along.

00:47:05.300 --> 00:47:07.740
But if you analyze the cases
under which this happens,

00:47:07.740 --> 00:47:09.980
It turns out that
whenever you don't have it

00:47:09.980 --> 00:47:12.340
in your dictionary
entry, the missing letter

00:47:12.340 --> 00:47:14.090
that you want to pull
into your dictionary

00:47:14.090 --> 00:47:16.760
is the same as the first
one in that string that's

00:47:16.760 --> 00:47:18.320
waiting to be built up.

00:47:18.320 --> 00:47:22.500
So it completes it with
an A, and it's all set.

00:47:22.500 --> 00:47:28.010
Now it says ABCA,
and it continues

00:47:28.010 --> 00:47:30.200
So this happens under very
particular conditions.

00:47:30.200 --> 00:47:31.250
It's a special case.

00:47:31.250 --> 00:47:36.350
If you actually look at the code
that's in the notes you'll see.

00:47:36.350 --> 00:47:38.700
While the encoding
is straightforward,

00:47:38.700 --> 00:47:41.630
it's really remarkable that
a short fragment like this

00:47:41.630 --> 00:47:43.100
can do that encoding.

00:47:48.600 --> 00:47:49.440
Let's see here.

00:47:51.830 --> 00:47:52.830
I don't want to do this.

00:47:52.830 --> 00:47:53.747
I did another example.

00:48:00.210 --> 00:48:03.410
Let me just say what's on this
before I dispense with it.

00:48:03.410 --> 00:48:05.930
Sorry.

00:48:05.930 --> 00:48:06.530
OK.

00:48:06.530 --> 00:48:12.390
So look at what's happened.

00:48:12.390 --> 00:48:15.090
In terms of the number
of things we've sent,

00:48:15.090 --> 00:48:16.447
we've only sent these addresses.

00:48:16.447 --> 00:48:18.030
And there are fewer
of them than there

00:48:18.030 --> 00:48:19.462
were symbols in the original.

00:48:19.462 --> 00:48:21.170
So that's where the
compression comes in.

00:48:21.170 --> 00:48:25.595
And as you get the longer
strings, the benefit is higher.

00:48:25.595 --> 00:48:27.720
Actually, I'm going to pass
this and just tell you,

00:48:27.720 --> 00:48:33.060
when you look through the
code fragment for decoding,

00:48:33.060 --> 00:48:35.070
this is the special case
that we talked about.

00:48:35.070 --> 00:48:37.020
If the code is not
in your dictionary,

00:48:37.020 --> 00:48:38.260
then do such and such.

00:48:38.260 --> 00:48:40.020
So that's the explanation.

00:48:40.020 --> 00:48:41.142
All right.

00:48:41.142 --> 00:48:42.600
And that's described
in the slides.

00:48:42.600 --> 00:48:43.980
We'll put that on.

00:48:43.980 --> 00:48:47.500
I just wanted to end
with a couple of things.

00:48:47.500 --> 00:48:50.160
One is actually-- LZW is a
good example of something

00:48:50.160 --> 00:48:53.280
that you see in other
contexts as well, where you're

00:48:53.280 --> 00:48:55.950
faced with transmitting
data and you decide instead

00:48:55.950 --> 00:48:58.260
that you'll transmit the
model or your best model

00:48:58.260 --> 00:48:59.550
for what generates that data.

00:48:59.550 --> 00:49:02.220
That can often be a much more
efficient way to do things.

00:49:02.220 --> 00:49:04.650
And in fact, when you
speak into your cell phone,

00:49:04.650 --> 00:49:07.290
you're not transmitting
a raw speech waveform.

00:49:07.290 --> 00:49:10.890
There's actually a very
sophisticated code there

00:49:10.890 --> 00:49:13.380
that's modeling your
speech as the output

00:49:13.380 --> 00:49:14.840
of an autoregressive filter.

00:49:14.840 --> 00:49:19.470
And then it sends the filter
tap weights to the receiver.

00:49:19.470 --> 00:49:21.690
So this kind of thing
arises again and again.

00:49:21.690 --> 00:49:24.330
Sending the model and
the little information

00:49:24.330 --> 00:49:26.430
you need to run the model
at the receiving end

00:49:26.430 --> 00:49:28.572
can be much more efficient
than sending the data.

00:49:28.572 --> 00:49:30.030
The other thing is
everything we've

00:49:30.030 --> 00:49:32.590
talked about has been
lossless compression--

00:49:32.590 --> 00:49:33.930
Huffman and LZW.

00:49:33.930 --> 00:49:37.940
You can completely recover
what was compressed.

00:49:37.940 --> 00:49:40.700
But there's a whole world
of lossy compression,

00:49:40.700 --> 00:49:41.700
which is very important.

00:49:41.700 --> 00:49:44.760
And we'll find ways to sneak
in discussion of that as well.

00:49:44.760 --> 00:49:46.820
All right, thank you.