WEBVTT
00:00:00.750 --> 00:00:04.059
This problems presents
Huffman Encoding
00:00:04.059 --> 00:00:06.950
which produces variable-length
encodings based
00:00:06.950 --> 00:00:08.820
on the probabilities
of the occurrence
00:00:08.820 --> 00:00:10.710
of each different choice.
00:00:10.710 --> 00:00:13.870
More likely choices will end
up with shorter encodings
00:00:13.870 --> 00:00:15.010
than less likely choices.
00:00:20.030 --> 00:00:22.160
The goal of this
problem is to produce
00:00:22.160 --> 00:00:26.280
a Huffman code to encode
student choices of majors.
00:00:26.280 --> 00:00:29.930
There are a total of 4 majors,
and each has a probability
00:00:29.930 --> 00:00:32.049
associated with it.
00:00:32.049 --> 00:00:35.790
To produce a Huffman encoding,
one begins with the 2 choices
00:00:35.790 --> 00:00:37.950
with lowest probability.
00:00:37.950 --> 00:00:43.560
In this example, major 6-7,
has a probability of 0.06,
00:00:43.560 --> 00:00:47.080
and major 6-1, has a
probability of 0.09.
00:00:47.080 --> 00:00:51.900
Since these are the two lowest
probabilities, these two
00:00:51.900 --> 00:00:54.350
choices are selected
as the starting point
00:00:54.350 --> 00:00:56.900
for building our encoding tree.
00:00:56.900 --> 00:00:59.400
The root node that
combines the two choices,
00:00:59.400 --> 00:01:02.750
has a probability equal to
the sum of the leaf nodes,
00:01:02.750 --> 00:01:06.440
or 0.15 in this example.
00:01:06.440 --> 00:01:09.170
We then label one side
of this tree with a 0
00:01:09.170 --> 00:01:10.290
and the other with a 1.
00:01:10.290 --> 00:01:14.880
The next step is to find the
next two smallest probabilities
00:01:14.880 --> 00:01:17.270
out of the remaining set
of probabilities where
00:01:17.270 --> 00:01:21.730
majors 6-1 and 6-7 have been
replaced by node A which
00:01:21.730 --> 00:01:25.450
has probability 0.15.
00:01:25.450 --> 00:01:28.180
In this case, our
lowest probabilities
00:01:28.180 --> 00:01:31.940
are 0.15 (which is the
probability of node A)
00:01:31.940 --> 00:01:36.760
and 0.41 (which is the
probability of major 6-3).
00:01:36.760 --> 00:01:41.990
So we create a new node B
that merges nodes A and 6-3.
00:01:41.990 --> 00:01:46.100
This new node has
probability 0.56.
00:01:46.100 --> 00:01:48.800
Again we label the two
branches, one with a 0
00:01:48.800 --> 00:01:50.920
and the other with a 1.
00:01:50.920 --> 00:01:53.690
We now repeat this
process one last time
00:01:53.690 --> 00:01:56.170
with the only two
remaining choices which
00:01:56.170 --> 00:01:59.290
are now node B and major 6-2.
00:01:59.290 --> 00:02:02.390
This means that we should
make a new node C that merges
00:02:02.390 --> 00:02:04.990
node B and major 6-2.
00:02:04.990 --> 00:02:07.030
Note that the
probability of this node
00:02:07.030 --> 00:02:11.550
is 1.0 because we have
reached the top of our tree.
00:02:11.550 --> 00:02:16.290
Our final step is to label
these last two branches.
00:02:16.290 --> 00:02:18.120
Now that all the
branches are labeled,
00:02:18.120 --> 00:02:21.750
we can traverse the tree from
the root node to each leaf node
00:02:21.750 --> 00:02:24.390
in order to identify
the encoding that
00:02:24.390 --> 00:02:28.750
has been assigned to the major
associated with that leaf node.
00:02:28.750 --> 00:02:33.950
We find that for major
6-1 the encoding is 101.
00:02:33.950 --> 00:02:38.770
For major 6-2, we end up
with a 1-bit encoding of 0.
00:02:38.770 --> 00:02:42.570
Next we traverse the tree to
identify the last two encodings
00:02:42.570 --> 00:02:47.590
and find that for major 6-3 the
encoding 11 has been assigned,
00:02:47.590 --> 00:02:53.560
and for major 6-7 the encoding
100 has been assigned.
00:02:53.560 --> 00:02:55.560
These encodings make
sense because we
00:02:55.560 --> 00:02:58.190
expect the major with
the highest probability,
00:02:58.190 --> 00:03:03.110
in this case major 6-2 to end
up with the shortest encoding.
00:03:03.110 --> 00:03:05.070
The next highest
probability major
00:03:05.070 --> 00:03:08.870
is 6-3 so it ends up with
the second shortest encoding,
00:03:08.870 --> 00:03:09.720
and so on.
00:03:12.990 --> 00:03:16.180
We just saw that the encodings
resulting from this Huffman
00:03:16.180 --> 00:03:22.440
encoding tree are: 101 for
major 6-1, a 0 for major 6-2,
00:03:22.440 --> 00:03:28.930
11 for major 6-3, and
100 for major 6-7.
00:03:28.930 --> 00:03:31.950
Note that the Huffman
encoding tree for this problem
00:03:31.950 --> 00:03:35.460
could have also been
drawn like this.
00:03:35.460 --> 00:03:38.390
These two trees are identical
in structure and result
00:03:38.390 --> 00:03:40.570
in the same encodings
for the four majors.
00:03:44.130 --> 00:03:47.420
Furthermore, a Huffman tree
can result in more than one
00:03:47.420 --> 00:03:48.930
valid encoding.
00:03:48.930 --> 00:03:51.500
The only constraint
in labeling the edges
00:03:51.500 --> 00:03:55.730
is that from each node, there is
both a 0 branch and a 1 branch
00:03:55.730 --> 00:03:58.480
but there are no constraints
about which side has to be
00:03:58.480 --> 00:04:01.390
labeled 0 and which side 1.
00:04:01.390 --> 00:04:03.800
So for example, we
could have chosen
00:04:03.800 --> 00:04:06.510
to label the left side
of the B node with a 1
00:04:06.510 --> 00:04:09.830
and the right side with
a 0 instead of the way we
00:04:09.830 --> 00:04:11.760
originally labeled them.
00:04:11.760 --> 00:04:13.880
Note, however, that
this would result
00:04:13.880 --> 00:04:17.209
in a different but also
valid Huffman encoding.
00:04:17.209 --> 00:04:21.600
In this case the encoding
for major 6-1 is 111,
00:04:21.600 --> 00:04:27.150
major 6-2 remains 0,
major 6-3 becomes 10,
00:04:27.150 --> 00:04:30.600
and major 6-7 becomes 110.
00:04:30.600 --> 00:04:34.300
As long as one
maintains consistency
00:04:34.300 --> 00:04:37.860
across the selected tree,
a valid Huffman encoding
00:04:37.860 --> 00:04:40.650
is produced.
00:04:40.650 --> 00:04:44.430
We now add one more column
to our table which gives p *
00:04:44.430 --> 00:04:48.850
log2(1/p) for each
of the majors.
00:04:48.850 --> 00:04:52.110
Using this information we
can calculate the entropy
00:04:52.110 --> 00:04:54.180
which is the average
amount of information
00:04:54.180 --> 00:04:56.360
contained in each message.
00:04:56.360 --> 00:05:02.850
This is calculated by taking
the sum of p * log2(1/p) for all
00:05:02.850 --> 00:05:04.750
choices of majors.
00:05:04.750 --> 00:05:07.150
For this problem,
the entropy is 1.6.
00:05:07.150 --> 00:05:11.020
We can now also calculate
the average bits
00:05:11.020 --> 00:05:14.430
per major of the encodings
that we have identified.
00:05:14.430 --> 00:05:16.210
This is calculated
by multiplying
00:05:16.210 --> 00:05:19.580
the number of bits in each
encoding times the probability
00:05:19.580 --> 00:05:20.980
of that major.
00:05:20.980 --> 00:05:23.310
Recall that our
encoding for major 6-1
00:05:23.310 --> 00:05:29.110
was 111, for major 6-2 it was
0, for major 6-3 it was 10,
00:05:29.110 --> 00:05:34.190
and finally for
major 6-7 it was 110.
00:05:34.190 --> 00:05:37.600
This means that the average
bits per major is 3 times
00:05:37.600 --> 00:05:41.020
0.09 plus 1 times
0.44 plus 2 times
00:05:41.020 --> 00:05:47.350
0.41 plus 3 times 0.06 =
0.27 plus 0.44 plus 0.82
00:05:47.350 --> 00:05:50.150
plus 0.18 which equals 1.71.
00:05:50.150 --> 00:05:56.820
Note that this is
slightly larger
00:05:56.820 --> 00:06:03.150
than the entropy which is 1.6.
00:06:03.150 --> 00:06:05.730
This occurs because
while Huffman encoding is
00:06:05.730 --> 00:06:08.380
an efficient encoding which
gets us most of the way
00:06:08.380 --> 00:06:11.510
there, there are still
some inefficiencies present
00:06:11.510 --> 00:06:14.330
in Huffman encoding
because it is only
00:06:14.330 --> 00:06:18.440
encoding one major at a time
rather than also considering
00:06:18.440 --> 00:06:20.760
the probabilities
associated with seeing
00:06:20.760 --> 00:06:24.050
a particular sequence of
majors in a message that
00:06:24.050 --> 00:06:28.471
conveys the major selected for
a large number of students.