WEBVTT
00:00:01.630 --> 00:00:07.820
Next we turn our attention to encoding data
as sequences of 0's and 1's, a string of bits.
00:00:07.820 --> 00:00:12.540
An encoding is an unambiguous mapping between
bit strings and the members of the set of
00:00:12.540 --> 00:00:14.410
data to be encoded.
00:00:14.410 --> 00:00:20.140
For example, suppose we have a set of four
symbols -- A, B, C, and D – and we want
00:00:20.140 --> 00:00:26.710
to use bit strings to encode messages constructed
of these symbols, for example "ABBA".
00:00:26.710 --> 00:00:32.189
If we choose to encode the message one character
at a time, our encoding would assign a unique
00:00:32.189 --> 00:00:34.800
bit string to each symbol.
00:00:34.800 --> 00:00:39.010
Since we have four symbols, we might choose
a unique two-bit string for each: "A" could
00:00:39.010 --> 00:00:45.890
be "00", B = "01", C = "10", and D = "11".
00:00:45.890 --> 00:00:50.460
This would be called a "fixed-length encoding"
since the bit strings used to represent the
00:00:50.460 --> 00:00:52.930
symbols all have the same length.
00:00:52.930 --> 00:01:00.610
The encoding for the message "ABBA" would
be "00-01-01-00".
00:01:00.610 --> 00:01:05.899
And we can run the process backwards: given
a bit string and the encoding key, we can
00:01:05.899 --> 00:01:12.270
look up the next bits in the bit string, using
the key to determine the symbol they represent.
00:01:12.270 --> 00:01:18.340
"00" would be decoded as "A", "01" as B and
so on.
00:01:18.340 --> 00:01:24.249
We can also use bit strings of different lengths
to encode symbols -- this is called a variable-length
00:01:24.249 --> 00:01:25.759
encoding.
00:01:25.759 --> 00:01:32.399
So A could be "01", B = "1", C = "000" and
D = "001".
00:01:32.399 --> 00:01:39.289
"ABBA" would be encoded as "01-1-1-01".
00:01:39.289 --> 00:01:43.990
We'll see that carefully constructed variable-length
encodings are useful for the efficient encoding
00:01:43.990 --> 00:01:48.740
of messages where the symbols occur with different
probabilities.
00:01:48.740 --> 00:01:52.049
We have to be careful that the encoding is
unambiguous!
00:01:52.049 --> 00:01:59.880
Suppose we had decided to encode A as "0",
B as "1", C as "10" and D as "11".
00:01:59.880 --> 00:02:04.659
The encoding for "ABBA" would be "0-1-1-0".
00:02:04.659 --> 00:02:09.850
Looking good since that encoding is shorter
than either of the previous two encodings.
00:02:09.850 --> 00:02:13.820
Now let's try to decode this bit string -- oops.
00:02:13.820 --> 00:02:19.100
Using the encoding key, we can unfortunately
arrive at several decodings: "ABBA" of course,
00:02:19.100 --> 00:02:24.380
but also "ADA" or "ABC" depending on how we
group the bits.
00:02:24.380 --> 00:02:28.740
This attempt at specifying an encoding has
failed since the message cannot be interpreted
00:02:28.740 --> 00:02:30.390
unambiguously.
00:02:30.390 --> 00:02:35.590
Graphically we can represent an unambiguous
encoding as a binary tree, labeling the branches
00:02:35.590 --> 00:02:40.700
from each tree node with "0" and "1", placing
the symbols to be encoded as the leaves of
00:02:40.700 --> 00:02:41.880
the tree.
00:02:41.880 --> 00:02:46.510
If you build a binary tree for a proposed
encoding and find that there are no symbols
00:02:46.510 --> 00:02:51.579
labeling interior nodes and exactly one symbol
at each leaf, then your encoding is good to
00:02:51.579 --> 00:02:52.660
go!
00:02:52.660 --> 00:02:55.940
For example, consider the encoding shown on
the left.
00:02:55.940 --> 00:02:59.460
It takes just a second to draw the corresponding
binary tree.
00:02:59.460 --> 00:03:04.020
The symbol B is distance 1 from the root of
the tree, along an arc labeled "0".
00:03:04.020 --> 00:03:09.040
A is distance two, and C and D are distance
3.
00:03:09.040 --> 00:03:15.910
If we receive an encoded message, for example
"01111", we can decode it using successive
00:03:15.910 --> 00:03:20.870
bits of the encoding to identify a path from
the root of tree, descending step-by-step
00:03:20.870 --> 00:03:25.900
until we come to leaf, then repeating the
process starting at the root again, until
00:03:25.900 --> 00:03:29.540
all the bits in the encoded message have been
consumed.
00:03:29.540 --> 00:03:34.930
So the message from the sheep is: "0" takes
us from the root to the leaf B, which is our
00:03:34.930 --> 00:03:36.750
first decoded symbol.
00:03:36.750 --> 00:03:41.320
Then "1-1" takes us to A and the next "1-1"
results in a second A.
00:03:41.320 --> 00:03:46.990
The final decoded message -- "BAA" -- is not
totally unexpected, at least from an American
00:03:46.990 --> 00:03:47.920
sheep.