1
00:00:01,069 --> 00:00:02,790
"Optimal" sounds pretty good!
2
00:00:02,790 --> 00:00:05,170
Does that mean we can't do any better?
3
00:00:05,170 --> 00:00:08,750
Well, not by encoding symbols one-at-a-time.
4
00:00:08,750 --> 00:00:13,309
But if we want to encode long sequences of
symbols, we can reduce the expected length
5
00:00:13,309 --> 00:00:19,920
of the encoding by working with, say, pairs
of symbols instead of only single symbols.
6
00:00:19,920 --> 00:00:24,039
The table below shows the probability of pairs
of symbols from our example.
7
00:00:24,039 --> 00:00:29,490
If we use Huffman's Algorithm to build the
optimal variable-length code using these probabilities,
8
00:00:29,490 --> 00:00:36,320
it turns out the expected length when encoding
pairs is 1.646 bits/symbol.
9
00:00:36,320 --> 00:00:43,460
This is a small improvement on the 1.667 bits/symbols
when encoding each symbol individually.
10
00:00:43,460 --> 00:00:48,670
And we'd do even better if we encoded sequences
of length 3, and so on.
11
00:00:48,670 --> 00:00:53,129
Modern file compression algorithms use an
adaptive algorithm to determine on-the-fly
12
00:00:53,129 --> 00:00:57,340
which sequences occur frequently and hence
should have short encodings.
13
00:00:57,340 --> 00:01:02,079
They work quite well when the data has many
repeating sequences, for example, natural
14
00:01:02,079 --> 00:01:08,600
language data where some letter combinations
or even whole words occur again and again.
15
00:01:08,600 --> 00:01:12,630
Compression can achieve dramatic reductions
from the original file size.
16
00:01:12,630 --> 00:01:18,030
If you'd like to learn more, look up "LZW"
on Wikipedia to read the Lempel-Ziv-Welch
17
00:01:18,030 --> 00:01:19,160
data compression algorithm.