WEBVTT
00:00:00.500 --> 00:00:02.730
Mathematicians like
to model uncertainty
00:00:02.730 --> 00:00:06.130
about a particular circumstance
by introducing the concept
00:00:06.130 --> 00:00:07.830
of a random variable.
00:00:07.830 --> 00:00:09.600
For our application,
we'll always
00:00:09.600 --> 00:00:12.380
be dealing with circumstances
where there are a finite number
00:00:12.380 --> 00:00:14.750
N of distinct
choices, so we'll be
00:00:14.750 --> 00:00:16.470
using a discrete
random variable that
00:00:16.470 --> 00:00:18.630
can take on one of
N possible values:
00:00:18.630 --> 00:00:22.980
x_1, x_2, and so on up to x_N.
00:00:22.980 --> 00:00:25.410
The probability that X
will take on the value x_1
00:00:25.410 --> 00:00:28.590
is given by the
probability p_1, the value
00:00:28.590 --> 00:00:33.010
x_2 by probability
p_2, and so on.
00:00:33.010 --> 00:00:35.490
The smaller the probability,
the more uncertain
00:00:35.490 --> 00:00:39.410
it is that X will take
on that particular value.
00:00:39.410 --> 00:00:41.690
Claude Shannon, in his
seminal work on the theory
00:00:41.690 --> 00:00:44.300
of information, defined the
information received when
00:00:44.300 --> 00:00:49.480
learning that X had taken on
the value x_i as the log-base-2
00:00:49.480 --> 00:00:51.790
of 1/p_i.
00:00:51.790 --> 00:00:53.520
Note that uncertainty
of a choice
00:00:53.520 --> 00:00:56.060
is inversely proportional
to its probability,
00:00:56.060 --> 00:00:59.390
so the term inside of the log
is basically the uncertainty
00:00:59.390 --> 00:01:01.700
of that particular choice.
00:01:01.700 --> 00:01:04.040
We use the log-base-2
to measure the magnitude
00:01:04.040 --> 00:01:07.120
of the uncertainty in bits --
where a bit is a quantity that
00:01:07.120 --> 00:01:10.770
can take on the value 0 or 1
-- think of the information
00:01:10.770 --> 00:01:14.380
content as the number of bits
we would require to encode this
00:01:14.380 --> 00:01:16.490
choice.
00:01:16.490 --> 00:01:18.540
Suppose the data
we receive doesn't
00:01:18.540 --> 00:01:20.280
resolve all the uncertainty.
00:01:20.280 --> 00:01:22.980
For example, when earlier
we received the data
00:01:22.980 --> 00:01:25.560
that the card was a
heart: some of uncertainty
00:01:25.560 --> 00:01:27.850
has been resolved since we
know more about the card
00:01:27.850 --> 00:01:29.810
than we did before the
receiving the data,
00:01:29.810 --> 00:01:32.100
but we don't yet
know the exact card,
00:01:32.100 --> 00:01:34.990
so some uncertainty
still remains.
00:01:34.990 --> 00:01:37.320
We can still use the formula
for information content
00:01:37.320 --> 00:01:40.210
from the previous slide,
using the probability
00:01:40.210 --> 00:01:43.910
we received to compute
the information content.
00:01:43.910 --> 00:01:46.230
In our example the
probability of learning
00:01:46.230 --> 00:01:49.860
that a card chosen randomly
from a 52-card deck is a heart
00:01:49.860 --> 00:01:54.670
is 13/52, the number of
hearts over the total number
00:01:54.670 --> 00:01:55.980
of choices.
00:01:55.980 --> 00:02:01.070
So p_data is 13/52, or 1/4
and the information content is
00:02:01.070 --> 00:02:03.030
computed as
log-base-2 of 1/(1/4),
00:02:03.030 --> 00:02:07.730
which figures out to be 2 bits.
00:02:07.730 --> 00:02:10.280
This example is one
we encounter often --
00:02:10.280 --> 00:02:13.850
we receive partial information
about N equally-probable
00:02:13.850 --> 00:02:18.220
choices (each choice has
probability 1/N) that narrows
00:02:18.220 --> 00:02:21.000
the number of choices down to M.
00:02:21.000 --> 00:02:24.740
The probability of receiving
such information is M*(1/N),
00:02:24.740 --> 00:02:31.740
so information received
is log-base-2 of N/M bits.
00:02:31.740 --> 00:02:33.600
Let's look at some examples.
00:02:33.600 --> 00:02:36.160
If we learn the result
(HEADS or TAILS)
00:02:36.160 --> 00:02:39.370
of a flip of a fair coin,
we go from 2 choices
00:02:39.370 --> 00:02:40.600
to a single choice.
00:02:40.600 --> 00:02:43.100
So the information received
is log-base-2 of 2/1,
00:02:43.100 --> 00:02:45.780
or a single bit.
00:02:45.780 --> 00:02:47.330
This makes sense:
it would take us
00:02:47.330 --> 00:02:49.880
one bit to encode which
of the two possibilities
00:02:49.880 --> 00:02:54.920
actually happened, say, "1"
for heads and "0" for tails.
00:02:54.920 --> 00:02:57.160
Reviewing the example
from the previous slide:
00:02:57.160 --> 00:02:59.970
learning that a card drawn
from a fresh deck is a heart
00:02:59.970 --> 00:03:05.590
gives us log-base-2 of 52/13,
or 2 bits of information.
00:03:05.590 --> 00:03:07.290
Again this makes
sense: it would take us
00:03:07.290 --> 00:03:10.200
two bits to encode which
of four possible card suits
00:03:10.200 --> 00:03:12.410
had turned up.
00:03:12.410 --> 00:03:14.190
Finally consider
what information
00:03:14.190 --> 00:03:18.220
we get from rolling two
dice, one red and one green.
00:03:18.220 --> 00:03:23.200
Each die has six faces, so there
are 36 possible combinations.
00:03:23.200 --> 00:03:25.580
Once we learn the exact
outcome of the roll,
00:03:25.580 --> 00:03:30.700
we've received log-base-2
of 36/1 or 5.17 bits
00:03:30.700 --> 00:03:32.320
of information.
00:03:32.320 --> 00:03:32.960
Hmm.
00:03:32.960 --> 00:03:35.470
What do those
fractional bits mean?
00:03:35.470 --> 00:03:38.740
Our circuitry will only
deal with whole bits!
00:03:38.740 --> 00:03:42.360
So to encode a single outcome,
we'd need to use six bits.
00:03:42.360 --> 00:03:44.670
But suppose we wanted to
record the outcome of 10
00:03:44.670 --> 00:03:46.550
successive rolls.
00:03:46.550 --> 00:03:50.760
At 6 bits per roll, we would
need a total of 60 bits.
00:03:50.760 --> 00:03:52.690
What this formula
is telling us is
00:03:52.690 --> 00:03:55.720
that we would need
not 60 bits, but only
00:03:55.720 --> 00:03:59.620
52 bits to unambiguously
encode the results.
00:03:59.620 --> 00:04:01.560
Whether we can come
with an encoding that
00:04:01.560 --> 00:04:04.310
achieves this lower bound
is an interesting question,
00:04:04.310 --> 00:04:07.850
which we will take up
later in the chapter.
00:04:07.850 --> 00:04:10.850
To wrap up, let's return
to our initial example.
00:04:10.850 --> 00:04:13.430
Here's table showing the
different choices for the data
00:04:13.430 --> 00:04:15.630
received, along
with the probability
00:04:15.630 --> 00:04:19.662
of that event and the
computed information content.
00:04:19.662 --> 00:04:21.120
We've already talked
about learning
00:04:21.120 --> 00:04:22.560
that the card was a heart.
00:04:22.560 --> 00:04:26.340
The probability of this event
is 13/52 with an information
00:04:26.340 --> 00:04:28.820
content of 2 bits.
00:04:28.820 --> 00:04:31.250
Learning that a card is
not the Ace of spades
00:04:31.250 --> 00:04:33.870
is quite likely, since
there's only one chance in 52
00:04:33.870 --> 00:04:36.600
that it is the Ace of spades.
00:04:36.600 --> 00:04:40.930
So we only get a small amount of
information from this event --
00:04:40.930 --> 00:04:42.790
.028 bits.
00:04:42.790 --> 00:04:44.990
There twelve face
cards in a card deck,
00:04:44.990 --> 00:04:47.360
so the probability of
this event is 12/52
00:04:47.360 --> 00:04:52.290
and we would receive 2.115 bits.
00:04:52.290 --> 00:04:55.040
A bit more information than
learning about the card's suit
00:04:55.040 --> 00:04:59.200
since there's slightly
less residual uncertainty.
00:04:59.200 --> 00:05:01.880
Finally, we get the most
information when all
00:05:01.880 --> 00:05:04.910
uncertainty is eliminated
-- a bit more than 5.7 bits.
00:05:04.910 --> 00:05:11.120
The results line up nicely with
our and Mr. Blue's intuition:
00:05:11.120 --> 00:05:14.310
the more uncertainty is
resolved, the more information
00:05:14.310 --> 00:05:16.090
we have received.
00:05:16.090 --> 00:05:19.040
Now try your hand at computing
the information for a few
00:05:19.040 --> 00:05:22.070
more examples in the
following exercises.