1
00:00:00,500 --> 00:00:02,650
We will now go through
an example that
2
00:00:02,650 --> 00:00:05,820
involves a continuous
unknown parameter,
3
00:00:05,820 --> 00:00:09,830
the unknown bias of a coin
and discrete observations,
4
00:00:09,830 --> 00:00:12,100
namely, the number
of heads that are
5
00:00:12,100 --> 00:00:14,360
observed in a sequence
of coin flips.
6
00:00:14,360 --> 00:00:17,880
This is an example that we
will start in some detail now,
7
00:00:17,880 --> 00:00:20,630
and we will also
revisit later on.
8
00:00:20,630 --> 00:00:23,620
And in the process, we will
also have the opportunity
9
00:00:23,620 --> 00:00:27,770
to introduce a new class of
probability distributions.
10
00:00:27,770 --> 00:00:30,860
This example is an
extension of an example
11
00:00:30,860 --> 00:00:33,320
that we have already
seen, when we first
12
00:00:33,320 --> 00:00:36,970
introduced the relevant
version of the Bayes rule.
13
00:00:36,970 --> 00:00:38,800
We have a coin.
14
00:00:38,800 --> 00:00:44,110
It has a certain bias between 0
and 1, but the bias is unknown.
15
00:00:44,110 --> 00:00:47,190
And consistent with the
Bayesian philosophy,
16
00:00:47,190 --> 00:00:50,470
we treat this unknown
bias as a random variable,
17
00:00:50,470 --> 00:00:54,240
and we assign a prior
probability distribution to it.
18
00:00:54,240 --> 00:00:57,130
We flip this coin n
times independently,
19
00:00:57,130 --> 00:00:59,360
where n is some
positive integer,
20
00:00:59,360 --> 00:01:02,440
and we record the number
of heads that are obtained.
21
00:01:02,440 --> 00:01:05,030
On the basis of the value
of this random variable,
22
00:01:05,030 --> 00:01:08,740
we would like to make
inferences about Theta.
23
00:01:08,740 --> 00:01:11,510
Now to make some more
concrete progress,
24
00:01:11,510 --> 00:01:13,280
let us make a
specific assumption.
25
00:01:13,280 --> 00:01:16,820
Let us assume that
the prior on Theta
26
00:01:16,820 --> 00:01:20,840
is uniform on the unit interval,
in some sense reflecting
27
00:01:20,840 --> 00:01:25,260
complete ignorance about
the true value of Theta.
28
00:01:25,260 --> 00:01:30,789
We observe the value of this
random variable, some little k,
29
00:01:30,789 --> 00:01:34,400
we fix that value, and we're
interested in the functional
30
00:01:34,400 --> 00:01:38,490
dependence on theta of
this particular quantity,
31
00:01:38,490 --> 00:01:41,140
when k is given to us.
32
00:01:41,140 --> 00:01:42,650
How do we do this?
33
00:01:42,650 --> 00:01:46,610
We use the appropriate form
of the Bayes rule, which
34
00:01:46,610 --> 00:01:49,740
in this setting is as follows.
35
00:01:49,740 --> 00:01:54,289
it is the usual
form, but we have
36
00:01:54,289 --> 00:01:57,620
f's indicating
densities whenever we're
37
00:01:57,620 --> 00:01:59,509
talking about the
distribution of Theta,
38
00:01:59,509 --> 00:02:01,440
because Theta is continuous.
39
00:02:01,440 --> 00:02:04,760
And whenever we talk about
the distribution of K, which
40
00:02:04,760 --> 00:02:07,020
is discrete, we
use the symbol p,
41
00:02:07,020 --> 00:02:10,600
because we're dealing with
probability mass functions.
42
00:02:10,600 --> 00:02:14,770
As always, the
denominator term is such
43
00:02:14,770 --> 00:02:19,490
that the integral of the
whole expression over theta
44
00:02:19,490 --> 00:02:20,670
is equal to 1.
45
00:02:20,670 --> 00:02:23,329
This is the necessary
normalization property,
46
00:02:23,329 --> 00:02:26,180
and because of this,
this denominator term
47
00:02:26,180 --> 00:02:29,650
has to be equal to the
integral of the numerator
48
00:02:29,650 --> 00:02:33,250
over all theta, which
is what we have here.
49
00:02:33,250 --> 00:02:36,990
So now let us move, and
let us apply this formula.
50
00:02:36,990 --> 00:02:41,320
We first have the prior,
which is equal to 1.
51
00:02:41,320 --> 00:02:45,530
Then we have the probability
that K is equal to little k.
52
00:02:45,530 --> 00:02:49,030
This is the probability of
obtaining exactly k heads,
53
00:02:49,030 --> 00:02:51,740
if I tell you the
bias or the coin.
54
00:02:51,740 --> 00:02:53,860
But if I tell you
the bias of the coin,
55
00:02:53,860 --> 00:02:57,410
we're dealing with the usual
model of independent coin
56
00:02:57,410 --> 00:03:00,270
flips, and the
probability of k heads
57
00:03:00,270 --> 00:03:04,610
is given by the binomial
probabilities, which
58
00:03:04,610 --> 00:03:05,890
takes this form.
59
00:03:08,900 --> 00:03:14,520
And finally, we have
the denominator term,
60
00:03:14,520 --> 00:03:18,260
which we do not need to
evaluate at this point.
61
00:03:18,260 --> 00:03:21,760
Now, I said earlier that we're
interested in the dependence
62
00:03:21,760 --> 00:03:26,250
on theta, which comes
through these terms.
63
00:03:26,250 --> 00:03:29,550
On the other hand,
the remaining terms
64
00:03:29,550 --> 00:03:34,090
do not involve any
thetas, and so they
65
00:03:34,090 --> 00:03:38,420
can be lumped together
in just a constant.
66
00:03:38,420 --> 00:03:41,140
And so we can write
the answer that we
67
00:03:41,140 --> 00:03:44,980
have found in this
more suggestive form.
68
00:03:44,980 --> 00:03:47,160
We have some
normalizing constant,
69
00:03:47,160 --> 00:03:50,670
and here we keep separately
the dependence on theta.
70
00:03:50,670 --> 00:03:52,960
Of course, this
answer that we derived
71
00:03:52,960 --> 00:03:57,570
is valid for little theta
belonging to the unit interval.
72
00:03:57,570 --> 00:04:01,660
Outside the unit interval,
either the prior density
73
00:04:01,660 --> 00:04:07,370
or the posterior density of
Theta would be equal to 0.
74
00:04:07,370 --> 00:04:12,130
This particular form of
the posterior distribution
75
00:04:12,130 --> 00:04:15,500
for Theta is a certain
type of density,
76
00:04:15,500 --> 00:04:18,110
and it shows up in
various contexts.
77
00:04:18,110 --> 00:04:20,890
And for this reason,
it has a name.
78
00:04:20,890 --> 00:04:25,320
It is called a Beta distribution
with certain parameters,
79
00:04:25,320 --> 00:04:28,040
and the parameters
reflect the exponents
80
00:04:28,040 --> 00:04:32,390
that we have up here
in the two terms.
81
00:04:32,390 --> 00:04:36,150
Note that these parameters are
the exponents augmented by 1.
82
00:04:36,150 --> 00:04:39,730
This is for historical reasons
that do not concern us here.
83
00:04:39,730 --> 00:04:41,720
It is just a convention.
84
00:04:41,720 --> 00:04:45,840
The important thing is to be
able to recognize what it takes
85
00:04:45,840 --> 00:04:48,760
for a distribution to
be a Beta distribution.
86
00:04:48,760 --> 00:04:52,790
That this that the dependence
on theta is of the form theta
87
00:04:52,790 --> 00:04:57,100
to some power times 1 minus
theta to some other power.
88
00:04:57,100 --> 00:05:01,060
Any distribution of this form
is called a Beta distribution.
89
00:05:01,060 --> 00:05:03,020
So now, let's
continue this example
90
00:05:03,020 --> 00:05:05,270
by considering a
different prior.
91
00:05:05,270 --> 00:05:10,530
Suppose that the prior is
itself a Beta distribution
92
00:05:10,530 --> 00:05:13,610
of this form where
alpha and beta are
93
00:05:13,610 --> 00:05:17,130
some non-negative numbers.
94
00:05:17,130 --> 00:05:20,250
What is the posterior
in this case?
95
00:05:20,250 --> 00:05:23,160
We just go through the
same calculation as before,
96
00:05:23,160 --> 00:05:27,150
but instead of using one
in the place of the prior,
97
00:05:27,150 --> 00:05:30,850
we now use the prior
that's given to us.
98
00:05:35,950 --> 00:05:39,909
The probability of k
heads in the n tosses,
99
00:05:39,909 --> 00:05:43,350
when we know the bias,
is exactly as before.
100
00:05:43,350 --> 00:05:47,840
It is given by the
binomial probabilities.
101
00:05:47,840 --> 00:05:53,540
And finally, we need to divide
by the denominator term, which
102
00:05:53,540 --> 00:05:56,480
is the normalizing constant.
103
00:05:56,480 --> 00:05:58,670
What do we observe here?
104
00:05:58,670 --> 00:06:03,750
The dependence on theta
comes through these terms.
105
00:06:03,750 --> 00:06:07,610
The remaining terms
do not involve theta,
106
00:06:07,610 --> 00:06:11,710
and they can all be
absorbed in a constant.
107
00:06:11,710 --> 00:06:16,430
Let's call that constant d, and
collect the remaining terms.
108
00:06:16,430 --> 00:06:22,260
We have theta to the
power of alpha plus k,
109
00:06:22,260 --> 00:06:28,550
and then, 1 minus theta to the
power of beta plus n minus k.
110
00:06:33,530 --> 00:06:36,900
And once more, this is
the form of the posterior
111
00:06:36,900 --> 00:06:40,170
for thetas belonging
to this range.
112
00:06:40,170 --> 00:06:43,680
The posterior is 0
outside this range.
113
00:06:43,680 --> 00:06:45,180
So what do we see?
114
00:06:45,180 --> 00:06:47,390
We started with
a prior that came
115
00:06:47,390 --> 00:06:49,920
from the Beta
family of this form,
116
00:06:49,920 --> 00:06:54,830
and we came up with a
posterior that is still
117
00:06:54,830 --> 00:06:57,490
a function of
theta of this form,
118
00:06:57,490 --> 00:07:01,550
but with different values of
the parameters alpha and beta.
119
00:07:01,550 --> 00:07:03,970
Namely, alpha gets
replaced by alpha plus k,
120
00:07:03,970 --> 00:07:08,080
beta gets replaced by
beta plus n minus k.
121
00:07:08,080 --> 00:07:10,340
So we see that if we
start with a prior
122
00:07:10,340 --> 00:07:12,890
from the family of
Beta distributions,
123
00:07:12,890 --> 00:07:17,720
the posterior will also
be in that same family.
124
00:07:17,720 --> 00:07:21,120
This is a beautiful property
of Beta distributions
125
00:07:21,120 --> 00:07:24,410
that can be exploited
in various ways.
126
00:07:24,410 --> 00:07:26,890
One of which is that
it actually allows
127
00:07:26,890 --> 00:07:31,170
for recursive ways of updating
the posterior of Theta
128
00:07:31,170 --> 00:07:34,159
as we get more and
more observations.