1
00:00:01,210 --> 00:00:05,070
By this time, we know how to
construct confidence intervals
2
00:00:05,070 --> 00:00:08,380
when we try to estimate an
unknown mean of a certain
3
00:00:08,380 --> 00:00:12,340
distribution using the sample
mean as our estimator.
4
00:00:12,340 --> 00:00:14,980
Or actually, these are
approximate confidence
5
00:00:14,980 --> 00:00:18,850
intervals, because we are
using the approximation
6
00:00:18,850 --> 00:00:21,740
suggested by the central
limit theorem.
7
00:00:21,740 --> 00:00:25,690
But what if we do not know the
value of sigma, the standard
8
00:00:25,690 --> 00:00:27,790
deviation of the X's?
9
00:00:27,790 --> 00:00:29,940
Then we have a few options.
10
00:00:29,940 --> 00:00:33,253
One option is to use an
upper bound on sigma.
11
00:00:33,253 --> 00:00:36,600
So we will be using a
value that's larger
12
00:00:36,600 --> 00:00:38,326
than or equal to sigma.
13
00:00:38,326 --> 00:00:42,180
And this is going to make our
interval somewhat larger.
14
00:00:42,180 --> 00:00:44,670
So this is a conservative
choice, but it is
15
00:00:44,670 --> 00:00:45,890
definitely an option.
16
00:00:45,890 --> 00:00:48,330
For example, if we're dealing
with Bernoulli random
17
00:00:48,330 --> 00:00:51,320
variables, we know that the
standard deviation is less
18
00:00:51,320 --> 00:00:54,480
than or equal to 1/2, so we can
just plug-in the value of
19
00:00:54,480 --> 00:00:56,840
1/2 at this point.
20
00:00:56,840 --> 00:01:00,220
Another option is to try
to estimate sigma.
21
00:01:00,220 --> 00:01:02,230
How do we estimate it?
22
00:01:02,230 --> 00:01:06,740
We can perhaps use an ad hoc
estimate of sigma that fits to
23
00:01:06,740 --> 00:01:08,980
the particular situation
at hand.
24
00:01:08,980 --> 00:01:12,800
So for example, in the Bernoulli
case, we know that
25
00:01:12,800 --> 00:01:17,900
sigma is given by this formula,
where theta is the
26
00:01:17,900 --> 00:01:19,830
mean of the Bernoulli.
27
00:01:19,830 --> 00:01:24,490
And using this, and since we do
have an estimate of theta--
28
00:01:24,490 --> 00:01:26,810
this is just the sample mean--
29
00:01:26,810 --> 00:01:29,990
we can plug-in that particular
estimate.
30
00:01:29,990 --> 00:01:34,320
And that gives us an estimate
of the standard deviation.
31
00:01:34,320 --> 00:01:38,509
When n is large, this estimate
is going to be very close to
32
00:01:38,509 --> 00:01:39,770
the true value.
33
00:01:39,770 --> 00:01:43,229
And so this estimate of the
standard deviation will also
34
00:01:43,229 --> 00:01:47,090
be very close to
the true value.
35
00:01:47,090 --> 00:01:51,130
Both of these options were
discussed for special cases
36
00:01:51,130 --> 00:01:53,780
where we have special structure
and we can derive an
37
00:01:53,780 --> 00:01:56,030
upper bound, or there is
a natural estimate
38
00:01:56,030 --> 00:01:57,810
that suggests itself.
39
00:01:57,810 --> 00:02:00,550
More generally, what
can we do?
40
00:02:00,550 --> 00:02:04,680
One general option is to
use a generic way of
41
00:02:04,680 --> 00:02:07,190
estimating the variance.
42
00:02:07,190 --> 00:02:10,419
And here's how it goes.
43
00:02:10,419 --> 00:02:14,060
The variance is, by definition,
the expected value
44
00:02:14,060 --> 00:02:17,290
of something, of this
expression.
45
00:02:17,290 --> 00:02:21,500
And we can estimate expected
values by taking several
46
00:02:21,500 --> 00:02:26,150
samples of this quantity, and
taking the average of them.
47
00:02:26,150 --> 00:02:30,530
So if we have n pieces of data,
for each piece of data,
48
00:02:30,530 --> 00:02:33,900
we calculate this quantity,
divide by n.
49
00:02:33,900 --> 00:02:39,290
And by the weak law of large
numbers, this is the sample
50
00:02:39,290 --> 00:02:42,100
mean of this particular
random variable.
51
00:02:42,100 --> 00:02:45,480
And it converges to
the expected value
52
00:02:45,480 --> 00:02:48,250
of this random variable.
53
00:02:48,250 --> 00:02:51,590
So that's how we could estimate
the variance.
54
00:02:51,590 --> 00:02:53,470
But there is a catch.
55
00:02:53,470 --> 00:02:56,760
This expression here
involves the mean
56
00:02:56,760 --> 00:02:58,280
of the random variable.
57
00:02:58,280 --> 00:03:00,640
And this is something
that we do not know.
58
00:03:00,640 --> 00:03:02,570
So what can we do?
59
00:03:02,570 --> 00:03:05,630
Well, we have an estimate for
the mean, so we could just
60
00:03:05,630 --> 00:03:09,750
plug in that estimate instead
of the true value.
61
00:03:09,750 --> 00:03:13,440
And this gives us this
alternative expression.
62
00:03:13,440 --> 00:03:19,590
Now, when n is very large, as n
increases, this sample mean
63
00:03:19,590 --> 00:03:21,800
converges to the true mean.
64
00:03:21,800 --> 00:03:26,040
So this expression here would
become closer and closer to
65
00:03:26,040 --> 00:03:27,500
this expression.
66
00:03:27,500 --> 00:03:31,700
Now, this expression converges
to sigma squared, and we
67
00:03:31,700 --> 00:03:34,960
conclude from this that this
expression will also converge
68
00:03:34,960 --> 00:03:36,460
to sigma squared.
69
00:03:36,460 --> 00:03:40,360
And so here we have a way of
estimating sigma squared from
70
00:03:40,360 --> 00:03:43,840
the data, and by taking the
square root, we obtain an
71
00:03:43,840 --> 00:03:47,120
estimate of sigma as well that
we can plug in in this
72
00:03:47,120 --> 00:03:49,290
expression.
73
00:03:49,290 --> 00:03:52,420
And this gives us a complete
way of coming up with
74
00:03:52,420 --> 00:03:56,880
confidence intervals when we
only have data available in
75
00:03:56,880 --> 00:04:03,460
our hands, but do not know ahead
of time what sigma is.
76
00:04:03,460 --> 00:04:05,570
Some remarks.
77
00:04:05,570 --> 00:04:10,040
This procedure of constructing
confidence intervals involves
78
00:04:10,040 --> 00:04:12,330
two separate approximations.
79
00:04:12,330 --> 00:04:15,620
One approximation has to do with
the fact that the sample
80
00:04:15,620 --> 00:04:18,810
mean is approximately normal
according to the
81
00:04:18,810 --> 00:04:20,540
central limit theorem.
82
00:04:20,540 --> 00:04:24,500
And then there is a second
approximation that comes in in
83
00:04:24,500 --> 00:04:26,930
using an estimate of sigma
instead of the
84
00:04:26,930 --> 00:04:29,010
true value of sigma.
85
00:04:29,010 --> 00:04:32,350
Now, when we estimate sigma
instead of using the true
86
00:04:32,350 --> 00:04:35,550
value, we're introducing
some additional
87
00:04:35,550 --> 00:04:37,890
randomness in this procedure.
88
00:04:37,890 --> 00:04:40,460
And because of this randomness,
the confidence
89
00:04:40,460 --> 00:04:43,900
intervals actually should
be a little larger.
90
00:04:43,900 --> 00:04:48,670
There is a systematic way of
doing that, and it involves
91
00:04:48,670 --> 00:04:52,860
using the so-called
t-distribution tables.
92
00:04:52,860 --> 00:04:56,500
And those tables are going to
give us certain numbers that
93
00:04:56,500 --> 00:04:59,310
are a little different from
what we have here.
94
00:04:59,310 --> 00:05:04,500
So instead of 1.96, we might
have a somewhat larger number.
95
00:05:04,500 --> 00:05:08,300
This correction is relevant
when n is a small number,
96
00:05:08,300 --> 00:05:11,050
let's say n smaller than 30.
97
00:05:11,050 --> 00:05:15,130
But for larger values of n, this
correction, where we use
98
00:05:15,130 --> 00:05:17,860
t tables instead of normal
tables, is rather
99
00:05:17,860 --> 00:05:21,060
insignificant and one doesn't
bother with it.
100
00:05:21,060 --> 00:05:23,910
In any case, we will not discuss
any further this
101
00:05:23,910 --> 00:05:27,720
additional correction, but it
is useful to know that it is
102
00:05:27,720 --> 00:05:31,450
something that the statisticians
will often do.
103
00:05:31,450 --> 00:05:33,990
Finally, one last remark.
104
00:05:33,990 --> 00:05:38,610
One will often see an
alternative way of estimating
105
00:05:38,610 --> 00:05:43,870
the variance where instead of
this factor of 1/n, one uses a
106
00:05:43,870 --> 00:05:47,840
factor of 1 over n minus 1.
107
00:05:47,840 --> 00:05:51,190
With this alternative form, it
turns out that this is an
108
00:05:51,190 --> 00:05:54,990
unbiased estimator
of the variance.
109
00:05:54,990 --> 00:05:58,650
And that could be a reason for
preferring to use this
110
00:05:58,650 --> 00:06:00,110
alternative form.
111
00:06:00,110 --> 00:06:04,080
On the other hand, when n is
large, whether we use n or n
112
00:06:04,080 --> 00:06:07,410
minus 1 makes very little
difference.
113
00:06:07,410 --> 00:06:09,490
And this concludes
our discussion
114
00:06:09,490 --> 00:06:10,740
of confidence intervals.