1 00:00:01,210 --> 00:00:05,070 By this time, we know how to construct confidence intervals 2 00:00:05,070 --> 00:00:08,380 when we try to estimate an unknown mean of a certain 3 00:00:08,380 --> 00:00:12,340 distribution using the sample mean as our estimator. 4 00:00:12,340 --> 00:00:14,980 Or actually, these are approximate confidence 5 00:00:14,980 --> 00:00:18,850 intervals, because we are using the approximation 6 00:00:18,850 --> 00:00:21,740 suggested by the central limit theorem. 7 00:00:21,740 --> 00:00:25,690 But what if we do not know the value of sigma, the standard 8 00:00:25,690 --> 00:00:27,790 deviation of the X's? 9 00:00:27,790 --> 00:00:29,940 Then we have a few options. 10 00:00:29,940 --> 00:00:33,253 One option is to use an upper bound on sigma. 11 00:00:33,253 --> 00:00:36,600 So we will be using a value that's larger 12 00:00:36,600 --> 00:00:38,326 than or equal to sigma. 13 00:00:38,326 --> 00:00:42,180 And this is going to make our interval somewhat larger. 14 00:00:42,180 --> 00:00:44,670 So this is a conservative choice, but it is 15 00:00:44,670 --> 00:00:45,890 definitely an option. 16 00:00:45,890 --> 00:00:48,330 For example, if we're dealing with Bernoulli random 17 00:00:48,330 --> 00:00:51,320 variables, we know that the standard deviation is less 18 00:00:51,320 --> 00:00:54,480 than or equal to 1/2, so we can just plug-in the value of 19 00:00:54,480 --> 00:00:56,840 1/2 at this point. 20 00:00:56,840 --> 00:01:00,220 Another option is to try to estimate sigma. 21 00:01:00,220 --> 00:01:02,230 How do we estimate it? 22 00:01:02,230 --> 00:01:06,740 We can perhaps use an ad hoc estimate of sigma that fits to 23 00:01:06,740 --> 00:01:08,980 the particular situation at hand. 24 00:01:08,980 --> 00:01:12,800 So for example, in the Bernoulli case, we know that 25 00:01:12,800 --> 00:01:17,900 sigma is given by this formula, where theta is the 26 00:01:17,900 --> 00:01:19,830 mean of the Bernoulli. 27 00:01:19,830 --> 00:01:24,490 And using this, and since we do have an estimate of theta-- 28 00:01:24,490 --> 00:01:26,810 this is just the sample mean-- 29 00:01:26,810 --> 00:01:29,990 we can plug-in that particular estimate. 30 00:01:29,990 --> 00:01:34,320 And that gives us an estimate of the standard deviation. 31 00:01:34,320 --> 00:01:38,509 When n is large, this estimate is going to be very close to 32 00:01:38,509 --> 00:01:39,770 the true value. 33 00:01:39,770 --> 00:01:43,229 And so this estimate of the standard deviation will also 34 00:01:43,229 --> 00:01:47,090 be very close to the true value. 35 00:01:47,090 --> 00:01:51,130 Both of these options were discussed for special cases 36 00:01:51,130 --> 00:01:53,780 where we have special structure and we can derive an 37 00:01:53,780 --> 00:01:56,030 upper bound, or there is a natural estimate 38 00:01:56,030 --> 00:01:57,810 that suggests itself. 39 00:01:57,810 --> 00:02:00,550 More generally, what can we do? 40 00:02:00,550 --> 00:02:04,680 One general option is to use a generic way of 41 00:02:04,680 --> 00:02:07,190 estimating the variance. 42 00:02:07,190 --> 00:02:10,419 And here's how it goes. 43 00:02:10,419 --> 00:02:14,060 The variance is, by definition, the expected value 44 00:02:14,060 --> 00:02:17,290 of something, of this expression. 45 00:02:17,290 --> 00:02:21,500 And we can estimate expected values by taking several 46 00:02:21,500 --> 00:02:26,150 samples of this quantity, and taking the average of them. 47 00:02:26,150 --> 00:02:30,530 So if we have n pieces of data, for each piece of data, 48 00:02:30,530 --> 00:02:33,900 we calculate this quantity, divide by n. 49 00:02:33,900 --> 00:02:39,290 And by the weak law of large numbers, this is the sample 50 00:02:39,290 --> 00:02:42,100 mean of this particular random variable. 51 00:02:42,100 --> 00:02:45,480 And it converges to the expected value 52 00:02:45,480 --> 00:02:48,250 of this random variable. 53 00:02:48,250 --> 00:02:51,590 So that's how we could estimate the variance. 54 00:02:51,590 --> 00:02:53,470 But there is a catch. 55 00:02:53,470 --> 00:02:56,760 This expression here involves the mean 56 00:02:56,760 --> 00:02:58,280 of the random variable. 57 00:02:58,280 --> 00:03:00,640 And this is something that we do not know. 58 00:03:00,640 --> 00:03:02,570 So what can we do? 59 00:03:02,570 --> 00:03:05,630 Well, we have an estimate for the mean, so we could just 60 00:03:05,630 --> 00:03:09,750 plug in that estimate instead of the true value. 61 00:03:09,750 --> 00:03:13,440 And this gives us this alternative expression. 62 00:03:13,440 --> 00:03:19,590 Now, when n is very large, as n increases, this sample mean 63 00:03:19,590 --> 00:03:21,800 converges to the true mean. 64 00:03:21,800 --> 00:03:26,040 So this expression here would become closer and closer to 65 00:03:26,040 --> 00:03:27,500 this expression. 66 00:03:27,500 --> 00:03:31,700 Now, this expression converges to sigma squared, and we 67 00:03:31,700 --> 00:03:34,960 conclude from this that this expression will also converge 68 00:03:34,960 --> 00:03:36,460 to sigma squared. 69 00:03:36,460 --> 00:03:40,360 And so here we have a way of estimating sigma squared from 70 00:03:40,360 --> 00:03:43,840 the data, and by taking the square root, we obtain an 71 00:03:43,840 --> 00:03:47,120 estimate of sigma as well that we can plug in in this 72 00:03:47,120 --> 00:03:49,290 expression. 73 00:03:49,290 --> 00:03:52,420 And this gives us a complete way of coming up with 74 00:03:52,420 --> 00:03:56,880 confidence intervals when we only have data available in 75 00:03:56,880 --> 00:04:03,460 our hands, but do not know ahead of time what sigma is. 76 00:04:03,460 --> 00:04:05,570 Some remarks. 77 00:04:05,570 --> 00:04:10,040 This procedure of constructing confidence intervals involves 78 00:04:10,040 --> 00:04:12,330 two separate approximations. 79 00:04:12,330 --> 00:04:15,620 One approximation has to do with the fact that the sample 80 00:04:15,620 --> 00:04:18,810 mean is approximately normal according to the 81 00:04:18,810 --> 00:04:20,540 central limit theorem. 82 00:04:20,540 --> 00:04:24,500 And then there is a second approximation that comes in in 83 00:04:24,500 --> 00:04:26,930 using an estimate of sigma instead of the 84 00:04:26,930 --> 00:04:29,010 true value of sigma. 85 00:04:29,010 --> 00:04:32,350 Now, when we estimate sigma instead of using the true 86 00:04:32,350 --> 00:04:35,550 value, we're introducing some additional 87 00:04:35,550 --> 00:04:37,890 randomness in this procedure. 88 00:04:37,890 --> 00:04:40,460 And because of this randomness, the confidence 89 00:04:40,460 --> 00:04:43,900 intervals actually should be a little larger. 90 00:04:43,900 --> 00:04:48,670 There is a systematic way of doing that, and it involves 91 00:04:48,670 --> 00:04:52,860 using the so-called t-distribution tables. 92 00:04:52,860 --> 00:04:56,500 And those tables are going to give us certain numbers that 93 00:04:56,500 --> 00:04:59,310 are a little different from what we have here. 94 00:04:59,310 --> 00:05:04,500 So instead of 1.96, we might have a somewhat larger number. 95 00:05:04,500 --> 00:05:08,300 This correction is relevant when n is a small number, 96 00:05:08,300 --> 00:05:11,050 let's say n smaller than 30. 97 00:05:11,050 --> 00:05:15,130 But for larger values of n, this correction, where we use 98 00:05:15,130 --> 00:05:17,860 t tables instead of normal tables, is rather 99 00:05:17,860 --> 00:05:21,060 insignificant and one doesn't bother with it. 100 00:05:21,060 --> 00:05:23,910 In any case, we will not discuss any further this 101 00:05:23,910 --> 00:05:27,720 additional correction, but it is useful to know that it is 102 00:05:27,720 --> 00:05:31,450 something that the statisticians will often do. 103 00:05:31,450 --> 00:05:33,990 Finally, one last remark. 104 00:05:33,990 --> 00:05:38,610 One will often see an alternative way of estimating 105 00:05:38,610 --> 00:05:43,870 the variance where instead of this factor of 1/n, one uses a 106 00:05:43,870 --> 00:05:47,840 factor of 1 over n minus 1. 107 00:05:47,840 --> 00:05:51,190 With this alternative form, it turns out that this is an 108 00:05:51,190 --> 00:05:54,990 unbiased estimator of the variance. 109 00:05:54,990 --> 00:05:58,650 And that could be a reason for preferring to use this 110 00:05:58,650 --> 00:06:00,110 alternative form. 111 00:06:00,110 --> 00:06:04,080 On the other hand, when n is large, whether we use n or n 112 00:06:04,080 --> 00:06:07,410 minus 1 makes very little difference. 113 00:06:07,410 --> 00:06:09,490 And this concludes our discussion 114 00:06:09,490 --> 00:06:10,740 of confidence intervals.