1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:21,380 --> 00:00:23,850 PHILIPPE RIGOLLET: So welcome back. 9 00:00:23,850 --> 00:00:27,840 We're going to finish this chapter on maximum likelihood 10 00:00:27,840 --> 00:00:28,430 estimation. 11 00:00:28,430 --> 00:00:30,830 And last time, I briefly mentioned something that 12 00:00:30,830 --> 00:00:33,220 was called Fisher information. 13 00:00:33,220 --> 00:00:35,990 So Fisher information, in general, 14 00:00:35,990 --> 00:00:40,890 is actually a matrix when you have a multivariate parameter 15 00:00:40,890 --> 00:00:41,390 theta. 16 00:00:41,390 --> 00:00:44,960 So if theta, for example, is of dimension d, 17 00:00:44,960 --> 00:00:46,430 then the Fisher information matrix 18 00:00:46,430 --> 00:00:48,350 is going to be a d by d matrix. 19 00:00:48,350 --> 00:00:51,740 You can see that, because it's the outer product. 20 00:00:51,740 --> 00:00:55,605 So it's of the form gradient gradient transpose. 21 00:00:55,605 --> 00:00:57,230 So if it's gradient gradient transpose, 22 00:00:57,230 --> 00:00:59,060 the gradient is the d dimensional. 23 00:00:59,060 --> 00:01:03,530 And so gradient times gradient transpose is a d by d matrix. 24 00:01:03,530 --> 00:01:07,280 And so this matrix actually contains-- 25 00:01:07,280 --> 00:01:09,560 well, tells you it's called Fisher information matrix. 26 00:01:09,560 --> 00:01:11,143 So it's basically telling you how much 27 00:01:11,143 --> 00:01:14,270 information about the theta is in your model. 28 00:01:14,270 --> 00:01:17,960 So for example, if your model is very well-parameterized, 29 00:01:17,960 --> 00:01:20,120 then you will have a lot of information. 30 00:01:20,120 --> 00:01:21,684 You will have a higher-- 31 00:01:21,684 --> 00:01:23,600 so let's think of it as being a scalar number, 32 00:01:23,600 --> 00:01:25,141 just one number now-- so you're going 33 00:01:25,141 --> 00:01:27,920 to have a larger information about your parameter 34 00:01:27,920 --> 00:01:30,090 in the same probability distribution. 35 00:01:30,090 --> 00:01:35,900 But if start having a weird way to parameterize your model, 36 00:01:35,900 --> 00:01:38,150 then the Fisher information is actually going to drop. 37 00:01:38,150 --> 00:01:40,820 So as a concrete example think of, for example, 38 00:01:40,820 --> 00:01:44,240 a parameter of interest in a Gaussian model, 39 00:01:44,240 --> 00:01:45,890 where the mean is known to be zero. 40 00:01:45,890 --> 00:01:48,974 But what you're interested in is the variance, sigma squared. 41 00:01:48,974 --> 00:01:50,390 If I'm interested in sigma square, 42 00:01:50,390 --> 00:01:53,750 I could parameterize my model by sigma, sigma squared, sigma 43 00:01:53,750 --> 00:01:55,880 to the fourth, sigma to 24th. 44 00:01:55,880 --> 00:01:57,690 I could parameterize it by whatever I want, 45 00:01:57,690 --> 00:01:59,601 then I would have a simple transformation. 46 00:01:59,601 --> 00:02:01,100 Then you could say that some of them 47 00:02:01,100 --> 00:02:02,600 are actually more or less informative, 48 00:02:02,600 --> 00:02:04,308 and you're going to have different values 49 00:02:04,308 --> 00:02:06,800 for your Fisher information. 50 00:02:06,800 --> 00:02:10,729 So let's just review a few well-known computations. 51 00:02:10,729 --> 00:02:17,540 So I will focus primarily on the one dimensional case as usual. 52 00:02:17,540 --> 00:02:19,430 And I claim that there's two definitions. 53 00:02:19,430 --> 00:02:24,020 So if theta is a real valued parameter, 54 00:02:24,020 --> 00:02:26,827 then there's basically two definitions 55 00:02:26,827 --> 00:02:28,910 that you can think of for your Fisher information. 56 00:02:28,910 --> 00:02:30,350 One involves the first derivative 57 00:02:30,350 --> 00:02:31,670 of your log likelihood. 58 00:02:31,670 --> 00:02:34,840 And the second one involves the second derivative. 59 00:02:34,840 --> 00:02:36,590 So the log likelihood here, we're 60 00:02:36,590 --> 00:02:39,715 actually going to define it as l of theta. 61 00:02:39,715 --> 00:02:40,340 And what is it? 62 00:02:40,340 --> 00:02:43,550 Well, it's simply the likelihood function for one observation. 63 00:02:43,550 --> 00:02:46,640 So it's l-- and I'm going to write 1 just to make sure that 64 00:02:46,640 --> 00:02:49,010 we all know what we're talking about one observation-- 65 00:02:49,010 --> 00:02:52,148 of-- which is the order again, I think it's X and theta. 66 00:02:55,170 --> 00:02:57,290 So that's the log likelihood, remember? 67 00:03:05,290 --> 00:03:07,100 So for example, if I have a density, 68 00:03:07,100 --> 00:03:08,058 what is it going to be? 69 00:03:08,058 --> 00:03:12,330 It's going to be log of f sub theta of X. 70 00:03:12,330 --> 00:03:15,190 So this guy is a random variable, 71 00:03:15,190 --> 00:03:17,456 because it's a function of a random variable. 72 00:03:17,456 --> 00:03:19,580 And that's what you see expectations of this thing. 73 00:03:19,580 --> 00:03:21,940 It's a random function of theta. 74 00:03:21,940 --> 00:03:23,525 If I view this as a function of theta, 75 00:03:23,525 --> 00:03:25,150 the function becomes random, because it 76 00:03:25,150 --> 00:03:27,590 depends on this random X. 77 00:03:27,590 --> 00:03:35,180 And so I of theta is actually defined as the variance 78 00:03:35,180 --> 00:03:40,400 of l prime of theta-- 79 00:03:40,400 --> 00:03:43,760 so the variance of the derivative of this function. 80 00:03:43,760 --> 00:03:50,750 And I also claim that it's equal to negative the expectation 81 00:03:50,750 --> 00:03:54,160 of the second derivative of theta. 82 00:03:57,380 --> 00:04:00,560 And here, the expectation and the variance 83 00:04:00,560 --> 00:04:03,120 are computed, because this function, remember, is random. 84 00:04:03,120 --> 00:04:05,450 So I need to tell you what is the distribution 85 00:04:05,450 --> 00:04:06,920 of the X with respect to which I'm 86 00:04:06,920 --> 00:04:08,882 computing the expectation and the variance. 87 00:04:08,882 --> 00:04:09,965 And it's the theta itself. 88 00:04:13,882 --> 00:04:15,340 So typically, the theta we're going 89 00:04:15,340 --> 00:04:18,670 to be interested in-- so there's a Fisher 90 00:04:18,670 --> 00:04:20,519 information for all values of the parameter, 91 00:04:20,519 --> 00:04:22,227 but the one typically we're interested in 92 00:04:22,227 --> 00:04:25,830 is the true parameter, theta star. 93 00:04:25,830 --> 00:04:29,270 But view this as a function of theta right now. 94 00:04:29,270 --> 00:04:31,760 So now, I need to prove to you-- and this is not 95 00:04:31,760 --> 00:04:34,610 a trivial statement-- the variance of the derivative 96 00:04:34,610 --> 00:04:36,590 is equal to negative the expectation 97 00:04:36,590 --> 00:04:37,777 of the second derivative. 98 00:04:37,777 --> 00:04:40,360 I mean, there's really quite a bit that comes into this right. 99 00:04:40,360 --> 00:04:44,267 And it comes from the fact that this is a log not of anything. 100 00:04:44,267 --> 00:04:45,350 It's the log of a density. 101 00:04:45,350 --> 00:04:48,560 So let's just prove that without having 102 00:04:48,560 --> 00:04:51,260 to bother too much ourselves with 103 00:04:51,260 --> 00:04:54,219 some technical assumptions. 104 00:04:54,219 --> 00:04:56,260 And the technical assumptions are the assumptions 105 00:04:56,260 --> 00:04:59,270 that allow me to permute derivative and integral. 106 00:04:59,270 --> 00:05:01,820 Because when I compute the variances and expectations, 107 00:05:01,820 --> 00:05:04,310 I'm actually integrating against the density. 108 00:05:04,310 --> 00:05:09,080 And what I want to do is to make sure that I can always do that. 109 00:05:09,080 --> 00:05:13,250 So my technical assumptions are I can always permute 110 00:05:13,250 --> 00:05:15,320 integral and derivatives. 111 00:05:15,320 --> 00:05:19,350 So let's just prove this. 112 00:05:19,350 --> 00:05:21,560 So what I'm going to do is I'm going to assume 113 00:05:21,560 --> 00:05:32,750 that X has density f theta. 114 00:05:35,211 --> 00:05:37,210 And I'm actually just going to write-- well, let 115 00:05:37,210 --> 00:05:39,030 me write it f theta right now. 116 00:05:39,030 --> 00:05:42,000 Let me try to not be lazy about writing. 117 00:05:42,000 --> 00:05:43,910 And so the thing I'm going to use 118 00:05:43,910 --> 00:05:50,540 is the fact that the integral of this density is equal to what? 119 00:05:50,540 --> 00:05:51,820 1. 120 00:05:51,820 --> 00:05:54,620 And this is where I'm going to start doing weird things. 121 00:05:54,620 --> 00:05:56,860 That means that if I take the derivative of this guy, 122 00:05:56,860 --> 00:05:59,650 it's equal to 0. 123 00:05:59,650 --> 00:06:03,370 So that means that if I look at the derivative with respect 124 00:06:03,370 --> 00:06:11,840 to theta of integral f theta of X dX, this is equal to 0. 125 00:06:11,840 --> 00:06:14,124 And this is where I'm actually making this switch, 126 00:06:14,124 --> 00:06:16,040 is that I'm going to say that this is actually 127 00:06:16,040 --> 00:06:19,670 equal to the integral of the derivative. 128 00:06:27,670 --> 00:06:30,390 So that's going to be the first thing I'm going to use. 129 00:06:30,390 --> 00:06:32,820 And of course, if it's true for the first derivative, 130 00:06:32,820 --> 00:06:34,820 it's going to be true for the second derivative. 131 00:06:34,820 --> 00:06:36,550 So I'm going to actually do it a second time. 132 00:06:36,550 --> 00:06:38,220 And the second thing I'm going to use 133 00:06:38,220 --> 00:06:46,860 is the fact the integral of the second derivative 134 00:06:46,860 --> 00:06:47,640 is equal to 0. 135 00:06:50,739 --> 00:06:51,780 So let's start from here. 136 00:06:59,410 --> 00:07:01,990 And let me start from, say, the expectation 137 00:07:01,990 --> 00:07:06,790 of the second derivative of l prime theta. 138 00:07:06,790 --> 00:07:08,440 So what is l prime prime theta? 139 00:07:08,440 --> 00:07:21,320 Well, it's the second derivative of log of f theta of X. 140 00:07:21,320 --> 00:07:24,270 And we know that the derivative of the log-- 141 00:07:24,270 --> 00:07:30,780 sorry-- yeah, so the derivative of the log is 1 over-- 142 00:07:30,780 --> 00:07:34,050 well, it's the derivative of f divided by f itself. 143 00:07:49,647 --> 00:07:50,480 Everybody's with me? 144 00:07:53,450 --> 00:07:58,760 Just log of f prime is f prime over f. 145 00:07:58,760 --> 00:08:01,610 Here, it's just that f, I view this as a function of theta 146 00:08:01,610 --> 00:08:04,200 and not as a function of X. 147 00:08:04,200 --> 00:08:08,040 So now, I need to take another derivative of this thing. 148 00:08:08,040 --> 00:08:09,560 So that's going to be equal to-- 149 00:08:09,560 --> 00:08:13,030 well, so we all know the formula for the derivative 150 00:08:13,030 --> 00:08:14,000 of the ratio. 151 00:08:14,000 --> 00:08:22,080 So I pick up the second derivative times f theta 152 00:08:22,080 --> 00:08:30,480 minus the first derivative squared 153 00:08:30,480 --> 00:08:33,840 divided by f theta squared-- 154 00:08:38,590 --> 00:08:40,542 basic calculus. 155 00:08:40,542 --> 00:08:43,390 And now, I need to check that negative the expectation 156 00:08:43,390 --> 00:08:47,580 of this guy is giving me back what I want. 157 00:08:47,580 --> 00:08:52,030 Well what is negative the expectation of l prime prime 158 00:08:52,030 --> 00:08:54,150 of theta? 159 00:08:54,150 --> 00:08:56,440 Well, what we need to do is to do negative integral 160 00:08:56,440 --> 00:08:59,500 of this guy against f theta. 161 00:08:59,500 --> 00:09:01,780 So it's minus the integral of-- 162 00:09:28,340 --> 00:09:31,640 That's just the definition of the expectation. 163 00:09:31,640 --> 00:09:34,370 I take an integral against f theta. 164 00:09:34,370 --> 00:09:36,440 But here, I have something nice. 165 00:09:36,440 --> 00:09:38,540 What's happening is that those guys are canceling. 166 00:09:41,937 --> 00:09:43,520 And now that those guys are canceling, 167 00:09:43,520 --> 00:09:44,780 those guys are canceling too. 168 00:09:51,556 --> 00:09:53,180 So what I have is that the first term-- 169 00:09:53,180 --> 00:09:56,000 I'm going to break this difference here. 170 00:09:56,000 --> 00:09:58,250 So I'm going to say that integral of this difference 171 00:09:58,250 --> 00:10:00,320 is the difference of the integrals. 172 00:10:00,320 --> 00:10:03,260 So the first term is going to be the integral 173 00:10:03,260 --> 00:10:09,256 of d over d theta squared of f theta. 174 00:10:13,910 --> 00:10:17,360 And the second one, the negative signs are going to cancel, 175 00:10:17,360 --> 00:10:32,140 and I'm going to be left with this. 176 00:10:37,700 --> 00:10:39,147 Everybody's following? 177 00:10:39,147 --> 00:10:40,230 Anybody found the mistake? 178 00:10:44,860 --> 00:10:46,954 How about the other mistake? 179 00:10:46,954 --> 00:10:48,370 I don't know if there's a mistake. 180 00:10:48,370 --> 00:10:50,980 I'm just trying to get you to check what I'm doing. 181 00:10:54,010 --> 00:10:56,370 With me so far? 182 00:10:56,370 --> 00:10:59,670 So this guy here is the integral of the second the derivative 183 00:10:59,670 --> 00:11:02,400 of f of X dX. 184 00:11:02,400 --> 00:11:04,858 What is this? 185 00:11:04,858 --> 00:11:06,340 AUDIENCE: It's 0. 186 00:11:06,340 --> 00:11:08,330 PHILIPPE RIGOLLET: It's 0. 187 00:11:08,330 --> 00:11:16,910 And that's because of this guy, which I will call frowny face. 188 00:11:16,910 --> 00:11:20,930 So frowny face tells me this. 189 00:11:20,930 --> 00:11:26,480 And let's call this guy monkey that hides his eyes. 190 00:11:26,480 --> 00:11:28,030 No, let's just do something simpler. 191 00:11:28,030 --> 00:11:29,360 Let's call it star. 192 00:11:29,360 --> 00:11:32,180 And this guy, we will use later on. 193 00:11:35,630 --> 00:11:37,760 So now, I have to prove that this guy, which 194 00:11:37,760 --> 00:11:40,460 I have proved is equal to this, is now 195 00:11:40,460 --> 00:11:46,070 equal to the variance of l prime theta. 196 00:11:46,070 --> 00:11:48,200 So now, let's go back to the other way. 197 00:11:48,200 --> 00:11:49,520 We're going to meet halfway. 198 00:11:49,520 --> 00:11:51,020 I'm going to have a series-- 199 00:11:51,020 --> 00:11:56,090 I want to prove that this guy is equal to this guy. 200 00:11:56,090 --> 00:11:58,340 And I'm going to have a series of equalities 201 00:11:58,340 --> 00:12:01,740 that I'm going to meet halfway. 202 00:12:01,740 --> 00:12:03,239 So let's start from the other end. 203 00:12:03,239 --> 00:12:05,280 We started from the negative l prime prime theta. 204 00:12:05,280 --> 00:12:06,743 Let's start with the variance part. 205 00:12:10,330 --> 00:12:17,394 Variance of l prime of theta, so that's the variance-- 206 00:12:22,520 --> 00:12:29,170 so that's the expectation of l prime 207 00:12:29,170 --> 00:12:35,230 of theta squared minus the square of the expectation of l 208 00:12:35,230 --> 00:12:35,920 prime of theta. 209 00:12:41,370 --> 00:12:43,340 Now, what is the square of the expectation 210 00:12:43,340 --> 00:12:44,630 of l prime of theta? 211 00:12:44,630 --> 00:12:50,750 Well, l prime of theta is equal to the partial with respect 212 00:12:50,750 --> 00:12:57,461 to theta of log f theta of X, which we know from the first 213 00:12:57,461 --> 00:12:59,960 line over there-- that's what's in the bracket on the second 214 00:12:59,960 --> 00:13:01,030 line there-- 215 00:13:01,030 --> 00:13:05,010 is actually equal to the partial over theta of f 216 00:13:05,010 --> 00:13:09,410 theta X divided by f theta X. That's 217 00:13:09,410 --> 00:13:11,390 the derivative of the log. 218 00:13:11,390 --> 00:13:16,150 So when I look at the expectation of this guy, 219 00:13:16,150 --> 00:13:18,700 I'm going to have the integral of this against f theta. 220 00:13:18,700 --> 00:13:20,540 And the f thetas are going to cancel again, 221 00:13:20,540 --> 00:13:23,260 just like I did here. 222 00:13:23,260 --> 00:13:26,110 So this thing is actually equal to the integral 223 00:13:26,110 --> 00:13:33,650 of partial over theta of f theta of X dX. 224 00:13:33,650 --> 00:13:34,950 And what does this equal to? 225 00:13:37,720 --> 00:13:42,310 0, by the monkey hiding is eyes. 226 00:13:42,310 --> 00:13:46,950 So that's star-- tells me that this is equal to 0. 227 00:13:50,090 --> 00:13:52,640 So basically, when I compute the variance, this term is not. 228 00:13:52,640 --> 00:13:53,390 Going to matter. 229 00:13:53,390 --> 00:13:54,973 I only have to complete the first one. 230 00:14:10,630 --> 00:14:12,460 So what is the first one? 231 00:14:12,460 --> 00:14:21,280 Well, the first one is the expectation of l prime squared. 232 00:14:24,820 --> 00:14:29,770 And so that guy is the integral of-- well, what is l prime? 233 00:14:29,770 --> 00:14:33,160 Again, it's partial over partial theta 234 00:14:33,160 --> 00:14:37,960 f theta of X divided by f theta of X. Now, this time, this guy 235 00:14:37,960 --> 00:14:40,975 is squared against the density. 236 00:14:44,320 --> 00:14:45,625 So one of the f thetas cancel. 237 00:15:07,195 --> 00:15:11,790 But now, I'm back to what I had before for this guy. 238 00:15:16,420 --> 00:15:20,540 So this guy is now equal to this guy. 239 00:15:20,540 --> 00:15:21,940 There's just the same formula. 240 00:15:21,940 --> 00:15:23,960 So they're the same thing. 241 00:15:23,960 --> 00:15:25,880 And so I've moved both ways. 242 00:15:25,880 --> 00:15:27,800 Starting from the expression that 243 00:15:27,800 --> 00:15:30,230 involves the expectation of the second derivative, 244 00:15:30,230 --> 00:15:31,820 I've come to this guy. 245 00:15:31,820 --> 00:15:34,310 And starting from the expression that tells me 246 00:15:34,310 --> 00:15:36,350 about the variance of the first derivative, 247 00:15:36,350 --> 00:15:38,300 I've come to the same guy. 248 00:15:38,300 --> 00:15:41,397 So that completes my proof. 249 00:15:41,397 --> 00:15:43,063 Are there any questions about the proof? 250 00:15:47,050 --> 00:15:53,890 We also have on our way found an explicit formula for the Fisher 251 00:15:53,890 --> 00:15:55,330 information as well. 252 00:15:55,330 --> 00:15:57,760 So now that I have this thing, I could actually 253 00:15:57,760 --> 00:16:02,200 add that if X has a density, for example, 254 00:16:02,200 --> 00:16:07,000 this is also equal to the integral of-- 255 00:16:09,810 --> 00:16:15,840 well, the partial over theta of f theta of X 256 00:16:15,840 --> 00:16:24,431 squared divided by f theta of X, because I just 257 00:16:24,431 --> 00:16:26,180 proved that those two things were actually 258 00:16:26,180 --> 00:16:28,013 equal to the same thing, which was this guy. 259 00:16:31,230 --> 00:16:34,024 Now in practice, this is really going to be the useful one. 260 00:16:34,024 --> 00:16:35,940 The other two are going to be useful depending 261 00:16:35,940 --> 00:16:37,810 on what case you're in. 262 00:16:37,810 --> 00:16:40,560 So if I ask you to compute the Fisher information, 263 00:16:40,560 --> 00:16:43,020 you have now three ways to pick from. 264 00:16:43,020 --> 00:16:44,670 And basically, practice will tell you 265 00:16:44,670 --> 00:16:47,070 which one to choose if you want to save five minutes when 266 00:16:47,070 --> 00:16:48,666 you're doing your computations. 267 00:16:48,666 --> 00:16:50,790 Maybe you're the guy who likes to take derivatives. 268 00:16:50,790 --> 00:16:52,860 And then you're going to go with the second derivative one. 269 00:16:52,860 --> 00:16:55,000 Maybe you're the guy who likes to extend squares, 270 00:16:55,000 --> 00:16:56,500 so you're going to take the one that 271 00:16:56,500 --> 00:16:59,340 involves the square of the squared prime. 272 00:16:59,340 --> 00:17:01,042 And maybe you're just a normal person, 273 00:17:01,042 --> 00:17:02,250 and you want to use that guy. 274 00:17:06,119 --> 00:17:07,740 Why do I care? 275 00:17:07,740 --> 00:17:09,540 This is the Fisher information. 276 00:17:09,540 --> 00:17:11,790 And I could have defined the [? Hilbert ?] information 277 00:17:11,790 --> 00:17:13,470 by taking the square root of this guy 278 00:17:13,470 --> 00:17:16,500 plus the sine of this thing and just be super happy 279 00:17:16,500 --> 00:17:18,369 and have my name in textbooks. 280 00:17:18,369 --> 00:17:22,170 But this thing has a very particular meaning. 281 00:17:22,170 --> 00:17:24,540 When we're doing the maximum likelihood estimation-- 282 00:17:28,590 --> 00:17:32,910 so remember the maximum likelihood estimation is just 283 00:17:32,910 --> 00:17:36,720 an empirical version of trying to minimize the KL divergence. 284 00:17:36,720 --> 00:17:39,570 So what we're trying to do, maximum likelihood, 285 00:17:39,570 --> 00:17:42,820 is really trying to minimize the KL divergence. 286 00:17:54,160 --> 00:17:57,490 And we're trying to minimize this function, remember? 287 00:17:57,490 --> 00:18:01,034 So now what we're going to do is we're 288 00:18:01,034 --> 00:18:02,200 going to plot this function. 289 00:18:02,200 --> 00:18:04,000 We said that, let's place ourselves 290 00:18:04,000 --> 00:18:06,440 in cases where this KL is convex, 291 00:18:06,440 --> 00:18:09,550 so that the inverse is concave. 292 00:18:09,550 --> 00:18:11,020 So it's going to look like this-- 293 00:18:11,020 --> 00:18:13,930 U-shaped, that's convex. 294 00:18:13,930 --> 00:18:16,119 So that's the truth thing I'm trying to minimize. 295 00:18:16,119 --> 00:18:18,160 And what I said is that I'm going to actually try 296 00:18:18,160 --> 00:18:19,215 to estimate this guy. 297 00:18:19,215 --> 00:18:20,590 So in practice, I'm going to have 298 00:18:20,590 --> 00:18:22,990 something that looks like this, but it's not really this. 299 00:18:26,425 --> 00:18:28,050 And we're not going to do this, but you 300 00:18:28,050 --> 00:18:30,120 can show that you can control this 301 00:18:30,120 --> 00:18:33,810 uniformly over the entire space, that there is no space where 302 00:18:33,810 --> 00:18:35,190 it just becomes huge. 303 00:18:35,190 --> 00:18:37,020 In particular, this is not the space 304 00:18:37,020 --> 00:18:38,450 where it just becomes super huge, 305 00:18:38,450 --> 00:18:39,930 and the minimum of the dotted line 306 00:18:39,930 --> 00:18:42,120 becomes really far from this guy. 307 00:18:42,120 --> 00:18:45,330 So if those two functions are close to each other, 308 00:18:45,330 --> 00:18:48,510 then this implies that the minimum here of the dotted line 309 00:18:48,510 --> 00:18:53,250 is close to the minimum of the solid line. 310 00:18:53,250 --> 00:18:56,180 So we know that this is theta star. 311 00:18:56,180 --> 00:19:00,472 And this is our MLE, estimator, theta hat ml. 312 00:19:00,472 --> 00:19:01,930 So that's basically the principle-- 313 00:19:01,930 --> 00:19:05,450 the more data we have, the closer the dotted line 314 00:19:05,450 --> 00:19:07,020 is to the solid line. 315 00:19:07,020 --> 00:19:10,470 And so the minimum is closer to the minimum. 316 00:19:10,470 --> 00:19:12,800 But now, this is just one example, 317 00:19:12,800 --> 00:19:14,330 where I drew a picture for you. 318 00:19:14,330 --> 00:19:17,190 But there could be some really nasty examples. 319 00:19:17,190 --> 00:19:20,330 Think of this example, where I have 320 00:19:20,330 --> 00:19:23,840 a function, which is convex, but it looks more like this. 321 00:19:30,120 --> 00:19:31,790 That's convex, it's U-shaped. 322 00:19:31,790 --> 00:19:36,300 It's just a professional U. 323 00:19:36,300 --> 00:19:41,430 Now, I'm going to put a dotted line around it that has pretty 324 00:19:41,430 --> 00:19:42,690 much the same fluctuations. 325 00:19:42,690 --> 00:19:44,720 The bend around it is of this size. 326 00:19:52,980 --> 00:19:56,730 So do we agree that the distance between the solid line 327 00:19:56,730 --> 00:19:58,590 and the dotted line is pretty much the same 328 00:19:58,590 --> 00:20:01,110 in those two pictures? 329 00:20:01,110 --> 00:20:04,650 Now, here, depending on how I tilt this guy, 330 00:20:04,650 --> 00:20:07,570 basically, I can put the minimum theta star wherever I want. 331 00:20:07,570 --> 00:20:11,650 And let's say that here, I actually put it here. 332 00:20:11,650 --> 00:20:13,800 That's pretty much the minimum of this line. 333 00:20:13,800 --> 00:20:16,530 And now, the minimum of the dotted line is this guy. 334 00:20:20,930 --> 00:20:22,730 So they're very far. 335 00:20:22,730 --> 00:20:25,880 The fact that I'm very flat at the bottom 336 00:20:25,880 --> 00:20:28,340 makes my requirements for being close 337 00:20:28,340 --> 00:20:31,950 to the U-shaped solid curve much more stringent, 338 00:20:31,950 --> 00:20:34,020 if I want to stay close. 339 00:20:34,020 --> 00:20:38,190 And so this is the canonical case. 340 00:20:38,190 --> 00:20:39,720 This is the annoying case. 341 00:20:39,720 --> 00:20:43,710 And of course, you have the awesome case-- 342 00:20:43,710 --> 00:20:45,540 looks like this. 343 00:20:45,540 --> 00:20:47,780 And then whether you deviate, you 344 00:20:47,780 --> 00:20:50,430 can have something that moves pretty far. 345 00:20:50,430 --> 00:20:53,480 It doesn't matter, it's always going to stay close. 346 00:20:53,480 --> 00:20:57,600 Now, what is the quantity that measures how 347 00:20:57,600 --> 00:20:59,700 curved I am at a given point-- 348 00:20:59,700 --> 00:21:03,840 how curved the function is at a given point? 349 00:21:03,840 --> 00:21:05,420 The secondary derivative. 350 00:21:05,420 --> 00:21:11,150 And so the Fisher information is negative the second derivative. 351 00:21:11,150 --> 00:21:12,394 Why the negative? 352 00:21:17,044 --> 00:21:18,960 Well here-- Yeah, we're looking for a minimum, 353 00:21:18,960 --> 00:21:20,793 and this guy is really the-- you should view 354 00:21:20,793 --> 00:21:23,460 this as a reverted function. 355 00:21:23,460 --> 00:21:26,370 This is we're trying to maximize the likelihood, which 356 00:21:26,370 --> 00:21:28,950 is basically maximizing the negative KL. 357 00:21:28,950 --> 00:21:31,650 So the picture I'm showing you is trying to minimize the KL. 358 00:21:31,650 --> 00:21:33,990 So the truth picture that you should see for this guy 359 00:21:33,990 --> 00:21:37,080 is the same, except that it's just flipped over. 360 00:21:37,080 --> 00:21:40,800 But the curvature is the same, whether I flip my sheet or not. 361 00:21:40,800 --> 00:21:42,390 So it's the same thing. 362 00:21:42,390 --> 00:21:44,034 So apart from this negative sign, 363 00:21:44,034 --> 00:21:45,450 which is just coming from the fact 364 00:21:45,450 --> 00:21:47,550 that we're maximizing instead of minimizing, 365 00:21:47,550 --> 00:21:50,490 this is just telling me how curved my likelihood is 366 00:21:50,490 --> 00:21:51,810 around the maximum. 367 00:21:51,810 --> 00:21:55,080 And therefore, it's actually telling me how good, 368 00:21:55,080 --> 00:21:58,830 how robust my maximum likelihood estimator is. 369 00:21:58,830 --> 00:22:01,270 It's going to tell me how close, actually, my likelihood 370 00:22:01,270 --> 00:22:03,870 estimator is going to be-- 371 00:22:03,870 --> 00:22:06,510 maximum likelihood is going to be to the true parameter. 372 00:22:06,510 --> 00:22:09,030 So I should be able to see that somewhere. 373 00:22:09,030 --> 00:22:11,220 There should be some statement that tells me 374 00:22:11,220 --> 00:22:13,230 that this Fisher information will 375 00:22:13,230 --> 00:22:17,700 play a role when assessing the precision of this estimator. 376 00:22:17,700 --> 00:22:20,280 And remember, how do we characterize a good estimator? 377 00:22:20,280 --> 00:22:24,030 Well, we look at its bias, or we look its variance. 378 00:22:24,030 --> 00:22:26,880 And we can combine the two and form the quadratic risk. 379 00:22:26,880 --> 00:22:29,160 So essentially, what we're going to try to say 380 00:22:29,160 --> 00:22:31,525 is that one of those guys-- either the bias 381 00:22:31,525 --> 00:22:33,150 or the variance or the quadratic risk-- 382 00:22:33,150 --> 00:22:35,284 is going to be worse if my function is flatter, 383 00:22:35,284 --> 00:22:37,200 meaning that my Fisher information is smaller. 384 00:22:40,390 --> 00:22:44,030 And this is exactly the point of this last theorem. 385 00:22:44,030 --> 00:22:46,270 So let's look at a couple of conditions. 386 00:22:46,270 --> 00:22:51,310 So this is your typical 1950s statistics 387 00:22:51,310 --> 00:22:54,770 paper that has like one page of assumptions. 388 00:22:54,770 --> 00:22:56,764 And this was like that in the early days, 389 00:22:56,764 --> 00:22:58,180 because people were trying to make 390 00:22:58,180 --> 00:23:01,572 theories that would be valid for as many models as possible. 391 00:23:01,572 --> 00:23:03,280 And now, people are sort of abusing this, 392 00:23:03,280 --> 00:23:05,470 and they're just making all this lists of assumptions 393 00:23:05,470 --> 00:23:06,670 so that their particular method works 394 00:23:06,670 --> 00:23:08,628 for their particular problem, because they just 395 00:23:08,628 --> 00:23:10,030 want to take shortcuts. 396 00:23:10,030 --> 00:23:13,480 But really, the maximum likelihood estimator 397 00:23:13,480 --> 00:23:15,820 is basically as old as modern statistics. 398 00:23:15,820 --> 00:23:18,190 And so this was really necessary conditions. 399 00:23:18,190 --> 00:23:19,810 And we'll just parse that. 400 00:23:19,810 --> 00:23:21,610 The model is identified. 401 00:23:21,610 --> 00:23:24,070 Well, better be, because I'm trying to estimate 402 00:23:24,070 --> 00:23:25,550 theta and not P theta. 403 00:23:25,550 --> 00:23:26,860 So this one is good. 404 00:23:26,860 --> 00:23:32,630 For all theta, the support of P theta does not depend on theta. 405 00:23:32,630 --> 00:23:34,850 So that's just something that we need to have. 406 00:23:34,850 --> 00:23:36,710 Otherwise, things become really messy. 407 00:23:36,710 --> 00:23:38,540 And in particular, I'm not going to be 408 00:23:38,540 --> 00:23:40,600 able to define likelihood-- 409 00:23:40,600 --> 00:23:42,910 Kullback-Leibler divergences. 410 00:23:42,910 --> 00:23:44,340 Then why can I not do that? 411 00:23:44,340 --> 00:23:46,730 Well, because the Kullback-Leibler divergence 412 00:23:46,730 --> 00:23:49,430 has a log of the ratio of two densities. 413 00:23:49,430 --> 00:23:51,554 And if one of the support is changing with theta 414 00:23:51,554 --> 00:23:53,720 is it might be they have the log of something that's 415 00:23:53,720 --> 00:23:55,820 0 or something that's not 0. 416 00:23:55,820 --> 00:23:59,450 And the log of 0 is a slightly annoying quantity to play with. 417 00:23:59,450 --> 00:24:01,220 And so we're just removing that case. 418 00:24:01,220 --> 00:24:02,870 Nothing depends on theta-- 419 00:24:02,870 --> 00:24:05,170 think of it as being basically the entire real line 420 00:24:05,170 --> 00:24:08,020 as a support for Gaussian, for example. 421 00:24:08,020 --> 00:24:10,830 Theta star is not on the boundary of theta. 422 00:24:10,830 --> 00:24:13,147 Can anybody tell me why this is important? 423 00:24:17,720 --> 00:24:19,142 We're talking about derivatives. 424 00:24:19,142 --> 00:24:20,850 So when I want to talk about derivatives, 425 00:24:20,850 --> 00:24:23,260 I'm talking about fluctuations around a certain point. 426 00:24:23,260 --> 00:24:26,166 And if I'm at the boundary, it's actually really annoying. 427 00:24:26,166 --> 00:24:27,790 I might have the derivative-- remember, 428 00:24:27,790 --> 00:24:28,940 I give you this example-- 429 00:24:28,940 --> 00:24:31,720 where the maximum likelihood is just obtained at the boundary, 430 00:24:31,720 --> 00:24:34,300 because the function cannot grow anymore at the boundary. 431 00:24:34,300 --> 00:24:36,550 But it does not mean that the first order 432 00:24:36,550 --> 00:24:38,050 derivative is equal to 0. 433 00:24:38,050 --> 00:24:39,700 It does not mean anything. 434 00:24:39,700 --> 00:24:42,040 So all this picture here is valid 435 00:24:42,040 --> 00:24:46,720 only if I'm actually achieving the minimum inside. 436 00:24:46,720 --> 00:24:52,030 Because if my theta space stops here and it's just this guy, 437 00:24:52,030 --> 00:24:53,200 I'm going to be here. 438 00:24:53,200 --> 00:24:55,600 And there's no questions about curvature or anything 439 00:24:55,600 --> 00:24:56,690 that comes into play. 440 00:24:56,690 --> 00:24:58,340 It's completely different. 441 00:24:58,340 --> 00:25:00,080 So here, it's inside. 442 00:25:00,080 --> 00:25:02,770 Again, think of theta as being the entire real line. 443 00:25:02,770 --> 00:25:05,550 Then everything is inside. 444 00:25:05,550 --> 00:25:08,040 I is invertible. 445 00:25:08,040 --> 00:25:11,130 What does it mean for a positive number, a 1 by 1 matrix 446 00:25:11,130 --> 00:25:12,420 to be invertible? 447 00:25:16,820 --> 00:25:17,320 Yep. 448 00:25:17,320 --> 00:25:20,260 AUDIENCE: It'd be equal to its [INAUDIBLE].. 449 00:25:20,260 --> 00:25:24,167 PHILIPPE RIGOLLET: A 1 by 1 matrix, that's a number, right? 450 00:25:24,167 --> 00:25:26,750 What is a characteristic-- if I give you a matrix with numbers 451 00:25:26,750 --> 00:25:28,250 and ask you if it's invertible, what 452 00:25:28,250 --> 00:25:31,658 are you going to do with it? 453 00:25:31,658 --> 00:25:33,650 AUDIENCE: Check if the determinant is 0. 454 00:25:33,650 --> 00:25:35,691 PHILIPPE RIGOLLET: Check if the determinant is 0. 455 00:25:35,691 --> 00:25:37,600 What is the determinant of the 1 by 1 matrix? 456 00:25:37,600 --> 00:25:38,944 It's just the number itself. 457 00:25:38,944 --> 00:25:41,360 So that's basically, you want to check if this number is 0 458 00:25:41,360 --> 00:25:42,500 or not. 459 00:25:42,500 --> 00:25:44,990 So we're going to think in the one dimensional case here. 460 00:25:44,990 --> 00:25:46,739 And in the one dimensional case, that just 461 00:25:46,739 --> 00:25:51,480 means that the curvature is not 0. 462 00:25:51,480 --> 00:25:53,230 Well, it better be not 0, because then I'm 463 00:25:53,230 --> 00:25:54,460 going to have no guarantees. 464 00:25:54,460 --> 00:25:56,680 If I'm totally flat, if I have no curvature, 465 00:25:56,680 --> 00:25:58,900 I'm basically totally flat at the bottom. 466 00:25:58,900 --> 00:26:00,580 And then I'm going to get nasty things. 467 00:26:00,580 --> 00:26:02,170 Now, this is not true. 468 00:26:02,170 --> 00:26:05,740 I could have the curvature which grows like-- so here, it's 469 00:26:05,740 --> 00:26:08,110 basically-- the second derivative is telling me-- 470 00:26:08,110 --> 00:26:09,670 if I do the Taylor expansion, it's 471 00:26:09,670 --> 00:26:13,170 telling me how I grow as a function of, say, x squared. 472 00:26:13,170 --> 00:26:15,250 It's the quadratic term that I'm controlling. 473 00:26:15,250 --> 00:26:19,170 It could be that this guy is 0, and then the term of order, 474 00:26:19,170 --> 00:26:20,805 x to the fourth, is picking up. 475 00:26:20,805 --> 00:26:22,640 That could be the first one that's non-zero. 476 00:26:23,290 --> 00:26:25,270 But that would mean that my rate of convergence 477 00:26:25,270 --> 00:26:26,550 would not be square root of n. 478 00:26:26,550 --> 00:26:28,549 When I'm actually playing central limit theorem, 479 00:26:28,549 --> 00:26:30,820 it would become n to the 1/4th. 480 00:26:30,820 --> 00:26:33,660 And if I have all a bunch of 0 until the 16th order, 481 00:26:33,660 --> 00:26:36,460 I would have n to the 1/16th, because that's really 482 00:26:36,460 --> 00:26:39,572 telling me how flat I am. 483 00:26:39,572 --> 00:26:41,030 So we're going to focus on the case 484 00:26:41,030 --> 00:26:43,160 where it's only quadratic terms, and the rates 485 00:26:43,160 --> 00:26:46,370 of the central limit theorems kick in. 486 00:26:46,370 --> 00:26:48,560 And then a few other technical conditions-- 487 00:26:48,560 --> 00:26:49,940 we just used a couple of them. 488 00:26:49,940 --> 00:26:52,100 So I permuted limit and integral. 489 00:26:52,100 --> 00:26:54,890 And you can check that really what you want 490 00:26:54,890 --> 00:26:58,246 is that the integral of a derivative is equal to 0. 491 00:26:58,246 --> 00:27:00,370 Well, it just means that the values at the two ends 492 00:27:00,370 --> 00:27:01,580 are actually the same. 493 00:27:01,580 --> 00:27:05,090 So those are slightly different things. 494 00:27:05,090 --> 00:27:08,900 So now, what we have is that the maximum likelihood estimator 495 00:27:08,900 --> 00:27:10,590 has the following two properties. 496 00:27:10,590 --> 00:27:13,610 The first one, if I were to say that in words, what would 497 00:27:13,610 --> 00:27:15,470 I say, that theta hat is-- 498 00:27:18,851 --> 00:27:20,790 Is what? 499 00:27:20,790 --> 00:27:22,430 Yeah, that's what I would say when I-- 500 00:27:22,430 --> 00:27:23,630 that's for mathematicians. 501 00:27:23,630 --> 00:27:27,530 But if I'm a statistician, what am I going to say? 502 00:27:27,530 --> 00:27:28,620 It's consistent. 503 00:27:28,620 --> 00:27:30,470 It's a consistent estimator of theta star. 504 00:27:30,470 --> 00:27:33,830 It converges in probability to theta star. 505 00:27:33,830 --> 00:27:35,960 And then we have this sort of central limit theorem 506 00:27:35,960 --> 00:27:36,946 statement. 507 00:27:36,946 --> 00:27:39,320 The central limit theorem statement tells me that if this 508 00:27:39,320 --> 00:27:44,200 was an average and I remove the expectation of the average-- 509 00:27:44,200 --> 00:27:45,717 let's say it's 0, for example-- 510 00:27:45,717 --> 00:27:47,550 then square root of n times the average blah 511 00:27:47,550 --> 00:27:49,830 goes through some normal distribution. 512 00:27:49,830 --> 00:27:52,080 This is telling me that this is actually true, 513 00:27:52,080 --> 00:27:55,500 even if theta hat has nothing to do with an average. 514 00:27:55,500 --> 00:27:56,725 That's remarkable. 515 00:27:56,725 --> 00:27:59,640 Theta hat might not even have a closed form, 516 00:27:59,640 --> 00:28:02,070 and I'm still having basically the same properties 517 00:28:02,070 --> 00:28:04,410 as an average that would be given to me 518 00:28:04,410 --> 00:28:08,180 by a central limit theorem. 519 00:28:08,180 --> 00:28:10,690 And what is the asymptotic variance? 520 00:28:10,690 --> 00:28:12,510 So that's the variance in the n. 521 00:28:12,510 --> 00:28:15,980 So here, I'm thinking of having those guys being multivariate. 522 00:28:15,980 --> 00:28:18,320 And so I have the inverse of the covariance matrix 523 00:28:18,320 --> 00:28:21,050 that shows up as the variance-covariance matrix 524 00:28:21,050 --> 00:28:22,640 asymptotically. 525 00:28:22,640 --> 00:28:25,430 But if you think of just being a one dimensional parameter, 526 00:28:25,430 --> 00:28:27,680 it's one over this Fisher information, 527 00:28:27,680 --> 00:28:29,616 one over the curvature. 528 00:28:29,616 --> 00:28:31,490 So the curvature is really flat, the variance 529 00:28:31,490 --> 00:28:33,230 becomes really big. 530 00:28:33,230 --> 00:28:36,230 If the function is really flat, curvature is low, 531 00:28:36,230 --> 00:28:37,070 variance is big. 532 00:28:37,070 --> 00:28:41,384 If the curvature is very high, the variance becomes very low. 533 00:28:41,384 --> 00:28:42,800 And so that illustrates everything 534 00:28:42,800 --> 00:28:45,510 that's happening with the pictures that we have. 535 00:28:45,510 --> 00:28:48,740 And if you look, what's amazing here, 536 00:28:48,740 --> 00:28:51,970 there is no square root 2 pi, there's no fudge factors 537 00:28:51,970 --> 00:28:52,940 going on here. 538 00:28:52,940 --> 00:28:56,270 This is the asymptotic variance, nothing else. 539 00:28:56,270 --> 00:28:58,404 It's all in there, all in the curvature. 540 00:29:03,770 --> 00:29:05,228 Are there any questions about this? 541 00:29:07,860 --> 00:29:11,190 So you can see here that theta star is the true parameter. 542 00:29:11,190 --> 00:29:17,160 And the information matrix is evaluated at theta star. 543 00:29:17,160 --> 00:29:18,660 That's the point that matters. 544 00:29:18,660 --> 00:29:20,280 When I drew this picture, the point 545 00:29:20,280 --> 00:29:22,420 that was at the very bottom was always theta star. 546 00:29:22,420 --> 00:29:26,980 It's the one that minimizes the KL divergence, 547 00:29:26,980 --> 00:29:32,856 am as long as I'm identified. 548 00:29:32,856 --> 00:29:33,356 Yes. 549 00:29:33,356 --> 00:29:35,766 AUDIENCE: So the higher the curvature, 550 00:29:35,766 --> 00:29:38,515 the higher the inverse of the Fisher information? 551 00:29:38,515 --> 00:29:39,890 PHILIPPE RIGOLLET: No, the higher 552 00:29:39,890 --> 00:29:42,520 the Fisher information itself. 553 00:29:42,520 --> 00:29:46,310 So the inverse is going to be smaller. 554 00:29:46,310 --> 00:29:48,960 So small variance is good. 555 00:29:48,960 --> 00:29:50,510 So now what it means, actually, if I 556 00:29:50,510 --> 00:29:51,980 look at what is the quadratic risk 557 00:29:51,980 --> 00:29:54,050 of this guy, asymptotically-- what 558 00:29:54,050 --> 00:29:56,270 is asymptotic quadratic risk? 559 00:29:56,270 --> 00:29:57,590 Well, it's 0 actually. 560 00:29:57,590 --> 00:30:01,470 But if I assume that this thing is true, 561 00:30:01,470 --> 00:30:03,419 that this thing is pretty much Gaussian, 562 00:30:03,419 --> 00:30:05,210 if I look at the quadratic risk, well, it's 563 00:30:05,210 --> 00:30:08,170 the expectation of the square of this thing. 564 00:30:08,170 --> 00:30:12,312 And so it's going to scale like the variance divided by n. 565 00:30:14,930 --> 00:30:18,800 The bias goes to 0, just by this. 566 00:30:18,800 --> 00:30:20,590 And then the quadratic risk is going 567 00:30:20,590 --> 00:30:23,340 to scale like one over Fisher information divided by n. 568 00:30:28,241 --> 00:30:30,240 So here, the-- I'm not mentioning the constants. 569 00:30:30,240 --> 00:30:32,160 There must be constants, because everything is asymptotic. 570 00:30:32,160 --> 00:30:33,834 So for each finite n, I'm going to have 571 00:30:33,834 --> 00:30:35,000 some constants that show up. 572 00:30:39,270 --> 00:30:43,590 Everybody just got their mind blown by this amazing theorem? 573 00:30:43,590 --> 00:30:48,090 So I mean, if you think about it, the MLE can be anything. 574 00:30:48,090 --> 00:30:50,562 I'm sorry to say to you, in many instances, 575 00:30:50,562 --> 00:30:52,770 the MLE is just going to be an average, which is just 576 00:30:52,770 --> 00:30:54,660 going to be slightly annoying. 577 00:30:54,660 --> 00:30:57,370 But there are some cases where it's not. 578 00:30:57,370 --> 00:30:59,462 And we have to resort to this theorem 579 00:30:59,462 --> 00:31:01,920 rather than actually resorting to the central limit theorem 580 00:31:01,920 --> 00:31:03,090 to prove this thing. 581 00:31:03,090 --> 00:31:05,960 And more importantly, even if this was an average, 582 00:31:05,960 --> 00:31:07,710 you don't have to even know how to compute 583 00:31:07,710 --> 00:31:09,060 the covariance matrix-- 584 00:31:09,060 --> 00:31:11,320 sorry, the variance of this thing 585 00:31:11,320 --> 00:31:14,490 to plug it into the central limit theorem. 586 00:31:14,490 --> 00:31:17,220 I'm telling you, it's actually given by the Fisher information 587 00:31:17,220 --> 00:31:18,950 matrix. 588 00:31:18,950 --> 00:31:22,070 So if it's an average, between you and me, 589 00:31:22,070 --> 00:31:24,590 you probably want to go the central limit theorem route 590 00:31:24,590 --> 00:31:26,700 if you want to prove this kind of stuff. 591 00:31:26,700 --> 00:31:28,910 But if it's not, then that's your best shot. 592 00:31:28,910 --> 00:31:31,870 But you have to check those conditions. 593 00:31:31,870 --> 00:31:38,020 I will give you for granted the 0.5. 594 00:31:38,020 --> 00:31:39,780 Ready? 595 00:31:39,780 --> 00:31:40,410 Any questions? 596 00:31:40,410 --> 00:31:41,960 We're going to wrap up this chapter four. 597 00:31:41,960 --> 00:31:43,440 So if you have questions, that's the time. 598 00:31:43,440 --> 00:31:43,939 Yes. 599 00:31:43,939 --> 00:31:45,925 AUDIENCE: What was the quadratic risk up there? 600 00:31:45,925 --> 00:31:47,716 PHILIPPE RIGOLLET: You mean the definition? 601 00:31:47,716 --> 00:31:49,620 AUDIENCE: No, the-- what is was for this. 602 00:31:49,620 --> 00:31:51,210 PHILIPPE RIGOLLET: Well, you see the quadratic risk, 603 00:31:51,210 --> 00:31:53,070 if I think of it as being one dimensional, 604 00:31:53,070 --> 00:31:55,272 the quadratic risk is the expectation 605 00:31:55,272 --> 00:31:57,730 of the square of the difference between theta hat and theta 606 00:31:57,730 --> 00:31:58,230 star. 607 00:32:01,010 --> 00:32:05,900 So that means that if I think of this as having a normal 0, 1, 608 00:32:05,900 --> 00:32:09,680 that's basically computing the expectation of the square 609 00:32:09,680 --> 00:32:13,310 of this Gaussian divided by n. 610 00:32:13,310 --> 00:32:15,759 I just divided by square root of n on both sides. 611 00:32:15,759 --> 00:32:18,050 So it's the expectation of the square of this Gaussian. 612 00:32:18,050 --> 00:32:20,383 The Gaussian is mean 0, so the expectation of the square 613 00:32:20,383 --> 00:32:23,060 is just a variance. 614 00:32:23,060 --> 00:32:25,903 And so I'm left with 1 over the Fisher information divided 615 00:32:25,903 --> 00:32:26,403 by n. 616 00:32:26,403 --> 00:32:26,886 AUDIENCE: I see. 617 00:32:26,886 --> 00:32:27,386 OK. 618 00:32:34,084 --> 00:32:36,250 PHILIPPE RIGOLLET: So let's move on to chapter four. 619 00:32:36,250 --> 00:32:38,190 And this is the method of moments. 620 00:32:38,190 --> 00:32:42,000 So the method of moments is actually maybe a bit older 621 00:32:42,000 --> 00:32:44,260 than maximum likelihood. 622 00:32:44,260 --> 00:32:48,490 And maximum likelihood is dated, say, early 20th century, 623 00:32:48,490 --> 00:32:50,490 I mean as a systematic thing, because as I said, 624 00:32:50,490 --> 00:32:52,323 many of those guys are going to be averages. 625 00:32:52,323 --> 00:32:56,010 So finding an average is probably a little older. 626 00:32:56,010 --> 00:32:58,380 The method of moments, there's some really nice uses. 627 00:32:58,380 --> 00:33:03,679 There's a paper by Pearson in 1904, I believe, or maybe 1894. 628 00:33:03,679 --> 00:33:04,220 I don't know. 629 00:33:06,930 --> 00:33:10,860 And this paper, he was actually studying some species 630 00:33:10,860 --> 00:33:12,607 of crab in an island, and he was trying 631 00:33:12,607 --> 00:33:13,690 to make some measurements. 632 00:33:13,690 --> 00:33:16,314 That's how he came up with this model of mixtures of Gaussians, 633 00:33:16,314 --> 00:33:18,930 because there was actually two different populations 634 00:33:18,930 --> 00:33:20,860 in this populations of crab. 635 00:33:20,860 --> 00:33:23,580 And the way he actually fitted the parameters 636 00:33:23,580 --> 00:33:25,530 was by doing the method of moments, 637 00:33:25,530 --> 00:33:27,740 except that since there were a lot of parameters, 638 00:33:27,740 --> 00:33:33,580 he actually had to basically solve six equations with six 639 00:33:33,580 --> 00:33:34,080 unknowns. 640 00:33:34,080 --> 00:33:35,496 And that was a complete nightmare. 641 00:33:35,496 --> 00:33:36,980 And the guy did it by hand. 642 00:33:36,980 --> 00:33:40,140 And we don't know how he did it actually. 643 00:33:40,140 --> 00:33:43,360 But that is pretty impressive. 644 00:33:43,360 --> 00:33:44,480 So I want to start-- 645 00:33:44,480 --> 00:33:48,150 and this first part is a little brutal. 646 00:33:48,150 --> 00:33:52,020 But this is a Course 18 class, and I do not want to give you-- 647 00:33:52,020 --> 00:33:54,510 So let's all agree that this course might be slightly more 648 00:33:54,510 --> 00:33:56,820 challenging than AP statistics. 649 00:33:56,820 --> 00:34:00,540 And that means that it's going to be challenging just 650 00:34:00,540 --> 00:34:01,670 during class. 651 00:34:01,670 --> 00:34:04,170 I'm not going to ask you about the Weierstrass Approximation 652 00:34:04,170 --> 00:34:05,670 Theorem during the exams. 653 00:34:05,670 --> 00:34:08,219 But what I want is to give you mathematical motivations 654 00:34:08,219 --> 00:34:10,130 for what we're doing. 655 00:34:10,130 --> 00:34:12,480 And I can promise you that maybe you 656 00:34:12,480 --> 00:34:17,730 will have a slightly higher body temperature during the lecture, 657 00:34:17,730 --> 00:34:20,760 but you will come out smarter of this class. 658 00:34:20,760 --> 00:34:24,810 And I'm trying to motivate to you for using mathematical tool 659 00:34:24,810 --> 00:34:27,989 and show you where interesting mathematical things that you 660 00:34:27,989 --> 00:34:31,800 might find dry elsewhere actually work very beautifully 661 00:34:31,800 --> 00:34:33,348 in the stats literature. 662 00:34:33,348 --> 00:34:35,889 And one that we saw was using Kullback-Leibler divergence out 663 00:34:35,889 --> 00:34:38,639 of motivation for maximum likelihood estimation, 664 00:34:38,639 --> 00:34:39,929 for example. 665 00:34:39,929 --> 00:34:42,300 So the Weierstrass Approximation Theorem 666 00:34:42,300 --> 00:34:45,270 is something that comes from pure analysis. 667 00:34:45,270 --> 00:34:49,656 So maybe-- I mean, it took me a while before I saw that. 668 00:34:49,656 --> 00:34:51,239 And essentially, what it's telling you 669 00:34:51,239 --> 00:34:52,822 is that if you look at a function that 670 00:34:52,822 --> 00:34:55,710 is continuous on an interval a, b-- 671 00:34:55,710 --> 00:34:57,810 on a segment a, b-- 672 00:34:57,810 --> 00:35:02,250 then you can actually approximate it 673 00:35:02,250 --> 00:35:05,430 uniformly well by polynomials as long 674 00:35:05,430 --> 00:35:06,930 as you're willing to take the degree 675 00:35:06,930 --> 00:35:09,000 of this polynomials large enough. 676 00:35:09,000 --> 00:35:11,890 So the formal statement is, for any epsilon, 677 00:35:11,890 --> 00:35:16,140 there exists the d that depends on epsilon in a1 to ad-- 678 00:35:16,140 --> 00:35:20,400 so if you insist on having an accuracy which is 1/10,000, 679 00:35:20,400 --> 00:35:23,650 maybe you're going to need a polynomial of degree 100,000, 680 00:35:23,650 --> 00:35:24,150 who knows. 681 00:35:24,150 --> 00:35:26,520 It doesn't tell you anything about this. 682 00:35:26,520 --> 00:35:28,170 But it's telling you that at least you 683 00:35:28,170 --> 00:35:29,850 have only a finite number of parameters 684 00:35:29,850 --> 00:35:31,725 to approximate those functions that typically 685 00:35:31,725 --> 00:35:35,310 require an infinite number of parameters to be described. 686 00:35:35,310 --> 00:35:36,670 So that's actually quite nice. 687 00:35:36,670 --> 00:35:39,510 And that's the basis for many things 688 00:35:39,510 --> 00:35:43,000 and many polynomial methods typically. 689 00:35:43,000 --> 00:35:45,540 And so here, it's uniform, so there's 690 00:35:45,540 --> 00:35:50,400 this max over x that shows up that's actually nice as well. 691 00:35:50,400 --> 00:35:52,200 That's Weierstrass Approximation Theorem. 692 00:35:52,200 --> 00:35:54,720 Why is that useful to us? 693 00:35:54,720 --> 00:35:58,180 Well, in statistics, I have a sample of X1 to Xn. 694 00:35:58,180 --> 00:36:00,054 I have, say, a unified statistical model. 695 00:36:00,054 --> 00:36:01,470 I'm not always going to remind you 696 00:36:01,470 --> 00:36:04,200 that it's identified-- not unified-- identified 697 00:36:04,200 --> 00:36:05,640 statistical model. 698 00:36:05,640 --> 00:36:08,550 And I'm going to assume that it has a density. 699 00:36:08,550 --> 00:36:10,170 You could think of it as having a PMF, 700 00:36:10,170 --> 00:36:13,140 but think of it as having a density for one second. 701 00:36:13,140 --> 00:36:16,770 Now, what I want is to find the distribution. 702 00:36:16,770 --> 00:36:18,030 I want to find theta. 703 00:36:18,030 --> 00:36:20,340 And finding theta, since it's identified 704 00:36:20,340 --> 00:36:22,590 as equivalent to finding P theta, which 705 00:36:22,590 --> 00:36:26,460 is equivalent to finding f theta, and knowing a function 706 00:36:26,460 --> 00:36:28,410 is the same-- 707 00:36:28,410 --> 00:36:30,750 knowing a density is the same as knowing a density 708 00:36:30,750 --> 00:36:33,150 against any test function h. 709 00:36:33,150 --> 00:36:38,589 So that means that if I want to make sure I know a density-- 710 00:36:38,589 --> 00:36:40,630 if I want to check if two densities are the same, 711 00:36:40,630 --> 00:36:42,520 all I have to do is to compute their integral 712 00:36:42,520 --> 00:36:46,240 against all bounded continuous functions. 713 00:36:46,240 --> 00:36:48,340 You already know that it would be true 714 00:36:48,340 --> 00:36:50,530 if I checked for all functions h. 715 00:36:50,530 --> 00:36:53,170 But since f is a density, I can actually 716 00:36:53,170 --> 00:36:56,140 look only at functions h that are bounded, 717 00:36:56,140 --> 00:37:04,360 say, between minus 1 and 1, and that are continuous. 718 00:37:04,360 --> 00:37:06,140 That's enough. 719 00:37:06,140 --> 00:37:06,875 Agreed? 720 00:37:06,875 --> 00:37:08,510 Well, just trust me on this. 721 00:37:08,510 --> 00:37:11,774 Yes, you have a question? 722 00:37:11,774 --> 00:37:14,518 AUDIENCE: Why is this-- like, why shouldn't you 723 00:37:14,518 --> 00:37:17,263 just say that [INAUDIBLE]? 724 00:37:20,195 --> 00:37:21,820 PHILIPPE RIGOLLET: Yeah, I can do that. 725 00:37:21,820 --> 00:37:23,410 I'm just finding a characterization 726 00:37:23,410 --> 00:37:25,210 that's going to be useful for me later on. 727 00:37:25,210 --> 00:37:26,810 I can find a bunch of them. 728 00:37:26,810 --> 00:37:28,600 But here, this one is going to be useful. 729 00:37:28,600 --> 00:37:32,670 So all I need to say is that f theta star integrated against 730 00:37:32,670 --> 00:37:35,700 X, h of x-- so this implies that f-- 731 00:37:35,700 --> 00:37:38,320 if theta is equal to f theta star not everywhere, 732 00:37:38,320 --> 00:37:41,770 but almost everywhere. 733 00:37:41,770 --> 00:37:44,635 And that's only true if I guarantee to you that f theta 734 00:37:44,635 --> 00:37:46,390 and f theta stars are densities. 735 00:37:46,390 --> 00:37:49,180 This is not true for any function. 736 00:37:49,180 --> 00:37:53,570 So now, that means that, well, if I wanted to estimate theta 737 00:37:53,570 --> 00:37:56,880 hat, all I would have to do is to compute the average, right-- 738 00:37:56,880 --> 00:38:01,110 so this guy here, the integral-- 739 00:38:01,110 --> 00:38:02,480 let me clean up a bit my board. 740 00:38:22,590 --> 00:38:30,350 So my goal is to find theta such that, if I look at f theta 741 00:38:30,350 --> 00:38:34,680 and now I integrate it against h of x, then 742 00:38:34,680 --> 00:38:36,540 this gives me the same thing as if I were 743 00:38:36,540 --> 00:38:42,860 to do it against f theta star. 744 00:38:42,860 --> 00:38:45,690 And I want this for any h, which is continuous and bounded. 745 00:38:48,390 --> 00:38:51,690 So of course, I don't know what this quantity is. 746 00:38:51,690 --> 00:38:53,320 It depends on my unknown theta star. 747 00:38:53,320 --> 00:38:54,600 But I have theta from this. 748 00:38:54,600 --> 00:38:56,100 And I'm going to do the usual-- 749 00:38:56,100 --> 00:38:57,870 the good old statistical trick, which is, 750 00:38:57,870 --> 00:39:01,470 well, this I can write as the expectation with respect 751 00:39:01,470 --> 00:39:06,090 to P theta star of h theta of x. 752 00:39:06,090 --> 00:39:08,950 That's just the integral of a function against something. 753 00:39:08,950 --> 00:39:11,250 And so what I can do is say, well, now I 754 00:39:11,250 --> 00:39:12,160 don't know this guy. 755 00:39:12,160 --> 00:39:14,070 But my good old trick from the book 756 00:39:14,070 --> 00:39:15,980 is replace expectations by averages. 757 00:39:15,980 --> 00:39:16,670 And what I get-- 758 00:39:23,050 --> 00:39:29,190 And that's approximately by the law of large numbers. 759 00:39:29,190 --> 00:39:33,950 So if I can actually find a function f theta such 760 00:39:33,950 --> 00:39:36,150 that when I integrate it against h 761 00:39:36,150 --> 00:39:39,480 it gives me pretty much the average of the evaluations 762 00:39:39,480 --> 00:39:42,890 of h over my data points for all h, 763 00:39:42,890 --> 00:39:44,575 then that should be a good candidate. 764 00:39:47,690 --> 00:39:52,040 The problem is that's a lot of functions to try. 765 00:39:52,040 --> 00:39:54,500 Even if we reduced that from all possible functions 766 00:39:54,500 --> 00:39:56,780 to bounded and continuous ones, that's 767 00:39:56,780 --> 00:40:01,490 still a pretty large infinite number of them. 768 00:40:01,490 --> 00:40:05,550 And so what we can do is to use our Weierstrass Approximation 769 00:40:05,550 --> 00:40:06,050 Theorem. 770 00:40:06,050 --> 00:40:09,170 And it says, well, maybe I don't need to test it against all h. 771 00:40:09,170 --> 00:40:11,987 Maybe the polynomials are enough for me. 772 00:40:11,987 --> 00:40:14,570 So what I'm going to do is I'm going to look only at functions 773 00:40:14,570 --> 00:40:20,130 h that are of the form sum of ak-- 774 00:40:20,130 --> 00:40:29,725 so h of x is sum of ak X to the k-th for k 775 00:40:29,725 --> 00:40:34,040 equals 0 to d-- only polynomials of degree d. 776 00:40:34,040 --> 00:40:37,520 So when I look at the average of my h's, I'm 777 00:40:37,520 --> 00:40:40,360 going to get a term like the first one. 778 00:40:40,360 --> 00:40:47,485 So the first one here, this guy, becomes 1/n sum from i equal 1 779 00:40:47,485 --> 00:40:54,290 to n sum from k equal 0 to d of ak Xi to the k-th. 780 00:40:54,290 --> 00:40:58,430 That's just the average of the values of h of Xi. 781 00:40:58,430 --> 00:41:00,710 And now, what I need to do is to check 782 00:41:00,710 --> 00:41:03,590 that it's the same thing when I integrate 783 00:41:03,590 --> 00:41:06,640 h of this form as well. 784 00:41:06,640 --> 00:41:10,880 I want this to hold for all polynomials of degree d. 785 00:41:10,880 --> 00:41:12,110 That's still a lot of them. 786 00:41:12,110 --> 00:41:14,110 There's still an infinite number of polynomials, 787 00:41:14,110 --> 00:41:17,870 because there's an infinite number of numbers a0 to ad 788 00:41:17,870 --> 00:41:21,340 that describe those polynomials. 789 00:41:21,340 --> 00:41:23,410 But since those guys are polynomials, 790 00:41:23,410 --> 00:41:26,320 it's actually enough for me to look only at the terms 791 00:41:26,320 --> 00:41:28,420 of the form X to the k-th-- 792 00:41:28,420 --> 00:41:30,610 no linear combination, no nothing. 793 00:41:30,610 --> 00:41:32,290 So actually, it's enough to look only 794 00:41:32,290 --> 00:41:40,050 at h of x, which is equal to X to the k-th for k equal 0 to d. 795 00:41:43,260 --> 00:41:46,350 And now, how many of those guys are there? 796 00:41:46,350 --> 00:41:49,210 Just d plus 1, 0 to d. 797 00:41:49,210 --> 00:41:51,640 So that's actually a much easier thing for me to solve. 798 00:41:54,290 --> 00:42:01,970 Now, this quantity, which is the integral of f theta against X 799 00:42:01,970 --> 00:42:05,360 to the k-th-- so that the expectation of X to the k-th 800 00:42:05,360 --> 00:42:06,860 here-- 801 00:42:06,860 --> 00:42:12,940 it's called moment of order k, or k-th moment of P theta. 802 00:42:12,940 --> 00:42:13,620 That's a moment. 803 00:42:13,620 --> 00:42:16,210 A moment is just the expectation of the power. 804 00:42:16,210 --> 00:42:19,780 The mean is which moment? 805 00:42:19,780 --> 00:42:21,760 The first moment. 806 00:42:21,760 --> 00:42:24,720 And variance is not exactly the second moment. 807 00:42:24,720 --> 00:42:27,170 It's the second moment minus the first moment squared. 808 00:42:29,862 --> 00:42:30,695 That's the variance. 809 00:42:30,695 --> 00:42:34,691 It's E of X squared minus E of X squared. 810 00:42:34,691 --> 00:42:36,440 So those are things that you already know. 811 00:42:36,440 --> 00:42:37,564 And then you can go higher. 812 00:42:37,564 --> 00:42:40,200 You can go to E of X cube, E of X blah, blah. 813 00:42:40,200 --> 00:42:43,030 Here, I say go to E of X to the d-th. 814 00:42:43,030 --> 00:42:44,780 Now, as you can see, this is not something 815 00:42:44,780 --> 00:42:47,781 you can really put in action right now, 816 00:42:47,781 --> 00:42:50,030 because the Weierstrass Approximation Theorem does not 817 00:42:50,030 --> 00:42:52,070 tell you what d should be. 818 00:42:52,070 --> 00:42:54,020 Actually, we totally lost track of the epsilon 819 00:42:54,020 --> 00:42:54,978 I was even looking for. 820 00:42:54,978 --> 00:42:57,300 I just said approximately equal, approximately equal. 821 00:42:57,300 --> 00:42:59,300 And so all this thing is really just motivation. 822 00:42:59,300 --> 00:43:02,730 But it's essentially telling you that if you 823 00:43:02,730 --> 00:43:05,010 go to d large enough, technically 824 00:43:05,010 --> 00:43:08,730 you should be able to identify exactly your distribution up 825 00:43:08,730 --> 00:43:11,280 to epsilon. 826 00:43:11,280 --> 00:43:16,210 So I should be pretty good, if I go to d large enough. 827 00:43:16,210 --> 00:43:19,190 Now in practice, actually there should be much 828 00:43:19,190 --> 00:43:23,120 less than arbitrarily large d. 829 00:43:23,120 --> 00:43:25,460 Typically, we are going to need d which is 1 or 2. 830 00:43:28,150 --> 00:43:31,720 So there are some limitations to the Weierstrass Approximation 831 00:43:31,720 --> 00:43:32,440 Theorem. 832 00:43:32,440 --> 00:43:33,940 And there's a few. 833 00:43:33,940 --> 00:43:35,950 The first one is that it only works 834 00:43:35,950 --> 00:43:39,850 for continuous functions, which is not so much of a problem. 835 00:43:39,850 --> 00:43:42,740 That can be fixed. 836 00:43:42,740 --> 00:43:44,560 Well, we need bounded continuous functions. 837 00:43:44,560 --> 00:43:45,961 It works only on intervals. 838 00:43:45,961 --> 00:43:47,460 That's annoying, because we're going 839 00:43:47,460 --> 00:43:51,080 to have random variables that are defined beyond intervals. 840 00:43:51,080 --> 00:43:53,560 So we need something that just goes beyond the intervals. 841 00:43:53,560 --> 00:43:55,840 And you can imagine that if you let your functions be huge, 842 00:43:55,840 --> 00:43:57,256 it's going to be very hard for you 843 00:43:57,256 --> 00:44:00,001 to have a polynomial approximately [INAUDIBLE] well. 844 00:44:00,001 --> 00:44:02,500 Things are going to start going up and down at the boundary, 845 00:44:02,500 --> 00:44:05,380 and it's going to be very hard. 846 00:44:05,380 --> 00:44:07,360 And again, as I said several times, 847 00:44:07,360 --> 00:44:09,160 it doesn't tell us what d should be. 848 00:44:09,160 --> 00:44:11,470 And as statisticians, we're looking for methods, not 849 00:44:11,470 --> 00:44:15,910 like principles of existence of a method that exists. 850 00:44:15,910 --> 00:44:21,840 So if E is discrete, I can actually 851 00:44:21,840 --> 00:44:23,720 get a handle on this d. 852 00:44:23,720 --> 00:44:26,730 If E is discrete and actually finite-- 853 00:44:26,730 --> 00:44:29,250 I'm going to actually look at a finite E, 854 00:44:29,250 --> 00:44:33,690 meaning that I have a PMF on, say, r possible values, x1 855 00:44:33,690 --> 00:44:34,404 and xr. 856 00:44:34,404 --> 00:44:35,820 My random variable, capital X, can 857 00:44:35,820 --> 00:44:37,290 take only r possible values. 858 00:44:37,290 --> 00:44:41,550 Let's think of them as being the integer numbers 1 to r. 859 00:44:41,550 --> 00:44:44,880 That's the number of success out of r trials 860 00:44:44,880 --> 00:44:46,590 that I get, for example. 861 00:44:46,590 --> 00:44:51,640 Binomial rp, that's exactly something like this. 862 00:44:51,640 --> 00:44:55,600 So now, clearly this entire distribution 863 00:44:55,600 --> 00:45:00,452 is defined by the PMF, which gives me exactly r numbers. 864 00:45:00,452 --> 00:45:02,410 So it can completely describe this distribution 865 00:45:02,410 --> 00:45:03,850 with r numbers. 866 00:45:03,850 --> 00:45:08,290 The question is, do I have an enormous amount of redundancy 867 00:45:08,290 --> 00:45:12,250 if I try to describe this distribution using moments? 868 00:45:12,250 --> 00:45:14,970 It might be that I need-- say, r is equal to 10, 869 00:45:14,970 --> 00:45:18,080 maybe I have only 10 numbers to describe this thing, 870 00:45:18,080 --> 00:45:20,980 but I actually need to compute moments up to the order of 100 871 00:45:20,980 --> 00:45:23,500 before I actually recover entirely the distribution. 872 00:45:23,500 --> 00:45:25,090 Maybe I need to go infinite. 873 00:45:25,090 --> 00:45:27,220 Maybe the Weierstrass Theorem is the only thing 874 00:45:27,220 --> 00:45:28,420 that actually saves me here. 875 00:45:28,420 --> 00:45:30,720 And I just cannot recover it exactly. 876 00:45:30,720 --> 00:45:33,340 I can go to epsilon if I'm willing to go to higher 877 00:45:33,340 --> 00:45:34,611 and higher polynomials. 878 00:45:34,611 --> 00:45:36,610 Oh, by the way, in the Weierstrass Approximation 879 00:45:36,610 --> 00:45:39,190 Theorem, I can promise you that as epsilon goes to 0, 880 00:45:39,190 --> 00:45:41,660 d goes to infinity. 881 00:45:41,660 --> 00:45:46,160 So now, really I don't even have actually r parameters. 882 00:45:46,160 --> 00:45:50,070 I have only r minus parameter, because the last one-- 883 00:45:50,070 --> 00:45:51,500 because they sum up to 1. 884 00:45:51,500 --> 00:45:53,630 So the last one I can always get by doing 885 00:45:53,630 --> 00:45:56,960 1 minus the sum of the first r minus 1 first. 886 00:45:56,960 --> 00:45:58,520 Agreed? 887 00:45:58,520 --> 00:46:01,100 So each distribution r numbers is described 888 00:46:01,100 --> 00:46:04,700 by r minus 1 parameters. 889 00:46:04,700 --> 00:46:07,025 The question is, can I use only r minus moments 890 00:46:07,025 --> 00:46:08,020 to describe this guy? 891 00:46:12,870 --> 00:46:16,380 This is something called Gaussian quadrature. 892 00:46:16,380 --> 00:46:18,930 The Gaussian quadrature tells you, yes, moments 893 00:46:18,930 --> 00:46:22,380 are actually a good way to reparametrize your distribution 894 00:46:22,380 --> 00:46:24,870 in the sense that if I give you the moments 895 00:46:24,870 --> 00:46:27,120 or if I give you the probability mass function, 896 00:46:27,120 --> 00:46:29,370 I'm basically giving you exactly the same information. 897 00:46:29,370 --> 00:46:32,770 You can recover all the probabilities from there. 898 00:46:32,770 --> 00:46:34,930 So here, I'm going to denote by-- 899 00:46:34,930 --> 00:46:37,870 I'm going to drop the notation in theta. 900 00:46:37,870 --> 00:46:38,770 I don't have theta. 901 00:46:38,770 --> 00:46:41,460 Here, I'm talking about any generic distribution. 902 00:46:41,460 --> 00:46:44,950 And so I'm going to call mk the k-th moment. 903 00:46:49,080 --> 00:46:54,610 And I have a PMF, this is really the sum for j 904 00:46:54,610 --> 00:47:06,690 equals 1 to r of xj to the k-th times p of xj. 905 00:47:06,690 --> 00:47:10,450 And this is the PMF. 906 00:47:10,450 --> 00:47:12,100 So that's my k-th moment. 907 00:47:12,100 --> 00:47:15,340 So the k-th moment is a linear combination of the numbers 908 00:47:15,340 --> 00:47:16,430 that I am interested in. 909 00:47:19,750 --> 00:47:22,195 So that's one equation. 910 00:47:25,220 --> 00:47:27,350 And I have as many equations as I'm actually 911 00:47:27,350 --> 00:47:28,700 willing to look at moments. 912 00:47:28,700 --> 00:47:34,250 So if I'm looking at 25 moments, I have 25 equations. 913 00:47:34,250 --> 00:47:36,650 m1 equals blah with this to the power of 1, 914 00:47:36,650 --> 00:47:40,020 m2 equals blah with this to the power of 2, et cetera. 915 00:47:40,020 --> 00:47:41,850 And then I also have the equation 916 00:47:41,850 --> 00:47:51,240 that 1 is equal to the sum of the p of xj. 917 00:47:51,240 --> 00:47:55,190 That's just the definition of PMF. 918 00:47:55,190 --> 00:47:56,190 So this is r's. 919 00:47:56,190 --> 00:47:58,163 They're ugly, but those are r's. 920 00:48:00,790 --> 00:48:04,390 So now, this is a system of linear equations in p, 921 00:48:04,390 --> 00:48:07,390 and I can actually write it in its canonical form, which 922 00:48:07,390 --> 00:48:11,410 is of the form a matrix of those guys 923 00:48:11,410 --> 00:48:15,750 times my parameters of interest is equal to a right hand side. 924 00:48:15,750 --> 00:48:17,880 The right hand side is the moments. 925 00:48:17,880 --> 00:48:20,070 That means, if I did you the moments, 926 00:48:20,070 --> 00:48:22,740 can you come back and find what the PMF, 927 00:48:22,740 --> 00:48:24,870 because we know already from probability 928 00:48:24,870 --> 00:48:27,390 that the PMF is all I need to know to fully describe 929 00:48:27,390 --> 00:48:29,190 my distribution. 930 00:48:29,190 --> 00:48:32,010 Given the moments, that's unclear. 931 00:48:32,010 --> 00:48:37,830 Now, here, I'm going to actually take exactly r minus 1 moment 932 00:48:37,830 --> 00:48:39,960 and this extra condition that the sum of those guys 933 00:48:39,960 --> 00:48:41,640 should be 1. 934 00:48:41,640 --> 00:48:45,240 So that gives me r equations based on r minus 1 moments. 935 00:48:45,240 --> 00:48:47,260 And how many unknowns do I have? 936 00:48:47,260 --> 00:48:54,230 Well, I have my r unknown parameters 937 00:48:54,230 --> 00:48:59,060 for the PMF, the r values of the PMF. 938 00:48:59,060 --> 00:49:02,540 Now, of course, this is going to play a huge role 939 00:49:02,540 --> 00:49:06,920 in whether the are many p's that give me the same. 940 00:49:06,920 --> 00:49:09,620 The goal is to find if there are several p's that can give me 941 00:49:09,620 --> 00:49:10,552 the same moments. 942 00:49:10,552 --> 00:49:13,010 But if there's only one p that can give me a set of moment, 943 00:49:13,010 --> 00:49:15,260 that means that I have a one-to-one correspondence 944 00:49:15,260 --> 00:49:17,294 between PMF and moments. 945 00:49:17,294 --> 00:49:18,710 And so if you give me the moments, 946 00:49:18,710 --> 00:49:23,074 I can just go back to the PMF. 947 00:49:23,074 --> 00:49:23,990 Now, how do I go back? 948 00:49:23,990 --> 00:49:26,310 Well, by inverting this matrix. 949 00:49:26,310 --> 00:49:28,710 If I multiply this matrix by its inverse, 950 00:49:28,710 --> 00:49:32,600 I'm going to get the identity times the vector of p's equal 951 00:49:32,600 --> 00:49:36,890 the inverse of the matrix times the m's. 952 00:49:36,890 --> 00:49:41,150 So what we want to do is to say that p 953 00:49:41,150 --> 00:49:45,350 is equal to the inverse of this big matrix times the moments 954 00:49:45,350 --> 00:49:47,190 that you give me. 955 00:49:47,190 --> 00:49:49,380 And if I can actually talk about the inverse, 956 00:49:49,380 --> 00:49:52,410 then I have basically a one-to-one mapping 957 00:49:52,410 --> 00:49:55,930 between the m's, the moments, and the matrix. 958 00:49:55,930 --> 00:49:58,380 So what I need to show is that this matrix is invertible. 959 00:49:58,380 --> 00:50:01,350 And we just decided that the way to check 960 00:50:01,350 --> 00:50:05,670 if a matrix is invertible is by computing its determinant. 961 00:50:05,670 --> 00:50:10,300 Who has computed a determinant before? 962 00:50:10,300 --> 00:50:12,820 Who was supposed to compute a determinant at least than just 963 00:50:12,820 --> 00:50:15,004 to say, no, you know how to do it. 964 00:50:15,004 --> 00:50:16,670 So you know how to compute determinants. 965 00:50:16,670 --> 00:50:19,180 And if you've seen any determinant in class, 966 00:50:19,180 --> 00:50:22,660 there's one that shows up in the exercises that professors love. 967 00:50:22,660 --> 00:50:25,390 And it's called the Vandermonde determinant. 968 00:50:25,390 --> 00:50:26,890 And it's the determinant of a matrix 969 00:50:26,890 --> 00:50:28,900 that have a very specific form. 970 00:50:28,900 --> 00:50:31,780 It looks like-- so there's basically only r parameters 971 00:50:31,780 --> 00:50:33,950 to this r by r matrix. 972 00:50:33,950 --> 00:50:36,340 The first row, or the first column-- sometimes, 973 00:50:36,340 --> 00:50:37,810 it's presented like that-- 974 00:50:37,810 --> 00:50:41,551 is this vector where each entry is to the power of 1. 975 00:50:41,551 --> 00:50:43,800 And the second one is each entry is to the power of 2, 976 00:50:43,800 --> 00:50:46,970 and to the power of 3, and to the power 4, et cetera. 977 00:50:46,970 --> 00:50:49,410 So that's exactly what we have-- x1 to the first, x2 978 00:50:49,410 --> 00:50:51,460 to the first, all the way to xr to the first, 979 00:50:51,460 --> 00:50:53,290 and then same thing to the power of 2, 980 00:50:53,290 --> 00:50:54,550 all the way to the last one. 981 00:50:54,550 --> 00:50:58,270 And I also need to add the row of all 1's, which 982 00:50:58,270 --> 00:51:01,210 you can think of those guys are to the power of 0, if you want. 983 00:51:01,210 --> 00:51:02,690 So I should really put it on top, 984 00:51:02,690 --> 00:51:05,430 if I wanted to have a nice ordering. 985 00:51:05,430 --> 00:51:07,290 So that was the matrix that I had. 986 00:51:07,290 --> 00:51:09,060 And I'm not asking you to check it. 987 00:51:09,060 --> 00:51:10,860 You can prove that by induction actually, 988 00:51:10,860 --> 00:51:14,190 typically by doing the usual let's eliminate 989 00:51:14,190 --> 00:51:15,810 some rows and columns type of tricks 990 00:51:15,810 --> 00:51:17,260 that you do for matrices. 991 00:51:17,260 --> 00:51:19,197 So you basically start from the whole matrix. 992 00:51:19,197 --> 00:51:21,780 And then you move onto a matrix that has only one 1's and then 993 00:51:21,780 --> 00:51:22,770 0's here. 994 00:51:22,770 --> 00:51:25,827 And then you have Vandermonde that's just slightly smaller. 995 00:51:25,827 --> 00:51:26,910 And then you just iterate. 996 00:51:26,910 --> 00:51:27,894 Yeah. 997 00:51:27,894 --> 00:51:31,502 AUDIENCE: I feel like there's a loss to either the supra index, 998 00:51:31,502 --> 00:51:35,274 or the sub index should have a k somewhere [INAUDIBLE].. 999 00:51:38,119 --> 00:51:39,702 [INAUDIBLE] the one I'm talking about? 1000 00:51:39,702 --> 00:51:41,285 PHILIPPE RIGOLLET: Yeah, I know, but I 1001 00:51:41,285 --> 00:51:45,280 don't think the answer to your question is yes. 1002 00:51:45,280 --> 00:51:48,330 So k is the general index, right? 1003 00:51:48,330 --> 00:51:51,180 So there's no k. k does not exist. k just is here for me 1004 00:51:51,180 --> 00:51:53,310 to tell me for k equals 1 to r. 1005 00:51:53,310 --> 00:51:56,250 So this is an r by r matrix. 1006 00:51:56,250 --> 00:51:58,290 And so there is no k there. 1007 00:51:58,290 --> 00:52:00,960 So if you wanted the generic term, 1008 00:52:00,960 --> 00:52:03,980 if I wanted to put 1 in the middle on the j-th row and k-th 1009 00:52:03,980 --> 00:52:07,740 column, that would be x-- 1010 00:52:07,740 --> 00:52:13,410 so j-th row would be x sub k to the power of j. 1011 00:52:13,410 --> 00:52:15,420 That would be the-- 1012 00:52:15,420 --> 00:52:19,090 And so now, this is basically the sum-- 1013 00:52:19,090 --> 00:52:20,630 well, that should not be strictly-- 1014 00:52:20,630 --> 00:52:25,000 So that would be for j and k between 1 and r. 1015 00:52:25,000 --> 00:52:26,920 So this is the formula that get when 1016 00:52:26,920 --> 00:52:29,470 you try to expand this Vandermonde determinant. 1017 00:52:29,470 --> 00:52:32,110 You have to do it only once when you're a sophomore typically. 1018 00:52:32,110 --> 00:52:34,000 And then you can just go on Wikipedia to do it. 1019 00:52:34,000 --> 00:52:34,750 That's what I did. 1020 00:52:34,750 --> 00:52:36,700 I actually made a mistake copying it. 1021 00:52:36,700 --> 00:52:39,370 The first one should be 1 less than or equal to j. 1022 00:52:39,370 --> 00:52:42,370 And the last one should be k less than or equal to r. 1023 00:52:42,370 --> 00:52:43,870 And now what you have is the product 1024 00:52:43,870 --> 00:52:45,520 of the differences of xj and xk. 1025 00:52:47,204 --> 00:52:48,620 And for this thing to be non-zero, 1026 00:52:48,620 --> 00:52:51,259 you need all the terms to be non-zero. 1027 00:52:51,259 --> 00:52:52,800 And for all the terms to be non-zero, 1028 00:52:52,800 --> 00:52:58,412 you need to have no xi, xj, and no xj, xk that are identical. 1029 00:52:58,412 --> 00:52:59,870 If all those are different numbers, 1030 00:52:59,870 --> 00:53:03,094 then this product is going to be different from 0. 1031 00:53:03,094 --> 00:53:05,010 And those are different numbers, because those 1032 00:53:05,010 --> 00:53:09,050 are r possible values that your random verbal takes. 1033 00:53:09,050 --> 00:53:11,360 You're not going to say that it takes two 1034 00:53:11,360 --> 00:53:14,010 with probability 1.5-- 1035 00:53:14,010 --> 00:53:18,170 sorry, two with probability 0.5 and two with probability 0.25. 1036 00:53:18,170 --> 00:53:22,370 You're going to say it takes two with probability 0.75 directly. 1037 00:53:22,370 --> 00:53:24,350 So those xj's are different. 1038 00:53:24,350 --> 00:53:27,510 These are the different values that your random variable 1039 00:53:27,510 --> 00:53:28,010 can take. 1040 00:53:32,200 --> 00:53:37,404 Remember, xj, xk was just the different values x1 to xr-- 1041 00:53:37,404 --> 00:53:39,820 sorry-- was the different values that your random variable 1042 00:53:39,820 --> 00:53:41,020 can take. 1043 00:53:41,020 --> 00:53:43,796 Nobody in their right mind would write twice the same value 1044 00:53:43,796 --> 00:53:44,788 in this list. 1045 00:53:47,450 --> 00:53:49,119 So my Vandermonde is non-zero. 1046 00:53:49,119 --> 00:53:49,910 So I can invert it. 1047 00:53:49,910 --> 00:53:51,493 And I have a one-to-one correspondence 1048 00:53:51,493 --> 00:53:55,970 between my entire PMF and the first r minus 1's moments 1049 00:53:55,970 --> 00:54:00,390 to which I append the number 1, which is really 1050 00:54:00,390 --> 00:54:02,700 the moment of order 0 again. 1051 00:54:02,700 --> 00:54:05,550 It's E of X to the 0-th, which is 1. 1052 00:54:05,550 --> 00:54:10,110 So good news, I only need r minus 1 parameters 1053 00:54:10,110 --> 00:54:12,030 to describe r minus 1 parameters. 1054 00:54:12,030 --> 00:54:14,260 And I can choose either the values of my PMF. 1055 00:54:14,260 --> 00:54:16,360 Or I can choose the r minus 1 first moments. 1056 00:54:20,300 --> 00:54:22,580 So the moments tell me something. 1057 00:54:22,580 --> 00:54:26,450 Here, it tells me that if I have a discrete distribution 1058 00:54:26,450 --> 00:54:28,160 with r possible values, I only need 1059 00:54:28,160 --> 00:54:30,200 to compute r minus 1 moments. 1060 00:54:30,200 --> 00:54:34,471 So this is better than Weierstrass Approximation 1061 00:54:34,471 --> 00:54:34,970 Theorem. 1062 00:54:34,970 --> 00:54:37,970 This tells me exactly how many moments I need to consider. 1063 00:54:37,970 --> 00:54:39,410 And this is for any distribution. 1064 00:54:39,410 --> 00:54:41,100 This is not a distribution that's 1065 00:54:41,100 --> 00:54:43,790 parametrized by one parameter, like the Poisson 1066 00:54:43,790 --> 00:54:47,210 or the binomial or all this stuff. 1067 00:54:47,210 --> 00:54:50,250 This is for any distribution under a finite number. 1068 00:54:50,250 --> 00:54:53,810 So hopefully, if I reduce the family of PMFs 1069 00:54:53,810 --> 00:54:55,970 that I'm looking at to a one-parameter family, 1070 00:54:55,970 --> 00:54:58,430 I'm actually going to need to compute much less than r 1071 00:54:58,430 --> 00:55:01,110 minus 1 values. 1072 00:55:01,110 --> 00:55:02,640 But this is actually hopeful. 1073 00:55:02,640 --> 00:55:04,650 It tells you that the method of moments 1074 00:55:04,650 --> 00:55:06,775 is going to work for any distribution. 1075 00:55:06,775 --> 00:55:09,417 You just have to invert a Vandermonde matrix. 1076 00:55:13,220 --> 00:55:17,350 So just the conclusion-- the statistical conclusion-- 1077 00:55:17,350 --> 00:55:20,770 is that moments contain important information 1078 00:55:20,770 --> 00:55:24,880 about the PMF and the PDF. 1079 00:55:24,880 --> 00:55:26,890 If we can estimate these moments accurately, 1080 00:55:26,890 --> 00:55:30,820 we can solve for the parameters of the distribution 1081 00:55:30,820 --> 00:55:32,674 and recover the distribution. 1082 00:55:32,674 --> 00:55:34,090 And in a parametric setting, where 1083 00:55:34,090 --> 00:55:36,970 knowing P theta amounts to knowing theta, which 1084 00:55:36,970 --> 00:55:41,270 is identifiability-- this is not innocuous-- 1085 00:55:41,270 --> 00:55:44,260 it is often the case that even less moments are needed. 1086 00:55:44,260 --> 00:55:46,810 After all, if theta is a one dimensional parameter, 1087 00:55:46,810 --> 00:55:48,730 I have one parameter to estimate. 1088 00:55:48,730 --> 00:55:51,370 Why would I go and get 25 moments 1089 00:55:51,370 --> 00:55:52,870 to get this one parameter. 1090 00:55:52,870 --> 00:55:54,532 Typically, there is actually-- we 1091 00:55:54,532 --> 00:55:55,990 will see that the method of moments 1092 00:55:55,990 --> 00:55:58,480 just says if you have a d dimensional parameter, 1093 00:55:58,480 --> 00:56:02,110 just compute d moments, and that's it. 1094 00:56:02,110 --> 00:56:04,280 But this is only on a case-by-case basis. 1095 00:56:04,280 --> 00:56:07,610 I mean, maybe your model will totally screw up its parameters 1096 00:56:07,610 --> 00:56:09,950 and you actually need to get them. 1097 00:56:09,950 --> 00:56:16,710 I mean, think about it, if the function is parameterized just 1098 00:56:16,710 --> 00:56:19,047 by its 27th moment-- 1099 00:56:19,047 --> 00:56:21,630 like, that's the only thing that matters in this distribution, 1100 00:56:21,630 --> 00:56:24,187 I just describe the function, it's just a density, 1101 00:56:24,187 --> 00:56:26,520 and the only thing that can change from one distribution 1102 00:56:26,520 --> 00:56:28,484 to another is this 27th moment-- 1103 00:56:28,484 --> 00:56:30,900 well, then you're going to have to go get the 27th moment. 1104 00:56:30,900 --> 00:56:33,780 And that probably means that your modeling step was actually 1105 00:56:33,780 --> 00:56:34,686 pretty bad. 1106 00:56:37,680 --> 00:56:40,970 So the rule of thumb, if theta is in Rd, we need d moments. 1107 00:56:46,970 --> 00:56:48,430 So what is the method of moments? 1108 00:56:52,800 --> 00:56:55,080 That's just a good old trick. 1109 00:56:55,080 --> 00:56:58,380 Replace the expectation by averages. 1110 00:56:58,380 --> 00:56:59,970 That's the beauty. 1111 00:56:59,970 --> 00:57:02,080 The moments are expectations. 1112 00:57:02,080 --> 00:57:04,710 So let's just replace the expectations by averages 1113 00:57:04,710 --> 00:57:07,620 and then do it with the average version, 1114 00:57:07,620 --> 00:57:10,200 as if it was the true one. 1115 00:57:10,200 --> 00:57:14,160 So for example, I'm going to talk about population moments, 1116 00:57:14,160 --> 00:57:16,357 when I'm computing them with the true distribution, 1117 00:57:16,357 --> 00:57:18,690 and I'm going to talk about them empirical moments, when 1118 00:57:18,690 --> 00:57:22,290 I talk about averages. 1119 00:57:22,290 --> 00:57:24,690 So those are the two quantities that I have. 1120 00:57:24,690 --> 00:57:28,430 And now, what I hope is that there is. 1121 00:57:28,430 --> 00:57:30,960 So this is basically-- 1122 00:57:30,960 --> 00:57:32,140 everything is here. 1123 00:57:32,140 --> 00:57:33,750 That's where all the money is. 1124 00:57:33,750 --> 00:57:36,960 I'm going to assume there's a function psi that maps 1125 00:57:36,960 --> 00:57:40,120 my parameters-- let's say they're in Rd-- 1126 00:57:40,120 --> 00:57:42,385 to the set of the first d moments. 1127 00:57:45,490 --> 00:57:48,040 Well, what I want to do is to come from this guy 1128 00:57:48,040 --> 00:57:49,070 back to theta. 1129 00:57:49,070 --> 00:57:50,980 So it better be that this function is-- 1130 00:57:54,850 --> 00:57:55,802 invertible. 1131 00:57:55,802 --> 00:57:57,385 I want this function to be invertible. 1132 00:57:57,385 --> 00:57:59,200 In the Vandermonde case, this function 1133 00:57:59,200 --> 00:58:03,610 with just a linear function-- multiply a matrix by theta. 1134 00:58:03,610 --> 00:58:06,610 Then inverting a linear function is inverting the matrix. 1135 00:58:06,610 --> 00:58:08,145 Then this is the same thing. 1136 00:58:08,145 --> 00:58:09,520 So now what I'm going to assume-- 1137 00:58:09,520 --> 00:58:14,470 and that's key for this method to work-- is that this theta-- 1138 00:58:14,470 --> 00:58:16,570 so this function psi is one to one. 1139 00:58:16,570 --> 00:58:24,360 There's only one theta that gets only one set of moments. 1140 00:58:24,360 --> 00:58:26,750 And so if it's one to one, I can talk about its inverse. 1141 00:58:26,750 --> 00:58:28,750 And so now, I'm going to be able to define theta 1142 00:58:28,750 --> 00:58:32,330 as the inverse of the moments-- 1143 00:58:32,330 --> 00:58:33,620 the reciprocal of the moments. 1144 00:58:33,620 --> 00:58:37,940 And so now, what I get is that the moment estimator is just 1145 00:58:37,940 --> 00:58:42,140 the thing where rather than taking the true guys in there, 1146 00:58:42,140 --> 00:58:44,780 I'm actually going to take the empirical moments in there. 1147 00:58:48,580 --> 00:58:50,530 Before we go any further, I'd like 1148 00:58:50,530 --> 00:58:53,980 to just go back and tell you that this is not 1149 00:58:53,980 --> 00:58:56,380 completely free. 1150 00:58:56,380 --> 00:58:58,382 How well-behaved your function psi 1151 00:58:58,382 --> 00:58:59,590 is going to play a huge role. 1152 00:59:02,490 --> 00:59:05,394 Can somebody tell me what the typical distance-- 1153 00:59:05,394 --> 00:59:06,810 if I have a sample of size n, what 1154 00:59:06,810 --> 00:59:10,062 is the typical distance between an average and the expectation? 1155 00:59:12,790 --> 00:59:14,360 What is the typical distance? 1156 00:59:14,360 --> 00:59:18,920 What is the order of magnitude as a function of n between xn 1157 00:59:18,920 --> 00:59:23,024 bar and its expectation. 1158 00:59:23,024 --> 00:59:25,000 AUDIENCE: 1 over square root of n. 1159 00:59:25,000 --> 00:59:25,760 PHILIPPE RIGOLLET: 1 over square root n. 1160 00:59:25,760 --> 00:59:28,064 That's what the central limit theorem tells us, right? 1161 00:59:28,064 --> 00:59:29,480 The central limit theorem tells us 1162 00:59:29,480 --> 00:59:31,521 that those things are basically a Gaussian, which 1163 00:59:31,521 --> 00:59:34,490 is of order of 1 divided by its square of n. 1164 00:59:34,490 --> 00:59:37,670 And so basically, I start with something 1165 00:59:37,670 --> 00:59:41,530 which is 1 over square root of n away from the true thing. 1166 00:59:41,530 --> 00:59:49,730 Now, if my function psi inverse is super steep like this-- 1167 00:59:49,730 --> 00:59:54,970 that's psi inverse-- then just small fluctuations, even 1168 00:59:54,970 --> 00:59:57,310 if they're of order 1 square root of n, 1169 00:59:57,310 --> 01:00:04,090 can translate into giant fluctuations in the y-axis. 1170 01:00:04,090 --> 01:00:06,040 And that's going to be controlled 1171 01:00:06,040 --> 01:00:09,640 by how steep psi inverse is, which is the same 1172 01:00:09,640 --> 01:00:14,150 as saying how flat psi is-- 1173 01:00:14,150 --> 01:00:15,720 how flat is psi. 1174 01:00:15,720 --> 01:00:20,440 So if you go back to this Vandermonde inverse, 1175 01:00:20,440 --> 01:00:26,570 what it's telling you is that if this inverse matrix blows up 1176 01:00:26,570 --> 01:00:29,030 this guy a lot-- 1177 01:00:29,030 --> 01:00:32,566 so if I start from a small fluctuation of this thing 1178 01:00:32,566 --> 01:00:34,190 and then they're blowing up by applying 1179 01:00:34,190 --> 01:00:36,050 the inverse of this matrix, things 1180 01:00:36,050 --> 01:00:37,600 are not going to go well. 1181 01:00:37,600 --> 01:00:41,860 Anybody knows what is the number that I should be looking for? 1182 01:00:41,860 --> 01:00:45,080 So that's from, say, numerical linear algebra 1183 01:00:45,080 --> 01:00:47,270 numerical methods. 1184 01:00:47,270 --> 01:00:49,244 When I have a system of linear equations, 1185 01:00:49,244 --> 01:00:50,660 what is the actual number I should 1186 01:00:50,660 --> 01:00:53,510 be looking at to know how much I'm 1187 01:00:53,510 --> 01:00:54,950 blowing up the fluctuations? 1188 01:00:54,950 --> 01:00:55,090 Yeah. 1189 01:00:55,090 --> 01:00:55,776 AUDIENCE: Condition number? 1190 01:00:55,776 --> 01:00:57,280 PHILIPPE RIGOLLET: The condition number, right. 1191 01:00:57,280 --> 01:00:59,600 So what's important here is the condition number 1192 01:00:59,600 --> 01:01:00,680 of this matrix. 1193 01:01:00,680 --> 01:01:03,715 If the condition number of this matrix is small, 1194 01:01:03,715 --> 01:01:04,340 then it's good. 1195 01:01:04,340 --> 01:01:05,660 It's not going to blow up much. 1196 01:01:05,660 --> 01:01:07,280 But if the condition number is very large, 1197 01:01:07,280 --> 01:01:08,720 it's just going to blow up a lot. 1198 01:01:08,720 --> 01:01:10,310 And the condition number is the ratio 1199 01:01:10,310 --> 01:01:13,010 of the largest and the smallest eigenvalues. 1200 01:01:13,010 --> 01:01:14,720 So you'll have to know what it is. 1201 01:01:14,720 --> 01:01:17,180 But this is how all these things get together. 1202 01:01:17,180 --> 01:01:21,380 So the numerical stability translates 1203 01:01:21,380 --> 01:01:24,835 into statistical stability here. 1204 01:01:24,835 --> 01:01:26,684 And numerical means just if I had 1205 01:01:26,684 --> 01:01:28,350 errors in measuring the right hand side, 1206 01:01:28,350 --> 01:01:30,360 how much would they translate into errors on the left hand 1207 01:01:30,360 --> 01:01:31,060 side. 1208 01:01:31,060 --> 01:01:33,976 So the error here is intrinsic to statistical questions. 1209 01:01:38,610 --> 01:01:42,490 So that's my estimator, provided that it exists. 1210 01:01:42,490 --> 01:01:45,040 And I said it's a one to one, so it should exist, 1211 01:01:45,040 --> 01:01:48,520 if I assume that psi is invertible. 1212 01:01:48,520 --> 01:01:51,627 So how good is this guy? 1213 01:01:51,627 --> 01:01:53,460 That's going to be definitely our question-- 1214 01:01:53,460 --> 01:01:54,860 how good is this thing. 1215 01:01:54,860 --> 01:02:00,560 And as I said, there's chances that if psi is really steep, 1216 01:02:00,560 --> 01:02:02,800 then it should be not very good-- 1217 01:02:02,800 --> 01:02:06,140 if psi inverse is very steep, it should not be very good, 1218 01:02:06,140 --> 01:02:07,740 which means that it's-- 1219 01:02:07,740 --> 01:02:11,480 well, let's just leave it to that. 1220 01:02:11,480 --> 01:02:13,010 So that means that I should probably 1221 01:02:13,010 --> 01:02:16,460 see the derivative of psi showing up somewhere. 1222 01:02:16,460 --> 01:02:19,626 If the derivative of psi inverse, say, is very large, 1223 01:02:19,626 --> 01:02:21,500 then I should actually have a larger variance 1224 01:02:21,500 --> 01:02:22,520 in my estimator. 1225 01:02:22,520 --> 01:02:26,900 So hopefully, just like we had a theorem that told us 1226 01:02:26,900 --> 01:02:29,390 that the Fisher information was key in the variance 1227 01:02:29,390 --> 01:02:30,890 of the maximum likelihood estimator, 1228 01:02:30,890 --> 01:02:32,473 we should have a theorem that tells us 1229 01:02:32,473 --> 01:02:33,920 that the derivative of psi inverse 1230 01:02:33,920 --> 01:02:37,313 is going to have a key role in the method of moments. 1231 01:02:37,313 --> 01:02:38,792 So let's do it. 1232 01:02:57,540 --> 01:03:01,950 So I'm going to talk to you about matrices. 1233 01:03:01,950 --> 01:03:02,680 So now, I have-- 1234 01:03:10,150 --> 01:03:15,080 So since I have to manipulate d numbers at any given time, 1235 01:03:15,080 --> 01:03:17,610 I'm just going to concatenate them into a vector. 1236 01:03:17,610 --> 01:03:19,670 So I'm going to call capital M theta-- 1237 01:03:19,670 --> 01:03:24,570 so that's basically the population moment. 1238 01:03:24,570 --> 01:03:31,320 And I have M hat, which is just m hat 1 to m hat d. 1239 01:03:31,320 --> 01:03:32,715 And that's my empirical moment. 1240 01:03:39,100 --> 01:03:41,170 And what's going to play a role is 1241 01:03:41,170 --> 01:03:45,370 what is the variance-covariance of the random vector. 1242 01:03:45,370 --> 01:03:49,680 So I have this vector 1-- 1243 01:03:49,680 --> 01:03:50,440 do I have 1? 1244 01:03:50,440 --> 01:03:51,865 No, I don't have 1. 1245 01:03:59,240 --> 01:04:02,480 So that's a d dimensional vector. 1246 01:04:02,480 --> 01:04:04,940 And here, I take the successive powers. 1247 01:04:04,940 --> 01:04:08,780 Remember, that looks very much like a column of my Vandermonde 1248 01:04:08,780 --> 01:04:10,590 matrix. 1249 01:04:10,590 --> 01:04:12,120 So now, I have this random vector. 1250 01:04:12,120 --> 01:04:15,570 It's just the successive powers of some random variable X. 1251 01:04:15,570 --> 01:04:19,480 And the variance-covariance matrix is the expectation-- 1252 01:04:19,480 --> 01:04:20,130 so sigma-- 1253 01:04:20,130 --> 01:04:21,695 of theta. 1254 01:04:21,695 --> 01:04:23,820 The theta just means I'm going to take expectations 1255 01:04:23,820 --> 01:04:26,310 with respect to theta. 1256 01:04:26,310 --> 01:04:28,350 That's the expectation with respect 1257 01:04:28,350 --> 01:04:31,316 to theta of this guy times this guy 1258 01:04:31,316 --> 01:04:40,575 transpose minus the same thing but with the expectation 1259 01:04:40,575 --> 01:04:41,075 inside. 1260 01:04:45,270 --> 01:04:46,760 Why do I do X, X1. 1261 01:04:46,760 --> 01:04:48,070 I have X, X2, X3. 1262 01:04:50,720 --> 01:05:04,384 X, X2, Xd times the expectation of X, X2, Xd. 1263 01:05:04,384 --> 01:05:05,550 Everybody sees what this is? 1264 01:05:05,550 --> 01:05:11,790 So this is a matrix where if I look at the ij-th term of this 1265 01:05:11,790 --> 01:05:13,530 matrix-- 1266 01:05:13,530 --> 01:05:20,980 or let's say, jk-th term, so on row j and column k, 1267 01:05:20,980 --> 01:05:26,130 I have sigma jk of theta. 1268 01:05:26,130 --> 01:05:30,960 And it's simply the expectation of X to the j 1269 01:05:30,960 --> 01:05:40,541 plus k-- well, Xj Xk minus expectation of Xj expectation 1270 01:05:40,541 --> 01:05:41,040 of Xk. 1271 01:05:45,170 --> 01:05:53,970 So I can write this as m j plus k of theta minus mj of theta 1272 01:05:53,970 --> 01:05:55,080 times mk of theta. 1273 01:06:00,840 --> 01:06:04,400 So that's my covariance matrix of this particular vector 1274 01:06:04,400 --> 01:06:06,870 that I define. 1275 01:06:06,870 --> 01:06:09,240 And now, I'm going to assume that psi inverse-- 1276 01:06:09,240 --> 01:06:11,070 well, if I want to talk about the slope 1277 01:06:11,070 --> 01:06:14,060 in an analytic fashion, I have to assume 1278 01:06:14,060 --> 01:06:16,250 that psi is differentiable. 1279 01:06:16,250 --> 01:06:18,650 And I will talk about the gradient 1280 01:06:18,650 --> 01:06:20,500 of psi, which is, if it's one dimensional, 1281 01:06:20,500 --> 01:06:22,340 it's just the derivative. 1282 01:06:22,340 --> 01:06:24,470 And here, that's where notation becomes annoying. 1283 01:06:24,470 --> 01:06:26,011 And I'm going to actually just assume 1284 01:06:26,011 --> 01:06:28,310 that so now I have a vector. 1285 01:06:28,310 --> 01:06:30,590 But it's a vector of functions and I 1286 01:06:30,590 --> 01:06:32,840 want to compute those functions at a particular value. 1287 01:06:32,840 --> 01:06:34,506 And the value I'm actually interested in 1288 01:06:34,506 --> 01:06:37,010 is at the m of theta parameter. 1289 01:06:37,010 --> 01:06:41,600 So psi inverse goes from the set of moments 1290 01:06:41,600 --> 01:06:43,710 to the set of parameters. 1291 01:06:43,710 --> 01:06:45,680 So when I look at the gradient of this guy, 1292 01:06:45,680 --> 01:06:48,740 it should be a function that takes as inputs moments. 1293 01:06:48,740 --> 01:06:51,700 And where do I want this function to be evaluated at? 1294 01:06:51,700 --> 01:06:54,352 At the true moment-- 1295 01:06:54,352 --> 01:06:58,100 at the population moment vector. 1296 01:06:58,100 --> 01:07:00,860 Just like when I computed my Fisher information, 1297 01:07:00,860 --> 01:07:04,250 I was computing it at the true parameter. 1298 01:07:04,250 --> 01:07:08,400 So now, once they compute this guy-- 1299 01:07:08,400 --> 01:07:11,176 so now, why is this a d by d gradient matrix? 1300 01:07:15,840 --> 01:07:19,920 So I have a gradient vector when I have a function from rd to r. 1301 01:07:19,920 --> 01:07:22,160 This is the partial derivatives. 1302 01:07:22,160 --> 01:07:25,900 But now, I have a function from rd to rd. 1303 01:07:25,900 --> 01:07:28,210 So I have to take the derivative with respect 1304 01:07:28,210 --> 01:07:32,457 to the arrival coordinate and the departure coordinate. 1305 01:07:35,260 --> 01:07:39,140 And so that's the gradient matrix. 1306 01:07:39,140 --> 01:07:41,360 And now, I have the following properties. 1307 01:07:41,360 --> 01:07:46,270 The first one is that the law of large numbers 1308 01:07:46,270 --> 01:07:52,720 tells me that theta hat is a weakly or strongly consistent 1309 01:07:52,720 --> 01:07:54,332 estimator. 1310 01:07:54,332 --> 01:07:56,290 So either I use the strong law of large numbers 1311 01:07:56,290 --> 01:07:57,665 or the weak law of large numbers, 1312 01:07:57,665 --> 01:08:01,300 and I get strong or weak consistency. 1313 01:08:01,300 --> 01:08:02,870 So what does that mean? 1314 01:08:02,870 --> 01:08:03,640 Why is that true? 1315 01:08:03,640 --> 01:08:12,470 Well, because now so I really have the function-- 1316 01:08:12,470 --> 01:08:13,930 so what is my estimator? 1317 01:08:13,930 --> 01:08:23,689 Theta hat this psi inverse of m hat 1 to m hat k. 1318 01:08:23,689 --> 01:08:26,630 Now, by the law of large numbers, 1319 01:08:26,630 --> 01:08:28,890 let's look only at the weak one. 1320 01:08:28,890 --> 01:08:35,600 Law of large numbers tells me that each of the mj hat 1321 01:08:35,600 --> 01:08:38,750 is going to converge in probability as n 1322 01:08:38,750 --> 01:08:40,970 to infinity to the-- so the empirical moments 1323 01:08:40,970 --> 01:08:44,950 converge to the population moments. 1324 01:08:44,950 --> 01:08:48,189 That's what the good old trick is using, 1325 01:08:48,189 --> 01:08:49,750 the fact that the empirical moments 1326 01:08:49,750 --> 01:08:52,760 are close to the true moments as n becomes larger. 1327 01:08:52,760 --> 01:08:55,390 And that's because, well, just because the m hat j's 1328 01:08:55,390 --> 01:08:57,160 are averages, and the law of large numbers 1329 01:08:57,160 --> 01:08:59,229 works for averages. 1330 01:08:59,229 --> 01:09:04,930 So now, plus if I look at my continuous mapping theorem, 1331 01:09:04,930 --> 01:09:10,700 then I have that psi inverse is continuously differentiable. 1332 01:09:10,700 --> 01:09:12,279 So it's definitely continuous. 1333 01:09:12,279 --> 01:09:16,510 And so what I have is that psi inverse of m hat 1334 01:09:16,510 --> 01:09:28,740 1 m hat d converges to psi inverse m1 to md, which 1335 01:09:28,740 --> 01:09:33,950 is equal to of theta star. 1336 01:09:33,950 --> 01:09:35,060 So that's theta star. 1337 01:09:35,060 --> 01:09:37,910 By definition, we assumed that that was the unique one that 1338 01:09:37,910 --> 01:09:40,189 was actually doing this. 1339 01:09:40,189 --> 01:09:43,109 Again, this is a very strong assumption. 1340 01:09:43,109 --> 01:09:46,100 I mean, it's basically saying, if the method of moment works, 1341 01:09:46,100 --> 01:09:47,540 it works. 1342 01:09:47,540 --> 01:09:51,710 So the fact that psi inverse one to one 1343 01:09:51,710 --> 01:09:55,280 is really the key to making this guy work. 1344 01:09:55,280 --> 01:09:57,200 And then I also have a central limit theorem. 1345 01:09:57,200 --> 01:10:00,140 And the central limit theorem is basically 1346 01:10:00,140 --> 01:10:04,550 telling me that M hat is converging to M even 1347 01:10:04,550 --> 01:10:06,040 in the multivariate sense. 1348 01:10:06,040 --> 01:10:09,410 So if I look at the vector of M hat and the true vector of M, 1349 01:10:09,410 --> 01:10:11,870 then I actually make them go-- 1350 01:10:11,870 --> 01:10:14,570 I look at the difference for scale by square root of n. 1351 01:10:14,570 --> 01:10:15,951 It goes to some Gaussian. 1352 01:10:15,951 --> 01:10:18,200 And usually, we would see-- if it was one dimensional, 1353 01:10:18,200 --> 01:10:19,283 we would see the variance. 1354 01:10:19,283 --> 01:10:22,430 Then we see the variance-covariance matrix. 1355 01:10:22,430 --> 01:10:25,370 Who has never seen the-- well, nobody answers this question. 1356 01:10:25,370 --> 01:10:28,200 Who has already seen the multivariate central limit 1357 01:10:28,200 --> 01:10:28,700 theorem? 1358 01:10:31,339 --> 01:10:33,380 Who was never seen the multivariate central limit 1359 01:10:33,380 --> 01:10:35,410 theorem? 1360 01:10:35,410 --> 01:10:37,630 So the multivariate central limit theorem 1361 01:10:37,630 --> 01:10:41,860 is basically just the slight extension 1362 01:10:41,860 --> 01:10:43,630 of the univariate one. 1363 01:10:43,630 --> 01:10:48,160 It just says that if I want to think-- 1364 01:10:48,160 --> 01:10:51,020 so the univariate one would tell me something like this-- 1365 01:11:05,460 --> 01:11:06,270 and 0. 1366 01:11:06,270 --> 01:11:18,960 And then I would have basically the variance of X to the j-th. 1367 01:11:18,960 --> 01:11:22,240 So that's what the central limit theorem tells me. 1368 01:11:22,240 --> 01:11:23,350 This is an average. 1369 01:11:29,350 --> 01:11:31,150 So this is just for averages. 1370 01:11:31,150 --> 01:11:33,190 The central limit theorem tells me this. 1371 01:11:33,190 --> 01:11:36,560 Just think of X to the j-th as being y. 1372 01:11:36,560 --> 01:11:37,960 And that would be true. 1373 01:11:37,960 --> 01:11:40,092 Everybody agrees with me? 1374 01:11:40,092 --> 01:11:41,550 So now, this is actually telling me 1375 01:11:41,550 --> 01:11:45,610 what's happening for all these guys individually. 1376 01:11:45,610 --> 01:11:48,990 But what happens when those guys start to correlate together? 1377 01:11:48,990 --> 01:11:51,090 I'd like to know if they actually correlate 1378 01:11:51,090 --> 01:11:53,010 the same way asymptotically. 1379 01:11:53,010 --> 01:11:56,760 And so if I actually looked at the covariance matrix 1380 01:11:56,760 --> 01:11:57,465 of this vector-- 1381 01:12:03,440 --> 01:12:07,470 so now, I need to look at a matrix which is d by d-- 1382 01:12:07,470 --> 01:12:10,170 then would those univariate central limit theorems 1383 01:12:10,170 --> 01:12:12,896 tell me-- 1384 01:12:12,896 --> 01:12:16,890 so let me right like this, double bar. 1385 01:12:16,890 --> 01:12:19,560 So that's just the covariance matrix. 1386 01:12:19,560 --> 01:12:23,050 This notation, V double bar is the variance-covariance matrix. 1387 01:12:23,050 --> 01:12:26,010 So what this thing tells me-- so I know this thing 1388 01:12:26,010 --> 01:12:30,117 is a matrix, d by d. 1389 01:12:30,117 --> 01:12:31,950 Those univariate central limit theorems only 1390 01:12:31,950 --> 01:12:36,150 give me information about the diagonal terms. 1391 01:12:36,150 --> 01:12:40,860 But here, I have no idea where the covariance matrix is. 1392 01:12:40,860 --> 01:12:46,020 This guy is telling me, for example, that this thing is 1393 01:12:46,020 --> 01:12:49,520 like variance of X to the j-th. 1394 01:12:49,520 --> 01:12:51,860 But what if I want to find off-diagonal elements 1395 01:12:51,860 --> 01:12:53,130 of this matrix? 1396 01:12:53,130 --> 01:12:55,130 Well, I need to use a multivariate central limit 1397 01:12:55,130 --> 01:12:56,150 theorem. 1398 01:12:56,150 --> 01:12:58,670 And really what it's telling me is that you can actually 1399 01:12:58,670 --> 01:13:00,200 replace this guy here-- 1400 01:13:10,450 --> 01:13:14,500 so that goes in distribution to some normal mean 0, again. 1401 01:13:14,500 --> 01:13:17,080 And now, what I have is just sigma 1402 01:13:17,080 --> 01:13:22,000 of theta, which is just the covariance matrix 1403 01:13:22,000 --> 01:13:26,696 of this vector X, X2, X3, X4, all the way to Xd. 1404 01:13:26,696 --> 01:13:27,514 And that's it. 1405 01:13:27,514 --> 01:13:28,930 So that's a multivariate Gaussian. 1406 01:13:28,930 --> 01:13:33,040 Who has never seen a multivariate Gaussian? 1407 01:13:33,040 --> 01:13:35,974 Please, just go on Wikipedia or something. 1408 01:13:35,974 --> 01:13:37,390 There's not much to know about it. 1409 01:13:37,390 --> 01:13:40,270 But I don't have time to redo probability here. 1410 01:13:40,270 --> 01:13:43,350 So we're going to have to live with it. 1411 01:13:43,350 --> 01:13:46,230 Now, to be fair, if your goal is not 1412 01:13:46,230 --> 01:13:48,970 to become a statistical savant, we 1413 01:13:48,970 --> 01:13:52,490 will stick to univariate questions 1414 01:13:52,490 --> 01:14:01,260 in the scope of homework and exams. 1415 01:14:01,260 --> 01:14:09,900 So now, what was the delta method telling me? 1416 01:14:09,900 --> 01:14:13,440 It was telling me that if I had a central limit theorem that 1417 01:14:13,440 --> 01:14:16,112 told me that theta hat was going to theta, 1418 01:14:16,112 --> 01:14:17,820 or square root of n theta hat minus theta 1419 01:14:17,820 --> 01:14:19,530 was going to some Gaussian, then I 1420 01:14:19,530 --> 01:14:23,220 could look at square root of Mg of theta hat minus g of theta. 1421 01:14:23,220 --> 01:14:25,110 And this thing was also going to a Gaussian. 1422 01:14:25,110 --> 01:14:27,030 But what it had to be is the square 1423 01:14:27,030 --> 01:14:32,700 of the derivative of g in the variance. 1424 01:14:32,700 --> 01:14:35,190 So the delta method, it was just a way 1425 01:14:35,190 --> 01:14:38,280 to go from square root of n theta 1426 01:14:38,280 --> 01:14:46,810 hat n minus theta goes to some N, say 0, sigma squared, to-- 1427 01:14:46,810 --> 01:14:50,410 so delta method was telling me that this was square root 1428 01:14:50,410 --> 01:14:56,030 Ng of theta hat N minus g of theta 1429 01:14:56,030 --> 01:15:01,770 was going in distribution to N0 sigma squared 1430 01:15:01,770 --> 01:15:04,200 g prime squared of theta. 1431 01:15:07,210 --> 01:15:09,130 That was the delta method. 1432 01:15:09,130 --> 01:15:12,700 Now, here, we have a function of those guys. 1433 01:15:12,700 --> 01:15:15,580 The central limit theorem, even the multivariate one, 1434 01:15:15,580 --> 01:15:20,180 is only guaranteeing something for me regarding the moments. 1435 01:15:20,180 --> 01:15:23,350 But now, I need to map the moments back into some theta, 1436 01:15:23,350 --> 01:15:26,230 so I have a function of the moments. 1437 01:15:26,230 --> 01:15:31,950 And there is something called the multivariate delta 1438 01:15:31,950 --> 01:15:35,310 method, where derivatives are replaced by gradients. 1439 01:15:35,310 --> 01:15:39,310 Like, they always are in multivariate calculus. 1440 01:15:39,310 --> 01:15:43,080 And rather than multiplying, since things do not compute, 1441 01:15:43,080 --> 01:15:46,557 rather than choosing which side I want to put the square, 1442 01:15:46,557 --> 01:15:49,140 I'm actually just going to take half of the square on one side 1443 01:15:49,140 --> 01:15:51,810 and the other half of the square on the other side. 1444 01:15:51,810 --> 01:15:53,790 So the way you should view this, you 1445 01:15:53,790 --> 01:15:59,160 should think of sigma squared times g prime squared 1446 01:15:59,160 --> 01:16:02,490 as being g prime of theta times sigma 1447 01:16:02,490 --> 01:16:06,040 squared times g prime of theta. 1448 01:16:06,040 --> 01:16:08,640 And now, this is completely symmetric. 1449 01:16:08,640 --> 01:16:14,850 And the multivariate delta method 1450 01:16:14,850 --> 01:16:20,010 is basically telling you that you get the gradient here. 1451 01:16:20,010 --> 01:16:21,480 So you start from something that's 1452 01:16:21,480 --> 01:16:24,100 like that over there, a sigma-- 1453 01:16:24,100 --> 01:16:26,280 so that's my sigma squared, think of sigma squared. 1454 01:16:26,280 --> 01:16:29,040 And then I premultiply by the gradient and postmultiply 1455 01:16:29,040 --> 01:16:30,514 by the gradient. 1456 01:16:30,514 --> 01:16:31,680 The first one is transposed. 1457 01:16:31,680 --> 01:16:33,620 The second one is not. 1458 01:16:33,620 --> 01:16:36,140 But that's very straightforward extension. 1459 01:16:36,140 --> 01:16:37,890 You don't even have to understand it. 1460 01:16:37,890 --> 01:16:41,780 Just think of what would be the natural generalization. 1461 01:16:41,780 --> 01:16:44,450 Here, by the way, I wrote explicitly 1462 01:16:44,450 --> 01:16:48,020 what the gradient of a multivariate function is. 1463 01:16:48,020 --> 01:16:53,930 So that's a function that goes from Rd to Rk. 1464 01:16:53,930 --> 01:16:56,050 So now, the gradient is a d by k matrix. 1465 01:16:58,920 --> 01:17:00,504 And so now, for this guy, we can do it 1466 01:17:00,504 --> 01:17:01,586 for the method or moments. 1467 01:17:01,586 --> 01:17:03,140 And we can see that basically we're 1468 01:17:03,140 --> 01:17:04,765 going to have this scaling that depends 1469 01:17:04,765 --> 01:17:08,300 on the gradient of the reciprocal of psi, which 1470 01:17:08,300 --> 01:17:08,810 is normal. 1471 01:17:08,810 --> 01:17:13,137 Because if psi is super steep, if psi inverse is super steep, 1472 01:17:13,137 --> 01:17:14,720 then the gradient is going to be huge, 1473 01:17:14,720 --> 01:17:17,120 which is going to translate into having a huge variance 1474 01:17:17,120 --> 01:17:18,203 for the method of moments. 1475 01:17:21,180 --> 01:17:24,127 So this is actually the end. 1476 01:17:24,127 --> 01:17:26,460 I would like to encourage you-- and we'll probably do it 1477 01:17:26,460 --> 01:17:27,550 on Thursday just to start. 1478 01:17:27,550 --> 01:17:30,480 But I encourage you do it in one dimension, 1479 01:17:30,480 --> 01:17:35,070 so that you know how to use the method of moments, 1480 01:17:35,070 --> 01:17:37,140 you know how to do a bunch of things. 1481 01:17:37,140 --> 01:17:40,470 Do it in one dimension and see how you can check those things. 1482 01:17:40,470 --> 01:17:43,860 So just as a quick comparison, in terms of the quadratic risk, 1483 01:17:43,860 --> 01:17:46,050 the maximum likelihood estimator is typically 1484 01:17:46,050 --> 01:17:50,024 more accurate than the method of moments. 1485 01:17:50,024 --> 01:17:51,440 What is pretty good to do is, when 1486 01:17:51,440 --> 01:17:54,530 you have a non-concave likelihood 1487 01:17:54,530 --> 01:17:56,060 function, what people like to do is 1488 01:17:56,060 --> 01:17:58,980 to start with the method of moments as an initialization 1489 01:17:58,980 --> 01:18:01,680 and then run some algorithm that optimizes locally 1490 01:18:01,680 --> 01:18:03,710 the likelihood starting from this point, 1491 01:18:03,710 --> 01:18:05,985 because it's actually likely to be closer. 1492 01:18:05,985 --> 01:18:07,610 And then the MLE is going to improve it 1493 01:18:07,610 --> 01:18:12,010 a little bit by pushing the likelihood a little better. 1494 01:18:12,010 --> 01:18:13,840 So of course, the maximum likelihood 1495 01:18:13,840 --> 01:18:14,890 is sometimes intractable. 1496 01:18:14,890 --> 01:18:18,440 Whereas, computing moments is fairly doable. 1497 01:18:18,440 --> 01:18:20,262 If the likelihood is concave, as I said, 1498 01:18:20,262 --> 01:18:21,720 we can use optimization algorithms, 1499 01:18:21,720 --> 01:18:24,020 such as interior-point methods or gradient descent, 1500 01:18:24,020 --> 01:18:25,154 I guess, to maximize it. 1501 01:18:25,154 --> 01:18:26,695 And if the likelihood is non-concave, 1502 01:18:26,695 --> 01:18:28,240 we only have local heuristics. 1503 01:18:28,240 --> 01:18:29,920 Risk And that's what I meant-- 1504 01:18:29,920 --> 01:18:31,440 you have only local maxima. 1505 01:18:31,440 --> 01:18:32,860 And one trick you can do-- 1506 01:18:32,860 --> 01:18:37,880 so your likelihood looks like this, 1507 01:18:37,880 --> 01:18:42,140 and it might be the case that if you have a lot of those peaks, 1508 01:18:42,140 --> 01:18:44,810 you basically have to start your algorithm in each 1509 01:18:44,810 --> 01:18:46,270 of those peaks. 1510 01:18:46,270 --> 01:18:48,530 But the method of moments can actually 1511 01:18:48,530 --> 01:18:50,510 start you in the right peak, and then you 1512 01:18:50,510 --> 01:18:53,300 just move up by doing some local algorithm 1513 01:18:53,300 --> 01:18:55,040 for maximum likelihood. 1514 01:18:55,040 --> 01:18:56,180 So that's not key. 1515 01:18:56,180 --> 01:18:59,330 But that's just if you want to think about algorithmically 1516 01:18:59,330 --> 01:19:03,470 how I would end up doing this and how can I combine the two. 1517 01:19:03,470 --> 01:19:04,970 So I'll see you on Thursday. 1518 01:19:04,970 --> 01:19:06,820 Thank you.