1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high-quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:21,884 --> 00:00:23,300 PHILIPPE RIGOLLET: So I apologize. 9 00:00:23,300 --> 00:00:27,810 My voice is not 100%. 10 00:00:27,810 --> 00:00:32,930 So if you don't understand what I'm saying, please ask me. 11 00:00:32,930 --> 00:00:36,440 So we're going to be analyzing-- actually, not really analyzing. 12 00:00:36,440 --> 00:00:38,750 We described a second-order method 13 00:00:38,750 --> 00:00:42,860 to optimize the log likelihood in a generalized linear model, 14 00:00:42,860 --> 00:00:45,637 when the parameter of interest was beta. 15 00:00:45,637 --> 00:00:47,970 So here, I'm going to rewrite the whole thing as a beta. 16 00:00:47,970 --> 00:00:49,740 So that's the equation you see. 17 00:00:49,740 --> 00:00:51,560 But we really have this beta. 18 00:00:51,560 --> 00:00:58,160 And at iteration k plus 1, beta is given by beta k. 19 00:00:58,160 --> 00:01:01,170 And then I have a plus sign. 20 00:01:01,170 --> 00:01:06,390 And the plus, if you think of the Fisher information at beta 21 00:01:06,390 --> 00:01:09,340 k as being some number-- 22 00:01:09,340 --> 00:01:11,090 if you were to say whether it's a positive 23 00:01:11,090 --> 00:01:12,549 or a negative number, it's actually 24 00:01:12,549 --> 00:01:14,339 going to be a positive number, because it's 25 00:01:14,339 --> 00:01:15,770 a positive semi-definite matrix. 26 00:01:15,770 --> 00:01:18,020 So since we're doing gradient ascent, 27 00:01:18,020 --> 00:01:19,730 we have a plus sign here. 28 00:01:19,730 --> 00:01:21,620 And then the direction is basically 29 00:01:21,620 --> 00:01:26,750 gradient ln at beta k. 30 00:01:26,750 --> 00:01:27,410 OK? 31 00:01:27,410 --> 00:01:30,320 So this is the iterations that we're trying to implement. 32 00:01:30,320 --> 00:01:31,540 And we could just do this. 33 00:01:31,540 --> 00:01:34,814 At each iteration, we compute the Fisher information, 34 00:01:34,814 --> 00:01:36,230 and then we do it again and again. 35 00:01:36,230 --> 00:01:36,869 All right. 36 00:01:36,869 --> 00:01:38,660 That's called the Fisher-scoring algorithm. 37 00:01:38,660 --> 00:01:41,045 And I told you that this was going to converge. 38 00:01:41,045 --> 00:01:44,090 And what we're going to try to do in this lecture 39 00:01:44,090 --> 00:01:46,100 is to show how we can re-implement this, 40 00:01:46,100 --> 00:01:48,770 using iteratively re-weighted least squares, 41 00:01:48,770 --> 00:01:50,870 so that each step of this algorithm 42 00:01:50,870 --> 00:01:54,270 consists simply of solving a weighted least square problem. 43 00:01:54,270 --> 00:01:54,770 All right. 44 00:01:54,770 --> 00:01:59,840 So let's go back quickly and remind ourselves 45 00:01:59,840 --> 00:02:04,830 that we are in the Gaussian-- 46 00:02:04,830 --> 00:02:07,110 sorry, we're in the exponential family. 47 00:02:07,110 --> 00:02:10,400 So if I look at the log likelihood for one observation, 48 00:02:10,400 --> 00:02:12,290 so here it's ln-- 49 00:02:12,290 --> 00:02:13,280 sorry. 50 00:02:13,280 --> 00:02:17,210 This is the sum from i equal 1 to n of yi minus-- 51 00:02:20,694 --> 00:02:26,770 OK, so it's yi times theta i, sorry, minus b of theta i. 52 00:02:26,770 --> 00:02:28,550 Then there's going to be some parameter. 53 00:02:28,550 --> 00:02:32,810 And then I have plus c of yi phi. 54 00:02:32,810 --> 00:02:33,310 OK. 55 00:02:33,310 --> 00:02:35,210 So just the exponential went away 56 00:02:35,210 --> 00:02:36,860 when I took the log of the likelihood. 57 00:02:36,860 --> 00:02:38,600 And I have n observations, so I'm summing 58 00:02:38,600 --> 00:02:40,320 over all n observations. 59 00:02:40,320 --> 00:02:40,820 All right. 60 00:02:40,820 --> 00:02:43,445 Then we had a bunch of formulas that we came up to be. 61 00:02:43,445 --> 00:02:46,100 So if I look at the expectation of yi-- 62 00:02:46,100 --> 00:02:49,880 so that's really the conditional of yi, given xi. 63 00:02:49,880 --> 00:02:52,230 But like here, it really doesn't matter. 64 00:02:52,230 --> 00:02:54,110 It's just going to be different for each i. 65 00:02:54,110 --> 00:02:55,880 This is denoted by mu i. 66 00:02:55,880 --> 00:03:00,830 And we showed that this was beta prime of theta i. 67 00:03:00,830 --> 00:03:03,880 Then the other equation that we found was that. 68 00:03:03,880 --> 00:03:06,350 And so what we want to model is this thing. 69 00:03:06,350 --> 00:03:10,220 We want it to be equal to xi transpose beta- sorry 70 00:03:10,220 --> 00:03:11,190 g of this thing. 71 00:03:14,918 --> 00:03:15,620 All right. 72 00:03:15,620 --> 00:03:19,010 So that's our model. 73 00:03:19,010 --> 00:03:21,650 And then we had that the variance was also 74 00:03:21,650 --> 00:03:23,360 given by the second derivative. 75 00:03:23,360 --> 00:03:24,747 I'm not going to go into it. 76 00:03:24,747 --> 00:03:26,330 What's actually interesting is to see, 77 00:03:26,330 --> 00:03:32,240 if we want to express theta i as a function of xi, what we get, 78 00:03:32,240 --> 00:03:38,900 going from xi to mu i by g inverse, and then to theta i 79 00:03:38,900 --> 00:03:43,790 by b inverse, we get that theta i 80 00:03:43,790 --> 00:03:51,660 is equal to h of xi transpose beta, h of xi transpose beta, 81 00:03:51,660 --> 00:03:56,340 where h is the inverse-- 82 00:03:56,340 --> 00:03:58,890 so which order is --this? 83 00:03:58,890 --> 00:04:03,650 Is the inverse of g, and then the compose would be prime. 84 00:04:03,650 --> 00:04:05,210 OK? 85 00:04:05,210 --> 00:04:09,194 So we remember that last time, those are all computations 86 00:04:09,194 --> 00:04:10,610 that we've made, but they're going 87 00:04:10,610 --> 00:04:12,660 to be useful in our derivation. 88 00:04:12,660 --> 00:04:14,510 And the first thing we did last time is 89 00:04:14,510 --> 00:04:17,690 to show that, if I look now at the derivative of the log 90 00:04:17,690 --> 00:04:20,852 likelihood with respect to one coordinate of beta, which 91 00:04:20,852 --> 00:04:23,060 is going to give me the gradient if I do that for all 92 00:04:23,060 --> 00:04:25,250 the coordinates, what we ended up finding 93 00:04:25,250 --> 00:04:28,400 is that we can rewrite it in this form, some 94 00:04:28,400 --> 00:04:31,470 of yi tilde minus mu tilde. 95 00:04:31,470 --> 00:04:33,380 So let's remind ourselves that-- 96 00:04:36,340 --> 00:04:41,560 so y tilde is just y divided-- 97 00:04:41,560 --> 00:04:45,010 well, OK y tilde i is yi-- 98 00:04:45,010 --> 00:04:46,300 is it times or divided-- 99 00:04:46,300 --> 00:04:50,890 times g prime of mu i. 100 00:04:50,890 --> 00:05:00,140 Mu tilde i is mu i times g prime of mu i. 101 00:05:00,140 --> 00:05:02,980 And then that was just an artificial thing, 102 00:05:02,980 --> 00:05:07,070 so that we could actually divide the weights by g prime. 103 00:05:07,070 --> 00:05:10,060 But the real thing that built the weights are this h prime. 104 00:05:10,060 --> 00:05:12,310 And there's this normalization factor. 105 00:05:12,310 --> 00:05:14,440 And so if we read it like that-- so if I also 106 00:05:14,440 --> 00:05:22,900 write that wi is h prime of xi transpose beta divided 107 00:05:22,900 --> 00:05:27,640 by g prime of mu i times phi, then 108 00:05:27,640 --> 00:05:30,820 I could actually rewrite my gradient, which 109 00:05:30,820 --> 00:05:34,270 is a vector, in the following matrix form, 110 00:05:34,270 --> 00:05:40,820 the gradient ln at beta. 111 00:05:40,820 --> 00:05:44,300 So the gradient of my log likelihood of beta 112 00:05:44,300 --> 00:05:45,390 took the following form. 113 00:05:45,390 --> 00:05:53,600 It was x transpose w, and then y tilde minus mu tilde. 114 00:05:53,600 --> 00:05:57,770 And here, w was just the matrix with w1, 115 00:05:57,770 --> 00:06:02,020 w2, all the way to wn on the diagonal and 0 116 00:06:02,020 --> 00:06:04,142 on of the up diagonals. 117 00:06:04,142 --> 00:06:06,030 OK? 118 00:06:06,030 --> 00:06:08,340 So that was just taking the derivative 119 00:06:08,340 --> 00:06:11,490 and doing a slight manipulations that said, 120 00:06:11,490 --> 00:06:14,670 well, let's just divide whatever is here by g 121 00:06:14,670 --> 00:06:17,740 prime and multiply whatever is here by g prime. 122 00:06:17,740 --> 00:06:19,680 So today, we'll see why we make this division 123 00:06:19,680 --> 00:06:23,220 and multiplication by g prime, which seems to make no sense, 124 00:06:23,220 --> 00:06:26,620 but it actually comes from the Hessian computations. 125 00:06:26,620 --> 00:06:28,530 So the Hessian computations are going 126 00:06:28,530 --> 00:06:29,790 to be a little more annoying. 127 00:06:29,790 --> 00:06:33,600 Actually, let me start directly with the coordinate y's 128 00:06:33,600 --> 00:06:34,440 derivative, right? 129 00:06:34,440 --> 00:06:37,740 So to build this gradient, what we used, in the end, 130 00:06:37,740 --> 00:06:41,880 was that the partial derivative of ln with respect to the gth 131 00:06:41,880 --> 00:06:49,220 coordinate of beta was equal to the sum over i 132 00:06:49,220 --> 00:06:55,520 of yi tilde minus mu i tilde times wi times 133 00:06:55,520 --> 00:06:59,725 the gth coordinate of xi. 134 00:06:59,725 --> 00:07:01,310 OK? 135 00:07:01,310 --> 00:07:03,480 So now, let's just take another derivative, 136 00:07:03,480 --> 00:07:07,810 and that's going to give us the entries of the Hessian. 137 00:07:07,810 --> 00:07:11,680 OK, so we're going to the second derivative. 138 00:07:11,680 --> 00:07:16,950 So what I want to compute is the derivative with respect 139 00:07:16,950 --> 00:07:18,740 to beta j and beta k. 140 00:07:21,830 --> 00:07:22,330 OK. 141 00:07:22,330 --> 00:07:24,525 So where does beta j-- 142 00:07:24,525 --> 00:07:26,650 so here, I already took the derivative with respect 143 00:07:26,650 --> 00:07:27,550 to beta j. 144 00:07:27,550 --> 00:07:29,530 So this is just the derivative with respect 145 00:07:29,530 --> 00:07:32,850 to beta k of the derivative with respect to beta j. 146 00:07:36,874 --> 00:07:39,290 So what I need to do is to take the derivative of this guy 147 00:07:39,290 --> 00:07:40,790 with respect to beta k. 148 00:07:40,790 --> 00:07:42,510 Where does beta k show up here? 149 00:07:48,920 --> 00:07:52,170 It's set in, in two places. 150 00:07:52,170 --> 00:07:53,179 AUDIENCE: In the y's? 151 00:07:53,179 --> 00:07:54,970 PHILIPPE RIGOLLET: No, it's not in the y's. 152 00:07:54,970 --> 00:07:56,760 The y's are my data, right? 153 00:07:59,470 --> 00:08:02,220 But I mean, it's in the y tildes. 154 00:08:02,220 --> 00:08:03,700 Yeah, because it's in mu, right? 155 00:08:03,700 --> 00:08:04,960 Mu depends on beta. 156 00:08:04,960 --> 00:08:09,270 Mu is g inverse of xi transpose beta. 157 00:08:09,270 --> 00:08:12,930 And it's also in the wi's. 158 00:08:12,930 --> 00:08:17,070 Actually, everything that you see is directly-- well, OK, w 159 00:08:17,070 --> 00:08:21,810 depends on mu n on beta explicitly. 160 00:08:21,810 --> 00:08:24,480 But the rest depends only on mu. 161 00:08:24,480 --> 00:08:27,930 And so we might want to be a little-- 162 00:08:27,930 --> 00:08:30,660 well, we can actually use the-- 163 00:08:30,660 --> 00:08:32,220 did I use the chain rule already? 164 00:08:35,059 --> 00:08:36,950 Yeah, it's here. 165 00:08:36,950 --> 00:08:40,780 But OK, well, let's go for it. 166 00:08:49,200 --> 00:08:50,695 Oh yeah, OK. 167 00:08:50,695 --> 00:08:52,320 Sorry, I should not write it like that, 168 00:08:52,320 --> 00:08:54,390 because that was actually-- 169 00:08:54,390 --> 00:08:57,000 right, so I make my life miserable by just multiplying 170 00:08:57,000 --> 00:09:00,800 and dividing by this g prime of mu. 171 00:09:00,800 --> 00:09:02,310 I should not do this, right? 172 00:09:02,310 --> 00:09:04,860 So what I should just write is say that this guy here-- 173 00:09:04,860 --> 00:09:09,180 I'm actually going to remove the g prime of mu, 174 00:09:09,180 --> 00:09:11,570 because I just make something that depends on theta 175 00:09:11,570 --> 00:09:13,090 appear when it really should not. 176 00:09:13,090 --> 00:09:15,880 So let's just look at the last but one equality. 177 00:09:23,000 --> 00:09:23,500 OK. 178 00:09:23,500 --> 00:09:27,430 So that's the one over there, and then I have xi j. 179 00:09:27,430 --> 00:09:29,800 OK, so here, it make my life much more simple, 180 00:09:29,800 --> 00:09:31,852 because yi does not depend on beta, 181 00:09:31,852 --> 00:09:34,310 but this guy depends on beta, and this guy depends on beta. 182 00:09:34,310 --> 00:09:35,074 All right. 183 00:09:35,074 --> 00:09:36,490 So when I take the derivative, I'm 184 00:09:36,490 --> 00:09:38,440 going to have to be a little more careful now. 185 00:09:38,440 --> 00:09:40,300 But I just have a derivative of a product, 186 00:09:40,300 --> 00:09:42,080 nothing more complicated. 187 00:09:42,080 --> 00:09:43,345 So this is what? 188 00:09:43,345 --> 00:09:45,350 Well, the sum is going to be linear, 189 00:09:45,350 --> 00:09:46,710 so it's going to come out. 190 00:09:46,710 --> 00:09:51,170 Then I'm going to have to take the derivative of this term. 191 00:09:51,170 --> 00:09:54,120 So it's just going to be 1 over psi. 192 00:09:54,120 --> 00:09:58,460 Then the derivative of mu i with respect 193 00:09:58,460 --> 00:10:04,440 to beta k, which I will just write like this, 194 00:10:04,440 --> 00:10:09,920 times h prime of xi transpose beta xi j. 195 00:10:09,920 --> 00:10:15,570 And then I'm going to have the other one, which is yi minus mu 196 00:10:15,570 --> 00:10:24,640 i over 5 times the second derivative of h of xi transpose 197 00:10:24,640 --> 00:10:25,330 beta. 198 00:10:25,330 --> 00:10:27,038 And then I'm going to take the derivative 199 00:10:27,038 --> 00:10:30,190 of this guy with respect to beta j with beta k, which is just 200 00:10:30,190 --> 00:10:32,196 xi k. 201 00:10:32,196 --> 00:10:36,200 So I have xi j times xi k. 202 00:10:36,200 --> 00:10:36,700 OK. 203 00:10:36,700 --> 00:10:40,400 So I still need to compute this guy. 204 00:10:40,400 --> 00:10:42,590 So what is the partial derivative 205 00:10:42,590 --> 00:10:46,430 with respect to beta k of g? 206 00:10:46,430 --> 00:10:49,310 So mu is g of-- 207 00:10:49,310 --> 00:10:52,524 worry, it's g inverse of xi transpose beta. 208 00:10:56,610 --> 00:10:58,400 OK? 209 00:10:58,400 --> 00:10:59,460 So what do I get? 210 00:10:59,460 --> 00:11:01,610 Well, I'm going to get definitely 211 00:11:01,610 --> 00:11:02,990 the second derivative of g. 212 00:11:11,558 --> 00:11:14,050 Well, OK, that's actually not a bad idea. 213 00:11:17,857 --> 00:11:18,690 Well, no, that's OK. 214 00:11:18,690 --> 00:11:21,150 I can make the second-- 215 00:11:21,150 --> 00:11:22,850 what makes my life easier, actually? 216 00:11:26,690 --> 00:11:31,010 Give me one second. 217 00:11:31,010 --> 00:11:33,230 Well, there's no one that actually 218 00:11:33,230 --> 00:11:35,660 makes my life so much easier. 219 00:11:35,660 --> 00:11:36,872 Let's just write it. 220 00:11:36,872 --> 00:11:37,830 Let's go with this guy. 221 00:11:37,830 --> 00:11:43,300 So it's going to be g prime prime of xi transpose beta 222 00:11:43,300 --> 00:11:47,677 times xi k. 223 00:11:47,677 --> 00:11:50,140 OK? 224 00:11:50,140 --> 00:11:53,470 So now, what do I have if I collect my terms? 225 00:11:53,470 --> 00:12:05,990 I have that this whole thing here, the second derivative is, 226 00:12:05,990 --> 00:12:10,600 well, I have the sum from 1 equal 1 to n. 227 00:12:10,600 --> 00:12:13,200 Then I have terms that I can factor out, right? 228 00:12:13,200 --> 00:12:17,790 Both of these guys have xi j, and this guy pulls out an xi k. 229 00:12:17,790 --> 00:12:21,450 And it's also here, xi j times xi k, right? 230 00:12:21,450 --> 00:12:26,690 So everybody here is xi j xi k. 231 00:12:26,690 --> 00:12:29,790 And now, I just have to take the terms that I have here. 232 00:12:29,790 --> 00:12:33,490 The 1 over phi, I can actually pull out in front. 233 00:12:33,490 --> 00:12:40,400 And I'm left with the second derivative of g times 234 00:12:40,400 --> 00:12:46,370 the first derivative of h, both taken at xi transpose beta. 235 00:12:46,370 --> 00:12:48,880 And then, I have this yi minus mu i 236 00:12:48,880 --> 00:12:52,166 times the second derivative of h, taken at xi transpose beta. 237 00:13:00,180 --> 00:13:00,680 OK. 238 00:13:00,680 --> 00:13:03,240 But here, I'm looking at Fisher scoring. 239 00:13:03,240 --> 00:13:07,200 I'm not looking at Newton's method, which 240 00:13:07,200 --> 00:13:09,660 means that I can actually take the expectation 241 00:13:09,660 --> 00:13:11,636 of the second derivative. 242 00:13:11,636 --> 00:13:13,260 So when I start taking the expectation, 243 00:13:13,260 --> 00:13:15,640 what's going to happen-- 244 00:13:15,640 --> 00:13:17,670 so if I take the expectation of this whole thing 245 00:13:17,670 --> 00:13:21,830 here, well, this guy, it's not-- 246 00:13:21,830 --> 00:13:24,990 and when I say expectation, it's always conditionally on xi. 247 00:13:24,990 --> 00:13:27,470 So let's write it-- 248 00:13:27,470 --> 00:13:29,540 x1 xn. 249 00:13:29,540 --> 00:13:31,160 So I take conditional. 250 00:13:31,160 --> 00:13:32,790 This is just deterministic. 251 00:13:32,790 --> 00:13:34,430 But what is the conditional expectation 252 00:13:34,430 --> 00:13:39,570 of yi minus mu i times this guy, conditionally on xi? 253 00:13:42,160 --> 00:13:43,499 0, right? 254 00:13:43,499 --> 00:13:45,790 Because this is just the conditional expectation of yi, 255 00:13:45,790 --> 00:13:47,620 and everything else depends on xi only, 256 00:13:47,620 --> 00:13:50,810 so I can push it out of the conditional expectation. 257 00:13:50,810 --> 00:13:52,200 So I'm left only with this term. 258 00:14:06,460 --> 00:14:06,960 OK. 259 00:14:13,850 --> 00:14:14,950 So now I need to-- 260 00:14:14,950 --> 00:14:23,185 sorry, and I have xi xj, xi j xi j. 261 00:14:23,185 --> 00:14:26,953 OK 262 00:14:26,953 --> 00:14:34,850 So now, I want to go to something that's slightly more 263 00:14:34,850 --> 00:14:35,880 convenient for me. 264 00:14:35,880 --> 00:14:37,820 So maybe we can skip that part here, 265 00:14:37,820 --> 00:14:40,790 because this is not going to be convenient for me anyway. 266 00:14:40,790 --> 00:14:45,500 So I just want to go back to something that looks eventually 267 00:14:45,500 --> 00:14:48,150 like this. 268 00:14:48,150 --> 00:14:50,010 OK, that's what I'm going to want. 269 00:14:50,010 --> 00:14:53,700 So I need to have my xi show up with some weight somehow. 270 00:14:53,700 --> 00:14:57,530 And the weight should involve h prime divided by g prime. 271 00:14:57,530 --> 00:15:00,840 Again, the reason why I want to see g prime coming back 272 00:15:00,840 --> 00:15:03,900 is because I had g prime coming in the original w. 273 00:15:03,900 --> 00:15:06,690 This is actually the same definition as the w 274 00:15:06,690 --> 00:15:09,870 that I used when I was computing the gradient. 275 00:15:09,870 --> 00:15:13,600 Those are exactly these w's, those guys. 276 00:15:13,600 --> 00:15:15,750 So I need to have g prime that shows up. 277 00:15:15,750 --> 00:15:17,166 And that's where I'm going to have 278 00:15:17,166 --> 00:15:21,240 to make a little bit of computation here. 279 00:15:21,240 --> 00:15:26,460 And it's coming from this kind of consideration. 280 00:15:26,460 --> 00:15:27,840 OK? 281 00:15:27,840 --> 00:15:29,960 So this thing here-- 282 00:15:33,680 --> 00:15:39,180 well, actually, I'm missing the phi over there, right? 283 00:15:39,180 --> 00:15:41,170 There should be a phi here. 284 00:15:41,170 --> 00:15:41,670 OK. 285 00:15:41,670 --> 00:15:46,482 So we have exactly this thing, because this tells me that, 286 00:15:46,482 --> 00:15:47,565 if I look at the Hessian-- 287 00:15:53,840 --> 00:15:56,740 so this was entry-wise, and this is exactly 288 00:15:56,740 --> 00:15:58,930 the form of something of the form of the k. 289 00:15:58,930 --> 00:16:06,120 This is exactly the jth kth entry of xi xi transpose. 290 00:16:06,120 --> 00:16:06,620 Right? 291 00:16:06,620 --> 00:16:07,880 We've used that before. 292 00:16:07,880 --> 00:16:09,980 So if I want to write this in a vector form, 293 00:16:09,980 --> 00:16:13,010 this is just going to be the sum of something that depends 294 00:16:13,010 --> 00:16:15,710 on i times xi xi transpose. 295 00:16:15,710 --> 00:16:20,660 So this is 1 over phi sum from i equal 1 to n of g 296 00:16:20,660 --> 00:16:28,580 prime prime xi transpose beta h prime xi transpose beta xi xi 297 00:16:28,580 --> 00:16:30,631 transpose. 298 00:16:30,631 --> 00:16:31,130 OK? 299 00:16:31,130 --> 00:16:32,504 And that's for the entire matrix. 300 00:16:32,504 --> 00:16:34,820 Here, that was just the j kth entries of this matrix. 301 00:16:38,520 --> 00:16:41,640 And you can just check that, if I take this matrix, 302 00:16:41,640 --> 00:16:45,330 the j kth entry is just the product of the jth coordinate 303 00:16:45,330 --> 00:16:48,780 and the kth coordinate of xi. 304 00:16:48,780 --> 00:16:51,540 All right. 305 00:16:51,540 --> 00:16:54,000 So now I need to do my rewriting. 306 00:16:54,000 --> 00:16:54,975 Can I write this? 307 00:16:58,529 --> 00:17:00,070 So I'm missing something here, right? 308 00:17:11,790 --> 00:17:13,829 Oh, I know where it's coming from. 309 00:17:18,630 --> 00:17:22,010 Mu is not g prime of x beta. 310 00:17:22,010 --> 00:17:24,386 Mu is g inverse of x beta, right? 311 00:17:27,859 --> 00:17:34,670 So the derivative of x prime is not g prime prime. 312 00:17:34,670 --> 00:17:39,915 It's like this guy-- 313 00:17:44,890 --> 00:17:46,660 no, 1 over this, right? 314 00:17:51,583 --> 00:17:52,083 Yeah. 315 00:18:06,880 --> 00:18:08,040 OK? 316 00:18:08,040 --> 00:18:12,180 The derivative of g inverse is 1 over g prime of gene inverse. 317 00:18:15,390 --> 00:18:18,260 I need you guys, OK? 318 00:18:18,260 --> 00:18:18,790 All right. 319 00:18:18,790 --> 00:18:20,670 So now, I'm going to have to rewrite this. 320 00:18:20,670 --> 00:18:21,810 This guy is still going to go away. 321 00:18:21,810 --> 00:18:23,351 It doesn't matter, but now this thing 322 00:18:23,351 --> 00:18:30,180 is becoming h prime over g prime of g inverse of xi transpose 323 00:18:30,180 --> 00:18:41,820 beta, which is the same here, which is the same here. 324 00:18:52,220 --> 00:18:53,300 OK? 325 00:18:53,300 --> 00:18:55,435 Everybody approves? 326 00:18:55,435 --> 00:18:55,935 All right. 327 00:18:55,935 --> 00:18:58,460 Well, now, it's actually much nicer. 328 00:18:58,460 --> 00:19:01,040 What is g inverse of xi transpose beta? 329 00:19:05,154 --> 00:19:07,320 Well, that was exactly the mistake that I just made, 330 00:19:07,320 --> 00:19:08,310 right? 331 00:19:08,310 --> 00:19:10,960 It's mu i itself. 332 00:19:10,960 --> 00:19:18,330 So this guy is really g prime of mu i. 333 00:19:18,330 --> 00:19:19,630 Sorry, just the bottom, right? 334 00:19:23,200 --> 00:19:32,870 So now, I have something which looks like a sum from i 335 00:19:32,870 --> 00:19:38,470 equal 1 to n of h prime of xi transpose beta, 336 00:19:38,470 --> 00:19:46,780 divided by g prime of mu i phi times xi xi transpose, which 337 00:19:46,780 --> 00:19:54,550 I can certainly write in matrix form as x transpose wx, 338 00:19:54,550 --> 00:20:00,410 where w is exactly the same as before. 339 00:20:00,410 --> 00:20:05,330 So it's w1 wn. 340 00:20:05,330 --> 00:20:11,380 And wi is h prime of xi transpose beta 341 00:20:11,380 --> 00:20:16,082 divided by g prime of mu i. 342 00:20:16,082 --> 00:20:20,880 There's a prime here times phi, which 343 00:20:20,880 --> 00:20:23,390 is the same that we had here. 344 00:20:23,390 --> 00:20:26,610 And it's supposed to be the same that we have here, 345 00:20:26,610 --> 00:20:30,380 except the phi is in white. 346 00:20:30,380 --> 00:20:31,820 That's why it's not there. 347 00:20:31,820 --> 00:20:32,320 OK. 348 00:20:37,655 --> 00:20:39,610 All right? 349 00:20:39,610 --> 00:20:42,891 So it's actually simpler than what's on the slides, I guess. 350 00:20:42,891 --> 00:20:43,390 All right. 351 00:20:43,390 --> 00:20:46,450 So now, if you pay attention, I actually 352 00:20:46,450 --> 00:20:49,060 never force this g prime of mu i to be here. 353 00:20:49,060 --> 00:20:52,660 Actually, I even tried to make a mistake to not have it. 354 00:20:52,660 --> 00:20:57,670 And so this g prime of mu i shows up completely naturally. 355 00:20:57,670 --> 00:21:04,200 If I had started with this, you would have never questioned 356 00:21:04,200 --> 00:21:07,060 why I actually didn't multiply by g prime 357 00:21:07,060 --> 00:21:09,700 and divided by g prime completely artificially here. 358 00:21:09,700 --> 00:21:12,790 It just shows up naturally in the weights. 359 00:21:12,790 --> 00:21:14,230 But it's just more natural for me 360 00:21:14,230 --> 00:21:15,771 to compute the first derivative first 361 00:21:15,771 --> 00:21:17,790 than the second derivative second, OK? 362 00:21:17,790 --> 00:21:20,620 And so we just did it the other way around. 363 00:21:20,620 --> 00:21:23,620 But now, let's assume we forgot about everything. 364 00:21:23,620 --> 00:21:24,270 We have this. 365 00:21:24,270 --> 00:21:28,240 This is a natural way of writing it, x transpose wx. 366 00:21:28,240 --> 00:21:30,310 If I want something that involves some weights, 367 00:21:30,310 --> 00:21:34,270 I have to force them in by dividing by g prime of mu i 368 00:21:34,270 --> 00:21:40,410 and therefore, multiplying yi n mu i by this wi. 369 00:21:40,410 --> 00:21:41,140 OK? 370 00:21:41,140 --> 00:21:46,490 So now, if we recap what we've actually found, we got that-- 371 00:21:49,470 --> 00:21:51,540 let me write it here. 372 00:21:58,740 --> 00:22:02,010 We also have that the expectation 373 00:22:02,010 --> 00:22:12,100 of H ln of beta x transpose xw. 374 00:22:12,100 --> 00:22:15,190 So if I go back to my iterations over there, 375 00:22:15,190 --> 00:22:20,260 I should actually update beta k plus 1 376 00:22:20,260 --> 00:22:25,240 to be equal to beta k plus the inverse. 377 00:22:25,240 --> 00:22:30,250 So that's actually equal to negative i of beta k-- 378 00:22:30,250 --> 00:22:33,200 well, yeah. 379 00:22:33,200 --> 00:22:35,180 That's negative i of beta, I guess. 380 00:22:38,230 --> 00:22:42,680 Oh, and beta here shows up in w, right? w depends on beta. 381 00:22:42,680 --> 00:22:44,460 So that's going to be beta k. 382 00:22:44,460 --> 00:22:45,380 So let me call it wk. 383 00:22:49,151 --> 00:22:54,460 So that's the diagonal of H prime xi transpose beta 384 00:22:54,460 --> 00:23:01,800 k, this time, divided by g prime of mu i k phi. 385 00:23:01,800 --> 00:23:02,300 OK? 386 00:23:02,300 --> 00:23:06,650 So this beta k induces a mu by looking 387 00:23:06,650 --> 00:23:11,141 at g inverse of xi transpose beta k. 388 00:23:11,141 --> 00:23:11,670 All right. 389 00:23:11,670 --> 00:23:21,804 So mu i k is g inverse of xi transpose beta k. 390 00:23:21,804 --> 00:23:25,470 So that's 2 to the-- sorry, that's an iteration. 391 00:23:25,470 --> 00:23:28,080 And so now, if I actually write these things together, 392 00:23:28,080 --> 00:23:37,820 I get minus x transpose wx inverse. 393 00:23:37,820 --> 00:23:38,385 So that's wk. 394 00:23:41,900 --> 00:23:45,260 And then I have my gradient here that I 395 00:23:45,260 --> 00:23:50,810 have to apply at k, which is x transpose wk. 396 00:23:50,810 --> 00:23:58,610 And then I have y tilde k minus mu tilde k, where, again, the 397 00:23:58,610 --> 00:23:59,330 indices-- 398 00:23:59,330 --> 00:24:01,860 I mean the superscript k are pretty natural. 399 00:24:01,860 --> 00:24:05,720 y tilde k just means that-- 400 00:24:05,720 --> 00:24:07,370 so that's just yi. 401 00:24:07,370 --> 00:24:14,650 So that's just yi times g prime of mu i k. 402 00:24:14,650 --> 00:24:21,050 And mu tilde k is, if I look at the i coordinate, 403 00:24:21,050 --> 00:24:27,960 it's just going to be mu i times g prime of mu i. 404 00:24:31,571 --> 00:24:32,070 OK? 405 00:24:32,070 --> 00:24:34,470 So I just add superscripts k to everything. 406 00:24:34,470 --> 00:24:37,710 So I know that those things get updated real time, right? 407 00:24:37,710 --> 00:24:41,670 Every time I make one iteration, I get a new value for beta, 408 00:24:41,670 --> 00:24:43,800 I get a new value for mu, and therefore, I 409 00:24:43,800 --> 00:24:44,981 get a new value for w. 410 00:24:44,981 --> 00:24:45,480 Yes? 411 00:24:45,480 --> 00:24:50,210 AUDIENCE: [INAUDIBLE] the Fisher equation [INAUDIBLE]?? 412 00:24:50,210 --> 00:24:52,660 PHILIPPE RIGOLLET: Yeah, that's a good point. 413 00:24:52,660 --> 00:24:54,400 So that's definitely a plus, because this 414 00:24:54,400 --> 00:24:56,030 is a positive, semi-definite matrix. 415 00:24:56,030 --> 00:24:57,700 So this is a plus. 416 00:24:57,700 --> 00:25:01,330 And well, that's probably where I erased it. 417 00:25:15,920 --> 00:25:16,420 OK. 418 00:25:16,420 --> 00:25:19,105 Let's see where I made my mistake. 419 00:25:23,510 --> 00:25:28,602 So there should be a minus here. 420 00:25:28,602 --> 00:25:29,810 There should be a minus here. 421 00:25:29,810 --> 00:25:32,720 There should be a minus even at the beginning, I believe. 422 00:25:32,720 --> 00:25:37,940 So that means that what is my-- oh, yeah, yeah. 423 00:25:37,940 --> 00:25:41,160 So you see, when we go back to the first, 424 00:25:41,160 --> 00:25:47,440 so what I erased was basically this thing here, yi minus mu i. 425 00:25:47,440 --> 00:25:49,680 And when I took the first derivative-- 426 00:25:49,680 --> 00:25:53,170 so it was the derivative with respect to H prime. 427 00:25:53,170 --> 00:25:55,830 So the derivative with respect to the second term-- 428 00:25:55,830 --> 00:25:57,920 I mean, the derivative of the second term 429 00:25:57,920 --> 00:25:59,754 was actually killed, because we took 430 00:25:59,754 --> 00:26:00,920 the expectation of this guy. 431 00:26:00,920 --> 00:26:03,253 But when we took the derivative of the first term, which 432 00:26:03,253 --> 00:26:05,747 is the only one that stayed, this guy went away. 433 00:26:05,747 --> 00:26:07,580 But there was a negative sign from this guy, 434 00:26:07,580 --> 00:26:09,740 because that's the thing we took the negative off. 435 00:26:09,740 --> 00:26:12,920 So it's really, when I take my second derivative, 436 00:26:12,920 --> 00:26:15,896 I should carry out the minus signs everywhere. 437 00:26:22,084 --> 00:26:24,000 OK? 438 00:26:24,000 --> 00:26:26,530 So it's just I forget this minus throughout. 439 00:26:31,700 --> 00:26:34,735 You see the first term went away, on the first line there. 440 00:26:34,735 --> 00:26:36,110 The first term went away, because 441 00:26:36,110 --> 00:26:38,930 the conditional expectation of yi, given xi 0. 442 00:26:38,930 --> 00:26:41,410 And then I had this minus sign in front of everyone, 443 00:26:41,410 --> 00:26:42,110 and I forgot it. 444 00:26:44,660 --> 00:26:45,770 All right. 445 00:26:45,770 --> 00:26:47,390 Any other mistake that I made? 446 00:26:51,230 --> 00:26:52,800 We're good? 447 00:26:52,800 --> 00:26:54,858 All right. 448 00:26:54,858 --> 00:27:08,360 So now, this is what we have, that xk-- 449 00:27:08,360 --> 00:27:14,220 sorry, that beta k plus 1 is equal to beta k 450 00:27:14,220 --> 00:27:15,590 plus this thing. 451 00:27:15,590 --> 00:27:16,920 OK? 452 00:27:16,920 --> 00:27:19,140 And if you look at this thing, it sort of reminds 453 00:27:19,140 --> 00:27:20,700 us of something. 454 00:27:20,700 --> 00:27:22,860 Remember the least squares estimator. 455 00:27:22,860 --> 00:27:24,870 So here, I'm going to actually deviate slightly 456 00:27:24,870 --> 00:27:25,820 from the slides. 457 00:27:25,820 --> 00:27:27,480 And I will tell you how. 458 00:27:27,480 --> 00:27:30,690 The slides take beta k and put it 459 00:27:30,690 --> 00:27:33,220 in here, which is one way to go. 460 00:27:33,220 --> 00:27:36,300 And just think of this as a big least square solution. 461 00:27:36,300 --> 00:27:41,040 Or you can keep the beta k, solve another least squares, 462 00:27:41,040 --> 00:27:43,150 and then add it to the beta k that you have. 463 00:27:43,150 --> 00:27:44,280 It's the same thing. 464 00:27:44,280 --> 00:27:45,820 So I will take the different routes. 465 00:27:45,820 --> 00:27:47,445 So you have the two options, all right? 466 00:28:07,410 --> 00:28:09,340 OK. 467 00:28:09,340 --> 00:28:10,880 So when we did the least squares-- 468 00:28:10,880 --> 00:28:15,880 so parenthesis least squares-- 469 00:28:19,210 --> 00:28:23,810 we had y equals x beta plus epsilon. 470 00:28:23,810 --> 00:28:27,850 And our estimator beta hat was x transpose 471 00:28:27,850 --> 00:28:33,382 x inverse x transpose y, right? 472 00:28:33,382 --> 00:28:36,560 And that was just solving the first order condition, 473 00:28:36,560 --> 00:28:38,230 and that's what we found. 474 00:28:38,230 --> 00:28:40,680 Now look at this-- 475 00:28:40,680 --> 00:28:46,770 x transpose bleep x inverse, x transpose bleep something. 476 00:28:46,770 --> 00:28:47,460 OK? 477 00:28:47,460 --> 00:28:58,120 So this looks like, if this is the same as the left board, 478 00:28:58,120 --> 00:29:04,140 if wk is equal to the identity matrix, meaning we 479 00:29:04,140 --> 00:29:11,040 don't see it, and y is equal to y tilde k minus mu tilde k-- 480 00:29:13,560 --> 00:29:16,950 so those similarities, the fact that we just squeeze in-- 481 00:29:16,950 --> 00:29:19,730 so the fact that the response variable is different 482 00:29:19,730 --> 00:29:20,850 is really not a problem. 483 00:29:20,850 --> 00:29:22,560 We just have to pretend that this 484 00:29:22,560 --> 00:29:24,877 is equal to y tilde minus mu tilde. 485 00:29:24,877 --> 00:29:26,460 I mean, that's just the least squares. 486 00:29:26,460 --> 00:29:29,440 When you call a software that does least squares for you, 487 00:29:29,440 --> 00:29:31,710 you just tell it what y is, you tell it with x is, 488 00:29:31,710 --> 00:29:32,940 and it makes the computation. 489 00:29:32,940 --> 00:29:35,470 So you would just lie to it and say all the actual y 490 00:29:35,470 --> 00:29:37,530 I want is this thing. 491 00:29:37,530 --> 00:29:42,420 And then we need to somehow incorporate those weights. 492 00:29:42,420 --> 00:29:44,980 And so the question is, is that easy to do? 493 00:29:44,980 --> 00:29:48,390 And the answer is yes, because this is a setup where 494 00:29:48,390 --> 00:29:50,460 this would actually arise. 495 00:29:50,460 --> 00:29:52,876 So one of the things that's very specific to what 496 00:29:52,876 --> 00:29:54,750 we did here and with least squares, we assume 497 00:29:54,750 --> 00:29:58,140 that epsilon, when we did at least the inference, 498 00:29:58,140 --> 00:30:01,080 we assumed that epsilon was normal 0 499 00:30:01,080 --> 00:30:04,960 and the covariance matrix was the identity, right? 500 00:30:04,960 --> 00:30:07,180 What if the covariance matrix is not the identity? 501 00:30:07,180 --> 00:30:09,610 If the covariance matrix is not the identity, 502 00:30:09,610 --> 00:30:12,140 then your maximum likelihood is not exactly 503 00:30:12,140 --> 00:30:13,600 these least squares. 504 00:30:13,600 --> 00:30:15,580 If the covariance matrix is any matrix 505 00:30:15,580 --> 00:30:18,280 you have another solution, which involves the inverse 506 00:30:18,280 --> 00:30:20,620 of the covariance matrix that you have, 507 00:30:20,620 --> 00:30:24,100 but if your covariance matrix, in particular, is diagonal-- 508 00:30:24,100 --> 00:30:26,560 which would mean that each observation that you 509 00:30:26,560 --> 00:30:30,160 get in this system of equations is still independent, 510 00:30:30,160 --> 00:30:32,530 but the variances can change from one line 511 00:30:32,530 --> 00:30:35,030 to another, from one observation to another-- 512 00:30:35,030 --> 00:30:37,570 then it's called heteroscedastic. 513 00:30:37,570 --> 00:30:39,730 "Hetero" means "not the same." 514 00:30:39,730 --> 00:30:41,680 "Scedastic" is "scale." 515 00:30:41,680 --> 00:30:45,280 And a heteroscedastic case, you would have 516 00:30:45,280 --> 00:30:47,000 something slightly different. 517 00:30:47,000 --> 00:30:49,750 And it makes sense that, if you know 518 00:30:49,750 --> 00:30:52,970 that some observations have much less variance than others, 519 00:30:52,970 --> 00:30:54,790 you might want to give them more weight. 520 00:30:54,790 --> 00:30:55,420 OK? 521 00:30:55,420 --> 00:31:02,940 So if you think about your usual drawing, 522 00:31:02,940 --> 00:31:07,100 and maybe you have something like this, 523 00:31:07,100 --> 00:31:08,600 but the actual line is really-- 524 00:31:08,600 --> 00:31:12,350 OK, let's say you have this guy as well, so just a few here. 525 00:31:12,350 --> 00:31:16,474 If you start drawing this thing, if you do least squares, 526 00:31:16,474 --> 00:31:18,140 you're going to see something that looks 527 00:31:18,140 --> 00:31:20,030 like this on those points. 528 00:31:20,030 --> 00:31:22,640 But now, if I tell you that, on this side, 529 00:31:22,640 --> 00:31:26,900 the variance is equal to 100, meaning that those points are 530 00:31:26,900 --> 00:31:29,030 actually really far from the true one, 531 00:31:29,030 --> 00:31:31,527 and here on this side, the variance is equal to 1, 532 00:31:31,527 --> 00:31:33,860 meaning that those points are actually close to the line 533 00:31:33,860 --> 00:31:36,151 you're looking for, then the line you should be fitting 534 00:31:36,151 --> 00:31:38,450 is probably this guy, meaning do not 535 00:31:38,450 --> 00:31:42,210 trust the guys that have a lot of variance. 536 00:31:42,210 --> 00:31:44,140 And so you need somehow to incorporate that. 537 00:31:44,140 --> 00:31:46,600 If you know that those things have much more variance 538 00:31:46,600 --> 00:31:49,370 than these guys, you want to weight this. 539 00:31:49,370 --> 00:31:52,620 And the way you do it is by using weighted least squares. 540 00:31:52,620 --> 00:31:53,120 OK. 541 00:31:53,120 --> 00:31:54,661 So we're going to open in parentheses 542 00:31:54,661 --> 00:31:55,820 on weighted least squares. 543 00:31:55,820 --> 00:31:57,980 It's not a fundamental statistical question, 544 00:31:57,980 --> 00:32:00,470 but it's useful for us, because this is exactly 545 00:32:00,470 --> 00:32:01,850 what's going to spit out-- 546 00:32:01,850 --> 00:32:05,160 something that looks like this with this matrix w in there. 547 00:32:05,160 --> 00:32:05,660 OK. 548 00:32:05,660 --> 00:32:09,720 So let's go back in time for a second. 549 00:32:09,720 --> 00:32:12,840 Assume we're still covering least squares regression. 550 00:32:12,840 --> 00:32:19,220 So now, I'm going to assume that y is x beta plus epsilon, 551 00:32:19,220 --> 00:32:23,600 but this time, epsilon is a multivariate Gaussian in, say, 552 00:32:23,600 --> 00:32:25,940 p dimensions with mean 0. 553 00:32:25,940 --> 00:32:29,720 And covariance matrix, I will write it as w inverse, 554 00:32:29,720 --> 00:32:32,790 because w is going to be the one that's going to show up. 555 00:32:32,790 --> 00:32:34,650 OK? 556 00:32:34,650 --> 00:32:37,080 So this is the so-called heteroscedastic. 557 00:32:37,080 --> 00:32:43,560 That's how it's spelled, and yet another name 558 00:32:43,560 --> 00:32:47,800 that you can pick for your soccer team or a capella group. 559 00:32:47,800 --> 00:32:48,300 All right. 560 00:32:48,300 --> 00:32:52,289 So the maximum likelihood, in this case-- 561 00:32:52,289 --> 00:32:54,330 so actually, let's compute the maximum likelihood 562 00:32:54,330 --> 00:32:55,470 for this problem, right? 563 00:32:55,470 --> 00:32:58,770 So the log likelihood is what? 564 00:32:58,770 --> 00:33:02,110 Well, we're going to have the term that tells us 565 00:33:02,110 --> 00:33:04,120 that it's going to be-- so OK. 566 00:33:04,120 --> 00:33:06,390 What is the density of a multivariate Gaussian? 567 00:33:10,339 --> 00:33:12,130 So it's going to be a multivariate Gaussian 568 00:33:12,130 --> 00:33:17,270 in p dimension with mean x beta and covariance matrix w 569 00:33:17,270 --> 00:33:19,040 inverse, right? 570 00:33:19,040 --> 00:33:20,660 So that's the density that we want. 571 00:33:20,660 --> 00:33:30,490 Well, it's of the form 1 over determinant of w inverse times 572 00:33:30,490 --> 00:33:35,734 2 pi to the p/2. 573 00:33:35,734 --> 00:33:37,730 OK? 574 00:33:37,730 --> 00:33:47,570 And times exponential, and now, what I have is x minus x beta 575 00:33:47,570 --> 00:33:51,980 transpose w-- so that's the inverse of w inverse-- 576 00:33:51,980 --> 00:33:58,340 x minus x beta divided by 2. 577 00:33:58,340 --> 00:33:59,240 OK? 578 00:33:59,240 --> 00:34:03,080 So this is x minus mu transpose sigma inverse x 579 00:34:03,080 --> 00:34:04,920 minus mu divided by 2. 580 00:34:04,920 --> 00:34:10,766 And if you want a sanity check, just assume that sigma-- 581 00:34:10,766 --> 00:34:11,266 yeah? 582 00:34:11,266 --> 00:34:15,074 AUDIENCE: Is it x minus x beta or y? 583 00:34:15,074 --> 00:34:18,290 PHILIPPE RIGOLLET: Well, you know, if you want this to be y, 584 00:34:18,290 --> 00:34:21,629 then this is y, right? 585 00:34:21,629 --> 00:34:22,601 Sure. 586 00:34:22,601 --> 00:34:24,960 Yeah, maybe it's less confusing. 587 00:34:24,960 --> 00:34:29,886 So if you should do p is equal to 1, then what does it mean? 588 00:34:29,886 --> 00:34:31,469 It means that you have this mean here. 589 00:34:31,469 --> 00:34:32,969 So let's forget about what it is. 590 00:34:32,969 --> 00:34:35,520 But this guy is going to be just 1 sigma squared, right? 591 00:34:35,520 --> 00:34:38,699 So what you see here is the inverse of sigma squared. 592 00:34:38,699 --> 00:34:41,670 So that's going to be 2 over 2 sigma squared, like we usually 593 00:34:41,670 --> 00:34:42,420 see it. 594 00:34:42,420 --> 00:34:44,310 The determinant of w inverse is just 595 00:34:44,310 --> 00:34:45,960 the product of the entry of the 1 596 00:34:45,960 --> 00:34:53,341 by 1 matrix, which is just sigma square. 597 00:34:53,341 --> 00:34:53,840 OK? 598 00:34:53,840 --> 00:34:58,390 So that should be actually-- 599 00:34:58,390 --> 00:35:01,100 yeah, no, that's actually-- yeah, that's sigma square. 600 00:35:01,100 --> 00:35:02,480 And then I have this 2 pi. 601 00:35:02,480 --> 00:35:04,670 So square root of this, because p is equal to 1, 602 00:35:04,670 --> 00:35:06,290 I get sigma square root 2 pi, which is 603 00:35:06,290 --> 00:35:07,719 the normalization that I get. 604 00:35:07,719 --> 00:35:09,260 This is not going to matter, because, 605 00:35:09,260 --> 00:35:12,640 when I look at the log likelihood 606 00:35:12,640 --> 00:35:15,400 as a function of beta-- 607 00:35:15,400 --> 00:35:17,720 so I'm assuming that w is known-- 608 00:35:17,720 --> 00:35:19,760 what I get is something which is a constant. 609 00:35:19,760 --> 00:35:25,520 So it's minus p minus n times p/2 times 610 00:35:25,520 --> 00:35:31,290 log that w inverse times 2 pi. 611 00:35:31,290 --> 00:35:31,790 OK? 612 00:35:31,790 --> 00:35:33,290 So this is just going to be a constant. 613 00:35:33,290 --> 00:35:35,390 It won't matter when I do the maximum likelihood. 614 00:35:35,390 --> 00:35:36,723 And then I'm going to have what? 615 00:35:36,723 --> 00:35:44,508 I'm going to have plus 1/2 of y minus x beta transpose w 616 00:35:44,508 --> 00:35:45,820 y minus x beta. 617 00:35:49,230 --> 00:35:53,520 So if I want to take the maximum of this guy-- 618 00:35:53,520 --> 00:35:56,620 sorry, there's a minus here. 619 00:35:56,620 --> 00:35:58,590 So if I want to take the maximum of this guy, 620 00:35:58,590 --> 00:36:01,230 I'm going to have to take the minimum of this thing. 621 00:36:01,230 --> 00:36:04,530 And the minimum of this thing, if you take the derivative, 622 00:36:04,530 --> 00:36:05,820 you get to see-- 623 00:36:05,820 --> 00:36:07,350 so that's what we have, right? 624 00:36:07,350 --> 00:36:09,240 We need to compute the minimum of y 625 00:36:09,240 --> 00:36:13,980 minus x beta transpose w minus y minus x beta. 626 00:36:13,980 --> 00:36:16,570 And the solution that you get-- 627 00:36:16,570 --> 00:36:20,320 I mean, you can actually check this for yourself. 628 00:36:20,320 --> 00:36:24,640 The way you can see this is by doing the following. 629 00:36:24,640 --> 00:36:27,702 If you're lazy and you don't want to redo the entire thing-- 630 00:36:27,702 --> 00:36:28,910 maybe I should keep that guy. 631 00:36:36,110 --> 00:36:39,240 W is diagonal, right? 632 00:36:39,240 --> 00:36:42,540 I'm going to assume that so w inverse is diagonal, 633 00:36:42,540 --> 00:36:45,270 and I'm going to assume that no variance is equal to 0 634 00:36:45,270 --> 00:36:47,280 and no variance is equal to infinity, 635 00:36:47,280 --> 00:36:52,050 so that both w inverse and w have only positive entries 636 00:36:52,050 --> 00:36:53,010 on the diagonal. 637 00:36:53,010 --> 00:36:53,887 All right? 638 00:36:53,887 --> 00:36:55,970 So in particular, I can talk about the square root 639 00:36:55,970 --> 00:36:58,520 of w, which is just the matrix, the diagonal matrix, 640 00:36:58,520 --> 00:37:00,460 with the square roots on the diagonal. 641 00:37:00,460 --> 00:37:01,040 OK? 642 00:37:01,040 --> 00:37:08,960 And so I want to minimize in beta y minus x beta transpose w 643 00:37:08,960 --> 00:37:11,420 y minus x beta. 644 00:37:11,420 --> 00:37:13,850 So I'm going to write w as square root 645 00:37:13,850 --> 00:37:17,584 of w times square root of w, which I can, because w-- 646 00:37:17,584 --> 00:37:19,250 and it's just the simplest thing, right? 647 00:37:19,250 --> 00:37:28,030 If w is w1 wn, so that's my w, then the square root of w 648 00:37:28,030 --> 00:37:31,580 is just square root of w1 square root of wn, 649 00:37:31,580 --> 00:37:33,960 and then 0 is elsewhere. 650 00:37:33,960 --> 00:37:35,210 OK? 651 00:37:35,210 --> 00:37:37,100 So the product of those two matrices 652 00:37:37,100 --> 00:37:38,740 gives me definitely back what I want, 653 00:37:38,740 --> 00:37:41,387 and that's the usual matrix product. 654 00:37:41,387 --> 00:37:43,970 Now, what I'm going to do is I'm going to push one on one side 655 00:37:43,970 --> 00:37:45,681 and push the other one on the other side. 656 00:37:45,681 --> 00:37:47,180 So that gives me that this is really 657 00:37:47,180 --> 00:37:49,124 the minimum over beta of-- 658 00:37:49,124 --> 00:37:50,540 well, here I have this transposed, 659 00:37:50,540 --> 00:37:52,123 so I have to put it on the other side. 660 00:37:52,123 --> 00:37:55,970 w is clearly symmetric and so is square root of w. 661 00:37:55,970 --> 00:37:57,424 So the transpose doesn't matter. 662 00:37:57,424 --> 00:37:59,090 And so what I'm left with is square root 663 00:37:59,090 --> 00:38:06,290 of wy minus square root of wx beta transpose, and then times 664 00:38:06,290 --> 00:38:07,870 itself. 665 00:38:07,870 --> 00:38:15,010 So that's square root wy minus square root w-- 666 00:38:15,010 --> 00:38:17,530 oh, I don't have enough space-- 667 00:38:17,530 --> 00:38:20,130 x beta. 668 00:38:20,130 --> 00:38:23,347 OK, and that stops here. 669 00:38:23,347 --> 00:38:25,680 But this is the same thing that we've been doing before. 670 00:38:25,680 --> 00:38:26,680 This is a new y. 671 00:38:26,680 --> 00:38:28,074 Let's call it y prime. 672 00:38:28,074 --> 00:38:28,740 This is a new x. 673 00:38:28,740 --> 00:38:31,250 Let's call it x prime. 674 00:38:31,250 --> 00:38:33,480 And now, this is just the least squares estimator 675 00:38:33,480 --> 00:38:39,000 associated to a response y prime and a design matrix x prime. 676 00:38:39,000 --> 00:38:47,460 So I know that the solution is x prime transpose x prime inverse 677 00:38:47,460 --> 00:38:53,020 x prime transpose y prime. 678 00:38:53,020 --> 00:38:55,380 And now, I'm just going to substitute again 679 00:38:55,380 --> 00:38:58,560 what my x prime is in terms of x and what 680 00:38:58,560 --> 00:39:01,830 my y prime is in terms of y. 681 00:39:01,830 --> 00:39:06,630 And that gives me exactly x square root w square root w 682 00:39:06,630 --> 00:39:11,490 x inverse. 683 00:39:11,490 --> 00:39:17,490 And then I have x transpose square root w for this guy. 684 00:39:17,490 --> 00:39:21,660 And then I have square root wy for that guy. 685 00:39:21,660 --> 00:39:23,400 And that's exactly what I wanted. 686 00:39:23,400 --> 00:39:30,880 I'm left with x transpose wx inverse x transpose wy. 687 00:39:34,664 --> 00:39:35,164 OK? 688 00:39:38,020 --> 00:39:41,510 So that's a simple way to take into account 689 00:39:41,510 --> 00:39:44,150 the w that we had before. 690 00:39:44,150 --> 00:39:47,285 And you could actually do it with any matrix that's positive 691 00:39:47,285 --> 00:39:48,910 semi-definite, because you can actually 692 00:39:48,910 --> 00:39:52,204 talk about the square root of those matrices. 693 00:39:52,204 --> 00:39:54,620 And it's just the square root of a matrix is just a matrix 694 00:39:54,620 --> 00:39:58,260 such that, when you multiply it by itself, 695 00:39:58,260 --> 00:40:00,846 it gives you the original matrix. 696 00:40:00,846 --> 00:40:03,560 OK? 697 00:40:03,560 --> 00:40:06,220 So here, that was just a shortcut 698 00:40:06,220 --> 00:40:08,560 that consisted in saying, OK, maybe I 699 00:40:08,560 --> 00:40:12,910 don't want to recompute the gradient of this quantity, 700 00:40:12,910 --> 00:40:17,510 set it equal to 0, and see what beta hat had should be. 701 00:40:17,510 --> 00:40:19,810 Instead, I am going to assume that I already 702 00:40:19,810 --> 00:40:23,050 know that, if I did not have the w, 703 00:40:23,050 --> 00:40:24,560 I would know how to solve it. 704 00:40:24,560 --> 00:40:25,810 And that's exactly what I did. 705 00:40:25,810 --> 00:40:28,120 I said, well, I know that this is 706 00:40:28,120 --> 00:40:30,370 the minimum of something that looks like this, 707 00:40:30,370 --> 00:40:32,020 when I have the primes. 708 00:40:32,020 --> 00:40:36,148 And then I just substitute back my w in there. 709 00:40:36,148 --> 00:40:36,790 All right. 710 00:40:36,790 --> 00:40:38,390 So that' just the lazy computation. 711 00:40:38,390 --> 00:40:40,440 But again, if you don't like it, you 712 00:40:40,440 --> 00:40:42,380 can always take the gradient of this guy. 713 00:40:42,380 --> 00:40:42,880 Yes? 714 00:40:42,880 --> 00:40:44,612 AUDIENCE: Why is the solution written 715 00:40:44,612 --> 00:40:45,685 in the slides different? 716 00:40:45,685 --> 00:40:47,560 PHILIPPE RIGOLLET: Because there's a mistake. 717 00:40:49,647 --> 00:40:51,230 Yeah, there's a mistake on the slides. 718 00:40:58,385 --> 00:40:59,520 How did I make that one? 719 00:40:59,520 --> 00:41:01,220 I'm actually trying to parse it back. 720 00:41:11,570 --> 00:41:13,820 I mean, it's clearly wrong, right? 721 00:41:13,820 --> 00:41:14,600 Oh, no, it's not. 722 00:41:24,590 --> 00:41:27,570 No, it is. 723 00:41:27,570 --> 00:41:29,110 So it's not clearly wrong. 724 00:41:32,680 --> 00:41:34,960 Actually, it is clearly wrong. 725 00:41:34,960 --> 00:41:37,840 Because if I put the identity here, 726 00:41:37,840 --> 00:41:39,360 those are still associative, right? 727 00:41:39,360 --> 00:41:42,140 So this product is actually not compatible. 728 00:41:42,140 --> 00:41:44,140 So it's wrong, but there's just this extra thing 729 00:41:44,140 --> 00:41:46,630 that I probably copy-pasted from some place. 730 00:41:46,630 --> 00:41:48,430 Since this is one of my latest slide, 731 00:41:48,430 --> 00:41:51,280 I'll just color it in white. 732 00:41:51,280 --> 00:41:54,961 But yeah, sorry, there's a mis-- this parenthesis is not here. 733 00:41:54,961 --> 00:41:55,460 Thank you. 734 00:41:55,460 --> 00:41:56,388 AUDIENCE: [INAUDIBLE]. 735 00:41:56,388 --> 00:41:57,388 PHILIPPE RIGOLLET: Yeah. 736 00:42:01,244 --> 00:42:03,172 OK? 737 00:42:03,172 --> 00:42:06,124 AUDIENCE: So why not square root [INAUDIBLE]?? 738 00:42:06,124 --> 00:42:08,040 PHILIPPE RIGOLLET: Because I have two of them. 739 00:42:08,040 --> 00:42:11,180 I have one that comes from the x prime that's here, this guy. 740 00:42:11,180 --> 00:42:15,760 And then I have one that comes from this guy here. 741 00:42:15,760 --> 00:42:17,530 OK, so the solution-- 742 00:42:17,530 --> 00:42:20,121 let's write it in some place that's actually legible-- 743 00:42:25,530 --> 00:42:27,150 which is the correction for this thing 744 00:42:27,150 --> 00:42:34,930 is x transpose wx inverse x transpose wy. 745 00:42:34,930 --> 00:42:35,470 OK? 746 00:42:35,470 --> 00:42:38,270 So you just squeeze in this w in there. 747 00:42:38,270 --> 00:42:41,860 And that's exactly what we had before, 748 00:42:41,860 --> 00:42:47,740 x transpose wx inverse x transpose w some y. 749 00:42:47,740 --> 00:42:49,360 OK? 750 00:42:49,360 --> 00:42:53,050 And what I claim is that this is routinely implemented. 751 00:42:53,050 --> 00:42:55,000 As you can imagine, heteroscedastic linear 752 00:42:55,000 --> 00:42:57,550 regression is something that's very common. 753 00:42:57,550 --> 00:43:00,100 So every time you a least squares formula, 754 00:43:00,100 --> 00:43:02,886 you also have a way to put in some weights. 755 00:43:02,886 --> 00:43:04,510 You don't have to put diagonal weights, 756 00:43:04,510 --> 00:43:05,718 but here, that's all we need. 757 00:43:08,190 --> 00:43:12,310 So here on the slides, again, I took the beta k, 758 00:43:12,310 --> 00:43:15,060 and I put it in there, so that I have only one least square 759 00:43:15,060 --> 00:43:17,370 solution to formulate. 760 00:43:17,370 --> 00:43:19,680 But let's do it slightly differently. 761 00:43:19,680 --> 00:43:21,180 What I'm going to do here now is I'm 762 00:43:21,180 --> 00:43:24,430 going to say, OK, let's feed it to some least squares. 763 00:43:24,430 --> 00:43:32,600 So let's do weighted least squares on a response, 764 00:43:32,600 --> 00:43:44,810 y being y tilde k minus mu tilde k, and design matrix being, 765 00:43:44,810 --> 00:43:47,520 well, just the x itself. 766 00:43:47,520 --> 00:43:50,240 So that doesn't change. 767 00:43:50,240 --> 00:44:00,090 And the weights-- so the weights are what? 768 00:44:00,090 --> 00:44:04,290 The weights are the wk that I had here. 769 00:44:04,290 --> 00:44:15,630 So wki is h prime of xi transpose beta 770 00:44:15,630 --> 00:44:24,380 k divided by g prime of mu i at time k times phi. 771 00:44:28,630 --> 00:44:32,620 OK, and so this, if I solve it, will spit out something 772 00:44:32,620 --> 00:44:33,910 that I will call a solution. 773 00:44:33,910 --> 00:44:41,290 I will call it u hat k plus 1. 774 00:44:41,290 --> 00:44:44,590 And to get beta hat k plus 1, all I 775 00:44:44,590 --> 00:44:53,215 need to do is to do beta k plus u hat k plus 1-- 776 00:44:53,215 --> 00:44:55,808 sorry, beta-- yeah. 777 00:44:55,808 --> 00:44:58,730 OK? 778 00:44:58,730 --> 00:45:01,430 And that's because-- so here, that's not clear. 779 00:45:01,430 --> 00:45:04,080 But I started from there, remember? 780 00:45:04,080 --> 00:45:08,250 I started from this guy here. 781 00:45:08,250 --> 00:45:10,775 So I'm just solving a least squares, a weighted least 782 00:45:10,775 --> 00:45:12,525 square that's going to give me this thing. 783 00:45:12,525 --> 00:45:15,300 That's what I called u hat k plus 1. 784 00:45:15,300 --> 00:45:18,475 And then I add it to beta k, and that gives me beta k minus 1. 785 00:45:18,475 --> 00:45:21,490 So I just have this intermediate step, 786 00:45:21,490 --> 00:45:25,238 which is removed in the slides. 787 00:45:25,238 --> 00:45:28,070 OK? 788 00:45:28,070 --> 00:45:29,960 So then you can repeat until convergence. 789 00:45:29,960 --> 00:45:32,270 What does it mean to repeat until convergence? 790 00:45:35,066 --> 00:45:37,435 AUDIENCE: [INAUDIBLE]? 791 00:45:37,435 --> 00:45:38,810 PHILIPPE RIGOLLET: Yeah, exactly. 792 00:45:38,810 --> 00:45:41,030 So you just set some threshold and you say, 793 00:45:41,030 --> 00:45:43,550 I promise you that this will converge, right? 794 00:45:43,550 --> 00:45:46,570 So you know that at some point, you're going to be there. 795 00:45:46,570 --> 00:45:48,320 You're going to go there, but you're never 796 00:45:48,320 --> 00:45:49,403 going to be exactly there. 797 00:45:49,403 --> 00:45:52,430 And so you just say, OK, I want this accuracy on my data. 798 00:45:52,430 --> 00:45:55,712 Actually, the machine is a little strong. 799 00:45:55,712 --> 00:45:57,920 Especially if you have 10 observations to start with, 800 00:45:57,920 --> 00:46:01,789 you know you're going to have something that's going 801 00:46:01,789 --> 00:46:03,080 to have some statistical error. 802 00:46:03,080 --> 00:46:05,990 So that should actually guide you into what kind of error 803 00:46:05,990 --> 00:46:06,930 you want to be making. 804 00:46:06,930 --> 00:46:08,780 So for example, a good rule of thumb 805 00:46:08,780 --> 00:46:11,960 is that if you have n observations, 806 00:46:11,960 --> 00:46:13,670 you just take some within-- 807 00:46:13,670 --> 00:46:17,060 if you want the L2 distance between the beta-- 808 00:46:17,060 --> 00:46:19,787 the two consecutive beta to be less than 1/n, 809 00:46:19,787 --> 00:46:20,870 you should be good enough. 810 00:46:20,870 --> 00:46:24,560 It doesn't have to be that machine precision. 811 00:46:24,560 --> 00:46:27,260 And so it's clear how we do this, right? 812 00:46:27,260 --> 00:46:30,680 So here, I just have to maintain a bunch of things, right? 813 00:46:30,680 --> 00:46:33,680 So remember, when I want to recompute-- at every step, 814 00:46:33,680 --> 00:46:35,430 I have to recompute a bunch of things. 815 00:46:35,430 --> 00:46:36,890 So I have to recompute the weights. 816 00:46:36,890 --> 00:46:39,080 But if I want to recompute the weights, not only do 817 00:46:39,080 --> 00:46:40,760 I need to previous iterate, but I 818 00:46:40,760 --> 00:46:46,040 need to know how the previous iterate impacts my means. 819 00:46:46,040 --> 00:46:50,300 So at each step, I have to recalculate mu i k 820 00:46:50,300 --> 00:46:53,090 by doing g prime, rate? 821 00:46:53,090 --> 00:47:02,670 Remember mu i k was just g inverse of xi transpose beta k, 822 00:47:02,670 --> 00:47:03,170 right? 823 00:47:03,170 --> 00:47:05,630 So I have to recompute that. 824 00:47:05,630 --> 00:47:09,340 And then I use this to compute my weights. 825 00:47:09,340 --> 00:47:15,790 I also use this to compute my y, right? 826 00:47:15,790 --> 00:47:20,030 so my y depends also on g prime of mu i k. 827 00:47:20,030 --> 00:47:24,950 I feed that to my weighted least squares engine. 828 00:47:24,950 --> 00:47:28,520 It spits out the u hat k, that I add to my previous beta k. 829 00:47:28,520 --> 00:47:30,605 And that gives me my new beta k plus 1. 830 00:47:33,170 --> 00:47:33,670 OK. 831 00:47:33,670 --> 00:47:35,980 So here's the pseudocode, if you want 832 00:47:35,980 --> 00:47:40,781 to take some time to parse it. 833 00:47:40,781 --> 00:47:41,280 All right. 834 00:47:41,280 --> 00:47:43,970 So here again, the trick is not much. 835 00:47:43,970 --> 00:47:49,400 It's just saying, if you don't feel like implementing Fisher 836 00:47:49,400 --> 00:47:52,662 scoring or inverting your Hessian at every step, 837 00:47:52,662 --> 00:47:54,620 then a weighted least squares is actually going 838 00:47:54,620 --> 00:47:56,360 to do it for you automatically. 839 00:47:56,360 --> 00:47:56,860 All right. 840 00:47:56,860 --> 00:47:58,610 Then that's just a numerical trick. 841 00:47:58,610 --> 00:48:00,950 There's nothing really statistical about this, 842 00:48:00,950 --> 00:48:04,730 except the fact that this calls for a solution for each 843 00:48:04,730 --> 00:48:09,682 of the step reminded us of sum of the squares, 844 00:48:09,682 --> 00:48:11,390 except that there was some extra weights. 845 00:48:14,180 --> 00:48:14,680 OK. 846 00:48:14,680 --> 00:48:18,670 So to conclude, we'll need to know, of course, 847 00:48:18,670 --> 00:48:19,945 xy, the link function. 848 00:48:22,629 --> 00:48:24,170 Why do we need the variance function? 849 00:48:29,530 --> 00:48:33,250 I'm not sure we actually need the variance function. 850 00:48:33,250 --> 00:48:36,220 No, I don't know why I say that. 851 00:48:36,220 --> 00:48:39,750 You need phi, not the variance function. 852 00:48:39,750 --> 00:48:41,370 So where do you start actually, right? 853 00:48:41,370 --> 00:48:44,400 So clearly, if you start very close to your solution, 854 00:48:44,400 --> 00:48:46,810 you're actually going to do much better. 855 00:48:46,810 --> 00:48:48,760 And one good way to start-- 856 00:48:48,760 --> 00:48:51,710 so for the beta itself, it's not clear what it's going to be. 857 00:48:51,710 --> 00:48:53,490 But you can actually get a good idea 858 00:48:53,490 --> 00:48:57,960 of what beta is by just having a good idea of what mu is. 859 00:48:57,960 --> 00:49:01,830 Because mu is g inverse of xi transpose beta. 860 00:49:01,830 --> 00:49:04,020 And so what you could do is to try 861 00:49:04,020 --> 00:49:07,560 to set mu to be the actual observations that you have, 862 00:49:07,560 --> 00:49:09,150 because that's the best guess that you 863 00:49:09,150 --> 00:49:11,540 have for their expected value. 864 00:49:11,540 --> 00:49:14,740 And then you just say, OK, once I have my mu, 865 00:49:14,740 --> 00:49:17,630 I know that my mu is a function of this thing. 866 00:49:17,630 --> 00:49:21,380 So I can write g of mu and solve it, using your least squares 867 00:49:21,380 --> 00:49:22,340 estimator, right? 868 00:49:22,340 --> 00:49:28,970 So g of mu is of the form x beta. 869 00:49:28,970 --> 00:49:33,710 So you just solve for-- once you have your mu, 870 00:49:33,710 --> 00:49:36,350 you pass it through g, and then you solve for the beta 871 00:49:36,350 --> 00:49:37,937 that you want. 872 00:49:37,937 --> 00:49:40,020 And then that's the beta that you initialize with. 873 00:49:42,954 --> 00:49:44,910 OK? 874 00:49:44,910 --> 00:49:47,850 And actually, this was your question from last time. 875 00:49:47,850 --> 00:49:50,320 As soon as I use the canonical link, 876 00:49:50,320 --> 00:49:53,880 Fisher scoring and Newton-Raphson 877 00:49:53,880 --> 00:49:57,720 are the same thing, because the Hessian is actually 878 00:49:57,720 --> 00:50:05,870 deterministic in that case, just because when 879 00:50:05,870 --> 00:50:09,290 you use the canonical link, H is the identity, which 880 00:50:09,290 --> 00:50:12,050 means that its second derivative is equal to 0. 881 00:50:12,050 --> 00:50:15,650 So this term goes away even without taking the expectation. 882 00:50:15,650 --> 00:50:17,840 So remember, the term that went away 883 00:50:17,840 --> 00:50:23,420 was of the form yi minus mu i divided 884 00:50:23,420 --> 00:50:29,609 by phi times h prime prime of xi transpose beta, right? 885 00:50:29,609 --> 00:50:32,150 That's the term that we said, oh, the conditional expectation 886 00:50:32,150 --> 00:50:34,170 of this guy is 0. 887 00:50:34,170 --> 00:50:36,384 But if h prime prime is already equal to 0, 888 00:50:36,384 --> 00:50:37,800 then there's nothing that changes. 889 00:50:37,800 --> 00:50:39,120 There's nothing that goes away. 890 00:50:39,120 --> 00:50:40,530 It was already equal to 0. 891 00:50:40,530 --> 00:50:43,710 And that always happens when you have the canonical link, 892 00:50:43,710 --> 00:50:54,450 because h is g b prime inverse. 893 00:50:54,450 --> 00:50:57,690 And the canonical link is b prime inverse, 894 00:50:57,690 --> 00:51:00,176 so this thing is the identity. 895 00:51:00,176 --> 00:51:06,780 So the second derivative of f of x is equal to x is 0. 896 00:51:06,780 --> 00:51:08,630 OK. 897 00:51:08,630 --> 00:51:11,620 My screen says end of show. 898 00:51:11,620 --> 00:51:13,862 So we can start with some questions. 899 00:51:13,862 --> 00:51:15,320 AUDIENCE: I just wanted to clarify. 900 00:51:15,320 --> 00:51:19,127 So iterative-- what is it say for iterative-- 901 00:51:19,127 --> 00:51:20,960 PHILIPPE RIGOLLET: Reweighted least squares. 902 00:51:20,960 --> 00:51:21,386 AUDIENCE: Reweighted least squares 903 00:51:21,386 --> 00:51:23,840 is an implementation of the Fisher scoring [INAUDIBLE]?? 904 00:51:23,840 --> 00:51:25,631 PHILIPPE RIGOLLET: That's an implementation 905 00:51:25,631 --> 00:51:29,000 that's just making calls to weighted least squares oracles. 906 00:51:29,000 --> 00:51:30,730 It's called an oracle sometimes. 907 00:51:30,730 --> 00:51:33,849 An oracle is what you assume the machine can do easily for you. 908 00:51:33,849 --> 00:51:35,390 So if you assume that your machine is 909 00:51:35,390 --> 00:51:38,150 very good at multiplying by the inverse of a matrix, 910 00:51:38,150 --> 00:51:40,580 you might as well just do Fisher scoring yourself, right? 911 00:51:40,580 --> 00:51:43,130 It's just a way so that you don't have to actually do it. 912 00:51:43,130 --> 00:51:46,460 And usually, those things are implemented-- 913 00:51:46,460 --> 00:51:49,320 and I just said routinely-- in statistical software. 914 00:51:49,320 --> 00:51:51,440 But they're implemented very efficiently 915 00:51:51,440 --> 00:51:52,440 in statistical software. 916 00:51:52,440 --> 00:51:54,770 So this is going to be one of the fastest ways you're 917 00:51:54,770 --> 00:51:59,165 going to have to solve, to do this step, 918 00:51:59,165 --> 00:52:01,145 especially for large-scale problems. 919 00:52:01,145 --> 00:52:03,186 AUDIENCE: So the thing that computers can do well 920 00:52:03,186 --> 00:52:05,105 is the multiplier [INAUDIBLE]. 921 00:52:05,105 --> 00:52:07,580 What's the thing that the computers can do fast 922 00:52:07,580 --> 00:52:09,525 and what's the thing that [INAUDIBLE]?? 923 00:52:09,525 --> 00:52:10,900 PHILIPPE RIGOLLET: So if you were 924 00:52:10,900 --> 00:52:13,210 to do this in the simplest possible way, 925 00:52:13,210 --> 00:52:18,070 your iterations for, say, Fisher scoring 926 00:52:18,070 --> 00:52:21,500 is just multiply by the inverse of the Fisher information, 927 00:52:21,500 --> 00:52:22,000 right? 928 00:52:22,000 --> 00:52:24,160 AUDIENCE: So finding that inverse is slow? 929 00:52:24,160 --> 00:52:26,530 PHILIPPE RIGOLLET: Yeah, so it takes a bit of time. 930 00:52:26,530 --> 00:52:30,330 Whereas, since you know you're going to multiply directly 931 00:52:30,330 --> 00:52:33,177 by something, if you just say-- 932 00:52:33,177 --> 00:52:35,010 those things are not as optimized as solving 933 00:52:35,010 --> 00:52:35,580 least squares. 934 00:52:35,580 --> 00:52:36,990 Actually, the way it's typically done 935 00:52:36,990 --> 00:52:38,340 is by doing some least squares. 936 00:52:38,340 --> 00:52:41,190 So you might as well just do least squares that you like. 937 00:52:41,190 --> 00:52:42,180 And there's also less-- 938 00:52:45,870 --> 00:52:47,770 well, no, there's no-- 939 00:52:47,770 --> 00:52:51,035 well, there is less recalculation, right? 940 00:52:51,035 --> 00:52:52,410 Here, your Fisher, you would have 941 00:52:52,410 --> 00:52:54,720 to recompute the entire matrix of Fisher information. 942 00:52:54,720 --> 00:52:56,170 Whereas here, you don't have to. 943 00:52:56,170 --> 00:52:56,670 Right? 944 00:52:56,670 --> 00:52:59,850 You really just have to compute some vectors and the vector 945 00:52:59,850 --> 00:53:00,600 of weights, right? 946 00:53:00,600 --> 00:53:03,230 So the Fisher information matrix has, say, 947 00:53:03,230 --> 00:53:05,910 n choose two entries that you need to compute, right? 948 00:53:05,910 --> 00:53:08,910 It's symmetric, so it's order n squared entries. 949 00:53:08,910 --> 00:53:11,460 But here, the only things you update, if you think about it, 950 00:53:11,460 --> 00:53:13,987 are this weight matrix. 951 00:53:13,987 --> 00:53:15,570 So there is only the diagonal elements 952 00:53:15,570 --> 00:53:19,330 that you need to update, and these vectors in there also. 953 00:53:19,330 --> 00:53:21,660 There's two inverses n squared. 954 00:53:21,660 --> 00:53:23,960 So that's much less thing to actually put in there. 955 00:53:23,960 --> 00:53:25,100 It does it for you somehow. 956 00:53:29,680 --> 00:53:30,810 Any other question? 957 00:53:34,440 --> 00:53:35,070 Yeah? 958 00:53:35,070 --> 00:53:37,950 AUDIENCE: So if I have a data set [INAUDIBLE],, 959 00:53:37,950 --> 00:53:40,451 then I can always try to model it with least squares, right? 960 00:53:40,451 --> 00:53:41,825 PHILIPPE RIGOLLET: Yeah, you can. 961 00:53:41,825 --> 00:53:44,670 AUDIENCE: And so this is like setting my weight equal to 1-- 962 00:53:44,670 --> 00:53:46,159 the identity, essentially, right? 963 00:53:46,159 --> 00:53:47,700 PHILIPPE RIGOLLET: Well, not exactly, 964 00:53:47,700 --> 00:53:50,640 because the g also shows up in this correction 965 00:53:50,640 --> 00:53:51,982 that you have here, right? 966 00:53:51,982 --> 00:53:52,934 AUDIENCE: Yeah. 967 00:53:52,934 --> 00:53:55,350 PHILIPPE RIGOLLET: I mean, I don't know what you mean by-- 968 00:53:55,350 --> 00:53:56,725 AUDIENCE: I'm just trying to say, 969 00:53:56,725 --> 00:53:59,652 are there ever situations where I'm trying to model a data set 970 00:53:59,652 --> 00:54:03,910 and I would want to pick my weights in a particular way? 971 00:54:03,910 --> 00:54:04,910 PHILIPPE RIGOLLET: Yeah. 972 00:54:04,910 --> 00:54:05,400 AUDIENCE: OK. 973 00:54:05,400 --> 00:54:06,216 PHILIPPE RIGOLLET: I mean-- 974 00:54:06,216 --> 00:54:07,920 AUDIENCE: [INAUDIBLE] example [INAUDIBLE].. 975 00:54:07,920 --> 00:54:09,420 PHILIPPE RIGOLLET: Well, OK, there's 976 00:54:09,420 --> 00:54:10,960 the heteroscedastic case for sure. 977 00:54:10,960 --> 00:54:13,632 So if you're going to actually compute those things-- and more 978 00:54:13,632 --> 00:54:15,340 generally, I don't think you should think 979 00:54:15,340 --> 00:54:16,390 of those as being weights. 980 00:54:16,390 --> 00:54:18,473 You should really think of those as being matrices 981 00:54:18,473 --> 00:54:19,510 that you invert. 982 00:54:19,510 --> 00:54:21,370 And don't think of it as being diagonal, 983 00:54:21,370 --> 00:54:23,890 but really think of them as being full matrices. 984 00:54:23,890 --> 00:54:25,390 So if you have-- 985 00:54:25,390 --> 00:54:30,280 when we wrote weighted least squares here, this was really-- 986 00:54:30,280 --> 00:54:31,776 the w, I said, is diagonal. 987 00:54:31,776 --> 00:54:34,150 But all the computations really never really use the fact 988 00:54:34,150 --> 00:54:35,140 that it's diagonal. 989 00:54:35,140 --> 00:54:38,500 So what shows up here is just the inverse 990 00:54:38,500 --> 00:54:40,180 of your covariance matrix. 991 00:54:40,180 --> 00:54:42,580 And so if you have data that's correlated, 992 00:54:42,580 --> 00:54:45,330 this is where it's going to show up.