1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,610 at ocw.mit.edu. 8 00:00:20,340 --> 00:00:23,280 PHILIPPE RIGOLLET: We're talking about 9 00:00:23,280 --> 00:00:24,390 generalized linear models. 10 00:00:24,390 --> 00:00:25,764 And in generalized linear models, 11 00:00:25,764 --> 00:00:28,780 we generalize linear models in two ways. 12 00:00:28,780 --> 00:00:31,170 The first one is to allow for a different distribution 13 00:00:31,170 --> 00:00:32,910 for the response variables. 14 00:00:32,910 --> 00:00:34,560 And the distributions that we wanted 15 00:00:34,560 --> 00:00:37,230 was the exponential family. 16 00:00:44,240 --> 00:00:46,910 And this is a family that can be generalized 17 00:00:46,910 --> 00:00:49,340 over random variables that are defined 18 00:00:49,340 --> 00:00:52,640 on c or q in general, with parameters rk. 19 00:00:52,640 --> 00:00:58,310 But we're going to focus in a very specific case when 20 00:00:58,310 --> 00:01:00,710 y is a real valued response variable, which 21 00:01:00,710 --> 00:01:04,040 is the one you're used to when you're doing linear regression. 22 00:01:04,040 --> 00:01:09,500 And the parameter theta also lives in r. 23 00:01:09,500 --> 00:01:12,250 And so we're going to talk about the canonical case. 24 00:01:12,250 --> 00:01:15,920 So that's the canonical exponential family, 25 00:01:15,920 --> 00:01:19,760 where you have a density, theta of x, which is 26 00:01:19,760 --> 00:01:25,280 of the form, exponential plus. 27 00:01:25,280 --> 00:01:27,800 And then, we have y, which interacts with theta 28 00:01:27,800 --> 00:01:29,580 only by taking a product. 29 00:01:29,580 --> 00:01:32,990 Then, there's a term that depends only on theta, 30 00:01:32,990 --> 00:01:35,180 some dispersion parameter phi. 31 00:01:35,180 --> 00:01:37,340 And then, we have some normalization factor. 32 00:01:37,340 --> 00:01:44,900 Let's call it c of y phi. 33 00:01:44,900 --> 00:01:48,340 So it really should not matter too much, so it's c of y phi, 34 00:01:48,340 --> 00:01:50,520 and that's really just the normal position factor. 35 00:01:50,520 --> 00:01:54,010 And here, we're going to assume that phi is known. 36 00:01:57,480 --> 00:01:58,694 I have no idea what I write. 37 00:01:58,694 --> 00:02:00,110 I don't know if you guys can read. 38 00:02:00,110 --> 00:02:01,943 I don't know what chalk has been used today, 39 00:02:01,943 --> 00:02:05,480 but I just can't see it. 40 00:02:05,480 --> 00:02:08,694 That's not my fault. All right, so we're going 41 00:02:08,694 --> 00:02:09,860 to assume that phi is known. 42 00:02:09,860 --> 00:02:12,021 And so we saw that several distributions 43 00:02:12,021 --> 00:02:14,270 that we know well, including the Gaussian for example, 44 00:02:14,270 --> 00:02:15,612 belong to this family. 45 00:02:15,612 --> 00:02:17,320 And there's other ones, such as Poisson-- 46 00:02:21,084 --> 00:02:22,000 Poisson and Bernoulli. 47 00:02:22,000 --> 00:02:24,040 So if the PMF has this form, if you 48 00:02:24,040 --> 00:02:27,610 have a discrete random variable, this is also valid. 49 00:02:27,610 --> 00:02:29,779 And the reason why we introduced this family 50 00:02:29,779 --> 00:02:32,320 is because there are going to be some properties that we know 51 00:02:32,320 --> 00:02:34,960 that this thing here, this function, b of theta, 52 00:02:34,960 --> 00:02:37,540 is essentially what completely characterizes 53 00:02:37,540 --> 00:02:38,950 your distribution. 54 00:02:38,950 --> 00:02:42,504 So if phi is fixed, we know that the interaction is the form. 55 00:02:42,504 --> 00:02:44,170 And this really just comes from the fact 56 00:02:44,170 --> 00:02:46,490 that we want the function to integrate to one. 57 00:02:46,490 --> 00:02:49,180 So this b here in the canonical form 58 00:02:49,180 --> 00:02:50,860 encodes everything we want to know. 59 00:02:50,860 --> 00:02:53,164 If I tell you what b of theta is-- 60 00:02:53,164 --> 00:02:54,580 and of course, I tell you what phi 61 00:02:54,580 --> 00:02:56,914 is, but let's say for a second that phi is equal to one. 62 00:02:56,914 --> 00:02:58,330 If I tell you this b of theta, you 63 00:02:58,330 --> 00:03:00,420 know exactly what distribution I'm talking about. 64 00:03:00,420 --> 00:03:02,520 So it should encode everything that's 65 00:03:02,520 --> 00:03:05,920 specific to this distribution, such as mean, variance, 66 00:03:05,920 --> 00:03:07,780 all the moments that you would want. 67 00:03:07,780 --> 00:03:12,310 And we'll see how we can compute from this thing 68 00:03:12,310 --> 00:03:14,590 the mean and the variance, for example. 69 00:03:14,590 --> 00:03:16,567 So today, we're going to talk about likelihood, 70 00:03:16,567 --> 00:03:18,400 and we're going to start with the likelihood 71 00:03:18,400 --> 00:03:21,100 function or the log likelihood for one observation. 72 00:03:21,100 --> 00:03:23,440 From this, we're going to do some computations, 73 00:03:23,440 --> 00:03:26,590 and then, we'll move on to the actual log likelihood based 74 00:03:26,590 --> 00:03:28,750 on n independent observations. 75 00:03:28,750 --> 00:03:30,730 And here, as we will see, the observations 76 00:03:30,730 --> 00:03:32,950 are not going to be identically distributed, 77 00:03:32,950 --> 00:03:35,080 because we're going to want each of them, 78 00:03:35,080 --> 00:03:39,530 conditionally on x to be a different function of x, where 79 00:03:39,530 --> 00:03:41,200 theta is just a different function of x 80 00:03:41,200 --> 00:03:43,230 for each of the observation. 81 00:03:43,230 --> 00:03:45,649 So remember, the log likelihood-- 82 00:03:50,050 --> 00:03:52,630 and this is for one observation-- 83 00:03:52,630 --> 00:03:54,400 is just the log of the density, right? 84 00:03:59,090 --> 00:04:02,840 And we have this identity that I mentioned 85 00:04:02,840 --> 00:04:04,530 at the end of the class on Tuesday. 86 00:04:04,530 --> 00:04:06,800 And this identity is just that the expectation 87 00:04:06,800 --> 00:04:08,960 of the derivative of this guy with respect to theta 88 00:04:08,960 --> 00:04:10,430 is equal to 0. 89 00:04:10,430 --> 00:04:11,330 So let's see why. 90 00:04:11,330 --> 00:04:15,610 So if I take the derivative with respect to theta, of log f, 91 00:04:15,610 --> 00:04:18,860 theta of x, what I get is the derivative 92 00:04:18,860 --> 00:04:21,930 with respect to theta of f, theta 93 00:04:21,930 --> 00:04:26,880 of x, divided by f theta of x. 94 00:04:26,880 --> 00:04:30,820 Now, if I take the expectation of this guy, 95 00:04:30,820 --> 00:04:37,810 with respect to this theta as well, what I get 96 00:04:37,810 --> 00:04:40,210 is that this thing-- what is the expectation? 97 00:04:40,210 --> 00:04:42,970 Well, it's just the integral against f theta. 98 00:04:42,970 --> 00:04:45,040 Or if I'm in a discrete case, I just 99 00:04:45,040 --> 00:04:48,590 have the sum against f theta, if it's a pmf. 100 00:04:48,590 --> 00:04:53,320 Just the definition, the expectation of x, 101 00:04:53,320 --> 00:04:56,790 is either the integral-- well, let's say of h of x-- 102 00:04:56,790 --> 00:04:59,770 is integral of h of x. 103 00:04:59,770 --> 00:05:01,390 F theta of x-- 104 00:05:01,390 --> 00:05:04,090 if this is discrete or is just the sum 105 00:05:04,090 --> 00:05:07,960 of h of x, f theta of x. 106 00:05:07,960 --> 00:05:09,880 If x is discrete-- 107 00:05:09,880 --> 00:05:15,115 so if it's continuous, you put this soft sum. 108 00:05:15,115 --> 00:05:17,110 This guy is the same thing, right? 109 00:05:17,110 --> 00:05:20,450 So I'm just going to illustrate the case when it's continuous. 110 00:05:20,450 --> 00:05:21,300 So this is what? 111 00:05:21,300 --> 00:05:24,790 Well, this is the integral of partial derivative with respect 112 00:05:24,790 --> 00:05:29,740 to theta, of f theta of x, divided by f theta 113 00:05:29,740 --> 00:05:35,060 of x, all time f theta of x-- 114 00:05:35,060 --> 00:05:36,526 dx. 115 00:05:36,526 --> 00:05:38,627 And now, this f theta is canceled, 116 00:05:38,627 --> 00:05:40,210 so I'm actually left with the integral 117 00:05:40,210 --> 00:05:41,290 of the derivative, which I'm going 118 00:05:41,290 --> 00:05:43,081 to write as the derivative of the integral. 119 00:05:50,690 --> 00:06:01,640 But f theta being density for any value of theta 120 00:06:01,640 --> 00:06:04,150 that I can take, this is the function. 121 00:06:04,150 --> 00:06:07,720 As a function of theta, this function 122 00:06:07,720 --> 00:06:10,670 is constantly equal to 1. 123 00:06:10,670 --> 00:06:13,710 For any theta that I take it, it takes value of 1. 124 00:06:13,710 --> 00:06:16,344 So this is constantly equal to 1. 125 00:06:16,344 --> 00:06:18,510 I put three bars to see that for any value of theta, 126 00:06:18,510 --> 00:06:21,830 this is 1, which actually tells me that the derivative is 127 00:06:21,830 --> 00:06:24,200 equal to 0. 128 00:06:24,200 --> 00:06:25,010 OK, yes? 129 00:06:30,455 --> 00:06:32,930 AUDIENCE: What is the first [INAUDIBLE] 130 00:06:32,930 --> 00:06:34,415 that you wrote on the board? 131 00:06:38,666 --> 00:06:40,540 PHILIPPE RIGOLLET: That's just the definition 132 00:06:40,540 --> 00:06:44,396 of the derivative of the log of a function? 133 00:06:44,396 --> 00:06:45,364 AUDIENCE: OK. 134 00:06:49,720 --> 00:06:53,660 PHILIPPE RIGOLLET: Log of f prime is f prime over f. 135 00:06:53,660 --> 00:06:56,060 That's a log, yeah. 136 00:06:56,060 --> 00:06:59,735 Just by elimination. 137 00:06:59,735 --> 00:07:01,652 AUDIENCE: [INAUDIBLE] 138 00:07:01,652 --> 00:07:02,860 PHILIPPE RIGOLLET: I'm sorry. 139 00:07:02,860 --> 00:07:05,276 AUDIENCE: When you write a squiggle that starts with an l, 140 00:07:05,276 --> 00:07:06,680 I assume it's lambda. 141 00:07:06,680 --> 00:07:09,420 PHILIPPE RIGOLLET: And you do good, because that's probably 142 00:07:09,420 --> 00:07:11,370 how my mind processes. 143 00:07:11,370 --> 00:07:13,310 And so I'm like, yeah, l. 144 00:07:13,310 --> 00:07:16,820 Here is enough information. 145 00:07:16,820 --> 00:07:19,290 OK, everybody is good with this? 146 00:07:19,290 --> 00:07:21,260 So that was convenient. 147 00:07:21,260 --> 00:07:22,860 So it just said that the expectation 148 00:07:22,860 --> 00:07:26,970 of the derivative of the log likelihood is equal to 0. 149 00:07:26,970 --> 00:07:29,340 That's going to be our first identity. 150 00:07:29,340 --> 00:07:30,900 Let's move onto the second identity, 151 00:07:30,900 --> 00:07:34,140 using exactly the same trick, which is let's hope 152 00:07:34,140 --> 00:07:35,850 that at some point, we have the integral 153 00:07:35,850 --> 00:07:37,266 of this function that's constantly 154 00:07:37,266 --> 00:07:41,550 equal to 1 as a function of theta, and then use the fact 155 00:07:41,550 --> 00:07:43,650 that its derivative is equal to 0. 156 00:07:43,650 --> 00:07:54,850 So if I start taking the second derivative of the log 157 00:07:54,850 --> 00:07:57,470 of f theta, so what is this? 158 00:07:57,470 --> 00:08:00,600 Well, it's the derivative of this guy 159 00:08:00,600 --> 00:08:02,720 here, so I'm going to go straight to it. 160 00:08:02,720 --> 00:08:08,830 So it's second derivative, f theta of x, times 161 00:08:08,830 --> 00:08:19,810 f theta of x, minus first derivative of f theta of x, 162 00:08:19,810 --> 00:08:22,160 times first derivative of f theta of x. 163 00:08:26,360 --> 00:08:29,670 Here is some super important stuff-- 164 00:08:29,670 --> 00:08:31,740 no, I'm kidding. 165 00:08:31,740 --> 00:08:34,080 So you can still see that guy over there? 166 00:08:34,080 --> 00:08:35,870 So it's just the square. 167 00:08:35,870 --> 00:08:38,100 And then, I divide by f theta of x squared. 168 00:08:43,890 --> 00:08:48,340 So here I have the second derivative, times f itself. 169 00:08:48,340 --> 00:08:51,630 And here, I have the product of the first derivative 170 00:08:51,630 --> 00:08:52,160 with itself. 171 00:08:52,160 --> 00:08:53,544 So that's the square. 172 00:08:53,544 --> 00:08:55,210 So now, I'm going to integrate this guy. 173 00:08:55,210 --> 00:09:01,550 So if I take the expectation of this thing here, what I get 174 00:09:01,550 --> 00:09:03,741 is the integral. 175 00:09:03,741 --> 00:09:05,240 So here, the only thing that's going 176 00:09:05,240 --> 00:09:07,073 to happen when I'm going to take my integral 177 00:09:07,073 --> 00:09:09,380 is that one of those squares is going to cancel 178 00:09:09,380 --> 00:09:10,940 against f theta, right? 179 00:09:10,940 --> 00:09:22,430 So I'm going to get the second derivative 180 00:09:22,430 --> 00:09:24,830 minus the second derivative squared. 181 00:09:32,950 --> 00:09:34,560 And then, I'm divided by f theta. 182 00:09:38,120 --> 00:09:39,850 And I know that this thing is equal to 0. 183 00:09:44,460 --> 00:09:46,660 Now, one of these guys here-- 184 00:09:46,660 --> 00:09:48,180 sorry, why do I have-- 185 00:09:48,180 --> 00:09:49,350 so I have this guy here. 186 00:09:49,350 --> 00:09:50,865 So this guy here is going to cancel. 187 00:09:53,440 --> 00:09:57,320 So this is what this is equal to-- 188 00:09:59,970 --> 00:10:05,595 the integral of the partial, so the second derivative of f 189 00:10:05,595 --> 00:10:09,620 theta of x, because those two guys cancel-- 190 00:10:09,620 --> 00:10:14,156 minus the integral of the second derivative. 191 00:10:24,380 --> 00:10:28,360 And this is telling me what? 192 00:10:55,480 --> 00:10:58,040 Yeah, I'm losing one, because I have some weird sequences. 193 00:10:58,040 --> 00:10:58,539 Thank you. 194 00:11:03,210 --> 00:11:12,282 OK, this is still positive. 195 00:11:12,282 --> 00:11:14,490 I want to say that this thing is actually equal to 0. 196 00:11:17,040 --> 00:11:19,410 But then, it gives me some weird things, 197 00:11:19,410 --> 00:11:24,150 which are that I have an integral 198 00:11:24,150 --> 00:11:26,080 of a positive function, which is equal to 0. 199 00:11:32,814 --> 00:11:34,480 Yeah, that's what I'm thinking of doing. 200 00:11:34,480 --> 00:11:37,230 But I'm going to get 0 for this entire integral, which 201 00:11:37,230 --> 00:11:39,729 means that I have the integral of a positive function, which 202 00:11:39,729 --> 00:11:42,810 is equal to 0, which means that this function is equal to 0, 203 00:11:42,810 --> 00:11:44,940 which sounds a little bad-- 204 00:11:44,940 --> 00:11:48,310 basically, tells me that this function, f theta, is linear. 205 00:11:52,710 --> 00:11:55,259 So I went a little too far, I believe, 206 00:11:55,259 --> 00:11:57,300 because I only want to prove that the expectation 207 00:11:57,300 --> 00:11:58,937 of the second derivative-- 208 00:12:24,190 --> 00:12:25,970 Yes, so I want to pull this out. 209 00:12:31,020 --> 00:12:35,229 So let's see, if I keep rolling with this, I'm going to get-- 210 00:12:35,229 --> 00:12:37,520 well, no because the fact that it's divided by f theta, 211 00:12:37,520 --> 00:12:40,670 means that, indeed, the second derivative is equal to 0. 212 00:12:40,670 --> 00:12:41,960 So it cannot do this here. 213 00:12:49,446 --> 00:12:51,438 AUDIENCE: [INAUDIBLE] 214 00:12:59,920 --> 00:13:03,120 PHILIPPE RIGOLLET: OK, but let's write it like this. 215 00:13:03,120 --> 00:13:05,020 You're right, so this is what? 216 00:13:05,020 --> 00:13:12,200 This is the expectation of the partial with respect 217 00:13:12,200 --> 00:13:15,740 to theta of f theta of x, divided 218 00:13:15,740 --> 00:13:21,360 by f theta of x squared. 219 00:13:21,360 --> 00:13:25,530 And this is exactly the derivative of the log, right? 220 00:13:25,530 --> 00:13:28,830 So indeed, this thing is equal to the expectation with respect 221 00:13:28,830 --> 00:13:34,980 to theta of the partial of l-- 222 00:13:34,980 --> 00:13:41,160 of log f theta, divided by partial theta. 223 00:13:41,160 --> 00:13:44,660 All right, so this is one of the guys that I want squared. 224 00:13:44,660 --> 00:13:47,510 This is one of the guys that I want. 225 00:13:47,510 --> 00:13:50,810 And this is actually equal, so this will 226 00:13:50,810 --> 00:13:53,270 be equal to the expectation-- 227 00:13:56,186 --> 00:13:58,130 AUDIENCE: [INAUDIBLE] 228 00:13:59,110 --> 00:14:02,860 PHILIPPE RIGOLLET: Oh, right, so this term should be equal to 0. 229 00:14:02,860 --> 00:14:03,940 This was not 0. 230 00:14:03,940 --> 00:14:04,990 You're absolutely right. 231 00:14:04,990 --> 00:14:06,672 So at some point, I got confused, 232 00:14:06,672 --> 00:14:08,380 because I thought putting this equal to 0 233 00:14:08,380 --> 00:14:09,463 would mean that this is 0. 234 00:14:09,463 --> 00:14:10,840 But this thing is not equal to 0. 235 00:14:10,840 --> 00:14:11,650 So this thing, you're right. 236 00:14:11,650 --> 00:14:13,858 I take the same trick as before, and this is actually 237 00:14:13,858 --> 00:14:16,900 equal to 0, which means that now I have 238 00:14:16,900 --> 00:14:19,690 what's on the left-hand side, which is equal to what's 239 00:14:19,690 --> 00:14:20,720 on the right-hand side. 240 00:14:20,720 --> 00:14:23,220 And if I recap, I get that e theta 241 00:14:23,220 --> 00:14:30,750 of the second derivative of the log of f theta 242 00:14:30,750 --> 00:14:32,670 is equal to minus-- 243 00:14:32,670 --> 00:14:34,360 because I had a minus sign here-- 244 00:14:34,360 --> 00:14:39,300 to the expectation with respect to theta, of log of f theta, 245 00:14:39,300 --> 00:14:44,100 divided by theta squared. 246 00:14:44,100 --> 00:14:48,720 Thank you for being on watch when I'm falling apart. 247 00:14:48,720 --> 00:14:50,390 All right, so this is exactly what 248 00:14:50,390 --> 00:14:52,140 you have here, except that both terms have 249 00:14:52,140 --> 00:14:54,180 been put on the same side. 250 00:14:54,180 --> 00:14:57,550 All right, so those things are going to be useful to us, 251 00:14:57,550 --> 00:14:59,880 so maybe, we should write them somewhere here. 252 00:15:13,820 --> 00:15:16,210 And then, we have that the expectation 253 00:15:16,210 --> 00:15:26,090 of the second derivative of the log 254 00:15:26,090 --> 00:15:33,170 is equal to minus the expectation of the square 255 00:15:33,170 --> 00:15:34,941 of the first derivative. 256 00:15:40,020 --> 00:15:42,750 And this is, indeed, my Fisher information. 257 00:15:42,750 --> 00:15:48,030 This is just telling me what is the second derivative of my log 258 00:15:48,030 --> 00:15:49,217 likelihood at theta, right? 259 00:15:49,217 --> 00:15:50,800 So everything is with respect to theta 260 00:15:50,800 --> 00:15:52,440 when I take these expectations. 261 00:15:52,440 --> 00:15:55,140 And so it tells me that the expectation 262 00:15:55,140 --> 00:15:58,530 of the second derivative-- at least first of all, what 263 00:15:58,530 --> 00:16:00,150 it's telling me is that it's concave, 264 00:16:00,150 --> 00:16:02,970 because the second derivative of this thing, 265 00:16:02,970 --> 00:16:05,910 which is the second derivative of kl divergence, 266 00:16:05,910 --> 00:16:09,436 is actually minus something which is must be non-negative. 267 00:16:09,436 --> 00:16:11,310 And so it's telling me that it's concave here 268 00:16:11,310 --> 00:16:14,646 at this [INAUDIBLE]. 269 00:16:14,646 --> 00:16:16,270 And in particular, it's also telling me 270 00:16:16,270 --> 00:16:19,240 that it has to be strictly positive, unless the derivative 271 00:16:19,240 --> 00:16:21,040 of f is equal to 0. 272 00:16:21,040 --> 00:16:27,700 Unless f is constant, then it's not going to change. 273 00:16:27,700 --> 00:16:32,920 All right, do you have a question? 274 00:16:32,920 --> 00:16:35,660 So now, let's use this. 275 00:16:35,660 --> 00:16:37,660 So what does my log likelihood look 276 00:16:37,660 --> 00:16:41,390 like when I actually compute it for this canonical exponential 277 00:16:41,390 --> 00:16:42,760 family. 278 00:16:42,760 --> 00:16:45,310 We have this exponential function, so taking the log 279 00:16:45,310 --> 00:16:48,260 should make my life much easier, and indeed, it does. 280 00:16:48,260 --> 00:16:56,250 So if I look at the canonical, what I have 281 00:16:56,250 --> 00:17:04,339 is that the log of f theta of x, it's equal 282 00:17:04,339 --> 00:17:10,849 simply to y theta minus b of theta, divided by phi, 283 00:17:10,849 --> 00:17:14,880 plus this function that does not depend on theta. 284 00:17:18,790 --> 00:17:20,440 So let's see what this tells me. 285 00:17:20,440 --> 00:17:23,329 Let's just plug-in those equalities in there. 286 00:17:23,329 --> 00:17:25,329 I can take the derivative of the right-hand side 287 00:17:25,329 --> 00:17:28,060 and just say that in expectation, it's equal to 0. 288 00:17:28,060 --> 00:17:32,300 So if I start looking at the derivative, 289 00:17:32,300 --> 00:17:33,780 this is equal to what? 290 00:17:33,780 --> 00:17:37,820 Well, here I'm going to pick up only y. 291 00:17:37,820 --> 00:17:39,160 Sorry, this is a function of y. 292 00:17:46,585 --> 00:17:48,460 I was talking about likelihood, so I actually 293 00:17:48,460 --> 00:17:50,630 need to put the random variable here. 294 00:17:50,630 --> 00:17:54,750 So I get y minus the derivative of b of theta. 295 00:17:54,750 --> 00:17:56,250 Since it's only a function of theta, 296 00:17:56,250 --> 00:17:58,180 I'm just going to write b prime, is at OK-- 297 00:17:58,180 --> 00:18:01,450 rather than having the partial with respect to theta. 298 00:18:01,450 --> 00:18:02,932 Then, this is a constant. 299 00:18:02,932 --> 00:18:04,890 This does not depend on theta, so it goes away. 300 00:18:10,200 --> 00:18:15,970 So if I start taking the expectation of this guy, 301 00:18:15,970 --> 00:18:20,200 I get the expectation of this guy, 302 00:18:20,200 --> 00:18:24,960 which is the expectation of y, minus-- well, 303 00:18:24,960 --> 00:18:27,100 this does not depend on y, so it's just itself-- 304 00:18:27,100 --> 00:18:28,390 b prime of theta. 305 00:18:28,390 --> 00:18:30,460 And the whole thing is divided by phi. 306 00:18:30,460 --> 00:18:33,100 But from my first equality over there, 307 00:18:33,100 --> 00:18:35,020 I know that this thing is actually equal to 0. 308 00:18:38,680 --> 00:18:40,740 We just proved that. 309 00:18:40,740 --> 00:18:43,420 So in particular, it means that since phi is non-zero, 310 00:18:43,420 --> 00:18:45,550 it means that this guy must be equal to this guy-- 311 00:18:45,550 --> 00:18:47,500 or phi is not infinity. 312 00:18:47,500 --> 00:18:50,395 And so that implies that the expectation 313 00:18:50,395 --> 00:18:56,310 with respect to theta of y is equal to b prime of theta. 314 00:19:02,322 --> 00:19:04,280 I'm sorry, you're not registered in this class. 315 00:19:04,280 --> 00:19:07,230 I'm going to have to ask you to leave. 316 00:19:07,230 --> 00:19:09,150 I'm not kidding. 317 00:19:09,150 --> 00:19:10,570 AUDIENCE: [INAUDIBLE] 318 00:19:11,070 --> 00:19:12,520 PHILIPPE RIGOLLET: You are? 319 00:19:12,520 --> 00:19:13,970 I've never seen you here. 320 00:19:13,970 --> 00:19:15,861 I saw you for the first lecture. 321 00:19:15,861 --> 00:19:16,360 OK. 322 00:19:19,120 --> 00:19:23,105 All right, so e theta of y is equal to b prime of theta. 323 00:19:23,105 --> 00:19:24,230 Everybody agrees with that? 324 00:19:27,320 --> 00:19:31,130 So this is actually nice, because if I give you 325 00:19:31,130 --> 00:19:34,190 an exponential family, the only thing I really need to tell 326 00:19:34,190 --> 00:19:36,210 you is what b theta is. 327 00:19:36,210 --> 00:19:38,780 And if I give you b of theta, then computing a derivative 328 00:19:38,780 --> 00:19:42,470 is actually much easier than having to integrate y 329 00:19:42,470 --> 00:19:44,000 against the density itself. 330 00:19:44,000 --> 00:19:46,010 You could really have fun and try 331 00:19:46,010 --> 00:19:48,310 to compute this, which you would be able to do, right? 332 00:19:54,080 --> 00:19:58,840 And then, there's the plus c of y phi, blah, blah, blah-- 333 00:19:58,840 --> 00:19:59,385 dy. 334 00:19:59,385 --> 00:20:01,760 And that's the way you would actually compute this thing. 335 00:20:05,040 --> 00:20:06,680 Sorry, this guy is here. 336 00:20:06,680 --> 00:20:07,940 That would be painful. 337 00:20:07,940 --> 00:20:10,310 I don't know what this normalization looks like, 338 00:20:10,310 --> 00:20:12,230 so it would have to also explicit that, 339 00:20:12,230 --> 00:20:13,790 so I can actually compute this thing. 340 00:20:13,790 --> 00:20:15,415 And you know, just the same way, if you 341 00:20:15,415 --> 00:20:17,852 want to compute the expectation of a Gaussian-- 342 00:20:17,852 --> 00:20:19,310 well, the expectation of a Gaussian 343 00:20:19,310 --> 00:20:20,750 is not the most difficult one. 344 00:20:20,750 --> 00:20:23,380 But even if you compute the expectation of a Poisson, 345 00:20:23,380 --> 00:20:25,130 you start to have to work in a little bit. 346 00:20:25,130 --> 00:20:27,200 There's a few things that you have to work through. 347 00:20:27,200 --> 00:20:29,199 Here, I'm just telling you, all you have to know 348 00:20:29,199 --> 00:20:30,740 is what b of theta is, and then, you 349 00:20:30,740 --> 00:20:33,190 can just take the derivative. 350 00:20:33,190 --> 00:20:35,750 Let's see what the second equality is going to give us. 351 00:20:56,490 --> 00:21:00,830 OK, so what is the second equality? 352 00:21:00,830 --> 00:21:03,850 It's telling me that if I look at the second derivative, 353 00:21:03,850 --> 00:21:07,410 and then I take its expectation, I'm 354 00:21:07,410 --> 00:21:11,190 going to have something which is equal to negative this guy 355 00:21:11,190 --> 00:21:13,059 squared. 356 00:21:13,059 --> 00:21:14,350 Sorry, that was the log, right? 357 00:21:19,970 --> 00:21:22,700 We've already computed this first derivative 358 00:21:22,700 --> 00:21:24,380 of the likelihood. 359 00:21:24,380 --> 00:21:29,390 It's just the expectation of the square of this thing here. 360 00:21:29,390 --> 00:21:34,070 So expectation of the derivative, 361 00:21:34,070 --> 00:21:38,900 with respect to theta of log, f theta of x, divided 362 00:21:38,900 --> 00:21:44,130 by partial theta squared. 363 00:21:44,130 --> 00:21:50,040 This is equal to the expectation of the square of y, 364 00:21:50,040 --> 00:21:58,580 minus b theta, divided by phi squared-- 365 00:21:58,580 --> 00:21:59,720 b prime, theta squared. 366 00:22:04,375 --> 00:22:06,500 OK, sorry, I'm actually going to move on with the-- 367 00:22:11,100 --> 00:22:13,320 so if I start computing, what is this thing? 368 00:22:13,320 --> 00:22:16,350 Well, we just agreed that this was what? 369 00:22:19,580 --> 00:22:22,920 The expectation of theta, right? 370 00:22:22,920 --> 00:22:27,120 So that's just the expectation of y. 371 00:22:27,120 --> 00:22:28,230 We just computed it here. 372 00:22:31,548 --> 00:22:32,970 AUDIENCE: [INAUDIBLE] 373 00:22:35,744 --> 00:22:37,410 PHILIPPE RIGOLLET: Yeah, that's b prime. 374 00:22:37,410 --> 00:22:39,050 There's a derivative here. 375 00:22:44,760 --> 00:22:47,900 So now, this is what? 376 00:22:47,900 --> 00:22:57,680 This is simply-- anyone? 377 00:23:01,580 --> 00:23:02,810 I'm sorry? 378 00:23:02,810 --> 00:23:05,660 Variance of y, but you're scaling by phi squared. 379 00:23:11,040 --> 00:23:15,390 OK, so this is negative of the right-hand side 380 00:23:15,390 --> 00:23:17,250 of our inequality. 381 00:23:17,250 --> 00:23:21,810 And now, I just have to take one more derivative to this guy. 382 00:23:21,810 --> 00:23:27,420 So now, if I look at the left-hand side now, 383 00:23:27,420 --> 00:23:29,920 I have that the second derivative 384 00:23:29,920 --> 00:23:38,890 of log, of f theta of y, divided by partial of theta squared. 385 00:23:38,890 --> 00:23:40,680 So this thing is equal to-- 386 00:23:40,680 --> 00:23:42,710 well, no, I'm not left with much. 387 00:23:42,710 --> 00:23:44,630 The white part is going to go away, 388 00:23:44,630 --> 00:23:47,590 and I'm left only with the second derivative of theta, 389 00:23:47,590 --> 00:23:49,930 minus the second derivative theta, divided by phi. 390 00:24:00,540 --> 00:24:03,790 So if I take expectation-- 391 00:24:03,790 --> 00:24:05,360 well, it just doesn't change. 392 00:24:08,010 --> 00:24:09,320 This is deterministic. 393 00:24:09,320 --> 00:24:11,240 So now, what I've established is that this guy 394 00:24:11,240 --> 00:24:14,010 is equal to negative this guy. 395 00:24:14,010 --> 00:24:17,050 So those two things, the signs are going to go away. 396 00:24:17,050 --> 00:24:20,910 And so this implies that the variance of y 397 00:24:20,910 --> 00:24:25,800 is equal to b prime prime theta. 398 00:24:25,800 --> 00:24:30,240 And then, I have a phi square in denominator 399 00:24:30,240 --> 00:24:33,180 that cancels only one of the phi squares, so it's time phi. 400 00:24:37,480 --> 00:24:41,140 So now, I have that my second derivative-- since I know phi 401 00:24:41,140 --> 00:24:46,280 is completely determining the variance. 402 00:24:46,280 --> 00:24:52,470 So basically, that's why b is called the cumulant generating 403 00:24:52,470 --> 00:24:52,970 function. 404 00:24:52,970 --> 00:24:55,430 It's not generating moments, but cumulants. 405 00:24:55,430 --> 00:24:59,180 But cumulants, in this case, correspond, basically, 406 00:24:59,180 --> 00:25:01,060 to the moments, at least for the first two. 407 00:25:01,060 --> 00:25:03,890 If I start going farther, I'm going 408 00:25:03,890 --> 00:25:08,090 to have more combinations of the expectation of y3, y2, 409 00:25:08,090 --> 00:25:09,530 and y itself. 410 00:25:13,150 --> 00:25:14,770 But as we know, those are the ones 411 00:25:14,770 --> 00:25:17,170 that are usually the most useful, at least 412 00:25:17,170 --> 00:25:19,384 if we're interested in asymptotic performance. 413 00:25:19,384 --> 00:25:20,800 The central limit theorem tells us 414 00:25:20,800 --> 00:25:23,380 that all that matters are the first two moments, 415 00:25:23,380 --> 00:25:25,580 and then, the rest is just going to go and say well, 416 00:25:25,580 --> 00:25:26,330 it doesn't matter. 417 00:25:26,330 --> 00:25:29,300 It's all going to [INAUDIBLE] anyway. 418 00:25:29,300 --> 00:25:31,290 So let's go to a Poisson for example. 419 00:25:31,290 --> 00:25:33,518 So if I had a Poisson distribution-- 420 00:25:39,910 --> 00:25:42,430 so this is a discrete distribution. 421 00:25:42,430 --> 00:25:46,390 And what I know is that f-- 422 00:25:46,390 --> 00:25:51,580 let me call mu the parameter of y. 423 00:25:56,290 --> 00:26:01,870 So it's mu to the y, divided by y factorial, exponential 424 00:26:01,870 --> 00:26:02,650 minus mu. 425 00:26:02,650 --> 00:26:04,570 OK so mu is usually called lambda, 426 00:26:04,570 --> 00:26:06,294 and y is usually called x, that's 427 00:26:06,294 --> 00:26:07,960 why it takes me to a little bit of time. 428 00:26:07,960 --> 00:26:09,626 But it usually it's lambda to the x over 429 00:26:09,626 --> 00:26:13,810 factorial x, exponential minus lambda. 430 00:26:13,810 --> 00:26:16,490 Since this is just the series expansion of the exponential 431 00:26:16,490 --> 00:26:19,230 when I sum those things from 0 to infinity, 432 00:26:19,230 --> 00:26:20,949 this thing sums to 1. 433 00:26:20,949 --> 00:26:22,990 But then, if I wanted to start understanding what 434 00:26:22,990 --> 00:26:25,900 the expectation of this thing is-- 435 00:26:25,900 --> 00:26:28,340 so if I want to understand the expectation with respect 436 00:26:28,340 --> 00:26:33,820 to mu of y, then, I would have to compute the sum 437 00:26:33,820 --> 00:26:48,280 from k equals 0 to infinity of k, times mu to the k, 438 00:26:48,280 --> 00:26:51,587 over factorial of k, exponential minus mu-- 439 00:26:51,587 --> 00:26:53,170 which means that I would, essentially, 440 00:26:53,170 --> 00:27:06,090 have to take the derivative of my series in the end. 441 00:27:06,090 --> 00:27:07,174 So I can do this. 442 00:27:07,174 --> 00:27:08,340 This is a standard exercise. 443 00:27:08,340 --> 00:27:10,630 You've probably done it when you took probability. 444 00:27:10,630 --> 00:27:12,900 But let's see if we can actually just read it off 445 00:27:12,900 --> 00:27:14,760 from the first derivative of b. 446 00:27:14,760 --> 00:27:16,380 So to do that, we need to write this 447 00:27:16,380 --> 00:27:18,850 in the form of an exponential, where there 448 00:27:18,850 --> 00:27:23,220 is one parameter that captures mu, that interacts with y, just 449 00:27:23,220 --> 00:27:25,860 doing this parameter times y, and then something that 450 00:27:25,860 --> 00:27:29,430 depends only on y, and then something that depends only 451 00:27:29,430 --> 00:27:32,979 on mu. 452 00:27:32,979 --> 00:27:34,020 That's the important one. 453 00:27:34,020 --> 00:27:35,550 That's going to be our B. And then, 454 00:27:35,550 --> 00:27:39,150 there's going to be something that depends only on y. 455 00:27:39,150 --> 00:27:42,990 So let's write this and check that this f mu, indeed, 456 00:27:42,990 --> 00:27:46,510 belongs to this canonical exponential family. 457 00:27:46,510 --> 00:27:48,600 So I definitely have an exponential 458 00:27:48,600 --> 00:27:50,075 that comes from this guy. 459 00:27:50,075 --> 00:27:51,454 So I have minus mu. 460 00:27:51,454 --> 00:27:53,370 And then, this thing is going to give me what? 461 00:27:53,370 --> 00:27:58,260 It's going to give me plus y log mu. 462 00:27:58,260 --> 00:28:02,166 And then, I'm going to have minus log of y factorial. 463 00:28:06,480 --> 00:28:08,790 So clearly, I have a term that depends only 464 00:28:08,790 --> 00:28:11,340 on mu, terms that depend only on y, 465 00:28:11,340 --> 00:28:15,300 and I have a product of y and something that depends on mu. 466 00:28:15,300 --> 00:28:17,820 If I want to be canonical, I must 467 00:28:17,820 --> 00:28:23,650 have this to be exactly the parameter theta itself. 468 00:28:23,650 --> 00:28:27,150 So I'm going to call this guy theta. 469 00:28:27,150 --> 00:28:30,750 So theta is log mu, which means that mu 470 00:28:30,750 --> 00:28:32,592 is equal to e to the theta. 471 00:28:32,592 --> 00:28:34,050 And so wherever I see mu, I'm going 472 00:28:34,050 --> 00:28:36,716 to replace it by [INAUDIBLE] the theta, because my new parameter 473 00:28:36,716 --> 00:28:38,070 now, is theta. 474 00:28:38,070 --> 00:28:38,880 So this is what? 475 00:28:38,880 --> 00:28:43,490 This is equal to exponential y times theta. 476 00:28:43,490 --> 00:28:47,860 And then, I'm going to have minus e of theta. 477 00:28:47,860 --> 00:28:51,630 And then, who cares, something that depends only on mu. 478 00:28:51,630 --> 00:28:58,330 So this is my c of y, and phi is equal to 1 in this case. 479 00:28:58,330 --> 00:29:00,930 So that's all I care about. 480 00:29:00,930 --> 00:29:01,830 So let's use it. 481 00:29:05,000 --> 00:29:08,770 So this is my canonical exponential family. 482 00:29:08,770 --> 00:29:11,680 Y interacts with theta exactly like this. 483 00:29:11,680 --> 00:29:13,040 And then, I have this function. 484 00:29:13,040 --> 00:29:17,410 So this function here must be b of theta. 485 00:29:20,080 --> 00:29:22,780 So from this function, exponential theta, 486 00:29:22,780 --> 00:29:25,000 I'm supposed to be able to read what the mean is. 487 00:29:39,820 --> 00:29:41,796 AUDIENCE: [INAUDIBLE] 488 00:29:43,772 --> 00:29:46,990 PHILIPPE RIGOLLET: Because since in this course 489 00:29:46,990 --> 00:29:48,610 I always know what the dispersion is, 490 00:29:48,610 --> 00:29:52,450 I can actually always absorb it into theta from one. 491 00:29:52,450 --> 00:29:54,670 But here, it's really of the form y times something 492 00:29:54,670 --> 00:29:55,720 divided by 1, right? 493 00:30:01,030 --> 00:30:04,805 If it was like log of mu divided by phi, 494 00:30:04,805 --> 00:30:06,430 that would be the question of whether I 495 00:30:06,430 --> 00:30:10,300 want to call phi my dispersion, or if I 496 00:30:10,300 --> 00:30:12,070 want to just have it in there. 497 00:30:16,610 --> 00:30:18,740 This makes no difference in practice. 498 00:30:18,740 --> 00:30:20,860 But the real thing is it's never going 499 00:30:20,860 --> 00:30:22,610 to happen that this thing, this version, 500 00:30:22,610 --> 00:30:23,960 is going to be an exact number. 501 00:30:23,960 --> 00:30:26,240 If it's an actual numerical number, 502 00:30:26,240 --> 00:30:28,580 this just means that this number should be absorbed 503 00:30:28,580 --> 00:30:32,120 in the definition of theta. 504 00:30:32,120 --> 00:30:34,700 But if it's something that is called sigma, say, 505 00:30:34,700 --> 00:30:36,470 and I will assume that sigma is known, 506 00:30:36,470 --> 00:30:39,162 then it's probably preferable to keep it in the dispersion, 507 00:30:39,162 --> 00:30:41,120 so you can see that there's this parameter here 508 00:30:41,120 --> 00:30:44,450 that you can, essentially, play with. 509 00:30:44,450 --> 00:30:48,810 It doesn't make any difference when you know phi. 510 00:30:48,810 --> 00:30:55,050 So now, if I look at the expectation of some y-- so now, 511 00:30:55,050 --> 00:31:00,419 I'm going to have y, which follows my Poisson mu. 512 00:31:00,419 --> 00:31:01,960 I'm going to look at the expectation, 513 00:31:01,960 --> 00:31:09,210 and I know that the expectation is b prime of theta. 514 00:31:09,210 --> 00:31:09,950 Agreed? 515 00:31:09,950 --> 00:31:14,780 That's what I just erased, I think. 516 00:31:14,780 --> 00:31:17,020 Agreed with this, the derivative? 517 00:31:17,020 --> 00:31:18,550 So what is this? 518 00:31:18,550 --> 00:31:23,050 Well, it's the derivative of e to the theta, which 519 00:31:23,050 --> 00:31:27,270 is e to the theta, which is mu. 520 00:31:27,270 --> 00:31:30,030 So my Poisson is parametrized by its mean. 521 00:31:30,030 --> 00:31:34,850 I can also compute the variance, which 522 00:31:34,850 --> 00:31:40,580 is equal to minus the second derivative of-- 523 00:31:40,580 --> 00:31:42,460 no, it's equal to the second derivative of b. 524 00:31:47,170 --> 00:31:49,090 Dispersion is equal to 1. 525 00:31:49,090 --> 00:31:55,000 Again, if I took phi elsewhere, I would see it here as well. 526 00:31:55,000 --> 00:31:57,760 So if I just absorbed phi here, I would see it divided here, 527 00:31:57,760 --> 00:32:00,040 so it would not make any difference. 528 00:32:00,040 --> 00:32:02,925 And what is the second derivative of the exponential? 529 00:32:06,570 --> 00:32:09,820 Still the exponential-- so it's still equal to mu. 530 00:32:14,760 --> 00:32:17,950 So that certainly makes our life easier. 531 00:32:17,950 --> 00:32:19,620 Just one quick from remark-- 532 00:32:31,130 --> 00:32:32,360 here's the function. 533 00:32:32,360 --> 00:32:35,150 I am giving you problem-- 534 00:32:35,150 --> 00:32:36,710 can the b function-- 535 00:32:39,320 --> 00:32:46,550 can it ever be equal to log of theta? 536 00:32:55,840 --> 00:32:58,310 Who says yes? 537 00:32:58,310 --> 00:33:00,428 Who says no? 538 00:33:00,428 --> 00:33:02,858 Why? 539 00:33:02,858 --> 00:33:04,802 AUDIENCE: [INAUDIBLE] 540 00:33:09,680 --> 00:33:13,670 PHILIPPE RIGOLLET: Yeah, so what I've learned from this-- 541 00:33:13,670 --> 00:33:16,610 it's sort of completely analytic, right? 542 00:33:16,610 --> 00:33:19,640 So we just took derivatives, and this thing just happened. 543 00:33:19,640 --> 00:33:22,490 This thing actually allowed us to relate the second derivative 544 00:33:22,490 --> 00:33:24,299 of b to the variance. 545 00:33:24,299 --> 00:33:26,090 And one thing that we know about a variance 546 00:33:26,090 --> 00:33:27,920 is that this is non-negative. 547 00:33:27,920 --> 00:33:30,830 And in particular, it's always positive. 548 00:33:30,830 --> 00:33:35,330 If they give you a canonical exponential family that 549 00:33:35,330 --> 00:33:39,260 has zero variance, trust me, you will see it. 550 00:33:39,260 --> 00:33:40,919 That means that this thing is not 551 00:33:40,919 --> 00:33:42,710 going to look like something that's finite, 552 00:33:42,710 --> 00:33:44,280 and it's going to have a point mass. 553 00:33:44,280 --> 00:33:46,280 It's going to take value infinity at one point. 554 00:33:46,280 --> 00:33:48,080 So this will, basically, never happen. 555 00:33:48,080 --> 00:33:50,220 This thing is, actually, strictly positive, 556 00:33:50,220 --> 00:33:52,600 which means that this thing is always strictly concave. 557 00:33:52,600 --> 00:33:55,220 It means that the second derivative of this function, b, 558 00:33:55,220 --> 00:34:00,440 has to be strictly positive, and so that the function is convex. 559 00:34:00,440 --> 00:34:03,005 So this is concave, so this is definitely not working. 560 00:34:03,005 --> 00:34:04,880 I need to have something that looks like this 561 00:34:04,880 --> 00:34:07,920 when I talk about my b. 562 00:34:07,920 --> 00:34:10,500 So f theta squared-- 563 00:34:10,500 --> 00:34:13,190 we'll see a bunch of exponential theta. 564 00:34:13,190 --> 00:34:14,389 And there's a bunch of them. 565 00:34:14,389 --> 00:34:18,280 But if you started writing something, and you find b-- 566 00:34:18,280 --> 00:34:20,230 try to think of the plot of b in your mind, 567 00:34:20,230 --> 00:34:23,980 and you find that b looks like it's going to become concave, 568 00:34:23,980 --> 00:34:25,844 you've made a sign mistake somewhere. 569 00:34:30,110 --> 00:34:33,320 All right, so we've done a pretty big parenthesis 570 00:34:33,320 --> 00:34:37,040 to try to characterize what the distribution of y 571 00:34:37,040 --> 00:34:37,820 was going to be. 572 00:34:37,820 --> 00:34:41,679 We wanted to extend from, say, Gaussian to something else. 573 00:34:41,679 --> 00:34:43,909 But when we're doing regression, which 574 00:34:43,909 --> 00:34:46,520 means generalized linear models, we 575 00:34:46,520 --> 00:34:48,500 are not interested in the distribution of y 576 00:34:48,500 --> 00:34:51,650 but really the conditional distribution of y given x. 577 00:34:51,650 --> 00:34:55,760 So I need now to couple those back together. 578 00:34:55,760 --> 00:34:59,702 So what I know is that this same mu, in this case, 579 00:34:59,702 --> 00:35:01,910 which is the expectation-- what I want to say is that 580 00:35:01,910 --> 00:35:09,740 the conditional expectation of y given x-- 581 00:35:12,710 --> 00:35:15,870 this is some mu of x. 582 00:35:15,870 --> 00:35:17,700 When we did linear models, we said well, 583 00:35:17,700 --> 00:35:21,870 this thing was some x transpose beta for linear models. 584 00:35:26,139 --> 00:35:27,680 And the whole premise of this chapter 585 00:35:27,680 --> 00:35:29,900 is to say well, this might make no sense, 586 00:35:29,900 --> 00:35:32,930 because x transpose beta can take the entire range 587 00:35:32,930 --> 00:35:34,320 of real values. 588 00:35:34,320 --> 00:35:36,870 Whereas, this mu can take only a partial range. 589 00:35:36,870 --> 00:35:40,550 So even if you actually focus on the Poisson, for example, 590 00:35:40,550 --> 00:35:43,970 we know that the expectation of a Poisson has to be 591 00:35:43,970 --> 00:35:45,910 a non-negative number-- 592 00:35:45,910 --> 00:35:47,660 actually, a positive number as soon as you 593 00:35:47,660 --> 00:35:49,970 have a little bit of variance. 594 00:35:49,970 --> 00:35:52,590 It's mu itself-- mu is a positive number. 595 00:35:52,590 --> 00:35:54,800 And so it's not going to make any sense 596 00:35:54,800 --> 00:35:57,710 to assume that mu of x is equal to x transpose beta, 597 00:35:57,710 --> 00:36:00,710 because you might find some x's for which this value ends up 598 00:36:00,710 --> 00:36:02,302 being negative. 599 00:36:02,302 --> 00:36:03,760 And so we're going to need, what we 600 00:36:03,760 --> 00:36:05,860 call, the link function that relates, 601 00:36:05,860 --> 00:36:08,560 that transforms mu, maps onto the real line, 602 00:36:08,560 --> 00:36:13,210 so that you can now express it of the form x transpose beta. 603 00:36:13,210 --> 00:36:17,560 So we're going to take not this, but we're 604 00:36:17,560 --> 00:36:21,250 going to assume that g of mu of x 605 00:36:21,250 --> 00:36:24,430 is not equal to x transpose beta, 606 00:36:24,430 --> 00:36:26,140 and that's the generalized linear models. 607 00:36:33,220 --> 00:36:40,650 So as I said, it's weird to transform x transpose beta-- 608 00:36:40,650 --> 00:36:43,420 a mu to make it take the real line. 609 00:36:43,420 --> 00:36:44,920 At least to me, it feels a bit more 610 00:36:44,920 --> 00:36:47,530 natural to take x transpose beta and make 611 00:36:47,530 --> 00:36:51,000 it fit to the particular distribution that I want. 612 00:36:51,000 --> 00:36:53,890 And so I'm going to want to talk about g and g inverse 613 00:36:53,890 --> 00:36:55,550 at the same time. 614 00:36:55,550 --> 00:36:59,070 So I'm going to actually take always g. 615 00:36:59,070 --> 00:37:04,920 So g is my link function, and I'm 616 00:37:04,920 --> 00:37:10,036 going to want g to be continuous differentiable. 617 00:37:16,980 --> 00:37:19,020 OK, let's say that it has a derivative, 618 00:37:19,020 --> 00:37:22,630 and its derivative is continuous. 619 00:37:22,630 --> 00:37:28,398 And I'm going to want g to be strictly increasing. 620 00:37:34,770 --> 00:37:39,930 And that actually implies that g inverse exists. 621 00:37:39,930 --> 00:37:43,410 Actually, that's not true. 622 00:37:43,410 --> 00:37:54,505 What I'm also going to want is that g of mu spans-- 623 00:37:57,420 --> 00:37:58,380 how do I do this? 624 00:38:06,090 --> 00:38:09,720 So I want the g, as I arrange for all possible values of mu, 625 00:38:09,720 --> 00:38:11,220 whether they're all positive values, 626 00:38:11,220 --> 00:38:12,969 or whether they're values that are limited 627 00:38:12,969 --> 00:38:15,240 between the intervals 0, 1, I want those to span 628 00:38:15,240 --> 00:38:18,340 the entire real line, so that when I want to talk about g 629 00:38:18,340 --> 00:38:20,282 inverses define over the entire real line, 630 00:38:20,282 --> 00:38:21,240 I know where I started. 631 00:38:24,396 --> 00:38:26,660 So this implies that gene inverse exists. 632 00:38:30,200 --> 00:38:32,388 What else does it imply about g inverse? 633 00:38:39,500 --> 00:38:41,270 So for a function to be invertible, 634 00:38:41,270 --> 00:38:43,855 I only need for it to be strictly monotone. 635 00:38:43,855 --> 00:38:45,605 I don't need it to be strictly increasing. 636 00:38:45,605 --> 00:38:47,729 So in particular, the fact that I picked increasing 637 00:38:47,729 --> 00:38:53,360 implies that this guy is actually increasing. 638 00:38:53,360 --> 00:38:54,820 AUDIENCE: [INAUDIBLE] 639 00:38:54,820 --> 00:38:56,320 PHILIPPE RIGOLLET: That's the image. 640 00:39:03,470 --> 00:39:06,830 So this is my link function, and this slide is just telling me 641 00:39:06,830 --> 00:39:08,330 I want my function to be invertible, 642 00:39:08,330 --> 00:39:09,890 so I can talk about g inverse. 643 00:39:09,890 --> 00:39:12,510 I'm going to switch between the two. 644 00:39:12,510 --> 00:39:15,450 So what link functions am I going to get? 645 00:39:15,450 --> 00:39:17,214 So for linear models, we just said 646 00:39:17,214 --> 00:39:18,630 there's no link function, which is 647 00:39:18,630 --> 00:39:20,962 the same as saying that the link function is identity, 648 00:39:20,962 --> 00:39:22,920 which certainly satisfies all these conditions. 649 00:39:22,920 --> 00:39:23,735 It's invertible. 650 00:39:23,735 --> 00:39:25,110 It has all these nice properties, 651 00:39:25,110 --> 00:39:27,540 but might as well not talk about it. 652 00:39:27,540 --> 00:39:29,220 For Poisson data, when we assume that 653 00:39:29,220 --> 00:39:32,250 the conditional distribution of y given x is Poisson, 654 00:39:32,250 --> 00:39:37,200 the mu, as I just said, is required to be positive. 655 00:39:37,200 --> 00:39:43,650 So I need a g that goes from the interval 0 infinity 656 00:39:43,650 --> 00:39:45,540 to the entire real line. 657 00:39:45,540 --> 00:39:47,910 I need a function that starts from one end 658 00:39:47,910 --> 00:39:51,720 and just takes-- not only the positive values 659 00:39:51,720 --> 00:39:54,580 are split between positive and negative values. 660 00:39:54,580 --> 00:39:56,820 And here, for example, I could take the log link. 661 00:39:56,820 --> 00:40:01,260 So the log is defined on this entire interval. 662 00:40:01,260 --> 00:40:04,050 And as I range from 0 to plus infinity, 663 00:40:04,050 --> 00:40:07,440 the log is ranging from negative infinity to plus infinity. 664 00:40:10,382 --> 00:40:12,090 You can probably think of other functions 665 00:40:12,090 --> 00:40:15,510 that do that, like 2 times log. 666 00:40:15,510 --> 00:40:16,860 That's another one. 667 00:40:16,860 --> 00:40:20,170 But there's many others you can think of. 668 00:40:20,170 --> 00:40:21,752 But let's say the log is one of them 669 00:40:21,752 --> 00:40:23,210 that you might want to think about. 670 00:40:32,680 --> 00:40:34,410 It is unnatural in the sense that it's 671 00:40:34,410 --> 00:40:36,160 one of the first function we can think of. 672 00:40:36,160 --> 00:40:39,840 We will see, also, that it has another canonical property that 673 00:40:39,840 --> 00:40:42,090 makes it a natural choice. 674 00:40:42,090 --> 00:40:44,130 The other one is the other example, 675 00:40:44,130 --> 00:40:47,520 where we had an even stronger condition on what mu could be. 676 00:40:47,520 --> 00:40:49,620 Mu could only be a number between 0 and 1, 677 00:40:49,620 --> 00:40:52,780 that was the probability of success of a coin flip-- 678 00:40:52,780 --> 00:40:55,290 probability of success of a Bernoulli random variable. 679 00:40:55,290 --> 00:40:59,310 And now, I need g to map 0, 1 to the entire real line. 680 00:40:59,310 --> 00:41:02,490 And so here are a bunch of things 681 00:41:02,490 --> 00:41:04,980 that you can come up with, because now you 682 00:41:04,980 --> 00:41:08,220 start to have maybe-- 683 00:41:08,220 --> 00:41:11,340 I will soon claim that this one, log of mu, 684 00:41:11,340 --> 00:41:14,220 divided by 1 minus mu is the most natural one. 685 00:41:14,220 --> 00:41:16,770 But maybe, if you had never thought of this, 686 00:41:16,770 --> 00:41:18,780 that might not be the first function 687 00:41:18,780 --> 00:41:20,490 you would come up with, right? 688 00:41:20,490 --> 00:41:23,670 You mentioned trigonometric functions, for example, 689 00:41:23,670 --> 00:41:25,980 so maybe, you can come up with something 690 00:41:25,980 --> 00:41:30,960 that comes from hyperbolic trigonometry or something. 691 00:41:30,960 --> 00:41:32,329 So what does this function do? 692 00:41:32,329 --> 00:41:34,370 Well, we'll see a picture, but this function does 693 00:41:34,370 --> 00:41:36,990 map the interval 0, 1 to the entire real line. 694 00:41:36,990 --> 00:41:40,770 We also discuss the fact that if we think reciprocally-- 695 00:41:43,740 --> 00:41:46,110 what I want if I want to think about g inverse, 696 00:41:46,110 --> 00:41:49,140 I want a function that maps the entire real line into the unit 697 00:41:49,140 --> 00:41:49,920 interval. 698 00:41:49,920 --> 00:41:52,650 And as we said, if I'm not a very creative statistician 699 00:41:52,650 --> 00:41:55,960 or probabilist, I can just pick my favorite continuous, 700 00:41:55,960 --> 00:41:59,710 strictly increasing cumulative distribution function, 701 00:41:59,710 --> 00:42:01,350 which as we know, will arise as soon 702 00:42:01,350 --> 00:42:03,060 as I have a density that has support 703 00:42:03,060 --> 00:42:04,830 on the entire real line. 704 00:42:04,830 --> 00:42:07,350 If I have support everywhere, then it means that my-- 705 00:42:12,100 --> 00:42:14,140 it is strictly positive everywhere, then, 706 00:42:14,140 --> 00:42:17,070 it means that my community distribution function 707 00:42:17,070 --> 00:42:18,690 has to be strictly increasing. 708 00:42:18,690 --> 00:42:21,450 And of course, it has to go from 0 to 1, because that's just 709 00:42:21,450 --> 00:42:22,717 the nature of those things. 710 00:42:22,717 --> 00:42:24,550 And so for example, I can take the Gaussian, 711 00:42:24,550 --> 00:42:25,980 that's one such function. 712 00:42:25,980 --> 00:42:28,591 But I could also take the double exponential 713 00:42:28,591 --> 00:42:30,340 that looks like an exponential on one end, 714 00:42:30,340 --> 00:42:32,460 and then an exponential on the other end. 715 00:42:32,460 --> 00:42:39,930 And basically, if you take capital phi, which 716 00:42:39,930 --> 00:42:43,560 is the standard Gaussian cumulative distribution 717 00:42:43,560 --> 00:42:47,460 function, it does work for you, and you can take its inverse. 718 00:42:47,460 --> 00:42:49,160 And in this case, we don't talk about, 719 00:42:49,160 --> 00:42:51,810 so this guy is called logit or logit. 720 00:42:51,810 --> 00:42:53,172 And this guy is called probit. 721 00:42:53,172 --> 00:42:54,630 And you see it, usually, every time 722 00:42:54,630 --> 00:42:58,534 you have a package on generalized linear models. 723 00:42:58,534 --> 00:42:59,700 You are trying to implement. 724 00:42:59,700 --> 00:43:00,970 You have this choice. 725 00:43:00,970 --> 00:43:04,009 And for what's called logistic regression-- so it's funny 726 00:43:04,009 --> 00:43:05,550 that it's called logistic regression, 727 00:43:05,550 --> 00:43:07,410 but you can actually use the probit link, 728 00:43:07,410 --> 00:43:10,620 which in this case, is called probit regression. 729 00:43:10,620 --> 00:43:12,480 But those things are essentially equivalent, 730 00:43:12,480 --> 00:43:14,816 and it's really a matter of taste. 731 00:43:14,816 --> 00:43:16,440 Maybe of communities-- some communities 732 00:43:16,440 --> 00:43:18,140 might prefer one over the other. 733 00:43:18,140 --> 00:43:20,490 We'll see that again, as I claimed 734 00:43:20,490 --> 00:43:24,810 before, the logistic, the logit one 735 00:43:24,810 --> 00:43:29,130 has a slightly more compelling argument for its reason 736 00:43:29,130 --> 00:43:30,152 to exist. 737 00:43:30,152 --> 00:43:31,860 I guess this one, the compelling argument 738 00:43:31,860 --> 00:43:34,770 is that it involved the standard Gaussian, which 739 00:43:34,770 --> 00:43:37,470 of course, is something that should show up everywhere. 740 00:43:37,470 --> 00:43:41,340 And then, you can think about crazy stuff. 741 00:43:41,340 --> 00:43:43,770 Even crazy gets name-- 742 00:43:43,770 --> 00:43:48,670 complimentary log, log, which is the log of minus, log 1 minus. 743 00:43:48,670 --> 00:43:49,170 Why not? 744 00:43:52,890 --> 00:43:56,450 So I guess you can iterate that thing. 745 00:43:56,450 --> 00:43:59,510 You can just put a log 1 minus in front of this thing, 746 00:43:59,510 --> 00:44:01,160 and it's still going to go. 747 00:44:01,160 --> 00:44:07,810 So that's not true. 748 00:44:07,810 --> 00:44:10,377 I have to put a minus and take-- 749 00:44:10,377 --> 00:44:11,210 no, that's not true. 750 00:44:13,707 --> 00:44:15,290 So you can think of whatever you want. 751 00:44:19,320 --> 00:44:21,770 So I claimed that the logit link is the natural choice, so 752 00:44:21,770 --> 00:44:22,970 here's a picture. 753 00:44:22,970 --> 00:44:25,590 I should have actually plotted the other one, 754 00:44:25,590 --> 00:44:27,399 so we can actually compare it. 755 00:44:27,399 --> 00:44:29,690 To be fair, I don't even remember how it would actually 756 00:44:29,690 --> 00:44:32,010 fit into those two functions. 757 00:44:32,010 --> 00:44:35,712 So the blue one, which is this one, for those of you 758 00:44:35,712 --> 00:44:37,670 don't see the difference between blue and red-- 759 00:44:37,670 --> 00:44:39,300 sorry about that. 760 00:44:39,300 --> 00:44:45,320 So this the blue one is the logistic one. 761 00:44:45,320 --> 00:44:49,980 So this guy is the function that does e to the x, over 1 plus 762 00:44:49,980 --> 00:44:50,480 e to the x. 763 00:44:50,480 --> 00:44:51,560 As you can see, this is a function 764 00:44:51,560 --> 00:44:53,600 that's supposed to map the entire real line 765 00:44:53,600 --> 00:44:55,970 into the intervals, 0, 1. 766 00:44:55,970 --> 00:44:58,220 So that's supposed to be the inverse of your function, 767 00:44:58,220 --> 00:45:00,580 and I claimed that this is the inverse of the logistic 768 00:45:00,580 --> 00:45:02,090 of the logit function. 769 00:45:02,090 --> 00:45:04,340 And the blue one, well, this is the Gaussian CDF, 770 00:45:04,340 --> 00:45:06,470 so you know it's clearly the inverse of the inverse 771 00:45:06,470 --> 00:45:07,732 of the Gaussian CDF. 772 00:45:07,732 --> 00:45:08,690 And that's the red one. 773 00:45:08,690 --> 00:45:09,939 That's the one that goes here. 774 00:45:12,290 --> 00:45:15,320 I would guess that the complimentary log, log is 775 00:45:15,320 --> 00:45:17,600 something that's probably going above here, 776 00:45:17,600 --> 00:45:19,790 and for which the slope is, actually, 777 00:45:19,790 --> 00:45:22,840 even a little flatter as you cross 0. 778 00:45:26,250 --> 00:45:29,119 So of course, this is not our link functions. 779 00:45:29,119 --> 00:45:30,910 These are the inverse of our link function. 780 00:45:30,910 --> 00:45:32,730 So what do they look like when actually, 781 00:45:32,730 --> 00:45:36,670 basically, flip my thing like this? 782 00:45:36,670 --> 00:45:38,810 So this is what I see. 783 00:45:38,810 --> 00:45:42,600 And so I can see that in blue, this is my logistic link. 784 00:45:42,600 --> 00:45:46,140 So it crosses 0 with a slightly faster rate. 785 00:45:46,140 --> 00:45:49,830 Remember, if we could use the identity, that 786 00:45:49,830 --> 00:45:51,134 would be very nice to us. 787 00:45:51,134 --> 00:45:52,800 We would just want to take the identity. 788 00:45:52,800 --> 00:45:55,145 The problem is that if I start having 789 00:45:55,145 --> 00:45:56,520 the identity that goes here, it's 790 00:45:56,520 --> 00:45:58,810 going to start being a problem. 791 00:45:58,810 --> 00:46:06,419 And this is the probit link, the phi verse that you see here. 792 00:46:06,419 --> 00:46:07,335 It's a little flatter. 793 00:46:10,290 --> 00:46:16,599 You can compute the derivative at zero of those guys. 794 00:46:16,599 --> 00:46:17,890 What is the derivative of the-- 795 00:46:21,180 --> 00:46:24,380 So I'm taking the derivative of log of x over 1 minus x. 796 00:46:24,380 --> 00:46:32,010 So it's 1 over x, minus 1 over 1 minus x. 797 00:46:35,850 --> 00:46:39,120 So if I look at 0.5-- 798 00:46:39,120 --> 00:46:40,770 sorry, this is the interval 0, 1. 799 00:46:40,770 --> 00:46:48,070 So I'm interested in the slope at 0.5. 800 00:46:48,070 --> 00:46:49,660 Yes, it's plus, thank you. 801 00:46:49,660 --> 00:46:53,230 So at 0.5, what I get is 2 plus 2. 802 00:46:57,090 --> 00:47:02,650 Yeah, so that's the slope that we get, 803 00:47:02,650 --> 00:47:07,434 and if you compute for the derivative-- 804 00:47:07,434 --> 00:47:09,100 what is the derivative of a phi inverse? 805 00:47:09,100 --> 00:47:13,180 Well, it's a little phi of x, divided 806 00:47:13,180 --> 00:47:20,640 by little phi of capital phi, inverse of x. 807 00:47:20,640 --> 00:47:24,319 So little phi at 1/2-- 808 00:47:24,319 --> 00:47:24,860 I don't know. 809 00:47:29,450 --> 00:47:30,950 Yeah, I guess I can probably compute 810 00:47:30,950 --> 00:47:32,590 the derivative of the capital phi 811 00:47:32,590 --> 00:47:34,460 at 0, which is going to be just that. 812 00:47:34,460 --> 00:47:37,070 1 over square root of 2 pi, and then just say well, 813 00:47:37,070 --> 00:47:38,870 the slope has to be 1 over that. 814 00:47:42,972 --> 00:47:43,680 Square root 2 pi. 815 00:47:47,130 --> 00:47:50,310 So that's just a comparison, but again, so far, we 816 00:47:50,310 --> 00:47:54,151 do not have any reason to prefer one to the other. 817 00:47:54,151 --> 00:47:56,400 And so now, I'm going to start giving you some reasons 818 00:47:56,400 --> 00:47:58,110 to prefer one to the other. 819 00:47:58,110 --> 00:48:01,260 And one of those two-- 820 00:48:01,260 --> 00:48:03,570 and actually for each canonical family, 821 00:48:03,570 --> 00:48:05,820 there is something which is called the canonical link. 822 00:48:05,820 --> 00:48:07,486 And when you don't have any other reason 823 00:48:07,486 --> 00:48:10,386 to choose anything else, why not choose the canonical one? 824 00:48:10,386 --> 00:48:11,760 And the canonical link is the one 825 00:48:11,760 --> 00:48:19,580 that says OK, what I want is g to map mu onto the real line. 826 00:48:22,980 --> 00:48:28,550 But mu is not the parameter of my canonical family. 827 00:48:28,550 --> 00:48:31,470 Here for example, mu is e of theta, 828 00:48:31,470 --> 00:48:33,290 but the canonical parameter is theta. 829 00:48:36,050 --> 00:48:39,480 But the parameter of a canonical exponential family 830 00:48:39,480 --> 00:48:42,650 is something that lives in the entire real line. 831 00:48:42,650 --> 00:48:45,510 It was defined for all thetas. 832 00:48:45,510 --> 00:48:50,250 And so in particular, I can just take theta 833 00:48:50,250 --> 00:48:52,950 to be the one that's x transpose beta. 834 00:48:52,950 --> 00:48:54,480 And so in particular, I'm just going 835 00:48:54,480 --> 00:48:57,180 to try to find the link that just says OK, when 836 00:48:57,180 --> 00:49:00,470 I take g of mu, I'm going to map, 837 00:49:00,470 --> 00:49:01,700 so that's what's going to be. 838 00:49:01,700 --> 00:49:05,499 So I know that g of mu is going to be equal to x beta. 839 00:49:05,499 --> 00:49:07,040 And now, what I'm going to say is OK, 840 00:49:07,040 --> 00:49:09,850 let's just take the g that makes this guy equal to theta, 841 00:49:09,850 --> 00:49:11,600 so that this is theta that actually model, 842 00:49:11,600 --> 00:49:14,880 like x transpose beta. 843 00:49:14,880 --> 00:49:17,960 Feels pretty canonical, right? 844 00:49:17,960 --> 00:49:19,880 What else? 845 00:49:19,880 --> 00:49:22,280 What other central, easy choice would you take? 846 00:49:22,280 --> 00:49:23,650 This was pretty easy. 847 00:49:23,650 --> 00:49:27,560 There is a natural parameter for this canonical family, 848 00:49:27,560 --> 00:49:29,780 and it takes value on the entire real line. 849 00:49:29,780 --> 00:49:32,500 I have a function that maps mu onto the entire real line, 850 00:49:32,500 --> 00:49:36,260 so let's just map it to the actual parameter. 851 00:49:36,260 --> 00:49:40,419 So now, OK, why do I have this? 852 00:49:40,419 --> 00:49:41,960 Well, we've already figured that out. 853 00:49:41,960 --> 00:49:44,840 The canonical link function is strictly increasing. 854 00:49:44,840 --> 00:49:49,670 Sorry, so I said that now I want this guy-- 855 00:49:49,670 --> 00:49:57,470 so I want g of mu to be equal to theta, 856 00:49:57,470 --> 00:50:00,560 which is equivalent to saying that I want mu to be 857 00:50:00,560 --> 00:50:03,860 equal to g inverse of theta. 858 00:50:03,860 --> 00:50:08,140 But we know that mu is what-- 859 00:50:08,140 --> 00:50:09,160 b prime of theta. 860 00:50:15,640 --> 00:50:21,090 So that means that b prime is the same function as g inverse. 861 00:50:21,090 --> 00:50:24,570 And I claimed that this is actually giving me, indeed, 862 00:50:24,570 --> 00:50:27,930 a function that has the properties that I want, 863 00:50:27,930 --> 00:50:30,060 because before I said, just pick any function that 864 00:50:30,060 --> 00:50:31,080 has these properties. 865 00:50:31,080 --> 00:50:33,102 And now, I'm giving you a very hard rule 866 00:50:33,102 --> 00:50:34,560 to pick this, though you need still 867 00:50:34,560 --> 00:50:37,050 to check that it satisfies those conditions and particular, 868 00:50:37,050 --> 00:50:39,050 that it's increasing and invertible. 869 00:50:39,050 --> 00:50:41,050 And so for this to be increasing and invertible, 870 00:50:41,050 --> 00:50:42,630 strictly increasing and invertible, 871 00:50:42,630 --> 00:50:44,880 really what I need is that the inverse is strictly 872 00:50:44,880 --> 00:50:48,070 increasing and invertible, which is the case here, 873 00:50:48,070 --> 00:50:51,220 because b prime as we said-- 874 00:50:51,220 --> 00:50:56,610 well, b prime is the derivative of a strictly convex function. 875 00:50:56,610 --> 00:50:58,749 A strictly convex function has a second derivative 876 00:50:58,749 --> 00:50:59,790 that's strictly positive. 877 00:50:59,790 --> 00:51:01,770 We just figured that out using the fact 878 00:51:01,770 --> 00:51:03,790 that the variance was strictly positive. 879 00:51:03,790 --> 00:51:06,330 And if phi is strictly positive, then this thing 880 00:51:06,330 --> 00:51:08,530 has to be strictly positive. 881 00:51:08,530 --> 00:51:10,650 So if b prime, prime is strictly positive-- 882 00:51:10,650 --> 00:51:13,604 this is the derivative of a function called b prime. 883 00:51:13,604 --> 00:51:15,270 If your derivative is strictly positive, 884 00:51:15,270 --> 00:51:17,670 you are strictly increasing. 885 00:51:17,670 --> 00:51:22,810 And so we know that b prime is, indeed, strictly increasing. 886 00:51:22,810 --> 00:51:26,060 And what I need also to check-- well, 887 00:51:26,060 --> 00:51:28,010 I guess this is already checked on its own, 888 00:51:28,010 --> 00:51:33,560 because b prime is actually mapping all of our 889 00:51:33,560 --> 00:51:37,100 into the possible values. 890 00:51:37,100 --> 00:51:38,870 When theta ranges on the entire real line, 891 00:51:38,870 --> 00:51:41,120 then b prime ranges in the entire interval 892 00:51:41,120 --> 00:51:45,440 of the mean values that it can take. 893 00:51:45,440 --> 00:51:48,030 And so now, I have this thing that's completely defined. 894 00:51:48,030 --> 00:51:50,490 B prime inverse is a valid link. 895 00:51:56,030 --> 00:51:57,450 And it's called a canonical link. 896 00:52:02,470 --> 00:52:05,770 OK, so again, if I give you an exponential family, which 897 00:52:05,770 --> 00:52:09,350 is another way of saying I'll give you a convex function, b, 898 00:52:09,350 --> 00:52:12,400 which gives you some exponential family, 899 00:52:12,400 --> 00:52:15,160 then if you just take b prime inverse, 900 00:52:15,160 --> 00:52:17,770 this gives you the associated canonical link 901 00:52:17,770 --> 00:52:21,590 for this canonical exponential family. 902 00:52:21,590 --> 00:52:26,196 So clearly there's an advantage of doing 903 00:52:26,196 --> 00:52:28,070 this, which is I don't have to actually think 904 00:52:28,070 --> 00:52:30,800 about which one to pick if I don't want to think about it. 905 00:52:30,800 --> 00:52:34,220 But there's other advantages that come to it, 906 00:52:34,220 --> 00:52:36,170 and we'll see that in the representations. 907 00:52:36,170 --> 00:52:38,711 There's, basically, going to be some light cancellations that 908 00:52:38,711 --> 00:52:39,290 show up. 909 00:52:39,290 --> 00:52:40,665 So before we go there, let's just 910 00:52:40,665 --> 00:52:43,790 compute the canonical link for the Bernoulli distribution. 911 00:52:43,790 --> 00:52:46,360 So remember, the Bernoulli distribution 912 00:52:46,360 --> 00:52:55,510 has a PMF, which is part of the canonical exponential family. 913 00:52:55,510 --> 00:53:00,180 So the PMF of the Bernoulli is f theta of x. 914 00:53:06,529 --> 00:53:07,820 Let me just write it like this. 915 00:53:07,820 --> 00:53:12,470 So it's p to the y, let's say-- one minus p 916 00:53:12,470 --> 00:53:16,700 to the 1 minus y, which I will write 917 00:53:16,700 --> 00:53:28,910 as exponential y log p, plus 1 minus y, log 1 minus p. 918 00:53:28,910 --> 00:53:30,750 OK, we've done that last time. 919 00:53:30,750 --> 00:53:32,670 Now, I'm going to group my terms in y 920 00:53:32,670 --> 00:53:37,530 to see how y interacts with this parameter p. 921 00:53:37,530 --> 00:53:40,710 And what I'm getting is y, which is times log p 922 00:53:40,710 --> 00:53:42,540 divided by 1 minus p. 923 00:53:42,540 --> 00:53:47,040 And then, the only term that remains is log, 1 minus p. 924 00:53:50,370 --> 00:53:53,710 Now, I want this to be a canonical exponential family, 925 00:53:53,710 --> 00:53:56,880 which means that I just need to call this guy, 926 00:53:56,880 --> 00:53:58,710 so it is part of the exponential family. 927 00:53:58,710 --> 00:53:59,460 You can read that. 928 00:53:59,460 --> 00:54:04,520 If I want it to be canonical, this guy must be theta itself. 929 00:54:04,520 --> 00:54:11,180 So I have that theta is equal to log p, 1 minus p. 930 00:54:11,180 --> 00:54:12,800 If I invert this thing, it tells me 931 00:54:12,800 --> 00:54:16,880 that p is e to the theta, divided by 1, plus e 932 00:54:16,880 --> 00:54:18,434 to the theta. 933 00:54:18,434 --> 00:54:19,850 It's just inverting this function. 934 00:54:23,550 --> 00:54:28,140 In particular, it means that log, 1 minus p, 935 00:54:28,140 --> 00:54:31,900 is equal to log, 1 minus this thing. 936 00:54:31,900 --> 00:54:33,520 So the exponential thetas go away. 937 00:54:33,520 --> 00:54:39,350 So in the numerator, this is what I get. 938 00:54:39,350 --> 00:54:44,870 That's the log 1 minus this guy, which is equal to minus log 1, 939 00:54:44,870 --> 00:54:46,010 plus e to the theta. 940 00:54:50,790 --> 00:54:52,540 So I'm going a bit too fast, but these are 941 00:54:52,540 --> 00:54:56,230 very elementary manipulations-- 942 00:54:56,230 --> 00:54:59,220 maybe, it requires one more line to convince yourself. 943 00:54:59,220 --> 00:55:05,940 But just do it in the comfort of your room. 944 00:55:05,940 --> 00:55:11,210 And then, what you have is the exponential of y times theta, 945 00:55:11,210 --> 00:55:16,850 and then, I have minus log, 1 plus e theta. 946 00:55:16,850 --> 00:55:20,960 So this is the representation of the p 947 00:55:20,960 --> 00:55:23,990 and f of a Bernoulli distribution, 948 00:55:23,990 --> 00:55:27,680 as part of a member of the canonical exponential family. 949 00:55:27,680 --> 00:55:33,530 And it tells me that b of theta is equal to log 1, 950 00:55:33,530 --> 00:55:36,510 plus e of theta. 951 00:55:36,510 --> 00:55:38,140 That's what I have there. 952 00:55:38,140 --> 00:55:41,790 From there, I can compute the expectation, which hopefully, 953 00:55:41,790 --> 00:55:46,170 I'm going to get p as the mean and p times 1, 954 00:55:46,170 --> 00:55:47,759 minus p as the variance. 955 00:55:47,759 --> 00:55:49,050 Otherwise, that would be weird. 956 00:55:51,960 --> 00:55:55,840 So let's just do this. 957 00:55:55,840 --> 00:56:00,950 B prime of theta should give me the mean. 958 00:56:00,950 --> 00:56:04,010 And indeed, b prime of theta is e to the theta, 959 00:56:04,010 --> 00:56:08,060 divided by 1, plus e to the theta, which is exactly 960 00:56:08,060 --> 00:56:09,290 this p that I had there. 961 00:56:14,850 --> 00:56:18,350 OK just for fun-- 962 00:56:18,350 --> 00:56:19,200 well, I don't know. 963 00:56:19,200 --> 00:56:20,520 Maybe, that's not part of it. 964 00:56:20,520 --> 00:56:22,820 Yeah, let's not compute the second derivative. 965 00:56:22,820 --> 00:56:25,800 That's probably going to be on your homework at some point-- 966 00:56:25,800 --> 00:56:29,440 if not, on the final. 967 00:56:29,440 --> 00:56:32,890 So b prime now-- 968 00:56:32,890 --> 00:56:34,120 oh, I erased it, of course. 969 00:56:37,300 --> 00:56:39,380 G, the canonical link is b prime inverse. 970 00:56:42,520 --> 00:56:44,770 And I claim that this is going to give me 971 00:56:44,770 --> 00:56:48,910 the logit function, log of mu, over 1 minus mu. 972 00:56:48,910 --> 00:56:50,480 So let's check that. 973 00:56:50,480 --> 00:56:54,236 So b prime is this thing, so now, 974 00:56:54,236 --> 00:56:55,360 I want to find the inverse. 975 00:57:02,180 --> 00:57:05,360 Well, I should really call my inverse a function of p. 976 00:57:05,360 --> 00:57:06,750 And I've done it before-- 977 00:57:06,750 --> 00:57:08,930 all I have to do is to solve this equation, which 978 00:57:08,930 --> 00:57:10,400 I've actually just done it, that's 979 00:57:10,400 --> 00:57:11,890 where I'm actually coming from. 980 00:57:11,890 --> 00:57:14,510 So it's actually telling me that the solution of this thing 981 00:57:14,510 --> 00:57:17,941 is equal to log of p over 1 minus p. 982 00:57:25,810 --> 00:57:28,540 We just solve this thing both ways. 983 00:57:28,540 --> 00:57:38,520 And this is, indeed, logit of p by definition of logit. 984 00:57:38,520 --> 00:57:40,470 So b prime inverse, this function that 985 00:57:40,470 --> 00:57:42,440 seemed to come out of nowhere, is really 986 00:57:42,440 --> 00:57:45,030 just the inverse of b prime, which we know 987 00:57:45,030 --> 00:57:46,200 is the canonical link. 988 00:57:46,200 --> 00:57:49,200 And canonical is some sort of ad hoc choices 989 00:57:49,200 --> 00:57:53,040 that we've made by saying let's just take the link, such that d 990 00:57:53,040 --> 00:57:57,819 of mu is giving me the actual canonical parameter of theta. 991 00:57:57,819 --> 00:57:58,785 Yeah? 992 00:57:58,785 --> 00:58:00,717 AUDIENCE: [INAUDIBLE] 993 00:58:02,197 --> 00:58:03,530 PHILIPPE RIGOLLET: You're right. 994 00:58:08,520 --> 00:58:11,550 Now, of course, I'm going through all this trouble, 995 00:58:11,550 --> 00:58:13,470 but you could see it immediately. 996 00:58:13,470 --> 00:58:16,550 I know this is going to be theta. 997 00:58:16,550 --> 00:58:19,380 We also have prior knowledge, hopefully, 998 00:58:19,380 --> 00:58:23,520 that the expectation of a Bernoulli is p itself. 999 00:58:23,520 --> 00:58:25,760 So right at this step, when I say 1000 00:58:25,760 --> 00:58:28,070 that I'm going to take theta to be this guy, 1001 00:58:28,070 --> 00:58:32,959 already knew that the canonical link was the logit-- 1002 00:58:32,959 --> 00:58:34,500 because I just said oh, here's theta. 1003 00:58:34,500 --> 00:58:37,356 And it's just this function of mu [INAUDIBLE].. 1004 00:58:41,100 --> 00:58:43,539 OK, so you can do that for a bunch of examples, 1005 00:58:43,539 --> 00:58:45,330 and this is what they're going to give you. 1006 00:58:45,330 --> 00:58:47,820 So the Gaussian case, b of theta-- 1007 00:58:47,820 --> 00:58:49,760 we've actually computed it, actually, once. 1008 00:58:49,760 --> 00:58:51,290 This is theta squared over 2. 1009 00:58:51,290 --> 00:58:53,130 So the derivative of this thing is really 1010 00:58:53,130 --> 00:58:56,970 just theta, which means that g or g inverse is actually 1011 00:58:56,970 --> 00:58:59,280 equal to the identity. 1012 00:58:59,280 --> 00:59:02,760 And again, sanity check-- 1013 00:59:02,760 --> 00:59:04,410 when I'm in the Gaussian case, there's 1014 00:59:04,410 --> 00:59:06,780 nothing general about general linear models 1015 00:59:06,780 --> 00:59:09,040 if you don't have a link. 1016 00:59:09,040 --> 00:59:12,390 The Poisson case-- you can actually check. 1017 00:59:12,390 --> 00:59:13,480 Did we do this, actually? 1018 00:59:13,480 --> 00:59:14,350 Yes we did. 1019 00:59:14,350 --> 00:59:16,570 So that's when we had this e of theta. 1020 00:59:16,570 --> 00:59:19,960 And so b is e of theta, which means that the natural link is 1021 00:59:19,960 --> 00:59:24,790 the inverse, which is log, which is the inverse of exponential. 1022 00:59:24,790 --> 00:59:29,680 And so that's logarithm link, which as I said, 1023 00:59:29,680 --> 00:59:31,560 I used the word natural. 1024 00:59:31,560 --> 00:59:33,610 You can also use the word canonical 1025 00:59:33,610 --> 00:59:35,740 if you want to describe this function as being 1026 00:59:35,740 --> 00:59:38,620 the right function to map the positive real line 1027 00:59:38,620 --> 00:59:40,959 to the entire real line. 1028 00:59:40,959 --> 00:59:42,250 The Bernoulli-- we just did it. 1029 00:59:42,250 --> 00:59:46,930 So b-- the cumulative enduring function is log of 1, 1030 00:59:46,930 --> 00:59:52,990 plus e of theta, which is log of mu over 1 minus mu. 1031 00:59:52,990 --> 00:59:57,520 And gamma function-- where you have 1032 00:59:57,520 --> 01:00:00,700 the thing you're going to see is minus log of minus [INAUDIBLE].. 1033 01:00:00,700 --> 01:00:04,030 You see the reciprocal link is the link that actually 1034 01:00:04,030 --> 01:00:08,045 shows up, so minus 1 over mu. 1035 01:00:08,045 --> 01:00:08,545 That maps. 1036 01:00:35,690 --> 01:00:40,400 So are there any questions about the canonical links, 1037 01:00:40,400 --> 01:00:42,532 canonical families? 1038 01:00:42,532 --> 01:00:45,020 I use the word canonical a lot. 1039 01:00:45,020 --> 01:00:48,929 But is everything fitting together right now? 1040 01:00:48,929 --> 01:00:49,970 So we have this function. 1041 01:00:49,970 --> 01:00:53,060 We have canonical exponential family, by assumption. 1042 01:00:53,060 --> 01:00:54,800 It has a function, b, which contains 1043 01:00:54,800 --> 01:00:56,552 every information we want. 1044 01:00:56,552 --> 01:00:58,010 At the beginning of the lecture, we 1045 01:00:58,010 --> 01:00:59,468 established that it has information 1046 01:00:59,468 --> 01:01:01,310 about the mean in the first derivative, 1047 01:01:01,310 --> 01:01:03,143 about the variance in the second derivative. 1048 01:01:03,143 --> 01:01:05,210 And it's also giving us a canonical link. 1049 01:01:05,210 --> 01:01:08,035 So just cherish this b once you've found it, 1050 01:01:08,035 --> 01:01:09,410 because it's everything you need. 1051 01:01:09,410 --> 01:01:09,909 Yeah? 1052 01:01:09,909 --> 01:01:11,750 AUDIENCE: [INAUDIBLE] 1053 01:01:15,962 --> 01:01:19,342 PHILIPPE RIGOLLET: I don't know, a political preference? 1054 01:01:24,710 --> 01:01:26,730 I don't know, honestly. 1055 01:01:26,730 --> 01:01:29,570 If I were a serious practitioner, 1056 01:01:29,570 --> 01:01:31,700 I probably would have a better answer for you. 1057 01:01:31,700 --> 01:01:32,870 At this point, I just don't. 1058 01:01:32,870 --> 01:01:34,244 I think it's a matter of practice 1059 01:01:34,244 --> 01:01:36,860 and actual preferences. 1060 01:01:36,860 --> 01:01:38,426 You can also try both. 1061 01:01:38,426 --> 01:01:39,800 We didn't mention it, but there's 1062 01:01:39,800 --> 01:01:41,360 this idea of cross-validation-- well, 1063 01:01:41,360 --> 01:01:43,610 we mentioned it without going too much into detail. 1064 01:01:43,610 --> 01:01:46,460 But you could try both and see which one performs 1065 01:01:46,460 --> 01:01:48,617 best on a yet unseen data set. 1066 01:01:48,617 --> 01:01:51,200 In terms of prediction, just say I prefer this one of the two, 1067 01:01:51,200 --> 01:01:53,450 because this actually comes as part of your modeling 1068 01:01:53,450 --> 01:01:56,090 assumption, right? 1069 01:01:56,090 --> 01:01:59,630 Not only did you decide to model the image of mu 1070 01:01:59,630 --> 01:02:03,057 through the link function as a linear model, but really 1071 01:02:03,057 --> 01:02:03,890 what you're saying-- 1072 01:02:03,890 --> 01:02:05,750 your model is saying well, you have 1073 01:02:05,750 --> 01:02:07,860 two pieces of [INAUDIBLE],, the distribution of y. 1074 01:02:07,860 --> 01:02:10,340 But you also have the fact that mu 1075 01:02:10,340 --> 01:02:14,870 is modeled as g inverse of x transpose beta. 1076 01:02:14,870 --> 01:02:17,120 And for different g's, this is just different modeling 1077 01:02:17,120 --> 01:02:18,380 assumptions, right? 1078 01:02:18,380 --> 01:02:25,930 So why should this be linear-- 1079 01:02:25,930 --> 01:02:26,610 I don't know. 1080 01:02:29,470 --> 01:02:32,740 My authority as a person who has not 1081 01:02:32,740 --> 01:02:34,780 examined the [INAUDIBLE] data sets 1082 01:02:34,780 --> 01:02:38,050 for both things would be that the changes are fairly minor. 1083 01:02:42,270 --> 01:02:45,420 OK, so this was all for one observation. 1084 01:02:45,420 --> 01:02:49,350 We just, basically, did probability. 1085 01:02:49,350 --> 01:02:52,620 We described some density, some properties of the densities, 1086 01:02:52,620 --> 01:02:53,940 how to compute expectations. 1087 01:02:53,940 --> 01:02:55,314 That was really just probability. 1088 01:02:55,314 --> 01:02:57,240 There was no data involved at any point. 1089 01:02:57,240 --> 01:03:00,330 We did a bit of modeling, but it was all for one observation. 1090 01:03:00,330 --> 01:03:01,710 What we're going to try to do now 1091 01:03:01,710 --> 01:03:06,360 is given the reverse engineering to probability 1092 01:03:06,360 --> 01:03:08,310 that is statistics, given data, what 1093 01:03:08,310 --> 01:03:10,780 can I infer about my model? 1094 01:03:10,780 --> 01:03:12,370 Now remember, there's three parameters 1095 01:03:12,370 --> 01:03:15,040 that are floating around in this model. 1096 01:03:15,040 --> 01:03:18,190 There is one that was theta. 1097 01:03:18,190 --> 01:03:21,689 There is one that was mu, and there is one that is beta. 1098 01:03:21,689 --> 01:03:23,230 OK, so those are the three parameters 1099 01:03:23,230 --> 01:03:25,110 that are floating around. 1100 01:03:25,110 --> 01:03:32,550 What we said is that the expectation of y, given x, 1101 01:03:32,550 --> 01:03:34,980 is mu of x. 1102 01:03:34,980 --> 01:03:37,950 So if I estimate mu, I know the conditional expectation of y, 1103 01:03:37,950 --> 01:03:44,960 given x, which definitely gives me theta of x. 1104 01:03:44,960 --> 01:03:46,830 How do I go from mu of x to theta of x? 1105 01:03:55,080 --> 01:03:58,010 The inverse of what-- 1106 01:03:58,010 --> 01:03:59,890 of the arrow? 1107 01:03:59,890 --> 01:04:07,290 Yeah, sure, but how do I go from this guy to this guy? 1108 01:04:07,290 --> 01:04:08,860 So theta as a function of mu is? 1109 01:04:12,556 --> 01:04:13,792 AUDIENCE: [INAUDIBLE] 1110 01:04:13,792 --> 01:04:15,250 PHILIPPE RIGOLLET: Yeah, so we just 1111 01:04:15,250 --> 01:04:18,760 computed that mu was b prime of theta. 1112 01:04:18,760 --> 01:04:23,260 So it means that theta is just b prime inverse of mu. 1113 01:04:23,260 --> 01:04:24,910 So those two things are the same as far 1114 01:04:24,910 --> 01:04:27,580 as we're concerned, because we know that b prime is strictly 1115 01:04:27,580 --> 01:04:29,020 increasing it's invertible. 1116 01:04:29,020 --> 01:04:31,560 So it's just a matter of re-parametrization, 1117 01:04:31,560 --> 01:04:34,420 and we just can switch from one to the other whenever we want. 1118 01:04:34,420 --> 01:04:36,754 But why we go through mu, because so far 1119 01:04:36,754 --> 01:04:38,170 for the entire semester I told you 1120 01:04:38,170 --> 01:04:39,150 there was one parameter that's theta. 1121 01:04:39,150 --> 01:04:41,420 It does not have to be the mean, and that's the parameter 1122 01:04:41,420 --> 01:04:42,130 that we care about. 1123 01:04:42,130 --> 01:04:43,700 It's the one on which we want to do interference. 1124 01:04:43,700 --> 01:04:45,580 That's the one for which we're going to compute the Fisher 1125 01:04:45,580 --> 01:04:46,360 information. 1126 01:04:46,360 --> 01:04:49,572 This was the parameter that was our object of worship. 1127 01:04:49,572 --> 01:04:51,280 And now, I'm saying oh, I'm going to have 1128 01:04:51,280 --> 01:04:53,200 mu that's coming around. 1129 01:04:53,200 --> 01:04:55,270 And why we have mu, because this is the mu 1130 01:04:55,270 --> 01:04:58,390 that we use to go to beta. 1131 01:04:58,390 --> 01:05:06,360 So I can go freely from theta to mu using b prime or b 1132 01:05:06,360 --> 01:05:07,600 prime inverse. 1133 01:05:07,600 --> 01:05:11,080 And now, I can go from mu to beta, 1134 01:05:11,080 --> 01:05:19,120 because I have that g of mu of x is beta transpose x. 1135 01:05:19,120 --> 01:05:21,130 So in the end, now, this is going 1136 01:05:21,130 --> 01:05:22,360 to be my object of worship. 1137 01:05:22,360 --> 01:05:24,318 This is going to be the parameter that matters. 1138 01:05:24,318 --> 01:05:27,910 Because once I set beta, I set everything else 1139 01:05:27,910 --> 01:05:30,290 through this chain. 1140 01:05:30,290 --> 01:05:33,010 So the question is if I start stacking up this pile 1141 01:05:33,010 --> 01:05:36,260 of parameters-- so I start with my beta, 1142 01:05:36,260 --> 01:05:38,520 which in turns give me a mu, which in turn, 1143 01:05:38,520 --> 01:05:39,580 gives me a theta-- 1144 01:05:39,580 --> 01:05:43,720 can I just have a long, streamlined-- 1145 01:05:43,720 --> 01:05:45,640 what is the outcome when I actually 1146 01:05:45,640 --> 01:05:48,016 start writing my likelihood, not as a function of theta, 1147 01:05:48,016 --> 01:05:50,140 not as a function of mu, but as a function of beta, 1148 01:05:50,140 --> 01:05:52,720 which is the one at the end of the chain? 1149 01:05:52,720 --> 01:05:55,540 And hopefully, things are going to happen nicely, 1150 01:05:55,540 --> 01:05:56,292 and they might no. 1151 01:05:56,292 --> 01:05:56,792 Yeah? 1152 01:05:56,792 --> 01:05:58,702 AUDIENCE: [INAUDIBLE] 1153 01:06:02,076 --> 01:06:03,680 PHILIPPE RIGOLLET: Is G-- 1154 01:06:03,680 --> 01:06:04,800 that's my link. 1155 01:06:04,800 --> 01:06:06,710 G of mu of x-- 1156 01:06:06,710 --> 01:06:09,320 now, its mu is a function of x, because its conditional on x. 1157 01:06:12,200 --> 01:06:17,000 So this is really theta of x, mu of x, 1158 01:06:17,000 --> 01:06:21,100 but b is not a function of x, because it's just something 1159 01:06:21,100 --> 01:06:22,965 to tells me what the function of x is. 1160 01:06:22,965 --> 01:06:24,865 AUDIENCE: [INAUDIBLE] 1161 01:06:26,074 --> 01:06:28,240 PHILIPPE RIGOLLET: Mu is the conditional expectation 1162 01:06:28,240 --> 01:06:29,770 of y, given x. 1163 01:06:29,770 --> 01:06:33,010 It has, actually, a fancy name in the statistics literature. 1164 01:06:33,010 --> 01:06:36,989 It's called-- anybody knows the name of the function, mu 1165 01:06:36,989 --> 01:06:39,280 of x, which is a conditional expectation of y, given x? 1166 01:06:42,116 --> 01:06:43,960 AUDIENCE: [INAUDIBLE] 1167 01:06:43,960 --> 01:06:46,120 PHILIPPE RIGOLLET: That's the regression function. 1168 01:06:46,120 --> 01:06:47,230 That's actual definition. 1169 01:06:47,230 --> 01:06:48,970 If I tell you what is the definition of the regression 1170 01:06:48,970 --> 01:06:51,011 function, that's just the conditional expectation 1171 01:06:51,011 --> 01:06:52,970 of why, given x. 1172 01:06:52,970 --> 01:06:58,720 And I could look at any property of the conditional distribution 1173 01:06:58,720 --> 01:07:00,020 of y given x. 1174 01:07:00,020 --> 01:07:02,639 I could look at the conditional 95th percentile. 1175 01:07:02,639 --> 01:07:04,180 I can look at the conditional median. 1176 01:07:04,180 --> 01:07:06,450 I can look at the conditional [INAUDIBLE] range. 1177 01:07:06,450 --> 01:07:08,470 I can look at the conditional variance. 1178 01:07:08,470 --> 01:07:12,300 But I decide to look at the conditional expectation, which 1179 01:07:12,300 --> 01:07:15,429 is called the regression function. 1180 01:07:18,363 --> 01:07:19,341 Yes? 1181 01:07:19,341 --> 01:07:21,297 AUDIENCE: [INAUDIBLE] 1182 01:07:24,231 --> 01:07:26,290 PHILIPPE RIGOLLET: Oh, there's no transpose here. 1183 01:07:26,290 --> 01:07:28,700 Actually, only Victor-Emmanuel used this prime for transpose, 1184 01:07:28,700 --> 01:07:30,710 and I found it confusing with the derivatives. 1185 01:07:30,710 --> 01:07:33,306 So primes here is only a derivative. 1186 01:07:33,306 --> 01:07:34,623 AUDIENCE: [INAUDIBLE] 1187 01:07:35,122 --> 01:07:38,640 PHILIPPE RIGOLLET: Oh, yeah, sorry, beta transpose x. 1188 01:07:38,640 --> 01:07:40,350 So you said what? 1189 01:07:40,350 --> 01:07:43,245 I said that g of mu of x is beta transpose x? 1190 01:07:43,245 --> 01:07:45,145 AUDIENCE: [INAUDIBLE] 1191 01:07:48,035 --> 01:07:49,910 PHILIPPE RIGOLLET: Isn't that the same thing? 1192 01:07:52,510 --> 01:07:53,970 X is a vector here, right? 1193 01:07:53,970 --> 01:07:54,930 AUDIENCE: Yeah. 1194 01:07:54,930 --> 01:07:56,555 PHILIPPE RIGOLLET: So x transpose beta, 1195 01:07:56,555 --> 01:08:00,348 and beta transpose x are of the same thing. 1196 01:08:00,348 --> 01:08:02,280 AUDIENCE: [INAUDIBLE] 1197 01:08:03,979 --> 01:08:05,770 PHILIPPE RIGOLLET: So beta looks like this. 1198 01:08:05,770 --> 01:08:08,706 X looks like this. 1199 01:08:08,706 --> 01:08:12,420 It's just a simple number. 1200 01:08:12,420 --> 01:08:13,386 Yeah, you're right. 1201 01:08:13,386 --> 01:08:15,010 I'm going to start to look at matrices. 1202 01:08:15,010 --> 01:08:18,189 I'm going to have to be slightly more careful when I do this. 1203 01:08:18,189 --> 01:08:20,740 OK so let's do the reverse engineering. 1204 01:08:20,740 --> 01:08:22,199 I'm giving you data. 1205 01:08:22,199 --> 01:08:23,740 From this data, hopefully, you should 1206 01:08:23,740 --> 01:08:26,979 be able to get what the conditional-- if I give you 1207 01:08:26,979 --> 01:08:29,630 an infinite amount of data, you would know exactly, 1208 01:08:29,630 --> 01:08:33,819 of pairs xy, what the conditional distribution of y 1209 01:08:33,819 --> 01:08:36,130 given x is. 1210 01:08:36,130 --> 01:08:37,770 And in particular, you would know 1211 01:08:37,770 --> 01:08:40,560 what the conditional expectation of y given x 1212 01:08:40,560 --> 01:08:42,359 is, which means that you would know mu, 1213 01:08:42,359 --> 01:08:44,192 which means that you would know theta, which 1214 01:08:44,192 --> 01:08:45,920 means that you would know beta. 1215 01:08:45,920 --> 01:08:48,600 Now, when I have a finite number of observations, 1216 01:08:48,600 --> 01:08:50,910 I'm going to try to estimate mu of x. 1217 01:08:50,910 --> 01:08:53,250 But really, I'm going to go the other way around. 1218 01:08:53,250 --> 01:08:56,279 Because the fact that I assume, specifically, that mu of x 1219 01:08:56,279 --> 01:09:00,510 is of the form g of mu of x is x transpose beta, then that 1220 01:09:00,510 --> 01:09:02,850 means that I only have to estimate beta, which 1221 01:09:02,850 --> 01:09:06,432 is a much simpler object than the entire regression function. 1222 01:09:06,432 --> 01:09:07,890 So that's what I'm going to go for. 1223 01:09:07,890 --> 01:09:10,330 I'm going to try to represent the likelihood, the log 1224 01:09:10,330 --> 01:09:12,890 likelihood, of my data as a function, not of theta, 1225 01:09:12,890 --> 01:09:15,390 not of mu, but of beta-- 1226 01:09:15,390 --> 01:09:18,120 and then, maximize that guy. 1227 01:09:18,120 --> 01:09:21,870 So now, rather than thinking of just one observation, 1228 01:09:21,870 --> 01:09:23,940 I'm going to have a bunch of observations. 1229 01:09:27,100 --> 01:09:29,069 So this might actually look a little confusing, 1230 01:09:29,069 --> 01:09:32,189 but let's just make sure that we understand each other 1231 01:09:32,189 --> 01:09:33,700 before we go any further. 1232 01:09:33,700 --> 01:09:38,510 So I'm going to have observations, 1233 01:09:38,510 --> 01:09:43,359 x1, y1, all the way to xn, yn, just 1234 01:09:43,359 --> 01:09:45,310 like in a natural regression problem, 1235 01:09:45,310 --> 01:09:49,180 except that here my y's might be 0 one valued. 1236 01:09:49,180 --> 01:09:50,649 They might be positive valued. 1237 01:09:50,649 --> 01:09:51,732 They might be exponential. 1238 01:09:51,732 --> 01:09:54,600 They might be anything in the canonical exponential family. 1239 01:09:57,840 --> 01:09:59,950 OK so I have this thing, and now, 1240 01:09:59,950 --> 01:10:01,900 what I have is that my observations are x1, 1241 01:10:01,900 --> 01:10:03,310 y1, xn, yn. 1242 01:10:03,310 --> 01:10:06,460 And what I want is that I'm going 1243 01:10:06,460 --> 01:10:11,640 to assume that the conditional expectation of yi, given-- 1244 01:10:14,980 --> 01:10:18,710 the conditional distribution of yi, given xi, 1245 01:10:18,710 --> 01:10:20,390 is something that has density. 1246 01:10:30,070 --> 01:10:31,473 Did I put an i on y-- yeah. 1247 01:10:42,820 --> 01:10:45,920 I'm not going to deal with the phi and the c now. 1248 01:10:45,920 --> 01:10:48,610 And why do I have theta i and not theta 1249 01:10:48,610 --> 01:11:01,350 is because theta i is really a function of xi. 1250 01:11:01,350 --> 01:11:05,270 So it's really theta i of xi. 1251 01:11:05,270 --> 01:11:07,240 But what do I know about theta i of xi, 1252 01:11:07,240 --> 01:11:11,890 it's actually equal to b-- 1253 01:11:11,890 --> 01:11:13,920 I did this error twice-- 1254 01:11:13,920 --> 01:11:16,450 b prime inverse of mu of xi. 1255 01:11:30,620 --> 01:11:34,160 And I'm going to assume that this is of the form beta 1256 01:11:34,160 --> 01:11:36,190 transpose xi. 1257 01:11:36,190 --> 01:11:37,810 And this is why I have theta i-- 1258 01:11:37,810 --> 01:11:40,414 is because this theta i is a function of xi, 1259 01:11:40,414 --> 01:11:42,830 and I'm going to assume a very simple form for this thing. 1260 01:11:46,030 --> 01:11:48,747 Sorry, sorry, sorry, sorry-- 1261 01:11:48,747 --> 01:11:50,080 I should not write it like this. 1262 01:11:50,080 --> 01:11:51,980 This is only when I have the canonical link. 1263 01:11:51,980 --> 01:11:57,310 So this is actually equal to b prime inverse of g, 1264 01:11:57,310 --> 01:11:59,650 of xi transpose beta. 1265 01:12:05,010 --> 01:12:07,754 Sorry, g inverse-- those two things 1266 01:12:07,754 --> 01:12:09,170 are actually canceling each other. 1267 01:12:13,760 --> 01:12:17,735 So as before, I'm going to stack everything into some-- 1268 01:12:17,735 --> 01:12:20,360 well, actually, I'm not going to stack anything for the moment. 1269 01:12:20,360 --> 01:12:22,151 I'm just going to give you a peek at what's 1270 01:12:22,151 --> 01:12:28,010 happening next week, rather than just manipulating the data. 1271 01:12:28,010 --> 01:12:33,810 So here is how we're going to proceed at this point. 1272 01:12:33,810 --> 01:12:36,540 Well now, I want to write my likelihood function, 1273 01:12:36,540 --> 01:12:39,270 not as a function of theta, but as a function of beta, 1274 01:12:39,270 --> 01:12:44,270 because that's the parameter I'm actually trying to maximize. 1275 01:12:44,270 --> 01:12:47,050 So if I have a link-- 1276 01:12:47,050 --> 01:12:50,455 so this thing that matters here, I'm going to call h. 1277 01:12:53,600 --> 01:12:58,190 By definition, this is going to be h of xi transpose beta. 1278 01:12:58,190 --> 01:13:00,080 Helena, you have a question? 1279 01:13:00,080 --> 01:13:02,069 AUDIENCE: Uh, no [INAUDIBLE] 1280 01:13:02,069 --> 01:13:04,110 PHILIPPE RIGOLLET: So this is just all the things 1281 01:13:04,110 --> 01:13:04,930 that we know. 1282 01:13:04,930 --> 01:13:09,150 Theta is just the, by definition of the fact that mu 1283 01:13:09,150 --> 01:13:11,505 is b prime of theta, the mean is b prime of theta-- 1284 01:13:11,505 --> 01:13:14,250 it means that theta is b prime inverse of mu. 1285 01:13:14,250 --> 01:13:19,190 And then, mu is modeled from the systematic component. 1286 01:13:19,190 --> 01:13:21,940 G of mu is xi transpose beta, so this is 1287 01:13:21,940 --> 01:13:23,590 g inverse of xi transpose beta. 1288 01:13:23,590 --> 01:13:27,810 So I want to have b prime inverse of g inverse. 1289 01:13:27,810 --> 01:13:30,030 This function is a bit annoying to say, 1290 01:13:30,030 --> 01:13:32,750 so I'm just going to call it h. 1291 01:13:32,750 --> 01:13:34,837 And when I do the composition of two inverses, 1292 01:13:34,837 --> 01:13:36,920 the inverse of the composition of those two things 1293 01:13:36,920 --> 01:13:38,280 in the reverse order-- 1294 01:13:38,280 --> 01:13:42,140 so h is really the inverse of g, composed with b 1295 01:13:42,140 --> 01:13:46,677 prime, g of b prime inverse. 1296 01:13:46,677 --> 01:13:48,260 And now, if I have the canonical link, 1297 01:13:48,260 --> 01:13:51,200 since I know that g is b prime inverse, 1298 01:13:51,200 --> 01:13:54,180 this is really just the identity. 1299 01:13:54,180 --> 01:13:58,109 As you can imagine, this entire thing, 1300 01:13:58,109 --> 01:13:59,650 which is actually quite complicated-- 1301 01:13:59,650 --> 01:14:01,750 would just say oh, this thing, actually, does not show up 1302 01:14:01,750 --> 01:14:03,041 when I have the canonical link. 1303 01:14:03,041 --> 01:14:06,370 I really just have that theta can be replaced by xi of beta. 1304 01:14:06,370 --> 01:14:09,280 So think about going back to this guy here. 1305 01:14:09,280 --> 01:14:15,160 Now, theta becomes only xi transpose beta. 1306 01:14:15,160 --> 01:14:18,425 That's going to be much more simple to optimize, 1307 01:14:18,425 --> 01:14:20,550 because remember, when I'm going to log likelihood, 1308 01:14:20,550 --> 01:14:21,841 this thing is going to go away. 1309 01:14:21,841 --> 01:14:23,020 I'm going to sum those guys. 1310 01:14:23,020 --> 01:14:24,310 And so what I'm going to have is something which 1311 01:14:24,310 --> 01:14:26,140 is essentially linear in beta. 1312 01:14:26,140 --> 01:14:28,340 And then, I'm going to have this minus b, 1313 01:14:28,340 --> 01:14:31,760 which is just minus the sum of convex functions of beta. 1314 01:14:31,760 --> 01:14:34,220 And so I'm going to have to bring in the tools of convex 1315 01:14:34,220 --> 01:14:34,860 optimization. 1316 01:14:34,860 --> 01:14:37,566 Now, it's not just going to be take the gradient, set it to 0. 1317 01:14:37,566 --> 01:14:39,440 It's going to be more complicated to do that. 1318 01:14:39,440 --> 01:14:42,320 I'm going to have to do that in an iterative fashion. 1319 01:14:42,320 --> 01:14:43,800 And so that's what I'm telling you, 1320 01:14:43,800 --> 01:14:46,400 when you look at your log likelihood for all 1321 01:14:46,400 --> 01:14:47,330 those functions. 1322 01:14:47,330 --> 01:14:50,062 You sum, the exponential goes away because you had the log, 1323 01:14:50,062 --> 01:14:51,770 and then, you have all these things here. 1324 01:14:51,770 --> 01:14:52,660 I kept the b. 1325 01:14:52,660 --> 01:14:53,990 I kept the h. 1326 01:14:53,990 --> 01:14:56,690 But if h is the identity, this is the linear function, 1327 01:14:56,690 --> 01:14:59,210 the linear part, yi times xi transpose 1328 01:14:59,210 --> 01:15:03,776 beta, minus b of my theta, which is now only xi transpose beta. 1329 01:15:03,776 --> 01:15:05,900 And that's the function I want to maximize in beta. 1330 01:15:10,370 --> 01:15:11,390 It's a convex function. 1331 01:15:11,390 --> 01:15:15,130 When I know what b is, I have an explicit formula for this, 1332 01:15:15,130 --> 01:15:18,230 and I want to just bring in some optimization. 1333 01:15:18,230 --> 01:15:19,682 And that's what we're going to do, 1334 01:15:19,682 --> 01:15:21,890 and we're going to see three different methods, which 1335 01:15:21,890 --> 01:15:24,110 are really, basically, the same method. 1336 01:15:24,110 --> 01:15:28,760 It's just an adaptation or specialization 1337 01:15:28,760 --> 01:15:31,550 of the so-called Newton-Raphson method, which is essentially 1338 01:15:31,550 --> 01:15:34,735 telling you do iterative local quadratic approximations 1339 01:15:34,735 --> 01:15:36,360 through your function-- so second order 1340 01:15:36,360 --> 01:15:38,480 [INAUDIBLE] expansion, minimize this guy, 1341 01:15:38,480 --> 01:15:41,060 and then do it again from where you were. 1342 01:15:41,060 --> 01:15:43,460 And we'll see that this can be, actually, 1343 01:15:43,460 --> 01:15:47,210 implemented using what's called iteratively re-weighted least 1344 01:15:47,210 --> 01:15:49,640 squares, which means that every step-- 1345 01:15:49,640 --> 01:15:51,200 since it's just a quadratic, it's 1346 01:15:51,200 --> 01:15:53,090 going to be just squares in there-- 1347 01:15:53,090 --> 01:15:56,190 can actually be solved by using a weighted least 1348 01:15:56,190 --> 01:15:59,420 squares version of the problem. 1349 01:15:59,420 --> 01:16:02,270 So I'm going to stop here for today. 1350 01:16:02,270 --> 01:16:05,930 So we'll continue and probably not finish this chapter, 1351 01:16:05,930 --> 01:16:07,440 but finish next week. 1352 01:16:07,440 --> 01:16:10,670 And then, I think there's only one lecture. 1353 01:16:10,670 --> 01:16:13,310 Actually, for the last lecture, what do you guys want to do? 1354 01:16:16,320 --> 01:16:18,460 Do you want to have doughnuts and cider? 1355 01:16:18,460 --> 01:16:25,620 Do you want to just have some more outlooking lecture 1356 01:16:25,620 --> 01:16:31,390 on what's happening post 1975 in statistics? 1357 01:16:31,390 --> 01:16:36,130 Do you want to have a review for the final exam-- 1358 01:16:36,130 --> 01:16:38,970 pragmatic people. 1359 01:16:38,970 --> 01:16:43,300 AUDIENCE: [INAUDIBLE] interesting, advanced topics. 1360 01:16:43,300 --> 01:16:46,100 PHILIPPE RIGOLLET: You want to do interesting, advanced-- 1361 01:16:46,100 --> 01:16:48,200 for the last lecture? 1362 01:16:48,200 --> 01:16:50,470 AUDIENCE: Something that we haven't thought of yet. 1363 01:16:50,470 --> 01:16:53,920 PHILIPPE RIGOLLET: Yeah, that's, basically, what I'm asking, 1364 01:16:53,920 --> 01:16:55,420 right-- interesting advanced topics, 1365 01:16:55,420 --> 01:17:00,694 versus ask me any question you want. 1366 01:17:00,694 --> 01:17:03,110 Those questions can be about interesting, advanced topics, 1367 01:17:03,110 --> 01:17:03,850 though. 1368 01:17:03,850 --> 01:17:06,020 Like, what are interesting, advanced topics? 1369 01:17:06,020 --> 01:17:06,547 I'm sorry? 1370 01:17:06,547 --> 01:17:08,630 AUDIENCE: Interesting with doughnuts-- is that OK? 1371 01:17:08,630 --> 01:17:10,963 PHILIPPE RIGOLLET: Yeah, we can always do the doughnuts. 1372 01:17:10,963 --> 01:17:11,838 [LAUGHTER] 1373 01:17:11,838 --> 01:17:14,792 AUDIENCE: As long as there are doughnuts. 1374 01:17:14,792 --> 01:17:16,750 PHILIPPE RIGOLLET: All right, so we'll do that. 1375 01:17:16,750 --> 01:17:19,500 So you guys have a good weekend.