1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,850 Commons license. 3 00:00:03,850 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:19,914 and ocw.mit.edu. 8 00:00:19,914 --> 00:00:22,080 PHILIPPE RIGOLLET: The chapter is a natural capstone 9 00:00:22,080 --> 00:00:24,840 chapter for this entire course. 10 00:00:24,840 --> 00:00:26,760 We'll see some of the things we've 11 00:00:26,760 --> 00:00:29,910 seen during maximum likelihood and some of the things 12 00:00:29,910 --> 00:00:34,080 we've seen during linear regression, some of the things 13 00:00:34,080 --> 00:00:37,020 we've seen in terms of the basic modeling that we've had before. 14 00:00:37,020 --> 00:00:39,655 We're not going to go back to much inference questions. 15 00:00:39,655 --> 00:00:41,280 It's really going to be about modeling. 16 00:00:41,280 --> 00:00:44,355 And in a way, generalized linear models, as the word says, 17 00:00:44,355 --> 00:00:47,010 are just a generalization of linear models. 18 00:00:47,010 --> 00:00:49,300 And they're actually extremely useful. 19 00:00:49,300 --> 00:00:51,900 They're often forgotten about and people just 20 00:00:51,900 --> 00:00:54,720 jump onto machine learning and sophisticated techniques. 21 00:00:54,720 --> 00:00:57,384 But those things do the job quite well. 22 00:00:57,384 --> 00:00:59,550 So let's see in what sense they are a generalization 23 00:00:59,550 --> 00:01:02,250 of the linear models. 24 00:01:02,250 --> 00:01:05,400 So remember, the linear model looked like this. 25 00:01:05,400 --> 00:01:13,030 We said that y was equal to x transpose beta plus epsilon, 26 00:01:13,030 --> 00:01:13,530 right? 27 00:01:13,530 --> 00:01:15,960 That was our linear regression model. 28 00:01:15,960 --> 00:01:19,330 And it's-- another way to say this is that if-- 29 00:01:19,330 --> 00:01:20,970 and let's assume that those were, say, 30 00:01:20,970 --> 00:01:25,230 Gaussian with mean 0 and identity covariance matrix. 31 00:01:25,230 --> 00:01:26,730 Then another way to say this is that 32 00:01:26,730 --> 00:01:32,700 the conditional distribution of y given x is equal to-- 33 00:01:32,700 --> 00:01:39,690 sorry, I a Gaussian with mean x transpose beta and variance-- 34 00:01:39,690 --> 00:01:43,440 well, we had a sigma squared, which I will forget as usual-- 35 00:01:43,440 --> 00:01:46,080 x transpose beta and then sigma squared. 36 00:01:46,080 --> 00:01:50,550 OK, so here, we just assumed that-- so what is regression 37 00:01:50,550 --> 00:01:54,630 is just saying I'm trying to explain why as a function of x. 38 00:01:54,630 --> 00:01:57,660 Given x, I'm assuming a distribution for the y. 39 00:01:57,660 --> 00:01:59,460 And this x is just going to be here 40 00:01:59,460 --> 00:02:05,430 to help me model what the mean of this Gaussian is, right? 41 00:02:05,430 --> 00:02:07,720 I mean, I could have something crazy. 42 00:02:07,720 --> 00:02:13,570 I could have something that looks like y given 43 00:02:13,570 --> 00:02:17,560 x is n0 x transpose beta. 44 00:02:17,560 --> 00:02:19,660 And then this could be some other thing 45 00:02:19,660 --> 00:02:22,780 which looks like, I don't know, some x transpose 46 00:02:22,780 --> 00:02:26,950 gamma squared times, I don't know, 47 00:02:26,950 --> 00:02:30,350 x, x transpose plus identity-- 48 00:02:30,350 --> 00:02:33,250 some crazy thing that depends on x here, right? 49 00:02:33,250 --> 00:02:37,570 And we deliberately assumed that all the thing that depends on x 50 00:02:37,570 --> 00:02:39,820 shows up in the mean, OK? 51 00:02:39,820 --> 00:02:42,520 And so what I have here is that y 52 00:02:42,520 --> 00:02:45,640 given x is a Gaussian with a mean that 53 00:02:45,640 --> 00:02:51,240 depends on x and covariance matrix sigma square identity. 54 00:02:51,240 --> 00:02:54,699 Now the linear model assumed a very specific form 55 00:02:54,699 --> 00:02:55,240 for the mean. 56 00:02:55,240 --> 00:02:59,190 It said I want the mean to be equal to x 57 00:02:59,190 --> 00:03:01,050 transpose beta which, remember, was 58 00:03:01,050 --> 00:03:10,270 the sum from, say, j equals 1 to p of beta j xj, right? 59 00:03:10,270 --> 00:03:13,240 It's where the xj's are the coordinates of x. 60 00:03:13,240 --> 00:03:16,050 But I could do something also more complicated, right? 61 00:03:16,050 --> 00:03:19,170 I could have something that looks like instead , 62 00:03:19,170 --> 00:03:28,990 replace this by, I don't know, sum of beta j log of x to the j 63 00:03:28,990 --> 00:03:34,450 divided by x to the j squared or something like this, right? 64 00:03:34,450 --> 00:03:37,360 I could do this as well. 65 00:03:37,360 --> 00:03:39,630 So there's two things that we have assumed. 66 00:03:39,630 --> 00:03:41,520 The first one is that when I look 67 00:03:41,520 --> 00:03:43,440 at the conditional distribution of y given x, 68 00:03:43,440 --> 00:03:45,570 x affects only the mean. 69 00:03:45,570 --> 00:03:47,394 I also assume that it was Gaussian 70 00:03:47,394 --> 00:03:48,810 and that it affects only the mean. 71 00:03:48,810 --> 00:03:51,130 And the mean is affected in a very specific way, 72 00:03:51,130 --> 00:03:53,670 which is linear in x, right? 73 00:03:53,670 --> 00:03:56,270 So this is essentially the things 74 00:03:56,270 --> 00:03:58,230 we're going to try to relax. 75 00:03:58,230 --> 00:03:59,670 So the first thing that we assume, 76 00:03:59,670 --> 00:04:03,300 the fact that y was Gaussian and had only its mean [INAUDIBLE] 77 00:04:03,300 --> 00:04:07,140 dependant no x is what's called the random component. 78 00:04:07,140 --> 00:04:09,435 It just says that the response variables, you know, 79 00:04:09,435 --> 00:04:13,990 it sort of makes sense to assume that they're Gaussian. 80 00:04:13,990 --> 00:04:17,220 And everything was essentially captured, right? 81 00:04:17,220 --> 00:04:18,779 So there's this property of Gaussians 82 00:04:18,779 --> 00:04:22,069 that if you tell me-- if the variance is known, 83 00:04:22,069 --> 00:04:23,610 all you need to tell me to understand 84 00:04:23,610 --> 00:04:25,950 exactly what the distribution of a Gaussian is, 85 00:04:25,950 --> 00:04:29,110 all you need to tell me is its expected value. 86 00:04:29,110 --> 00:04:31,730 All right, so that's this mu of x. 87 00:04:31,730 --> 00:04:35,570 And the second thing is that we have this link that says, 88 00:04:35,570 --> 00:04:38,600 well, I need to find a way to use my x's to explain 89 00:04:38,600 --> 00:04:40,370 this mu you and the link was exactly 90 00:04:40,370 --> 00:04:42,390 mu of x was equal to x transpose beta. 91 00:04:45,770 --> 00:04:51,140 Now we are talking about generalized linear models. 92 00:04:51,140 --> 00:04:56,150 So this part here where mu of x is of the form-- the way 93 00:04:56,150 --> 00:05:00,620 I want my beta, my x, to show up is linear, 94 00:05:00,620 --> 00:05:03,380 this will never be a question. 95 00:05:03,380 --> 00:05:06,030 In principle, I could add a third point, 96 00:05:06,030 --> 00:05:10,250 which is just question this part, the fact that mu of x 97 00:05:10,250 --> 00:05:11,310 is x transpose beta. 98 00:05:11,310 --> 00:05:13,640 I could have some more complicated, nonlinear function 99 00:05:13,640 --> 00:05:14,300 of x. 100 00:05:14,300 --> 00:05:15,740 And then we'll never do that because we're talking 101 00:05:15,740 --> 00:05:17,100 about generalized linear model. 102 00:05:17,100 --> 00:05:20,640 The only thing with generalize are the random component, 103 00:05:20,640 --> 00:05:23,330 the conditional distribution of y given x, 104 00:05:23,330 --> 00:05:26,870 and the link that just says, well, once you actually tell me 105 00:05:26,870 --> 00:05:29,540 that the only thing I need to figure out is the mean, 106 00:05:29,540 --> 00:05:32,180 I'm just going to slap it exactly these x transpose beta 107 00:05:32,180 --> 00:05:36,520 thing without any transformation of x transpose beta. 108 00:05:36,520 --> 00:05:37,750 So those are the two things. 109 00:05:40,300 --> 00:05:42,260 It will become clear what I mean. 110 00:05:42,260 --> 00:05:44,450 This sounds like a tautology, but let's just 111 00:05:44,450 --> 00:05:46,730 see how we could extend that. 112 00:05:46,730 --> 00:05:50,140 So what we're going to do in generalized linear models-- 113 00:05:50,140 --> 00:05:55,482 right, so when I talk about GLNs, 114 00:05:55,482 --> 00:05:57,190 the first thing I'm going to do with my x 115 00:05:57,190 --> 00:05:59,330 is turn it into some x transpose beta. 116 00:05:59,330 --> 00:06:02,372 And that's just the l part, right? 117 00:06:02,372 --> 00:06:03,830 I'm not going to be able to change. 118 00:06:03,830 --> 00:06:05,030 That's the way it works. 119 00:06:05,030 --> 00:06:07,530 I'm not going to do anything non-linear. 120 00:06:07,530 --> 00:06:09,780 But the two things I'm going to change 121 00:06:09,780 --> 00:06:16,430 is this random component, which is 122 00:06:16,430 --> 00:06:21,410 that y, which used to be some Gaussian with mean mu of x 123 00:06:21,410 --> 00:06:24,200 here in sigma squared-- 124 00:06:24,200 --> 00:06:26,990 so y given x, sorry-- 125 00:06:26,990 --> 00:06:35,770 this is going to become y given x follows some distribution. 126 00:06:35,770 --> 00:06:37,690 And I'm not going to allow any distribution. 127 00:06:37,690 --> 00:06:40,900 I want something that comes from the exponential family. 128 00:06:49,910 --> 00:06:52,400 Who knows what the exponential family of distribution is? 129 00:06:52,400 --> 00:06:55,970 This is not the same thing as the exponential distribution. 130 00:06:55,970 --> 00:06:58,970 It's a family of distributions. 131 00:06:58,970 --> 00:07:00,495 All right, so we'll see that. 132 00:07:00,495 --> 00:07:01,770 It's-- wow. 133 00:07:04,560 --> 00:07:06,420 What can that be? 134 00:07:06,420 --> 00:07:08,194 Oh yeah, that's actually [INAUDIBLE].. 135 00:07:11,638 --> 00:07:17,050 So-- I'm sorry? 136 00:07:17,050 --> 00:07:19,527 AUDIENCE: [INAUDIBLE] 137 00:07:19,527 --> 00:07:21,360 PHILIPPE RIGOLLET: I'm in presentation mode. 138 00:07:21,360 --> 00:07:23,650 That should not happen. 139 00:07:23,650 --> 00:07:25,130 OK, so hopefully, this is muted. 140 00:07:29,390 --> 00:07:32,300 So essentially, this is going to be a family of distributions. 141 00:07:32,300 --> 00:07:34,442 And what makes them exponential typically 142 00:07:34,442 --> 00:07:35,900 is that there's an exponential that 143 00:07:35,900 --> 00:07:39,020 shows up in the definition of the density, all right? 144 00:07:39,020 --> 00:07:41,000 We'll see that the Gaussian belongs 145 00:07:41,000 --> 00:07:42,560 to the exponential family. 146 00:07:42,560 --> 00:07:44,210 But they're slightly less expected ones 147 00:07:44,210 --> 00:07:48,570 because there's this crazy thing that a to the x 148 00:07:48,570 --> 00:07:52,327 is exponential x log a, which makes the potential show up 149 00:07:52,327 --> 00:07:53,160 without being there. 150 00:07:53,160 --> 00:07:54,910 So if there's an exponential of some power, 151 00:07:54,910 --> 00:07:55,830 it's going to show up. 152 00:07:55,830 --> 00:07:56,640 But it's more than that. 153 00:07:56,640 --> 00:07:58,639 So we'll actually come to this particular family 154 00:07:58,639 --> 00:07:59,866 of distribution. 155 00:07:59,866 --> 00:08:00,990 Why this particular family? 156 00:08:00,990 --> 00:08:02,406 Because in a way, everything we've 157 00:08:02,406 --> 00:08:04,710 done for the linear model with Gaussian 158 00:08:04,710 --> 00:08:08,610 is going to extend fairly naturally to this family. 159 00:08:08,610 --> 00:08:11,460 All right, and it actually also, because it encompasses 160 00:08:11,460 --> 00:08:13,470 pretty much everything, all the distributions 161 00:08:13,470 --> 00:08:15,950 we've discussed before. 162 00:08:15,950 --> 00:08:19,890 All right, so the second thing that I want to question-- 163 00:08:19,890 --> 00:08:22,260 right, so before, we just said, well, 164 00:08:22,260 --> 00:08:28,560 mu of x was directly equal to this thing. 165 00:08:31,880 --> 00:08:34,260 Mu of x was directly x transpose beta. 166 00:08:34,260 --> 00:08:36,530 So I knew I was going to have an x transpose beta 167 00:08:36,530 --> 00:08:39,030 and I said, well, I could do something with this x transpose 168 00:08:39,030 --> 00:08:42,750 beta before I used it to explain the expected value. 169 00:08:42,750 --> 00:08:44,490 But I'm actually taking it like that. 170 00:08:44,490 --> 00:08:52,200 Here, we're going to say, let's extend this to some function 171 00:08:52,200 --> 00:08:54,000 is equal to this thing. 172 00:08:54,000 --> 00:08:56,790 Now admittedly, this is not the most natural way 173 00:08:56,790 --> 00:08:57,600 to think about it. 174 00:08:57,600 --> 00:08:59,724 What you would probably feel more comfortable doing 175 00:08:59,724 --> 00:09:03,870 is write something like mu of x is a function. 176 00:09:03,870 --> 00:09:08,070 Let's call it f of x transpose beta. 177 00:09:08,070 --> 00:09:12,850 But here, I decide to call f g inverse. 178 00:09:12,850 --> 00:09:14,574 OK, let's just my g inverse. 179 00:09:14,574 --> 00:09:15,074 Yes. 180 00:09:15,074 --> 00:09:18,430 AUDIENCE: Is this different then just [INAUDIBLE] 181 00:09:18,430 --> 00:09:19,430 PHILIPPE RIGOLLET: Yeah. 182 00:09:22,820 --> 00:09:26,855 I mean, what transformation you want to put on your x's? 183 00:09:26,855 --> 00:09:35,120 AUDIENCE: [INAUDIBLE] 184 00:09:35,120 --> 00:09:37,620 PHILIPPE RIGOLLET: Oh no, certainly not, right? 185 00:09:37,620 --> 00:09:40,820 I mean, if I give you-- if I force you to work with x1 plus 186 00:09:40,820 --> 00:09:44,280 x2, you cannot work with any function of x1 plus any 187 00:09:44,280 --> 00:09:46,050 function of x2, right? 188 00:09:46,050 --> 00:09:48,435 So this is different. 189 00:09:51,900 --> 00:09:55,230 All right, so-- yeah. 190 00:09:55,230 --> 00:09:57,900 The transformation would be just the simple part 191 00:09:57,900 --> 00:09:59,400 of your linear regression problem 192 00:09:59,400 --> 00:10:01,920 where you would take your exes, transform them, 193 00:10:01,920 --> 00:10:03,960 and then just apply another linear regression. 194 00:10:03,960 --> 00:10:04,950 This is genuinely new. 195 00:10:07,419 --> 00:10:08,210 Any other question? 196 00:10:11,040 --> 00:10:13,740 All right, so this function g and the reason 197 00:10:13,740 --> 00:10:16,830 why I sort of have to, like, stick to this slightly less 198 00:10:16,830 --> 00:10:18,690 natural way of defining it is because that's 199 00:10:18,690 --> 00:10:21,660 g that gets a name, not g inverse that gets a name. 200 00:10:21,660 --> 00:10:23,330 And the name of g is the link function. 201 00:10:29,950 --> 00:10:33,530 So if I want to give you a generalized linear model, 202 00:10:33,530 --> 00:10:35,250 I need to give you two ingredients. 203 00:10:35,250 --> 00:10:37,630 The first one is the random component, 204 00:10:37,630 --> 00:10:40,110 which is the distribution of y given x. 205 00:10:40,110 --> 00:10:44,520 And it can be anything in what's called the exponential family 206 00:10:44,520 --> 00:10:45,630 of distributions. 207 00:10:45,630 --> 00:10:47,670 So for example, I could say, y given 208 00:10:47,670 --> 00:10:50,910 x is Gaussian with mean mu x sigma identity. 209 00:10:50,910 --> 00:10:53,070 But I can also tell you y given x 210 00:10:53,070 --> 00:10:57,580 is gamma with shared parameter equal to alpha of x, OK? 211 00:10:57,580 --> 00:11:00,480 I could do some weird things like this. 212 00:11:00,480 --> 00:11:03,930 And the second thing is I need to give you a link function. 213 00:11:03,930 --> 00:11:08,300 And the link function is going to become very clear 214 00:11:08,300 --> 00:11:09,860 how you pick a link function. 215 00:11:09,860 --> 00:11:12,350 And the only reason that you actually pick a link function 216 00:11:12,350 --> 00:11:15,010 is because of compatibility. 217 00:11:15,010 --> 00:11:18,730 This mu of x, I call it mu because mu of x 218 00:11:18,730 --> 00:11:21,950 is always the conditional expectation of y given x, 219 00:11:21,950 --> 00:11:25,450 always, which means that let's think 220 00:11:25,450 --> 00:11:27,660 of y as being a Bernoulli random variable. 221 00:11:31,176 --> 00:11:32,530 Where does mu of x live? 222 00:11:37,430 --> 00:11:38,410 AUDIENCE: [INAUDIBLE] 223 00:11:38,410 --> 00:11:39,100 PHILIPPE RIGOLLET: 0, 1, right? 224 00:11:39,100 --> 00:11:40,683 That's the expectation of a Bernoulli. 225 00:11:40,683 --> 00:11:43,630 It's just the probability that my coin flip gives me 1. 226 00:11:43,630 --> 00:11:45,630 So it's a number between 0 and 1. 227 00:11:45,630 --> 00:11:49,960 But this guy right here, if my x's are anything, right-- 228 00:11:49,960 --> 00:11:52,540 think of any body measurements plus [INAUDIBLE] 229 00:11:52,540 --> 00:11:55,520 linear combinations with arbitrarily large coefficients. 230 00:11:55,520 --> 00:11:57,860 This thing can be any real number. 231 00:11:57,860 --> 00:12:01,180 So the link function, what it's effectively going to do 232 00:12:01,180 --> 00:12:03,060 is make those two things compatible. 233 00:12:03,060 --> 00:12:04,570 It's going to take my number which, 234 00:12:04,570 --> 00:12:07,270 for example, is constrained to be between 0 and 1 235 00:12:07,270 --> 00:12:11,006 and map it into the entire real line. 236 00:12:11,006 --> 00:12:13,380 If I have mu which is forced to be positive, for example, 237 00:12:13,380 --> 00:12:16,830 in an exponential distribution, the mean is positive, right? 238 00:12:16,830 --> 00:12:20,850 That's the, say, don't know, inter-arrival time 239 00:12:20,850 --> 00:12:22,500 for Poisson process. 240 00:12:22,500 --> 00:12:25,310 This thing is known to be positive for an exponential. 241 00:12:25,310 --> 00:12:27,060 I need to map something that's exponential 242 00:12:27,060 --> 00:12:28,072 to the entire real line. 243 00:12:28,072 --> 00:12:30,030 I need a function that takes something positive 244 00:12:30,030 --> 00:12:31,260 and [INAUDIBLE] everywhere. 245 00:12:31,260 --> 00:12:32,520 So we'll see. 246 00:12:32,520 --> 00:12:34,020 By the end of this chapter, you will 247 00:12:34,020 --> 00:12:36,900 have 100 ways of doing this, but there are some more traditional 248 00:12:36,900 --> 00:12:38,560 ones [INAUDIBLE]. 249 00:12:38,560 --> 00:12:41,110 So before we go any further, I gave you the example 250 00:12:41,110 --> 00:12:46,809 of a Bernoulli random variable. 251 00:12:46,809 --> 00:12:48,850 Let's see a few examples that actually fit there. 252 00:12:48,850 --> 00:12:49,349 Yes. 253 00:12:51,104 --> 00:12:53,509 AUDIENCE: Will it come up later [INAUDIBLE] already know 254 00:12:53,509 --> 00:12:56,154 why do we need the transformer [INAUDIBLE] why 255 00:12:56,154 --> 00:12:59,300 don't [INAUDIBLE] 256 00:12:59,300 --> 00:13:01,810 PHILIPPE RIGOLLET: Well actually, this 257 00:13:01,810 --> 00:13:02,830 will not come up later. 258 00:13:02,830 --> 00:13:04,510 It should be very clear from here 259 00:13:04,510 --> 00:13:06,070 because if I actually have a model, 260 00:13:06,070 --> 00:13:08,290 I just want it to be plausible, right? 261 00:13:08,290 --> 00:13:11,040 I mean, what happens if I suddenly decide that my-- 262 00:13:11,040 --> 00:13:12,669 so this is what's going to happen. 263 00:13:12,669 --> 00:13:14,710 You're going to have only data to fit this model. 264 00:13:14,710 --> 00:13:17,530 Let's say you actually forget about this thing here. 265 00:13:17,530 --> 00:13:19,060 You can always do this, right? 266 00:13:19,060 --> 00:13:23,974 You can always say I'm going to pretend my y's just 267 00:13:23,974 --> 00:13:26,140 happen to be the realizations of said Gaussians that 268 00:13:26,140 --> 00:13:28,270 happen to be 0 or 1 only. 269 00:13:28,270 --> 00:13:32,020 You can always, like, stuff that in some linear model, right? 270 00:13:32,020 --> 00:13:35,760 You will have some least squares estimated for beta. 271 00:13:35,760 --> 00:13:36,880 And it's going to be fine. 272 00:13:36,880 --> 00:13:38,630 For all the points that you see, it 273 00:13:38,630 --> 00:13:40,270 will definitely put some number that's 274 00:13:40,270 --> 00:13:42,016 actually between 0 and 1. 275 00:13:42,016 --> 00:13:44,140 So this is what your picture is going to look like. 276 00:13:44,140 --> 00:13:48,795 You're going to have a bunch of values for x. 277 00:13:48,795 --> 00:13:50,169 This is your y. 278 00:13:50,169 --> 00:13:51,960 And for different-- so these are the values 279 00:13:51,960 --> 00:13:53,430 of x that you will get. 280 00:13:53,430 --> 00:13:55,920 And for a y, you will see either a 0 or a 1, right? 281 00:13:59,180 --> 00:14:02,990 Right, that's what your Bernoulli dataset would look 282 00:14:02,990 --> 00:14:05,210 like with a one dimensional x. 283 00:14:05,210 --> 00:14:09,680 Now if you do least squares on this, you will find this. 284 00:14:09,680 --> 00:14:11,360 And for this guy, this line certainly 285 00:14:11,360 --> 00:14:14,290 takes values between 0 and 1. 286 00:14:14,290 --> 00:14:16,242 But let's say now you get an x here. 287 00:14:16,242 --> 00:14:17,950 You're going to actually start pretending 288 00:14:17,950 --> 00:14:20,930 that the probability it spits out one conditionally in x 289 00:14:20,930 --> 00:14:22,910 is like 1.2, and that's going to be weird. 290 00:14:28,310 --> 00:14:31,240 Any other questions? 291 00:14:31,240 --> 00:14:34,700 All right, so let's start with some examples. 292 00:14:34,700 --> 00:14:38,790 Right, I mean, you get so used to them through this course. 293 00:14:38,790 --> 00:14:41,250 So the first one is-- 294 00:14:41,250 --> 00:14:42,500 so all these things are taken. 295 00:14:42,500 --> 00:14:44,124 So there's a few books on generalizing, 296 00:14:44,124 --> 00:14:45,950 your models, generalize [INAUDIBLE] models. 297 00:14:45,950 --> 00:14:48,920 And there's tons of applications that you can see. 298 00:14:48,920 --> 00:14:50,990 Those are extremely versatile, and as soon 299 00:14:50,990 --> 00:14:53,775 as you want to do modeling to explain some y given x, 300 00:14:53,775 --> 00:14:55,400 you sort of need to do that if you want 301 00:14:55,400 --> 00:14:58,040 to go beyond linear models. 302 00:14:58,040 --> 00:15:00,610 So this was in the disease occurring rate. 303 00:15:00,610 --> 00:15:04,340 So you have a disease epidemic and you 304 00:15:04,340 --> 00:15:08,390 want to basically model the expected number 305 00:15:08,390 --> 00:15:11,390 of new cases given-- 306 00:15:11,390 --> 00:15:13,100 at a certain time, OK? 307 00:15:13,100 --> 00:15:16,190 So you have time that progresses for each of your reservation. 308 00:15:16,190 --> 00:15:18,500 Each of your reservation is a time stamp-- 309 00:15:18,500 --> 00:15:21,410 say, I don't know, 20th day. 310 00:15:21,410 --> 00:15:26,354 And your response is the number of new cases. 311 00:15:26,354 --> 00:15:28,520 And you're going to actually put your model directly 312 00:15:28,520 --> 00:15:29,480 on mu, right? 313 00:15:29,480 --> 00:15:31,970 When I looked at this, everything here 314 00:15:31,970 --> 00:15:34,460 was on mu itself, on the expected, right? 315 00:15:34,460 --> 00:15:36,410 Mu of x is always the expected-- 316 00:15:39,609 --> 00:15:42,230 the conditional expectation of y given x. 317 00:15:45,280 --> 00:15:45,890 right? 318 00:15:45,890 --> 00:15:51,750 So all I need to model is this expected value. 319 00:15:51,750 --> 00:15:54,860 So this mu I'm going to actually say-- 320 00:15:54,860 --> 00:15:57,620 so I look at some parameters, and it says, well, 321 00:15:57,620 --> 00:16:00,489 it increases exponentially. 322 00:16:00,489 --> 00:16:02,780 So I want to say I have some sort of exponential trend. 323 00:16:02,780 --> 00:16:04,820 I can parametrize that in several ways. 324 00:16:04,820 --> 00:16:06,500 And the two parameters I want to slap in 325 00:16:06,500 --> 00:16:10,190 is, like, some sort of gamma, which is just the coefficient. 326 00:16:10,190 --> 00:16:13,920 And then there's some rate delta that's in the exponential. 327 00:16:13,920 --> 00:16:15,650 So if I tell you it's exponential, 328 00:16:15,650 --> 00:16:17,330 that's a nice family of functions you 329 00:16:17,330 --> 00:16:18,710 might want to think about, OK? 330 00:16:18,710 --> 00:16:24,520 So here, mu of x, if I want to keep the notation, x 331 00:16:24,520 --> 00:16:30,650 is gamma exponential delta x, right? 332 00:16:30,650 --> 00:16:34,600 Except that here, my x are t1, t2, t3, et cetera. 333 00:16:34,600 --> 00:16:37,340 And I want to find what the parameters gamma and delta are 334 00:16:37,340 --> 00:16:40,040 because I want to be able to maybe compare 335 00:16:40,040 --> 00:16:42,980 different epidemics and see if they have the same parameter 336 00:16:42,980 --> 00:16:46,670 or maybe just do some prediction based on the data 337 00:16:46,670 --> 00:16:49,070 that I have without-- to extrapolate in the future. 338 00:16:52,020 --> 00:16:58,280 So here, clearly mu of x is not of the form 339 00:16:58,280 --> 00:17:01,970 x transpose beta, right? 340 00:17:01,970 --> 00:17:04,410 That's not x transpose beta at all. 341 00:17:04,410 --> 00:17:07,579 And it's actually not even a function of x transpose data, 342 00:17:07,579 --> 00:17:08,210 right? 343 00:17:08,210 --> 00:17:09,900 There's two parameters, gamma and delta, 344 00:17:09,900 --> 00:17:11,849 and it's not of the form. 345 00:17:11,849 --> 00:17:14,359 So here we have x, which is 1 and x, right? 346 00:17:14,359 --> 00:17:16,200 I have two parameters. 347 00:17:16,200 --> 00:17:17,970 So what I do here is that I say, well, 348 00:17:17,970 --> 00:17:20,640 first, let me transform mu in such a way 349 00:17:20,640 --> 00:17:23,119 that I can hope to see something that's linear. 350 00:17:23,119 --> 00:17:26,819 So if I transform mu, I'm going to have log of mu, which 351 00:17:26,819 --> 00:17:28,099 is log of this thing, right? 352 00:17:28,099 --> 00:17:33,770 So log of mu of x is equal, well, 353 00:17:33,770 --> 00:17:36,950 to log of gamma plus log of exponential delta 354 00:17:36,950 --> 00:17:39,350 x, which is delta x. 355 00:17:42,050 --> 00:17:46,190 And now this thing is actually linear in x. 356 00:17:46,190 --> 00:17:49,440 So I have that this guy is my first beta 1. 357 00:17:49,440 --> 00:17:50,990 And so that's beta 1 finds 1. 358 00:17:50,990 --> 00:17:53,320 And this guy is beta 2-- 359 00:17:53,320 --> 00:17:55,950 times, sorry that said beta 0-- times 1, and this guy 360 00:17:55,950 --> 00:17:58,040 is beta 1 times x. 361 00:17:58,040 --> 00:18:00,200 OK, so that looks like a linear model. 362 00:18:00,200 --> 00:18:02,330 I just have to change my parameters-- 363 00:18:02,330 --> 00:18:05,840 my parameters beta 1 becomes the log of gamma and beta 2 364 00:18:05,840 --> 00:18:08,180 becomes delta itself. 365 00:18:08,180 --> 00:18:11,210 And the reason why we do this is because, well, the way 366 00:18:11,210 --> 00:18:13,737 we put those gamma and those delta was just so that we 367 00:18:13,737 --> 00:18:14,820 have some parametrization. 368 00:18:14,820 --> 00:18:17,300 It just so happens that if we want this to be linear, 369 00:18:17,300 --> 00:18:20,052 we need to just change the parametrization itself. 370 00:18:20,052 --> 00:18:21,510 This is going to have some effects. 371 00:18:21,510 --> 00:18:23,301 We know that it's going to have some effect 372 00:18:23,301 --> 00:18:24,531 in the fissure information. 373 00:18:24,531 --> 00:18:27,030 It's going to have a bunch of effect to change those things. 374 00:18:27,030 --> 00:18:29,510 But that's what needs to be done to have 375 00:18:29,510 --> 00:18:32,240 a generalized linear model. 376 00:18:32,240 --> 00:18:35,460 Now here, the function that I took 377 00:18:35,460 --> 00:18:37,690 to turn it into something that's linear is simple. 378 00:18:37,690 --> 00:18:41,000 It came directly from some natural thing I would do here, 379 00:18:41,000 --> 00:18:42,430 which is taking the log. 380 00:18:42,430 --> 00:18:44,500 And so the function g, the link that I take, 381 00:18:44,500 --> 00:18:47,530 is called the log link very creatively. 382 00:18:47,530 --> 00:18:49,750 And it's just the function that I 383 00:18:49,750 --> 00:18:52,200 apply to mu so that I see something that's linear 384 00:18:52,200 --> 00:18:53,260 and that looks like this. 385 00:18:59,580 --> 00:19:03,890 So now this only tells me how to deal with the link function. 386 00:19:03,890 --> 00:19:06,380 But I still have to deal with 0.1. 387 00:19:06,380 --> 00:19:08,960 And this, again, is just some modeling. 388 00:19:08,960 --> 00:19:11,090 Given some data, some random data, 389 00:19:11,090 --> 00:19:14,630 what distribution do you choose to explain the randomness? 390 00:19:14,630 --> 00:19:17,600 And this-- I mean, unless there's no choice, 391 00:19:17,600 --> 00:19:19,820 you know, it's just a matter of practice, right? 392 00:19:19,820 --> 00:19:22,100 I mean, why would it be Gaussian and not, you know, 393 00:19:22,100 --> 00:19:23,540 doubly exponential? 394 00:19:23,540 --> 00:19:25,472 This is-- there's matters of convenience that 395 00:19:25,472 --> 00:19:27,680 come into this, and there's just matter of experience 396 00:19:27,680 --> 00:19:29,780 that come into this. 397 00:19:29,780 --> 00:19:32,660 You know, I remember when you chat with engineers, 398 00:19:32,660 --> 00:19:34,416 they have a very good notion of what 399 00:19:34,416 --> 00:19:35,540 the distribution should be. 400 00:19:35,540 --> 00:19:37,970 They have y bold distributions. 401 00:19:37,970 --> 00:19:39,909 You know, they do optics and things like this. 402 00:19:39,909 --> 00:19:42,450 So there's some distributions that just come up but sometimes 403 00:19:42,450 --> 00:19:43,640 just have to work. 404 00:19:43,640 --> 00:19:45,380 Now here what do we have? 405 00:19:45,380 --> 00:19:47,720 The thing we're trying to measure, y-- 406 00:19:47,720 --> 00:19:49,790 as we said, so mu is the expectation, 407 00:19:49,790 --> 00:19:52,070 the conditional expectation, of y given x. 408 00:19:52,070 --> 00:19:56,090 But y is the number of new cases, right? 409 00:19:56,090 --> 00:19:57,560 Well it's a number of. 410 00:19:57,560 --> 00:19:59,060 And the first thing you should think 411 00:19:59,060 --> 00:20:00,980 of when you think about number of, 412 00:20:00,980 --> 00:20:03,620 if it were bounded above, you would think binomial, baby. 413 00:20:03,620 --> 00:20:05,220 But here, it's just a number. 414 00:20:05,220 --> 00:20:06,640 So you think Poisson. 415 00:20:06,640 --> 00:20:08,750 That's how insurers think. 416 00:20:08,750 --> 00:20:13,030 I have a number of, you know, claims per year. 417 00:20:13,030 --> 00:20:15,570 This is a Poisson distribution. 418 00:20:15,570 --> 00:20:18,062 And hopefully they can model the conditional distribution 419 00:20:18,062 --> 00:20:20,520 of the number of claims given everything that they actually 420 00:20:20,520 --> 00:20:24,940 ask you in the surveys that I hear 421 00:20:24,940 --> 00:20:26,980 you now fail in 15 minutes. 422 00:20:26,980 --> 00:20:31,924 All right, so now you have this Poisson distribution. 423 00:20:31,924 --> 00:20:33,590 And that's just the modeling assumption. 424 00:20:33,590 --> 00:20:34,840 There's no particular reason why you 425 00:20:34,840 --> 00:20:36,450 should do this except that, you know, 426 00:20:36,450 --> 00:20:38,050 that might be a good idea. 427 00:20:38,050 --> 00:20:39,700 And the expected value of your Poisson 428 00:20:39,700 --> 00:20:42,915 has to be this mu i, OK? 429 00:20:42,915 --> 00:20:46,330 At time i. 430 00:20:46,330 --> 00:20:48,760 Any question about this slide? 431 00:20:48,760 --> 00:20:51,660 OK, so let's switch to another example. 432 00:20:51,660 --> 00:20:54,870 Another example is the so-called pray capture rate. 433 00:20:54,870 --> 00:20:58,010 So here, what you're interested in 434 00:20:58,010 --> 00:21:05,330 is the rate capture of preys yi for a given prey. 435 00:21:05,330 --> 00:21:10,730 And you have xy, which is your explanation. 436 00:21:10,730 --> 00:21:12,275 And this is just the density of pray. 437 00:21:12,275 --> 00:21:17,030 So you're trying to explain the rate of captures of preys given 438 00:21:17,030 --> 00:21:20,060 the density of the prey, OK? 439 00:21:20,060 --> 00:21:22,964 And so you need to find some sort of relationship 440 00:21:22,964 --> 00:21:23,630 between the two. 441 00:21:23,630 --> 00:21:25,250 And here again, you talk to experts 442 00:21:25,250 --> 00:21:27,570 and what they tell you is that, well, it's 443 00:21:27,570 --> 00:21:28,820 going to be increasing, right? 444 00:21:28,820 --> 00:21:32,450 I mean, animals like predators are going to just eat more 445 00:21:32,450 --> 00:21:34,239 if there's more preys. 446 00:21:34,239 --> 00:21:35,780 But at some point, they're just going 447 00:21:35,780 --> 00:21:38,450 to level off because they're going to be [INAUDIBLE] full 448 00:21:38,450 --> 00:21:42,380 and they're going to stop capturing those prays. 449 00:21:42,380 --> 00:21:44,804 And you're just going to have some phenomenon that 450 00:21:44,804 --> 00:21:45,470 looks like this. 451 00:21:45,470 --> 00:21:47,870 So here is a curve that sort of makes sense, right? 452 00:21:47,870 --> 00:21:52,130 As your capture rate goes from 0 to 1, you're increasing, 453 00:21:52,130 --> 00:21:54,530 and then you see you have this like [INAUDIBLE] function 454 00:21:54,530 --> 00:21:57,630 that says, you know, at some point it levels up. 455 00:21:57,630 --> 00:21:59,490 OK, so here, one way I could-- 456 00:21:59,490 --> 00:22:01,590 I mean, there's again many ways I could just 457 00:22:01,590 --> 00:22:03,300 model a function that looks like this. 458 00:22:03,300 --> 00:22:05,910 But a simple one that has only two parameters 459 00:22:05,910 --> 00:22:09,930 is this one, where mu i is this a function of xi where 460 00:22:09,930 --> 00:22:13,230 I have some parameter alpha here and some parameter h here. 461 00:22:13,230 --> 00:22:15,820 OK, so there's clearly-- 462 00:22:15,820 --> 00:22:21,240 so this function, there's one that essentially tells you-- 463 00:22:21,240 --> 00:22:23,880 so this thing starts at 0 for sure. 464 00:22:23,880 --> 00:22:25,770 And essentially, alpha tells you how 465 00:22:25,770 --> 00:22:28,170 sharp this thing is, and h tells you 466 00:22:28,170 --> 00:22:30,180 at which points you end here. 467 00:22:30,180 --> 00:22:32,460 Well, it's not exactly what those values are equal to, 468 00:22:32,460 --> 00:22:35,380 but that tells you this. 469 00:22:35,380 --> 00:22:41,329 OK, so, you know-- simple, and-- 470 00:22:41,329 --> 00:22:41,870 well, no, OK. 471 00:22:41,870 --> 00:22:44,360 Sorry, that's actually alpha, which is the maximum capture. 472 00:22:44,360 --> 00:22:46,450 The rate and h represent the pre-density 473 00:22:46,450 --> 00:22:47,830 at which the capture weight is. 474 00:22:47,830 --> 00:22:49,730 So that's the half time. 475 00:22:49,730 --> 00:22:52,600 OK, so there's actual value [INAUDIBLE].. 476 00:22:52,600 --> 00:22:54,500 All right, so now I have this function. 477 00:22:54,500 --> 00:22:56,930 It's certainly not a function. 478 00:22:56,930 --> 00:22:59,330 There's no-- I don't see it as a function of x. 479 00:22:59,330 --> 00:23:06,390 So I need to find something that looks like a function of x, OK? 480 00:23:06,390 --> 00:23:08,340 So then here, there's no log. 481 00:23:08,340 --> 00:23:13,570 There's no-- well, I could actually take a log here. 482 00:23:13,570 --> 00:23:15,890 But I would have log of x and log of x plus h. 483 00:23:15,890 --> 00:23:17,600 So that would be weird. 484 00:23:17,600 --> 00:23:19,990 So what we propose to do here is to look, 485 00:23:19,990 --> 00:23:23,350 rather than looking at mu i, we look 1 over mu i. 486 00:23:23,350 --> 00:23:24,890 Right, and so since your function 487 00:23:24,890 --> 00:23:37,450 was mu i, when you take 1 over mu i, 488 00:23:37,450 --> 00:23:42,580 you get h plus xi divided by alpha xi, which 489 00:23:42,580 --> 00:23:49,690 is h over alpha times one over xi plus 1 over alpha. 490 00:23:49,690 --> 00:23:52,320 And now if I'm willing to make this transformation 491 00:23:52,320 --> 00:23:54,330 of variables and say, actually, I don't-- 492 00:23:54,330 --> 00:23:57,900 my x, whether it's the density of prey 493 00:23:57,900 --> 00:24:00,759 or the inverse density of prey, it really doesn't matter. 494 00:24:00,759 --> 00:24:02,300 I can always make this transformation 495 00:24:02,300 --> 00:24:03,750 when the data comes. 496 00:24:03,750 --> 00:24:06,210 Then I'm actually just going to think of this 497 00:24:06,210 --> 00:24:11,400 as being some linear function beta 0 plus beta 1, 498 00:24:11,400 --> 00:24:17,345 which is this guy, times 1 over xi. 499 00:24:17,345 --> 00:24:20,080 And now my new variable becomes 1 over xi. 500 00:24:20,080 --> 00:24:21,260 And now it's linear. 501 00:24:21,260 --> 00:24:23,350 And the transformation I had to take 502 00:24:23,350 --> 00:24:34,240 was this 1 over x, which is called the reciprocal link, OK? 503 00:24:34,240 --> 00:24:37,120 You can probably guess what the exponential link is going to be 504 00:24:37,120 --> 00:24:38,453 and things like this, all right? 505 00:24:38,453 --> 00:24:41,380 So we'll talk about other links that have slightly less 506 00:24:41,380 --> 00:24:43,180 obvious names. 507 00:24:43,180 --> 00:24:45,206 Now again, modeling, right? 508 00:24:45,206 --> 00:24:46,580 So this was the random component. 509 00:24:46,580 --> 00:24:47,690 This was the easy part. 510 00:24:47,690 --> 00:24:50,920 Now I need to just poor in some domain knowledge 511 00:24:50,920 --> 00:24:55,900 about how do I think this function, this y, which 512 00:24:55,900 --> 00:25:01,810 is which is the rate of capture of praise, 513 00:25:01,810 --> 00:25:05,162 I want to understand how this thing is actually 514 00:25:05,162 --> 00:25:09,430 changing what is the randomness of the thing around its mean. 515 00:25:09,430 --> 00:25:11,550 And you know, something that-- so that 516 00:25:11,550 --> 00:25:12,647 comes from this textbook. 517 00:25:12,647 --> 00:25:14,230 The standing deviation of capture rate 518 00:25:14,230 --> 00:25:16,750 might be approximately proportional to the mean rate. 519 00:25:16,750 --> 00:25:18,250 You need to find a distribution that 520 00:25:18,250 --> 00:25:19,390 actually has this property. 521 00:25:19,390 --> 00:25:21,160 And it turns out that this happens 522 00:25:21,160 --> 00:25:23,950 for gamma distributions, right? 523 00:25:23,950 --> 00:25:26,050 In gamma distributions, just like say, 524 00:25:26,050 --> 00:25:29,740 for Poisson distribution, the-- 525 00:25:29,740 --> 00:25:32,579 well, for Poisson, the variance and mean are of the same order. 526 00:25:32,579 --> 00:25:34,120 Here is the standard deviation that's 527 00:25:34,120 --> 00:25:39,540 of the same order as the [INAUDIBLE] for gammas. 528 00:25:39,540 --> 00:25:42,300 And it's a positive distribution as well. 529 00:25:42,300 --> 00:25:43,790 So here is a candidate. 530 00:25:43,790 --> 00:25:45,260 Now since we're sort of constrained 531 00:25:45,260 --> 00:25:48,777 to work under the exponential family of distributions, 532 00:25:48,777 --> 00:25:50,360 then you can just go through your list 533 00:25:50,360 --> 00:25:52,430 and just decide which one works best for you. 534 00:25:55,250 --> 00:25:56,940 All right, third example-- 535 00:25:56,940 --> 00:25:59,270 so here we have binary response. 536 00:25:59,270 --> 00:26:01,370 Here, essentially the binary response variable 537 00:26:01,370 --> 00:26:02,960 indicates the presence or absence 538 00:26:02,960 --> 00:26:07,460 of postoperative deforming for kyphosis on children. 539 00:26:07,460 --> 00:26:10,310 And here, rather than having one covariance which was before, 540 00:26:10,310 --> 00:26:12,950 in the first example, was time, in the second example 541 00:26:12,950 --> 00:26:15,230 was the density, here there's three ways 542 00:26:15,230 --> 00:26:17,030 that you measure on children. 543 00:26:17,030 --> 00:26:19,510 The first one is age of the child 544 00:26:19,510 --> 00:26:21,440 and the second one is the number of vertebrae 545 00:26:21,440 --> 00:26:23,040 involved in the operation. 546 00:26:23,040 --> 00:26:25,260 And the third one is the start of the range, 547 00:26:25,260 --> 00:26:29,660 right-- so where it is on the spine. 548 00:26:29,660 --> 00:26:35,105 OK, so the response variable here is, you know, 549 00:26:35,105 --> 00:26:36,440 did it work or not, right? 550 00:26:36,440 --> 00:26:37,970 I mean, that's very simple. 551 00:26:37,970 --> 00:26:41,859 And so here, it's nice because the random component 552 00:26:41,859 --> 00:26:42,650 is the easiest one. 553 00:26:42,650 --> 00:26:45,680 As I said, any random variable that takes only two outcomes 554 00:26:45,680 --> 00:26:49,020 must be a Bernoulli, right? 555 00:26:49,020 --> 00:26:52,004 So that's nice there's no modeling going on here. 556 00:26:52,004 --> 00:26:54,170 So you know that y given x is going to be Bernoulli, 557 00:26:54,170 --> 00:26:55,628 but of course, all your efforts are 558 00:26:55,628 --> 00:26:58,190 going to try to understand what the conditional mean 559 00:26:58,190 --> 00:27:00,315 of your Bernoulli, what the conditional probability 560 00:27:00,315 --> 00:27:02,090 of being 1 is going to be, OK? 561 00:27:02,090 --> 00:27:05,960 And so in particular-- so I'm just-- here, 562 00:27:05,960 --> 00:27:08,990 I'm spelling it out before we close those examples. 563 00:27:08,990 --> 00:27:12,560 I cannot say that mu of x is x transpose data for exactly this 564 00:27:12,560 --> 00:27:15,520 picture that I drew for you here, right? 565 00:27:15,520 --> 00:27:17,500 There's just no way here-- the goal 566 00:27:17,500 --> 00:27:20,050 of doing this is certainly to be able to extrapolate 567 00:27:20,050 --> 00:27:23,650 for yet unseen children whether this is something 568 00:27:23,650 --> 00:27:24,850 that we should be doing. 569 00:27:24,850 --> 00:27:27,340 And maybe the range of x is actually 570 00:27:27,340 --> 00:27:28,480 going to be slightly out. 571 00:27:28,480 --> 00:27:30,550 And so, OK I don't want to see that have 572 00:27:30,550 --> 00:27:34,770 a negative probability of outcome or a positive one-- 573 00:27:34,770 --> 00:27:38,590 sorry, or one that's lower than one. 574 00:27:38,590 --> 00:27:40,970 So I need to make this transformation. 575 00:27:40,970 --> 00:27:43,700 So what I need to do is to transform mu, which 576 00:27:43,700 --> 00:27:44,930 is, we know only a number. 577 00:27:44,930 --> 00:27:46,880 All we know is a number between 0 and 1. 578 00:27:46,880 --> 00:27:48,590 And we need to transform it in such a way 579 00:27:48,590 --> 00:27:50,510 that it maps the entire real line 580 00:27:50,510 --> 00:27:57,270 or reciprocally to say that-- 581 00:27:57,270 --> 00:27:58,850 or inversely, I should say-- 582 00:27:58,850 --> 00:28:00,650 that f of x transpose beta should 583 00:28:00,650 --> 00:28:02,410 be a number between 0 and 1. 584 00:28:02,410 --> 00:28:05,000 I need to find a function that takes any real number 585 00:28:05,000 --> 00:28:06,950 and maps it into 0 and 1. 586 00:28:06,950 --> 00:28:10,460 And we'll see that again, but you 587 00:28:10,460 --> 00:28:12,480 have an army of functions that do that for you. 588 00:28:12,480 --> 00:28:13,761 What are those functions? 589 00:28:16,707 --> 00:28:17,689 AUDIENCE: [INAUDIBLE] 590 00:28:17,689 --> 00:28:19,162 PHILIPPE RIGOLLET: I'm sorry? 591 00:28:19,162 --> 00:28:20,085 AUDIENCE: [INAUDIBLE] 592 00:28:20,085 --> 00:28:21,126 PHILIPPE RIGOLLET: Trait? 593 00:28:21,126 --> 00:28:22,665 AUDIENCE: [INAUDIBLE] 594 00:28:22,665 --> 00:28:23,581 PHILIPPE RIGOLLET: Oh. 595 00:28:23,581 --> 00:28:25,518 AUDIENCE: [INAUDIBLE] 596 00:28:25,518 --> 00:28:28,059 PHILIPPE RIGOLLET: Yeah, I want them to be invertible, right? 597 00:28:28,059 --> 00:28:34,074 AUDIENCE: [INAUDIBLE] 598 00:28:34,074 --> 00:28:35,990 PHILIPPE RIGOLLET: I have an army of function. 599 00:28:35,990 --> 00:28:39,100 I'm not asking for one soldier in this army. 600 00:28:39,100 --> 00:28:41,860 I want the name of this army. 601 00:28:41,860 --> 00:28:44,057 AUDIENCE: [INAUDIBLE] 602 00:28:44,057 --> 00:28:46,640 PHILIPPE RIGOLLET: Well, they're not really invertible either, 603 00:28:46,640 --> 00:28:48,860 right? 604 00:28:48,860 --> 00:28:53,730 So they're actually in [INAUDIBLE] textbook. 605 00:28:53,730 --> 00:28:55,272 Because remember, statisticians don't 606 00:28:55,272 --> 00:28:56,980 know how to integrate functions, but they 607 00:28:56,980 --> 00:28:59,250 know how to turn a function into a Gaussian integral. 608 00:28:59,250 --> 00:29:01,722 So we know it integrates to 1 and things like this. 609 00:29:01,722 --> 00:29:03,180 Same thing here-- we don't know how 610 00:29:03,180 --> 00:29:06,692 to build functions that are invertible and map 611 00:29:06,692 --> 00:29:08,400 the entire real line to 0, 1, but there's 612 00:29:08,400 --> 00:29:11,350 all the cumulative distribution functions that do that for us. 613 00:29:11,350 --> 00:29:13,190 So I can you any of those guys, and that's 614 00:29:13,190 --> 00:29:16,330 what I'm going to be doing, actually. 615 00:29:16,330 --> 00:29:19,730 All right, so just to recap what I just 616 00:29:19,730 --> 00:29:23,870 said as we were speaking, so normal linear model is not 617 00:29:23,870 --> 00:29:30,470 appropriate for these examples if only because the response 618 00:29:30,470 --> 00:29:34,340 variable is not necessarily Gaussian 619 00:29:34,340 --> 00:29:37,430 and also because the linear model has to be-- 620 00:29:37,430 --> 00:29:39,600 the mean has to be transformed before I can actually 621 00:29:39,600 --> 00:29:42,210 apply a linear model for all these plausible nonlinear 622 00:29:42,210 --> 00:29:44,890 models that I actually came up with. 623 00:29:44,890 --> 00:29:48,080 OK, so the family we're going to go for 624 00:29:48,080 --> 00:29:50,780 is the exponential family of distributions. 625 00:29:50,780 --> 00:29:54,130 And we're going to be able to show-- 626 00:29:54,130 --> 00:29:56,120 so one of the nice part of this is 627 00:29:56,120 --> 00:29:58,300 to actually compute maximum likelihood 628 00:29:58,300 --> 00:29:59,570 estimaters for those right? 629 00:29:59,570 --> 00:30:02,390 In the linear model, maximum-- like, in the Gauss 630 00:30:02,390 --> 00:30:05,360 linear model, maximum likelihood was as nice as it gets, right? 631 00:30:05,360 --> 00:30:08,810 This actually was the least squares estimator. 632 00:30:08,810 --> 00:30:10,220 We had a close form. 633 00:30:10,220 --> 00:30:12,920 x transpose x inverse x transpose y, 634 00:30:12,920 --> 00:30:14,120 and that was it, OK? 635 00:30:14,120 --> 00:30:15,780 We had to just take one derivative. 636 00:30:15,780 --> 00:30:19,580 Here, we're going to have a generally concave likelihood. 637 00:30:19,580 --> 00:30:21,170 We're not going to be able to actually 638 00:30:21,170 --> 00:30:23,750 solve this thing directly in close form 639 00:30:23,750 --> 00:30:26,610 unless it's Gaussian, but we will have-- 640 00:30:26,610 --> 00:30:30,070 we'll see actually how this is not just 641 00:30:30,070 --> 00:30:32,770 a black box optimization of a concave function. 642 00:30:32,770 --> 00:30:35,830 We have a lot of properties of this concave function, 643 00:30:35,830 --> 00:30:38,500 and we will be able to show some iterative algorithms. 644 00:30:38,500 --> 00:30:42,880 We'll basically see how, when you opened the box of convex 645 00:30:42,880 --> 00:30:46,270 optimization, you will actually be able to see how things work 646 00:30:46,270 --> 00:30:49,070 and actually implement it using least squares. 647 00:30:49,070 --> 00:30:51,260 So each iteration of this iterative algorithm 648 00:30:51,260 --> 00:30:52,760 will essentially be a least squares, 649 00:30:52,760 --> 00:30:54,700 and that's actually quite [INAUDIBLE].. 650 00:30:54,700 --> 00:30:56,830 So, very demonstrative of statisticians 651 00:30:56,830 --> 00:30:59,770 being pretty ingenious so that they 652 00:30:59,770 --> 00:31:01,900 don't have to call in some statistical software 653 00:31:01,900 --> 00:31:06,040 but just can repeatedly call their least squares 654 00:31:06,040 --> 00:31:09,730 Oracle within a statistical software. 655 00:31:09,730 --> 00:31:12,170 OK, so what is the exponential family, right? 656 00:31:12,170 --> 00:31:14,390 I promised to do the exponential family. 657 00:31:14,390 --> 00:31:17,540 Before we go into this, let me just 658 00:31:17,540 --> 00:31:19,910 tell you something about exponential families, 659 00:31:19,910 --> 00:31:22,040 and what's the only thing to differentiate 660 00:31:22,040 --> 00:31:25,870 an exponential family from all possible distributions? 661 00:31:25,870 --> 00:31:28,640 An exponential family has two parameters, right? 662 00:31:28,640 --> 00:31:30,140 And those are not really parameters, 663 00:31:30,140 --> 00:31:33,530 but there's this theta parameter of my distribution, OK? 664 00:31:33,530 --> 00:31:35,450 So it's going to be indexed by some parameter. 665 00:31:35,450 --> 00:31:37,324 Here, I'm only talking about the distribution 666 00:31:37,324 --> 00:31:40,550 of, say, some random variable or some random vector, OK? 667 00:31:40,550 --> 00:31:44,360 So here in this slide, you see that the parameter theta that 668 00:31:44,360 --> 00:31:48,760 indexed those distribution is k dimensional 669 00:31:48,760 --> 00:31:53,840 and the space of the x's that I'm looking at-- so 670 00:31:53,840 --> 00:31:55,735 that should really be y, right? 671 00:31:55,735 --> 00:31:57,110 What I'm going to plug in here is 672 00:31:57,110 --> 00:31:59,570 the conditional distribution of y given x and theta is 673 00:31:59,570 --> 00:32:00,620 going to depend on x. 674 00:32:00,620 --> 00:32:02,110 But this really is the y. 675 00:32:02,110 --> 00:32:04,770 That's their distribution of the response variable. 676 00:32:04,770 --> 00:32:06,620 And so this is on q, right? 677 00:32:06,620 --> 00:32:09,250 So I'm going to assume that y takes-- 678 00:32:09,250 --> 00:32:12,200 q dimensional-- is q dimensional. 679 00:32:12,200 --> 00:32:14,270 Clearly soon, q is going to be equal to 1, 680 00:32:14,270 --> 00:32:16,340 but I can define those things generally. 681 00:32:16,340 --> 00:32:17,750 OK, so I have this. 682 00:32:17,750 --> 00:32:19,710 I have to tell you what this looks like. 683 00:32:19,710 --> 00:32:23,310 And let's assume that this is a probability density function. 684 00:32:23,310 --> 00:32:26,360 So this, right this notation, the fact that I just 685 00:32:26,360 --> 00:32:28,490 put my theta in subscript, is just 686 00:32:28,490 --> 00:32:31,400 for me to remember that this is the variable that 687 00:32:31,400 --> 00:32:34,160 indicates the random variable, and this is just the parameter. 688 00:32:34,160 --> 00:32:37,400 But I could just write it as a function of theta and x, right? 689 00:32:37,400 --> 00:32:39,650 This is just going to be-- right, if you were in calc, 690 00:32:39,650 --> 00:32:41,360 in multivariable calc, you would have 691 00:32:41,360 --> 00:32:43,110 two parameter of theta and x and you would 692 00:32:43,110 --> 00:32:45,320 need to give me a function. 693 00:32:45,320 --> 00:32:46,580 Now think of all-- 694 00:32:46,580 --> 00:32:50,360 think of x and theta as being one dimensional at this point. 695 00:32:50,360 --> 00:32:51,890 Think of all the functions that can 696 00:32:51,890 --> 00:32:54,530 be depending on theta and x. 697 00:32:54,530 --> 00:32:56,660 There's many of them. 698 00:32:56,660 --> 00:33:01,810 And in particular, there's many ways theta and x can interact. 699 00:33:01,810 --> 00:33:03,580 What the exponential family does for you 700 00:33:03,580 --> 00:33:05,860 is that it restricts the way these things 701 00:33:05,860 --> 00:33:07,877 can actually interact with each other. 702 00:33:07,877 --> 00:33:09,460 It's essentially saying the following. 703 00:33:09,460 --> 00:33:15,700 It's saying this is going to be of the form exponential-- 704 00:33:15,700 --> 00:33:18,100 so this exponential is really not much because I 705 00:33:18,100 --> 00:33:20,020 could put a log next to it. 706 00:33:20,020 --> 00:33:24,940 But what I want is that the way theta and x 707 00:33:24,940 --> 00:33:30,310 interact has to be of the form theta times x 708 00:33:30,310 --> 00:33:32,530 in an exponential, OK? 709 00:33:32,530 --> 00:33:34,210 So that's the simplest-- that's one 710 00:33:34,210 --> 00:33:36,585 of the ways you can think of them interacting is you just 711 00:33:36,585 --> 00:33:37,900 the product of the two. 712 00:33:37,900 --> 00:33:40,450 Now clearly, this is not a very rich family. 713 00:33:40,450 --> 00:33:43,090 So what I'm allowing myself is to just slap 714 00:33:43,090 --> 00:33:46,000 on some terms that depend only on theta and depend only on x. 715 00:33:46,000 --> 00:33:52,630 So let's just call this thing, I don't know, f of x, g of theta. 716 00:33:52,630 --> 00:33:56,649 OK, so here, I've restricted the way theta and x can interact. 717 00:33:56,649 --> 00:33:58,190 So I have something that depends only 718 00:33:58,190 --> 00:33:59,981 on x, something that depends only on theta. 719 00:33:59,981 --> 00:34:02,560 And here, I have this very specific interaction. 720 00:34:02,560 --> 00:34:06,310 And that's all that exponential families are doing for you, OK? 721 00:34:06,310 --> 00:34:09,840 So if we go back to this slide, this is much more general, 722 00:34:09,840 --> 00:34:14,770 right? if I want to go from theta and x in r to theta 723 00:34:14,770 --> 00:34:16,270 and x theta in r-- 724 00:34:19,449 --> 00:34:26,659 to theta in r k and x in rq, I cannot take the product 725 00:34:26,659 --> 00:34:27,386 of theta and x. 726 00:34:27,386 --> 00:34:29,719 I cannot even take the inner product between theta and x 727 00:34:29,719 --> 00:34:32,030 because they're not even of compatible dimensions. 728 00:34:32,030 --> 00:34:37,460 But what I can do is to first map my theta into something 729 00:34:37,460 --> 00:34:40,940 and map my x into something so that I actually end up 730 00:34:40,940 --> 00:34:42,080 having the same dimensions. 731 00:34:42,080 --> 00:34:43,550 And then I can take the inner product. 732 00:34:43,550 --> 00:34:44,900 That's the natural generalization 733 00:34:44,900 --> 00:34:45,858 of this simple product. 734 00:34:59,800 --> 00:35:03,340 OK, so what I have is-- 735 00:35:03,340 --> 00:35:05,230 right, so if I want to go from theta 736 00:35:05,230 --> 00:35:10,510 to x, when I'm going to first do is I'm going to take theta, 737 00:35:10,510 --> 00:35:11,710 eta of theta-- 738 00:35:11,710 --> 00:35:16,590 so let's say eta1 of theta to eta k of theta. 739 00:35:20,100 --> 00:35:22,220 And then I'm going to actually take 740 00:35:22,220 --> 00:35:29,994 x becomes t1 of x all the way to tk of x. 741 00:35:29,994 --> 00:35:32,160 And what I'm going to do is take the inner product-- 742 00:35:32,160 --> 00:35:35,540 so let's call this eta and let's call this t. 743 00:35:35,540 --> 00:35:39,710 And I'm going to take the inner product of eta and t, which 744 00:35:39,710 --> 00:35:49,550 is just the sum from j equal 1 to k of eta j of theta times 745 00:35:49,550 --> 00:35:52,770 tj of x. 746 00:35:52,770 --> 00:35:57,690 OK, so that's just a way to say I want this simple interaction 747 00:35:57,690 --> 00:35:58,690 but in higher dimension. 748 00:35:58,690 --> 00:36:00,900 The simplest way I can actually make those things happen 749 00:36:00,900 --> 00:36:02,233 is just by taking inner product. 750 00:36:05,490 --> 00:36:07,010 OK, and so now what it's telling me 751 00:36:07,010 --> 00:36:09,630 is that the distribution-- so I want the exponential times 752 00:36:09,630 --> 00:36:11,921 something that depends only on theta and something that 753 00:36:11,921 --> 00:36:12,990 depends only on x. 754 00:36:12,990 --> 00:36:14,490 And so what it tells me is that when 755 00:36:14,490 --> 00:36:16,200 I'm going to take p of theta x, it's 756 00:36:16,200 --> 00:36:19,640 just going to be something which is exponential 757 00:36:19,640 --> 00:36:30,225 times the sum from j equal 1 to k of eta j theta tj of x. 758 00:36:30,225 --> 00:36:32,600 And then I'm going to have a function that depends only-- 759 00:36:32,600 --> 00:36:36,040 so let me read it for now like c of theta and then 760 00:36:36,040 --> 00:36:37,610 a function that depends only on x. 761 00:36:37,610 --> 00:36:39,490 Let me call it h of x. 762 00:36:39,490 --> 00:36:42,340 And for convenience, there's no particular reason 763 00:36:42,340 --> 00:36:43,300 why I do that. 764 00:36:43,300 --> 00:36:45,220 I'm taking this function c of theta 765 00:36:45,220 --> 00:36:47,300 and I'm just actually pushing it in there. 766 00:36:47,300 --> 00:36:57,182 So I can write c of theta as exponential minus log of 1 767 00:36:57,182 --> 00:36:58,612 over c of theta, right? 768 00:37:01,330 --> 00:37:03,324 And now I have exponential times exponential. 769 00:37:03,324 --> 00:37:04,990 So I push it in, and this thing actually 770 00:37:04,990 --> 00:37:10,320 looks like exponential sum from j equal 1 to k of eta 771 00:37:10,320 --> 00:37:22,120 j theta tj of x minus log 1 over c of theta times h of x. 772 00:37:22,120 --> 00:37:26,030 And this thing here, log 1 over c of theta, I call actually 773 00:37:26,030 --> 00:37:32,060 b of theta Because c, I called it c. 774 00:37:32,060 --> 00:37:35,140 But I can actually directly call this guy b, 775 00:37:35,140 --> 00:37:38,160 and I don't actually care about c itself. 776 00:37:38,160 --> 00:37:43,900 Now why don't I put back also h of x in there? 777 00:37:43,900 --> 00:37:48,949 Because h of x is really here to just-- 778 00:37:48,949 --> 00:37:50,398 how to put it-- 779 00:37:54,262 --> 00:38:00,160 OK, h of x and b of theta don't play the same role. 780 00:38:00,160 --> 00:38:03,490 B of theta in many ways is a normalizing constant, right? 781 00:38:03,490 --> 00:38:06,820 I want this density to integrate to 1. 782 00:38:06,820 --> 00:38:09,520 If I did not have this guy, I'm not 783 00:38:09,520 --> 00:38:11,950 guaranteed that this thing integrates to 1. 784 00:38:11,950 --> 00:38:14,860 But by tweaking this function b of theta or c of theta-- 785 00:38:14,860 --> 00:38:16,080 they're equivalent-- 786 00:38:16,080 --> 00:38:18,350 I can actually ensure that this thing integrates to 1. 787 00:38:18,350 --> 00:38:22,834 So b of theta is just a normalizing constant. 788 00:38:22,834 --> 00:38:25,000 H of x is something that's going to be funny for us. 789 00:38:25,000 --> 00:38:26,583 It's going to be something that allows 790 00:38:26,583 --> 00:38:29,740 us to be able to treat both discrete and continuous 791 00:38:29,740 --> 00:38:38,140 variables within the framework of exponential families. 792 00:38:38,140 --> 00:38:40,060 So for those that are familiar with this, 793 00:38:40,060 --> 00:38:41,890 this is essentially saying that that h of x 794 00:38:41,890 --> 00:38:44,120 is really just a change of measure. 795 00:38:44,120 --> 00:38:48,370 When I actually look at the density of p of theta-- 796 00:38:48,370 --> 00:38:50,320 this is with respect to some measure-- 797 00:38:50,320 --> 00:38:52,810 the fact that I just multiplied by a function of x just 798 00:38:52,810 --> 00:38:53,990 means that I'm not looking-- 799 00:38:53,990 --> 00:38:56,420 that this guy here without h of theta 800 00:38:56,420 --> 00:38:59,390 is not the density with respect to the original measure, 801 00:38:59,390 --> 00:39:01,660 but it's the density with respect to the distribution 802 00:39:01,660 --> 00:39:04,272 that has h as a density. 803 00:39:04,272 --> 00:39:05,480 That's all I'm saying, right? 804 00:39:05,480 --> 00:39:08,650 So I can first transform my x's and then take the density 805 00:39:08,650 --> 00:39:10,300 with respect to that. 806 00:39:10,300 --> 00:39:12,851 If you don't want to think about densities or measures, 807 00:39:12,851 --> 00:39:13,600 you don't have to. 808 00:39:13,600 --> 00:39:14,790 This is just the way-- 809 00:39:14,790 --> 00:39:16,930 this is just the definition. 810 00:39:16,930 --> 00:39:19,610 Is there any question about this definition? 811 00:39:19,610 --> 00:39:21,290 All right, so it looks complicated, 812 00:39:21,290 --> 00:39:23,560 but it's actually essentially the simplest 813 00:39:23,560 --> 00:39:25,360 way you could think about it. 814 00:39:25,360 --> 00:39:29,004 You want to be able to have x and theta interact 815 00:39:29,004 --> 00:39:30,670 and you just say, I want the interaction 816 00:39:30,670 --> 00:39:34,126 to be of the form exponential x times theta. 817 00:39:34,126 --> 00:39:35,500 And if they're higher dimensions, 818 00:39:35,500 --> 00:39:36,530 I'm going to take the exponential 819 00:39:36,530 --> 00:39:38,203 of the function of x inner product 820 00:39:38,203 --> 00:39:39,244 with a function of theta. 821 00:39:43,749 --> 00:39:45,540 All right, so I claimed since the beginning 822 00:39:45,540 --> 00:39:47,167 that the Gaussian was such an example. 823 00:39:47,167 --> 00:39:48,000 So let's just do it. 824 00:39:48,000 --> 00:39:51,330 So is the Gaussian of the-- is the interaction between theta 825 00:39:51,330 --> 00:39:55,350 and x in a Gaussian of the form in the product? 826 00:39:55,350 --> 00:39:58,680 And the answer is yes. 827 00:39:58,680 --> 00:40:03,480 Actually, whether I know or not what the variance is, OK? 828 00:40:03,480 --> 00:40:06,747 So let's start for the case where I actually do not 829 00:40:06,747 --> 00:40:07,830 know what the variance is. 830 00:40:07,830 --> 00:40:13,500 So here, I have x is n mu sigma squared. 831 00:40:13,500 --> 00:40:14,804 This is all one dimensional. 832 00:40:14,804 --> 00:40:17,220 And here, I'm going to assume that my parameter is both mu 833 00:40:17,220 --> 00:40:19,440 and sigma square. 834 00:40:19,440 --> 00:40:22,135 OK, so what I need to do is to have some function of mu, 835 00:40:22,135 --> 00:40:24,510 some function of stigma square, and take an inner product 836 00:40:24,510 --> 00:40:26,635 of some function of x and some other function of x. 837 00:40:26,635 --> 00:40:29,060 So I want to show that-- 838 00:40:29,060 --> 00:40:32,350 so p theta of x is what? 839 00:40:32,350 --> 00:40:36,390 Well, it's one over square root sigma 2 pi 840 00:40:36,390 --> 00:40:42,280 exponential minus x minus mu squared over 2 sigma squared, 841 00:40:42,280 --> 00:40:44,370 right? 842 00:40:44,370 --> 00:40:45,840 So that's just my Gaussian density. 843 00:40:45,840 --> 00:40:49,410 And I want to say that this thing here-- so 844 00:40:49,410 --> 00:40:51,660 clearly, the exponential shows up already. 845 00:40:51,660 --> 00:40:53,970 I want to show that this is something that looks 846 00:40:53,970 --> 00:41:01,620 like, you know, eta 1 of-- 847 00:41:01,620 --> 00:41:08,395 sorry, so that was-- yeah, eta 1 of, say, mu sigma squared. 848 00:41:08,395 --> 00:41:09,770 So I have only two of those guys, 849 00:41:09,770 --> 00:41:11,780 so I'm going to need only two etas, right? 850 00:41:11,780 --> 00:41:16,030 So I want it to be eta 1 of mu and sigma times t1 851 00:41:16,030 --> 00:41:22,940 of x plus eta 2 mu 1 mu sigma squared times t2 of x, right? 852 00:41:22,940 --> 00:41:26,070 So I want to have something like that that shows up, 853 00:41:26,070 --> 00:41:27,830 and the only things that are left, 854 00:41:27,830 --> 00:41:32,250 I want them to depend either only on theta or only on x. 855 00:41:32,250 --> 00:41:37,500 So to find that out, we just need to expand. 856 00:41:37,500 --> 00:41:42,480 OK, so I'm going to first put everything into my exponential 857 00:41:42,480 --> 00:41:43,650 and expand this guy. 858 00:41:43,650 --> 00:41:46,110 So the first term here is going to be minus x 859 00:41:46,110 --> 00:41:47,876 squared over 2 sigma square. 860 00:41:47,876 --> 00:41:49,500 The second term is going to be minus mu 861 00:41:49,500 --> 00:41:51,150 squared over two sigma squared. 862 00:41:51,150 --> 00:41:55,650 And then the cross term is going to be plus x mu divided 863 00:41:55,650 --> 00:41:57,014 by sigma squared. 864 00:41:57,014 --> 00:41:58,680 And then I'm going to put this guy here. 865 00:41:58,680 --> 00:42:05,037 So I have a minus log sigma over 2 pi, OK? 866 00:42:09,020 --> 00:42:13,740 OK, is this-- so this term here contains an interaction 867 00:42:13,740 --> 00:42:15,510 between X and the parameters. 868 00:42:15,510 --> 00:42:17,250 This term here contains an interaction 869 00:42:17,250 --> 00:42:18,480 between X and the parameters. 870 00:42:18,480 --> 00:42:21,240 So let me try to write them in a way that I want. 871 00:42:21,240 --> 00:42:22,950 This guy only depends on the parameters, 872 00:42:22,950 --> 00:42:25,870 this guy only depends on the parameter. 873 00:42:25,870 --> 00:42:28,390 So I'm going to rearrange things. 874 00:42:28,390 --> 00:42:34,080 And so I claim that this is of the form x squared. 875 00:42:34,080 --> 00:42:36,369 Well, let's say-- do-- 876 00:42:43,770 --> 00:42:44,800 who's getting the minus? 877 00:42:44,800 --> 00:42:46,180 Eta, OK. 878 00:42:46,180 --> 00:42:52,960 So it's x squared times minus 1 over 2 sigma 879 00:42:52,960 --> 00:42:58,450 squared plus x times mu over sigma squared, right? 880 00:42:58,450 --> 00:42:59,630 So that's this term here. 881 00:42:59,630 --> 00:43:01,060 That's this term here. 882 00:43:01,060 --> 00:43:04,129 Now I need to get this guy here, and that's minus. 883 00:43:04,129 --> 00:43:05,920 So I'm going to write it like this-- minus, 884 00:43:05,920 --> 00:43:09,950 and now I have mu squared over 2 sigma 885 00:43:09,950 --> 00:43:15,648 squared plus log sigma square root 2 pi. 886 00:43:22,210 --> 00:43:31,430 And now this thing is definitely of the form t of x times-- 887 00:43:31,430 --> 00:43:34,020 did I call them the right way or not? 888 00:43:34,020 --> 00:43:36,490 Of course not. 889 00:43:36,490 --> 00:43:39,450 OK, so that's going to be t2 of x times eta 890 00:43:39,450 --> 00:43:41,820 2 of x eta 2 of theta. 891 00:43:41,820 --> 00:43:48,230 This guy is going to be t1 of x times eta 1 of theta. 892 00:43:48,230 --> 00:43:50,992 All right, so just a function of theta times a function of x-- 893 00:43:50,992 --> 00:43:52,950 just a function of theta times a function of x. 894 00:43:52,950 --> 00:43:55,680 And the way combined is just by sending them. 895 00:43:55,680 --> 00:43:58,690 And this is going to be my d of theta. 896 00:44:01,710 --> 00:44:04,700 What is h of x? 897 00:44:04,700 --> 00:44:06,145 AUDIENCE: 1. 898 00:44:06,145 --> 00:44:07,020 PHILIPPE RIGOLLET: 1. 899 00:44:07,020 --> 00:44:09,800 There's one thing I can actually play with, 900 00:44:09,800 --> 00:44:13,040 and this is something you're going to have some three 901 00:44:13,040 --> 00:44:14,090 choices, right? 902 00:44:14,090 --> 00:44:19,850 This is not actually completely determined here is that-- 903 00:44:19,850 --> 00:44:27,220 for example, so when I write the log sigma square root 2 pi, 904 00:44:27,220 --> 00:44:32,660 this is just log of sigma plus log square root 2 pi. 905 00:44:32,660 --> 00:44:34,270 So I have two choices here. 906 00:44:34,270 --> 00:44:37,670 Either my b becomes this guy, or-- 907 00:44:37,670 --> 00:44:41,150 so either I have b of theta, which 908 00:44:41,150 --> 00:44:45,320 is mu squared over 2 sigma squared plus log sigma 909 00:44:45,320 --> 00:44:51,920 square root 2 pi and h of x is equal to 1, or I have 910 00:44:51,920 --> 00:44:56,120 that b of theta is mu square over 2 sigma squared 911 00:44:56,120 --> 00:44:58,120 plus log sigma. 912 00:44:58,120 --> 00:44:59,750 And h of x is equal to what? 913 00:45:08,400 --> 00:45:10,300 Well, I can just push this guy out, right? 914 00:45:10,300 --> 00:45:12,160 I can push it out of the exponential. 915 00:45:12,160 --> 00:45:15,370 And so it's just square root of 2 pi, which is 916 00:45:15,370 --> 00:45:16,590 a function of x, technically. 917 00:45:16,590 --> 00:45:19,760 I mean, it's a constant function of x, but it's a function. 918 00:45:19,760 --> 00:45:22,420 So you can see that it's not completely clear 919 00:45:22,420 --> 00:45:25,090 how you're going to do the trade off, right? 920 00:45:25,090 --> 00:45:28,840 So the constant terms can go either in b or in h. 921 00:45:28,840 --> 00:45:33,660 But you know, why bother with tracking down b and h when 922 00:45:33,660 --> 00:45:35,410 you can actually stuff everything into one 923 00:45:35,410 --> 00:45:38,200 and just call h one and call it a day? 924 00:45:38,200 --> 00:45:40,770 Right, so you can just forget about h. 925 00:45:40,770 --> 00:45:43,790 You know it's one and think about the right. 926 00:45:43,790 --> 00:45:46,410 H won't matter actually for estimation purposes or anything 927 00:45:46,410 --> 00:45:48,386 like this. 928 00:45:48,386 --> 00:45:50,760 All right, so that's basically everything that's written. 929 00:45:50,760 --> 00:45:55,040 When stigma square is known, what's 930 00:45:55,040 --> 00:46:00,020 happening is that this guy here is no longer 931 00:46:00,020 --> 00:46:03,640 a function of theta, right? 932 00:46:03,640 --> 00:46:05,401 Agreed? 933 00:46:05,401 --> 00:46:06,650 This is no longer a parameter. 934 00:46:06,650 --> 00:46:14,990 When sigma square is known, then theta is equal to mu only. 935 00:46:14,990 --> 00:46:17,660 There's no sigma square going on. 936 00:46:17,660 --> 00:46:19,657 So this-- everything depends on sigma square 937 00:46:19,657 --> 00:46:20,990 can be thought of as a constant. 938 00:46:20,990 --> 00:46:23,010 Think one. 939 00:46:23,010 --> 00:46:26,910 So in particular, this term here does not 940 00:46:26,910 --> 00:46:30,270 belong in the interaction between x and theta. 941 00:46:30,270 --> 00:46:37,150 It belongs to h, right? 942 00:46:37,150 --> 00:46:49,120 So if sigma is known, then this guy is only a function of h-- 943 00:46:49,120 --> 00:46:50,770 of x. 944 00:46:50,770 --> 00:47:01,420 So h of x becomes exponential x squared minus x squared 945 00:47:01,420 --> 00:47:05,674 over 2 sigma squared, right? 946 00:47:05,674 --> 00:47:06,840 That's just a function of x. 947 00:47:11,010 --> 00:47:11,681 Is that clear? 948 00:47:16,100 --> 00:47:18,650 So if you complete this computation, what you're 949 00:47:18,650 --> 00:47:28,402 going to get is that your new one parameter thing is that p 950 00:47:28,402 --> 00:47:35,760 theta x is not equal to exponential x times mu 951 00:47:35,760 --> 00:47:39,227 over sigma squared minus-- 952 00:47:39,227 --> 00:47:40,560 well, it's still the same thing. 953 00:47:49,300 --> 00:47:51,390 And then you have your h of x that comes out-- 954 00:47:54,210 --> 00:47:58,370 x squared over 2 sigma squared. 955 00:47:58,370 --> 00:48:02,350 OK, so that's my h of x. 956 00:48:02,350 --> 00:48:05,960 That's still my b of theta. 957 00:48:05,960 --> 00:48:11,260 And this is my t1 of x. 958 00:48:11,260 --> 00:48:15,190 And this is my eta one of theta. 959 00:48:15,190 --> 00:48:18,060 And remember, theta is just equal to mu in this case. 960 00:48:22,680 --> 00:48:26,610 So if I ask you prove that this distribution belongs 961 00:48:26,610 --> 00:48:29,390 to an exponential family, you just have to work it out. 962 00:48:29,390 --> 00:48:32,480 Typically, it's expanding what's in the exponential and see 963 00:48:32,480 --> 00:48:33,180 what's-- 964 00:48:33,180 --> 00:48:35,690 and just write it in this term and identify 965 00:48:35,690 --> 00:48:36,890 all the components, right? 966 00:48:36,890 --> 00:48:39,576 So here, notice those guys don't even get an index anymore 967 00:48:39,576 --> 00:48:40,950 because there's just one of them. 968 00:48:40,950 --> 00:48:45,629 So I wrote eta 1 and t1, but it's really just eta and t. 969 00:48:50,270 --> 00:48:54,410 Oh sorry, this guy also goes. 970 00:48:54,410 --> 00:48:56,240 This is also a constant, right? 971 00:48:56,240 --> 00:49:01,240 So it can actually just put sigma divided 972 00:49:01,240 --> 00:49:03,390 by sigma square root 2 pi. 973 00:49:03,390 --> 00:49:04,790 So h of x is what, actually? 974 00:49:08,718 --> 00:49:12,155 Is it the density of-- 975 00:49:12,155 --> 00:49:13,630 AUDIENCE: Standard [INAUDIBLE]. 976 00:49:13,630 --> 00:49:14,350 PHILIPPE RIGOLLET: It's not standard. 977 00:49:14,350 --> 00:49:15,340 It's centered. 978 00:49:15,340 --> 00:49:16,330 It has mean 0. 979 00:49:16,330 --> 00:49:18,810 But it variance sigma squared, right? 980 00:49:18,810 --> 00:49:21,060 But it's the density of a Gaussian. 981 00:49:21,060 --> 00:49:23,620 And this is what I meant when I said 982 00:49:23,620 --> 00:49:27,280 h of x is really just telling you with respect to which 983 00:49:27,280 --> 00:49:30,920 distribution, which measure you're taking the density. 984 00:49:30,920 --> 00:49:33,310 And so this thing here is really telling you 985 00:49:33,310 --> 00:49:37,690 the density of my Gaussian with mean mu 986 00:49:37,690 --> 00:49:41,710 is equal to-- is this with respect to a centered Gaussian 987 00:49:41,710 --> 00:49:43,636 is this guy, right? 988 00:49:43,636 --> 00:49:44,510 That's what it means. 989 00:49:44,510 --> 00:49:46,310 If this thing ends up being a density, 990 00:49:46,310 --> 00:49:49,370 it just means that now you just have a new measure, which 991 00:49:49,370 --> 00:49:51,270 is this density. 992 00:49:51,270 --> 00:49:53,270 So it's just saying that the density 993 00:49:53,270 --> 00:49:57,560 of the Gaussian with mean mu with respect 994 00:49:57,560 --> 00:50:00,928 to the Gaussian with mean 0 is just this [INAUDIBLE] here. 995 00:50:05,140 --> 00:50:07,960 All right, so let's move on. 996 00:50:07,960 --> 00:50:11,050 So here, as I said, you could actually 997 00:50:11,050 --> 00:50:13,840 do all these computations and forget about the fact 998 00:50:13,840 --> 00:50:16,430 that x is continuous. 999 00:50:16,430 --> 00:50:20,690 You can actually do it with PMFs and do it for x is discrete. 1000 00:50:20,690 --> 00:50:23,540 This actually also tells you if you can actually 1001 00:50:23,540 --> 00:50:26,540 get the same form for your density, which 1002 00:50:26,540 --> 00:50:29,000 is of the form exponential times the product 1003 00:50:29,000 --> 00:50:32,060 of the the interaction between theta 1004 00:50:32,060 --> 00:50:34,010 and x is just taking this product, 1005 00:50:34,010 --> 00:50:36,950 then a function only of theta and of function only of x, 1006 00:50:36,950 --> 00:50:40,130 for the PMF, it also works. 1007 00:50:40,130 --> 00:50:42,050 OK, so I claim that the Bernoulli 1008 00:50:42,050 --> 00:50:44,960 belongs to this family. 1009 00:50:44,960 --> 00:50:49,380 So the PMF of a Bernoulli-- 1010 00:50:49,380 --> 00:50:54,590 we say parameter p is p to the x 1 minus p to the 1 minus x, 1011 00:50:54,590 --> 00:50:55,540 right? 1012 00:50:55,540 --> 00:51:00,440 Because we know so that's only for x equals 0 or 1. 1013 00:51:00,440 --> 00:51:03,160 And the reason is because when x is equal to 0, 1014 00:51:03,160 --> 00:51:04,060 this is 1 minus p. 1015 00:51:04,060 --> 00:51:06,627 When x is equal to 1, this is minus 0. 1016 00:51:06,627 --> 00:51:08,210 OK, we've seen that when we're looking 1017 00:51:08,210 --> 00:51:11,730 at likelihoods for Bernoullis. 1018 00:51:11,730 --> 00:51:16,700 OK, this is not clear this is going to look like this at all. 1019 00:51:16,700 --> 00:51:19,610 But let's do it. 1020 00:51:19,610 --> 00:51:21,660 OK, so what does this thing look like? 1021 00:51:21,660 --> 00:51:23,150 Well, the first thing I want to do 1022 00:51:23,150 --> 00:51:24,710 is to make an exponential show up. 1023 00:51:24,710 --> 00:51:26,084 So what I'm going to write is I'm 1024 00:51:26,084 --> 00:51:31,190 going to write p to the x as exponential x log p, right? 1025 00:51:33,714 --> 00:51:35,630 And so I'm going to do that for the other one. 1026 00:51:35,630 --> 00:51:37,690 So this thing here-- 1027 00:51:37,690 --> 00:51:43,090 so I'm going to get exponential x log 1028 00:51:43,090 --> 00:51:47,038 p plus 1 minus x log 1 minus p. 1029 00:51:51,250 --> 00:51:54,050 So what I need to do is to collect my terms in x 1030 00:51:54,050 --> 00:51:56,750 and my terms in whatever parameters I have, 1031 00:51:56,750 --> 00:51:59,080 see here if theta is equal to p. 1032 00:52:03,180 --> 00:52:05,170 So if I do this, what I end up having 1033 00:52:05,170 --> 00:52:08,440 is equal to exponential-- 1034 00:52:08,440 --> 00:52:12,650 so determine x is log p minus log 1 minus p. 1035 00:52:12,650 --> 00:52:18,140 So that's x times log p over 1 minus p. 1036 00:52:18,140 --> 00:52:20,050 And then the term that rest is just-- 1037 00:52:20,050 --> 00:52:23,276 that stays is just 1 times log 1 minus p. 1038 00:52:23,276 --> 00:52:25,400 But I want to see this as a minus something, right? 1039 00:52:25,400 --> 00:52:27,067 It was minus b of theta. 1040 00:52:27,067 --> 00:52:28,525 So I'm going to write it as minus-- 1041 00:52:32,890 --> 00:52:35,150 well, I can just keep the plus, and I'm going to do-- 1042 00:52:41,770 --> 00:52:44,362 and that's all [INAUDIBLE]. 1043 00:52:44,362 --> 00:52:46,340 A-ha! 1044 00:52:46,340 --> 00:52:48,060 Well, this is of the form exponential-- 1045 00:52:48,060 --> 00:52:50,940 something that depends only on x times something that depends 1046 00:52:50,940 --> 00:52:52,720 only on theta-- 1047 00:52:52,720 --> 00:52:56,000 minus a function that depends only on theta. 1048 00:52:56,000 --> 00:52:59,230 And then h of x is equal to 1 again. 1049 00:52:59,230 --> 00:53:00,630 OK, so let's see. 1050 00:53:00,630 --> 00:53:03,410 So I have t1 of x is equal to x. 1051 00:53:03,410 --> 00:53:04,880 That's this guy. 1052 00:53:04,880 --> 00:53:11,000 Eta 1 of theta is equal to log p1 minus p. 1053 00:53:11,000 --> 00:53:20,930 And b of theta is equal to log 1 over 1 minus p, OK? 1054 00:53:20,930 --> 00:53:26,470 And h of x is equal to 1, all right? 1055 00:53:29,310 --> 00:53:31,230 You guys want to do Poisson, or do you 1056 00:53:31,230 --> 00:53:32,313 want to have any homework? 1057 00:53:35,120 --> 00:53:37,670 It's a dilemma because that's an easy homework versus 1058 00:53:37,670 --> 00:53:41,410 no homework at all but maybe something more difficult. OK, 1059 00:53:41,410 --> 00:53:43,680 who wants to do it now? 1060 00:53:43,680 --> 00:53:46,331 Who does not want to raise their hand now? 1061 00:53:46,331 --> 00:53:47,747 Who wants to raise their hand now? 1062 00:53:47,747 --> 00:53:57,116 All right, so let's move on. 1063 00:53:57,116 --> 00:53:59,240 I'll just do-- do you want to do the gammas instead 1064 00:53:59,240 --> 00:54:00,080 in the homework? 1065 00:54:00,080 --> 00:54:02,150 That's going to be fun. 1066 00:54:02,150 --> 00:54:04,450 I'm not even going to propose to do the gammas. 1067 00:54:04,450 --> 00:54:08,570 And so this is the gamma distribution. 1068 00:54:08,570 --> 00:54:10,820 It's brilliantly called gamma because it 1069 00:54:10,820 --> 00:54:14,480 has the gamma function just like the beta distribution had 1070 00:54:14,480 --> 00:54:16,400 the beta function in there. 1071 00:54:16,400 --> 00:54:17,690 They look very similar. 1072 00:54:17,690 --> 00:54:20,960 One is defined over r plus, the positive real line. 1073 00:54:20,960 --> 00:54:24,650 And remember, the beta was defined over the interval 0, 1. 1074 00:54:24,650 --> 00:54:28,640 And it's of the form x to some power times exponential 1075 00:54:28,640 --> 00:54:30,980 of minus x to some-- 1076 00:54:30,980 --> 00:54:32,340 times something, right? 1077 00:54:32,340 --> 00:54:34,298 So there's a function of polynomial [INAUDIBLE] 1078 00:54:34,298 --> 00:54:38,004 x where the exponent depends on the parameter. 1079 00:54:38,004 --> 00:54:40,670 And then there's the exponential minus x times something depends 1080 00:54:40,670 --> 00:54:41,930 on the parameters. 1081 00:54:41,930 --> 00:54:47,921 So this is going to also look like some function of x-- 1082 00:54:47,921 --> 00:54:49,670 sorry, like some exponential distribution. 1083 00:54:49,670 --> 00:54:52,280 Can somebody guess what is going to be t2 of x? 1084 00:54:58,338 --> 00:55:01,251 Oh, those are the functions of x that show up in this product, 1085 00:55:01,251 --> 00:55:01,750 right? 1086 00:55:01,750 --> 00:55:03,462 Remember when we have this-- 1087 00:55:03,462 --> 00:55:05,170 we just need to take some transformations 1088 00:55:05,170 --> 00:55:08,870 of x so it looks linear in those things and not in x itself. 1089 00:55:08,870 --> 00:55:11,570 Remember, we had x squared and x, for example, 1090 00:55:11,570 --> 00:55:12,560 in the Gaussian case. 1091 00:55:12,560 --> 00:55:14,471 I don't know if it's still there. 1092 00:55:14,471 --> 00:55:15,720 Yeah, it's still there, right? 1093 00:55:15,720 --> 00:55:17,330 t2 was x squared. 1094 00:55:17,330 --> 00:55:20,540 What do you think x is going-- t2 of x here. 1095 00:55:20,540 --> 00:55:23,536 So here's a hint. t1 is going to be x. 1096 00:55:23,536 --> 00:55:24,410 AUDIENCE: [INAUDIBLE] 1097 00:55:24,410 --> 00:55:25,480 PHILIPPE RIGOLLET: Yeah, [INAUDIBLE],, 1098 00:55:25,480 --> 00:55:26,438 what is going to be t1? 1099 00:55:26,438 --> 00:55:27,890 Yeah, you can-- this one is taken. 1100 00:55:27,890 --> 00:55:28,690 This one is taken. 1101 00:55:31,313 --> 00:55:32,580 What? 1102 00:55:32,580 --> 00:55:33,700 Log x, right? 1103 00:55:33,700 --> 00:55:35,680 Because this x to the a minus 1, I'm 1104 00:55:35,680 --> 00:55:39,380 going to write that as exponential a minus 1 log x. 1105 00:55:39,380 --> 00:55:43,375 So basically, eta 1 is going to be a minus 1. 1106 00:55:43,375 --> 00:55:47,560 Eta 2 is going to be minus 1 over b-- 1107 00:55:47,560 --> 00:55:48,980 well, actually the opposite. 1108 00:55:48,980 --> 00:55:50,271 And then you're going to have-- 1109 00:55:50,271 --> 00:55:52,630 but this is actually not too complicated. 1110 00:55:52,630 --> 00:55:55,090 All right, then those parameters get names. 1111 00:55:55,090 --> 00:55:58,480 a is the shape parameter, b is the scale parameter. 1112 00:55:58,480 --> 00:56:00,280 It doesn't really matter. 1113 00:56:00,280 --> 00:56:02,710 You have other things that are called the inverse gamma 1114 00:56:02,710 --> 00:56:05,850 distribution, which has this form. 1115 00:56:05,850 --> 00:56:09,360 The difference is that the parameter alpha 1116 00:56:09,360 --> 00:56:14,700 shows negatively there and then the inverse Gaussian 1117 00:56:14,700 --> 00:56:15,480 distribution. 1118 00:56:18,150 --> 00:56:20,220 You know, just densities you can come up with 1119 00:56:20,220 --> 00:56:23,305 and they just happened to fall in this family. 1120 00:56:23,305 --> 00:56:25,680 And there's other ones that you can actually put in there 1121 00:56:25,680 --> 00:56:26,640 that we've seen before. 1122 00:56:26,640 --> 00:56:28,695 The chi-square is actually part of this family. 1123 00:56:28,695 --> 00:56:30,570 The beta distribution is part of this family. 1124 00:56:30,570 --> 00:56:32,611 The binomial distribution is part of this family. 1125 00:56:32,611 --> 00:56:35,030 Well, that's easy because the Bernoulli was. 1126 00:56:35,030 --> 00:56:39,390 The negative binomial, which is some stopping time-- 1127 00:56:39,390 --> 00:56:42,600 the first time you hit a certain number of successes 1128 00:56:42,600 --> 00:56:46,120 when you flip some Bernoulli coins. 1129 00:56:46,120 --> 00:56:47,665 So you can check for all of those, 1130 00:56:47,665 --> 00:56:50,040 and you will see that you can actually write them as part 1131 00:56:50,040 --> 00:56:51,510 of the exponential family. 1132 00:56:51,510 --> 00:56:53,040 So the main goal of this slide is 1133 00:56:53,040 --> 00:56:54,581 to convince you that this is actually 1134 00:56:54,581 --> 00:56:56,400 a pretty broad range of distributions 1135 00:56:56,400 --> 00:57:00,360 because it basically includes everything we've seen 1136 00:57:00,360 --> 00:57:03,540 but not anything there-- 1137 00:57:03,540 --> 00:57:06,540 sorry, plus more, OK? 1138 00:57:06,540 --> 00:57:07,040 Yeah. 1139 00:57:07,040 --> 00:57:09,040 AUDIENCE: Is there any example of a distribution 1140 00:57:09,040 --> 00:57:10,456 that comes up pretty often that's 1141 00:57:10,456 --> 00:57:11,801 not in the exponential family? 1142 00:57:11,801 --> 00:57:13,384 PHILIPPE RIGOLLET: Yeah, like uniform. 1143 00:57:13,384 --> 00:57:16,312 AUDIENCE: Oh, OK, so maybe a bit more complicated than 1144 00:57:16,312 --> 00:57:17,702 [INAUDIBLE]. 1145 00:57:17,702 --> 00:57:19,410 Anything Anything that has a support that 1146 00:57:19,410 --> 00:57:21,740 depends on the parameter is not going to fall-- 1147 00:57:21,740 --> 00:57:24,410 is not going to fit in there. 1148 00:57:24,410 --> 00:57:26,570 Right, and you can actually convince yourself 1149 00:57:26,570 --> 00:57:31,910 why anything that has the support that 1150 00:57:31,910 --> 00:57:33,680 does not-- that depends on the parameter 1151 00:57:33,680 --> 00:57:35,310 is not going to be part of this guy. 1152 00:57:35,310 --> 00:57:37,460 It's kind of a hard thing to-- 1153 00:57:37,460 --> 00:57:42,330 in fact, you proved that it's not and you prove this rule. 1154 00:57:42,330 --> 00:57:43,850 That's kind of a little difficult, 1155 00:57:43,850 --> 00:57:46,190 but the way you can convince yourself is that remember, 1156 00:57:46,190 --> 00:57:49,910 the only interaction between x and theta that I allowed 1157 00:57:49,910 --> 00:57:51,470 was taking the product of those guys 1158 00:57:51,470 --> 00:57:54,160 and then the exponential, right? 1159 00:57:54,160 --> 00:57:56,660 If you have something that depends on some parameter-- 1160 00:57:56,660 --> 00:57:59,740 let's say you're going to see something that looks like this. 1161 00:57:59,740 --> 00:58:01,510 Right, for uniform, it looks like this. 1162 00:58:04,720 --> 00:58:08,210 Well, this is not of the form exponential x times theta. 1163 00:58:08,210 --> 00:58:10,990 There's an interaction between x and theta here, 1164 00:58:10,990 --> 00:58:12,840 but it's actually certainly not of the form 1165 00:58:12,840 --> 00:58:14,580 x exponential x times theta. 1166 00:58:14,580 --> 00:58:16,680 So this is definitely not going to be 1167 00:58:16,680 --> 00:58:18,210 part of the exponential family. 1168 00:58:18,210 --> 00:58:20,680 And every time you start doing things like that, 1169 00:58:20,680 --> 00:58:21,930 it's just not going to happen. 1170 00:58:25,790 --> 00:58:28,370 Actually, to be fair, I'm not even sure 1171 00:58:28,370 --> 00:58:30,680 that all these guys, when you allow 1172 00:58:30,680 --> 00:58:32,600 them to have all their parameters free, 1173 00:58:32,600 --> 00:58:34,810 are actually going to be part of this. 1174 00:58:34,810 --> 00:58:36,500 For example-- the beta probably is, 1175 00:58:36,500 --> 00:58:38,730 but I'm not actually entirely convinced. 1176 00:58:43,140 --> 00:58:47,320 There's books on experiential families. 1177 00:58:47,320 --> 00:58:48,970 All right, so let's go back. 1178 00:58:48,970 --> 00:58:52,060 So here, we've put a lot of effort understanding 1179 00:58:52,060 --> 00:58:57,160 how big, how much wider than the Gaussian distribution 1180 00:58:57,160 --> 00:59:01,630 can we think of for the conditional distribution 1181 00:59:01,630 --> 00:59:04,030 of our response y given x. 1182 00:59:04,030 --> 00:59:06,620 So let's go back to the generalized linear models, 1183 00:59:06,620 --> 00:59:07,120 right? 1184 00:59:07,120 --> 00:59:09,870 So [INAUDIBLE] said, OK, the random component? 1185 00:59:09,870 --> 00:59:11,800 y has to be part of some exponential family 1186 00:59:11,800 --> 00:59:13,090 distribution-- check. 1187 00:59:13,090 --> 00:59:14,726 We know what this means. 1188 00:59:14,726 --> 00:59:16,350 So now I have to understand two things. 1189 00:59:16,350 --> 00:59:20,127 I have to understand what is the expectation, right? 1190 00:59:20,127 --> 00:59:21,960 Because that's actually what I model, right? 1191 00:59:21,960 --> 00:59:24,160 I take the expectation, the conditional expectation, 1192 00:59:24,160 --> 00:59:24,850 of y given x. 1193 00:59:24,850 --> 00:59:27,100 So I need to understand given this guy, 1194 00:59:27,100 --> 00:59:30,250 it would be nice if you had some simple rules that would tell me 1195 00:59:30,250 --> 00:59:32,950 exactly what the expectation is rather than having to do it 1196 00:59:32,950 --> 00:59:34,360 over and over again, right? 1197 00:59:34,360 --> 00:59:36,100 If I told you, here's a Gaussian, 1198 00:59:36,100 --> 00:59:37,600 compute the expectation, every time 1199 00:59:37,600 --> 00:59:40,750 you had to use that would be slightly painful. 1200 00:59:40,750 --> 00:59:43,510 So hopefully, this thing being simple enough-- 1201 00:59:43,510 --> 00:59:45,870 we've actually selected a class that's 1202 00:59:45,870 --> 00:59:47,590 simple enough so that we can have rules. 1203 00:59:47,590 --> 00:59:52,360 Whereas as soon as they give you those parameters t1, t2, eta 1, 1204 00:59:52,360 --> 00:59:55,810 eta 2, b and h, you can actually have some simple rules 1205 00:59:55,810 --> 01:00:00,370 to compute the mean and variance and all those things. 1206 01:00:00,370 --> 01:00:03,520 And so in particular, I'm interested in the mean, 1207 01:00:03,520 --> 01:00:05,950 and I'm going to have to actually say, well, you know, 1208 01:00:05,950 --> 01:00:09,770 this mean has to be mapped into the whole real line. 1209 01:00:09,770 --> 01:00:12,040 So I can actually talk about modeling this function 1210 01:00:12,040 --> 01:00:14,410 of the mean as x transpose beta. 1211 01:00:14,410 --> 01:00:17,380 And we saw that for the [INAUDIBLE] dataset 1212 01:00:17,380 --> 01:00:21,400 or whatever other data sets. 1213 01:00:21,400 --> 01:00:24,250 You actually can-- you can actually do this using the log 1214 01:00:24,250 --> 01:00:27,960 of the reciprocal or for the-- 1215 01:00:27,960 --> 01:00:30,050 oh, actually, we didn't do it for the Bernoulli. 1216 01:00:30,050 --> 01:00:30,940 We'll come to this. 1217 01:00:30,940 --> 01:00:32,981 This is the most important one, and that's called 1218 01:00:32,981 --> 01:00:34,510 a logit it or a logistic link. 1219 01:00:37,090 --> 01:00:39,230 But before we go there, this was actually 1220 01:00:39,230 --> 01:00:42,320 a very broad family, right? 1221 01:00:42,320 --> 01:00:44,995 When I wrote this thing on the bottom board-- it's gone now, 1222 01:00:44,995 --> 01:00:46,620 but when I wrote it in the first place, 1223 01:00:46,620 --> 01:00:48,870 the only thing that I wrote is I wanted x times theta. 1224 01:00:48,870 --> 01:00:51,119 Wouldn't it be nice if you have some distribution that 1225 01:00:51,119 --> 01:00:53,230 was just x times theta, not some function of x 1226 01:00:53,230 --> 01:00:54,660 times some function of theta? 1227 01:00:54,660 --> 01:00:58,050 The functions seem to be here so that they actually 1228 01:00:58,050 --> 01:01:02,610 make things a little-- 1229 01:01:02,610 --> 01:01:05,160 so the functions were here so that I can actually 1230 01:01:05,160 --> 01:01:06,480 put a lot of functions there. 1231 01:01:06,480 --> 01:01:08,430 But first of all, if I actually decide 1232 01:01:08,430 --> 01:01:10,680 to re-parametrize my problem, I can always 1233 01:01:10,680 --> 01:01:12,180 assume-- if I'm one dimensional, I 1234 01:01:12,180 --> 01:01:14,970 can always assume that eta 1 of theta 1235 01:01:14,970 --> 01:01:17,460 becomes my new theta, right? 1236 01:01:17,460 --> 01:01:20,772 So this thing-- here for example, 1237 01:01:20,772 --> 01:01:22,230 I could say, well, this is actually 1238 01:01:22,230 --> 01:01:23,510 the parameter of my Bernoulli. 1239 01:01:23,510 --> 01:01:25,950 Let me call this guy theta, right? 1240 01:01:25,950 --> 01:01:28,230 I could do that. 1241 01:01:28,230 --> 01:01:31,230 Then I could say, well, here I have x that shows up here. 1242 01:01:31,230 --> 01:01:33,980 And here since I'm talking about the response, 1243 01:01:33,980 --> 01:01:35,690 I cannot really make any transformations. 1244 01:01:35,690 --> 01:01:38,240 So here, I'm going to actually talk about a specific family 1245 01:01:38,240 --> 01:01:41,920 for which this guy is not x square or square root of x 1246 01:01:41,920 --> 01:01:43,350 or log of x or anything I want. 1247 01:01:43,350 --> 01:01:45,349 I'm just going to actually look at distributions 1248 01:01:45,349 --> 01:01:46,430 for which this is x. 1249 01:01:46,430 --> 01:01:48,485 This exponential families are called 1250 01:01:48,485 --> 01:01:51,140 a canonical exponential family. 1251 01:01:51,140 --> 01:01:55,010 So in the canonical exponential family, what I have 1252 01:01:55,010 --> 01:01:57,230 is that I have my x times theta. 1253 01:01:57,230 --> 01:01:59,959 I'm going to allow myself some normalization factor phi, 1254 01:01:59,959 --> 01:02:01,500 and we'll see, for example, that it's 1255 01:02:01,500 --> 01:02:05,330 very convenient when I talk about the Gaussian, right? 1256 01:02:05,330 --> 01:02:07,830 Because even if I know-- 1257 01:02:11,250 --> 01:02:15,134 yeah, even if I know this guy, which I actually pull into my-- 1258 01:02:15,134 --> 01:02:16,300 oh, that's over here, right? 1259 01:02:20,970 --> 01:02:23,155 Right, I know sigma squared. 1260 01:02:23,155 --> 01:02:24,780 But I don't want to change my parameter 1261 01:02:24,780 --> 01:02:26,290 to be mu over sigma squared. 1262 01:02:26,290 --> 01:02:27,490 It's kind of painful. 1263 01:02:27,490 --> 01:02:30,120 So I just take mu, and I'm going to keep this guy 1264 01:02:30,120 --> 01:02:31,980 as being this phi over there. 1265 01:02:31,980 --> 01:02:34,830 And it's called the dispersion parameter 1266 01:02:34,830 --> 01:02:38,010 from a clear analogy with the Gaussian, right? 1267 01:02:38,010 --> 01:02:41,580 That's the variance and that's measuring dispersion. 1268 01:02:41,580 --> 01:02:45,540 OK, so here, what I want is I'm going 1269 01:02:45,540 --> 01:02:49,450 to think throughout this class-- so phi may be known or not. 1270 01:02:49,450 --> 01:02:51,390 And depending-- when it's not known, 1271 01:02:51,390 --> 01:02:54,360 this actually might turn into some exponential family 1272 01:02:54,360 --> 01:02:55,470 or it might not. 1273 01:02:55,470 --> 01:03:01,380 And the main reason is because this b of theta over phi 1274 01:03:01,380 --> 01:03:04,950 is not necessarily a function of theta over phi, right? 1275 01:03:04,950 --> 01:03:09,660 If I actually have phi unknown, then y theta over phi 1276 01:03:09,660 --> 01:03:10,740 has to be-- 1277 01:03:10,740 --> 01:03:13,390 this guy has to be my new parameter. 1278 01:03:13,390 --> 01:03:17,930 And b might not be a function of this new parameter. 1279 01:03:17,930 --> 01:03:21,860 OK, so in a way, it may or may not, 1280 01:03:21,860 --> 01:03:24,710 but this is not really a concern that we're going to have 1281 01:03:24,710 --> 01:03:26,810 because throughout this class, we're 1282 01:03:26,810 --> 01:03:29,195 going to assume that phi is known, OK? 1283 01:03:29,195 --> 01:03:31,820 Phi is going to be known all the time, which means that this is 1284 01:03:31,820 --> 01:03:34,334 always an exponential family. 1285 01:03:34,334 --> 01:03:35,750 And it's just the simplest one you 1286 01:03:35,750 --> 01:03:38,270 could think of-- one dimensional parameter, one 1287 01:03:38,270 --> 01:03:42,800 dimensional response, and I just have-- the product is just y 1288 01:03:42,800 --> 01:03:45,050 times or, we used to call it x. 1289 01:03:45,050 --> 01:03:49,670 Now I've switched to y, but y times theta divided by phi, OK? 1290 01:03:52,550 --> 01:03:56,120 Should I write this or this is clear to everyone what this is? 1291 01:03:56,120 --> 01:03:58,665 Let me write it somewhere so we actually keep track of it 1292 01:03:58,665 --> 01:04:00,565 toward the [INAUDIBLE]. 1293 01:04:05,800 --> 01:04:07,990 OK, so this is-- 1294 01:04:07,990 --> 01:04:11,620 remember, we had all the distributions. 1295 01:04:11,620 --> 01:04:15,670 And then here we had the exponential family. 1296 01:04:15,670 --> 01:04:18,610 And now we have the canonical exponential family. 1297 01:04:21,280 --> 01:04:24,200 It's actually much, much smaller. 1298 01:04:24,200 --> 01:04:26,950 Well, actually, it's probably sort of a good picture. 1299 01:04:26,950 --> 01:04:32,620 And what I have is that my density or my PMF 1300 01:04:32,620 --> 01:04:37,120 is just exponential y times theta minus b 1301 01:04:37,120 --> 01:04:41,020 of theta divided by phi. 1302 01:04:41,020 --> 01:04:46,480 And I have plus phi of-- 1303 01:04:46,480 --> 01:04:53,820 oh, yeah, plus phi of y phi, which 1304 01:04:53,820 --> 01:04:58,680 means that this is really-- if phi is known, h of y 1305 01:04:58,680 --> 01:05:05,742 is just exponential c of y phi, agreed? 1306 01:05:05,742 --> 01:05:07,950 Actually, this is the reason why it's not necessarily 1307 01:05:07,950 --> 01:05:10,410 a canonical family. 1308 01:05:10,410 --> 01:05:12,990 It might not be that this depends only on y. 1309 01:05:12,990 --> 01:05:15,400 It could depend on y and phi in some annoying way 1310 01:05:15,400 --> 01:05:18,950 and I may not be able to break it. 1311 01:05:18,950 --> 01:05:21,220 OK, but if phi is known, this is just a function 1312 01:05:21,220 --> 01:05:23,580 that depends on y, agreed? 1313 01:05:28,290 --> 01:05:29,670 In particular, I think you need-- 1314 01:05:29,670 --> 01:05:31,753 I hope you can convince yourself that this is just 1315 01:05:31,753 --> 01:05:33,750 a subcase of everything we've seen before. 1316 01:05:41,990 --> 01:05:44,800 So for example, the Gaussian when the variance is known 1317 01:05:44,800 --> 01:05:47,010 is indeed of this form, right? 1318 01:05:47,010 --> 01:05:49,220 So we still have it on the board. 1319 01:05:49,220 --> 01:05:51,040 So here is my y, right? 1320 01:05:51,040 --> 01:05:53,950 So then let me write this as f theta of y. 1321 01:05:53,950 --> 01:05:59,030 So every x is replaceable with y, blah, blah, blah. 1322 01:05:59,030 --> 01:06:01,330 This is this guy. 1323 01:06:01,330 --> 01:06:07,120 And now what I have is that this is going to be my phi. 1324 01:06:07,120 --> 01:06:10,630 This is my parameter of theta. 1325 01:06:10,630 --> 01:06:14,320 So I'm definitely of the form y times theta divided by phi. 1326 01:06:14,320 --> 01:06:16,440 And then here I have a function b 1327 01:06:16,440 --> 01:06:20,890 that depends only on theta over phi again. 1328 01:06:20,890 --> 01:06:27,040 So b of theta is mu squared divided by 2. 1329 01:06:31,000 --> 01:06:33,890 OK, then it's divided by 6 sigma square. 1330 01:06:33,890 --> 01:06:35,519 And then I have this extra stuff. 1331 01:06:35,519 --> 01:06:37,310 But I really don't care what it is for now. 1332 01:06:37,310 --> 01:06:42,140 It's just something that depends only on y and known stuff. 1333 01:06:42,140 --> 01:06:44,150 So it was just a function of y just like my h. 1334 01:06:44,150 --> 01:06:47,180 I stuff everything in there. 1335 01:06:47,180 --> 01:06:50,090 The b, though, this thing here, this 1336 01:06:50,090 --> 01:06:52,229 is actually what's important because 1337 01:06:52,229 --> 01:06:53,770 in the canonical family, if you think 1338 01:06:53,770 --> 01:06:57,060 about it, when you know phi-- 1339 01:06:57,060 --> 01:07:03,270 sorry-- right, this is just y times theta 1340 01:07:03,270 --> 01:07:05,790 scaled by a known constant-- sorry, y times 1341 01:07:05,790 --> 01:07:08,160 theta scaled by a known constant is the first term. 1342 01:07:08,160 --> 01:07:12,000 The second term is b of theta scaled by some known constant. 1343 01:07:12,000 --> 01:07:13,860 But b of theta is what's going to make 1344 01:07:13,860 --> 01:07:17,580 the difference between the Gaussian and Bernoullis 1345 01:07:17,580 --> 01:07:19,680 and gammas and betas-- 1346 01:07:19,680 --> 01:07:21,750 this is all in this b of theta. b of theta 1347 01:07:21,750 --> 01:07:25,050 contains everything that's idiosyncratic to 1348 01:07:25,050 --> 01:07:27,270 this particular distribution. 1349 01:07:27,270 --> 01:07:29,000 And so this is going to be important. 1350 01:07:29,000 --> 01:07:32,120 And we will see that b of theta is going to capture information 1351 01:07:32,120 --> 01:07:34,230 about the mean, about the variance, 1352 01:07:34,230 --> 01:07:37,133 about likelihood, about everything. 1353 01:07:44,710 --> 01:07:46,420 Should I go through this computation? 1354 01:07:46,420 --> 01:07:47,647 I mean, it's the same. 1355 01:07:47,647 --> 01:07:48,730 We've just done it, right? 1356 01:07:48,730 --> 01:07:53,750 So maybe it's probably better if you can redo it on your own. 1357 01:07:53,750 --> 01:07:56,680 All right, so the canonical exponential family also 1358 01:07:56,680 --> 01:07:58,210 has other distributions, right? 1359 01:07:58,210 --> 01:08:00,890 So there's the Gaussian and there's the Poisson 1360 01:08:00,890 --> 01:08:02,410 and there's the Bernoulli. 1361 01:08:02,410 --> 01:08:05,250 But the other ones may not be part of this, right? 1362 01:08:05,250 --> 01:08:07,810 In particular, think about the gamma distribution. 1363 01:08:07,810 --> 01:08:13,600 We had this-- log x was one of the things that showed up. 1364 01:08:13,600 --> 01:08:15,670 I mean, I cannot get rid of this log x. 1365 01:08:15,670 --> 01:08:18,729 I mean, that's part of it except if a is equal to 1 1366 01:08:18,729 --> 01:08:20,380 and I know it for sure, right? 1367 01:08:20,380 --> 01:08:23,979 So if a is equal to 1, then I'm going to have a minus 1, 1368 01:08:23,979 --> 01:08:25,000 which is equal to 0. 1369 01:08:25,000 --> 01:08:27,160 So I'm going to have a minus 1 times log 1370 01:08:27,160 --> 01:08:28,630 x, which is going to be just 0. 1371 01:08:28,630 --> 01:08:30,560 So log x is going to vanish from here. 1372 01:08:30,560 --> 01:08:33,550 But if a is equal to 1, then this distribution 1373 01:08:33,550 --> 01:08:36,250 is actually much nicer, and it actually does not even 1374 01:08:36,250 --> 01:08:37,450 deserve the name gamma. 1375 01:08:37,450 --> 01:08:38,890 What is it if a is equal to 1? 1376 01:08:42,444 --> 01:08:43,569 It's an exponential, right? 1377 01:08:43,569 --> 01:08:47,779 Gamma 1 is equal to 1. x to the a minus 1 is equal to 1. 1378 01:08:47,779 --> 01:08:51,529 b-- so I have exponential x over b divided by b. 1379 01:08:51,529 --> 01:08:53,520 So 1 over b-- call it lambda. 1380 01:08:53,520 --> 01:08:56,759 And this is just an exponential distribution. 1381 01:08:56,759 --> 01:08:58,800 And so every time you're going to see something-- 1382 01:08:58,800 --> 01:09:02,590 so all these guys that don't make it to this table, 1383 01:09:02,590 --> 01:09:06,094 they could be part of those guys, but they're just more-- 1384 01:09:06,094 --> 01:09:09,050 they're just to-- 1385 01:09:09,050 --> 01:09:10,939 they just have another name in this thing. 1386 01:09:10,939 --> 01:09:13,970 All right, so you could compute the value of theta 1387 01:09:13,970 --> 01:09:15,520 for different values, right? 1388 01:09:15,520 --> 01:09:18,714 So again, you still have some continuous or discrete ones. 1389 01:09:18,714 --> 01:09:19,630 This is my b of theta. 1390 01:09:19,630 --> 01:09:22,520 And I said this is actually really what captures my theta. 1391 01:09:22,520 --> 01:09:26,450 This b is actually called cumulant generating function, 1392 01:09:26,450 --> 01:09:27,160 OK? 1393 01:09:27,160 --> 01:09:28,300 I don't have time. 1394 01:09:28,300 --> 01:09:30,370 I could write five slides to explain to you, 1395 01:09:30,370 --> 01:09:32,729 but it would just only tell you why it's called 1396 01:09:32,729 --> 01:09:34,390 cumulant generating function. 1397 01:09:34,390 --> 01:09:38,090 It's also known as the log of the moment generating function. 1398 01:09:38,090 --> 01:09:42,195 And the way it's called cumulant generating function 1399 01:09:42,195 --> 01:09:44,320 is because if I start taking successive derivatives 1400 01:09:44,320 --> 01:09:47,584 and evaluating them at 0, I get the successive cumulance 1401 01:09:47,584 --> 01:09:50,859 of this distribution, which are some transformation 1402 01:09:50,859 --> 01:09:51,815 of the moments. 1403 01:09:51,815 --> 01:09:53,654 AUDIENCE: What are you talking about again? 1404 01:09:53,654 --> 01:09:55,070 PHILIPPE RIGOLLET: The function b. 1405 01:09:55,070 --> 01:09:55,945 AUDIENCE: [INAUDIBLE] 1406 01:09:55,945 --> 01:09:57,986 PHILIPPE RIGOLLET: So this is just normalization. 1407 01:09:57,986 --> 01:10:00,170 So this is just to tell you I can compute this, 1408 01:10:00,170 --> 01:10:01,640 but I really don't care. 1409 01:10:01,640 --> 01:10:04,460 And obviously I don't care about stuff that's complicated. 1410 01:10:04,460 --> 01:10:07,316 This is actually cute, and this is what completes everything. 1411 01:10:07,316 --> 01:10:09,440 And the rest is just like some general description. 1412 01:10:09,440 --> 01:10:11,930 You only need to tell you that the range of y 1413 01:10:11,930 --> 01:10:14,090 is 0 to infinity, right? 1414 01:10:14,090 --> 01:10:16,469 And that is essentially telling me 1415 01:10:16,469 --> 01:10:19,010 this is going to give me some hints as to which link function 1416 01:10:19,010 --> 01:10:20,180 I should be using, right? 1417 01:10:20,180 --> 01:10:21,710 Because the range of y tells me what 1418 01:10:21,710 --> 01:10:23,846 the range of expectation of y is going to be. 1419 01:10:23,846 --> 01:10:25,970 All right, so here, it tells me that the range of y 1420 01:10:25,970 --> 01:10:28,850 is between 0 and 1. 1421 01:10:28,850 --> 01:10:30,500 OK, so what I want to show you is 1422 01:10:30,500 --> 01:10:33,134 that this captures a variety of different ranges 1423 01:10:33,134 --> 01:10:34,088 that you can have. 1424 01:10:40,300 --> 01:10:46,570 OK, so I'm going to want to go into the likelihood. 1425 01:10:46,570 --> 01:10:48,460 And the likelihood I'm actually going 1426 01:10:48,460 --> 01:10:50,780 to use to compute the expectations. 1427 01:10:50,780 --> 01:10:52,840 But since I actually don't have time 1428 01:10:52,840 --> 01:10:55,690 to do this now, let's just go quickly through this 1429 01:10:55,690 --> 01:10:59,770 and give you spoiler alert to make sure that you all wake up 1430 01:10:59,770 --> 01:11:01,270 on Thursday and really, really want 1431 01:11:01,270 --> 01:11:03,160 to think about coming here immediately. 1432 01:11:03,160 --> 01:11:05,470 All right, so the thing I'm going to want to do, 1433 01:11:05,470 --> 01:11:07,570 as I said, is it would be nice if, at least 1434 01:11:07,570 --> 01:11:11,434 for this canonical family, when I give you b, 1435 01:11:11,434 --> 01:11:12,850 you would be able to say, oh, here 1436 01:11:12,850 --> 01:11:16,340 is a simple computation of b that would actually give me 1437 01:11:16,340 --> 01:11:17,530 the mean and the variance. 1438 01:11:17,530 --> 01:11:20,590 The mean and the variance are also known as moments. 1439 01:11:20,590 --> 01:11:22,970 b is called cumulant generating function. 1440 01:11:22,970 --> 01:11:24,940 So it sounds like moments being related 1441 01:11:24,940 --> 01:11:28,060 to cumulance, I might have a path to finding those, right? 1442 01:11:28,060 --> 01:11:31,660 And it might involve taking derivatives of b, as we'll see. 1443 01:11:31,660 --> 01:11:33,330 The way we're going to prove this 1444 01:11:33,330 --> 01:11:36,820 by using this thing that we've used several times. 1445 01:11:36,820 --> 01:11:39,354 So this property we use when we're computing, 1446 01:11:39,354 --> 01:11:41,020 remember, the fisher information, right? 1447 01:11:41,020 --> 01:11:43,420 We had two formulas for the fisher information. 1448 01:11:43,420 --> 01:11:49,210 One was the expectation of the second derivative of the log 1449 01:11:49,210 --> 01:11:53,026 likelihood, and one was negative expectation of the square-- 1450 01:11:53,026 --> 01:11:55,150 sorry, expectation of the square, and the other one 1451 01:11:55,150 --> 01:11:57,970 was negative the expectation of the second derivative, right? 1452 01:11:57,970 --> 01:12:00,850 The log likelihood is concave, so this number is negative, 1453 01:12:00,850 --> 01:12:02,470 this number is positive. 1454 01:12:02,470 --> 01:12:04,990 And the way we did this is by just permuting some derivative 1455 01:12:04,990 --> 01:12:06,004 and integral here. 1456 01:12:06,004 --> 01:12:08,170 And there was just-- we used the fact that something 1457 01:12:08,170 --> 01:12:09,378 that looked like this, right? 1458 01:12:09,378 --> 01:12:13,780 The log likelihood is log of f theta. 1459 01:12:13,780 --> 01:12:20,500 And when I take the derivative of this guy with respect 1460 01:12:20,500 --> 01:12:24,690 to theta, then I have something that 1461 01:12:24,690 --> 01:12:30,460 looks like the derivative divided by f theta. 1462 01:12:30,460 --> 01:12:34,020 And if I start taking the integral against f theta 1463 01:12:34,020 --> 01:12:39,270 of this thing, so the expectation of this thing, 1464 01:12:39,270 --> 01:12:42,420 those things would cancel. 1465 01:12:42,420 --> 01:12:45,739 And then I had just the integral of a derivative, which 1466 01:12:45,739 --> 01:12:48,030 I would make a leap of faith and say that it's actually 1467 01:12:48,030 --> 01:12:49,321 the derivative of the integral. 1468 01:12:53,770 --> 01:12:56,000 But this was equal to 1. 1469 01:12:56,000 --> 01:12:58,404 So this derivative was actually equal to 0. 1470 01:12:58,404 --> 01:13:00,320 And so that's how you got that the expectation 1471 01:13:00,320 --> 01:13:02,930 of the derivative of the log likelihood is equal to 0. 1472 01:13:02,930 --> 01:13:04,940 And you do it once again and you get this guy. 1473 01:13:04,940 --> 01:13:06,350 It's just some nice things that happen 1474 01:13:06,350 --> 01:13:08,433 with the [INAUDIBLE] taking derivative of the log. 1475 01:13:08,433 --> 01:13:10,430 We've done that, we'll do that again. 1476 01:13:10,430 --> 01:13:13,660 But once you do this, you can actually apply it. 1477 01:13:13,660 --> 01:13:17,580 And-- missing a parenthesis over there. 1478 01:13:17,580 --> 01:13:19,610 So when you write the log likelihood, 1479 01:13:19,610 --> 01:13:21,266 it's just log of an exponential. 1480 01:13:21,266 --> 01:13:22,640 Huh, that's actually pretty nice. 1481 01:13:22,640 --> 01:13:25,020 Just like the least squares came naturally, the least 1482 01:13:25,020 --> 01:13:26,436 squares [INAUDIBLE] came naturally 1483 01:13:26,436 --> 01:13:29,204 when we took the log likelihood of the Gaussians, 1484 01:13:29,204 --> 01:13:31,370 we're going to have the same thing that happens when 1485 01:13:31,370 --> 01:13:33,080 I take the log of the density. 1486 01:13:33,080 --> 01:13:35,300 The exponential is going to go away, 1487 01:13:35,300 --> 01:13:36,990 and then I'm going to use this formula. 1488 01:13:36,990 --> 01:13:39,770 But this formula is going to actually give me 1489 01:13:39,770 --> 01:13:43,026 an equation directly-- oh, that's where it was. 1490 01:13:43,026 --> 01:13:44,840 So that's the one that's missing up there. 1491 01:13:44,840 --> 01:13:49,010 And so the expectation minus this thing 1492 01:13:49,010 --> 01:13:50,780 is going to be equal to 0, which tells me 1493 01:13:50,780 --> 01:13:53,122 that the expectation is just the derivative. 1494 01:13:53,122 --> 01:13:55,190 Right, so it's still a function of theta, 1495 01:13:55,190 --> 01:13:57,410 but it's just a derivative of b. 1496 01:13:57,410 --> 01:13:59,660 And the variance is just going to be 1497 01:13:59,660 --> 01:14:01,280 the second derivative of b. 1498 01:14:01,280 --> 01:14:03,910 But remember, this was some sort of a scaling, right? 1499 01:14:03,910 --> 01:14:05,990 It's called the dispersion parameter. 1500 01:14:05,990 --> 01:14:09,410 So if I had a Gaussian and the variance of the Gaussian 1501 01:14:09,410 --> 01:14:12,020 did not depend on the sigma squared 1502 01:14:12,020 --> 01:14:15,260 which I stuffed in this phi, that would be certainly weird. 1503 01:14:15,260 --> 01:14:17,601 And it cannot depend only on mu, and so this will-- 1504 01:14:17,601 --> 01:14:19,850 for the Gaussian, this is definitely going to be equal 1505 01:14:19,850 --> 01:14:20,960 to 1. 1506 01:14:20,960 --> 01:14:24,350 And this is just going to be equal to my variance. 1507 01:14:24,350 --> 01:14:28,460 So this is just by taking the second derivative. 1508 01:14:28,460 --> 01:14:33,260 So basically, the take-home message is that this function b 1509 01:14:33,260 --> 01:14:35,170 captures-- 1510 01:14:35,170 --> 01:14:37,090 by taking one derivative of the expectation 1511 01:14:37,090 --> 01:14:39,565 and by taking two derivatives captures the variance. 1512 01:14:39,565 --> 01:14:41,200 Another thing that's actually cool 1513 01:14:41,200 --> 01:14:42,580 and we'll come back to this and I 1514 01:14:42,580 --> 01:14:45,640 want to think about is if this second derivative is 1515 01:14:45,640 --> 01:14:49,190 the variance, what can I say about this thing? 1516 01:14:52,037 --> 01:14:53,370 What do I know about a variance? 1517 01:14:53,370 --> 01:14:54,950 AUDIENCE: [INAUDIBLE] 1518 01:14:54,950 --> 01:14:56,730 PHILIPPE RIGOLLET: Yeah, that's positive. 1519 01:14:56,730 --> 01:14:58,110 So I know that this is positive. 1520 01:14:58,110 --> 01:15:00,600 So what does that tell me? 1521 01:15:00,600 --> 01:15:03,115 Positive? 1522 01:15:03,115 --> 01:15:03,990 That's convex, right? 1523 01:15:03,990 --> 01:15:07,050 A function that has positive second derivative is convex. 1524 01:15:07,050 --> 01:15:09,700 So we're going to use that as well, all right? 1525 01:15:09,700 --> 01:15:12,530 So yeah, I'll see you on Thursday. 1526 01:15:12,530 --> 01:15:14,380 I have your homework.