1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,650 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,650 --> 00:00:17,880 at ocw.mit.edu. 8 00:00:22,729 --> 00:00:24,770 PROFESSOR: So I'm using a few things here, right? 9 00:00:24,770 --> 00:00:27,740 I'm using the fact that KL is non-negative. 10 00:00:27,740 --> 00:00:31,130 But KL is equal to 0 when I take twice the same argument. 11 00:00:31,130 --> 00:00:34,061 So I know that this function is always non-negative. 12 00:00:38,650 --> 00:00:45,580 So that's theta and that's KL P theta star P theta. 13 00:00:45,580 --> 00:00:51,170 And I know that at theta star, it's equal to 0. 14 00:00:51,170 --> 00:00:52,360 OK? 15 00:00:52,360 --> 00:00:56,450 I could be in the case where I have this happening. 16 00:00:56,450 --> 00:01:01,460 I have two-- let's call it theta star prime. 17 00:01:01,460 --> 00:01:04,069 I have two minimizers. 18 00:01:04,069 --> 00:01:05,390 That could be the case, right? 19 00:01:05,390 --> 00:01:07,460 I'm not saying that-- so K of L-- 20 00:01:07,460 --> 00:01:11,810 KL is 0 at the minimum. 21 00:01:11,810 --> 00:01:14,660 That doesn't mean that I have a unit minimum, right? 22 00:01:14,660 --> 00:01:16,110 But it does, actually. 23 00:01:16,110 --> 00:01:17,562 What do I need to use to make sure 24 00:01:17,562 --> 00:01:18,770 that I have only one minimum? 25 00:01:22,210 --> 00:01:24,280 So the definiteness is guaranteeing to me 26 00:01:24,280 --> 00:01:28,412 that there's a unique P theta star that minimizes it. 27 00:01:28,412 --> 00:01:30,120 But then I need to make sure that there's 28 00:01:30,120 --> 00:01:33,214 a unique-- from this unique P theta star, 29 00:01:33,214 --> 00:01:35,380 I need to make sure there's a unique theta star that 30 00:01:35,380 --> 00:01:36,550 defines this P theta star. 31 00:01:39,250 --> 00:01:40,550 Exactly. 32 00:01:40,550 --> 00:01:43,420 All right, so I combine definiteness 33 00:01:43,420 --> 00:01:47,410 and identifiability to make sure that there is a unique 34 00:01:47,410 --> 00:01:50,070 minimizer, in this case cannot exist. 35 00:01:50,070 --> 00:01:55,630 OK, so basically, let me write what I just said. 36 00:01:55,630 --> 00:02:06,540 So definiteness, that implies that P theta star 37 00:02:06,540 --> 00:02:23,120 is the unique minimizer of P theta maps to KL P theta star P 38 00:02:23,120 --> 00:02:23,910 theta. 39 00:02:23,910 --> 00:02:26,290 So definiteness only guarantees that the probability 40 00:02:26,290 --> 00:02:29,430 distribution is uniquely identified. 41 00:02:29,430 --> 00:02:42,260 And identifiability implies that theta star 42 00:02:42,260 --> 00:02:56,937 is the unique minimizer of theta maps to KL P theta star P 43 00:02:56,937 --> 00:03:00,590 theta, OK? 44 00:03:00,590 --> 00:03:02,480 So I'm basically doing the composition 45 00:03:02,480 --> 00:03:04,260 of two injective functions. 46 00:03:04,260 --> 00:03:07,970 The first one is the one that maps, say, theta to P theta. 47 00:03:07,970 --> 00:03:11,090 And the second one is the one that maps P theta 48 00:03:11,090 --> 00:03:14,482 to the set of minimizers, OK? 49 00:03:20,770 --> 00:03:27,560 So at least morally, you should agree that theta star 50 00:03:27,560 --> 00:03:28,940 is the minimizer of this thing. 51 00:03:28,940 --> 00:03:30,523 Whether it's unique or not, you should 52 00:03:30,523 --> 00:03:33,110 agree that it's a good one. 53 00:03:33,110 --> 00:03:36,620 So maybe you can think a little longer on this. 54 00:03:36,620 --> 00:03:38,720 So thinking about this being the minimizer, 55 00:03:38,720 --> 00:03:40,220 then it says, well, if I actually 56 00:03:40,220 --> 00:03:42,230 had a good estimate for this function, 57 00:03:42,230 --> 00:03:44,050 I would use the strategy that I described 58 00:03:44,050 --> 00:03:45,860 for the total variation, which is, 59 00:03:45,860 --> 00:03:48,050 well, I don't know what this function looks like. 60 00:03:48,050 --> 00:03:49,700 It depends on theta star. 61 00:03:49,700 --> 00:03:51,470 But maybe I can find an estimator 62 00:03:51,470 --> 00:03:55,170 of this function that fluctuates around this function, 63 00:03:55,170 --> 00:03:58,070 and such that when I minimize this estimator of the function, 64 00:03:58,070 --> 00:04:01,380 I'm actually not too far, OK? 65 00:04:01,380 --> 00:04:04,650 And this is exactly what drives me to do this, 66 00:04:04,650 --> 00:04:07,380 because I can actually construct an estimator. 67 00:04:07,380 --> 00:04:09,300 I can actually construct an estimator such 68 00:04:09,300 --> 00:04:12,540 that this estimator is actually-- 69 00:04:12,540 --> 00:04:15,900 of the KL is actually close to the KL, all right? 70 00:04:15,900 --> 00:04:18,709 So I define KL hat. 71 00:04:18,709 --> 00:04:22,920 So all we did is just replacing expectation with respect 72 00:04:22,920 --> 00:04:27,400 to theta star by averages. 73 00:04:30,840 --> 00:04:33,850 That's what we did. 74 00:04:33,850 --> 00:04:37,270 So if you're a little puzzled by this error, that's all it says. 75 00:04:37,270 --> 00:04:39,540 Replace this guy by this guy. 76 00:04:39,540 --> 00:04:41,190 It has no mathematical meaning. 77 00:04:41,190 --> 00:04:42,990 It just means just replace it by. 78 00:04:42,990 --> 00:04:46,190 And now that actually tells me how to get my estimator. 79 00:04:46,190 --> 00:04:51,660 It just says, well, my estimator, KL hat, 80 00:04:51,660 --> 00:04:54,969 is equal to some constant which I don't know. 81 00:04:54,969 --> 00:04:56,760 I mean, it certainly depends on theta star, 82 00:04:56,760 --> 00:04:59,720 but I won't care about it when I'm trying to minimize-- 83 00:04:59,720 --> 00:05:09,610 minus 1/n sum from i from 1 to n log f theta of x. 84 00:05:09,610 --> 00:05:11,320 So here I'm reading it with the density. 85 00:05:11,320 --> 00:05:13,950 You have it with the PMF on the slides, 86 00:05:13,950 --> 00:05:18,560 and so you have the two versions in front of you, OK? 87 00:05:18,560 --> 00:05:22,400 Oh sorry, I forgot the xi. 88 00:05:22,400 --> 00:05:25,430 Now clearly, this function I know how to compute. 89 00:05:25,430 --> 00:05:30,710 If you give me a theta, since I know the form of the density 90 00:05:30,710 --> 00:05:33,380 f theta, for each theta that you give me, 91 00:05:33,380 --> 00:05:38,070 I can actually compute this quantity, right? 92 00:05:38,070 --> 00:05:40,475 This I don't know, but I don't care. 93 00:05:40,475 --> 00:05:42,600 Because I'm just shifting the value of the function 94 00:05:42,600 --> 00:05:43,590 I'm trying to minimize. 95 00:05:43,590 --> 00:05:46,830 The set of minimizers is not going to change. 96 00:05:46,830 --> 00:05:50,420 So now, this is my estimation strategy. 97 00:05:50,420 --> 00:06:01,560 Minimize in theta KL hat P theta star P theta, OK? 98 00:06:01,560 --> 00:06:05,575 So now let's just make sure that we all agree that-- 99 00:06:05,575 --> 00:06:07,960 so what we want is the argument of the minimum, 100 00:06:07,960 --> 00:06:10,710 right? arg min means the theta that minimizes this guy, 101 00:06:10,710 --> 00:06:13,534 rather than finding the value of the min. 102 00:06:13,534 --> 00:06:15,700 OK, so I'm trying to find the arg min of this thing. 103 00:06:15,700 --> 00:06:18,900 Well, this is equivalent to finding the arg 104 00:06:18,900 --> 00:06:28,020 min of, say, a constant minus 1/n sum from i from 1 to n 105 00:06:28,020 --> 00:06:31,226 of log f theta of xi. 106 00:06:33,864 --> 00:06:34,530 So that's just-- 107 00:06:38,850 --> 00:06:41,026 I don't think it likes me. 108 00:06:41,026 --> 00:06:42,490 No. 109 00:06:42,490 --> 00:06:46,110 OK, so thus minimizing this average, right? 110 00:06:46,110 --> 00:06:48,620 I just plugged in the definition of KL hat. 111 00:06:48,620 --> 00:06:50,360 Now, I claim that taking the arg min 112 00:06:50,360 --> 00:06:53,510 of a constant plus a function or the arg min of the function 113 00:06:53,510 --> 00:06:55,820 is the same thing. 114 00:06:55,820 --> 00:07:00,650 Is anybody not comfortable with this idea? 115 00:07:00,650 --> 00:07:03,530 OK, so this is the same. 116 00:07:13,757 --> 00:07:15,750 By the way, this I should probably 117 00:07:15,750 --> 00:07:18,630 switch to the next slide, because I'm writing 118 00:07:18,630 --> 00:07:22,830 the same thing, but better. 119 00:07:22,830 --> 00:07:29,360 And it's with PMF rather than as PF. 120 00:07:29,360 --> 00:07:34,000 OK, now, arg min of the minimum is the same of arg max-- 121 00:07:34,000 --> 00:07:35,595 sorry, arg min of the negative thing 122 00:07:35,595 --> 00:07:37,720 is the same as arg max without the negative, right? 123 00:07:40,324 --> 00:07:49,010 arg max over theta of 1/n from i equal equal 1 to n log f 124 00:07:49,010 --> 00:07:49,850 theta of xi. 125 00:07:53,540 --> 00:07:54,980 Taking the arg min of the average 126 00:07:54,980 --> 00:07:56,563 or the arg min of the sum, again, it's 127 00:07:56,563 --> 00:07:59,030 not going to make much difference. 128 00:07:59,030 --> 00:08:01,310 Just adding constants OR multiplying by constants 129 00:08:01,310 --> 00:08:04,440 does not change the arg min or the arg max. 130 00:08:04,440 --> 00:08:07,677 Now, I have the sum of logs, which 131 00:08:07,677 --> 00:08:08,760 is the log of the product. 132 00:08:23,310 --> 00:08:24,350 OK? 133 00:08:24,350 --> 00:08:27,280 It's the arg max of the log of f theta of x1 times 134 00:08:27,280 --> 00:08:30,190 f theta of x2, f theta of xn. 135 00:08:30,190 --> 00:08:37,440 But the log is a function that's increasing, so maximizing 136 00:08:37,440 --> 00:08:40,830 log of a function or maximizing the function itself 137 00:08:40,830 --> 00:08:42,860 is the same thing. 138 00:08:42,860 --> 00:08:45,200 The value is going to change, but the arg max 139 00:08:45,200 --> 00:08:46,220 is not going to change. 140 00:08:46,220 --> 00:08:47,344 Everybody agrees with this? 141 00:08:50,340 --> 00:08:59,990 So this is equivalent to arg max over theta of pi from 1 to n 142 00:08:59,990 --> 00:09:02,780 of f theta xi. 143 00:09:02,780 --> 00:09:10,515 And that's because x maps to log x is increasing. 144 00:09:13,930 --> 00:09:17,140 So now I've gone from minimizing the KL 145 00:09:17,140 --> 00:09:19,750 to minimizing the estimate of the KL 146 00:09:19,750 --> 00:09:23,520 to maximizing this product. 147 00:09:23,520 --> 00:09:27,280 Well, this chapter is called maximum likelihood estimation. 148 00:09:27,280 --> 00:09:30,370 The maximum comes from the fact that our original idea 149 00:09:30,370 --> 00:09:32,240 was to minimize the negative of a function. 150 00:09:32,240 --> 00:09:34,150 So that's why it's maximum likelihood. 151 00:09:34,150 --> 00:09:42,770 And this function here is called the likelihood. 152 00:09:42,770 --> 00:09:45,150 This function is really just telling me-- 153 00:09:45,150 --> 00:09:47,370 they call it likelihood because it's 154 00:09:47,370 --> 00:09:49,920 some measure of how likely it is that theta 155 00:09:49,920 --> 00:09:52,800 was the parameter that generated the data. 156 00:09:52,800 --> 00:09:55,444 OK, so let's go to the-- 157 00:09:55,444 --> 00:09:57,610 well, we'll go to the formal definition in a second. 158 00:09:57,610 --> 00:09:59,160 But actually, let me just give you 159 00:09:59,160 --> 00:10:05,592 intuition as to why this is the distribution of the data. 160 00:10:05,592 --> 00:10:07,050 Why this is the likelihood-- sorry. 161 00:10:07,050 --> 00:10:11,940 Why is this making sense as a measure of likelihood? 162 00:10:11,940 --> 00:10:14,550 Let's now think for simplicity of the following model. 163 00:10:14,550 --> 00:10:15,710 So I have-- 164 00:10:15,710 --> 00:10:19,550 I'm on the real line and I look at n, say, 165 00:10:19,550 --> 00:10:25,540 theta 1 for theta in the real-- do you see that? 166 00:10:25,540 --> 00:10:26,040 OK. 167 00:10:26,040 --> 00:10:27,660 Probably you don't. 168 00:10:27,660 --> 00:10:28,860 Not that you care. 169 00:10:28,860 --> 00:10:29,790 OK, so-- 170 00:10:41,120 --> 00:10:42,590 OK, let's look at a simple example. 171 00:10:45,990 --> 00:10:48,910 So here's the model. 172 00:10:48,910 --> 00:10:52,310 As I said, we're looking at observations on the real line. 173 00:10:52,310 --> 00:10:57,032 And they're distributed according to some n theta 1. 174 00:10:57,032 --> 00:10:58,490 So I don't care about the variance. 175 00:10:58,490 --> 00:10:59,810 I know it's 1. 176 00:10:59,810 --> 00:11:03,170 And it's indexed by theta in the real line. 177 00:11:03,170 --> 00:11:05,360 OK, so this is-- the only thing I need to figure out 178 00:11:05,360 --> 00:11:09,260 is, what is the mean of those guys, OK? 179 00:11:09,260 --> 00:11:11,600 Now, I have this n observations. 180 00:11:11,600 --> 00:11:15,920 And if you actually remember from your probability class, 181 00:11:15,920 --> 00:11:18,610 are you familiar with the concept of joint density? 182 00:11:18,610 --> 00:11:20,420 You have multivariate observations. 183 00:11:20,420 --> 00:11:23,750 The joint density of independent random variables 184 00:11:23,750 --> 00:11:26,990 is just a product of their individual densities. 185 00:11:26,990 --> 00:11:30,950 So really, when I look at the product from i 186 00:11:30,950 --> 00:11:34,610 equal 1 to n of f theta of xi, this 187 00:11:34,610 --> 00:11:44,800 is really the joint density of the vector-- 188 00:11:48,592 --> 00:11:51,210 well, let me not use the word vector-- 189 00:11:51,210 --> 00:11:55,660 of x1 xn, OK? 190 00:11:55,660 --> 00:11:58,930 So if I take the product of density, is it still a density? 191 00:11:58,930 --> 00:12:04,136 And it's actually-- but this time on the r to the n. 192 00:12:04,136 --> 00:12:06,010 And so now what this thing is telling me-- so 193 00:12:06,010 --> 00:12:07,330 think of it in r2, right? 194 00:12:07,330 --> 00:12:10,630 So this is the joint density of two Gaussians. 195 00:12:10,630 --> 00:12:14,640 So it's something that looks like some bell-shaped curve 196 00:12:14,640 --> 00:12:15,940 in two dimensions. 197 00:12:15,940 --> 00:12:20,410 And it's centered at the value theta theta. 198 00:12:20,410 --> 00:12:22,390 OK, they both have the mean theta. 199 00:12:22,390 --> 00:12:24,280 So let's assume for one second-- it's 200 00:12:24,280 --> 00:12:28,000 going to be hard for me to make pictures in n dimensions. 201 00:12:28,000 --> 00:12:29,710 Actually, already in two dimensions, 202 00:12:29,710 --> 00:12:31,660 I can promise you that it's not very easy. 203 00:12:31,660 --> 00:12:34,300 So I'm actually just going to assume 204 00:12:34,300 --> 00:12:37,270 that n is equal to 1 for the sake of illustration. 205 00:12:37,270 --> 00:12:40,900 OK, so now I have this data. 206 00:12:40,900 --> 00:12:44,350 And now I have one observation, OK? 207 00:12:44,350 --> 00:12:47,524 And I know that the f theta looks like this. 208 00:12:47,524 --> 00:12:48,940 And what I'm doing is I'm actually 209 00:12:48,940 --> 00:12:51,961 looking at the value of x theta as my observation. 210 00:12:54,787 --> 00:12:57,870 Let's call it x1. 211 00:12:57,870 --> 00:13:00,590 Now, my principal tells me, just find the theta that 212 00:13:00,590 --> 00:13:03,260 makes this guy the most likely. 213 00:13:03,260 --> 00:13:05,360 What is the likelihood of my x1? 214 00:13:05,360 --> 00:13:07,640 Well, it's just the value of the function. 215 00:13:07,640 --> 00:13:09,410 That this value here. 216 00:13:09,410 --> 00:13:13,670 And if I wanted to find the most likely theta that had generated 217 00:13:13,670 --> 00:13:16,370 this x1, what I would need to do is to shift this thing 218 00:13:16,370 --> 00:13:19,290 and put it here. 219 00:13:19,290 --> 00:13:21,950 And so my estimate, my maximum likelihood estimator 220 00:13:21,950 --> 00:13:28,720 here would be theta is equal to x1, OK? 221 00:13:28,720 --> 00:13:30,370 That would be just the observation. 222 00:13:30,370 --> 00:13:32,110 Because if I have only one observation, 223 00:13:32,110 --> 00:13:33,454 what else am I going to do? 224 00:13:33,454 --> 00:13:34,870 OK, and so it sort of makes sense. 225 00:13:34,870 --> 00:13:36,286 And if you have more observations, 226 00:13:36,286 --> 00:13:40,540 you can think of it this way, as if you had more observations. 227 00:13:40,540 --> 00:13:42,735 So now I have, say, K observations, 228 00:13:42,735 --> 00:13:44,157 or n observations. 229 00:13:44,157 --> 00:13:46,240 And what I do is that I look at the value for each 230 00:13:46,240 --> 00:13:48,790 of these guys. 231 00:13:48,790 --> 00:13:52,240 So this value, this value, this value, this value. 232 00:13:52,240 --> 00:13:55,870 I take their product and I make this thing large. 233 00:13:55,870 --> 00:13:57,250 OK, why do I take the product? 234 00:13:57,250 --> 00:14:00,100 Well, because I'm trying to maximize their value 235 00:14:00,100 --> 00:14:02,830 all together, and I need to just turn it into one number 236 00:14:02,830 --> 00:14:04,030 that I can maximize. 237 00:14:04,030 --> 00:14:06,580 And taking the product is the natural way 238 00:14:06,580 --> 00:14:08,470 of doing it, either by motivating it 239 00:14:08,470 --> 00:14:11,050 by the KL principle or motivating it 240 00:14:11,050 --> 00:14:14,800 by maximizing the joint density, rather than just maximizing 241 00:14:14,800 --> 00:14:15,910 anything. 242 00:14:15,910 --> 00:14:20,200 OK, so that's why, visually, this is the maximum likelihood. 243 00:14:20,200 --> 00:14:24,010 It just says that if my observations are here, 244 00:14:24,010 --> 00:14:29,210 then this guy, this mean theta, is more likely than this guy. 245 00:14:29,210 --> 00:14:31,450 Because now if I look at the value 246 00:14:31,450 --> 00:14:33,850 of the function for this guy-- if I 247 00:14:33,850 --> 00:14:35,740 look at theta being this thing, then this 248 00:14:35,740 --> 00:14:37,540 is a very small value. 249 00:14:37,540 --> 00:14:39,850 Very small value, very small value, very small value. 250 00:14:39,850 --> 00:14:41,984 Everything gets a super small value, right? 251 00:14:41,984 --> 00:14:43,900 That's just the value that it gets in the tail 252 00:14:43,900 --> 00:14:45,730 here, which is very close to 0. 253 00:14:45,730 --> 00:14:47,980 But as soon as I start covering all my points 254 00:14:47,980 --> 00:14:53,260 with my bell-shaped curve, then all the values go up. 255 00:14:53,260 --> 00:14:58,720 All right, so I just want to make a short break 256 00:14:58,720 --> 00:15:00,490 into statistics, and just make sure 257 00:15:00,490 --> 00:15:04,120 that the maximum likelihood principle involves 258 00:15:04,120 --> 00:15:05,650 maximizing a function. 259 00:15:05,650 --> 00:15:07,250 So I just want to make sure that we're 260 00:15:07,250 --> 00:15:11,200 all on par about how do we maximize functions. 261 00:15:11,200 --> 00:15:13,990 In most instances, it's going to be a one-dimensional function, 262 00:15:13,990 --> 00:15:16,690 because theta is going to be a one-dimensional parameter. 263 00:15:16,690 --> 00:15:18,800 Like here it's the real line. 264 00:15:18,800 --> 00:15:20,110 So it's going to be easy. 265 00:15:20,110 --> 00:15:22,130 In some cases, it may be a multivariate function 266 00:15:22,130 --> 00:15:24,790 and it might be more complicated. 267 00:15:24,790 --> 00:15:26,650 OK, so let's just make this interlude. 268 00:15:26,650 --> 00:15:28,450 So the first thing I want you to notice 269 00:15:28,450 --> 00:15:31,940 is that if you open any book on what's called optimization, 270 00:15:31,940 --> 00:15:35,110 which basically is the science behind optimizing functions, 271 00:15:35,110 --> 00:15:36,610 you will talk mostly-- 272 00:15:36,610 --> 00:15:40,170 I mean, I'd say 99.9% of the cases 273 00:15:40,170 --> 00:15:42,349 will talk about minimizing functions. 274 00:15:42,349 --> 00:15:44,890 But it doesn't matter, because you can just flip the function 275 00:15:44,890 --> 00:15:47,710 and you just put a minus sign, and minimizing h 276 00:15:47,710 --> 00:15:51,640 is the same as maximizing minus h and the opposite, OK? 277 00:15:51,640 --> 00:15:53,547 So for this class, since we're only 278 00:15:53,547 --> 00:15:55,630 going to talk about maximum likelihood estimation, 279 00:15:55,630 --> 00:15:57,654 we will talk about maximizing functions. 280 00:15:57,654 --> 00:15:59,320 But don't be lost if you decide suddenly 281 00:15:59,320 --> 00:16:01,569 to open a book on optimization and find only something 282 00:16:01,569 --> 00:16:03,700 about minimizing functions. 283 00:16:03,700 --> 00:16:08,279 OK, so maximizing an arbitrary function can actually be fairly 284 00:16:08,279 --> 00:16:10,570 difficult. If I give you a function that has this weird 285 00:16:10,570 --> 00:16:13,740 shape, right-- let's think of this polynomial for example-- 286 00:16:13,740 --> 00:16:17,140 and I wanted to find the maximum, how would we do it? 287 00:16:20,350 --> 00:16:23,580 So what is the thing you've learned in calculus on how 288 00:16:23,580 --> 00:16:26,200 to maximize the function? 289 00:16:26,200 --> 00:16:27,856 Set the derivative equal to 0. 290 00:16:27,856 --> 00:16:29,730 Maybe you want to check the second derivative 291 00:16:29,730 --> 00:16:31,647 to make sure it's a maximum and not a minimum. 292 00:16:31,647 --> 00:16:34,105 But the thing is, this is only guaranteeing to you that you 293 00:16:34,105 --> 00:16:35,280 have a local one, right? 294 00:16:35,280 --> 00:16:38,412 So if I do it for this function, for example, then this guy 295 00:16:38,412 --> 00:16:39,870 is going to satisfy this criterion, 296 00:16:39,870 --> 00:16:41,610 this guy is going to satisfy this criterion, 297 00:16:41,610 --> 00:16:43,530 this guy is going to satisfy this criterion, this guy here, 298 00:16:43,530 --> 00:16:45,404 and this guy satisfies the criterion, but not 299 00:16:45,404 --> 00:16:46,950 the second derivative one. 300 00:16:46,950 --> 00:16:50,160 So I have a lot of candidates. 301 00:16:50,160 --> 00:16:52,800 And if my function can be really anything, 302 00:16:52,800 --> 00:16:54,696 it's going to be difficult, whether it's 303 00:16:54,696 --> 00:16:56,820 analytically by taking derivatives and setting them 304 00:16:56,820 --> 00:17:00,230 to 0, or trying to find some algorithms to do this. 305 00:17:00,230 --> 00:17:02,400 Because if my function is very jittery, 306 00:17:02,400 --> 00:17:05,130 then my algorithm basically has to check all candidates. 307 00:17:05,130 --> 00:17:08,520 And if there's a lot of them, it might take forever, OK? 308 00:17:08,520 --> 00:17:11,369 So this is-- I have only one, two, three, four, 309 00:17:11,369 --> 00:17:13,109 five candidates to check. 310 00:17:13,109 --> 00:17:15,900 But in practice, you might have a million of them to check. 311 00:17:15,900 --> 00:17:17,460 And that might take forever. 312 00:17:17,460 --> 00:17:21,180 OK, so what's nice about statistical models, and one 313 00:17:21,180 --> 00:17:24,400 of the things that makes all these models particularly 314 00:17:24,400 --> 00:17:27,792 robust, and that we still talk about them 100 315 00:17:27,792 --> 00:17:29,250 years after they've been introduced 316 00:17:29,250 --> 00:17:31,680 is that the functions that-- the likelihoods 317 00:17:31,680 --> 00:17:33,450 that they lead for us to maximize 318 00:17:33,450 --> 00:17:34,800 are actually very simple. 319 00:17:34,800 --> 00:17:37,090 And they all share a nice property, 320 00:17:37,090 --> 00:17:40,350 which is that of being concave. 321 00:17:40,350 --> 00:17:42,567 All right, so what is a concave function? 322 00:17:42,567 --> 00:17:44,775 Well, by definition, it's just a function for which-- 323 00:17:44,775 --> 00:17:47,760 let's think of it as being twice differentiable. 324 00:17:47,760 --> 00:17:49,320 You can define functions that are not 325 00:17:49,320 --> 00:17:51,695 differentiable as being concave, but let's think about it 326 00:17:51,695 --> 00:17:53,035 as having a second derivative. 327 00:17:53,035 --> 00:17:54,660 And so if you look at the function that 328 00:17:54,660 --> 00:17:57,480 has a second derivative, concave are the functions 329 00:17:57,480 --> 00:17:59,340 that have their second derivative that's 330 00:17:59,340 --> 00:18:02,060 negative everywhere. 331 00:18:02,060 --> 00:18:06,430 Not just at the maximum, everywhere, OK? 332 00:18:06,430 --> 00:18:09,190 And so if it's strictly concave, this second derivative 333 00:18:09,190 --> 00:18:12,280 is actually strictly less than zero. 334 00:18:12,280 --> 00:18:16,110 And particularly if I think of a linear function, 335 00:18:16,110 --> 00:18:19,480 y is equal to x, then this function 336 00:18:19,480 --> 00:18:24,130 has its second derivative which is equal to zero, OK? 337 00:18:24,130 --> 00:18:26,020 So it is concave. 338 00:18:26,020 --> 00:18:28,430 But it's not strictly concave, OK? 339 00:18:28,430 --> 00:18:31,570 If I look at the function which is negative x squared, 340 00:18:31,570 --> 00:18:33,060 what is its second derivative? 341 00:18:35,700 --> 00:18:36,810 Minus 2. 342 00:18:36,810 --> 00:18:39,810 So it's strictly negative everywhere, OK? 343 00:18:39,810 --> 00:18:43,767 So actually, this is a pretty canonical example 344 00:18:43,767 --> 00:18:44,850 strictly concave function. 345 00:18:44,850 --> 00:18:46,850 If you want to think of a picture of a strictly concave 346 00:18:46,850 --> 00:18:48,540 function, think of negative x squared. 347 00:18:48,540 --> 00:18:52,770 So parabola pointing downwards. 348 00:18:52,770 --> 00:18:56,980 OK, so we can talk about strictly convex functions. 349 00:18:56,980 --> 00:18:59,940 So convex is just happening when the negative of the function 350 00:18:59,940 --> 00:19:00,757 is concave. 351 00:19:00,757 --> 00:19:03,090 So that translates into having a second derivative which 352 00:19:03,090 --> 00:19:05,910 is either non-negative or positive, depending 353 00:19:05,910 --> 00:19:09,040 on whether you're talking about convexity or strict convexity. 354 00:19:09,040 --> 00:19:11,850 But again, those convex functions 355 00:19:11,850 --> 00:19:14,580 are convenient when you're trying to minimize something. 356 00:19:14,580 --> 00:19:16,579 And since we're trying to maximize the function, 357 00:19:16,579 --> 00:19:18,710 we're looking for concave. 358 00:19:18,710 --> 00:19:21,732 So here are some examples. 359 00:19:21,732 --> 00:19:23,190 Let's just go through them quickly. 360 00:19:39,060 --> 00:19:41,830 OK, so the first one is-- 361 00:19:41,830 --> 00:19:46,540 so here I made my life a little uneasy 362 00:19:46,540 --> 00:19:49,889 by talking about the functions in theta, right? 363 00:19:49,889 --> 00:19:51,430 I'm talking about likelihoods, right? 364 00:19:51,430 --> 00:19:54,460 So I'm thinking of functions where the parameter is theta. 365 00:19:54,460 --> 00:19:56,270 So I have h of theta. 366 00:19:56,270 --> 00:19:59,370 And so if I start with theta squared, 367 00:19:59,370 --> 00:20:02,170 negative theta squared, then as we said, 368 00:20:02,170 --> 00:20:09,490 h prime prime of theta, the second derivative is minus 2, 369 00:20:09,490 --> 00:20:11,830 which is strictly negative, so this function is strictly 370 00:20:11,830 --> 00:20:12,329 concave. 371 00:20:19,210 --> 00:20:24,400 OK, another function is h of theta, which is-- 372 00:20:24,400 --> 00:20:25,980 what did we pick-- 373 00:20:25,980 --> 00:20:28,380 square root of theta. 374 00:20:28,380 --> 00:20:30,018 What is the first derivative? 375 00:20:35,760 --> 00:20:39,720 1/2 square root of theta. 376 00:20:39,720 --> 00:20:41,332 What is the second derivative? 377 00:20:48,220 --> 00:20:51,617 So that's theta to the negative 1/2. 378 00:20:51,617 --> 00:20:53,450 So I'm just picking up another negative 1/2, 379 00:20:53,450 --> 00:20:56,640 so I get negative 1/4. 380 00:20:56,640 --> 00:21:02,420 And then I get theta to the 3/4 downstairs, OK? 381 00:21:02,420 --> 00:21:03,320 Sorry, 3/2. 382 00:21:09,430 --> 00:21:16,820 And that's strictly negative for theta, say, larger than 0. 383 00:21:16,820 --> 00:21:20,060 And I really need to have this thing larger than 0 384 00:21:20,060 --> 00:21:21,470 so that it's well-defined. 385 00:21:21,470 --> 00:21:24,320 But strictly larger than 0 is so that this thing does not 386 00:21:24,320 --> 00:21:25,597 blow up to infinity. 387 00:21:25,597 --> 00:21:26,180 And it's true. 388 00:21:26,180 --> 00:21:30,740 If you think about this function, it looks like this. 389 00:21:30,740 --> 00:21:34,940 And already, the first derivative to infinity at 0. 390 00:21:34,940 --> 00:21:37,520 And it's a concave function, OK? 391 00:21:37,520 --> 00:21:39,920 Another one is the log, of course. 392 00:21:44,070 --> 00:21:47,640 What is the derivative of the log? 393 00:21:47,640 --> 00:21:52,710 That's 1 over theta, where h prime of theta is 1 over theta. 394 00:21:52,710 --> 00:22:01,080 And the second derivative negative 1 over theta squared, 395 00:22:01,080 --> 00:22:06,210 which again, is negative if theta is strictly positive. 396 00:22:06,210 --> 00:22:07,770 Here I define it as-- 397 00:22:07,770 --> 00:22:10,310 I don't need to define it to be strictly positive here, 398 00:22:10,310 --> 00:22:13,110 but I need it for the log. 399 00:22:13,110 --> 00:22:16,320 And sine. 400 00:22:16,320 --> 00:22:18,030 OK, so let's just do one more. 401 00:22:18,030 --> 00:22:22,555 So h of theta is sine of theta. 402 00:22:22,555 --> 00:22:24,180 But here I take it only on an interval, 403 00:22:24,180 --> 00:22:27,540 because you want to think of this function 404 00:22:27,540 --> 00:22:29,112 as pointing always downwards. 405 00:22:29,112 --> 00:22:31,070 And in particular, you don't want this function 406 00:22:31,070 --> 00:22:32,400 to have an inflection point. 407 00:22:32,400 --> 00:22:34,080 You don't want it to go down and then up 408 00:22:34,080 --> 00:22:37,200 and then down and then up, because this is not concave. 409 00:22:37,200 --> 00:22:39,960 And so sine is certainly going up and down, right? 410 00:22:39,960 --> 00:22:43,530 So what we do is we restrict it to an interval where sine 411 00:22:43,530 --> 00:22:45,690 is actually-- so what does the sine function looks 412 00:22:45,690 --> 00:22:47,530 like at 0, 0? 413 00:22:47,530 --> 00:22:48,600 And it's going up. 414 00:22:48,600 --> 00:22:53,110 Where is the first maximum of the sine? 415 00:22:53,110 --> 00:22:54,341 STUDENT: [INAUDIBLE] 416 00:22:54,341 --> 00:22:55,215 PROFESSOR: I'm sorry. 417 00:22:55,215 --> 00:22:56,006 STUDENT: Pi over 2. 418 00:22:56,006 --> 00:22:59,220 PROFESSOR: Pi over 2, where it takes value 1. 419 00:22:59,220 --> 00:23:01,700 And then it goes down again. 420 00:23:01,700 --> 00:23:04,220 And then that's at pi. 421 00:23:04,220 --> 00:23:05,420 And then I go down again. 422 00:23:05,420 --> 00:23:08,360 And here you see I actually start changing my inflection. 423 00:23:08,360 --> 00:23:10,700 So what we do is we stop it at pi. 424 00:23:10,700 --> 00:23:12,450 And we look at this function, it certainly 425 00:23:12,450 --> 00:23:14,404 looks like a parabola pointing downwards. 426 00:23:14,404 --> 00:23:16,820 And so if you look at the-- you can check that it actually 427 00:23:16,820 --> 00:23:17,944 works with the derivatives. 428 00:23:17,944 --> 00:23:22,530 So the derivative of sine is cosine. 429 00:23:25,160 --> 00:23:31,850 And the derivative of cosine is negative sine. 430 00:23:34,560 --> 00:23:38,170 OK, and this thing between 0 and pi is actually positive. 431 00:23:38,170 --> 00:23:40,730 So this entire thing is going to be negative. 432 00:23:40,730 --> 00:23:41,860 OK? 433 00:23:41,860 --> 00:23:45,160 And you know, I can come up with a lot of examples, 434 00:23:45,160 --> 00:23:46,730 but let's just stick to those. 435 00:23:46,730 --> 00:23:48,990 There's a linear function, of course. 436 00:23:48,990 --> 00:23:51,196 And the find function is going to be concave, 437 00:23:51,196 --> 00:23:53,320 but it's actually going to be convex as well, which 438 00:23:53,320 --> 00:23:55,028 means that it's certainly not going to be 439 00:23:55,028 --> 00:23:58,780 strictly concave or convex, OK? 440 00:23:58,780 --> 00:24:01,450 So here's your standard picture. 441 00:24:01,450 --> 00:24:04,630 And here, if you look at the dotted line, what 442 00:24:04,630 --> 00:24:07,194 it tells me is that a concave function, 443 00:24:07,194 --> 00:24:08,860 and the property we're going to be using 444 00:24:08,860 --> 00:24:12,910 is that if a strictly concave function has a maximum, which 445 00:24:12,910 --> 00:24:15,790 is not always the case, but if it has a maximum, 446 00:24:15,790 --> 00:24:18,770 then it actually must be-- sorry, a local maximum, 447 00:24:18,770 --> 00:24:21,350 it must be a global maximum. 448 00:24:21,350 --> 00:24:23,870 OK, so just the fact that it goes up and down and not 449 00:24:23,870 --> 00:24:28,260 again means that there's only global maximum that can exist. 450 00:24:28,260 --> 00:24:32,480 Now if you looked, for example, at the square root function, 451 00:24:32,480 --> 00:24:34,855 look at the entire positive real line, 452 00:24:34,855 --> 00:24:36,980 then this thing is never going to attain a maximum. 453 00:24:36,980 --> 00:24:39,362 It's just going to infinity as x goes to infinity. 454 00:24:39,362 --> 00:24:40,820 So if I wanted to find the maximum, 455 00:24:40,820 --> 00:24:42,590 I would have to stop somewhere and say 456 00:24:42,590 --> 00:24:46,200 that the maximum is attained at the right-hand side. 457 00:24:46,200 --> 00:24:49,890 OK, so that's the beauty about convex functions or concave 458 00:24:49,890 --> 00:24:53,880 functions, is that essentially, these functions 459 00:24:53,880 --> 00:24:55,110 are easy to maximize. 460 00:24:55,110 --> 00:24:57,660 And if I tell you a function is concave, 461 00:24:57,660 --> 00:25:00,164 you take the first derivative, set it equal to 0. 462 00:25:00,164 --> 00:25:01,830 If you find a point that satisfies this, 463 00:25:01,830 --> 00:25:07,560 then it must be a global maximum, OK? 464 00:25:07,560 --> 00:25:09,086 STUDENT: What if your set theta was 465 00:25:09,086 --> 00:25:13,695 [INAUDIBLE] then couldn't you have a function that, 466 00:25:13,695 --> 00:25:17,090 by the definition, is concave, with two upside down parabolas 467 00:25:17,090 --> 00:25:22,910 at two disjoint intervals, but yet it has two global maximums? 468 00:25:26,704 --> 00:25:28,120 PROFESSOR: So you won't get them-- 469 00:25:28,120 --> 00:25:31,030 so you want the function to be concave on what? 470 00:25:31,030 --> 00:25:34,430 On the convex cell of the intervals? 471 00:25:34,430 --> 00:25:35,875 Or you want it to be-- 472 00:25:35,875 --> 00:25:38,250 STUDENT: [INAUDIBLE] just said that any subset. 473 00:25:38,250 --> 00:25:40,029 PROFESSOR: OK, OK. 474 00:25:40,029 --> 00:25:40,570 You're right. 475 00:25:40,570 --> 00:25:42,060 So maybe the definition-- so you're 476 00:25:42,060 --> 00:25:45,810 pointing to a weakness in the definition. 477 00:25:45,810 --> 00:25:49,050 Let's just say that theta is a convex set 478 00:25:49,050 --> 00:25:50,220 and then you're good, OK? 479 00:25:50,220 --> 00:25:51,450 So you're right. 480 00:25:54,210 --> 00:25:56,790 Since I actually just said that this is true only for theta, 481 00:25:56,790 --> 00:25:59,280 I can just take pieces of concave functions, right? 482 00:25:59,280 --> 00:26:00,990 I can do this, and then the next one 483 00:26:00,990 --> 00:26:03,330 I can do this, on the next one I can do this. 484 00:26:03,330 --> 00:26:05,530 And then I would have a bunch of them. 485 00:26:05,530 --> 00:26:10,620 But what I want is think of it as a global function 486 00:26:10,620 --> 00:26:11,590 on some convex set. 487 00:26:11,590 --> 00:26:13,450 You're right. 488 00:26:13,450 --> 00:26:14,900 So think of theta as being convex 489 00:26:14,900 --> 00:26:17,560 for this guy, an interval, if it's a real line. 490 00:26:20,340 --> 00:26:25,689 OK, so as I said, for more generally-- so 491 00:26:25,689 --> 00:26:27,980 we can actually define concave functions more generally 492 00:26:27,980 --> 00:26:29,540 in higher dimensions. 493 00:26:29,540 --> 00:26:32,690 And that will be useful if theta is not just 494 00:26:32,690 --> 00:26:34,640 one parameter but several parameters. 495 00:26:34,640 --> 00:26:39,050 And for that, you need to remind yourself of Calculus II, 496 00:26:39,050 --> 00:26:42,440 and you have generalization of the notion of derivative, which 497 00:26:42,440 --> 00:26:46,130 is called a gradient, which is basically a vector where 498 00:26:46,130 --> 00:26:49,390 each coordinate is just the partial derivative with respect 499 00:26:49,390 --> 00:26:51,170 to each coordinate of theta. 500 00:26:51,170 --> 00:26:54,380 And the Hessian is the matrix, which 501 00:26:54,380 --> 00:26:58,020 is essentially a generalization of the second derivative. 502 00:26:58,020 --> 00:27:01,220 I denote it by nabla squared, but you 503 00:27:01,220 --> 00:27:02,610 can write it the way you want. 504 00:27:02,610 --> 00:27:07,296 And so this matrix here is taking as entry 505 00:27:07,296 --> 00:27:10,970 the second partial derivatives of h with respect 506 00:27:10,970 --> 00:27:12,920 to theta i and theta j. 507 00:27:12,920 --> 00:27:15,440 And so that's the ij-th entry. 508 00:27:15,440 --> 00:27:16,550 Who has never seen that? 509 00:27:19,400 --> 00:27:20,600 OK. 510 00:27:20,600 --> 00:27:27,200 So now, being concave here is essentially generalizing, 511 00:27:27,200 --> 00:27:28,820 saying that a vector is equal to zero. 512 00:27:28,820 --> 00:27:31,390 Well, that's just setting the vector-- sorry. 513 00:27:31,390 --> 00:27:33,700 The first order condition to say that it's a maximum 514 00:27:33,700 --> 00:27:34,700 is going to be the same. 515 00:27:34,700 --> 00:27:38,930 Saying that a function has a gradient equal to zero 516 00:27:38,930 --> 00:27:43,730 is the same as saying that each of its coordinates 517 00:27:43,730 --> 00:27:44,730 are equal to zero. 518 00:27:44,730 --> 00:27:46,521 And that's actually going to be a condition 519 00:27:46,521 --> 00:27:48,560 for a global maximum here. 520 00:27:48,560 --> 00:27:52,190 So to check convexity, we need to see that a matrix itself 521 00:27:52,190 --> 00:27:53,760 is negative. 522 00:27:53,760 --> 00:27:55,220 Sorry, to check concavity, we need 523 00:27:55,220 --> 00:27:57,020 to check that a matrix is negative. 524 00:27:57,020 --> 00:27:59,840 And there is a notion among matrices 525 00:27:59,840 --> 00:28:03,320 that compare matrix to zero, and that's exactly this notion. 526 00:28:03,320 --> 00:28:06,170 You pre- and post-multiply by the same x. 527 00:28:06,170 --> 00:28:08,960 So that works for symmetric matrices, 528 00:28:08,960 --> 00:28:10,590 which is the case here. 529 00:28:10,590 --> 00:28:13,940 And so you pre-multiply by x, post-multiply by the same x. 530 00:28:13,940 --> 00:28:15,930 So you have your matrix, your Hessian here. 531 00:28:20,630 --> 00:28:24,030 It's a d by d matrix if you have a d-dimensional matrix. 532 00:28:24,030 --> 00:28:26,900 So let's call it-- 533 00:28:26,900 --> 00:28:27,400 OK. 534 00:28:27,400 --> 00:28:31,150 And then here I pre-multiply by x transpose. 535 00:28:31,150 --> 00:28:34,330 I post-multiply by x. 536 00:28:34,330 --> 00:28:38,470 And this has to be non-positive if I want it to be concave, 537 00:28:38,470 --> 00:28:42,850 and strictly negative if I want it to be strictly concave. 538 00:28:42,850 --> 00:28:44,740 OK, that's just a real generalization. 539 00:28:44,740 --> 00:28:47,050 You can check for yourself that this is the same thing. 540 00:28:47,050 --> 00:28:49,760 If I were in dimension 1, this would be the same thing. 541 00:28:49,760 --> 00:28:50,410 Why? 542 00:28:50,410 --> 00:28:53,380 Because in dimension 1, pre- and post-multiplying by x 543 00:28:53,380 --> 00:28:55,840 is the same as multiplying by x squared. 544 00:28:55,840 --> 00:28:58,820 Because in dimension 1, I can just move my x's around, right? 545 00:28:58,820 --> 00:29:01,180 And so that would just mean the first condition 546 00:29:01,180 --> 00:29:04,930 would mean in dimension 1 that the second derivative times x 547 00:29:04,930 --> 00:29:11,110 squared has to be less than or equal to zero. 548 00:29:11,110 --> 00:29:14,371 So here I need this for all x's that are not zero, 549 00:29:14,371 --> 00:29:16,870 because I can take x to be zero and make this equal to zero, 550 00:29:16,870 --> 00:29:17,370 right? 551 00:29:17,370 --> 00:29:21,640 So this is for x's that are not equal to zero, OK? 552 00:29:21,640 --> 00:29:25,720 And so some examples. 553 00:29:25,720 --> 00:29:27,340 Just look at this function. 554 00:29:27,340 --> 00:29:29,830 So now I have functions that depend on two parameters, 555 00:29:29,830 --> 00:29:31,600 theta1 and theta2. 556 00:29:31,600 --> 00:29:33,130 So the first one is-- 557 00:29:33,130 --> 00:29:36,460 so if I take theta to be equal to-- 558 00:29:36,460 --> 00:29:39,010 now I need two parameters, r squared. 559 00:29:39,010 --> 00:29:42,670 And I look at the function, which is h of theta. 560 00:29:42,670 --> 00:29:45,266 Can somebody tell me what h of theta is? 561 00:29:45,266 --> 00:29:49,530 STUDENT: [INAUDIBLE] 562 00:29:49,530 --> 00:29:52,040 PROFESSOR: Minus 2 theta2 squared? 563 00:29:52,040 --> 00:30:00,920 OK, so let's compute the gradient of h of theta. 564 00:30:00,920 --> 00:30:04,210 So it's going to be something that has two coordinates. 565 00:30:04,210 --> 00:30:06,064 To get the first coordinate, what do I do? 566 00:30:06,064 --> 00:30:07,730 Well, I take the derivative with respect 567 00:30:07,730 --> 00:30:10,230 to theta1, thinking of theta2 as being a constant. 568 00:30:10,230 --> 00:30:11,750 So this thing is going to go away. 569 00:30:11,750 --> 00:30:14,189 And so I get negative 2 theta1. 570 00:30:14,189 --> 00:30:15,980 And when I take the derivative with respect 571 00:30:15,980 --> 00:30:18,620 to the second part, thinking of this part as being constant, 572 00:30:18,620 --> 00:30:21,490 I get minus 4 theta2. 573 00:30:24,560 --> 00:30:26,970 That clear for everyone? 574 00:30:26,970 --> 00:30:29,455 That's just the definition of partial derivatives. 575 00:30:32,430 --> 00:30:40,880 And then if I want to do the Hessian, 576 00:30:40,880 --> 00:30:42,860 so now I'm going to get a 2 by 2 matrix. 577 00:30:45,690 --> 00:30:48,650 The first guy here, I take the first-- so this guy 578 00:30:48,650 --> 00:30:51,480 I get by taking the derivative of this guy with respect 579 00:30:51,480 --> 00:30:52,550 to theta1. 580 00:30:52,550 --> 00:30:53,380 So that's easy. 581 00:30:53,380 --> 00:30:55,152 So that's just minus 2. 582 00:30:55,152 --> 00:30:56,610 This guy I get by taking derivative 583 00:30:56,610 --> 00:30:58,530 of this guy with respect to theta2. 584 00:30:58,530 --> 00:31:00,341 So I get what? 585 00:31:00,341 --> 00:31:00,840 Zero. 586 00:31:00,840 --> 00:31:03,234 I treat this guy as being a constant. 587 00:31:03,234 --> 00:31:04,650 This guy is also going to be zero, 588 00:31:04,650 --> 00:31:06,990 because I take the derivative of this guy with respect 589 00:31:06,990 --> 00:31:08,269 to theta1. 590 00:31:08,269 --> 00:31:10,560 And then I take the derivative of this guy with respect 591 00:31:10,560 --> 00:31:14,310 to theta2, so I get minus 4. 592 00:31:14,310 --> 00:31:19,220 OK, so now I want to check that this matrix satisfies 593 00:31:19,220 --> 00:31:21,210 x transpose-- 594 00:31:21,210 --> 00:31:24,690 this matrix x is negative. 595 00:31:24,690 --> 00:31:25,920 So what I do is-- 596 00:31:25,920 --> 00:31:27,360 so what is x transpose x? 597 00:31:27,360 --> 00:31:33,810 So if I do x transpose delta squared h theta x, what I get 598 00:31:33,810 --> 00:31:42,570 is minus 2 x1 squared minus 4 x2 squared. 599 00:31:42,570 --> 00:31:45,990 Because this matrix is diagonal, so all it does is just weights 600 00:31:45,990 --> 00:31:47,920 the square of the x's. 601 00:31:47,920 --> 00:31:51,270 So this guy is definitely negative. 602 00:31:51,270 --> 00:31:53,580 This guy is negative. 603 00:31:53,580 --> 00:31:56,070 And actually, if one of the two is non-zero, 604 00:31:56,070 --> 00:31:58,050 which means that x is non-zero, then this thing 605 00:31:58,050 --> 00:32:00,240 is actually strictly negative. 606 00:32:00,240 --> 00:32:02,600 So this function is actually strictly concave. 607 00:32:05,380 --> 00:32:07,730 And it looks like a parabola that's slightly 608 00:32:07,730 --> 00:32:09,499 distorted in one direction. 609 00:32:15,730 --> 00:32:21,257 So well, I know this might have been some time ago. 610 00:32:21,257 --> 00:32:23,590 Maybe for some of you might have been since high school. 611 00:32:23,590 --> 00:32:27,360 So just remind yourself of doing second derivatives and Hessians 612 00:32:27,360 --> 00:32:29,710 and things like this. 613 00:32:29,710 --> 00:32:32,920 Here's another one as an exercise. 614 00:32:32,920 --> 00:32:36,660 h is minus theta1 minus theta2 squared. 615 00:32:36,660 --> 00:32:44,100 So this one is going to actually not be diagonal. 616 00:32:44,100 --> 00:32:46,630 The Hessian is not going to be diagonal. 617 00:32:46,630 --> 00:32:50,660 Who would like to do this now in class? 618 00:32:50,660 --> 00:32:51,800 OK, thank you. 619 00:32:51,800 --> 00:32:53,730 This is not a calculus class. 620 00:32:53,730 --> 00:32:56,090 So you can just do it as a calculus exercise. 621 00:32:56,090 --> 00:32:58,110 And you can do it for log as well. 622 00:32:58,110 --> 00:33:01,100 Now, there is a nice recipe for concavity 623 00:33:01,100 --> 00:33:05,111 that works for the second one and the third one. 624 00:33:05,111 --> 00:33:07,610 And the thing is, if you look at those particular functions, 625 00:33:07,610 --> 00:33:11,360 what I'm doing is taking, first of all, a linear combination 626 00:33:11,360 --> 00:33:13,040 of my arguments. 627 00:33:13,040 --> 00:33:15,890 And then I take a concave function of this guy. 628 00:33:15,890 --> 00:33:18,350 And this is always going to work. 629 00:33:18,350 --> 00:33:20,930 This is always going to give me a complete function. 630 00:33:20,930 --> 00:33:22,841 So the computations that I just made, 631 00:33:22,841 --> 00:33:24,840 I actually never made them when I prepared those 632 00:33:24,840 --> 00:33:26,132 slides because I don't have to. 633 00:33:26,132 --> 00:33:28,548 I know that if I take a linear combination of those things 634 00:33:28,548 --> 00:33:30,650 and then I take a concave function of this guy, 635 00:33:30,650 --> 00:33:33,750 I'm always going to get a concave function. 636 00:33:33,750 --> 00:33:39,410 OK, so that's an easy way to check this, or at least as 637 00:33:39,410 --> 00:33:42,520 a sanity check. 638 00:33:42,520 --> 00:33:48,250 All right, and so as I said, finding maximizers of concave 639 00:33:48,250 --> 00:33:50,380 or strictly concave function is the same 640 00:33:50,380 --> 00:33:52,870 as it was in the one-dimensional case. 641 00:33:52,870 --> 00:33:55,052 What I do-- sorry, in the one-dimensional case, 642 00:33:55,052 --> 00:33:57,010 we just agreed that we just take the derivative 643 00:33:57,010 --> 00:33:58,077 and set it to zero. 644 00:33:58,077 --> 00:34:00,160 In the high dimensional case, we take the gradient 645 00:34:00,160 --> 00:34:01,270 and set it equal to zero. 646 00:34:01,270 --> 00:34:04,300 Again, that's calculus, all right? 647 00:34:04,300 --> 00:34:07,930 So it turns out that so this is going 648 00:34:07,930 --> 00:34:09,489 to give me equations, right? 649 00:34:09,489 --> 00:34:11,530 The first one is an equation in theta. 650 00:34:11,530 --> 00:34:15,040 The second one is an equation in theta1, theta2, theta3, 651 00:34:15,040 --> 00:34:16,734 all the way to theta d. 652 00:34:16,734 --> 00:34:19,150 And it doesn't mean that because I can write this equation 653 00:34:19,150 --> 00:34:21,130 that I can actually solve it. 654 00:34:21,130 --> 00:34:23,110 This equation might be super nasty. 655 00:34:23,110 --> 00:34:28,929 It might be like some polynomial and exponentials and logs equal 656 00:34:28,929 --> 00:34:31,219 zero, or some crazy thing. 657 00:34:31,219 --> 00:34:36,620 And so there's actually, for a concave function, 658 00:34:36,620 --> 00:34:38,760 since we know there's a unique maximizer, 659 00:34:38,760 --> 00:34:42,780 there's this theory of convex optimization, which really, 660 00:34:42,780 --> 00:34:44,909 since those books are talking about minimizing, 661 00:34:44,909 --> 00:34:46,620 you had to find some sort of direction. 662 00:34:46,620 --> 00:34:50,280 But you can think of it as the theory of concave maximization. 663 00:34:50,280 --> 00:34:54,060 And they allow you to find algorithms to solve 664 00:34:54,060 --> 00:34:57,670 this numerically and fairly efficiently. 665 00:34:57,670 --> 00:34:58,800 OK, that means fast. 666 00:34:58,800 --> 00:35:01,099 Even if d is of size 10,000, you're 667 00:35:01,099 --> 00:35:02,640 going to wait for one second and it's 668 00:35:02,640 --> 00:35:05,130 going to tell you what the maximum is. 669 00:35:05,130 --> 00:35:06,914 And that's what machine learning is about. 670 00:35:06,914 --> 00:35:08,830 If you've taken any class on machine learning, 671 00:35:08,830 --> 00:35:11,163 there's a lot of optimization, because they have really, 672 00:35:11,163 --> 00:35:13,850 really big problems to solve. 673 00:35:13,850 --> 00:35:15,470 Often in this class, since this is 674 00:35:15,470 --> 00:35:19,460 more introductory statistics, we will have a close form. 675 00:35:19,460 --> 00:35:21,250 For the maximum likelihood estimator 676 00:35:21,250 --> 00:35:25,240 will be saying theta hat equals, and say x bar, 677 00:35:25,240 --> 00:35:28,150 and that will be the maximum likelihood estimator. 678 00:35:28,150 --> 00:35:34,310 So just why-- so has anybody seen convex optimization 679 00:35:34,310 --> 00:35:36,950 before? 680 00:35:36,950 --> 00:35:38,830 So let me just give you an intuition 681 00:35:38,830 --> 00:35:43,690 why those functions are easy to maximize or to minimize. 682 00:35:43,690 --> 00:35:46,990 In one dimension, it's actually very easy for you to see that. 683 00:35:50,540 --> 00:35:52,550 And the reason is this. 684 00:35:52,550 --> 00:35:57,110 If I want to maximize the concave function, what 685 00:35:57,110 --> 00:35:59,780 I need to do is to be able to query a point 686 00:35:59,780 --> 00:36:04,080 and get as an answer the derivative of this function, 687 00:36:04,080 --> 00:36:04,791 OK? 688 00:36:04,791 --> 00:36:07,040 So now I said this is the function I want to optimize, 689 00:36:07,040 --> 00:36:13,410 and I've been running my algorithm for 5/10 of a second. 690 00:36:13,410 --> 00:36:15,650 And it's at this point here. 691 00:36:15,650 --> 00:36:17,214 OK, that's the candidate. 692 00:36:17,214 --> 00:36:19,130 Now, what I can ask is, what is the derivative 693 00:36:19,130 --> 00:36:21,051 of my function here? 694 00:36:21,051 --> 00:36:22,550 Well, it's going to give me a value. 695 00:36:22,550 --> 00:36:26,600 And this value is going to be either negative, positive, 696 00:36:26,600 --> 00:36:27,246 or zero. 697 00:36:27,246 --> 00:36:28,620 Well, if it's zero, that's great. 698 00:36:28,620 --> 00:36:30,411 That means I'm here and I can just go home. 699 00:36:30,411 --> 00:36:31,679 I've solved my problem. 700 00:36:31,679 --> 00:36:33,470 I know there's a unique maximum, and that's 701 00:36:33,470 --> 00:36:34,760 what I wanted to find. 702 00:36:34,760 --> 00:36:37,340 If it's positive, it actually tells me 703 00:36:37,340 --> 00:36:41,470 that I'm on the left of the optimizer. 704 00:36:41,470 --> 00:36:43,520 And on the left of the optimal value. 705 00:36:43,520 --> 00:36:47,270 And if it's negative, it means that I'm 706 00:36:47,270 --> 00:36:50,370 at the right of the value I'm looking for. 707 00:36:50,370 --> 00:36:53,600 And so most of the convex optimization methods 708 00:36:53,600 --> 00:36:56,780 basically tell you, well, if you query the derivative 709 00:36:56,780 --> 00:37:00,390 and it's actually positive, move to the right. 710 00:37:00,390 --> 00:37:02,430 And if it's negative, move to the left. 711 00:37:02,430 --> 00:37:07,280 Now, by how much you move is basically, well, 712 00:37:07,280 --> 00:37:09,020 why people write books. 713 00:37:09,020 --> 00:37:13,400 And in higher dimension, it's a little more complicated, 714 00:37:13,400 --> 00:37:16,260 because in higher dimension, thinks about two dimensions, 715 00:37:16,260 --> 00:37:21,940 then I'm only being able to get in a vector. 716 00:37:21,940 --> 00:37:24,320 And the vector is only telling me, well, here 717 00:37:24,320 --> 00:37:26,579 is half of the space in which you can move. 718 00:37:26,579 --> 00:37:28,370 Now here, if you tell me move to the right, 719 00:37:28,370 --> 00:37:30,620 I know exactly which direction I'm going to have to move. 720 00:37:30,620 --> 00:37:32,036 But in two dimension, you're going 721 00:37:32,036 --> 00:37:37,160 to basically tell me, well, move in this global direction. 722 00:37:37,160 --> 00:37:40,190 And so, of course, I know there's a line on the floor I 723 00:37:40,190 --> 00:37:42,140 cannot move behind. 724 00:37:42,140 --> 00:37:45,350 But even if you tell me, draw a line on the floor 725 00:37:45,350 --> 00:37:47,720 and move only to that side of the line, 726 00:37:47,720 --> 00:37:50,840 then there's many directions in that line that I can go to. 727 00:37:50,840 --> 00:37:53,870 And that's also why there's lots of things 728 00:37:53,870 --> 00:37:55,870 you can do in optimization. 729 00:37:55,870 --> 00:38:00,990 OK, but still, putting this line on the floor is telling me, 730 00:38:00,990 --> 00:38:02,167 do not go backwards. 731 00:38:02,167 --> 00:38:03,250 And that's very important. 732 00:38:03,250 --> 00:38:04,791 It's just telling you which direction 733 00:38:04,791 --> 00:38:07,470 I should be going always, OK? 734 00:38:07,470 --> 00:38:11,310 All right, so that's what's behind this notion 735 00:38:11,310 --> 00:38:14,490 of gradient descent algorithm, steepest descent. 736 00:38:14,490 --> 00:38:17,940 Or steepest descent, actually, if we're trying to maximize. 737 00:38:17,940 --> 00:38:22,150 OK, so let's move on. 738 00:38:22,150 --> 00:38:26,400 So this course is not about optimization, all right? 739 00:38:26,400 --> 00:38:30,690 So as I said, the likelihood was this guy. 740 00:38:30,690 --> 00:38:32,532 The product of f of the xi's. 741 00:38:32,532 --> 00:38:33,990 And one way you can do this is just 742 00:38:33,990 --> 00:38:39,060 basically the joint distribution of my data at the point theta. 743 00:38:39,060 --> 00:38:41,160 So now the likelihood, formerly-- so here 744 00:38:41,160 --> 00:38:44,760 I am giving myself the model e theta. 745 00:38:44,760 --> 00:38:48,120 And here I'm going to assume that e is discrete 746 00:38:48,120 --> 00:38:49,740 so that I can talk about PMFs. 747 00:38:49,740 --> 00:38:51,840 But everything you're doing, just 748 00:38:51,840 --> 00:38:55,080 redo for the sake of yourself by replacing PMFs by PDFs, 749 00:38:55,080 --> 00:38:56,680 and everything's going to be fine. 750 00:38:56,680 --> 00:38:58,260 We'll do it in a second. 751 00:38:58,260 --> 00:39:02,470 All right, so the likelihood of the model. 752 00:39:02,470 --> 00:39:05,552 So here I'm not looking at the likelihood of a parameter. 753 00:39:05,552 --> 00:39:07,260 I'm looking at the likelihood of a model. 754 00:39:07,260 --> 00:39:09,234 So it's actually a function of the parameter. 755 00:39:09,234 --> 00:39:10,650 And actually, I'm going to make it 756 00:39:10,650 --> 00:39:14,130 even a function of the points x1 to xn. 757 00:39:14,130 --> 00:39:15,760 All right, so I have a function. 758 00:39:15,760 --> 00:39:18,070 And what it takes as input is all the points x1 759 00:39:18,070 --> 00:39:22,062 to xn and a candidate parameter theta. 760 00:39:22,062 --> 00:39:22,770 Not the true one. 761 00:39:22,770 --> 00:39:23,989 A candidate. 762 00:39:23,989 --> 00:39:25,530 And what I'm going to do is I'm going 763 00:39:25,530 --> 00:39:28,530 to look at the probability that my random variables 764 00:39:28,530 --> 00:39:29,970 under this distribution, p theta, 765 00:39:29,970 --> 00:39:34,630 take these exact values, px1, px2, pxn. 766 00:39:34,630 --> 00:39:40,290 Now remember, if my data was independent, 767 00:39:40,290 --> 00:39:43,200 then I could actually just say that the probability 768 00:39:43,200 --> 00:39:45,960 of this intersection is just a product of the probabilities. 769 00:39:45,960 --> 00:39:48,790 And it would look something like this. 770 00:39:48,790 --> 00:39:50,790 But I can define likelihood even if I don't have 771 00:39:50,790 --> 00:39:52,865 independent random variables. 772 00:39:52,865 --> 00:39:54,490 But think of them as being independent, 773 00:39:54,490 --> 00:39:57,550 because that's all we're going to encounter in this class, OK? 774 00:39:57,550 --> 00:40:00,380 I just want you to be aware that if I had dependent variables, 775 00:40:00,380 --> 00:40:02,089 I could still define the likelihood. 776 00:40:02,089 --> 00:40:04,630 I would have to understand how to compute these probabilities 777 00:40:04,630 --> 00:40:08,270 there to be able to compute it. 778 00:40:08,270 --> 00:40:11,000 OK, so think of Bernoullis, for example. 779 00:40:11,000 --> 00:40:12,985 So here is my example of a Bernoulli. 780 00:40:16,570 --> 00:40:18,650 So my parameter is-- 781 00:40:18,650 --> 00:40:25,211 so my model is 0,1 Bernoulli p. 782 00:40:25,211 --> 00:40:31,790 p is in the interval 0,1. 783 00:40:31,790 --> 00:40:35,917 The probability, just as a side remark, 784 00:40:35,917 --> 00:40:38,000 I'm just going to use the fact that I can actually 785 00:40:38,000 --> 00:40:41,840 write the PMF of a Bernoulli in a very concise form, right? 786 00:40:41,840 --> 00:40:43,970 If I ask you what the PMF of a Bernoulli is, 787 00:40:43,970 --> 00:40:46,500 you could tell me, well, the probability that x-- 788 00:40:46,500 --> 00:40:50,720 so under p, the probability that x is equal to 0 is 1 minus p. 789 00:40:50,720 --> 00:40:57,230 The probability under p that x is equal to 1 is equal to p. 790 00:40:57,230 --> 00:41:01,790 But I can be a bit smart and say that for any X that's 791 00:41:01,790 --> 00:41:04,910 either 0 or 1, the probability under p 792 00:41:04,910 --> 00:41:07,610 that X is equal to little x, I can write it 793 00:41:07,610 --> 00:41:14,150 in a compact form as p to the X, 1 minus p to the 1 minus x. 794 00:41:14,150 --> 00:41:17,570 And you can check that this is the right form because, well, 795 00:41:17,570 --> 00:41:20,910 you have to check it only for two values of X, 0 and 1. 796 00:41:20,910 --> 00:41:23,350 And if you plug in 1, you only keep the p. 797 00:41:23,350 --> 00:41:27,840 If you plug in 0, you only keep the 1 minus p. 798 00:41:27,840 --> 00:41:31,190 And that's just a trick, OK? 799 00:41:31,190 --> 00:41:34,350 I could have gone with many other ways. 800 00:41:34,350 --> 00:41:35,940 Agreed? 801 00:41:35,940 --> 00:41:39,342 I could have said, actually, something like-- 802 00:41:39,342 --> 00:41:41,550 another one would be-- which we are not going to use, 803 00:41:41,550 --> 00:41:47,340 but we could say, well, it's xp plus and minus x 1 minus 804 00:41:47,340 --> 00:41:47,850 p, right? 805 00:41:50,680 --> 00:41:53,160 That's another one. 806 00:41:53,160 --> 00:41:56,057 But this one is going to be convenient. 807 00:41:56,057 --> 00:41:57,640 So forget about this guy for a second. 808 00:42:02,750 --> 00:42:05,450 So now, I said that the likelihood is just 809 00:42:05,450 --> 00:42:12,380 this function that's computing the probability that X1 810 00:42:12,380 --> 00:42:15,050 is equal to little x1. 811 00:42:15,050 --> 00:42:27,950 So likelihood is L of X1, Xn. 812 00:42:27,950 --> 00:42:30,140 So let me try to make those calligraphic so you 813 00:42:30,140 --> 00:42:33,140 know that I'm talking about smaller values, right? 814 00:42:33,140 --> 00:42:35,010 Small x's. 815 00:42:35,010 --> 00:42:38,840 x1, xn, and then of course p. 816 00:42:38,840 --> 00:42:40,284 Sometimes we even put-- 817 00:42:40,284 --> 00:42:42,200 I didn't do it, but sometimes you can actually 818 00:42:42,200 --> 00:42:46,640 put a semicolon here, semicolon so you know that those two 819 00:42:46,640 --> 00:42:48,860 things are treated differently. 820 00:42:48,860 --> 00:42:51,570 And so now you have this thing is equal to what? 821 00:42:51,570 --> 00:42:54,440 Well, it's just the probability under p 822 00:42:54,440 --> 00:42:59,990 that X1 is little x1 all the way to Xn is little xn. 823 00:42:59,990 --> 00:43:02,064 OK, that's just the definition. 824 00:43:06,910 --> 00:43:11,590 All right, so now let's start working. 825 00:43:11,590 --> 00:43:13,240 So we write the definition, and then we 826 00:43:13,240 --> 00:43:16,030 want to make it look like something we would potentially 827 00:43:16,030 --> 00:43:17,902 be able to maximize if I were-- 828 00:43:17,902 --> 00:43:20,235 like if I take the derivative of this with respect to p, 829 00:43:20,235 --> 00:43:22,570 it's not very helpful because I just don't know. 830 00:43:22,570 --> 00:43:26,770 Just want the algebraic function of p. 831 00:43:26,770 --> 00:43:28,580 So this thing is going to be equal to what? 832 00:43:28,580 --> 00:43:30,413 Well, what is the first thing I want to use? 833 00:43:32,740 --> 00:43:35,350 I have a probability of an intersection of events, 834 00:43:35,350 --> 00:43:39,630 so it's just the product of the probabilities. 835 00:43:39,630 --> 00:43:44,396 So this is the product from i equal 1 to n of P, small p, 836 00:43:44,396 --> 00:43:47,970 Xi is equal to little xi. 837 00:43:47,970 --> 00:43:49,858 That's independence. 838 00:43:54,070 --> 00:43:58,690 OK, now, I'm starting to mean business, because for each P, 839 00:43:58,690 --> 00:44:00,370 we have a closed form, right? 840 00:44:00,370 --> 00:44:03,910 I wrote this as this supposedly convenient form. 841 00:44:03,910 --> 00:44:06,470 I still have to reveal to you why it's convenient. 842 00:44:06,470 --> 00:44:09,640 So this thing is equal to-- 843 00:44:09,640 --> 00:44:15,090 well, we said that that was p xi for a little xi. 844 00:44:15,090 --> 00:44:20,240 1 minus p to the 1 minus xi, OK? 845 00:44:22,960 --> 00:44:26,650 So that was just what I wrote over there as the probability 846 00:44:26,650 --> 00:44:29,540 that Xi is equal to little xi. 847 00:44:29,540 --> 00:44:32,780 And since they all have the same parameter p, just 848 00:44:32,780 --> 00:44:34,280 have this p that shows up here. 849 00:44:38,140 --> 00:44:41,230 And so now I'm just taking the products of something 850 00:44:41,230 --> 00:44:45,160 to the xi, so it's this thing to the sum of the xi's. 851 00:44:45,160 --> 00:44:48,090 Everybody agrees with this? 852 00:44:48,090 --> 00:44:56,360 So this is equal to p sum of the xi, 1 minus p 853 00:44:56,360 --> 00:44:58,180 to the n minus sum of the xi. 854 00:45:10,180 --> 00:45:13,300 If you don't feel comfortable with writing it directly, 855 00:45:13,300 --> 00:45:15,520 you can observe that this thing here 856 00:45:15,520 --> 00:45:22,170 is actually equal to p over 1 minus p to the xi times 1 857 00:45:22,170 --> 00:45:26,022 minus p, OK? 858 00:45:26,022 --> 00:45:27,480 So now when I take the product, I'm 859 00:45:27,480 --> 00:45:28,938 getting the products of those guys. 860 00:45:28,938 --> 00:45:31,380 So it's just this guy to the power of sum 861 00:45:31,380 --> 00:45:33,570 and this guy to the power n. 862 00:45:33,570 --> 00:45:39,670 And then I can rewrite it like this if I want to 863 00:45:39,670 --> 00:45:42,720 And so now-- well, that's what we have here. 864 00:45:42,720 --> 00:45:45,750 And now I am in business because I can still 865 00:45:45,750 --> 00:45:48,750 hope to maximize this function. 866 00:45:48,750 --> 00:45:50,679 And how to maximize this function? 867 00:45:50,679 --> 00:45:52,470 All I have to do is to take the derivative. 868 00:45:52,470 --> 00:45:54,710 Do you want to do it? 869 00:45:54,710 --> 00:45:56,502 Let's just take the derivative, OK? 870 00:45:56,502 --> 00:45:58,960 Sorry, I didn't tell you that, well, the maximum likelihood 871 00:45:58,960 --> 00:46:01,700 principle is to just maxim-- the idea is to maximize this thing, 872 00:46:01,700 --> 00:46:02,200 OK? 873 00:46:02,200 --> 00:46:04,310 But I'm not going to get there right now. 874 00:46:04,310 --> 00:46:08,810 OK, so let's do it maybe for the Poisson model for a second. 875 00:46:08,810 --> 00:46:16,910 So if you want to do it for the Poisson model, 876 00:46:16,910 --> 00:46:18,380 let's write the likelihood. 877 00:46:18,380 --> 00:46:20,020 So right now I'm not doing anything. 878 00:46:20,020 --> 00:46:21,010 I'm not maximizing. 879 00:46:21,010 --> 00:46:24,040 I'm just computing the likelihood function. 880 00:46:29,640 --> 00:46:32,470 OK, so the likelihood function for Poisson. 881 00:46:32,470 --> 00:46:36,710 So now I know-- what is my sample space for Poisson? 882 00:46:36,710 --> 00:46:38,140 STUDENT: Positives. 883 00:46:38,140 --> 00:46:41,170 PROFESSOR: Positive integers. 884 00:46:41,170 --> 00:46:45,220 And well, let me write it like this. 885 00:46:45,220 --> 00:46:51,170 Poisson lambda, and I'm going to take lambda to be positive. 886 00:46:51,170 --> 00:46:53,560 And so that means that the probability under lambda 887 00:46:53,560 --> 00:46:57,920 that X is equal to little x in the sample space 888 00:46:57,920 --> 00:47:01,030 is lambda to the X over factorial x 889 00:47:01,030 --> 00:47:03,130 e to the minus lambda. 890 00:47:03,130 --> 00:47:05,530 So that's basically the same as the compact form 891 00:47:05,530 --> 00:47:06,740 that I wrote over there. 892 00:47:06,740 --> 00:47:08,860 It's just now a different one. 893 00:47:08,860 --> 00:47:12,340 And so when I want to write my likelihood, again, 894 00:47:12,340 --> 00:47:13,500 we said little x's. 895 00:47:17,050 --> 00:47:18,390 This is equal to what? 896 00:47:18,390 --> 00:47:23,690 Well, it's equal to the probability under lambda 897 00:47:23,690 --> 00:47:31,796 that X1 is little x1, Xn is little xn, 898 00:47:31,796 --> 00:47:33,045 which is equal to the product. 899 00:47:40,950 --> 00:47:42,720 OK? 900 00:47:42,720 --> 00:47:45,270 Just by independence. 901 00:47:45,270 --> 00:47:47,640 And now I can write those guys as being-- each 902 00:47:47,640 --> 00:47:52,080 of them being i equal 1 to n. 903 00:47:52,080 --> 00:47:56,100 So this guy is just this thing where a plug in Xi. 904 00:47:56,100 --> 00:48:05,540 So I get lambda to the Xi divided by factorial xi times e 905 00:48:05,540 --> 00:48:10,660 to the minus lambda, OK? 906 00:48:10,660 --> 00:48:13,709 And now, I mean, this guy is going to be nice. 907 00:48:13,709 --> 00:48:15,250 This guy is not going to be too nice. 908 00:48:15,250 --> 00:48:16,570 But let's write it. 909 00:48:16,570 --> 00:48:18,820 When I'm going to take the product of those guys here, 910 00:48:18,820 --> 00:48:21,910 I'm going to pick up lambda to the sum of the xi's. 911 00:48:21,910 --> 00:48:23,470 Here I'm going to pick up exponential 912 00:48:23,470 --> 00:48:25,334 minus n times lambda. 913 00:48:25,334 --> 00:48:27,250 And here I'm going to pick up just the product 914 00:48:27,250 --> 00:48:29,200 of the factorials. 915 00:48:29,200 --> 00:48:35,900 So x1 factorial all the way to xn factorial. 916 00:48:35,900 --> 00:48:41,130 Then I get lambda, the sum of the xi. 917 00:48:41,130 --> 00:48:43,480 Those are little xi's. 918 00:48:43,480 --> 00:48:46,581 e to the minus xn lambda. 919 00:48:46,581 --> 00:48:47,080 OK? 920 00:48:51,900 --> 00:48:55,510 So that might be freaky at this point, but remember, 921 00:48:55,510 --> 00:48:58,100 this is a function we will be maximizing. 922 00:48:58,100 --> 00:49:01,480 And the denominator here does not depend on lambda. 923 00:49:01,480 --> 00:49:04,860 So we knew that maximizing this function with this denominator, 924 00:49:04,860 --> 00:49:07,590 or any other denominator, including 1, 925 00:49:07,590 --> 00:49:09,930 will give me the same arg max. 926 00:49:09,930 --> 00:49:12,180 So it won't be a problem for me. 927 00:49:12,180 --> 00:49:14,349 As long as it does not depend on lambda, 928 00:49:14,349 --> 00:49:15,640 this thing is going to go away. 929 00:49:19,130 --> 00:49:24,720 OK, so in the continuous case, the likelihood I cannot-- 930 00:49:24,720 --> 00:49:25,220 right? 931 00:49:25,220 --> 00:49:26,720 So if I would write the likelihood 932 00:49:26,720 --> 00:49:29,600 like this in the continuous case, 933 00:49:29,600 --> 00:49:32,240 this one would be equal to what? 934 00:49:32,240 --> 00:49:33,160 Zero, right? 935 00:49:33,160 --> 00:49:34,565 So it's not very helpful. 936 00:49:34,565 --> 00:49:36,440 And so what we do is we define the likelihood 937 00:49:36,440 --> 00:49:39,860 as the product of the f of theta xi. 938 00:49:39,860 --> 00:49:43,340 Now that would be a jump if I told you, 939 00:49:43,340 --> 00:49:45,230 well, just define it like that and go home 940 00:49:45,230 --> 00:49:46,700 and don't discuss it. 941 00:49:46,700 --> 00:49:52,011 But we know that this is exactly what's coming from the-- 942 00:49:52,011 --> 00:49:53,510 well, actually, I think I erased it. 943 00:49:53,510 --> 00:49:55,370 It was just behind. 944 00:49:55,370 --> 00:49:58,280 So this was exactly what was coming from the KL 945 00:49:58,280 --> 00:50:00,200 divergence estimated, right? 946 00:50:00,200 --> 00:50:01,700 The thing that I showed you, if we 947 00:50:01,700 --> 00:50:03,200 want to follow this strategy, which 948 00:50:03,200 --> 00:50:06,830 consists in estimating the KL divergence and minimizing it, 949 00:50:06,830 --> 00:50:08,210 is exactly doing this. 950 00:50:12,190 --> 00:50:16,730 So in the Gaussian case-- 951 00:50:16,730 --> 00:50:17,835 well, let's write it. 952 00:50:17,835 --> 00:50:19,610 So in the Gaussian case, let's see 953 00:50:19,610 --> 00:50:20,940 what the likelihood looks like. 954 00:50:27,600 --> 00:50:32,000 OK, so if I have a Gaussian experiment here-- 955 00:50:32,000 --> 00:50:33,430 did I actually write it? 956 00:50:36,440 --> 00:50:40,590 OK, so I'm going to take mu and sigma as being two parameters. 957 00:50:40,590 --> 00:50:43,756 So that means that my sample space is going to be what? 958 00:50:47,330 --> 00:50:49,700 Well, my sample space is still R. 959 00:50:49,700 --> 00:50:51,750 Those are just my observations. 960 00:50:51,750 --> 00:50:56,840 But then I'm going to have a N mu sigma squared. 961 00:50:56,840 --> 00:50:58,400 And the parameters of interest are mu 962 00:50:58,400 --> 00:51:04,291 and R. And sigma squared and say 0 infinity. 963 00:51:04,291 --> 00:51:06,450 OK, so that's my Gaussian model. 964 00:51:06,450 --> 00:51:07,736 Yes. 965 00:51:07,736 --> 00:51:17,455 STUDENT: [INAUDIBLE] 966 00:51:17,455 --> 00:51:18,580 PROFESSOR: No, there's no-- 967 00:51:18,580 --> 00:51:20,080 I mean, there's no difference. 968 00:51:20,080 --> 00:51:21,514 STUDENT: [INAUDIBLE] 969 00:51:21,514 --> 00:51:22,180 PROFESSOR: Yeah. 970 00:51:22,180 --> 00:51:24,880 I think the all the slides I put the curly bracket, 971 00:51:24,880 --> 00:51:26,520 then I'm just being lazy. 972 00:51:26,520 --> 00:51:31,540 I just like those concave parenthesis. 973 00:51:31,540 --> 00:51:33,850 All right, so let's write it. 974 00:51:33,850 --> 00:51:39,670 So the definition, L xi, xn. 975 00:51:39,670 --> 00:51:43,810 And now I have two parameters, mu and sigma squared. 976 00:51:43,810 --> 00:51:48,035 We said, by definition, is the product from i 977 00:51:48,035 --> 00:51:55,540 equal 1 to n of f theta of little xi. 978 00:51:55,540 --> 00:51:57,550 Now, think about it. 979 00:51:57,550 --> 00:52:00,790 Here we always had an extra line, right? 980 00:52:00,790 --> 00:52:03,460 The line was to say that the definition was the probability 981 00:52:03,460 --> 00:52:05,470 that they were all equal to each other. 982 00:52:05,470 --> 00:52:08,230 That was the joint probability. 983 00:52:08,230 --> 00:52:12,430 And here it could actually have a line that says it's the joint 984 00:52:12,430 --> 00:52:14,146 probability distribution of the xi's. 985 00:52:14,146 --> 00:52:15,520 And if it's not independent, it's 986 00:52:15,520 --> 00:52:16,732 not going to be the product. 987 00:52:16,732 --> 00:52:18,190 But again, since we're only dealing 988 00:52:18,190 --> 00:52:21,020 with independent observations in the scope of this class, 989 00:52:21,020 --> 00:52:23,890 this is the only definition we're going to be using. 990 00:52:23,890 --> 00:52:26,710 OK, and actually, from here on, I 991 00:52:26,710 --> 00:52:30,910 will literally skip this step when I talk about discrete ones 992 00:52:30,910 --> 00:52:33,270 as well, because they are also independent. 993 00:52:33,270 --> 00:52:35,530 Agreed? 994 00:52:35,530 --> 00:52:37,570 So we start with this, which we agreed 995 00:52:37,570 --> 00:52:39,590 was the definition for this particular case. 996 00:52:39,590 --> 00:52:44,545 And so now all of you know by heart what the density of a-- 997 00:52:44,545 --> 00:52:45,600 sorry, that's not theta. 998 00:52:45,600 --> 00:52:47,540 I should write it mu sigma squared. 999 00:52:47,540 --> 00:52:50,650 And so you need to understand what this density. 1000 00:52:50,650 --> 00:53:01,070 And it's product of 1 over sigma square root 2 pi times 1001 00:53:01,070 --> 00:53:07,350 exponential minus xi minus mu squared 1002 00:53:07,350 --> 00:53:10,210 divided by 2 sigma squared. 1003 00:53:10,210 --> 00:53:13,750 OK, that's the Gaussian density with parameters mu and sigma 1004 00:53:13,750 --> 00:53:15,810 squared. 1005 00:53:15,810 --> 00:53:18,360 I just plugged in this thing which I don't give you, 1006 00:53:18,360 --> 00:53:20,630 so you just have to trust me. 1007 00:53:20,630 --> 00:53:22,500 It's all over any book. 1008 00:53:22,500 --> 00:53:25,334 Certainly, I mean, you can find it. 1009 00:53:25,334 --> 00:53:26,250 I will give it to you. 1010 00:53:26,250 --> 00:53:29,310 And again, you're not expected to know it by heart. 1011 00:53:29,310 --> 00:53:34,290 Though, if you do your homework every week without wanting to, 1012 00:53:34,290 --> 00:53:36,180 you will definitely use some of your brain 1013 00:53:36,180 --> 00:53:38,140 to remember that thing. 1014 00:53:38,140 --> 00:53:42,680 OK, and so now, well, I have this constant in front. 1015 00:53:42,680 --> 00:53:45,000 1 over sigma square root 2 pi that I can pull out. 1016 00:53:45,000 --> 00:53:50,474 So I get 1 over sigma square root 2 pi to the power n. 1017 00:53:50,474 --> 00:53:52,890 And then I have the product of exponentials, which we know 1018 00:53:52,890 --> 00:53:55,420 is the exponential of the sum. 1019 00:53:55,420 --> 00:53:58,710 So this is equal to exponential minus. 1020 00:53:58,710 --> 00:54:01,260 And here I'm going to put the 1 over 2 sigma squared 1021 00:54:01,260 --> 00:54:02,210 outside the sum. 1022 00:54:15,740 --> 00:54:19,850 And so that's how this guy shows up. 1023 00:54:19,850 --> 00:54:23,550 Just the product of the density is evaluated at, respectively, 1024 00:54:23,550 --> 00:54:24,676 x1 to xn. 1025 00:54:28,850 --> 00:54:33,240 OK, any questions about computing those likelihoods? 1026 00:54:33,240 --> 00:54:34,556 Yes. 1027 00:54:34,556 --> 00:54:41,460 STUDENT: Why [INAUDIBLE] 1028 00:54:41,460 --> 00:54:42,890 PROFESSOR: Oh, that's a typo. 1029 00:54:42,890 --> 00:54:43,740 Thank you. 1030 00:54:43,740 --> 00:54:47,040 Because I just took it from probably the previous thing. 1031 00:54:47,040 --> 00:54:48,840 So those are actually-- should be-- 1032 00:54:48,840 --> 00:54:50,850 OK, thank you for noting that one. 1033 00:54:50,850 --> 00:55:00,180 So this line should say for any x1 to xn in R to the n. 1034 00:55:00,180 --> 00:55:01,470 Thank you, good catch. 1035 00:55:06,940 --> 00:55:10,840 All right, so that's really e to the n, right? 1036 00:55:10,840 --> 00:55:12,490 My sample space always. 1037 00:55:16,090 --> 00:55:19,800 OK, so what is maximum likelihood estimation? 1038 00:55:19,800 --> 00:55:24,770 Well again, if you go back to the estimate 1039 00:55:24,770 --> 00:55:27,770 that we got, the estimation strategy, which consisted 1040 00:55:27,770 --> 00:55:31,160 in replacing expectation with respect to theta star 1041 00:55:31,160 --> 00:55:35,540 by average of the data in the KL divergence, 1042 00:55:35,540 --> 00:55:41,810 we would try to maximize not this guy, but this guy. 1043 00:55:45,770 --> 00:55:48,260 The thing that we actually plugged in were not any small 1044 00:55:48,260 --> 00:55:48,760 xi's. 1045 00:55:48,760 --> 00:55:52,040 Were actually-- the random variable is capital Xi. 1046 00:55:52,040 --> 00:55:54,190 So the maximum likelihood estimator 1047 00:55:54,190 --> 00:55:57,090 is actually taking the likelihood, 1048 00:55:57,090 --> 00:55:59,570 which is a function of little x's, and now 1049 00:55:59,570 --> 00:56:02,210 the values at which it estimates, if you look at it, 1050 00:56:02,210 --> 00:56:03,620 is actually-- 1051 00:56:03,620 --> 00:56:05,870 the capital X is my data. 1052 00:56:05,870 --> 00:56:09,800 So it looks at the function, at the data, 1053 00:56:09,800 --> 00:56:11,900 and at the parameter theta. 1054 00:56:11,900 --> 00:56:14,932 That's what the-- so that's the first thing. 1055 00:56:14,932 --> 00:56:16,640 And then the maximum likelihood estimator 1056 00:56:16,640 --> 00:56:19,930 is maximizing this, OK? 1057 00:56:19,930 --> 00:56:24,090 So in a way, what it does is it's a function that couples 1058 00:56:24,090 --> 00:56:27,810 together the data, capital X1 to capital Xn, 1059 00:56:27,810 --> 00:56:32,310 with the parameter theta and just now tries to maximize it. 1060 00:56:32,310 --> 00:56:40,120 So if this is just a little hard for you to get, 1061 00:56:40,120 --> 00:56:42,340 the likelihood is formally defined 1062 00:56:42,340 --> 00:56:43,750 as a function of x, right? 1063 00:56:43,750 --> 00:56:46,105 Like when I write f of x. 1064 00:56:46,105 --> 00:56:48,580 f of little x, I define it like that. 1065 00:56:48,580 --> 00:56:52,990 But really, the only x arguments we're 1066 00:56:52,990 --> 00:56:54,680 going to evaluate this function at 1067 00:56:54,680 --> 00:56:57,920 are always the random variable, which is the data. 1068 00:56:57,920 --> 00:56:59,440 So if you want, you can think of it 1069 00:56:59,440 --> 00:57:02,230 as those guys being not parameters of this function, 1070 00:57:02,230 --> 00:57:04,810 but really, random variables themselves directly. 1071 00:57:09,390 --> 00:57:10,683 Is there any question? 1072 00:57:10,683 --> 00:57:15,516 STUDENT: [INAUDIBLE] those random variables [INAUDIBLE]?? 1073 00:57:15,516 --> 00:57:17,890 PROFESSOR: So those are going to be known once you have-- 1074 00:57:17,890 --> 00:57:20,500 so it's always the same thing in stats. 1075 00:57:20,500 --> 00:57:24,040 You first design your estimator as a function 1076 00:57:24,040 --> 00:57:25,270 of random variables. 1077 00:57:25,270 --> 00:57:27,490 And then once you get data, you just plug it in. 1078 00:57:27,490 --> 00:57:29,920 But we want to think of them as being random variables 1079 00:57:29,920 --> 00:57:32,262 because we want to understand what the fluctuations are. 1080 00:57:32,262 --> 00:57:34,720 So we're going to keep them as random variables for as long 1081 00:57:34,720 --> 00:57:35,685 as we can. 1082 00:57:35,685 --> 00:57:37,810 We're going to spit out the estimator as a function 1083 00:57:37,810 --> 00:57:38,690 of the random variables. 1084 00:57:38,690 --> 00:57:40,060 And then when we want to compute it from data, 1085 00:57:40,060 --> 00:57:41,351 we're just going to plug it in. 1086 00:57:44,170 --> 00:57:46,630 So keep the random variables for as long as you can. 1087 00:57:46,630 --> 00:57:48,430 Unless I give you numbers, actual numbers, 1088 00:57:48,430 --> 00:57:51,130 just those are random variables. 1089 00:57:51,130 --> 00:57:53,549 OK, so there might be some confusion 1090 00:57:53,549 --> 00:57:55,590 if you've seen any stats class, sometimes there's 1091 00:57:55,590 --> 00:57:58,420 a notation which says, oh, the realization 1092 00:57:58,420 --> 00:58:01,240 of the random variables are lower case versions 1093 00:58:01,240 --> 00:58:02,730 of the original random variables. 1094 00:58:02,730 --> 00:58:05,920 So lowercase x should be thought as the realization 1095 00:58:05,920 --> 00:58:09,610 of the upper case X. This is not the case here. 1096 00:58:09,610 --> 00:58:12,010 When I write this, it's the same way 1097 00:58:12,010 --> 00:58:16,630 as I write f of x is equal to x squared, right? 1098 00:58:16,630 --> 00:58:20,260 It's just an argument of a function that I want to define. 1099 00:58:20,260 --> 00:58:22,150 So those are just generic x. 1100 00:58:22,150 --> 00:58:24,580 So if you correct the typo that I have, 1101 00:58:24,580 --> 00:58:27,150 this should say that this should be for any x and xn. 1102 00:58:27,150 --> 00:58:28,990 I'm just describing a function. 1103 00:58:28,990 --> 00:58:30,816 And now the only place at which I'm 1104 00:58:30,816 --> 00:58:32,440 interested in evaluating that function, 1105 00:58:32,440 --> 00:58:35,477 at least for those first n arguments, is at the capital 1106 00:58:35,477 --> 00:58:37,310 N observations random variables that I have. 1107 00:58:41,110 --> 00:58:45,040 So there's actually texts, there's actually 1108 00:58:45,040 --> 00:58:48,070 people doing research on when does the maximum likelihood 1109 00:58:48,070 --> 00:58:49,720 estimator exist? 1110 00:58:49,720 --> 00:58:56,890 And that happens when you have infinite sets, thetas. 1111 00:58:56,890 --> 00:58:58,770 And this thing can diverge. 1112 00:58:58,770 --> 00:59:00,160 There is no global maximum. 1113 00:59:00,160 --> 00:59:01,990 There's crazy things that might happen. 1114 00:59:01,990 --> 00:59:04,630 And so we're actually always going to be in a case 1115 00:59:04,630 --> 00:59:07,450 where this maximum likelihood estimator exists. 1116 00:59:07,450 --> 00:59:09,580 And if it doesn't, then it means that you actually 1117 00:59:09,580 --> 00:59:13,840 need to restrict your parameter space, capital Theta, 1118 00:59:13,840 --> 00:59:15,430 to something smaller. 1119 00:59:15,430 --> 00:59:17,500 Otherwise it won't exist. 1120 00:59:17,500 --> 00:59:21,910 OK, so another thing is the log likelihood estimator. 1121 00:59:21,910 --> 00:59:23,800 So it is still the likelihood estimator. 1122 00:59:23,800 --> 00:59:26,380 We solved before that maximizing a function 1123 00:59:26,380 --> 00:59:27,820 or maximizing log of this function 1124 00:59:27,820 --> 00:59:30,350 is the same thing, because the log function is increasing. 1125 00:59:30,350 --> 00:59:32,100 So the same thing is maximizing a function 1126 00:59:32,100 --> 00:59:35,352 or maximizing, I don't know, exponential of this function. 1127 00:59:35,352 --> 00:59:37,060 Every time I take an increasing function, 1128 00:59:37,060 --> 00:59:38,410 it's actually the same thing. 1129 00:59:38,410 --> 00:59:40,360 Maximizing a function or maximizing 10 times 1130 00:59:40,360 --> 00:59:41,693 this function is the same thing. 1131 00:59:41,693 --> 00:59:45,730 So the function x maps to 10 times x is increasing. 1132 00:59:45,730 --> 00:59:49,480 And so why do we talk about log likelihood rather than 1133 00:59:49,480 --> 00:59:50,620 likelihood? 1134 00:59:50,620 --> 00:59:52,590 So the log of likelihood is really just-- 1135 00:59:52,590 --> 00:59:55,810 I mean the log likelihood is the log of the likelihood. 1136 00:59:55,810 --> 00:59:59,420 And the reason is exactly for this kind of reasons. 1137 00:59:59,420 --> 01:00:02,240 Remember, that was my likelihood, right? 1138 01:00:02,240 --> 01:00:04,170 And I want to maximize it. 1139 01:00:04,170 --> 01:00:05,940 And it turns out that in stats, there's 1140 01:00:05,940 --> 01:00:10,410 a lot of distributions that look like exponential of something. 1141 01:00:10,410 --> 01:00:12,930 So I might as well just remove the exponential 1142 01:00:12,930 --> 01:00:14,730 by taking the log. 1143 01:00:14,730 --> 01:00:17,230 So once I have this guy, I can take the log. 1144 01:00:17,230 --> 01:00:19,080 This is something to a power of something. 1145 01:00:19,080 --> 01:00:21,720 If I take the log, it's going to look better for me. 1146 01:00:21,720 --> 01:00:23,400 I have this thing-- 1147 01:00:23,400 --> 01:00:25,650 well, I have another one somewhere, I think, 1148 01:00:25,650 --> 01:00:27,910 where I had the Poisson. 1149 01:00:27,910 --> 01:00:29,070 Where was the Poisson? 1150 01:00:29,070 --> 01:00:31,890 The Poisson's gone. 1151 01:00:31,890 --> 01:00:33,610 So the Poisson was the same thing. 1152 01:00:33,610 --> 01:00:35,670 If I took the log, because it had a power, 1153 01:00:35,670 --> 01:00:37,210 that would make my life easier. 1154 01:00:37,210 --> 01:00:43,800 So the log doesn't have any particular intrinsic notion, 1155 01:00:43,800 --> 01:00:47,550 except that it's just more convenient. 1156 01:00:47,550 --> 01:00:49,500 Now, that being said, if you think 1157 01:00:49,500 --> 01:00:53,370 about maximizing the KL, the original formulation, 1158 01:00:53,370 --> 01:00:55,590 we actually remove the log. 1159 01:00:55,590 --> 01:00:57,040 If we come back to the KL thing-- 1160 01:01:00,700 --> 01:01:01,610 where is my KL? 1161 01:01:01,610 --> 01:01:03,770 Sorry. 1162 01:01:03,770 --> 01:01:08,630 That was maximizing the sum of the logs of the pi's. 1163 01:01:08,630 --> 01:01:11,870 And so then we worked at it by saying that the sum of the logs 1164 01:01:11,870 --> 01:01:12,539 was-- 1165 01:01:12,539 --> 01:01:14,330 maximizing the sum of the logs was the same 1166 01:01:14,330 --> 01:01:16,220 as maximizing the product. 1167 01:01:16,220 --> 01:01:18,140 But here, we're basically-- log likelihood 1168 01:01:18,140 --> 01:01:21,571 is just going backwards in this chain of equivalences. 1169 01:01:21,571 --> 01:01:23,570 And that's just because the original formulation 1170 01:01:23,570 --> 01:01:27,180 was already convenient. 1171 01:01:27,180 --> 01:01:28,940 So we went to find the likelihood 1172 01:01:28,940 --> 01:01:32,620 and then coming back to our original estimation strategy. 1173 01:01:32,620 --> 01:01:34,250 So look at the Poisson. 1174 01:01:34,250 --> 01:01:39,210 I want to take log here to make my sum of xi's go down. 1175 01:01:39,210 --> 01:01:47,510 OK, so this is my estimator. 1176 01:01:47,510 --> 01:01:50,090 So the log of L-- 1177 01:01:50,090 --> 01:01:51,590 so one thing that you want to notice 1178 01:01:51,590 --> 01:01:59,960 is that the log of L of x1, xn theta, as we said, 1179 01:01:59,960 --> 01:02:02,860 is equal to the sum from i equal 1 1180 01:02:02,860 --> 01:02:09,950 to n of the log of either p theta of xi, or-- 1181 01:02:09,950 --> 01:02:11,270 so that's in the discrete case. 1182 01:02:11,270 --> 01:02:14,480 And in the continuous case is the sum 1183 01:02:14,480 --> 01:02:16,627 of the log of f theta of xi. 1184 01:02:19,277 --> 01:02:21,860 The beauty of this is that you don't have to really understand 1185 01:02:21,860 --> 01:02:23,360 the difference between probability mass 1186 01:02:23,360 --> 01:02:25,310 function and probability distribution function 1187 01:02:25,310 --> 01:02:26,690 to implement this. 1188 01:02:26,690 --> 01:02:29,518 Whatever you get, that's what you plug in. 1189 01:02:32,930 --> 01:02:33,810 Any questions so far? 1190 01:02:36,550 --> 01:02:39,940 All right, so shall we do some computations 1191 01:02:39,940 --> 01:02:44,720 and check that, actually, we've introduced all this stuff-- 1192 01:02:44,720 --> 01:02:47,380 complicate functions, maximizing, KL divergence, 1193 01:02:47,380 --> 01:02:50,590 lot of things-- so that we can spit out, again, averages? 1194 01:02:50,590 --> 01:02:51,160 All right? 1195 01:02:51,160 --> 01:02:51,580 That's great. 1196 01:02:51,580 --> 01:02:52,810 We're going to able to sleep at night 1197 01:02:52,810 --> 01:02:55,150 and know that there's a really powerful mechanism called 1198 01:02:55,150 --> 01:02:57,370 maximum likelihood estimator that was actually 1199 01:02:57,370 --> 01:03:00,370 driving our intuition without us knowing. 1200 01:03:00,370 --> 01:03:04,730 OK, so let's do this so. 1201 01:03:04,730 --> 01:03:06,240 Bernoulli trials. 1202 01:03:06,240 --> 01:03:07,400 I still have it over there. 1203 01:03:15,920 --> 01:03:19,120 OK, so actually, I don't know what-- 1204 01:03:19,120 --> 01:03:21,260 well, let me write it like that. 1205 01:03:21,260 --> 01:03:25,730 So it's P over 1 minus P xi-- 1206 01:03:25,730 --> 01:03:32,650 sorry, sum of the xi's times 1 minus P is to the n. 1207 01:03:32,650 --> 01:03:37,960 So now I want to maximize this as a function of P. 1208 01:03:37,960 --> 01:03:39,880 Well, the first thing we would want to do 1209 01:03:39,880 --> 01:03:41,860 is to check that this function is concave. 1210 01:03:41,860 --> 01:03:45,220 And I'm just going to ask you to trust me on this. 1211 01:03:45,220 --> 01:03:47,800 So I don't want-- sorry, sum of the xi's. 1212 01:03:47,800 --> 01:03:52,520 I only want to take the derivative and just go home. 1213 01:03:52,520 --> 01:03:55,150 So let's just take the derivative of this with respect 1214 01:03:55,150 --> 01:03:56,332 to P. Actually, no. 1215 01:03:56,332 --> 01:03:57,540 This one was more convenient. 1216 01:03:57,540 --> 01:03:58,040 I'm sorry. 1217 01:04:00,820 --> 01:04:03,100 This one was slightly more convenient, OK? 1218 01:04:03,100 --> 01:04:05,980 So now we have-- 1219 01:04:05,980 --> 01:04:09,130 so now let me take the log. 1220 01:04:09,130 --> 01:04:16,960 So if I take the log, what I get is sum of the xi's times log p 1221 01:04:16,960 --> 01:04:24,704 plus n minus some of the xi's times log 1 minus p. 1222 01:04:27,970 --> 01:04:29,590 Now I take the derivative with respect 1223 01:04:29,590 --> 01:04:35,837 to p and set it equal to zero. 1224 01:04:35,837 --> 01:04:36,920 So what does that give me? 1225 01:04:36,920 --> 01:04:43,710 It tells me that sum of the xi's divided by p minus n 1226 01:04:43,710 --> 01:04:50,130 sum of the xi's divided by 1 minus p is equal to 0. 1227 01:04:56,360 --> 01:04:58,980 So now I need to solve for p. 1228 01:04:58,980 --> 01:04:59,920 So let's just do it. 1229 01:04:59,920 --> 01:05:06,500 So what we get is that 1 minus p sum of the xi's is equal to p n 1230 01:05:06,500 --> 01:05:10,530 minus sum of the xi's. 1231 01:05:10,530 --> 01:05:17,300 So that's p times n minus sum of the xi's plus sum of the xi's. 1232 01:05:17,300 --> 01:05:18,550 So let me put it on the right. 1233 01:05:18,550 --> 01:05:24,410 So that's p times n is equal to sum of the xi's. 1234 01:05:24,410 --> 01:05:27,170 And that's equivalent to p-- 1235 01:05:27,170 --> 01:05:30,020 actually, I should start by putting p hat from here 1236 01:05:30,020 --> 01:05:33,720 on, because I'm already solving an equation, right? 1237 01:05:33,720 --> 01:05:36,880 And so p hat is equal to syn of the xi's 1238 01:05:36,880 --> 01:05:38,510 divided by n, which is my xn bar. 1239 01:05:44,050 --> 01:05:50,280 Poisson model, as I said, Poisson is gone. 1240 01:05:50,280 --> 01:05:51,874 So let me rewrite it quickly. 1241 01:06:00,850 --> 01:06:07,975 So Poisson, the likelihood in X1, Xn, and lambda 1242 01:06:07,975 --> 01:06:13,270 was equal to lambda to the sum of the xi's e 1243 01:06:13,270 --> 01:06:17,650 to the minus n lambda divided by X1 factorial, 1244 01:06:17,650 --> 01:06:20,920 all the way to Xn factorial. 1245 01:06:20,920 --> 01:06:25,110 So let me take the log likelihood. 1246 01:06:25,110 --> 01:06:26,490 That's going to be equal to what? 1247 01:06:26,490 --> 01:06:27,406 It's going to tell me. 1248 01:06:27,406 --> 01:06:29,096 It's going to be-- 1249 01:06:29,096 --> 01:06:30,720 well, let me get rid of this guy first. 1250 01:06:30,720 --> 01:06:36,780 Minus log of X1 factorial all the way to Xn factorial. 1251 01:06:36,780 --> 01:06:39,520 That's a constant with respect to lambda. 1252 01:06:39,520 --> 01:06:43,180 So when I'm going to take the derivative, it's going to go. 1253 01:06:43,180 --> 01:06:49,410 Then I'm going to have plus sum of the xi's times log lambda. 1254 01:06:49,410 --> 01:06:51,410 And then I'm going to have minus n lambda. 1255 01:06:54,390 --> 01:06:55,890 So now then, you take the derivative 1256 01:06:55,890 --> 01:06:57,660 and set it equal to zero. 1257 01:06:57,660 --> 01:07:04,860 So log L-- well, partial with respect to lambda of log L, 1258 01:07:04,860 --> 01:07:08,820 say lambda, equals zero. 1259 01:07:08,820 --> 01:07:11,160 This is equivalent to, so this guy goes. 1260 01:07:11,160 --> 01:07:16,440 This guy gives me sum of the xi's divided by lambda hat 1261 01:07:16,440 --> 01:07:17,070 equals n. 1262 01:07:22,470 --> 01:07:25,690 And so that's equivalent to lambda hat 1263 01:07:25,690 --> 01:07:31,092 is equal to sum of the xi's divided by n, which is Xn bar. 1264 01:07:34,044 --> 01:07:38,785 Take derivative, set it equal to zero, and just solve. 1265 01:07:38,785 --> 01:07:42,930 It's a very satisfying exercise, especially when 1266 01:07:42,930 --> 01:07:45,150 you get the average in the end. 1267 01:07:45,150 --> 01:07:49,060 You don't have to think about it forever. 1268 01:07:49,060 --> 01:07:54,360 OK, the Gaussian model I'm going to leave to you as an exercise. 1269 01:07:54,360 --> 01:07:57,600 Take the log to get rid of the pesky exponential, 1270 01:07:57,600 --> 01:08:00,690 and then take the derivative and you should be fine. 1271 01:08:00,690 --> 01:08:02,940 It's a bit more-- 1272 01:08:02,940 --> 01:08:05,960 it might be one more line than those guys. 1273 01:08:05,960 --> 01:08:12,760 OK, so-- well actually, you need to take 1274 01:08:12,760 --> 01:08:14,040 the gradient in this case. 1275 01:08:14,040 --> 01:08:15,930 Don't check the second derivative right now. 1276 01:08:15,930 --> 01:08:17,596 You don't have to really think about it. 1277 01:08:21,430 --> 01:08:23,537 What did I want to add? 1278 01:08:23,537 --> 01:08:25,370 I think there was something I wanted to say. 1279 01:08:25,370 --> 01:08:27,319 Yes. 1280 01:08:27,319 --> 01:08:31,040 When I have a function that's concave and I'm on, like, 1281 01:08:31,040 --> 01:08:33,671 some infinite interval, then it's 1282 01:08:33,671 --> 01:08:36,170 true that taking the derivative and setting it equal to zero 1283 01:08:36,170 --> 01:08:38,029 will give me the maximum. 1284 01:08:38,029 --> 01:08:42,330 But again, I might have a function that looks like this. 1285 01:08:42,330 --> 01:08:46,260 Now, if I'm on some finite interval-- let me go elsewhere. 1286 01:08:46,260 --> 01:08:55,550 So if I'm on some finite interval 1287 01:08:55,550 --> 01:09:00,979 and my function looks like this as a function of theta-- 1288 01:09:00,979 --> 01:09:03,220 let's say this is my log likelihood 1289 01:09:03,220 --> 01:09:06,410 as a function of theta-- 1290 01:09:06,410 --> 01:09:13,200 then, OK, there's no place in this interval-- 1291 01:09:13,200 --> 01:09:15,040 let's say this is between 0 and 1-- there's 1292 01:09:15,040 --> 01:09:19,870 no place in this interval where the derivative is equal to 0. 1293 01:09:19,870 --> 01:09:22,569 And if you actually try to solve this, 1294 01:09:22,569 --> 01:09:26,187 you won't find a solution which is not in the interval 0, 1. 1295 01:09:26,187 --> 01:09:28,270 And that's actually how you know that you probably 1296 01:09:28,270 --> 01:09:30,144 should not take the derivative equal to zero. 1297 01:09:30,144 --> 01:09:32,720 So don't panic if you get something that says, 1298 01:09:32,720 --> 01:09:34,720 well, the solution is at infinity, right? 1299 01:09:34,720 --> 01:09:36,285 If this function keeps going, you 1300 01:09:36,285 --> 01:09:37,660 will find that the solution-- you 1301 01:09:37,660 --> 01:09:40,490 won't be able to find a solution apart from infinity. 1302 01:09:40,490 --> 01:09:43,720 You are going to see something like 1 over theta hat 1303 01:09:43,720 --> 01:09:46,359 is equal to 0, or something like this. 1304 01:09:46,359 --> 01:09:48,939 So you know that when you've found this kind of solution, 1305 01:09:48,939 --> 01:09:51,370 you've probably made a mistake at some point. 1306 01:09:51,370 --> 01:09:54,820 And the reason is because the functions that are like this, 1307 01:09:54,820 --> 01:09:58,150 you don't find the maximum by setting the derivative equal 1308 01:09:58,150 --> 01:09:59,230 to zero. 1309 01:09:59,230 --> 01:10:01,159 You actually just find the maximum by saying, 1310 01:10:01,159 --> 01:10:03,450 well, it's an increasing function on the interval 0, 1, 1311 01:10:03,450 --> 01:10:05,000 so the maximum must be attained at 1. 1312 01:10:07,209 --> 01:10:08,750 So here in this case, that would mean 1313 01:10:08,750 --> 01:10:12,560 that my maximum would be 1. 1314 01:10:12,560 --> 01:10:14,540 My estimator would be 1, which would be weird. 1315 01:10:14,540 --> 01:10:17,316 So typically here, you have a function of the xi's. 1316 01:10:17,316 --> 01:10:19,940 So one example that you will see many times is when this guy is 1317 01:10:19,940 --> 01:10:24,870 the maximum of the xi's. 1318 01:10:24,870 --> 01:10:27,210 And in which case, the maximum is attained here, 1319 01:10:27,210 --> 01:10:29,190 which is the maximum of this. 1320 01:10:29,190 --> 01:10:31,620 OK, so just keep in mind-- 1321 01:10:31,620 --> 01:10:33,840 what I would recommend is every time 1322 01:10:33,840 --> 01:10:36,450 you're trying to take the maximum of a function, 1323 01:10:36,450 --> 01:10:39,320 just try to plot the function in your head. 1324 01:10:39,320 --> 01:10:40,380 It's not too complicated. 1325 01:10:40,380 --> 01:10:44,790 Those things are usually squares, or square roots, 1326 01:10:44,790 --> 01:10:45,630 or logs. 1327 01:10:45,630 --> 01:10:47,430 You know what those functions look like. 1328 01:10:47,430 --> 01:10:50,040 Just plug them in your mind and make sure 1329 01:10:50,040 --> 01:10:52,230 that you will find a maximum which really 1330 01:10:52,230 --> 01:10:54,210 goes up and then down again. 1331 01:10:54,210 --> 01:10:56,400 If you don't, then that means your maximum 1332 01:10:56,400 --> 01:10:59,370 is achieved at the boundary and you have 1333 01:10:59,370 --> 01:11:01,950 to think differently to get it. 1334 01:11:01,950 --> 01:11:04,590 So the machinery that consists in setting the derivative equal 1335 01:11:04,590 --> 01:11:06,870 to zero works 80% of the time. 1336 01:11:06,870 --> 01:11:08,880 But o you have to be careful. 1337 01:11:08,880 --> 01:11:11,880 And from the context, it will be clear 1338 01:11:11,880 --> 01:11:14,460 that you had to be careful, because you will find 1339 01:11:14,460 --> 01:11:17,190 some crazy stuff, such as solve 1 over theta hat 1340 01:11:17,190 --> 01:11:18,090 is equal to zero. 1341 01:11:23,140 --> 01:11:25,280 All right, so before we conclude, 1342 01:11:25,280 --> 01:11:28,090 I just wanted to give you some intuition about how does 1343 01:11:28,090 --> 01:11:30,620 the maximum likelihood perform? 1344 01:11:30,620 --> 01:11:33,070 So there's something called the Fisher information 1345 01:11:33,070 --> 01:11:35,980 that essentially controls how this thing performs. 1346 01:11:35,980 --> 01:11:38,710 And the Fisher information is, essentially, 1347 01:11:38,710 --> 01:11:40,420 a second derivative or a Hessian. 1348 01:11:40,420 --> 01:11:44,980 So if I'm in a one-dimensional parameter case, it's a number, 1349 01:11:44,980 --> 01:11:46,300 it's a second derivative. 1350 01:11:46,300 --> 01:11:51,000 If I'm in a multidimensional case, it's actually a Hessian, 1351 01:11:51,000 --> 01:11:52,780 it's a matrix. 1352 01:11:52,780 --> 01:11:57,800 So I'm going to actually take in notation little curly L 1353 01:11:57,800 --> 01:12:00,670 of theta to be the log likelihood, OK? 1354 01:12:00,670 --> 01:12:02,910 And that's the log likelihood for one observation. 1355 01:12:02,910 --> 01:12:05,560 So let's call it x generically, but think of it as being x1, 1356 01:12:05,560 --> 01:12:07,480 for example. 1357 01:12:07,480 --> 01:12:09,250 And I don't care of, like, summing, 1358 01:12:09,250 --> 01:12:11,260 because I'm actually going to take expectation of this thing. 1359 01:12:11,260 --> 01:12:13,176 So it's not going to be a data driven quantity 1360 01:12:13,176 --> 01:12:14,390 I'm going to play with. 1361 01:12:14,390 --> 01:12:15,806 So now I'm going to assume that it 1362 01:12:15,806 --> 01:12:19,390 is twice differentiable, almost surely, because it's 1363 01:12:19,390 --> 01:12:21,350 a random function. 1364 01:12:21,350 --> 01:12:23,890 And so now I'm going to just sweep under the rug 1365 01:12:23,890 --> 01:12:27,700 some technical conditions when these things hold. 1366 01:12:27,700 --> 01:12:32,130 So typically, when can I permute integral and derivatives 1367 01:12:32,130 --> 01:12:35,160 and this kind of stuff that you don't want to think about? 1368 01:12:35,160 --> 01:12:36,730 OK, the rule of thumb is it always 1369 01:12:36,730 --> 01:12:39,589 works until it doesn't, in which case, 1370 01:12:39,589 --> 01:12:41,380 that probably means you're actually solving 1371 01:12:41,380 --> 01:12:44,080 some sort of calculus problem. 1372 01:12:44,080 --> 01:12:47,390 Because in practice, it just doesn't happen. 1373 01:12:47,390 --> 01:12:56,010 So the Fisher information is the expectation of the-- 1374 01:12:56,010 --> 01:12:57,790 that's called the outer product. 1375 01:12:57,790 --> 01:13:01,240 So that's the product of this gradient 1376 01:13:01,240 --> 01:13:02,390 and the gradient transpose. 1377 01:13:02,390 --> 01:13:04,540 So that forms a matrix, right? 1378 01:13:04,540 --> 01:13:09,830 That's a matrix minus the outer product of the expectations. 1379 01:13:09,830 --> 01:13:12,910 So that's really what's called the covariance matrix 1380 01:13:12,910 --> 01:13:16,285 of this vector, nabla of L theta, which 1381 01:13:16,285 --> 01:13:18,090 is a random vector. 1382 01:13:18,090 --> 01:13:21,042 So I'm forming the covariance matrix of this thing. 1383 01:13:21,042 --> 01:13:23,250 And the technical conditions tells me that, actually, 1384 01:13:23,250 --> 01:13:26,600 this guy, which depends only on the Hessian, 1385 01:13:26,600 --> 01:13:31,115 is actually equal to negative expectation of the-- sorry. 1386 01:13:31,115 --> 01:13:32,240 It depends on the gradient. 1387 01:13:32,240 --> 01:13:36,140 Is actually negative expectation of the Hessian. 1388 01:13:36,140 --> 01:13:38,300 So I can actually get a quantity that 1389 01:13:38,300 --> 01:13:40,400 depends on the second derivatives only using 1390 01:13:40,400 --> 01:13:41,740 first derivatives. 1391 01:13:41,740 --> 01:13:44,202 But the expectation is going to play a role here. 1392 01:13:44,202 --> 01:13:45,410 And the fact that it's a log. 1393 01:13:45,410 --> 01:13:48,180 And lots of things actually show up here. 1394 01:13:48,180 --> 01:13:51,220 And so in this case, what I get is that-- 1395 01:13:51,220 --> 01:13:53,510 so in the one-dimensional case, then this 1396 01:13:53,510 --> 01:13:56,480 is just the covariance matrix of a one-dimensional thing, which 1397 01:13:56,480 --> 01:13:58,200 is just a variance of itself. 1398 01:13:58,200 --> 01:14:00,050 So the variance of the derivative 1399 01:14:00,050 --> 01:14:04,190 is actually equal to negative the expectation 1400 01:14:04,190 --> 01:14:07,080 of the second derivative. 1401 01:14:07,080 --> 01:14:09,280 OK, so we'll see that next time. 1402 01:14:09,280 --> 01:14:12,600 But what I wanted to emphasize with this is that why do 1403 01:14:12,600 --> 01:14:15,109 we care about this quantity? 1404 01:14:15,109 --> 01:14:16,650 That's called the Fisher information. 1405 01:14:16,650 --> 01:14:19,770 Fisher is the founding father of modern statistics. 1406 01:14:19,770 --> 01:14:23,070 Why do we give this quantity his name? 1407 01:14:23,070 --> 01:14:25,546 Well, it's because this quantity is actually very critical. 1408 01:14:25,546 --> 01:14:27,420 What does the second derivative of a function 1409 01:14:27,420 --> 01:14:29,560 tell me at the maximum? 1410 01:14:29,560 --> 01:14:34,350 Well, it's telling me how curved it is, right? 1411 01:14:34,350 --> 01:14:37,780 If I have a zero second derivative, I'm basically flat. 1412 01:14:37,780 --> 01:14:41,137 And if I have a very high second derivative, I'm very curvy. 1413 01:14:41,137 --> 01:14:42,720 And when I'm very curvy, what it means 1414 01:14:42,720 --> 01:14:45,760 is that I'm very robust to the estimation error. 1415 01:14:45,760 --> 01:14:47,160 Remember our estimation strategy, 1416 01:14:47,160 --> 01:14:50,130 which consisted in replacing expectation by averages? 1417 01:14:50,130 --> 01:14:52,830 If I'm extremely curvy, I can move a little bit. 1418 01:14:52,830 --> 01:14:55,410 This thing, the maximum, is not going to move much. 1419 01:14:55,410 --> 01:14:57,280 And this formula here-- 1420 01:14:57,280 --> 01:15:00,090 so forget about the matrix version for a second-- 1421 01:15:00,090 --> 01:15:01,780 is actually telling me exactly-- 1422 01:15:01,780 --> 01:15:06,000 it's telling me the curvature is basically the variance 1423 01:15:06,000 --> 01:15:08,290 of the first derivative. 1424 01:15:08,290 --> 01:15:10,840 And so the more the first derivative fluctuates, 1425 01:15:10,840 --> 01:15:12,930 the more your maximum is actually-- your org max 1426 01:15:12,930 --> 01:15:14,710 is going to move all over the place. 1427 01:15:14,710 --> 01:15:16,950 So this is really controlling how flat 1428 01:15:16,950 --> 01:15:20,280 your likelihood, your log likelihood, is at its maximum. 1429 01:15:20,280 --> 01:15:23,340 The flatter it is, the more sensitive to fluctuation 1430 01:15:23,340 --> 01:15:24,630 the arg max is going to be. 1431 01:15:24,630 --> 01:15:27,060 The curvier it is, the less sensitive it is. 1432 01:15:27,060 --> 01:15:28,740 And so what we're hoping-- a good model 1433 01:15:28,740 --> 01:15:31,710 is going to be one that has a large or small value 1434 01:15:31,710 --> 01:15:34,350 for the Fisher information. 1435 01:15:34,350 --> 01:15:36,938 I want this to be-- 1436 01:15:36,938 --> 01:15:38,300 small? 1437 01:15:38,300 --> 01:15:40,070 I want it to be large. 1438 01:15:40,070 --> 01:15:42,290 Because this is the curvature, right? 1439 01:15:42,290 --> 01:15:44,414 This number is negative, it's concave. 1440 01:15:44,414 --> 01:15:45,830 So if I take a negative sign, it's 1441 01:15:45,830 --> 01:15:47,810 going to be something that's positive. 1442 01:15:47,810 --> 01:15:51,230 And the larger this thing, the more curvy it is. 1443 01:15:51,230 --> 01:15:52,730 Oh, yeah, because it's the variance. 1444 01:15:52,730 --> 01:15:53,271 Again, sorry. 1445 01:15:53,271 --> 01:15:55,430 This is what-- 1446 01:15:55,430 --> 01:15:55,930 OK. 1447 01:15:59,480 --> 01:16:02,156 Yeah, maybe I should not go into those details 1448 01:16:02,156 --> 01:16:03,530 because I'm actually out of time. 1449 01:16:03,530 --> 01:16:06,890 But just spoiler alert, the asymptotic variance 1450 01:16:06,890 --> 01:16:09,020 of your-- the variance, basically, as n 1451 01:16:09,020 --> 01:16:11,370 goes to infinity of the maximum likelihood estimator 1452 01:16:11,370 --> 01:16:12,830 is going to be 1 over this guy. 1453 01:16:12,830 --> 01:16:15,260 So we want it to be large, because the asymptotic variance 1454 01:16:15,260 --> 01:16:16,910 is going to be very small. 1455 01:16:16,910 --> 01:16:18,650 All right, so we're out of time. 1456 01:16:18,650 --> 01:16:20,630 We'll see that next week. 1457 01:16:20,630 --> 01:16:22,730 And I have your homework with me. 1458 01:16:22,730 --> 01:16:25,052 And I will actually turn it in. 1459 01:16:25,052 --> 01:16:26,510 I will give it to you outside so we 1460 01:16:26,510 --> 01:16:28,580 can let the other room come in. 1461 01:16:28,580 --> 01:16:31,630 OK, I'll just leave you the--