1 00:00:00,120 --> 00:00:02,460 The following content is provided under a Creative 2 00:00:02,460 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,090 Your support will help MIT OpenCourseWare 4 00:00:06,090 --> 00:00:10,180 continue to offer high quality educational resources for free. 5 00:00:10,180 --> 00:00:12,720 To make a donation or to view additional materials 6 00:00:12,720 --> 00:00:16,680 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,680 --> 00:00:17,620 at ocw.mit.edu. 8 00:01:14,980 --> 00:01:17,500 PHILIPPE RIGOLLET: --bunch of x's and a bunch of y's. 9 00:01:17,500 --> 00:01:20,140 The y's were univariate, just one real 10 00:01:20,140 --> 00:01:21,460 valued random variable. 11 00:01:21,460 --> 00:01:24,760 And the x's were vectors that described a bunch of attributes 12 00:01:24,760 --> 00:01:27,730 for each of our individuals or each of our observations. 13 00:01:27,730 --> 00:01:30,350 Let's assume now that we're given essentially only the x's. 14 00:01:30,350 --> 00:01:33,970 This is sometimes referred to as unsupervised learning. 15 00:01:33,970 --> 00:01:35,920 There is just the x's. 16 00:01:35,920 --> 00:01:38,640 Usually, supervision is done by the y's. 17 00:01:38,640 --> 00:01:41,710 And so what you're trying to do is to make sense of this data. 18 00:01:41,710 --> 00:01:43,690 You're going to try to understand this data, 19 00:01:43,690 --> 00:01:47,062 represent this data, visualize this data, 20 00:01:47,062 --> 00:01:48,520 try to understand something, right? 21 00:01:48,520 --> 00:01:52,196 So, if I give you a d-dimensional random vectors, 22 00:01:52,196 --> 00:01:54,070 and you're going to have n independent copies 23 00:01:54,070 --> 00:01:57,310 of this individual-- of this random vector, OK? 24 00:01:57,310 --> 00:01:59,530 So you will see that I'm going to have-- 25 00:01:59,530 --> 00:02:02,200 I'm going to very quickly run into some limitations 26 00:02:02,200 --> 00:02:04,270 about what I can actually draw on the board 27 00:02:04,270 --> 00:02:05,980 because I'm using [? boldface ?] here. 28 00:02:05,980 --> 00:02:08,180 I'm also going to use the blackboard [? boldface. ?] 29 00:02:08,180 --> 00:02:09,820 So it's going to be a bit difficult. 30 00:02:09,820 --> 00:02:15,430 So tell me if you're actually a little confused by what 31 00:02:15,430 --> 00:02:17,710 is a vector, what is a number, and what is a matrix. 32 00:02:17,710 --> 00:02:19,720 But we'll get there. 33 00:02:19,720 --> 00:02:22,450 So I have X in Rd, and that's a random vector. 34 00:02:26,230 --> 00:02:30,650 And I have X1 to Xn that are IID. 35 00:02:30,650 --> 00:02:37,635 They're independent copies of X. OK, 36 00:02:37,635 --> 00:02:40,326 so you can think of those as being-- 37 00:02:40,326 --> 00:02:41,700 the realization of these guys are 38 00:02:41,700 --> 00:02:51,090 going to be a cloud of n points in R to the d. 39 00:02:51,090 --> 00:02:54,210 And we're going to think of d as being fairly large. 40 00:02:54,210 --> 00:02:55,710 And for this to start to make sense, 41 00:02:55,710 --> 00:02:59,760 we're going to think of d as being at least 4, OK? 42 00:02:59,760 --> 00:03:01,830 And meaning that you're going to have a hard time 43 00:03:01,830 --> 00:03:03,480 visualizing those things. 44 00:03:03,480 --> 00:03:06,530 If it was 3 or 2, you would be able to draw these points. 45 00:03:06,530 --> 00:03:08,040 And that's pretty much as much sense 46 00:03:08,040 --> 00:03:09,831 you're going to be making about those guys, 47 00:03:09,831 --> 00:03:12,030 just looking at the [INAUDIBLE] 48 00:03:12,030 --> 00:03:16,860 All right, so I'm going to write each of those X's, right? 49 00:03:16,860 --> 00:03:20,520 So this vector, X, has d coordinate. 50 00:03:20,520 --> 00:03:25,650 And I'm going to write them as X1, to Xd. 51 00:03:30,730 --> 00:03:34,780 And I'm going to stack them into a matrix, OK? 52 00:03:34,780 --> 00:03:38,100 So once I have those guys, I'm going to have a matrix. 53 00:03:38,100 --> 00:03:40,230 But here, I'm going to use the double bar. 54 00:03:40,230 --> 00:03:47,880 And it's X1 transpose, Xn transpose. 55 00:03:47,880 --> 00:03:51,250 So what it means is that the coordinates of this guy, 56 00:03:51,250 --> 00:03:53,040 of course, are X1,1. 57 00:03:53,040 --> 00:03:54,710 Here, I have-- 58 00:03:54,710 --> 00:03:57,870 I'm of size d, so I have X1d. 59 00:03:57,870 --> 00:04:01,290 And here, I have Xn1. 60 00:04:01,290 --> 00:04:02,940 Xnd. 61 00:04:02,940 --> 00:04:06,660 And so the i-th, j-th-- 62 00:04:06,660 --> 00:04:10,950 i-th row and j-th column is the matrix, Xij, right-- 63 00:04:10,950 --> 00:04:12,780 is the entry, Xi to-- sorry. 64 00:04:23,540 --> 00:04:28,230 OK, so each-- so the rows here are the observations. 65 00:04:28,230 --> 00:04:32,040 And the columns are the covariance over attributes. 66 00:04:32,040 --> 00:04:32,640 OK? 67 00:04:32,640 --> 00:04:34,060 So this is an n by d matrix. 68 00:04:39,220 --> 00:04:41,320 All right, this is really just some bookkeeping. 69 00:04:41,320 --> 00:04:43,840 How do we store this data somehow? 70 00:04:43,840 --> 00:04:46,257 And the fact that we use a matrix just like for regression 71 00:04:46,257 --> 00:04:48,464 is going to be convenient because we're going to able 72 00:04:48,464 --> 00:04:50,050 to talk about projections-- 73 00:04:50,050 --> 00:04:53,310 going to be able to talk about things like this. 74 00:04:53,310 --> 00:04:56,310 All right, so everything I'm going to say now 75 00:04:56,310 --> 00:04:59,190 is about variances or covariances 76 00:04:59,190 --> 00:05:01,945 of those things, which means that I need two moments, OK? 77 00:05:01,945 --> 00:05:03,570 If the variance does not exist, there's 78 00:05:03,570 --> 00:05:05,320 nothing I can say about this problem. 79 00:05:05,320 --> 00:05:07,620 So I'm going to assume that the variance exists. 80 00:05:07,620 --> 00:05:09,090 And one way to just put it to say 81 00:05:09,090 --> 00:05:12,390 that the two norm of those guys is 82 00:05:12,390 --> 00:05:15,030 finite, which is another way to say that each of them 83 00:05:15,030 --> 00:05:15,690 is finite. 84 00:05:15,690 --> 00:05:18,210 I mean, you can think of it the way you want. 85 00:05:18,210 --> 00:05:21,000 All right, so now, the mean of X, right? 86 00:05:21,000 --> 00:05:22,530 So I have a random vector. 87 00:05:22,530 --> 00:05:26,430 So I can talk about the expectation of X. 88 00:05:26,430 --> 00:05:29,040 That's a vector that's in Rd. 89 00:05:29,040 --> 00:05:33,828 And that's just taking the expectation entrywise. 90 00:05:33,828 --> 00:05:34,328 Sorry. 91 00:05:42,265 --> 00:05:45,540 X1, Xd. 92 00:05:45,540 --> 00:05:49,640 OK, so I should say it out loud. 93 00:05:49,640 --> 00:05:51,890 For this, the purpose of this class, 94 00:05:51,890 --> 00:05:55,850 I will denote by subscripts the indices that 95 00:05:55,850 --> 00:05:57,170 corresponds to observations. 96 00:05:57,170 --> 00:06:02,690 And superscripts, the indices that correspond to 97 00:06:02,690 --> 00:06:04,280 coordinates of a variable. 98 00:06:04,280 --> 00:06:07,340 And I think that's the same convention that we 99 00:06:07,340 --> 00:06:10,599 took for the regression case. 100 00:06:10,599 --> 00:06:12,390 Of course, you could use whatever you want. 101 00:06:12,390 --> 00:06:13,931 If you want to put commas, et cetera, 102 00:06:13,931 --> 00:06:16,072 it becomes just a bit more complicated. 103 00:06:16,072 --> 00:06:18,070 All right, and so now, once I have this, 104 00:06:18,070 --> 00:06:21,380 so this tells me where my cloud of point is centered, right? 105 00:06:21,380 --> 00:06:24,380 So if I have a bunch of points-- 106 00:06:24,380 --> 00:06:27,440 OK, so now I have a distribution on Rd, 107 00:06:27,440 --> 00:06:29,990 so maybe I should talk about this-- 108 00:06:29,990 --> 00:06:31,610 I'll talk about this when we talk 109 00:06:31,610 --> 00:06:32,960 about the empirical version. 110 00:06:32,960 --> 00:06:34,460 But if you think that you have, say, 111 00:06:34,460 --> 00:06:36,680 a two-dimensional Gaussian random variable, 112 00:06:36,680 --> 00:06:38,930 then you have a center in two dimension, which 113 00:06:38,930 --> 00:06:41,572 is where it peaks, basically. 114 00:06:41,572 --> 00:06:43,280 And that's what we're talking about here. 115 00:06:43,280 --> 00:06:44,738 But the other thing we want to know 116 00:06:44,738 --> 00:06:47,545 is how much does it spread in every direction, right? 117 00:06:47,545 --> 00:06:49,670 So in every direction of the two dimensional thing, 118 00:06:49,670 --> 00:06:52,220 I can then try to understand how much spread I'm getting. 119 00:06:52,220 --> 00:06:54,900 And the way you measure this is by using covariance, right? 120 00:06:54,900 --> 00:07:02,150 So the covariance matrix, sigma-- 121 00:07:02,150 --> 00:07:05,900 that's a matrix which is d by d. 122 00:07:05,900 --> 00:07:08,150 And it records-- in the j, k-th entry, 123 00:07:08,150 --> 00:07:10,620 it records the covariance between the j-th coordinate 124 00:07:10,620 --> 00:07:13,490 of X and the k-th coordinate of X, OK? 125 00:07:13,490 --> 00:07:14,570 So with entries-- 126 00:07:21,300 --> 00:07:30,510 OK, so I have sigma, which is sigma 1,1, sigma dd, sigma 1d, 127 00:07:30,510 --> 00:07:31,175 sigma d1. 128 00:07:34,750 --> 00:07:39,690 OK, and here I have sigma jk And sigma jk 129 00:07:39,690 --> 00:07:48,930 is just the covariance between Xj, the j-th coordinate 130 00:07:48,930 --> 00:07:52,160 and the k-th coordinate. 131 00:07:52,160 --> 00:07:52,869 OK? 132 00:07:52,869 --> 00:07:55,160 So in particular, it's symmetric because the covariance 133 00:07:55,160 --> 00:07:57,780 between Xj and Xk is the same as the covariance between Xk 134 00:07:57,780 --> 00:07:58,280 and Xj. 135 00:07:58,280 --> 00:08:01,230 I should not put those parentheses here. 136 00:08:01,230 --> 00:08:05,330 I do not use them in this, OK? 137 00:08:05,330 --> 00:08:06,900 Just the covariance matrix. 138 00:08:06,900 --> 00:08:09,050 So that's just something that records everything. 139 00:08:09,050 --> 00:08:10,966 And so what's nice about the covariance matrix 140 00:08:10,966 --> 00:08:13,040 is that if I actually give you X as a vector, 141 00:08:13,040 --> 00:08:15,170 you actually can build the matrix just 142 00:08:15,170 --> 00:08:18,140 by looking at vectors times vectors transpose, 143 00:08:18,140 --> 00:08:20,210 rather than actually thinking about building 144 00:08:20,210 --> 00:08:21,882 it coordinate by coordinate. 145 00:08:21,882 --> 00:08:23,840 So for example, if you're used to using MATLAB, 146 00:08:23,840 --> 00:08:26,006 that's the way you want to build a covariance matrix 147 00:08:26,006 --> 00:08:29,600 because MATLAB is good at manipulating vectors 148 00:08:29,600 --> 00:08:33,049 and matrices rather than just entering it entry by entry. 149 00:08:33,049 --> 00:08:34,820 OK, so, right? 150 00:08:34,820 --> 00:08:42,590 So, what is the covariance between Xj and Xk? 151 00:08:42,590 --> 00:08:51,360 Well by definition, it's the expectation of Xj and Xk 152 00:08:51,360 --> 00:09:01,330 minus the expectation of Xj times the expectation of Xk, 153 00:09:01,330 --> 00:09:01,830 right? 154 00:09:01,830 --> 00:09:03,496 That's the definition of the covariance. 155 00:09:03,496 --> 00:09:05,770 I hope everybody's seeing that. 156 00:09:05,770 --> 00:09:08,280 And so, in particular, I can actually 157 00:09:08,280 --> 00:09:10,620 see that this thing can be written as-- 158 00:09:10,620 --> 00:09:14,340 sigma can now be written as the expectation 159 00:09:14,340 --> 00:09:21,040 of XX transpose minus the expectation of X 160 00:09:21,040 --> 00:09:25,660 times the expectation of X transpose. 161 00:09:25,660 --> 00:09:26,500 Why? 162 00:09:26,500 --> 00:09:29,470 Well, let's look at the jk-th coefficient of this guy, right? 163 00:09:29,470 --> 00:09:35,650 So here, if I look at the jk-th coefficient, I see what? 164 00:09:35,650 --> 00:09:38,980 Well, I see that it's the expectation 165 00:09:38,980 --> 00:09:50,840 of XX transpose jk, which is equal to the expectation of XX 166 00:09:50,840 --> 00:09:53,920 transpose jk. 167 00:09:53,920 --> 00:09:56,570 And what are the entries of XX transpose? 168 00:09:56,570 --> 00:10:00,130 Well, they're of the form, Xj times Xk exactly. 169 00:10:00,130 --> 00:10:02,940 So this is actually equal to the expectation of Xj times Xk. 170 00:10:09,060 --> 00:10:11,250 And this is actually not the way I want to write it. 171 00:10:11,250 --> 00:10:12,083 I want to write it-- 172 00:10:15,530 --> 00:10:16,590 OK? 173 00:10:16,590 --> 00:10:17,590 Is that clear? 174 00:10:17,590 --> 00:10:20,420 That when I have a rank 1 matrix of this form, XX transpose, 175 00:10:20,420 --> 00:10:21,950 the entries are of this form, right? 176 00:10:21,950 --> 00:10:23,520 Because if I take-- 177 00:10:23,520 --> 00:10:28,865 for example, think about x, y, z, and then 178 00:10:28,865 --> 00:10:32,810 I multiply by x, y, z. 179 00:10:32,810 --> 00:10:36,380 What I'm getting here is x-- 180 00:10:36,380 --> 00:10:40,350 maybe I should actually use indices here. 181 00:10:40,350 --> 00:10:42,735 x1, x2, x3. 182 00:10:42,735 --> 00:10:44,750 x1, x2, x3. 183 00:10:44,750 --> 00:10:57,018 The entries are x1x1, x1x2, x1x3; x2x1, x2x2, x2x3; x3x1, 184 00:10:57,018 --> 00:11:04,770 x3x2, x3x3, OK? 185 00:11:04,770 --> 00:11:08,340 So indeed, this is exactly of the form if you look at jk, 186 00:11:08,340 --> 00:11:12,566 you get exactly Xj times Xk, OK? 187 00:11:12,566 --> 00:11:15,685 So that's the beauty of those matrices. 188 00:11:15,685 --> 00:11:19,380 So now, once I have this, I can do exactly the same thing, 189 00:11:19,380 --> 00:11:23,480 except that here, if I take the jk-th entry, 190 00:11:23,480 --> 00:11:25,044 I will get exactly the same thing, 191 00:11:25,044 --> 00:11:27,710 except that it's not going to be the expectation of the product, 192 00:11:27,710 --> 00:11:29,780 but the product of the expectation, right? 193 00:11:29,780 --> 00:11:36,810 So I get that the jk-th entry of E of X, E of X transpose, 194 00:11:36,810 --> 00:11:48,310 is just the j-th entry of E of X times the k-th entry of E of X. 195 00:11:48,310 --> 00:11:52,540 So if I put those two together, it's actually telling me 196 00:11:52,540 --> 00:11:56,990 that if I look at the j, k-th entry of sigma, 197 00:11:56,990 --> 00:11:59,690 which I called little sigma jk, then 198 00:11:59,690 --> 00:12:01,340 this is actually equal to what? 199 00:12:01,340 --> 00:12:04,170 It's equal to the first term minus the second term. 200 00:12:04,170 --> 00:12:11,420 The first term is the expectation of Xj, Xk 201 00:12:11,420 --> 00:12:18,900 minus the expectation of Xj, expectation of Xk, which-- 202 00:12:18,900 --> 00:12:20,900 oh, by the way, I forgot to say this is actually 203 00:12:20,900 --> 00:12:26,022 equal to the expectation of Xj times the expectation of Xk 204 00:12:26,022 --> 00:12:28,230 because that's just the definition of the expectation 205 00:12:28,230 --> 00:12:28,979 of random vectors. 206 00:12:28,979 --> 00:12:31,460 So my j and my k are now inside. 207 00:12:31,460 --> 00:12:37,175 And that's by definition the covariance between Xj and Xk, 208 00:12:37,175 --> 00:12:39,550 OK? 209 00:12:39,550 --> 00:12:43,360 So just if you've seen those manipulations between vectors, 210 00:12:43,360 --> 00:12:45,400 hopefully you're bored out of your mind. 211 00:12:45,400 --> 00:12:47,800 And if you have not, then that's something 212 00:12:47,800 --> 00:12:51,010 you just need to get comfortable with, right? 213 00:12:51,010 --> 00:12:52,660 So one thing that's going to be useful 214 00:12:52,660 --> 00:12:55,850 is to know very quickly what's called 215 00:12:55,850 --> 00:12:57,850 the outer product of a vector with itself, which 216 00:12:57,850 --> 00:12:59,997 is the vector of times the vector transpose, what 217 00:12:59,997 --> 00:13:01,330 the entries of these things are. 218 00:13:01,330 --> 00:13:06,510 And that's what we've been using on this second set of boards. 219 00:13:06,510 --> 00:13:08,290 OK, so everybody agrees now that we've 220 00:13:08,290 --> 00:13:11,860 sort of showed that the covariance matrix can 221 00:13:11,860 --> 00:13:14,290 be written in this vector form. 222 00:13:14,290 --> 00:13:17,500 So expectation of XX transpose minus expectation 223 00:13:17,500 --> 00:13:19,312 of X, expectation of X transpose. 224 00:13:22,264 --> 00:13:28,060 OK, just like the covariance can be written in two ways, 225 00:13:28,060 --> 00:13:30,070 right we know that the covariance can also 226 00:13:30,070 --> 00:13:39,460 be written as the expectation of Xj minus expectation of Xj 227 00:13:39,460 --> 00:13:45,500 times Xk minus expectation of Xk, right? 228 00:13:45,500 --> 00:13:50,220 That's the-- sometimes, this is the original definition 229 00:13:50,220 --> 00:13:50,850 of covariance. 230 00:13:50,850 --> 00:13:52,490 This is the second definition of covariance. 231 00:13:52,490 --> 00:13:54,031 Just like you have the variance which 232 00:13:54,031 --> 00:13:57,240 is the expectation of the square of X minus c of X, 233 00:13:57,240 --> 00:14:00,390 or the expectation X squared minus the expectation of X 234 00:14:00,390 --> 00:14:01,160 squared. 235 00:14:01,160 --> 00:14:03,420 It's the same thing for covariance. 236 00:14:03,420 --> 00:14:11,190 And you can actually see this in terms of vectors, right? 237 00:14:11,190 --> 00:14:14,270 So this actually implies that you can also rewrite sigma 238 00:14:14,270 --> 00:14:21,780 as the expectation of X minus expectation of X 239 00:14:21,780 --> 00:14:23,845 times the same thing transpose. 240 00:14:32,191 --> 00:14:32,690 Right? 241 00:14:32,690 --> 00:14:35,950 And the reason is because if you just distribute those guys, 242 00:14:35,950 --> 00:14:43,760 this is just the expectation of XX transpose 243 00:14:43,760 --> 00:14:54,800 minus X, expectation of X transpose minus expectation 244 00:14:54,800 --> 00:14:59,750 of XX transpose. 245 00:14:59,750 --> 00:15:03,608 And then I have plus expectation of X, 246 00:15:03,608 --> 00:15:05,628 expectation of X transpose. 247 00:15:09,930 --> 00:15:13,110 Now, things could go wrong because the main difference 248 00:15:13,110 --> 00:15:18,660 between matrices slash vectors and numbers is 249 00:15:18,660 --> 00:15:21,930 that multiplication does not commute, right? 250 00:15:21,930 --> 00:15:25,610 So in particular, those two things are not the same thing. 251 00:15:25,610 --> 00:15:27,860 And so that's the main difference that we have before, 252 00:15:27,860 --> 00:15:30,336 but it actually does not matter for our problem. 253 00:15:30,336 --> 00:15:32,210 It's because what's happening is that if when 254 00:15:32,210 --> 00:15:34,970 I take the expectation of this guy, then 255 00:15:34,970 --> 00:15:38,940 it's actually the same as the expectation of this guy, OK? 256 00:15:38,940 --> 00:15:43,540 And so just because the expectation is linear-- 257 00:15:48,230 --> 00:15:50,550 so what we have is that sigma now 258 00:15:50,550 --> 00:15:55,560 becomes equal to the expectation of XX transpose 259 00:15:55,560 --> 00:15:59,130 minus the expectation of X, expectation 260 00:15:59,130 --> 00:16:03,170 of X transpose minus expectation of X, 261 00:16:03,170 --> 00:16:07,110 expectation of X transpose. 262 00:16:07,110 --> 00:16:10,030 And then I have-- 263 00:16:10,030 --> 00:16:14,070 well, really, what I have is this guy. 264 00:16:14,070 --> 00:16:15,990 And then I have plus the expectation 265 00:16:15,990 --> 00:16:19,680 of X, expectation of X transpose. 266 00:16:23,970 --> 00:16:28,570 And now, those three things are actually equal to each other 267 00:16:28,570 --> 00:16:30,700 just because the expectation of X transpose 268 00:16:30,700 --> 00:16:34,145 is the same as the expectation of X transpose. 269 00:16:34,145 --> 00:16:35,520 And so what I'm left with is just 270 00:16:35,520 --> 00:16:44,364 the expectation of XX transpose minus the expectation of X, 271 00:16:44,364 --> 00:16:49,650 expectation of X transpose, OK? 272 00:16:49,650 --> 00:16:51,610 So same thing that's happening when 273 00:16:51,610 --> 00:16:53,110 you want to prove that you can write 274 00:16:53,110 --> 00:16:57,760 the covariance either this way or that way. 275 00:16:57,760 --> 00:17:00,980 The same thing happens for matrices, or for vectors, 276 00:17:00,980 --> 00:17:02,340 right, or a covariance matrix. 277 00:17:02,340 --> 00:17:04,609 They go together. 278 00:17:04,609 --> 00:17:05,920 Is there any questions so far? 279 00:17:05,920 --> 00:17:09,460 And if you have some, please tell me, because I want to-- 280 00:17:09,460 --> 00:17:12,490 I don't know to which extent you guys are comfortable with this 281 00:17:12,490 --> 00:17:13,420 at all or not. 282 00:17:16,700 --> 00:17:19,810 OK, so let's move on. 283 00:17:19,810 --> 00:17:23,460 All right, so of course, this is what 284 00:17:23,460 --> 00:17:26,420 I'm describing in terms of the distribution right here. 285 00:17:26,420 --> 00:17:28,359 I took expectations. 286 00:17:28,359 --> 00:17:30,140 Covariances are also expectations. 287 00:17:30,140 --> 00:17:32,560 So those depend on some distribution of X, right? 288 00:17:32,560 --> 00:17:34,630 If I wanted to compute that, I would basically 289 00:17:34,630 --> 00:17:36,601 need to know what the distribution of X is. 290 00:17:36,601 --> 00:17:37,975 Now, we're doing statistics, so I 291 00:17:37,975 --> 00:17:41,180 need to [INAUDIBLE] my question is going to be to say, well, 292 00:17:41,180 --> 00:17:44,380 how well can I estimate the covariance matrix itself, 293 00:17:44,380 --> 00:17:47,260 or some properties of this covariance matrix 294 00:17:47,260 --> 00:17:48,405 based on data? 295 00:17:48,405 --> 00:17:50,140 All right, so if I want to understand 296 00:17:50,140 --> 00:17:52,990 what my covariance matrix looks like based on data, 297 00:17:52,990 --> 00:17:54,940 I'm going to have to basically form 298 00:17:54,940 --> 00:17:57,760 its empirical counterparts, which 299 00:17:57,760 --> 00:18:02,200 I can do by doing the age-old statistical trick, which 300 00:18:02,200 --> 00:18:04,700 is replace your expectation by an average, all right? 301 00:18:04,700 --> 00:18:06,658 So let's just-- everything that's on the board, 302 00:18:06,658 --> 00:18:09,310 you see expectation, just replace it by an average. 303 00:18:09,310 --> 00:18:14,230 OK, so, now I'm going to be given X1, Xn. 304 00:18:14,230 --> 00:18:16,551 So, I'm going to define the empirical mean. 305 00:18:19,780 --> 00:18:22,290 OK so, really, the idea is take your expectation 306 00:18:22,290 --> 00:18:24,970 and replace it by 1 over n sum, right? 307 00:18:24,970 --> 00:18:28,230 And so the empirical mean is just 1 over n. 308 00:18:28,230 --> 00:18:31,510 Some of the Xi's-- 309 00:18:31,510 --> 00:18:34,070 I'm guessing everybody knows how to average vectors. 310 00:18:34,070 --> 00:18:36,110 It's just the average of the coordinates. 311 00:18:36,110 --> 00:18:39,730 So I will write this as X bar. 312 00:18:39,730 --> 00:18:51,440 And the empirical covariance matrix, often called 313 00:18:51,440 --> 00:18:57,520 sample covariance matrix, hence the notation, S. 314 00:18:57,520 --> 00:18:59,800 Well, this is my covariance matrix, right? 315 00:18:59,800 --> 00:19:02,650 Let's just replace the expectations by averages. 316 00:19:02,650 --> 00:19:12,160 1 over n, sum from i equal 1 to n, of Xi, Xi transpose, minus-- 317 00:19:12,160 --> 00:19:14,290 this is the expectation of X. I will replace it 318 00:19:14,290 --> 00:19:21,380 by the average, which I just called X bar, X bar transpose, 319 00:19:21,380 --> 00:19:22,590 OK? 320 00:19:22,590 --> 00:19:25,480 And that's when I want to use the-- 321 00:19:25,480 --> 00:19:28,430 that's when I want to use the notation-- 322 00:19:28,430 --> 00:19:30,670 the second definition, but I could actually 323 00:19:30,670 --> 00:19:35,530 do exactly the same thing using this definition here. 324 00:19:35,530 --> 00:19:38,750 Sorry, using this definition right here. 325 00:19:38,750 --> 00:19:42,340 So this is actually 1 over n, sum from i 326 00:19:42,340 --> 00:19:55,240 equal 1 to n, of Xi minus X bar, Xi minus X bar transpose. 327 00:19:55,240 --> 00:19:56,560 And those are actually-- 328 00:19:56,560 --> 00:19:58,367 I mean, in a way, it looks like I 329 00:19:58,367 --> 00:19:59,950 could define two different estimators, 330 00:19:59,950 --> 00:20:01,630 but you can actually check. 331 00:20:01,630 --> 00:20:03,700 And I do encourage you to do this. 332 00:20:03,700 --> 00:20:05,920 If you're not comfortable making those manipulations, 333 00:20:05,920 --> 00:20:08,294 you can actually check that those two things are actually 334 00:20:08,294 --> 00:20:15,216 exactly the same, OK? 335 00:20:20,540 --> 00:20:25,070 So now, I'm going to want to talk about matrices, OK? 336 00:20:25,070 --> 00:20:27,260 And remember, we defined this big matrix, X, 337 00:20:27,260 --> 00:20:28,790 with the double bar. 338 00:20:28,790 --> 00:20:31,160 And the question is, can I express 339 00:20:31,160 --> 00:20:35,360 both X bar and the sample covariance matrix 340 00:20:35,360 --> 00:20:37,460 in terms of this big matrix, X? 341 00:20:37,460 --> 00:20:39,740 Because right now, it's still expressed 342 00:20:39,740 --> 00:20:40,820 in terms of the vectors. 343 00:20:40,820 --> 00:20:43,220 I'm summing those vectors, vectors transpose. 344 00:20:43,220 --> 00:20:46,050 The question is, can I just do that in a very compact way, 345 00:20:46,050 --> 00:20:50,110 in a way that I can actually remove this sum term, 346 00:20:50,110 --> 00:20:50,610 all right? 347 00:20:50,610 --> 00:20:52,990 That's going to be the goal. 348 00:20:52,990 --> 00:20:54,850 I mean, that's not a notational goal. 349 00:20:54,850 --> 00:20:58,091 That's really something that we want-- 350 00:20:58,091 --> 00:20:59,590 that's going to be convenient for us 351 00:20:59,590 --> 00:21:02,740 just like it was convenient to talk about matrices when 352 00:21:02,740 --> 00:21:04,199 we did linear regression. 353 00:21:23,180 --> 00:21:26,340 OK, X bar. 354 00:21:26,340 --> 00:21:30,000 We just said it's 1 over n, sum from I equal 1 to n 355 00:21:30,000 --> 00:21:32,730 of Xi, right? 356 00:21:32,730 --> 00:21:35,100 Now remember, what does this matrix look like? 357 00:21:35,100 --> 00:21:39,010 We said that X bar-- 358 00:21:39,010 --> 00:21:40,270 X is this guy. 359 00:21:40,270 --> 00:21:45,930 So if I look at X transpose, the columns of this guy 360 00:21:45,930 --> 00:21:51,430 becomes X1, my first observation, X2, 361 00:21:51,430 --> 00:21:54,840 my second observation, all the way to Xn, my last observation, 362 00:21:54,840 --> 00:21:56,280 right? 363 00:21:56,280 --> 00:21:56,850 Agreed? 364 00:21:56,850 --> 00:21:58,470 That's what X transpose is. 365 00:21:58,470 --> 00:22:00,960 So if I want to sum those guys, I 366 00:22:00,960 --> 00:22:02,700 can multiply by the all-ones vector. 367 00:22:06,284 --> 00:22:08,700 All right, so that's what the definition of the all-ones 1 368 00:22:08,700 --> 00:22:09,250 vector is. 369 00:22:11,840 --> 00:22:19,870 Well, it's just a bunch of 1's in Rn, in this case. 370 00:22:19,870 --> 00:22:23,620 And so when I do X transpose 1, what I get is just the sum from 371 00:22:23,620 --> 00:22:27,690 i equal 1 to n of the Xi's. 372 00:22:27,690 --> 00:22:36,460 So if I divide by n, I get my average, OK? 373 00:22:36,460 --> 00:22:43,200 So here, I definitely removed the sum term. 374 00:22:43,200 --> 00:22:47,930 Let's see if with the covariance matrix, we can do the same. 375 00:22:47,930 --> 00:22:53,280 Well, and that's actually a little more difficult to see, 376 00:22:53,280 --> 00:22:55,280 I guess. 377 00:22:55,280 --> 00:23:05,510 But let's use this definition for S, OK? 378 00:23:05,510 --> 00:23:07,540 And one thing that's actually going to be-- 379 00:23:07,540 --> 00:23:10,260 so, let's see for one second, what-- 380 00:23:10,260 --> 00:23:12,510 so it's going to be something that involves X, 381 00:23:12,510 --> 00:23:14,377 multiplying X with itself, OK? 382 00:23:14,377 --> 00:23:15,960 And the question is, is it going to be 383 00:23:15,960 --> 00:23:19,032 multiplying X with X transpose, or X tranpose with X? 384 00:23:19,032 --> 00:23:20,490 To answer this question, you can go 385 00:23:20,490 --> 00:23:23,960 the easy route, which says, well, my covariance matrix is 386 00:23:23,960 --> 00:23:24,870 of size, what? 387 00:23:24,870 --> 00:23:27,682 What is the size of S? 388 00:23:27,682 --> 00:23:28,670 AUDIENCE: d by d. 389 00:23:28,670 --> 00:23:30,260 PHILIPPE RIGOLLET: d by d, OK? 390 00:23:30,260 --> 00:23:34,200 X is of size n by d. 391 00:23:34,200 --> 00:23:35,760 So if I do X times X transpose, I'm 392 00:23:35,760 --> 00:23:37,760 going to have something which is of size n by n. 393 00:23:37,760 --> 00:23:39,426 If I do X transpose X, I'm going to have 394 00:23:39,426 --> 00:23:40,794 something which is d by d. 395 00:23:40,794 --> 00:23:41,710 That's the easy route. 396 00:23:41,710 --> 00:23:44,150 And there's basically one of the two guys. 397 00:23:44,150 --> 00:23:46,130 You can actually open the box a little bit 398 00:23:46,130 --> 00:23:48,170 and see what's going on in there. 399 00:23:48,170 --> 00:23:52,760 If you do X transpose X, which we know gives you a d by d, 400 00:23:52,760 --> 00:23:54,920 you'll see that X is going to have vectors that 401 00:23:54,920 --> 00:23:57,465 are of the form, Xi, and X transpose 402 00:23:57,465 --> 00:24:02,230 is going to have vectors that are of the form, Xi transpose, 403 00:24:02,230 --> 00:24:03,710 right? 404 00:24:03,710 --> 00:24:06,810 And so, this is actually probably the right way to go. 405 00:24:06,810 --> 00:24:11,690 So let's look at what's X transpose X is giving us. 406 00:24:11,690 --> 00:24:16,850 So I claim that it's actually going to give us what we want, 407 00:24:16,850 --> 00:24:19,710 but rather than actually going there, let's-- 408 00:24:19,710 --> 00:24:22,700 to actually-- I mean, we could check it entry by entry, 409 00:24:22,700 --> 00:24:25,400 but there's actually a nice thing we can do. 410 00:24:25,400 --> 00:24:28,090 Before we go there, let's write X transpose 411 00:24:28,090 --> 00:24:33,260 as the following sum of variables, X1 and then 412 00:24:33,260 --> 00:24:36,270 just a bunch of 0's everywhere else. 413 00:24:36,270 --> 00:24:39,410 So it's still d by n. 414 00:24:39,410 --> 00:24:42,470 So n minus 1 of the columns are equal to 0 here. 415 00:24:42,470 --> 00:24:45,860 Then I'm going to put a 0 and then put X2. 416 00:24:45,860 --> 00:24:48,550 And then just a bunch of 0's, right? 417 00:24:48,550 --> 00:24:59,940 So that's just 0, 0 plus 0, 0, all the way to Xn, OK? 418 00:24:59,940 --> 00:25:01,260 Everybody agrees with it? 419 00:25:01,260 --> 00:25:03,650 See what I'm doing here? 420 00:25:03,650 --> 00:25:06,150 I'm just splitting it into a sum of matrices that 421 00:25:06,150 --> 00:25:08,730 only have one nonzero columns. 422 00:25:08,730 --> 00:25:11,210 But clearly, that's true. 423 00:25:11,210 --> 00:25:15,610 Now let's look at the product of this guy with itself. 424 00:25:15,610 --> 00:25:23,396 So, let's call these matrices M1, M2, Mn. 425 00:25:26,890 --> 00:25:30,750 So when I do X transpose X, what I 426 00:25:30,750 --> 00:25:37,970 do is the sum of the Mi's for i equal 1 to n, 427 00:25:37,970 --> 00:25:48,620 times the sum of the Mi transpose, right? 428 00:25:48,620 --> 00:25:50,840 Now, the sum of the Mi's transpose 429 00:25:50,840 --> 00:25:55,274 is just the sum of each of the Mi's transpose, OK? 430 00:25:58,190 --> 00:26:00,620 So now I just have this product of two sums, 431 00:26:00,620 --> 00:26:03,290 so I'm just going to re-index the second one by j. 432 00:26:03,290 --> 00:26:12,650 So this is sum for i equal 1 to n, j equal 1 to n of Mi 433 00:26:12,650 --> 00:26:15,600 Mj transpose. 434 00:26:15,600 --> 00:26:16,100 OK? 435 00:26:19,036 --> 00:26:20,410 And now what we want to notice is 436 00:26:20,410 --> 00:26:26,000 that if i is different from j, what's happening? 437 00:26:26,000 --> 00:26:34,380 Well if i is different from j, let's look at say, M1 times XM2 438 00:26:34,380 --> 00:26:35,040 transpose. 439 00:26:54,067 --> 00:26:56,150 So what is the product between those two matrices? 440 00:27:04,404 --> 00:27:09,870 AUDIENCE: It's a new entry and [INAUDIBLE] 441 00:27:09,870 --> 00:27:11,370 PHILIPPE RIGOLLET: There's an entry? 442 00:27:11,370 --> 00:27:12,801 AUDIENCE: Well, it's an entry. 443 00:27:12,801 --> 00:27:17,116 It's like a dot product in that form next to [? transpose. ?] 444 00:27:17,116 --> 00:27:19,490 PHILIPPE RIGOLLET: You mean a dot product is just getting 445 00:27:19,490 --> 00:27:20,360 [INAUDIBLE] number, right? 446 00:27:20,360 --> 00:27:22,068 So I want-- this is going to be a matrix. 447 00:27:22,068 --> 00:27:24,550 It's the product of two matrices, right? 448 00:27:24,550 --> 00:27:27,100 This is a matrix times a matrix. 449 00:27:27,100 --> 00:27:31,210 So this should be a matrix, right, of size d by d. 450 00:27:35,960 --> 00:27:37,610 Yeah, I should see a lot of hands 451 00:27:37,610 --> 00:27:39,060 that look like this, right? 452 00:27:39,060 --> 00:27:40,200 Because look at this. 453 00:27:40,200 --> 00:27:42,450 So let's multiply the first-- 454 00:27:42,450 --> 00:27:45,215 let's look at what's going on in the first column here. 455 00:27:45,215 --> 00:27:48,840 I'm multiplying this column with each of those rows. 456 00:27:48,840 --> 00:27:50,480 The only nonzero coefficient is here, 457 00:27:50,480 --> 00:27:54,190 and it only hits this column of 0's. 458 00:27:54,190 --> 00:27:57,036 So every time, this is going to give you 0, 0, 0, 0. 459 00:27:57,036 --> 00:28:00,020 And it's going to be the same for every single one of them. 460 00:28:00,020 --> 00:28:04,420 So this matrix is just full of 0's, right? 461 00:28:04,420 --> 00:28:06,130 They never hit each other when I do 462 00:28:06,130 --> 00:28:08,350 the matrix-matrix multiplication. 463 00:28:08,350 --> 00:28:11,811 There's no-- every non-zero hits a 0. 464 00:28:11,811 --> 00:28:13,560 So what it means is-- and this, of course, 465 00:28:13,560 --> 00:28:16,020 you can check for every i different from j. 466 00:28:16,020 --> 00:28:22,290 So this means that Mi times Mj transpose is actually 467 00:28:22,290 --> 00:28:27,150 equal to 0 when i is different from j, Right? 468 00:28:27,150 --> 00:28:29,370 Everybody is OK with this? 469 00:28:29,370 --> 00:28:32,670 So what that means is that when I do this double sum, really, 470 00:28:32,670 --> 00:28:33,670 it's a simple sum. 471 00:28:33,670 --> 00:28:37,310 There's only just the sum from i equal 1 472 00:28:37,310 --> 00:28:41,820 to n of Mi Mi transpose. 473 00:28:41,820 --> 00:28:44,820 Because this is the only terms in this double sum 474 00:28:44,820 --> 00:28:48,980 that are not going to be 0 when [INAUDIBLE] [? M1 ?] with M1 475 00:28:48,980 --> 00:28:50,492 itself. 476 00:28:50,492 --> 00:28:51,950 Now, let's see what's going on when 477 00:28:51,950 --> 00:28:53,930 I do M1 times M1 transpose. 478 00:28:53,930 --> 00:28:57,890 Well, now, if I do Mi times and Mi transpose, 479 00:28:57,890 --> 00:29:00,300 now this guy becomes [? X1 ?] [INAUDIBLE] it's here. 480 00:29:00,300 --> 00:29:03,830 And so now, I really have X1 times X1 transpose. 481 00:29:03,830 --> 00:29:06,785 So this is really just the sum from i 482 00:29:06,785 --> 00:29:20,080 equal 1 to n of Xi Xi transpose, just because Mi Mi transpose 483 00:29:20,080 --> 00:29:21,716 is Xi Xi transpose. 484 00:29:21,716 --> 00:29:22,840 There's nothing else there. 485 00:29:26,190 --> 00:29:28,520 So that's the good news, right? 486 00:29:28,520 --> 00:29:37,100 This term here is really just X transpose X divided by n. 487 00:29:43,460 --> 00:29:45,740 OK, I can use that guy again, I guess. 488 00:29:45,740 --> 00:29:46,260 Well, no. 489 00:29:46,260 --> 00:30:08,602 Let's just-- OK, so let me rewrite S. 490 00:30:08,602 --> 00:30:10,310 All right, that's the definition we have. 491 00:30:10,310 --> 00:30:14,990 And we know that this guy already is equal to 1 over n X 492 00:30:14,990 --> 00:30:20,960 transpose X. x bar x bar transpose-- 493 00:30:20,960 --> 00:30:25,950 we know that x bar-- we just proved that x bar-- 494 00:30:25,950 --> 00:30:31,080 sorry, little x bar was equal to 1 495 00:30:31,080 --> 00:30:36,652 over n X bar transpose times the all-ones vector. 496 00:30:36,652 --> 00:30:37,860 So I'm just going to do that. 497 00:30:37,860 --> 00:30:39,340 So that's just going to be minus. 498 00:30:39,340 --> 00:30:40,999 I'm going to pull my two 1 over n's-- 499 00:30:40,999 --> 00:30:42,540 one from this guy, one from this guy. 500 00:30:42,540 --> 00:30:44,530 So I'm going to get 1 over n squared. 501 00:30:44,530 --> 00:30:47,070 And then I'm going to get X bar-- 502 00:30:47,070 --> 00:30:48,690 sorry, there's no X bar here. 503 00:30:48,690 --> 00:30:50,908 It's just X. Yeah. 504 00:30:50,908 --> 00:30:59,861 X transpose all ones times X transpose all ones transpose, 505 00:30:59,861 --> 00:31:00,360 right? 506 00:31:04,580 --> 00:31:07,580 And X transpose all ones transpose-- 507 00:31:11,800 --> 00:31:14,200 right, the rule-- if I have A times B transpose, 508 00:31:14,200 --> 00:31:16,180 it's B transpose times A transpose, right? 509 00:31:23,460 --> 00:31:25,060 That's just the rule of transposition. 510 00:31:25,060 --> 00:31:31,400 So this is 1 transpose X transpose. 511 00:31:31,400 --> 00:31:34,120 And so when I put all these guys together, 512 00:31:34,120 --> 00:31:38,365 this is actually equal to 1 over n X transpose X minus one 513 00:31:38,365 --> 00:31:47,670 over n squared X transpose 1, 1 transpose X. Because X 514 00:31:47,670 --> 00:31:50,466 transpose transposes X, OK? 515 00:31:53,700 --> 00:31:55,950 So now, I can actually-- 516 00:31:55,950 --> 00:31:59,435 I have something which is of the form, X transpose X-- 517 00:31:59,435 --> 00:32:01,800 [INAUDIBLE] to the left, X transpose; to the right, X. 518 00:32:01,800 --> 00:32:04,930 Here, I have X transpose to the left, X to the right. 519 00:32:04,930 --> 00:32:07,690 So it can factor out whatever's in there. 520 00:32:07,690 --> 00:32:11,640 So I can write S as 1 over n-- 521 00:32:11,640 --> 00:32:17,230 sorry, X transpose times 1 over n times the identity of Rd. 522 00:32:21,610 --> 00:32:33,110 And then I have minus 1 over n, 1, 1 transpose X. 523 00:32:33,110 --> 00:32:34,490 OK, because if you-- 524 00:32:34,490 --> 00:32:36,770 I mean, you can distribute it back, right? 525 00:32:36,770 --> 00:32:38,090 So here, I'm going to get what? 526 00:32:38,090 --> 00:32:41,810 X transpose identity times X, the whole thing divided by n. 527 00:32:41,810 --> 00:32:42,777 That's this term. 528 00:32:42,777 --> 00:32:45,110 And then the second one is going to be-- sorry, 1 over n 529 00:32:45,110 --> 00:32:46,110 squared. 530 00:32:46,110 --> 00:32:50,840 And then I'm going to get 1 over n squared times X transpose 1, 531 00:32:50,840 --> 00:32:53,990 1 transpose which is this guy, times X, 532 00:32:53,990 --> 00:32:58,580 and that's the [? right ?] [? thing, ?] OK? 533 00:32:58,580 --> 00:33:01,820 So, the way it's written, I factored out one of the 1 over 534 00:33:01,820 --> 00:33:02,320 n's. 535 00:33:02,320 --> 00:33:05,500 So I'm just going to do the same thing as on this slide. 536 00:33:05,500 --> 00:33:08,110 So I'm just factoring out this 1 over n here. 537 00:33:08,110 --> 00:33:16,280 So it's 1 over n times X transpose identity 538 00:33:16,280 --> 00:33:21,010 of our d divided by n divided by 1 this time, 539 00:33:21,010 --> 00:33:26,780 minus 1 over n 1, 1 transpose times X, OK? 540 00:33:26,780 --> 00:33:28,395 So that's just what's on the slides. 541 00:33:31,720 --> 00:33:35,874 What does the matrix, 1, 1 transpose, look like? 542 00:33:35,874 --> 00:33:36,790 AUDIENCE: All 1's. 543 00:33:36,790 --> 00:33:38,623 PHILIPPE RIGOLLET: It's just all 1's, right? 544 00:33:38,623 --> 00:33:41,060 Because the entries are the products of the all-ones-- 545 00:33:41,060 --> 00:33:42,750 of the coordinates of the all-ones vectors with 546 00:33:42,750 --> 00:33:45,208 the coordinates of the all-ones vectors, so I only get 1's. 547 00:33:45,208 --> 00:33:49,610 So it's a d by d matrix with only 1's. 548 00:33:49,610 --> 00:33:52,170 So this matrix, I can actually write exactly, right? 549 00:33:52,170 --> 00:33:55,710 H, this matrix that I called H which 550 00:33:55,710 --> 00:33:59,430 is what's sandwiched in-between this X transpose and X. 551 00:33:59,430 --> 00:34:02,760 By definition, I said this is the definition of H. Then 552 00:34:02,760 --> 00:34:06,060 this thing, I can write its coordinates exactly. 553 00:34:18,880 --> 00:34:23,110 We know it's identity divided by n minus-- 554 00:34:23,110 --> 00:34:25,330 sorry, I don't know why I keep [INAUDIBLE].. 555 00:34:25,330 --> 00:34:29,110 Minus 1 over n 1, 1 transpose-- 556 00:34:29,110 --> 00:34:30,940 so it's this matrix with the only 1's 557 00:34:30,940 --> 00:34:34,389 on the diagonals and 0's and elsewhere-- minus a matrix that 558 00:34:34,389 --> 00:34:36,487 only has 1 over n everywhere. 559 00:34:41,469 --> 00:34:49,820 OK, so the whole thing is 1 minus 1 over n on the diagonals 560 00:34:49,820 --> 00:34:57,430 and then minus 1 over n here, OK? 561 00:34:57,430 --> 00:35:01,920 And now I claim that this matrix is an orthogonal projector. 562 00:35:01,920 --> 00:35:05,580 Now, I'm writing this, but it's completely useless. 563 00:35:05,580 --> 00:35:08,190 This is just a way for you to see that it's actually very 564 00:35:08,190 --> 00:35:11,430 convenient now to think about this problem 565 00:35:11,430 --> 00:35:14,850 as being a matrix problem, because things 566 00:35:14,850 --> 00:35:17,890 are much nicer when you think about the actual form 567 00:35:17,890 --> 00:35:18,890 of your matrices, right? 568 00:35:18,890 --> 00:35:21,090 They could tell you, here is the matrix. 569 00:35:21,090 --> 00:35:23,340 I mean, imagine you're sitting at a midterm, 570 00:35:23,340 --> 00:35:25,910 and I say, here's the matrix that has 1 minus 1 571 00:35:25,910 --> 00:35:28,640 over n on the diagonals and minus 1 over n 572 00:35:28,640 --> 00:35:30,010 on the [INAUDIBLE] diagonal. 573 00:35:30,010 --> 00:35:32,855 Prove to me that it's a projector matrix. 574 00:35:32,855 --> 00:35:34,230 You're going to have to basically 575 00:35:34,230 --> 00:35:35,520 take this guy times itself. 576 00:35:35,520 --> 00:35:37,497 It's going to be really complicated, right? 577 00:35:37,497 --> 00:35:38,580 So we know it's symmetric. 578 00:35:38,580 --> 00:35:39,930 That's for sure. 579 00:35:39,930 --> 00:35:42,120 But the fact that it has this particular way 580 00:35:42,120 --> 00:35:44,100 of writing it is going to make my life 581 00:35:44,100 --> 00:35:45,599 super easy to check this. 582 00:35:45,599 --> 00:35:47,140 That's the definition of a projector. 583 00:35:47,140 --> 00:35:48,930 It has to be symmetric and it has 584 00:35:48,930 --> 00:35:51,270 to square to itself because we just 585 00:35:51,270 --> 00:35:54,300 said in the chapter on linear regression 586 00:35:54,300 --> 00:35:57,360 that once you project, if you apply the projection again, 587 00:35:57,360 --> 00:35:59,610 you're not moving because you're already there. 588 00:35:59,610 --> 00:36:04,469 OK, so why is H squared equal to H? 589 00:36:04,469 --> 00:36:05,760 Well let's just write H square. 590 00:36:05,760 --> 00:36:09,300 It's the identity minus 1 over n 1, 1 591 00:36:09,300 --> 00:36:16,610 transpose times the identity minus 1 over n 1, 1 592 00:36:16,610 --> 00:36:19,370 transpose, right? 593 00:36:19,370 --> 00:36:22,490 Let's just expand this now. 594 00:36:22,490 --> 00:36:25,350 This is equal to the identity minus-- 595 00:36:25,350 --> 00:36:29,280 well, the identity times 1, 1 transpose is just the identity. 596 00:36:29,280 --> 00:36:31,900 So it's 1, 1 transpose, sorry. 597 00:36:31,900 --> 00:36:38,840 So 1 over n 1, 1 transpose minus 1 over n 1, 1 transpose. 598 00:36:38,840 --> 00:36:40,400 And then there's going to be what 599 00:36:40,400 --> 00:36:42,710 makes the deal is that I get this 1 over n 600 00:36:42,710 --> 00:36:44,750 squared this time. 601 00:36:44,750 --> 00:36:46,950 And then I get the product of 1 over n trans-- 602 00:36:46,950 --> 00:36:48,200 oh, let's write it completely. 603 00:36:48,200 --> 00:36:58,010 I get 1, 1 transpose times 1, 1 transpose, OK? 604 00:36:58,010 --> 00:37:01,260 But this thing here-- 605 00:37:01,260 --> 00:37:03,840 what is this? 606 00:37:03,840 --> 00:37:06,359 n, right, is the end product of the all-ones vector 607 00:37:06,359 --> 00:37:07,400 with the all-ones vector. 608 00:37:07,400 --> 00:37:10,740 So I'm just summing n times 1 squared, which is n. 609 00:37:10,740 --> 00:37:11,980 So this is equal to n. 610 00:37:11,980 --> 00:37:13,920 So I pull it out, cancel one of the ends, 611 00:37:13,920 --> 00:37:15,870 and I'm back to what I had before. 612 00:37:15,870 --> 00:37:21,720 So I had identity minus 2 over n 1, 1 transpose plus 1 613 00:37:21,720 --> 00:37:27,530 over n 1, 1 transpose which is equal to H. 614 00:37:27,530 --> 00:37:30,700 Because one of the 1 over n's cancel, OK? 615 00:37:36,264 --> 00:37:37,430 So it's a projection matrix. 616 00:37:37,430 --> 00:37:41,030 It's projecting onto some linear space, right? 617 00:37:41,030 --> 00:37:42,450 It's taking a matrix. 618 00:37:42,450 --> 00:37:44,480 Sorry, it's taking a vector and it's 619 00:37:44,480 --> 00:37:46,535 projecting onto a certain space of vectors. 620 00:37:49,255 --> 00:37:50,229 What is this space? 621 00:37:53,160 --> 00:37:54,920 Right, so, how do you-- so I'm only 622 00:37:54,920 --> 00:37:57,500 asking the answer to this question in words, right? 623 00:37:57,500 --> 00:37:59,830 So how would you describe the vectors 624 00:37:59,830 --> 00:38:02,950 onto which this matrix is projecting? 625 00:38:02,950 --> 00:38:05,050 Well, if you want to answer this question, 626 00:38:05,050 --> 00:38:07,870 the way you would tackle it is first by saying, OK, 627 00:38:07,870 --> 00:38:13,690 what does a vector which is of the form, H times something, 628 00:38:13,690 --> 00:38:14,960 look like, right? 629 00:38:14,960 --> 00:38:16,870 What can I say about this vector that's 630 00:38:16,870 --> 00:38:19,540 going to be definitely giving me something 631 00:38:19,540 --> 00:38:21,760 about the space on which it projects? 632 00:38:21,760 --> 00:38:24,800 I need to know a little more to know that it projects exactly 633 00:38:24,800 --> 00:38:25,820 onto this. 634 00:38:25,820 --> 00:38:29,050 But one way we can do this is just 635 00:38:29,050 --> 00:38:30,440 see how it acts on a vector. 636 00:38:30,440 --> 00:38:32,370 What does it do to a vector to apply H, right? 637 00:38:32,370 --> 00:38:44,550 So I take v. And let's see what taking v and applying H to it 638 00:38:44,550 --> 00:38:46,410 looks like. 639 00:38:46,410 --> 00:38:48,750 Well, it's the identity minus something. 640 00:38:48,750 --> 00:38:50,640 So it takes v and it removes something 641 00:38:50,640 --> 00:38:54,160 from v. What does it remove? 642 00:38:54,160 --> 00:39:00,590 Well, it's 1 over n times v transpose 1 times 643 00:39:00,590 --> 00:39:03,861 the all-ones vector, right? 644 00:39:03,861 --> 00:39:04,360 Agreed? 645 00:39:04,360 --> 00:39:13,570 I just wrote v transpose 1 instead of 1 transpose v, 646 00:39:13,570 --> 00:39:16,250 which are the same thing. 647 00:39:16,250 --> 00:39:17,310 What is this thing? 648 00:39:25,160 --> 00:39:27,765 What should I call it in mathematical notation? 649 00:39:30,720 --> 00:39:31,460 v bar, right? 650 00:39:31,460 --> 00:39:35,150 I should all it v bar because this is exactly the average 651 00:39:35,150 --> 00:39:38,840 of the entries of v, agreed? 652 00:39:38,840 --> 00:39:41,560 This is summing the entries of v's, and this is dividing 653 00:39:41,560 --> 00:39:43,170 by the number of those v's. 654 00:39:43,170 --> 00:39:44,860 Sorry, now v is in our-- 655 00:39:49,162 --> 00:39:51,074 sorry, why do I divide by-- 656 00:39:53,950 --> 00:39:59,070 I'm just-- OK, I need to check what my dimensions are now. 657 00:39:59,070 --> 00:40:00,390 No, it's in Rd, right? 658 00:40:00,390 --> 00:40:02,660 So why do I divide by n? 659 00:40:05,520 --> 00:40:07,720 So it's not really v bar. 660 00:40:07,720 --> 00:40:13,910 It's the sum of the v's divided by-- 661 00:40:13,910 --> 00:40:14,870 right, so it's v bar. 662 00:40:24,024 --> 00:40:25,163 AUDIENCE: [INAUDIBLE] 663 00:40:25,163 --> 00:40:25,996 [INTERPOSING VOICES] 664 00:40:25,996 --> 00:40:27,968 AUDIENCE: Yeah, v has to be [INAUDIBLE] 665 00:40:27,968 --> 00:40:29,450 PHILIPPE RIGOLLET: Oh, yeah. 666 00:40:29,450 --> 00:40:31,120 OK, thank you. 667 00:40:31,120 --> 00:40:34,750 So everywhere I wrote Hd, that was actually Hn. 668 00:40:34,750 --> 00:40:35,290 Oh, man. 669 00:40:35,290 --> 00:40:37,220 I wish I had a computer now. 670 00:40:37,220 --> 00:40:37,720 All right. 671 00:40:37,720 --> 00:40:43,230 So-- yeah, because the-- 672 00:40:43,230 --> 00:40:43,740 yeah, right? 673 00:40:43,740 --> 00:40:45,775 So why it's not-- 674 00:40:45,775 --> 00:40:48,150 well, why I thought it was this is because I was thinking 675 00:40:48,150 --> 00:40:49,890 about the outer dimension of X, really 676 00:40:49,890 --> 00:40:51,780 of X transpose, which is really the inner dimension, 677 00:40:51,780 --> 00:40:52,914 didn't matter to me, right? 678 00:40:52,914 --> 00:40:55,080 So the thing that I can sandwich between X transpose 679 00:40:55,080 --> 00:40:56,790 and X has to be n by n. 680 00:40:56,790 --> 00:40:58,800 So this was actually n by n. 681 00:40:58,800 --> 00:41:00,480 And so that's actually n by n. 682 00:41:00,480 --> 00:41:03,330 Everything is n by n. 683 00:41:03,330 --> 00:41:04,308 Sorry about that. 684 00:41:08,220 --> 00:41:09,400 So this is n. 685 00:41:09,400 --> 00:41:10,440 This is n. 686 00:41:10,440 --> 00:41:12,130 This is-- well, I didn't really tell you 687 00:41:12,130 --> 00:41:16,290 what the all-ones vector was, but it's also in our n. 688 00:41:16,290 --> 00:41:18,430 Yeah, OK. 689 00:41:22,190 --> 00:41:23,730 Thank you. 690 00:41:23,730 --> 00:41:27,939 And n-- actually, I used the fact that this was of size n 691 00:41:27,939 --> 00:41:28,480 here already. 692 00:41:31,690 --> 00:41:33,340 OK, and so that's indeed v bar. 693 00:41:38,996 --> 00:41:40,870 So what is this projection doing to a vector? 694 00:41:47,470 --> 00:41:51,930 It's removing its average on each coordinate, right? 695 00:41:51,930 --> 00:41:54,570 And the effect of this is that v is a vector. 696 00:41:54,570 --> 00:41:58,355 What is the average of Hv? 697 00:41:58,355 --> 00:41:59,340 AUDIENCE: 0. 698 00:41:59,340 --> 00:42:00,840 PHILIPPE RIGOLLET: Right, so it's 0. 699 00:42:00,840 --> 00:42:04,050 It's the average of v, which is v bar, minus the average 700 00:42:04,050 --> 00:42:07,230 of something that only has v bar's entry, which is v bar. 701 00:42:07,230 --> 00:42:08,490 So this thing is actually 0. 702 00:42:11,560 --> 00:42:12,840 So let me repeat my question. 703 00:42:12,840 --> 00:42:15,700 Onto what subspace does H project? 704 00:42:22,700 --> 00:42:26,670 Onto the subspace of vectors that have mean 0. 705 00:42:26,670 --> 00:42:30,010 A vector that has mean 0 is a vector. 706 00:42:30,010 --> 00:42:34,970 So if you want to talk more linear algebra, v bar-- 707 00:42:34,970 --> 00:42:36,750 for a vector you have mean 0, it means 708 00:42:36,750 --> 00:42:43,440 that v is orthogonal to the span of the all-ones vector. 709 00:42:43,440 --> 00:42:44,280 That's it. 710 00:42:44,280 --> 00:42:46,080 It projects to this space. 711 00:42:46,080 --> 00:42:47,930 So in words, it projects onto the space 712 00:42:47,930 --> 00:42:49,880 of vectors that have 0 mean. 713 00:42:49,880 --> 00:42:52,380 In linear algebra, it says it projects 714 00:42:52,380 --> 00:42:55,760 onto the hyperplane which is orthogonal 715 00:42:55,760 --> 00:42:58,360 to the all-ones vector, OK? 716 00:42:58,360 --> 00:43:01,860 So that's all. 717 00:43:01,860 --> 00:43:04,760 Can you guys still see the screen? 718 00:43:04,760 --> 00:43:05,940 Are you good over there? 719 00:43:05,940 --> 00:43:07,420 OK. 720 00:43:07,420 --> 00:43:12,030 All right, so now, what it means is that, well, I'm 721 00:43:12,030 --> 00:43:13,280 doing this weird thing, right? 722 00:43:13,280 --> 00:43:15,360 I'm taking the inner product-- 723 00:43:15,360 --> 00:43:20,030 so S is taking X. And then it's removing its mean of each 724 00:43:20,030 --> 00:43:21,440 of the columns of X, right? 725 00:43:21,440 --> 00:43:24,530 When I take H times X, I'm basically applying this 726 00:43:24,530 --> 00:43:26,780 projection which consists in removing the mean of all 727 00:43:26,780 --> 00:43:28,430 the X's. 728 00:43:28,430 --> 00:43:31,340 And then I multiply by H transpose. 729 00:43:31,340 --> 00:43:33,550 But what's actually nice is that, remember, 730 00:43:33,550 --> 00:43:35,930 H is a projector. 731 00:43:35,930 --> 00:43:38,000 Sorry, I don't want to keep that. 732 00:43:38,000 --> 00:43:47,010 Which means that when I look at X transpose HX, 733 00:43:47,010 --> 00:43:52,410 it's the same as looking at X transpose H squared X. 734 00:43:52,410 --> 00:43:54,420 But since H is equal to its transpose, 735 00:43:54,420 --> 00:43:58,020 this is actually the same as looking at X transpose H 736 00:43:58,020 --> 00:44:07,146 transpose HX, which is the same as looking at HX transpose 737 00:44:07,146 --> 00:44:11,000 HX, OK? 738 00:44:11,000 --> 00:44:14,300 So what it's doing, it's first applying this projection 739 00:44:14,300 --> 00:44:18,950 matrix, H, which removes the mean of each of your columns, 740 00:44:18,950 --> 00:44:23,000 and then looks at the inner products between those guys, 741 00:44:23,000 --> 00:44:23,586 right? 742 00:44:23,586 --> 00:44:25,460 Each entry of this guy is just the covariance 743 00:44:25,460 --> 00:44:27,320 between those centered things. 744 00:44:27,320 --> 00:44:28,910 That's all it's doing. 745 00:44:28,910 --> 00:44:35,450 All right, so those are actually going to be the key statements. 746 00:44:35,450 --> 00:44:37,270 So everything we've done so far is really 747 00:44:37,270 --> 00:44:38,920 mainly linear algebra, right? 748 00:44:38,920 --> 00:44:41,950 I mean, looking at expectations and covariances was just-- 749 00:44:41,950 --> 00:44:44,200 we just used the fact that the expectation was linear. 750 00:44:44,200 --> 00:44:45,520 We didn't do much. 751 00:44:45,520 --> 00:44:47,450 But now there's a nice thing that's happening. 752 00:44:47,450 --> 00:44:50,050 And that's why we're going to switch 753 00:44:50,050 --> 00:44:51,550 from the language of linear algebra 754 00:44:51,550 --> 00:44:53,710 to more statistical, because what's happening 755 00:44:53,710 --> 00:44:57,010 is that if I look at this quadratic form, right? 756 00:44:57,010 --> 00:44:59,080 So I take sigma. 757 00:44:59,080 --> 00:45:00,462 So I take a vector, u. 758 00:45:03,630 --> 00:45:09,180 And I'm going to look at u-- so let's say, in Rd. 759 00:45:09,180 --> 00:45:14,796 And I'm going to look at u transpose sigma u. 760 00:45:14,796 --> 00:45:15,295 OK? 761 00:45:18,510 --> 00:45:19,720 What is this doing? 762 00:45:19,720 --> 00:45:24,630 Well, we know that u transpose sigma u is equal to what? 763 00:45:24,630 --> 00:45:31,720 Well, sigma is the expectation of XX transpose 764 00:45:31,720 --> 00:45:35,610 minus the expectation of X expectation of X transpose, 765 00:45:35,610 --> 00:45:36,110 right? 766 00:45:39,460 --> 00:45:40,948 So I just substitute in there. 767 00:45:46,100 --> 00:45:49,370 Now, u is deterministic. 768 00:45:49,370 --> 00:45:52,250 So in particular, I can push it inside the expectation 769 00:45:52,250 --> 00:45:55,180 here, agreed? 770 00:45:55,180 --> 00:45:57,200 And I can do the same from the right. 771 00:45:57,200 --> 00:46:00,800 So here, when I push u transpose here, and u here, 772 00:46:00,800 --> 00:46:06,170 what I'm left with is the expectation of u transpose X 773 00:46:06,170 --> 00:46:09,990 times X transpose u. 774 00:46:09,990 --> 00:46:11,436 OK? 775 00:46:11,436 --> 00:46:14,050 And now, I can do the same thing for this guy. 776 00:46:14,050 --> 00:46:17,410 And this tells me that this is the expectation of u transpose 777 00:46:17,410 --> 00:46:21,340 X times the expectation of X transpose u. 778 00:46:24,640 --> 00:46:29,260 Of course, u transpose X is equal to X transpose u. 779 00:46:29,260 --> 00:46:31,330 And u-- yeah. 780 00:46:31,330 --> 00:46:33,910 So what it means is that this is actually 781 00:46:33,910 --> 00:46:43,700 equal to the expectation of u transpose X squared 782 00:46:43,700 --> 00:46:48,020 minus the expectation of u transpose X, 783 00:46:48,020 --> 00:46:49,065 the whole thing squared. 784 00:46:56,900 --> 00:46:58,900 But this is something that should look familiar. 785 00:46:58,900 --> 00:47:01,316 This is really just the variance of this particular random 786 00:47:01,316 --> 00:47:03,360 variable which is of the form, u transpose X, 787 00:47:03,360 --> 00:47:06,900 right? u transpose X is a number. 788 00:47:06,900 --> 00:47:10,110 It involves a random vector, so it's a random variable. 789 00:47:10,110 --> 00:47:11,580 And so it has a variance. 790 00:47:11,580 --> 00:47:15,430 And this variance is exactly given by this formula. 791 00:47:15,430 --> 00:47:19,595 So this is just the variance of u transpose X. 792 00:47:19,595 --> 00:47:21,720 So what we've proved is that if I look at this guy, 793 00:47:21,720 --> 00:47:29,772 this is really just the variance of u transpose X, OK? 794 00:47:37,580 --> 00:47:40,930 I can do the same thing for the sample variance. 795 00:47:40,930 --> 00:47:41,770 So let's do this. 796 00:47:48,240 --> 00:47:52,140 And as you can see, spoiler alert, 797 00:47:52,140 --> 00:47:56,334 this is going to be the sample variance. 798 00:47:59,590 --> 00:48:09,430 OK, so remember, S is 1 over n, sum of Xi Xi transpose minus X 799 00:48:09,430 --> 00:48:12,100 bar X bar transpose. 800 00:48:12,100 --> 00:48:16,060 So when I do u transpose, Su, what 801 00:48:16,060 --> 00:48:19,400 it gives me is 1 over n sum from i equal 1 802 00:48:19,400 --> 00:48:25,780 to n of u transpose Xi times Xi transpose u, all right? 803 00:48:25,780 --> 00:48:27,880 So those are two numbers that multiply each other 804 00:48:27,880 --> 00:48:30,370 and that happen to be equal to each other, 805 00:48:30,370 --> 00:48:36,430 minus u transpose X bar X bar transpose u, 806 00:48:36,430 --> 00:48:38,770 which is also the product of two numbers that happen 807 00:48:38,770 --> 00:48:39,997 to be equal to each other. 808 00:48:39,997 --> 00:48:41,455 So I can rewrite this with squares. 809 00:48:55,120 --> 00:48:57,390 So we're almost there. 810 00:48:57,390 --> 00:49:00,360 All I need to know to check is that this thing is actually 811 00:49:00,360 --> 00:49:02,010 the average of those guys, right? 812 00:49:02,010 --> 00:49:04,530 So u transpose X bar. 813 00:49:04,530 --> 00:49:05,030 What is it? 814 00:49:05,030 --> 00:49:10,980 It's 1 over n sum from i equal 1 to n of u transpose Xi. 815 00:49:10,980 --> 00:49:17,050 So it's really something that I can write as u transpose X bar, 816 00:49:17,050 --> 00:49:17,550 right? 817 00:49:17,550 --> 00:49:19,383 That's the average of those random variables 818 00:49:19,383 --> 00:49:21,240 of the form, u transpose Xi. 819 00:49:23,880 --> 00:49:29,910 So what it means is that u transpose Su, I can write as 1 820 00:49:29,910 --> 00:49:38,060 over n sum from i equal 1 to n of u transpose Xi squared 821 00:49:38,060 --> 00:49:46,720 minus u transpose X bar squared, which 822 00:49:46,720 --> 00:49:51,660 is the empirical variance that we need noted by small 823 00:49:51,660 --> 00:49:54,600 s squared, right? 824 00:49:54,600 --> 00:50:06,850 So that's the empirical variance of u transpose X1 all the way 825 00:50:06,850 --> 00:50:08,209 to u transpose Xn. 826 00:50:12,430 --> 00:50:13,910 OK, and here, same thing. 827 00:50:13,910 --> 00:50:15,210 I use exactly the same thing. 828 00:50:15,210 --> 00:50:17,990 I just use the fact that here, the only thing I use is really 829 00:50:17,990 --> 00:50:20,790 the linearity of this guy, of 1 over n sum 830 00:50:20,790 --> 00:50:24,020 or the linearity of expectation, that I can push things 831 00:50:24,020 --> 00:50:26,740 in there, OK? 832 00:50:30,224 --> 00:50:31,640 AUDIENCE: So what you have written 833 00:50:31,640 --> 00:50:33,844 at the end of that sum for uT Su? 834 00:50:33,844 --> 00:50:35,010 PHILIPPE RIGOLLET: This one? 835 00:50:35,010 --> 00:50:35,380 AUDIENCE: Yeah. 836 00:50:35,380 --> 00:50:37,290 PHILIPPE RIGOLLET: Yeah, I said it's equal to small s, 837 00:50:37,290 --> 00:50:39,430 and I want to make a difference between the big S 838 00:50:39,430 --> 00:50:40,660 that I'm using here. 839 00:50:40,660 --> 00:50:42,650 So this is equal to small-- 840 00:50:42,650 --> 00:50:45,190 I don't know, I'm trying to make it look 841 00:50:45,190 --> 00:50:47,550 like a calligraphic s squared. 842 00:50:56,870 --> 00:51:00,040 OK, so this is nice, right? 843 00:51:00,040 --> 00:51:04,120 This covariance matrix-- so let's look at capital sigma 844 00:51:04,120 --> 00:51:05,210 itself right now. 845 00:51:05,210 --> 00:51:07,070 This covariance matrix, we know that if we 846 00:51:07,070 --> 00:51:11,690 read its entries, what we get is the covariance 847 00:51:11,690 --> 00:51:15,260 between the coordinates of the X's, right, 848 00:51:15,260 --> 00:51:19,140 of the random vector, X. And the coordinates, well, 849 00:51:19,140 --> 00:51:22,530 by definition, are attached to a coordinate system. 850 00:51:22,530 --> 00:51:25,830 So I only know what the covariance 851 00:51:25,830 --> 00:51:30,570 of X in of those two things are, or the covariance of those two 852 00:51:30,570 --> 00:51:31,320 things are. 853 00:51:31,320 --> 00:51:33,570 But what if I want to find coordinates between linear 854 00:51:33,570 --> 00:51:35,076 combination of the X's? 855 00:51:35,076 --> 00:51:37,200 Sorry, if I want to find covariances between linear 856 00:51:37,200 --> 00:51:38,566 combination of those X's. 857 00:51:38,566 --> 00:51:40,440 And that's exactly what this allows me to do. 858 00:51:40,440 --> 00:51:44,640 It says, well, if I pre- and post-multiply by u, 859 00:51:44,640 --> 00:51:47,010 this is actually telling me what the variance 860 00:51:47,010 --> 00:51:51,950 of X along direction u is, OK? 861 00:51:51,950 --> 00:51:53,944 So there's a lot of information in there, 862 00:51:53,944 --> 00:51:55,610 and it's just really exploiting the fact 863 00:51:55,610 --> 00:52:00,600 that there is some linearity going on in the covariance. 864 00:52:00,600 --> 00:52:02,060 So, why variance? 865 00:52:02,060 --> 00:52:03,870 Why is variance interesting for us, right? 866 00:52:03,870 --> 00:52:04,370 Why? 867 00:52:04,370 --> 00:52:05,760 I started by saying, here, we're going 868 00:52:05,760 --> 00:52:07,050 to be interested in having something 869 00:52:07,050 --> 00:52:08,151 to do dimension reduction. 870 00:52:08,151 --> 00:52:10,650 We have-- think of your points as [? being in a ?] dimension 871 00:52:10,650 --> 00:52:13,990 larger than 4, and we're going to try to reduce the dimension. 872 00:52:13,990 --> 00:52:15,480 So let's just think for one second, 873 00:52:15,480 --> 00:52:19,320 what do we want about a dimension reduction procedure? 874 00:52:19,320 --> 00:52:23,427 If I have all my points that live in, say, three dimensions, 875 00:52:23,427 --> 00:52:25,260 and I have one point here and one point here 876 00:52:25,260 --> 00:52:28,020 and one point here and one point here and one point here, 877 00:52:28,020 --> 00:52:30,090 and I decide to project them onto some plane-- 878 00:52:30,090 --> 00:52:32,132 that I take a plane that's just like this, what's 879 00:52:32,132 --> 00:52:34,673 going to happen is that those points are all going to project 880 00:52:34,673 --> 00:52:36,030 to the same point, right? 881 00:52:36,030 --> 00:52:38,070 I'm just going to not see anything. 882 00:52:38,070 --> 00:52:40,410 However, if I take a plane which is like this, 883 00:52:40,410 --> 00:52:42,932 they're all going to project into some nice line. 884 00:52:42,932 --> 00:52:44,640 Maybe I can even project them onto a line 885 00:52:44,640 --> 00:52:47,160 and they will still be far apart from each other. 886 00:52:47,160 --> 00:52:48,160 So that's what you want. 887 00:52:48,160 --> 00:52:51,930 You want to be able to say, when I take my points 888 00:52:51,930 --> 00:52:54,610 and I say I project them onto lower dimensions, 889 00:52:54,610 --> 00:52:57,270 I do not want them to collapse into one single point. 890 00:52:57,270 --> 00:53:00,540 I want them to be spread as possible in the direction 891 00:53:00,540 --> 00:53:02,251 on which I project. 892 00:53:02,251 --> 00:53:04,000 And this is what we're going to try to do. 893 00:53:04,000 --> 00:53:06,510 And of course, measuring spread between points 894 00:53:06,510 --> 00:53:08,160 can be done in many ways, right? 895 00:53:08,160 --> 00:53:09,960 I mean, you could look at, I don't know, 896 00:53:09,960 --> 00:53:12,900 sum of pairwise distances between those guys. 897 00:53:12,900 --> 00:53:14,790 You could look at some sort of energy. 898 00:53:14,790 --> 00:53:16,380 You can look at many ways to measure 899 00:53:16,380 --> 00:53:18,199 of spread in a direction. 900 00:53:18,199 --> 00:53:19,740 But variance is a good way to measure 901 00:53:19,740 --> 00:53:21,150 of spread between points. 902 00:53:21,150 --> 00:53:23,727 If you have a lot of variance between your points, 903 00:53:23,727 --> 00:53:25,560 then chances are they're going to be spread. 904 00:53:25,560 --> 00:53:27,720 Now, this is not always the case, right? 905 00:53:27,720 --> 00:53:30,480 If I have a direction in which all my points are clumped 906 00:53:30,480 --> 00:53:33,234 onto one big point and one other big point, 907 00:53:33,234 --> 00:53:34,900 it's going to choose this because that's 908 00:53:34,900 --> 00:53:37,180 the direction that has a lot of variance. 909 00:53:37,180 --> 00:53:39,030 But hopefully, the variance is going 910 00:53:39,030 --> 00:53:41,560 to spread things out nicely. 911 00:53:41,560 --> 00:53:47,730 So the idea of principal component analysis 912 00:53:47,730 --> 00:53:51,330 is going to try to identify those variances-- 913 00:53:51,330 --> 00:53:55,740 those directions along which we have a lot of variance. 914 00:53:55,740 --> 00:53:57,870 Reciprocally, we're going to try to eliminate 915 00:53:57,870 --> 00:54:01,890 the directions along which we do not have a lot of variance, OK? 916 00:54:01,890 --> 00:54:02,640 And let's see why. 917 00:54:02,640 --> 00:54:08,130 Well, if-- so here's the first claim. 918 00:54:08,130 --> 00:54:14,000 If you transpose Su is equal to 0, what's happening? 919 00:54:14,000 --> 00:54:17,159 Well, I know that an empirical variance is equal to 0. 920 00:54:17,159 --> 00:54:18,950 What does it mean for an empirical variance 921 00:54:18,950 --> 00:54:22,056 to be equal to 0? 922 00:54:22,056 --> 00:54:23,680 So I give you a bunch of points, right? 923 00:54:23,680 --> 00:54:26,420 So those points are those points-- u transpose 924 00:54:26,420 --> 00:54:29,090 X1, u transpose-- those are a bunch of numbers. 925 00:54:29,090 --> 00:54:31,090 What does it mean to have the empirical variance 926 00:54:31,090 --> 00:54:33,279 of those points being equal to 0? 927 00:54:33,279 --> 00:54:34,570 AUDIENCE: They're all the same. 928 00:54:34,570 --> 00:54:36,590 PHILIPPE RIGOLLET: They're all the same. 929 00:54:36,590 --> 00:54:43,680 So what it means is that when I have my points, right? 930 00:54:43,680 --> 00:54:46,470 So, can you find a direction for those points in which they 931 00:54:46,470 --> 00:54:48,850 project to all the same point? 932 00:54:51,400 --> 00:54:52,360 No, right? 933 00:54:52,360 --> 00:54:53,590 There's no such thing. 934 00:54:53,590 --> 00:54:55,870 For this to happen, you have to have your points which 935 00:54:55,870 --> 00:54:57,849 are perfectly aligned. 936 00:54:57,849 --> 00:54:59,390 And then when you're going to project 937 00:54:59,390 --> 00:55:01,830 onto the orthogonal of this guy, they're 938 00:55:01,830 --> 00:55:03,690 going to all project to the same point 939 00:55:03,690 --> 00:55:06,450 here, which means that the empirical variance is 940 00:55:06,450 --> 00:55:08,790 going to be 0. 941 00:55:08,790 --> 00:55:10,270 Now, this is an extreme case. 942 00:55:10,270 --> 00:55:11,760 This will never happen in practice, 943 00:55:11,760 --> 00:55:13,840 because if that happens, well, I mean, 944 00:55:13,840 --> 00:55:16,850 you can basically figure that out very quickly. 945 00:55:16,850 --> 00:55:21,520 So in the same way, it's very unlikely 946 00:55:21,520 --> 00:55:23,710 that you're going to have u transpose sigma u, which 947 00:55:23,710 --> 00:55:26,230 is equal to 0, which means that, essentially, all 948 00:55:26,230 --> 00:55:28,510 your points are [INAUDIBLE] or let's say all of them 949 00:55:28,510 --> 00:55:30,069 are orthogonal to u, right? 950 00:55:30,069 --> 00:55:31,360 So it's exactly the same thing. 951 00:55:31,360 --> 00:55:33,330 It just says that in the population case, 952 00:55:33,330 --> 00:55:36,960 there's no probability that your points deviate from this guy 953 00:55:36,960 --> 00:55:37,510 here. 954 00:55:37,510 --> 00:55:41,142 This happens with zero probability, OK? 955 00:55:41,142 --> 00:55:42,600 And that's just because if you look 956 00:55:42,600 --> 00:55:46,690 at the variance of this guy, it's going to be 0. 957 00:55:46,690 --> 00:55:48,910 And then that means that there's no deviation. 958 00:55:48,910 --> 00:55:51,430 By the way, I'm using the name projection 959 00:55:51,430 --> 00:55:55,510 when I talk about u transpose X, right? 960 00:55:55,510 --> 00:55:59,170 So let's just be clear about this. 961 00:55:59,170 --> 00:56:04,090 If you-- so let's say I have a bunch of points, 962 00:56:04,090 --> 00:56:06,050 and u is a vector in this direction. 963 00:56:06,050 --> 00:56:08,650 And let's say that u has the-- 964 00:56:08,650 --> 00:56:10,120 so this is 0. 965 00:56:10,120 --> 00:56:10,720 This is u. 966 00:56:10,720 --> 00:56:17,560 And let's say that u has norm, 1, OK? 967 00:56:17,560 --> 00:56:21,140 When I look, what is the coordinate of the projection? 968 00:56:21,140 --> 00:56:23,860 So what is the length of this guy here? 969 00:56:23,860 --> 00:56:25,569 Let's call this guy X1. 970 00:56:25,569 --> 00:56:26,860 What is the length of this guy? 971 00:56:31,150 --> 00:56:32,330 In terms of inner products? 972 00:56:35,990 --> 00:56:39,678 This is exactly u transpose X1. 973 00:56:39,678 --> 00:56:42,730 This length here, if this is X2, this 974 00:56:42,730 --> 00:56:46,580 is exactly u transpose X2, OK? 975 00:56:46,580 --> 00:56:52,430 So those-- u transpose X measure exactly the distance 976 00:56:52,430 --> 00:56:55,700 to the origin of those-- 977 00:56:55,700 --> 00:56:58,310 I mean, it's really-- 978 00:56:58,310 --> 00:57:00,887 think of it as being just an x-axis thing. 979 00:57:00,887 --> 00:57:02,220 You just have a bunch of points. 980 00:57:02,220 --> 00:57:02,960 You have an origin. 981 00:57:02,960 --> 00:57:04,520 And it's really just telling you what 982 00:57:04,520 --> 00:57:07,670 the coordinate on this axis is going to be, right? 983 00:57:07,670 --> 00:57:10,820 So in particular, if the empirical variance is 0, 984 00:57:10,820 --> 00:57:12,470 it means that all these points project 985 00:57:12,470 --> 00:57:14,840 to the same point, which means that they have 986 00:57:14,840 --> 00:57:16,912 to be orthogonal to this guy. 987 00:57:16,912 --> 00:57:19,370 And you can think of it as being also maybe an entire plane 988 00:57:19,370 --> 00:57:23,990 that's orthogonal to this line, OK? 989 00:57:23,990 --> 00:57:26,590 So that's why I talk about projection, 990 00:57:26,590 --> 00:57:29,560 because the inner products, u transpose X, 991 00:57:29,560 --> 00:57:36,220 is really measuring the coordinates of X 992 00:57:36,220 --> 00:57:39,410 when u becomes the x-axis. 993 00:57:39,410 --> 00:57:42,820 Now, if u does not have norm 1, then you just 994 00:57:42,820 --> 00:57:44,365 have a change of scale here. 995 00:57:44,365 --> 00:57:46,790 You just have a change of unit, right? 996 00:57:46,790 --> 00:57:51,560 So this is really u times X1. 997 00:57:51,560 --> 00:57:54,044 The coordinates should really be divided by the norm of u. 998 00:57:59,150 --> 00:58:04,970 OK, so now, just in the same way-- so 999 00:58:04,970 --> 00:58:07,160 we're never going to have exactly 0. 1000 00:58:07,160 --> 00:58:08,810 But if we [INAUDIBLE] the other end, 1001 00:58:08,810 --> 00:58:12,050 if u transpose Su is large, what does it mean? 1002 00:58:14,990 --> 00:58:17,740 It means that when I look at my points 1003 00:58:17,740 --> 00:58:22,194 as projected onto the axis generated by u, 1004 00:58:22,194 --> 00:58:23,860 they're going to have a lot of variance. 1005 00:58:23,860 --> 00:58:25,930 They're going to be far away from each other in average, 1006 00:58:25,930 --> 00:58:26,430 right? 1007 00:58:26,430 --> 00:58:28,900 That's what large variance means, or at least 1008 00:58:28,900 --> 00:58:31,310 large empirical variance means. 1009 00:58:31,310 --> 00:58:34,690 And same thing for u. 1010 00:58:34,690 --> 00:58:36,130 So what we're going to try to find 1011 00:58:36,130 --> 00:58:39,870 is a u that maximizes this. 1012 00:58:39,870 --> 00:58:42,230 If I can find a u that maximizes this 1013 00:58:42,230 --> 00:58:44,790 so I can look in every direction, 1014 00:58:44,790 --> 00:58:48,320 and suddenly I find a direction in which the spread is massive, 1015 00:58:48,320 --> 00:58:50,070 then that's a point on which I'm basically 1016 00:58:50,070 --> 00:58:52,260 the less likely to have my points 1017 00:58:52,260 --> 00:58:54,824 project onto each other and collide, right? 1018 00:58:54,824 --> 00:58:56,490 At least I know they're going to project 1019 00:58:56,490 --> 00:58:59,710 at least onto two points. 1020 00:58:59,710 --> 00:59:02,290 So the idea now is to say, OK, let's try 1021 00:59:02,290 --> 00:59:04,630 to maximize this spread, right? 1022 00:59:04,630 --> 00:59:09,130 So we're going to try to find the maximum over all u's 1023 00:59:09,130 --> 00:59:12,886 of u transpose Su. 1024 00:59:12,886 --> 00:59:15,010 And that's going to be the direction that maximizes 1025 00:59:15,010 --> 00:59:15,968 the empirical variance. 1026 00:59:15,968 --> 00:59:22,075 Now of course, if I read it like that for all u's in Rd, 1027 00:59:22,075 --> 00:59:23,666 what is the value of this maximum? 1028 00:59:28,060 --> 00:59:29,220 It's infinity, right? 1029 00:59:29,220 --> 00:59:32,160 Because I can always multiply u by 10, 1030 00:59:32,160 --> 00:59:34,662 and this entire thing is going to multiplied by 100. 1031 00:59:34,662 --> 00:59:36,620 So I'm just going to take u as large as I want, 1032 00:59:36,620 --> 00:59:38,661 and this thing is going to be as large as I want, 1033 00:59:38,661 --> 00:59:40,050 and so I need to constrain u. 1034 00:59:40,050 --> 00:59:42,840 And as I said, I need to have u of size 1 1035 00:59:42,840 --> 00:59:45,990 to talk about coordinates in the system generated 1036 00:59:45,990 --> 00:59:47,340 by u like this. 1037 00:59:47,340 --> 00:59:50,730 So I'm just going to constrain u to have 1038 00:59:50,730 --> 00:59:55,467 Euclidean norm equal to 1, OK? 1039 00:59:55,467 --> 00:59:57,050 So that's going to be my goal-- trying 1040 00:59:57,050 --> 01:00:01,100 to find the largest possible u transpose Su, 1041 01:00:01,100 --> 01:00:03,680 or in other words, empirical variance of the points 1042 01:00:03,680 --> 01:00:07,520 projected onto the direction u when u is of norm 1, 1043 01:00:07,520 --> 01:00:11,039 which justifies to use the word, "direction," 1044 01:00:11,039 --> 01:00:12,830 and because there's no magnitude to this u. 1045 01:00:17,770 --> 01:00:22,410 OK, so how am I going to do this? 1046 01:00:22,410 --> 01:00:25,230 I could just fold and say, let's just optimize 1047 01:00:25,230 --> 01:00:26,700 this thing, right? 1048 01:00:26,700 --> 01:00:28,540 Let's just take this problem. 1049 01:00:28,540 --> 01:00:32,250 It says maximize a function onto some constraints. 1050 01:00:32,250 --> 01:00:34,125 Immediately, the constraint is sort of nasty. 1051 01:00:34,125 --> 01:00:37,212 I'm on a sphere, and I'm trying to move points on the sphere. 1052 01:00:37,212 --> 01:00:38,670 And I'm maximizing this thing which 1053 01:00:38,670 --> 01:00:40,182 actually happens to be convex. 1054 01:00:40,182 --> 01:00:42,390 And we know we know how to minimize convex functions, 1055 01:00:42,390 --> 01:00:45,280 but maximize them is a different question. 1056 01:00:45,280 --> 01:00:47,340 And so this problem might be super hard. 1057 01:00:47,340 --> 01:00:49,020 So I can just say, OK, here's what 1058 01:00:49,020 --> 01:00:52,950 I want to do, and let me give that to an optimizer 1059 01:00:52,950 --> 01:00:56,010 and just hope that the optimizer can solve this problem for me. 1060 01:00:56,010 --> 01:00:57,630 That's one thing we can do. 1061 01:00:57,630 --> 01:01:00,092 Now as you can imagine, PCA is so well spread, right? 1062 01:01:00,092 --> 01:01:01,800 Principal component analysis is something 1063 01:01:01,800 --> 01:01:03,700 that people do constantly. 1064 01:01:03,700 --> 01:01:06,190 And so that means that we know how to do this fast. 1065 01:01:06,190 --> 01:01:07,600 So that's one thing. 1066 01:01:07,600 --> 01:01:10,740 The other thing that you should probably question about why-- 1067 01:01:10,740 --> 01:01:13,110 if this thing is actually difficult, why in the world 1068 01:01:13,110 --> 01:01:16,200 would you even choose the variance as a measure of spread 1069 01:01:16,200 --> 01:01:19,020 if there's so many measures of spread, right? 1070 01:01:19,020 --> 01:01:21,222 The variance is one measure of spread. 1071 01:01:21,222 --> 01:01:22,680 It's not guaranteed that everything 1072 01:01:22,680 --> 01:01:26,366 is going to project nicely far apart from each other. 1073 01:01:26,366 --> 01:01:27,990 So we could choose the variance, but we 1074 01:01:27,990 --> 01:01:28,800 could choose something else. 1075 01:01:28,800 --> 01:01:30,990 If the variance does not help, why choose it? 1076 01:01:30,990 --> 01:01:32,520 Turns out the variance helps. 1077 01:01:32,520 --> 01:01:35,555 So this is indeed a non-convex problem. 1078 01:01:35,555 --> 01:01:38,340 I'm maximizing, so it's actually the same. 1079 01:01:38,340 --> 01:01:41,850 I can make this constraint convex 1080 01:01:41,850 --> 01:01:43,920 because I'm maximizing a convex function, 1081 01:01:43,920 --> 01:01:45,720 so it's clear that the maximum is going 1082 01:01:45,720 --> 01:01:47,220 to be attained at the boundary. 1083 01:01:47,220 --> 01:01:51,540 So I can actually just fill this ball into some convex ball. 1084 01:01:51,540 --> 01:01:53,430 However, I'm still maximizing, so this 1085 01:01:53,430 --> 01:01:55,170 is a non-convex problem. 1086 01:01:55,170 --> 01:01:57,550 And this turns out to be the fanciest non-convex problem 1087 01:01:57,550 --> 01:01:59,001 we know how to solve. 1088 01:01:59,001 --> 01:02:00,750 And the reason why we know how to solve it 1089 01:02:00,750 --> 01:02:04,410 is not because of optimization or using gradient-type things 1090 01:02:04,410 --> 01:02:06,690 or anything of the algorithms that I mentioned 1091 01:02:06,690 --> 01:02:09,350 during the maximum likelihood. 1092 01:02:09,350 --> 01:02:11,000 It's because of linear algebra. 1093 01:02:11,000 --> 01:02:13,980 Linear algebra guarantees that we know how to solve this. 1094 01:02:13,980 --> 01:02:17,885 And to understand this, we need to go a little deeper 1095 01:02:17,885 --> 01:02:22,360 in linear algebra, and we need to understand the concept 1096 01:02:22,360 --> 01:02:24,590 of diagonalization of a matrix. 1097 01:02:24,590 --> 01:02:29,850 So who has ever seen the concept of an eigenvalue? 1098 01:02:29,850 --> 01:02:30,790 Oh, that's beautiful. 1099 01:02:30,790 --> 01:02:31,880 And if you're not raising your hand, 1100 01:02:31,880 --> 01:02:33,588 you're just playing "Candy Crush," right? 1101 01:02:33,588 --> 01:02:35,930 All right, so, OK. 1102 01:02:44,930 --> 01:02:46,640 This is great. 1103 01:02:46,640 --> 01:02:48,160 Everybody's seen it. 1104 01:02:48,160 --> 01:02:51,230 For my live audience of millions, maybe you have not, 1105 01:02:51,230 --> 01:02:53,600 so I will still go through it. 1106 01:02:53,600 --> 01:02:58,840 All right, so one of the basic facts-- 1107 01:02:58,840 --> 01:03:02,490 and I remember when I learned this in-- 1108 01:03:02,490 --> 01:03:04,090 I mean, when I was an undergrad, I 1109 01:03:04,090 --> 01:03:05,860 learned about the spectral decomposition 1110 01:03:05,860 --> 01:03:07,450 and this diagonalization of matrices. 1111 01:03:07,450 --> 01:03:09,070 And for me, it was just a structural property 1112 01:03:09,070 --> 01:03:11,445 of matrices, but it turns out that it's extremely useful, 1113 01:03:11,445 --> 01:03:13,294 and it's useful for algorithmic purposes. 1114 01:03:13,294 --> 01:03:14,710 And so what this theorem tells you 1115 01:03:14,710 --> 01:03:16,765 is that if you take a symmetric matrix-- 1116 01:03:22,860 --> 01:03:24,340 well, with real entries, but that 1117 01:03:24,340 --> 01:03:28,220 really does not matter so much. 1118 01:03:28,220 --> 01:03:30,730 And here, I'm going to actually-- 1119 01:03:30,730 --> 01:03:33,200 so I take a symmetric matrix, and actually S and sigma 1120 01:03:33,200 --> 01:03:36,190 are two such symmetric matrices, right? 1121 01:03:36,190 --> 01:03:44,500 Then there exists P and D, which are both-- 1122 01:03:44,500 --> 01:03:47,000 so let's say d by d. 1123 01:03:47,000 --> 01:03:55,960 Which are both d by d such that P is orthogonal. 1124 01:03:58,960 --> 01:04:02,420 That means that P transpose P is equal to PP transpose 1125 01:04:02,420 --> 01:04:06,360 is equal to the identity. 1126 01:04:06,360 --> 01:04:07,630 And D is diagonal. 1127 01:04:11,840 --> 01:04:20,130 And sigma, let's say, is equal to PDP transpose, OK? 1128 01:04:20,130 --> 01:04:22,080 So it's a diagonalization because it's 1129 01:04:22,080 --> 01:04:23,970 finding a nice transformation. 1130 01:04:23,970 --> 01:04:25,260 P has some nice properties. 1131 01:04:25,260 --> 01:04:28,050 It's really just the change of coordinates in which 1132 01:04:28,050 --> 01:04:31,044 your matrix is diagonal, right? 1133 01:04:31,044 --> 01:04:32,460 And the way you want to see this-- 1134 01:04:32,460 --> 01:04:35,610 and I think it sort of helps to think about this problem 1135 01:04:35,610 --> 01:04:36,720 as being-- 1136 01:04:36,720 --> 01:04:38,276 sigma being a covariance matrix. 1137 01:04:38,276 --> 01:04:39,900 What does a covariance matrix tell you? 1138 01:04:39,900 --> 01:04:41,490 Think of a multivariate Gaussian. 1139 01:04:41,490 --> 01:04:43,660 Can everybody visualize a three-dimensional Gaussian 1140 01:04:43,660 --> 01:04:45,150 density? 1141 01:04:45,150 --> 01:04:48,200 Right, so it's going to be some sort of a bell-shaped curve, 1142 01:04:48,200 --> 01:04:51,870 but it might be more elongated in one direction than another. 1143 01:04:51,870 --> 01:04:54,310 And then going to chop it like that, all right? 1144 01:04:54,310 --> 01:04:56,120 So I'm going to chop it off. 1145 01:04:56,120 --> 01:05:00,070 And I'm going to look at how it bleeds, all right? 1146 01:05:00,070 --> 01:05:02,287 So I'm just going to look at where the blood is. 1147 01:05:02,287 --> 01:05:03,620 And what it's going to look at-- 1148 01:05:03,620 --> 01:05:08,720 it's going to look like some sort of ellipsoid, right? 1149 01:05:08,720 --> 01:05:11,652 In high dimension, it's just going to be an olive. 1150 01:05:11,652 --> 01:05:13,610 And that is just going to be bigger and bigger. 1151 01:05:13,610 --> 01:05:16,460 And then I chop it off a little lower, 1152 01:05:16,460 --> 01:05:20,150 and I get something a little bigger like this. 1153 01:05:20,150 --> 01:05:23,070 And so it turns out that sigma is capturing exactly this, 1154 01:05:23,070 --> 01:05:23,570 right? 1155 01:05:23,570 --> 01:05:27,320 The matrix sigma-- so the center of your covariance matrix 1156 01:05:27,320 --> 01:05:29,240 of your Gaussian is going to be this thing. 1157 01:05:29,240 --> 01:05:33,690 And sigma is going to tell you which direction it's elongated. 1158 01:05:33,690 --> 01:05:36,140 And so in particular, if you look, if you knew an ellipse, 1159 01:05:36,140 --> 01:05:38,160 you know there's something called principal axis, right? 1160 01:05:38,160 --> 01:05:39,743 So you could actually define something 1161 01:05:39,743 --> 01:05:43,190 that looks like this, which is this axis, the one along which 1162 01:05:43,190 --> 01:05:44,390 it's the most elongated. 1163 01:05:44,390 --> 01:05:47,345 Then the axis along which is orthogonal to it, 1164 01:05:47,345 --> 01:05:49,370 along which it's slightly less elongated, 1165 01:05:49,370 --> 01:05:52,880 and you go again and again along the orthogonal ones. 1166 01:05:52,880 --> 01:05:56,500 It turns out that those things here 1167 01:05:56,500 --> 01:05:59,620 is the new coordinate system in which this transformation, P 1168 01:05:59,620 --> 01:06:03,190 and P transpose, is putting you into. 1169 01:06:03,190 --> 01:06:06,390 And D has entries on the diagonal 1170 01:06:06,390 --> 01:06:09,979 which are exactly this length and this length, right? 1171 01:06:09,979 --> 01:06:11,270 So that's just what it's doing. 1172 01:06:11,270 --> 01:06:12,920 It's just telling you, well, if you 1173 01:06:12,920 --> 01:06:16,760 think of having this Gaussian or this high-dimensional 1174 01:06:16,760 --> 01:06:19,990 ellipsoid, it's elongated along certain directions. 1175 01:06:19,990 --> 01:06:23,020 And these directions are actually maybe not well aligned 1176 01:06:23,020 --> 01:06:25,270 with your original coordinate system, which might just 1177 01:06:25,270 --> 01:06:27,430 be the usual one, right-- 1178 01:06:27,430 --> 01:06:29,740 north, south, and east, west. 1179 01:06:29,740 --> 01:06:30,800 Maybe I need to turn it. 1180 01:06:30,800 --> 01:06:33,174 And that's exactly what this orthogonal transformation is 1181 01:06:33,174 --> 01:06:36,820 doing for you, all right? 1182 01:06:36,820 --> 01:06:39,627 So, in a way, this is actually telling you even more. 1183 01:06:39,627 --> 01:06:41,710 It's telling you that any matrix that's symmetric, 1184 01:06:41,710 --> 01:06:45,190 you can actually turn it somewhere. 1185 01:06:45,190 --> 01:06:47,530 And that'll start to dilate things in the directions 1186 01:06:47,530 --> 01:06:49,060 that you have, and then turn it back 1187 01:06:49,060 --> 01:06:50,800 to what you originally had. 1188 01:06:50,800 --> 01:06:53,110 And that's actually exactly the effect 1189 01:06:53,110 --> 01:06:57,180 of applying a symmetric matrix through a vector, right? 1190 01:06:57,180 --> 01:06:58,920 And it's pretty impressive. 1191 01:06:58,920 --> 01:07:04,650 It says if I take sigma times v. Any sigma that's 1192 01:07:04,650 --> 01:07:07,560 of this form, what I'm doing is-- that's symmetric. 1193 01:07:07,560 --> 01:07:09,360 What I'm really doing to v is I'm 1194 01:07:09,360 --> 01:07:12,150 changing its coordinate system, so I'm rotating it. 1195 01:07:12,150 --> 01:07:14,970 Then I'm changing-- I'm multiplying its coordinates, 1196 01:07:14,970 --> 01:07:16,956 and then I'm rotating it back. 1197 01:07:16,956 --> 01:07:18,330 That's all it's doing, and that's 1198 01:07:18,330 --> 01:07:21,550 what all symmetric matrices do, which 1199 01:07:21,550 --> 01:07:24,070 means that this is doing a lot. 1200 01:07:24,070 --> 01:07:27,130 All right, so OK. 1201 01:07:27,130 --> 01:07:29,237 So, what do I know? 1202 01:07:29,237 --> 01:07:30,820 So I'm not going to prove that this is 1203 01:07:30,820 --> 01:07:32,140 the so-called spectral theorem. 1204 01:07:39,270 --> 01:07:45,850 And the diagonal entries of D is of the form, lambda 1, 1205 01:07:45,850 --> 01:07:49,980 lambda 2, lambda d, 0, 0. 1206 01:07:49,980 --> 01:08:01,800 And the lambda j's are called eigenvalues of D. 1207 01:08:01,800 --> 01:08:05,170 Now in general, those numbers can be positive, negative, 1208 01:08:05,170 --> 01:08:06,660 or equal to 0. 1209 01:08:06,660 --> 01:08:12,000 But here, I know that sigma and S are-- 1210 01:08:12,000 --> 01:08:15,290 well, they're symmetric for sure, 1211 01:08:15,290 --> 01:08:17,467 but they are positive semidefinite. 1212 01:08:23,939 --> 01:08:25,840 What does it mean? 1213 01:08:25,840 --> 01:08:30,930 It means that when I take u transpose sigma u for example, 1214 01:08:30,930 --> 01:08:33,192 this number is always non-negative. 1215 01:08:35,910 --> 01:08:36,720 Why is this true? 1216 01:08:42,770 --> 01:08:43,609 What is this number? 1217 01:08:47,670 --> 01:08:49,850 It's the variance of-- and actually, I don't even 1218 01:08:49,850 --> 01:08:51,229 need to finish this sentence. 1219 01:08:51,229 --> 01:08:53,957 As soon as I say that this is a variance, well, 1220 01:08:53,957 --> 01:08:55,040 it has to be non-negative. 1221 01:08:55,040 --> 01:08:57,990 We know that a variance is not negative. 1222 01:08:57,990 --> 01:09:00,532 And so, that's also a nice way you can use that. 1223 01:09:00,532 --> 01:09:02,240 So it's just to say, well, OK, this thing 1224 01:09:02,240 --> 01:09:04,680 is positive semidefinite because it's a covariance matrix. 1225 01:09:04,680 --> 01:09:06,920 So I know it's a variance, OK? 1226 01:09:06,920 --> 01:09:08,779 So I get this. 1227 01:09:08,779 --> 01:09:10,560 Now, if I had some negative numbers-- 1228 01:09:10,560 --> 01:09:15,350 so the effect of that is that when I draw this picture, 1229 01:09:15,350 --> 01:09:19,040 those axes are always positive, which is kind of a weird thing 1230 01:09:19,040 --> 01:09:19,950 to say. 1231 01:09:19,950 --> 01:09:23,840 But what it means is that when I take a vector, v, I rotate it, 1232 01:09:23,840 --> 01:09:28,250 and then I stretch it in the directions of the coordinate, 1233 01:09:28,250 --> 01:09:30,260 I cannot flip it. 1234 01:09:30,260 --> 01:09:34,260 I can only stretch or shrink, but I cannot flip its sign, 1235 01:09:34,260 --> 01:09:34,760 all right? 1236 01:09:34,760 --> 01:09:37,370 But in general, for any symmetric matrices, 1237 01:09:37,370 --> 01:09:38,840 I could do this. 1238 01:09:38,840 --> 01:09:40,910 But when it's positive symmetric definite, 1239 01:09:40,910 --> 01:09:43,020 actually what turns out is that all the lambda 1240 01:09:43,020 --> 01:09:48,350 j's are non-negative. 1241 01:09:48,350 --> 01:09:51,370 I cannot flip it, OK? 1242 01:09:51,370 --> 01:09:53,778 So all the eigenvalues are non-negative. 1243 01:09:56,590 --> 01:09:58,469 That's a property of positive semidef. 1244 01:09:58,469 --> 01:10:00,510 So when it's symmetric, you have the eigenvalues. 1245 01:10:00,510 --> 01:10:01,670 They can be any number. 1246 01:10:01,670 --> 01:10:03,780 And when it's positive semidefinite, in particular 1247 01:10:03,780 --> 01:10:05,220 that's the case of the covariance matrix 1248 01:10:05,220 --> 01:10:07,110 and the empirical covariance matrix, right? 1249 01:10:07,110 --> 01:10:08,940 Because the empirical covariance matrix 1250 01:10:08,940 --> 01:10:12,150 is an empirical variance, which itself is non-negative. 1251 01:10:12,150 --> 01:10:17,900 And so I get that the eigenvalues are non-negative. 1252 01:10:17,900 --> 01:10:23,030 All right, so principal component analysis is saying, 1253 01:10:23,030 --> 01:10:32,370 OK, I want to find the direction, u, 1254 01:10:32,370 --> 01:10:38,830 that maximizes u transpose Su, all right? 1255 01:10:38,830 --> 01:10:40,420 I've just introduced in one slide 1256 01:10:40,420 --> 01:10:41,690 something about eigenvalues. 1257 01:10:41,690 --> 01:10:44,740 So hopefully, they should help. 1258 01:10:44,740 --> 01:10:47,560 So what is it that I'm going to be getting? 1259 01:10:47,560 --> 01:10:51,446 Well, let's just see what happens. 1260 01:10:51,446 --> 01:10:53,570 Oh, I forgot to mention that-- and I will use this. 1261 01:10:53,570 --> 01:10:56,020 So the lambda j's are called eigenvectors. 1262 01:10:56,020 --> 01:11:08,690 And then the matrix, P, has columns v1 to vd, OK? 1263 01:11:08,690 --> 01:11:13,370 The fact that it's orthogonal-- that P transpose P is equal 1264 01:11:13,370 --> 01:11:15,470 to the identity-- 1265 01:11:15,470 --> 01:11:20,810 means that those guys satisfied that vi transpose 1266 01:11:20,810 --> 01:11:27,485 vj is equal to 0 if i is different from j. 1267 01:11:27,485 --> 01:11:31,040 And vi transpose vi is actually equal to 1, 1268 01:11:31,040 --> 01:11:33,920 right, because the entries of PP transpose 1269 01:11:33,920 --> 01:11:38,990 are exactly going to be of the form, vi transpose vj, OK? 1270 01:11:38,990 --> 01:11:40,890 So those v's are called eigenvectors. 1271 01:11:46,000 --> 01:11:52,020 And v1 is attached to lambda 1, and v2 is attached to lambda 2, 1272 01:11:52,020 --> 01:11:53,180 OK? 1273 01:11:53,180 --> 01:11:56,280 So let's see what's happening with those things. 1274 01:11:56,280 --> 01:11:58,045 What happens if I take sigma-- 1275 01:11:58,045 --> 01:12:00,170 so if you know eigenvalues, you know exactly what's 1276 01:12:00,170 --> 01:12:01,580 going to happen. 1277 01:12:01,580 --> 01:12:06,920 If I look at, say, sigma times v1, well, what is sigma? 1278 01:12:06,920 --> 01:12:15,440 We know that sigma is PDP transpose v1. 1279 01:12:15,440 --> 01:12:17,420 What is P transpose times v1? 1280 01:12:17,420 --> 01:12:21,560 Well, P transpose has rows v1 transpose, 1281 01:12:21,560 --> 01:12:26,850 v2 transpose, all the way to vd transpose. 1282 01:12:26,850 --> 01:12:30,910 So when I multiply this by v1, what 1283 01:12:30,910 --> 01:12:32,820 I'm left with is the first coordinate 1284 01:12:32,820 --> 01:12:38,010 is going to be equal to 1 and the second coordinate is 1285 01:12:38,010 --> 01:12:40,980 going to be equal to 0, right? 1286 01:12:40,980 --> 01:12:42,910 Because they're orthogonal to each other-- 1287 01:12:42,910 --> 01:12:45,810 0 all the way to the end. 1288 01:12:45,810 --> 01:12:48,890 So that's when I do P transpose v1. 1289 01:12:48,890 --> 01:12:55,250 Now I multiply by D. Well, I'm just 1290 01:12:55,250 --> 01:12:58,950 multiplying this guy by lambda 1, this guy by lambda 2, 1291 01:12:58,950 --> 01:13:02,150 and this guy by lambda d, so this is really just lambda 1. 1292 01:13:04,720 --> 01:13:12,080 And now I need to post-multiply by P. 1293 01:13:12,080 --> 01:13:14,190 So what is P times this guy? 1294 01:13:14,190 --> 01:13:19,730 Well, P is v1 all the way to vd. 1295 01:13:19,730 --> 01:13:21,290 And now I multiply by a vector that 1296 01:13:21,290 --> 01:13:24,620 only has 0's except lambda 1 on the first guy. 1297 01:13:24,620 --> 01:13:26,510 So this is just lambda 1 times v1. 1298 01:13:29,470 --> 01:13:34,630 So what we've proved is that sigma times v1 is lambda 1 v1, 1299 01:13:34,630 --> 01:13:37,330 and that's probably the notion of eigenvalue you're 1300 01:13:37,330 --> 01:13:39,010 most comfortable with, right? 1301 01:13:39,010 --> 01:13:41,620 So just when I multiply by v1, I get 1302 01:13:41,620 --> 01:13:45,440 v1 back multiplied by something, which is the eigenvalue. 1303 01:13:45,440 --> 01:13:54,450 So in particular, if I look at v1, transpose sigma v1, 1304 01:13:54,450 --> 01:13:55,180 what do I get? 1305 01:13:55,180 --> 01:13:58,800 Well, I get lambda 1 v1 transpose v1, 1306 01:13:58,800 --> 01:14:00,180 which is 1, right? 1307 01:14:00,180 --> 01:14:04,050 So this is actually lambda 1 v1 transpose v1, 1308 01:14:04,050 --> 01:14:08,360 which is lambda 1, OK? 1309 01:14:08,360 --> 01:14:10,940 And if I do the same with v2, clearly I'm 1310 01:14:10,940 --> 01:14:13,450 going to get v2 transpose sigma. 1311 01:14:13,450 --> 01:14:16,910 v2 is equal to lambda 2. 1312 01:14:16,910 --> 01:14:19,910 So for each of the vj's, I know that if I 1313 01:14:19,910 --> 01:14:21,650 look at the variance along the vj, 1314 01:14:21,650 --> 01:14:27,760 it's actually exactly given by those eigenvalues, all right? 1315 01:14:27,760 --> 01:14:38,490 Which proves this, because the variance along the eigenvectors 1316 01:14:38,490 --> 01:14:40,270 is actually equal to the eigenvalues. 1317 01:14:40,270 --> 01:14:43,760 So since they're variances, they have to be non-negative. 1318 01:14:43,760 --> 01:14:47,960 So now, I'm looking for the one direction that 1319 01:14:47,960 --> 01:14:50,450 has the most variance, right? 1320 01:14:50,450 --> 01:14:53,040 But that's not only among the eigenvectors. 1321 01:14:53,040 --> 01:14:55,520 That's also among the other directions 1322 01:14:55,520 --> 01:14:57,200 that are in-between the eigenvectors. 1323 01:14:57,200 --> 01:14:59,390 If I were to look only at the eigenvectors, 1324 01:14:59,390 --> 01:15:02,420 it would just tell me, well, just pick the eigenvector, vj, 1325 01:15:02,420 --> 01:15:05,990 that's associated to the largest of the lambda j's. 1326 01:15:05,990 --> 01:15:09,080 But it turns out that that's also true for any vector-- 1327 01:15:09,080 --> 01:15:11,810 that the maximum direction is actually one direction which 1328 01:15:11,810 --> 01:15:13,809 is among the eigenvectors. 1329 01:15:13,809 --> 01:15:16,100 And among the eigenvectors, we know that the one that's 1330 01:15:16,100 --> 01:15:17,080 the largest-- 1331 01:15:17,080 --> 01:15:18,740 that carries the largest variance is 1332 01:15:18,740 --> 01:15:23,780 the one that's associated to the largest eigenvalue, all right? 1333 01:15:23,780 --> 01:15:26,990 And so this is what PCA is going to try to do for me. 1334 01:15:26,990 --> 01:15:29,420 So in practice, that's what I mentioned already, right? 1335 01:15:29,420 --> 01:15:31,970 We're trying to project the point cloud 1336 01:15:31,970 --> 01:15:34,730 onto a low-dimensional space, D prime, 1337 01:15:34,730 --> 01:15:36,800 by keeping as much information as possible. 1338 01:15:36,800 --> 01:15:39,230 And by "as much information," I mean we do not 1339 01:15:39,230 --> 01:15:41,540 want points to collide. 1340 01:15:41,540 --> 01:15:45,530 And so what PCA is going to do is just 1341 01:15:45,530 --> 01:15:48,231 going to try to project [? on two ?] directions. 1342 01:15:48,231 --> 01:15:49,730 So there's going to be a u, and then 1343 01:15:49,730 --> 01:15:52,021 there's going to be something orthogonal to u, and then 1344 01:15:52,021 --> 01:15:55,550 the third one, et cetera, so that once we project on those, 1345 01:15:55,550 --> 01:15:59,600 we're keeping as much of the covariance as possible, OK? 1346 01:15:59,600 --> 01:16:02,859 And in particular, those directions 1347 01:16:02,859 --> 01:16:04,400 that we're going to pick are actually 1348 01:16:04,400 --> 01:16:06,920 a subset of the vj's that are associated to the largest 1349 01:16:06,920 --> 01:16:08,580 eigenvalues. 1350 01:16:08,580 --> 01:16:11,300 So I'm going to stop here for today. 1351 01:16:11,300 --> 01:16:15,020 We'll finish this on Tuesday. 1352 01:16:15,020 --> 01:16:18,260 But basically, the idea is it's just the following. 1353 01:16:18,260 --> 01:16:22,590 You're just going to-- well, let me skip one more. 1354 01:16:22,590 --> 01:16:24,812 Yeah, this is the idea. 1355 01:16:24,812 --> 01:16:27,020 You're first going to pick the eigenvector associated 1356 01:16:27,020 --> 01:16:30,290 to the largest eigenvalue. 1357 01:16:30,290 --> 01:16:33,890 Then you're going to pick the direction that orthogonal 1358 01:16:33,890 --> 01:16:37,130 to the vector that you've picked, 1359 01:16:37,130 --> 01:16:38,984 and that's carrying the most variance. 1360 01:16:38,984 --> 01:16:40,650 And that's actually the second largest-- 1361 01:16:40,650 --> 01:16:44,030 the eigenvector associated to the second largest eigenvalue. 1362 01:16:44,030 --> 01:16:46,520 And you're going to go all the way to the number of them 1363 01:16:46,520 --> 01:16:50,120 that you actually want to pick, which is in this case, d, OK? 1364 01:16:50,120 --> 01:16:53,180 And wherever you choose to chop this process, 1365 01:16:53,180 --> 01:16:56,390 not going all the way to d, is going to actually give you 1366 01:16:56,390 --> 01:16:57,890 a lower-dimensional representation 1367 01:16:57,890 --> 01:17:01,238 in the coordinate system that's given by v1, v2, v3, et 1368 01:17:01,238 --> 01:17:02,420 cetera, OK? 1369 01:17:02,420 --> 01:17:04,591 So we'll see that in more details on Tuesday. 1370 01:17:04,591 --> 01:17:06,090 But I don't want to get into it now. 1371 01:17:06,090 --> 01:17:07,500 We don't have enough time. 1372 01:17:07,500 --> 01:17:10,000 Are there any questions?