The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PHILIPPE RIGOLLET: --bunch of x's and a bunch of y's. The y's were univariate, just one real valued random variable. And the x's were vectors that described a bunch of attributes for each of our individuals or each of our observations. Let's assume now that we're given essentially only the x's. This is sometimes referred to as unsupervised learning. There is just the x's. Usually, supervision is done by the y's.
And so what you're trying to do is to make sense of this data. You're going to try to understand this data, represent this data, visualize this data, try to understand something, right? So, if I give you a d-dimensional random vectors, and you're going to have n independent copies of this individual-- of this random vector, OK?
So you will see that I'm going to have-- I'm going to very quickly run into some limitations about what I can actually draw on the board because I'm using [? boldface ?] here. I'm also going to use the blackboard [? boldface. ?] So it's going to be a bit difficult. So tell me if you're actually a little confused by what is a vector, what is a number, and what is a matrix. But we'll get there.
So I have X in Rd, and that's a random vector. And I have X1 to Xn that are IID. They're independent copies of X. OK, so you can think of those as being-- the realization of these guys are going to be a cloud of n points in R to the d. And we're going to think of d as being fairly large. And for this to start to make sense, we're going to think of d as being at least 4, OK?
And meaning that you're going to have a hard time visualizing those things. If it was 3 or 2, you would be able to draw these points. And that's pretty much as much sense you're going to be making about those guys, just looking at the [INAUDIBLE]
All right, so I'm going to write each of those X's, right? So this vector, X, has d coordinate. And I'm going to write them as X1, to Xd. And I'm going to stack them into a matrix, OK? So once I have those guys, I'm going to have a matrix. But here, I'm going to use the double bar. And it's X1 transpose, Xn transpose.
So what it means is that the coordinates of this guy, of course, are X1,1. Here, I have-- I'm of size d, so I have X1d. And here, I have Xn1. Xnd. And so the i-th, j-th-- i-th row and j-th column is the matrix, Xij, right-- is the entry, Xi to-- sorry. OK, so each-- so the rows here are the observations. And the columns are the covariance over attributes. OK? So this is an n by d matrix.
All right, this is really just some bookkeeping. How do we store this data somehow? And the fact that we use a matrix just like for regression is going to be convenient because we're going to able to talk about projections-- going to be able to talk about things like this.
All right, so everything I'm going to say now is about variances or covariances of those things, which means that I need two moments, OK? If the variance does not exist, there's nothing I can say about this problem. So I'm going to assume that the variance exists. And one way to just put it to say that the two norm of those guys is finite, which is another way to say that each of them is finite. I mean, you can think of it the way you want.
All right, so now, the mean of X, right? So I have a random vector. So I can talk about the expectation of X. That's a vector that's in Rd. And that's just taking the expectation entrywise. Sorry. X1, Xd.
OK, so I should say it out loud. For this, the purpose of this class, I will denote by subscripts the indices that corresponds to observations. And superscripts, the indices that correspond to coordinates of a variable. And I think that's the same convention that we took for the regression case. Of course, you could use whatever you want. If you want to put commas, et cetera, it becomes just a bit more complicated.
All right, and so now, once I have this, so this tells me where my cloud of point is centered, right? So if I have a bunch of points-- OK, so now I have a distribution on Rd, so maybe I should talk about this-- I'll talk about this when we talk about the empirical version. But if you think that you have, say, a two-dimensional Gaussian random variable, then you have a center in two dimension, which is where it peaks, basically. And that's what we're talking about here.
But the other thing we want to know is how much does it spread in every direction, right? So in every direction of the two dimensional thing, I can then try to understand how much spread I'm getting. And the way you measure this is by using covariance, right? So the covariance matrix, sigma-- that's a matrix which is d by d. And it records-- in the j, k-th entry, it records the covariance between the j-th coordinate of X and the k-th coordinate of X, OK?
So with entries-- OK, so I have sigma, which is sigma 1,1, sigma dd, sigma 1d, sigma d1. OK, and here I have sigma jk And sigma jk is just the covariance between Xj, the j-th coordinate and the k-th coordinate. OK? So in particular, it's symmetric because the covariance between Xj and Xk is the same as the covariance between Xk and Xj. I should not put those parentheses here. I do not use them in this, OK?
Just the covariance matrix. So that's just something that records everything. And so what's nice about the covariance matrix is that if I actually give you X as a vector, you actually can build the matrix just by looking at vectors times vectors transpose, rather than actually thinking about building it coordinate by coordinate. So for example, if you're used to using MATLAB, that's the way you want to build a covariance matrix because MATLAB is good at manipulating vectors and matrices rather than just entering it entry by entry.
OK, so, right? So, what is the covariance between Xj and Xk? Well by definition, it's the expectation of Xj and Xk minus the expectation of Xj times the expectation of Xk, right? That's the definition of the covariance. I hope everybody's seeing that.
And so, in particular, I can actually see that this thing can be written as-- sigma can now be written as the expectation of XX transpose minus the expectation of X times the expectation of X transpose.
Why? Well, let's look at the jk-th coefficient of this guy, right? So here, if I look at the jk-th coefficient, I see what? Well, I see that it's the expectation of XX transpose jk, which is equal to the expectation of XX transpose jk. And what are the entries of XX transpose? Well, they're of the form, Xj times Xk exactly. So this is actually equal to the expectation of Xj times Xk.
And this is actually not the way I want to write it. I want to write it-- OK? Is that clear? That when I have a rank 1 matrix of this form, XX transpose, the entries are of this form, right? Because if I take-- for example, think about x, y, z, and then I multiply by x, y, z. What I'm getting here is x-- maybe I should actually use indices here.
x1, x2, x3. x1, x2, x3. The entries are x1x1, x1x2, x1x3; x2x1, x2x2, x2x3; x3x1, x3x2, x3x3, OK? So indeed, this is exactly of the form if you look at jk, you get exactly Xj times Xk, OK? So that's the beauty of those matrices.
So now, once I have this, I can do exactly the same thing, except that here, if I take the jk-th entry, I will get exactly the same thing, except that it's not going to be the expectation of the product, but the product of the expectation, right? So I get that the jk-th entry of E of X, E of X transpose, is just the j-th entry of E of X times the k-th entry of E of X.
So if I put those two together, it's actually telling me that if I look at the j, k-th entry of sigma, which I called little sigma jk, then this is actually equal to what? It's equal to the first term minus the second term. The first term is the expectation of Xj, Xk minus the expectation of Xj, expectation of Xk, which-- oh, by the way, I forgot to say this is actually equal to the expectation of Xj times the expectation of Xk because that's just the definition of the expectation of random vectors. So my j and my k are now inside. And that's by definition the covariance between Xj and Xk, OK?
So just if you've seen those manipulations between vectors, hopefully you're bored out of your mind. And if you have not, then that's something you just need to get comfortable with, right? So one thing that's going to be useful is to know very quickly what's called the outer product of a vector with itself, which is the vector of times the vector transpose, what the entries of these things are. And that's what we've been using on this second set of boards.
OK, so everybody agrees now that we've sort of showed that the covariance matrix can be written in this vector form. So expectation of XX transpose minus expectation of X, expectation of X transpose.
OK, just like the covariance can be written in two ways, right we know that the covariance can also be written as the expectation of Xj minus expectation of Xj times Xk minus expectation of Xk, right?
That's the-- sometimes, this is the original definition of covariance. This is the second definition of covariance. Just like you have the variance which is the expectation of the square of X minus c of X, or the expectation X squared minus the expectation of X squared. It's the same thing for covariance.
And you can actually see this in terms of vectors, right? So this actually implies that you can also rewrite sigma as the expectation of X minus expectation of X times the same thing transpose.
Right? And the reason is because if you just distribute those guys, this is just the expectation of XX transpose minus X, expectation of X transpose minus expectation of XX transpose. And then I have plus expectation of X, expectation of X transpose.
Now, things could go wrong because the main difference between matrices slash vectors and numbers is that multiplication does not commute, right? So in particular, those two things are not the same thing. And so that's the main difference that we have before, but it actually does not matter for our problem. It's because what's happening is that if when I take the expectation of this guy, then it's actually the same as the expectation of this guy, OK?
And so just because the expectation is linear-- so what we have is that sigma now becomes equal to the expectation of XX transpose minus the expectation of X, expectation of X transpose minus expectation of X, expectation of X transpose. And then I have-- well, really, what I have is this guy. And then I have plus the expectation of X, expectation of X transpose.
And now, those three things are actually equal to each other just because the expectation of X transpose is the same as the expectation of X transpose. And so what I'm left with is just the expectation of XX transpose minus the expectation of X, expectation of X transpose, OK?
So same thing that's happening when you want to prove that you can write the covariance either this way or that way. The same thing happens for matrices, or for vectors, right, or a covariance matrix. They go together. Is there any questions so far? And if you have some, please tell me, because I want to-- I don't know to which extent you guys are comfortable with this at all or not.
OK, so let's move on. All right, so of course, this is what I'm describing in terms of the distribution right here. I took expectations. Covariances are also expectations. So those depend on some distribution of X, right? If I wanted to compute that, I would basically need to know what the distribution of X is.
Now, we're doing statistics, so I need to [INAUDIBLE] my question is going to be to say, well, how well can I estimate the covariance matrix itself, or some properties of this covariance matrix based on data?
All right, so if I want to understand what my covariance matrix looks like based on data, I'm going to have to basically form its empirical counterparts, which I can do by doing the age-old statistical trick, which is replace your expectation by an average, all right? So let's just-- everything that's on the board, you see expectation, just replace it by an average.
OK, so, now I'm going to be given X1, Xn. So, I'm going to define the empirical mean. OK so, really, the idea is take your expectation and replace it by 1 over n sum, right? And so the empirical mean is just 1 over n. Some of the Xi's-- I'm guessing everybody knows how to average vectors. It's just the average of the coordinates. So I will write this as X bar. And the empirical covariance matrix, often called sample covariance matrix, hence the notation, S.
Well, this is my covariance matrix, right? Let's just replace the expectations by averages. 1 over n, sum from i equal 1 to n, of Xi, Xi transpose, minus-- this is the expectation of X. I will replace it by the average, which I just called X bar, X bar transpose, OK?
And that's when I want to use the-- that's when I want to use the notation-- the second definition, but I could actually do exactly the same thing using this definition here. Sorry, using this definition right here. So this is actually 1 over n, sum from i equal 1 to n, of Xi minus X bar, Xi minus X bar transpose.
And those are actually-- I mean, in a way, it looks like I could define two different estimators, but you can actually check. And I do encourage you to do this. If you're not comfortable making those manipulations, you can actually check that those two things are actually exactly the same, OK?
So now, I'm going to want to talk about matrices, OK? And remember, we defined this big matrix, X, with the double bar. And the question is, can I express both X bar and the sample covariance matrix in terms of this big matrix, X? Because right now, it's still expressed in terms of the vectors. I'm summing those vectors, vectors transpose. The question is, can I just do that in a very compact way, in a way that I can actually remove this sum term, all right?
That's going to be the goal. I mean, that's not a notational goal. That's really something that we want-- that's going to be convenient for us just like it was convenient to talk about matrices when we did linear regression.
OK, X bar. We just said it's 1 over n, sum from I equal 1 to n of Xi, right? Now remember, what does this matrix look like? We said that X bar-- X is this guy. So if I look at X transpose, the columns of this guy becomes X1, my first observation, X2, my second observation, all the way to Xn, my last observation, right? Agreed? That's what X transpose is.
So if I want to sum those guys, I can multiply by the all-ones vector. All right, so that's what the definition of the all-ones 1 vector is. Well, it's just a bunch of 1's in Rn, in this case. And so when I do X transpose 1, what I get is just the sum from i equal 1 to n of the Xi's. So if I divide by n, I get my average, OK? So here, I definitely removed the sum term.
Let's see if with the covariance matrix, we can do the same. Well, and that's actually a little more difficult to see, I guess. But let's use this definition for S, OK? And one thing that's actually going to be-- so, let's see for one second, what-- so it's going to be something that involves X, multiplying X with itself, OK? And the question is, is it going to be multiplying X with X transpose, or X tranpose with X?
To answer this question, you can go the easy route, which says, well, my covariance matrix is of size, what? What is the size of S?
AUDIENCE: d by d.
PHILIPPE RIGOLLET: d by d, OK? X is of size n by d. So if I do X times X transpose, I'm going to have something which is of size n by n. If I do X transpose X, I'm going to have something which is d by d. That's the easy route. And there's basically one of the two guys.
You can actually open the box a little bit and see what's going on in there. If you do X transpose X, which we know gives you a d by d, you'll see that X is going to have vectors that are of the form, Xi, and X transpose is going to have vectors that are of the form, Xi transpose, right? And so, this is actually probably the right way to go. So let's look at what's X transpose X is giving us.
So I claim that it's actually going to give us what we want, but rather than actually going there, let's-- to actually-- I mean, we could check it entry by entry, but there's actually a nice thing we can do. Before we go there, let's write X transpose as the following sum of variables, X1 and then just a bunch of 0's everywhere else. So it's still d by n. So n minus 1 of the columns are equal to 0 here.
Then I'm going to put a 0 and then put X2. And then just a bunch of 0's, right? So that's just 0, 0 plus 0, 0, all the way to Xn, OK? Everybody agrees with it? See what I'm doing here? I'm just splitting it into a sum of matrices that only have one nonzero columns. But clearly, that's true.
Now let's look at the product of this guy with itself. So, let's call these matrices M1, M2, Mn. So when I do X transpose X, what I do is the sum of the Mi's for i equal 1 to n, times the sum of the Mi transpose, right? Now, the sum of the Mi's transpose is just the sum of each of the Mi's transpose, OK? So now I just have this product of two sums, so I'm just going to re-index the second one by j. So this is sum for i equal 1 to n, j equal 1 to n of Mi Mj transpose. OK?
And now what we want to notice is that if i is different from j, what's happening? Well if i is different from j, let's look at say, M1 times XM2 transpose. So what is the product between those two matrices?
AUDIENCE: It's a new entry and [INAUDIBLE]
PHILIPPE RIGOLLET: There's an entry?
AUDIENCE: Well, it's an entry. It's like a dot product in that form next to [? transpose. ?]
PHILIPPE RIGOLLET: You mean a dot product is just getting [INAUDIBLE] number, right? So I want-- this is going to be a matrix. It's the product of two matrices, right? This is a matrix times a matrix. So this should be a matrix, right, of size d by d. Yeah, I should see a lot of hands that look like this, right?
Because look at this. So let's multiply the first-- let's look at what's going on in the first column here. I'm multiplying this column with each of those rows. The only nonzero coefficient is here, and it only hits this column of 0's. So every time, this is going to give you 0, 0, 0, 0. And it's going to be the same for every single one of them. So this matrix is just full of 0's, right? They never hit each other when I do the matrix-matrix multiplication. There's no-- every non-zero hits a 0.
So what it means is-- and this, of course, you can check for every i different from j. So this means that Mi times Mj transpose is actually equal to 0 when i is different from j, Right? Everybody is OK with this? So what that means is that when I do this double sum, really, it's a simple sum. There's only just the sum from i equal 1 to n of Mi Mi transpose. Because this is the only terms in this double sum that are not going to be 0 when [INAUDIBLE] [? M1 ?] with M1 itself.
Now, let's see what's going on when I do M1 times M1 transpose. Well, now, if I do Mi times and Mi transpose, now this guy becomes [? X1 ?] [INAUDIBLE] it's here. And so now, I really have X1 times X1 transpose. So this is really just the sum from i equal 1 to n of Xi Xi transpose, just because Mi Mi transpose is Xi Xi transpose. There's nothing else there.
So that's the good news, right? This term here is really just X transpose X divided by n. OK, I can use that guy again, I guess. Well, no. Let's just-- OK, so let me rewrite S. All right, that's the definition we have.
And we know that this guy already is equal to 1 over n X transpose X. x bar x bar transpose-- we know that x bar-- we just proved that x bar-- sorry, little x bar was equal to 1 over n X bar transpose times the all-ones vector. So I'm just going to do that.
So that's just going to be minus. I'm going to pull my two 1 over n's-- one from this guy, one from this guy. So I'm going to get 1 over n squared. And then I'm going to get X bar-- sorry, there's no X bar here. It's just X. Yeah. X transpose all ones times X transpose all ones transpose, right?
And X transpose all ones transpose-- right, the rule-- if I have A times B transpose, it's B transpose times A transpose, right? That's just the rule of transposition. So this is 1 transpose X transpose. And so when I put all these guys together, this is actually equal to 1 over n X transpose X minus one over n squared X transpose 1, 1 transpose X. Because X transpose transposes X, OK?
So now, I can actually-- I have something which is of the form, X transpose X-- [INAUDIBLE] to the left, X transpose; to the right, X. Here, I have X transpose to the left, X to the right. So it can factor out whatever's in there. So I can write S as 1 over n-- sorry, X transpose times 1 over n times the identity of Rd. And then I have minus 1 over n, 1, 1 transpose X.
OK, because if you-- I mean, you can distribute it back, right? So here, I'm going to get what? X transpose identity times X, the whole thing divided by n. That's this term. And then the second one is going to be-- sorry, 1 over n squared. And then I'm going to get 1 over n squared times X transpose 1, 1 transpose which is this guy, times X, and that's the [? right ?] [? thing, ?] OK?
So, the way it's written, I factored out one of the 1 over n's. So I'm just going to do the same thing as on this slide. So I'm just factoring out this 1 over n here. So it's 1 over n times X transpose identity of our d divided by n divided by 1 this time, minus 1 over n 1, 1 transpose times X, OK? So that's just what's on the slides.
What does the matrix, 1, 1 transpose, look like?
AUDIENCE: All 1's.
PHILIPPE RIGOLLET: It's just all 1's, right? Because the entries are the products of the all-ones-- of the coordinates of the all-ones vectors with the coordinates of the all-ones vectors, so I only get 1's. So it's a d by d matrix with only 1's. So this matrix, I can actually write exactly, right? H, this matrix that I called H which is what's sandwiched in-between this X transpose and X. By definition, I said this is the definition of H. Then this thing, I can write its coordinates exactly.
We know it's identity divided by n minus-- sorry, I don't know why I keep [INAUDIBLE]. Minus 1 over n 1, 1 transpose-- so it's this matrix with the only 1's on the diagonals and 0's and elsewhere-- minus a matrix that only has 1 over n everywhere. OK, so the whole thing is 1 minus 1 over n on the diagonals and then minus 1 over n here, OK?
And now I claim that this matrix is an orthogonal projector. Now, I'm writing this, but it's completely useless. This is just a way for you to see that it's actually very convenient now to think about this problem as being a matrix problem, because things are much nicer when you think about the actual form of your matrices, right? They could tell you, here is the matrix.
I mean, imagine you're sitting at a midterm, and I say, here's the matrix that has 1 minus 1 over n on the diagonals and minus 1 over n on the [INAUDIBLE] diagonal. Prove to me that it's a projector matrix. You're going to have to basically take this guy times itself. It's going to be really complicated, right? So we know it's symmetric. That's for sure.
But the fact that it has this particular way of writing it is going to make my life super easy to check this. That's the definition of a projector. It has to be symmetric and it has to square to itself because we just said in the chapter on linear regression that once you project, if you apply the projection again, you're not moving because you're already there.
OK, so why is H squared equal to H? Well let's just write H square. It's the identity minus 1 over n 1, 1 transpose times the identity minus 1 over n 1, 1 transpose, right? Let's just expand this now. This is equal to the identity minus-- well, the identity times 1, 1 transpose is just the identity. So it's 1, 1 transpose, sorry. So 1 over n 1, 1 transpose minus 1 over n 1, 1 transpose.
And then there's going to be what makes the deal is that I get this 1 over n squared this time. And then I get the product of 1 over n trans-- oh, let's write it completely. I get 1, 1 transpose times 1, 1 transpose, OK?
But this thing here-- what is this? n, right, is the end product of the all-ones vector with the all-ones vector. So I'm just summing n times 1 squared, which is n. So this is equal to n. So I pull it out, cancel one of the ends, and I'm back to what I had before. So I had identity minus 2 over n 1, 1 transpose plus 1 over n 1, 1 transpose which is equal to H. Because one of the 1 over n's cancel, OK?
So it's a projection matrix. It's projecting onto some linear space, right? It's taking a matrix. Sorry, it's taking a vector and it's projecting onto a certain space of vectors. What is this space? Right, so, how do you-- so I'm only asking the answer to this question in words, right? So how would you describe the vectors onto which this matrix is projecting?
Well, if you want to answer this question, the way you would tackle it is first by saying, OK, what does a vector which is of the form, H times something, look like, right? What can I say about this vector that's going to be definitely giving me something about the space on which it projects? I need to know a little more to know that it projects exactly onto this.
But one way we can do this is just see how it acts on a vector. What does it do to a vector to apply H, right? So I take v. And let's see what taking v and applying H to it looks like.
Well, it's the identity minus something. So it takes v and it removes something from v. What does it remove? Well, it's 1 over n times v transpose 1 times the all-ones vector, right? Agreed? I just wrote v transpose 1 instead of 1 transpose v, which are the same thing.
What is this thing? What should I call it in mathematical notation? v bar, right? I should all it v bar because this is exactly the average of the entries of v, agreed? This is summing the entries of v's, and this is dividing by the number of those v's. Sorry, now v is in our-- sorry, why do I divide by-- I'm just-- OK, I need to check what my dimensions are now. No, it's in Rd, right? So why do I divide by n?
So it's not really v bar. It's the sum of the v's divided by-- right, so it's v bar.
AUDIENCE: Yeah, v has to be [INAUDIBLE]
PHILIPPE RIGOLLET: Oh, yeah. OK, thank you. So everywhere I wrote Hd, that was actually Hn. Oh, man. I wish I had a computer now. All right. So-- yeah, because the-- yeah, right? So why it's not-- well, why I thought it was this is because I was thinking about the outer dimension of X, really of X transpose, which is really the inner dimension, didn't matter to me, right?
So the thing that I can sandwich between X transpose and X has to be n by n. So this was actually n by n. And so that's actually n by n. Everything is n by n. Sorry about that. So this is n. This is n. This is-- well, I didn't really tell you what the all-ones vector was, but it's also in our n. Yeah, OK. Thank you. And n-- actually, I used the fact that this was of size n here already.
OK, and so that's indeed v bar. So what is this projection doing to a vector? It's removing its average on each coordinate, right? And the effect of this is that v is a vector. What is the average of Hv?
PHILIPPE RIGOLLET: Right, so it's 0. It's the average of v, which is v bar, minus the average of something that only has v bar's entry, which is v bar. So this thing is actually 0. So let me repeat my question. Onto what subspace does H project? Onto the subspace of vectors that have mean 0.
A vector that has mean 0 is a vector. So if you want to talk more linear algebra, v bar-- for a vector you have mean 0, it means that v is orthogonal to the span of the all-ones vector. That's it. It projects to this space. So in words, it projects onto the space of vectors that have 0 mean. In linear algebra, it says it projects onto the hyperplane which is orthogonal to the all-ones vector, OK? So that's all.
Can you guys still see the screen? Are you good over there? OK.
All right, so now, what it means is that, well, I'm doing this weird thing, right? I'm taking the inner product-- so S is taking X. And then it's removing its mean of each of the columns of X, right? When I take H times X, I'm basically applying this projection which consists in removing the mean of all the X's. And then I multiply by H transpose.
But what's actually nice is that, remember, H is a projector. Sorry, I don't want to keep that. Which means that when I look at X transpose HX, it's the same as looking at X transpose H squared X. But since H is equal to its transpose, this is actually the same as looking at X transpose H transpose HX, which is the same as looking at HX transpose HX, OK?
So what it's doing, it's first applying this projection matrix, H, which removes the mean of each of your columns, and then looks at the inner products between those guys, right? Each entry of this guy is just the covariance between those centered things. That's all it's doing.
All right, so those are actually going to be the key statements. So everything we've done so far is really mainly linear algebra, right? I mean, looking at expectations and covariances was just-- we just used the fact that the expectation was linear. We didn't do much. But now there's a nice thing that's happening.
And that's why we're going to switch from the language of linear algebra to more statistical, because what's happening is that if I look at this quadratic form, right? So I take sigma. So I take a vector, u. And I'm going to look at u-- so let's say, in Rd. And I'm going to look at u transpose sigma u. OK?
What is this doing? Well, we know that u transpose sigma u is equal to what? Well, sigma is the expectation of XX transpose minus the expectation of X expectation of X transpose, right? So I just substitute in there.
Now, u is deterministic. So in particular, I can push it inside the expectation here, agreed? And I can do the same from the right. So here, when I push u transpose here, and u here, what I'm left with is the expectation of u transpose X times X transpose u. OK? And now, I can do the same thing for this guy. And this tells me that this is the expectation of u transpose X times the expectation of X transpose u.
Of course, u transpose X is equal to X transpose u. And u-- yeah. So what it means is that this is actually equal to the expectation of u transpose X squared minus the expectation of u transpose X, the whole thing squared.
But this is something that should look familiar. This is really just the variance of this particular random variable which is of the form, u transpose X, right? u transpose X is a number. It involves a random vector, so it's a random variable. And so it has a variance. And this variance is exactly given by this formula. So this is just the variance of u transpose X. So what we've proved is that if I look at this guy, this is really just the variance of u transpose X, OK?
I can do the same thing for the sample variance. So let's do this. And as you can see, spoiler alert, this is going to be the sample variance. OK, so remember, S is 1 over n, sum of Xi Xi transpose minus X bar X bar transpose. So when I do u transpose, Su, what it gives me is 1 over n sum from i equal 1 to n of u transpose Xi times Xi transpose u, all right?
So those are two numbers that multiply each other and that happen to be equal to each other, minus u transpose X bar X bar transpose u, which is also the product of two numbers that happen to be equal to each other. So I can rewrite this with squares.
So we're almost there. All I need to know to check is that this thing is actually the average of those guys, right? So u transpose X bar. What is it? It's 1 over n sum from i equal 1 to n of u transpose Xi. So it's really something that I can write as u transpose X bar, right? That's the average of those random variables of the form, u transpose Xi.
So what it means is that u transpose Su, I can write as 1 over n sum from i equal 1 to n of u transpose Xi squared minus u transpose X bar squared, which is the empirical variance that we need noted by small s squared, right? So that's the empirical variance of u transpose X1 all the way to u transpose Xn.
OK, and here, same thing. I use exactly the same thing. I just use the fact that here, the only thing I use is really the linearity of this guy, of 1 over n sum or the linearity of expectation, that I can push things in there, OK?
AUDIENCE: So what you have written at the end of that sum for uT Su?
PHILIPPE RIGOLLET: This one?
PHILIPPE RIGOLLET: Yeah, I said it's equal to small s, and I want to make a difference between the big S that I'm using here. So this is equal to small-- I don't know, I'm trying to make it look like a calligraphic s squared.
OK, so this is nice, right? This covariance matrix-- so let's look at capital sigma itself right now. This covariance matrix, we know that if we read its entries, what we get is the covariance between the coordinates of the X's, right, of the random vector, X. And the coordinates, well, by definition, are attached to a coordinate system. So I only know what the covariance of X in of those two things are, or the covariance of those two things are.
But what if I want to find coordinates between linear combination of the X's? Sorry, if I want to find covariances between linear combination of those X's. And that's exactly what this allows me to do. It says, well, if I pre- and post-multiply by u, this is actually telling me what the variance of X along direction u is, OK? So there's a lot of information in there, and it's just really exploiting the fact that there is some linearity going on in the covariance.
So, why variance? Why is variance interesting for us, right? Why? I started by saying, here, we're going to be interested in having something to do dimension reduction. We have-- think of your points as [? being in a ?] dimension larger than 4, and we're going to try to reduce the dimension. So let's just think for one second, what do we want about a dimension reduction procedure?
If I have all my points that live in, say, three dimensions, and I have one point here and one point here and one point here and one point here and one point here, and I decide to project them onto some plane-- that I take a plane that's just like this, what's going to happen is that those points are all going to project to the same point, right? I'm just going to not see anything.
However, if I take a plane which is like this, they're all going to project into some nice line. Maybe I can even project them onto a line and they will still be far apart from each other. So that's what you want. You want to be able to say, when I take my points and I say I project them onto lower dimensions, I do not want them to collapse into one single point. I want them to be spread as possible in the direction on which I project.
And this is what we're going to try to do. And of course, measuring spread between points can be done in many ways, right? I mean, you could look at, I don't know, sum of pairwise distances between those guys. You could look at some sort of energy. You can look at many ways to measure of spread in a direction.
But variance is a good way to measure of spread between points. If you have a lot of variance between your points, then chances are they're going to be spread. Now, this is not always the case, right? If I have a direction in which all my points are clumped onto one big point and one other big point, it's going to choose this because that's the direction that has a lot of variance. But hopefully, the variance is going to spread things out nicely.
So the idea of principal component analysis is going to try to identify those variances-- those directions along which we have a lot of variance. Reciprocally, we're going to try to eliminate the directions along which we do not have a lot of variance, OK? And let's see why.
Well, if-- so here's the first claim. If you transpose Su is equal to 0, what's happening? Well, I know that an empirical variance is equal to 0. What does it mean for an empirical variance to be equal to 0? So I give you a bunch of points, right? So those points are those points-- u transpose X1, u transpose-- those are a bunch of numbers. What does it mean to have the empirical variance of those points being equal to 0?
AUDIENCE: They're all the same.
PHILIPPE RIGOLLET: They're all the same. So what it means is that when I have my points, right? So, can you find a direction for those points in which they project to all the same point? No, right? There's no such thing. For this to happen, you have to have your points which are perfectly aligned. And then when you're going to project onto the orthogonal of this guy, they're going to all project to the same point here, which means that the empirical variance is going to be 0.
Now, this is an extreme case. This will never happen in practice, because if that happens, well, I mean, you can basically figure that out very quickly. So in the same way, it's very unlikely that you're going to have u transpose sigma u, which is equal to 0, which means that, essentially, all your points are [INAUDIBLE] or let's say all of them are orthogonal to u, right? So it's exactly the same thing. It just says that in the population case, there's no probability that your points deviate from this guy here. This happens with zero probability, OK?
And that's just because if you look at the variance of this guy, it's going to be 0. And then that means that there's no deviation. By the way, I'm using the name projection when I talk about u transpose X, right? So let's just be clear about this. If you-- so let's say I have a bunch of points, and u is a vector in this direction. And let's say that u has the-- so this is 0. This is u. And let's say that u has norm, 1, OK?
When I look, what is the coordinate of the projection? So what is the length of this guy here? Let's call this guy X1. What is the length of this guy? In terms of inner products? This is exactly u transpose X1. This length here, if this is X2, this is exactly u transpose X2, OK? So those-- u transpose X measure exactly the distance to the origin of those-- I mean, it's really-- think of it as being just an x-axis thing. You just have a bunch of points. You have an origin. And it's really just telling you what the coordinate on this axis is going to be, right?
So in particular, if the empirical variance is 0, it means that all these points project to the same point, which means that they have to be orthogonal to this guy. And you can think of it as being also maybe an entire plane that's orthogonal to this line, OK? So that's why I talk about projection, because the inner products, u transpose X, is really measuring the coordinates of X when u becomes the x-axis.
Now, if u does not have norm 1, then you just have a change of scale here. You just have a change of unit, right? So this is really u times X1. The coordinates should really be divided by the norm of u.
OK, so now, just in the same way-- so we're never going to have exactly 0. But if we [INAUDIBLE] the other end, if u transpose Su is large, what does it mean? It means that when I look at my points as projected onto the axis generated by u, they're going to have a lot of variance. They're going to be far away from each other in average, right? That's what large variance means, or at least large empirical variance means. And same thing for u.
So what we're going to try to find is a u that maximizes this. If I can find a u that maximizes this so I can look in every direction, and suddenly I find a direction in which the spread is massive, then that's a point on which I'm basically the less likely to have my points project onto each other and collide, right? At least I know they're going to project at least onto two points.
So the idea now is to say, OK, let's try to maximize this spread, right? So we're going to try to find the maximum over all u's of u transpose Su. And that's going to be the direction that maximizes the empirical variance. Now of course, if I read it like that for all u's in Rd, what is the value of this maximum?
It's infinity, right? Because I can always multiply u by 10, and this entire thing is going to multiplied by 100. So I'm just going to take u as large as I want, and this thing is going to be as large as I want, and so I need to constrain u. And as I said, I need to have u of size 1 to talk about coordinates in the system generated by u like this. So I'm just going to constrain u to have Euclidean norm equal to 1, OK?
So that's going to be my goal-- trying to find the largest possible u transpose Su, or in other words, empirical variance of the points projected onto the direction u when u is of norm 1, which justifies to use the word, "direction," and because there's no magnitude to this u.
OK, so how am I going to do this? I could just fold and say, let's just optimize this thing, right? Let's just take this problem. It says maximize a function onto some constraints. Immediately, the constraint is sort of nasty. I'm on a sphere, and I'm trying to move points on the sphere. And I'm maximizing this thing which actually happens to be convex. And we know we know how to minimize convex functions, but maximize them is a different question.
And so this problem might be super hard. So I can just say, OK, here's what I want to do, and let me give that to an optimizer and just hope that the optimizer can solve this problem for me. That's one thing we can do. Now as you can imagine, PCA is so well spread, right? Principal component analysis is something that people do constantly. And so that means that we know how to do this fast. So that's one thing.
The other thing that you should probably question about why-- if this thing is actually difficult, why in the world would you even choose the variance as a measure of spread if there's so many measures of spread, right? The variance is one measure of spread. It's not guaranteed that everything is going to project nicely far apart from each other. So we could choose the variance, but we could choose something else. If the variance does not help, why choose it? Turns out the variance helps.
So this is indeed a non-convex problem. I'm maximizing, so it's actually the same. I can make this constraint convex because I'm maximizing a convex function, so it's clear that the maximum is going to be attained at the boundary. So I can actually just fill this ball into some convex ball.
However, I'm still maximizing, so this is a non-convex problem. And this turns out to be the fanciest non-convex problem we know how to solve. And the reason why we know how to solve it is not because of optimization or using gradient-type things or anything of the algorithms that I mentioned during the maximum likelihood. It's because of linear algebra. Linear algebra guarantees that we know how to solve this.
And to understand this, we need to go a little deeper in linear algebra, and we need to understand the concept of diagonalization of a matrix. So who has ever seen the concept of an eigenvalue? Oh, that's beautiful. And if you're not raising your hand, you're just playing "Candy Crush," right? All right, so, OK.
This is great. Everybody's seen it. For my live audience of millions, maybe you have not, so I will still go through it. All right, so one of the basic facts-- and I remember when I learned this in-- I mean, when I was an undergrad, I learned about the spectral decomposition and this diagonalization of matrices. And for me, it was just a structural property of matrices, but it turns out that it's extremely useful, and it's useful for algorithmic purposes.
And so what this theorem tells you is that if you take a symmetric matrix-- well, with real entries, but that really does not matter so much. And here, I'm going to actually-- so I take a symmetric matrix, and actually S and sigma are two such symmetric matrices, right? Then there exists P and D, which are both-- so let's say d by d. Which are both d by d such that P is orthogonal. That means that P transpose P is equal to PP transpose is equal to the identity. And D is diagonal. And sigma, let's say, is equal to PDP transpose, OK?
So it's a diagonalization because it's finding a nice transformation. P has some nice properties. It's really just the change of coordinates in which your matrix is diagonal, right? And the way you want to see this-- and I think it sort of helps to think about this problem as being-- sigma being a covariance matrix. What does a covariance matrix tell you?
Think of a multivariate Gaussian. Can everybody visualize a three-dimensional Gaussian density? Right, so it's going to be some sort of a bell-shaped curve, but it might be more elongated in one direction than another. And then going to chop it like that, all right? So I'm going to chop it off. And I'm going to look at how it bleeds, all right? So I'm just going to look at where the blood is.
And what it's going to look at-- it's going to look like some sort of ellipsoid, right? In high dimension, it's just going to be an olive. And that is just going to be bigger and bigger. And then I chop it off a little lower, and I get something a little bigger like this. And so it turns out that sigma is capturing exactly this, right? The matrix sigma-- so the center of your covariance matrix of your Gaussian is going to be this thing. And sigma is going to tell you which direction it's elongated.
And so in particular, if you look, if you knew an ellipse, you know there's something called principal axis, right? So you could actually define something that looks like this, which is this axis, the one along which it's the most elongated. Then the axis along which is orthogonal to it, along which it's slightly less elongated, and you go again and again along the orthogonal ones.
It turns out that those things here is the new coordinate system in which this transformation, P and P transpose, is putting you into. And D has entries on the diagonal which are exactly this length and this length, right?
So that's just what it's doing. It's just telling you, well, if you think of having this Gaussian or this high-dimensional ellipsoid, it's elongated along certain directions. And these directions are actually maybe not well aligned with your original coordinate system, which might just be the usual one, right-- north, south, and east, west. Maybe I need to turn it. And that's exactly what this orthogonal transformation is doing for you, all right?
So, in a way, this is actually telling you even more. It's telling you that any matrix that's symmetric, you can actually turn it somewhere. And that'll start to dilate things in the directions that you have, and then turn it back to what you originally had. And that's actually exactly the effect of applying a symmetric matrix through a vector, right?
And it's pretty impressive. It says if I take sigma times v. Any sigma that's of this form, what I'm doing is-- that's symmetric. What I'm really doing to v is I'm changing its coordinate system, so I'm rotating it. Then I'm changing-- I'm multiplying its coordinates, and then I'm rotating it back. That's all it's doing, and that's what all symmetric matrices do, which means that this is doing a lot.
All right, so OK. So, what do I know? So I'm not going to prove that this is the so-called spectral theorem. And the diagonal entries of D is of the form, lambda 1, lambda 2, lambda d, 0, 0. And the lambda j's are called eigenvalues of D.
Now in general, those numbers can be positive, negative, or equal to 0. But here, I know that sigma and S are-- well, they're symmetric for sure, but they are positive semidefinite. What does it mean? It means that when I take u transpose sigma u for example, this number is always non-negative. Why is this true? What is this number?
It's the variance of-- and actually, I don't even need to finish this sentence. As soon as I say that this is a variance, well, it has to be non-negative. We know that a variance is not negative. And so, that's also a nice way you can use that. So it's just to say, well, OK, this thing is positive semidefinite because it's a covariance matrix. So I know it's a variance, OK? So I get this.
Now, if I had some negative numbers-- so the effect of that is that when I draw this picture, those axes are always positive, which is kind of a weird thing to say. But what it means is that when I take a vector, v, I rotate it, and then I stretch it in the directions of the coordinate, I cannot flip it. I can only stretch or shrink, but I cannot flip its sign, all right? But in general, for any symmetric matrices, I could do this.
But when it's positive symmetric definite, actually what turns out is that all the lambda j's are non-negative. I cannot flip it, OK? So all the eigenvalues are non-negative. That's a property of positive semidef. So when it's symmetric, you have the eigenvalues. They can be any number. And when it's positive semidefinite, in particular that's the case of the covariance matrix and the empirical covariance matrix, right? Because the empirical covariance matrix is an empirical variance, which itself is non-negative. And so I get that the eigenvalues are non-negative.
All right, so principal component analysis is saying, OK, I want to find the direction, u, that maximizes u transpose Su, all right? I've just introduced in one slide something about eigenvalues. So hopefully, they should help. So what is it that I'm going to be getting? Well, let's just see what happens.
Oh, I forgot to mention that-- and I will use this. So the lambda j's are called eigenvectors. And then the matrix, P, has columns v1 to vd, OK? The fact that it's orthogonal-- that P transpose P is equal to the identity-- means that those guys satisfied that vi transpose vj is equal to 0 if i is different from j. And vi transpose vi is actually equal to 1, right, because the entries of PP transpose are exactly going to be of the form, vi transpose vj, OK?
So those v's are called eigenvectors. And v1 is attached to lambda 1, and v2 is attached to lambda 2, OK? So let's see what's happening with those things. What happens if I take sigma-- so if you know eigenvalues, you know exactly what's going to happen. If I look at, say, sigma times v1, well, what is sigma? We know that sigma is PDP transpose v1.
What is P transpose times v1? Well, P transpose has rows v1 transpose, v2 transpose, all the way to vd transpose. So when I multiply this by v1, what I'm left with is the first coordinate is going to be equal to 1 and the second coordinate is going to be equal to 0, right? Because they're orthogonal to each other-- 0 all the way to the end. So that's when I do P transpose v1.
Now I multiply by D. Well, I'm just multiplying this guy by lambda 1, this guy by lambda 2, and this guy by lambda d, so this is really just lambda 1. And now I need to post-multiply by P. So what is P times this guy? Well, P is v1 all the way to vd. And now I multiply by a vector that only has 0's except lambda 1 on the first guy. So this is just lambda 1 times v1.
So what we've proved is that sigma times v1 is lambda 1 v1, and that's probably the notion of eigenvalue you're most comfortable with, right? So just when I multiply by v1, I get v1 back multiplied by something, which is the eigenvalue. So in particular, if I look at v1, transpose sigma v1, what do I get? Well, I get lambda 1 v1 transpose v1, which is 1, right? So this is actually lambda 1 v1 transpose v1, which is lambda 1, OK?
And if I do the same with v2, clearly I'm going to get v2 transpose sigma. v2 is equal to lambda 2. So for each of the vj's, I know that if I look at the variance along the vj, it's actually exactly given by those eigenvalues, all right? Which proves this, because the variance along the eigenvectors is actually equal to the eigenvalues. So since they're variances, they have to be non-negative.
So now, I'm looking for the one direction that has the most variance, right? But that's not only among the eigenvectors. That's also among the other directions that are in-between the eigenvectors. If I were to look only at the eigenvectors, it would just tell me, well, just pick the eigenvector, vj, that's associated to the largest of the lambda j's.
But it turns out that that's also true for any vector-- that the maximum direction is actually one direction which is among the eigenvectors. And among the eigenvectors, we know that the one that's the largest-- that carries the largest variance is the one that's associated to the largest eigenvalue, all right?
And so this is what PCA is going to try to do for me. So in practice, that's what I mentioned already, right? We're trying to project the point cloud onto a low-dimensional space, D prime, by keeping as much information as possible. And by "as much information," I mean we do not want points to collide.
And so what PCA is going to do is just going to try to project [? on two ?] directions. So there's going to be a u, and then there's going to be something orthogonal to u, and then the third one, et cetera, so that once we project on those, we're keeping as much of the covariance as possible, OK? And in particular, those directions that we're going to pick are actually a subset of the vj's that are associated to the largest eigenvalues.
So I'm going to stop here for today. We'll finish this on Tuesday. But basically, the idea is it's just the following. You're just going to-- well, let me skip one more.
Yeah, this is the idea. You're first going to pick the eigenvector associated to the largest eigenvalue. Then you're going to pick the direction that orthogonal to the vector that you've picked, and that's carrying the most variance. And that's actually the second largest-- the eigenvector associated to the second largest eigenvalue.
And you're going to go all the way to the number of them that you actually want to pick, which is in this case, d, OK? And wherever you choose to chop this process, not going all the way to d, is going to actually give you a lower-dimensional representation in the coordinate system that's given by v1, v2, v3, et cetera, OK? So we'll see that in more details on Tuesday. But I don't want to get into it now. We don't have enough time. Are there any questions?