This lecture continues the review of distance matrices. Professor Strang then introduces the Procrustes problem, which looks for the orthogonal matrix that swings one set of vectors as nearly as possible onto a second set.
Distance problem: Find positions \(x\) from distances between them.
Necessary and sufficient: Distances satisfy triangle inequality.
Procrustes: Given \(n\) vectors \(x\) and \(n\) vectors \(y\).
Find the orthogonal matrix \(Q\) so that \(Qx\)'s are closest to \(y\)'s.
Related sections in textbook: IV.9 and IV.10
Instructor: Prof. Gilbert Strang
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation, or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
GILBERT STRANG: OK. Just as we're getting started, I thought I'd add a few words about a question that came up after class. Suppose, in that discussion last time, where you were given three-- you were given a distance matrix-- you were given the distance between x1 and x2, between x2 and x3, and between x1 and x3, and you wanted to find points that satisfied that.
Well, we're going to fail on this example, because if the distance here is 1, the distance here is 1, then by the triangle inequality, the distance from x1 to x3 could not be more than 2. And when we square it, it could not be more than 4. And here it's 6. So what's going to happen? What goes wrong in that case? Yeah. I hadn't commented on that, and I'm not sure that the paper that I referenced does so.
So I had to do a little search back in the literature, because people couldn't overlook this problem. So this is the triangle inequality fails. And it's not going to help to go into 10 dimensions, because the triangle inequalities doesn't change. And it's still there in 10 dimensions. And we're still failing. So what happens? Well, what could happen? Do you remember?
And you'll have to remind me, the key equation. You remember, we had an equation connecting the-- so what is the matrix D for this problem? So D is-- this is a 3 by 3 matrix with these distances squared. And it was convenient to use distances squared, because that's what comes into the next steps. So of course, the distance from each x to itself is zero.
The distance from x distance squared was that. This one was that. But this one is 6. OK. So that's the distance matrix. And we would like to find-- the job was to find-- and I'm just going to write down, we cannot find x1, x2, and x3 to match those distances. So what goes wrong? Well, there's only one thing that could go wrong. When you connect this distance matrix D to the matrix X transpose X-- you remember the position matrix-- maybe I called it G? This is giving-- so Gij is the dot product of xi with xj.
Make that into a j. Thank you. So Gij is the matrix of dot product. And the great thing was that we can discover what that matrix-- that matrix G comes directly from D-- comes directly from D. And of course, what do we know about this matrix of cross products? We know that is positive semidefinite. So what goes wrong? Well, just in a word, when we write out that equation and discover what G is, if the triangle inequality fails, we learn that G doesn't come out positive definite.
That's really all I want to say. And I could push through the example. G will not come out positive definite if D-- if that's D because it can't. If it came out positive definite, then we could find an X. So if we had the G, then the final step, you remember, is to find an X. Well we know that if G is positive semidefinite, there are multiple ways to find an X.
This is positive semidefinite matrices is what you get out of X transpose X's. And we can find an x given a G. We can find G given an x. So it has to be that this won't be true-- that the matrix G that comes out of that equation will turn out not to be positive definite. So it's really quite nice. It's a beautiful little bit of mathematics, that if, and only if, the triangle inequality is satisfied by these numbers-- if and only if-- then the matrix in the D matrix-- then the G matrix that comes out of this equation-- which I haven't written-- is positive semidefinite.
If the triangle inequality is OK, we can find the points. If the triangle inequality is violated-- like here-- then the matrix G is not positive semidefinite, has negative eigenvalues, and we cannot find the point. Yeah. I could recall the G matrix but-- the G equation, but it's coming to you in the two page section that does distance matrices. OK. That's just-- I should have made a point. It's nice to have specific numbers.
And I could get the specific numbers for G, and we would see, no way. It's not positive definite. OK. So that's just tidying up last time. I have another small problem to talk about, and then a big question of whether deep learning actually works. I had an email from an expert last night, which changed my view of the world about that question, as you can imagine. The change in my world was, I had thought the answer was yes, and I now think the answer is no. So that's like rather a big issue for 18.065.
But we'll-- let's see about that later. OK. Now Procrustes' problem. So Procrustes-- and it's included in the notes-- that name comes from a Greek myth. Are you guys into Greek myths? So what was the story of Procrustes? Was it Procrustes who adjusted the length of his-- so he had a special bed. Procrustes' bed-- certain length. And then, he had visitors coming. And instead of adjusting the length of the bed to fit the visitor, Procrustes adjusted the length of the visitor to fit the bed. So either stretched the visitor or chopped off part of the visitor.
So anyway-- the Greeks like this sort of thing. OK. So anyway, that's a Greek myth for 18.065. OK. So the whole idea, the Procrustes problem, is to make something fit something else. So the two things are-- so suppose I'm just in three dimensions and I have two vectors here. So I have a basis for a two dimensional space. And over here I have-- people-- space scientists might have one computation of the positions of satellites.
Then, of course, they wouldn't be off by as much as this figure shows. But then they have another computation using different coordinates. So it partly rotated from this picture, but also it's partly got round off errors and error in it between the two. So the question is, what's the best orthogonal transformation? So this is a bunch of vectors, x1, x2, to xn, let's say. And I want to modify them by an orthogonal matrix-- maybe I'd do it on the other side. I think I do. Yeah.
Q, to be as close as possible to this other set, y1, y2 up to yn. So let me just say it again. I have two sets of vectors. And I'm looking, and they're different-- like those two sets. And I'm looking for the orthogonality matrix that, as well as possible, takes this set into this one. Of course, if this was an orthogonal basis, and this was an orthogonal basis, then we would be home free. Q-- we could get equality. We could take an orthogonal basis directly into an orthogonal basis with a orthogonal matrix Q.
In other words, if x was an orthogonal matrix, and y was an orthogonal matrix, we would get the exact correct Q. But that's not the case. So we're looking for the best possible. So that's the problem there-- minimize over orthogonal matrix-- matrices Q. And I just want to get my notation to be consistent here. OK. So I've-- I see that starting with the y's and mapping them to x's-- so let me ask the question. What orthogonal matrix Q multiplies the y's to come as close as possible to the x's?
So over all orthogonal Q's I want to minimize YQ minus X in the Frobenius norm. And I might as well square it. So Frobenius-- we're into the Frobenius norm. Remember the-- of a matrix? This is a very convenient norm in data science, to measure the size of a matrix. And we have several possible formulas for it. So let me call the matrix A. And the Frobenius norm squared-- so what's one expression, in terms of the entries of the matrix-- the numbers Aij in the matrix?
The Frobenius norm just treats it like a long vector. So it's a11 squared, plus a12 squared, of all the way along the first plus second row, just-- I'll say nn squared. OK. Sum of all the squares-- just treating it like a long vector. OK. This-- but that's a awkward expression to write down. So what other ways do we have to find the Frobenius norm of a matrix?
Let's see. I could look at this as A transpose A. Is that right? A transpose A. So what what's happening there? Remind me what-- yeah. I would get all that. I would get all these by taking the matrix A transpose times A. But what-- sorry. I'm not-- I haven't-- I've lost my thread of talk here. So here's-- oh, and then I take the trace, of course. So that first row-- first column of A times that one will give me the-- one set of squares. And then, that one times the other, and the next one, will give me the next set of squares, right?
So this is going to-- if I look at the trace-- so now, let me. So I just want to look at the diagonal here. So it's the trace. You remember, the trace of a matrix-- of a matrix M is the sum down the diagonal M11, M22, down to Mnn. It's the diagonal sum. And-- everybody with me here now? So that term on the diagonal-- A transpose A-- gives me all of that. Then-- or maybe I should be doing AA transpose. The point is, it doesn't matter.
Or the trace of AA transpose. That would be-- those would both give the correct Frobenius norm squared. So traces are going to come into this little problem. Now there's another formula for the Frobenius norm-- even shorter-- well, certainly shorter than this one-- involving a sum of squares. And what's that one? What's the other way to get the same answer? If I look at the SVD-- look at singular values. I think that this is also equal to the sum square of all the singular values.
So it's three nice expressions for the Frobenius norm. The nice ones involve A transpose A, or AA transpose. And of course, that connects to the singular values, because what are-- what's the connection between singular values and those-- and these guys-- A transpose A, or AA transpose? The singular values are the-- or the singular values squared are the--
GILBERT STRANG: Eigenvalues of A transpose A. And then when I add up the trace, I'm adding up the eigenvalues and that's the-- that gives me the Frobenius norm squared. So this is a-- that tells us something important, which we can see in different ways, that the-- so to solve this problem, we're going to need various facts, like the QA in the Frobenius norm is the same as A in the Frobenius norm. Why is that? Why? So here I'm multiplying every column by the matrix Q. What happens to the length of the column when I multiply it by q?
AUDIENCE: It doesn't change.
GILBERT STRANG: Doesn't change. So I could add up the length of the columns all squared. Here I wrote it in terms of rows. But I could have reordered that, and got it in terms of columns. That's because the length of Q times any vector squared is the same as the vector squared. And these-- take these to be the columns of A. So for column by column, the multiplication by Q doesn't change the length. And then when I add up all the columns squared, I get the Frobenius norm squared.
And another way to say it-- let's make that connection between this fact-- that Q didn't change the Frobenius norm-- and this fact, that the Frobenius norm is expressed in terms of the sigmas. So what does Q do to the sigmas? I want to see in another way the answer to why. So if I have a matrix A with singular values, I multiply by Q-- what happens to the singular values?
AUDIENCE: Don't change.
GILBERT STRANG: Don't change. Don't change. That's the key point about singular values. If I multiply-- so A has a SVD, U sigma V transpose. And QA will have the SVD QU sigma V transpose. So all I've changed when I multiply by Q-- all I changed was the first factor-- the first orthogonal factor in the SVD. I didn't change the sigmas. They're still sitting there. So-- and of course, I could do also A on the other side-- different Q. Same Q or a different Q on the other side would show up here, and would not change the sigmas, and therefore would not change the Frobenius norm.
So these are important properties of this Frobenius norm. It's a-- it looks messy to write down in that form, but it's much nicer in these forms and in that form. OK. So now, if I can just-- then we saw that it involves traces. So let me make a few observations about traces. So I'll just-- we want to be able to play with traces, and that's something we really haven't done. Here's a fact-- that the trace of A transpose B is equal to the trace of B transpose A.
Of course, if B is A, it's clear, and it's equal to the trace of BA transpose. So even do little changes in your matrix without changing the trace. Let's see why one of these is true. Why is that first statement true? How is that matrix related to this matrix?
AUDIENCE: [INAUDIBLE] transpose.
GILBERT STRANG: It's just a transpose. If I take the transpose of that matrix, I get that. So what happens to the trace? I'm adding down the diagonal. The transpose has no effect. Clearly, this is just a fact that the trace doesn't change-- is not changed when you transpose a matrix, because the diagonal is not changed. Now what about this guy? I guess we're getting back to old fashioned 18.065, remembering facts about linear algebra, because this is a pure linear algebra.
So what's this one about? This says that I can reverse the order of two matrices. So I'm now looking at the connection between those two. And so let me just-- to use different letters-- CD equals the trace of DC. I can flip the order. That's all I've done here is. I've reversed B with A transpose. I reversed C with D. So why is that true? Why is that true? Well, how shall we see the truth of that fact?
So these are really convenient facts, that make a lot of people use the trace more often than we have in 18.065. I'm not a big user of arguments based on trace, but these are identities that go a long way with many problems. So let's see why that's true. Any time you think about trace, you've got two languages to use. You can use the eigenvalues. It's the sum of the eigenvalues. Or you can use the diagonal entries, because it's the sum of the diagonal entries.
Let's use eigenvalues. How are the eigenvalues of CD related to the eigenvalues of DC? They're the same. If these matrices are rectangular, then there might be some extra zero eigenvalues, because they would have different shapes. But zeros are not going to affect the trace. So this is the same nonzero eigenvalues. OK. And so on. Yeah. OK. Let me just-- let me try to tell you the steps now to get the correct Q. And let me tell you the answer first.
And I'm realizing that all important question four-- does deep learning actually work? We're going to run out of time today, because we only have a few minutes left. I suggest we bring that question back up, because it's pretty important to a lot of people. There's-- I had lunch with Professor Edelman, and he said, you know, deep learning and neural nets have had a record amount of publicity and hype for sort of computational algorithm.
And-- but I had-- I've had people now tell me that typical first-- if you create a network-- using Alex's design, for example-- the chances are that it won't be successful-- that the successful networks have been worked on, and experimented with. And a good structure has emerged, but didn't-- wasn't there at the start. So I think that's a topic for Monday. And I'm really just realizing, from talking to people in the field, that it's by no means automatic.
That structure-- even if you put in a whole bunch of layers-- it may not be what you want. OK. So I'm-- let me finish this argument today. Let me give you the answer. So what's the good Q? I have matrices Y and X. And the idea is that I take it-- I look at Y transpose X. So that'll be all the dot products of one set of vectors or the other set of vectors. That's a matrix. And I do its SVD-- U sigma V transpose. So multiply this. Multiply Y-- the two bases that you're given.
Of course, if Y was the same as X-- if it was an orthogonal basis-- you'd have the identity, no questions. But generally, we have-- it has an SVD. And we're looking for a orthogonal matrix of the best Q is-- Da dun da duh. I mean, it's the right time for expressions of amazement. It is UV transpose. OK. So that gives us the answer. We're given X and Y. We're looking for the best Q. And the answer comes in the simplest possible way. Compute Y transpose X. Compute its SVD, and use the orthogonal matrices from the SVD. Yeah.
And I'm out of time so proof-- it's [INAUDIBLE] line later-- either to just send you the section online, or to discuss it in class Monday. But I'm really planning Monday to start with question 4. And meanwhile to ask a whole lot of people-- everybody I can find-- about that important question, is-- does deep learning usually work? How-- what can you do to make sure it works, or give yourself a better chance to have it work? So let's-- that's up for Monday then. Good.