The topic of this lecture is clustering for graphs, meaning finding sets of “related” vertices in graphs. The challenge is finding good algorithms to optimize cluster quality. Professor Strang reviews some possibilities.
Two ways to separate graph nodes into clusters
- k-means: Choose clusters, choose centroids, choose clusters, ...
- Fiedler vector: Eigenvector of graph Laplacian: \(+-\) signs give 2 clusters
Related sections in textbook: IV.6–IV.7
Instructor: Prof. Gilbert Strang
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation, or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
GILBERT STRANG: OK. Now, clustering for graphs. So this is a topic-- this is one of the important things you can try to do with a graph. So you have a large graph. Let me kind of divide it into two clusters. So you've got a giant graph. And then the job is to make some sense out of it. And one possible step is to be able to subdivide it, if, as I see here, there's a cut between two reasonably equal parts of the graph-- reasonable-- reasonably same size.
And therefore, that graph could be studied in two pieces. So the question is, how do you find such a cut by a algorithm? What's an algorithm that would find that cut? So that's a problem. Let's say we're looking for two clusters. We could look for more clusters, but let's say we want to look for two clusters. So what are we trying to do? We're trying to minimize. So this is the problem, then. So we look for-- find positions x and y, let's say. Two which will be the centers, so to speak, of the-- and really, it's just these points that-- so the data is the points and the edges, as always-- the nodes and the edges.
So the problem is to find x and y so that-- to minimize. So it's a distance of points ai from x-- maybe should emphasize we're in high dimensions-- plus the distance of other points. So the ai will be these points-- these nodes. And the bi will be these nodes, plus the sum of bi minus y squared. And you understand the rule here-- that together the a's union the b's give me all nodes. And I guess to be complete, the a's intersect the b's is empty. Just what you expect. I'm dividing the a's and the b's into two groups.
And I'm picking an x and a y sort of at the center of those groups, so that is a minimum. So I want to minimize. And also, I probably want to impose some condition that the number of a's is reasonably close to the number of b's. In other words, I don't want just that to be the a, and all the rest to be the b's. That would be not a satisfactory clustering. I'm looking for clusters that are good sized clusters. So minimize that. OK.
So there are a lot of different algorithms to do it. Some are more directly attacking this problem. Others use matrices that we associate with the graph. So let me tell you about two or three of those algorithms. And if you've seen-- studied-- had a course in graph theory, this-- you may already have seen this problem. First question would be, suppose I decide these are the a's, and those are the b's-- or some other decision. Yeah, probably some other decision. I don't want to solve the problem before I even start. So some a's and some b's.
What would be the best choice of the x once you've decided on the a's? And what would be the best choice of the y once you've decided on the b's? So we can answer that question if we knew the two groups. We could see where they should be centered, with the first group centered at x, the second group centered at y, and what does centering mean? So let's just say-- so I think what I'm saying here is-- let me bring that down a little.
So given the a's-- the a's-- this is a1 up to, say, ak. What is the best x just to make that part right? And the answer is, to do you know, geometrically, what x should be here? X is the-- so if I have a bunch of points, and I'm looking for the middle of those points-- the point x-- a good point x to say, OK, that's the middle. It'll make the sum of the distances, I think, squared-- I hope I'm right about that-- a minimum. What is x? It is the--
GILBERT STRANG: Centroid. Centroid is the word. X is the centroid of the a's. And what is the centroid? Let's see. Oh, maybe I don't know if x and y were a good choice, but let me see what-- I guess, it's the average a. It's the sum of the a's-- of these a's. Those are vectors, of course, divided by the number of a's. I think. Actually, I was just quickly reviewing this morning, so I'm not totally on top of this centroid.
What I'm going to talk-- the algorithm that I'm going to talk about is the k-- well, the k-means, it's always called. And here it will be the-- k will be 2. I just have two-- partitioning into two sets, a's and b's, so I just-- k is just 2. OK. What's the algorithm? Well, if I've chosen a partition-- the a's and b's have separated them-- then that tells me what the x and the y should be. But, now what do I do next? So is this going to be a sort of an alternating partition?
Now I take those two centroids. So step one is for given a's and b's, find the centroids x and y. And that's elementary. Then the second step is, given the centroids, x and y-- given those positions-- given x and y-- they won't be centroids when you see what happened. Given x and y, redo-- form the best partition-- best clusters.
So step one, we had a guess at what the best clusters were. And we found they're centroids. Now, we start with the centroids, and we form new clusters again. And if these clusters are the same as the ones we started with, then the algorithm is converged. But probably they won't be-- these clusters. So you'll have to tell me what I mean by the best clusters. If I've got the two points, x and y, I want the points-- I want to separate all the points that cluster around x to the ones that cluster around y.
And then, they're probably different from my original start. So now I've got new-- now I repeat step one. But let's just think, how do I form the best clusters? Well, I take a point and I have to decide, does it go with x, or does it go within the x cluster, or does it go in the cluster around y? So how do I decide that? Just whichever one it's closer to. So each point goes with each node. You should-- I could say, each node goes with the closer of x and y.
So points that should have been-- that are closer to x-- now we're going to put them in the cluster around x. And does that solve the problem? No, because-- well, it might, but it might not. We'd have to come back to step one. We've now changed the clusters. They'll have different centroids. So we repeat step one-- find the centroids for the two new clusters. Then we come to step two.
Find the ones that should go with the two centroids, and back and forth. I don't know. I don't think there's a nice theory of convergence, or rate of convergence-- all the questions that this course is always asking. But it's a very popular algorithm, k-means. k would be to have k clusters. OK. So that's a-- I'm not going to discuss the-- I'd rather discuss some other ways to do this, to solve this problem. But that's one sort of hack that works quite well. OK.
So second approach is what is coming next. Second solution method-- and it's called the spectral clustering. That's the name of the method. And before I write down what you do, what does the word spectral mean? You see spectral graph theory, spectral clustering. And in other parts of mathematics, you see that-- you see spectral theorem. I gave you the most-- and I described it as the most important-- perhaps-- theorem in linear algebra-- at least one of the top three. So I'll write it over here, because it's not-- it doesn't-- this is-- it's the same word, spectral.
Well, let me ask that question again? What's that word spectral about? What does that mean? That means that if I have a matrix, and I want to talk about its spectrum, what is the spectrum of the matrix? It is the eigenvalues. So spectral theory, spectral clustering is using the eigenvalues of some matrix. That's what that spectral is telling me. Yeah. So the spectral theorem, of course, is that for a symmetric matrix S, the eigenvalues are real, and the eigenvectors are orthogonal.
And don't forget what the real, full statement is here, because there could be repeated real eigenvalues. And what does the spectral theorem tell me for symmetric matrices, if lambda equals 5 is repeated four times-- if it's a four times solution of the equation that gives eigenvalues, then what's the conclusion? Then there are four independent, orthogonal eigenvectors to go with it. We can't say that about matrices-- about all matrices. But we can say it about all symmetric matrices.
And in fact, those eigenvectors are orthogonal. So we're even saying more. We can find four orthogonal eigenvectors that go with a multiplicity for eigenvalues. OK. That's spectral theorem. Spectral clustering starts with the graph Laplacian matrix. May I remember what that matrix is? Because that's the key connection of linear algebra to graph theory, is the properties of this graph, Laplacian matrix. OK. So let me say L, for Laplacian.
So that matrix-- one way to describe it is as A transpose A, where A is the incidence matrix of the graph. Or another way we'll see is the D-- the degree matrix. That's diagonal. And I'll do an example just to remind you. Minus the-- well, I don't know what I'd call this one. Shall I call it B for the moment. And what matrix is B? That's the adjacency matrix. Really, you should know these four matrices. They're the key four matrices associated with any graph.
The incidence matrix, that's m by n-- edges and nodes-- edges and nodes. So it's rectangular, but I'm forming A transpose A here. So I'm forming a symmetric, positive, semi-definite matrix. So this Laplacian is symmetric, positive, semi-definite. Yeah. Let me let me just recall what all these matrices are for a simple graph. OK. So I'll just draw a graph. All right. OK. So the incidence matrix-- there are 1, 2, 3, 4, 5 edges-- so five rows. There are four nodes-- 1, 2, 3, and 4. So four columns. And a typical row would be edge 1 going from node 1 to node 2, so it would have a minus 1 and a 1 there.
And let me take edge 2, going from 1 to node 3, so it would have a minus 1 and a 1 there, and so on. So that's the incidence matrix A. OK. What's the degree matrix? That's simple. The degree matrix-- well, A transpose A. This is m by n. This is n by m. So it's n by n. OK. In this case, 4 by 4. So the degree matrix will be 4 by 4, n by n. And it will tell us the degree of that, which means-- which we just count the edges. So three edges going in, node 2, three edges going in, node 3 has just two edges. And node 4 has just two edges.
So that's the degree matrix. And then the adjacency matrix that I've called B is also 4 by 4. And what is it? It tells us which node is connected to which node. So I don't allow nodes-- edges that connect a node to itself, so 0's on the diagonal. How many-- so which nodes are connected to node 1? Well, all of 2 and 4 and 3 are connected to 1. So I have 1's there. Node 2-- all three nodes are connected to node 2. So I'll have-- the second column and row will have all three 1's.
How about node 3? OK. Only edges-- only two edges are connected. Only two nodes are connected to 3, 1 and 2, but not 4. So 1 and 2 I have, but not 4. OK. So that's the adjacency matrix. Is that right? Think so. This is the degree matrix. This is the incidence matrix. And that formula gives me the Laplacian. OK. Let's just write down the Laplacian. So if I use the degree minus B-- that's easy. The degrees were 3, 3, 2, and 2. Now I have these minuses. And those were 0. OK. So that's a positive, semi-definite matrix.
Is it a positive definite matrix? So let me ask, is it singular or is it not singular? Is there a vector in its null space, or is there not a vector in its null space? Can you solve Dx equals all 0's? And of course, you can. Everybody sees that vector of all 1's will be a solution to L-- sorry. I should be saying L here. Lx equals 0.
Lx equals 0 as for a whole line of one dimensional null space of L has dimension 1. It's got 1 basis vector, 1, 1, 1, 1. And that will always happen with the graph set up that I've created. OK. So that's a first fact, that this positive, semi-definite matrix, L, has lambda 1 equals 0. And the eigenvector is constant-- C, C, C, C-- the one dimensional eigenspace. Or 1, 1, 1, 1 is the typical eigenvector. OK.
Now back to graph clustering. The idea of graph clustering is to look at the Fiedler eigenvector. This is called the x2-- is the next eigenvector-- is the eigenvector for the smallest positive eigenvalue for a lambda min excluding 0-- so the smallest eigenvalue of L-- the smallest eigenvalue and its eigenvector-- this is called the Fiedler vector, named after the Czech mathematician. A great man in linear algebra, and he studied this factor-- this situation. So everybody who knows about the graph Laplacian is aware that its first eigenvalue is 0, and that the next eigenvalue is important. Yeah.
AUDIENCE: Is the graph Laplacian named the Laplacian because it has connections to--
GILBERT STRANG: To Laplace's equation, yes. Yeah, that's a good question. So why the word-- the name, Laplacian? So yeah, that's a good question. So the familiar thing-- so it connects to Laplace's finite difference equation, because we're talking about matrices here, and not derivatives-- not functions. So why the word Laplacian? Well, so if I have a regular-- if my graph is composed of-- so there is a graph with 25 nodes, and 4 times 5-- 20, about 40. This probably has about 40 edges and 25 nodes.
And of course, I can construct its-- graph all those four matrices for it-- its incidence matrix, its degree matrix. So the degree will be four at all these inside points. The degree will be three at these boundary points. The degree will be two at these corner points. But the-- what will the matrix L look like? So what is L? And that will tell you why it has this name Laplacian. So the matrix L will have-- the degree 4 right will be on the diagonal. That's coming from D. The other-- the minus 1's that come from B, the adjacency matrix, will be associated with those nodes, and otherwise, all 0's.
So this is a typical row of L. This is typical row of L centered at that node. So maybe that's node number 5, 10, 13. That's 13 out of 25 that would show you this. And the-- sorry. Those are minus 1's. Minus 1's. So a 4 on the diagonal, and four minus 1's. That's the model problem for when the graph is a grid-- square grid. And do you associate that with Laplace's equation? So this is the reason that Laplace-- why Laplace gets in it. Because Laplace's equation-- the differential equation-- is second derivative with respect to x squared, and the second derivative with respect to y squared is 0.
And what we have here is Lu equals 0. It's the discrete Laplacian, the vector Laplacian, the graph Laplacian-- where the second x derivative is replaced by -1, 2, -1. And the second y derivative is replaced by -1, 2, -1. Second differences in the x and the y directions. So that's-- yeah. So that's the explanation for Laplace. It's the discrete Laplace-- discrete, or the finite difference Laplace. OK.
Now to just finish, I have to tell you what the-- what clusters-- how do you decide the clusters from L? How does L propose two clusters, the a's and b's? And here's the answer. They come from this eigenvector-- the Fiedler eigenvector. You look at that eigenvector. It's got some positive and some negative components. The components with positive numbers of this eigenvector-- so the positive components of x-- of-- this eigenvector.
And there are negative components of this eigenvector. And those are the two clusters. So it's-- the cluster is-- the two clusters are decided by the eigenvector-- by the signs-- plus or minus signs of the components. The plus signs go in one and the minus signs go in another. And you have to experiment to see that that would succeed. I don't know what it would do on this, actually, because that's hardly split up into two. I suppose maybe the split is along a line like that or something, to get-- I don't know what clustering.
This is not a graph that is naturally clustered, but you could still do k-means on it. You could still do spectral clustering. And you would find this eigenvector. Now what's the point about this eigenvector? I'll finish in one moment. What do we know about that eigenvector as compared to that one? So here was an eigenvector all 1's. Let me just make it all 1's, 1, 1, 1, 1. In that picture, it's 25 1's. Here's the next eigenvector up. And what's the relation between those two eigenvectors of L? They are--
GILBERT STRANG: Orthogonal. These are eigenvectors of a symmetric matrix. So they're orthogonal. So that means-- to be orthogonal to this guy means that your components add to 0, right? A Vector. Is orthogonal to all 1's. That dot product is just, add up the components. So we have a bunch of positive components and a bunch of negative components. They have the same sum, because the dot product with that is 0.
And those two components-- those two sets of components are your-- to tell you the two clusters in spectral clustering. So it's a pretty nifty algorithm. It does ask you to compute an eigenvector. And that, of course, takes time. And then there's a third, more direct algorithm to do this optimization problem. Well, actually, there are many. This is an important problem, so there are many proposed algorithms. Good. OK. I'm closing up. Final question. Yeah?
AUDIENCE: Is it possible to do more than two clusters?
GILBERT STRANG: Well, certainly for k-means. Now, if I had to do three clusters with Fiedler, I would look at the first three eigenvectors. And, well, the first one would be nothing. And I would look at the next two. And that would be pretty successful. If I wanted six clusters, it probably would fall off in the quality of the clustering. Yeah. But that certainly-- I would look at the lowest six eigenvectors, and get somewhere. Yeah. Right. So OK. So that's a topic-- an important topic-- a sort of standard topic in applied graph theory. OK.
So see you Wednesday. I'm hoping, on Wednesday-- so Professor Edelman has told me a new and optimal way to look at the problem of backpropagation. Do you remember backpropagation? If you remember the lecture on it-- you don't want to remember the lecture on it. It's a tricky, messy thing to explain. But he says, if I explain it using Julia in linear algebra, it's clear. So we'll give him a chance on Wednesday to show that revolutionary approach to the explanation of backpropagation.
And I hope for-- I told him he could have half an hour, and projects would take some time. I hope-- now we've had two with wild applause. So I hope we get a couple more in our last class. OK. See you Wednesday. And if you bring-- well, if you have projects ready, send them to me online, and maybe a print out as well. That would be terrific. If you don't have them ready by the hour, they can go-- the envelope outside my office would receive them. Good. So I'll see you Wednesday for the final class.