1 00:00:00,090 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,820 Commons license. 3 00:00:03,820 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,160 continue to offer high quality educational resources for free. 5 00:00:10,160 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,610 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,610 --> 00:00:17,261 at ocw.mit.edu. 8 00:00:21,085 --> 00:00:21,960 PROFESSOR: All right. 9 00:00:21,960 --> 00:00:28,120 So we're going to now show some of the examples. 10 00:00:28,120 --> 00:00:34,230 And so again, they're in our directory here, Examples. 11 00:00:34,230 --> 00:00:38,010 We're now in section 2, so we're going to go to Applications. 12 00:00:38,010 --> 00:00:42,849 And we have this section called Perfect Power Law. 13 00:00:42,849 --> 00:00:44,640 The actual example codes are all these ones 14 00:00:44,640 --> 00:00:46,210 called PPL 1, 2, 3, and 4. 15 00:00:46,210 --> 00:00:49,179 We have actually a lot of other functions in here, 16 00:00:49,179 --> 00:00:50,970 which are actually really useful functions. 17 00:00:50,970 --> 00:00:53,480 They're the functions that do all this stuff. 18 00:00:53,480 --> 00:00:56,560 And maybe they should actually get folded into the main D4M 19 00:00:56,560 --> 00:00:57,130 distribution. 20 00:00:57,130 --> 00:00:59,230 They're here right now. 21 00:00:59,230 --> 00:01:01,510 They probably could be cleaned up a little bit 22 00:01:01,510 --> 00:01:02,680 as part of the homework. 23 00:01:02,680 --> 00:01:05,330 But these are, again, very useful functions 24 00:01:05,330 --> 00:01:09,510 for allowing you to generate and fit these types of things. 25 00:01:09,510 --> 00:01:14,786 So I have already started my MatLab session. 26 00:01:14,786 --> 00:01:15,570 There we are. 27 00:01:15,570 --> 00:01:18,050 So we will do the first one. 28 00:01:35,360 --> 00:01:38,463 Going along here. 29 00:01:38,463 --> 00:01:38,963 Figures. 30 00:02:10,370 --> 00:02:12,230 So if we go and look at what we did here-- 31 00:02:12,230 --> 00:02:18,110 so the first thing we did is we set our parameters 32 00:02:18,110 --> 00:02:20,900 for our Perfect Power Law fit. 33 00:02:20,900 --> 00:02:23,420 So we set alpha at 1.3, a D max of 1,000, 34 00:02:23,420 --> 00:02:26,660 and then approximately 30 bins. 35 00:02:26,660 --> 00:02:28,340 We called our function here, which 36 00:02:28,340 --> 00:02:31,150 is in this directory, which is Power Law Distribution, which 37 00:02:31,150 --> 00:02:33,680 will create a power law distribution from these three 38 00:02:33,680 --> 00:02:34,970 parameters. 39 00:02:34,970 --> 00:02:39,000 And then the first thing we did is just plot a distribution. 40 00:02:39,000 --> 00:02:43,280 So we go here, then we go to Figure 1. 41 00:02:43,280 --> 00:02:45,260 You can see there is that distribution. 42 00:02:45,260 --> 00:02:48,810 So our Perfect Power Law distribution 43 00:02:48,810 --> 00:02:52,170 with these parameters, very nicely done. 44 00:02:55,130 --> 00:02:58,690 Then moving along, we can compute. 45 00:02:58,690 --> 00:03:02,104 This is how you compute the total number of vertices 46 00:03:02,104 --> 00:03:03,020 from the distribution. 47 00:03:03,020 --> 00:03:04,087 We just sum. 48 00:03:04,087 --> 00:03:05,670 It could be the total number of edges. 49 00:03:05,670 --> 00:03:07,030 We can sum there. 50 00:03:07,030 --> 00:03:09,960 And then we see that the total vertices 51 00:03:09,960 --> 00:03:18,960 was 18,187, 84,000 edges with an edge to vertex ratio of 4.6. 52 00:03:18,960 --> 00:03:23,300 We now have a function called edges from distribution. 53 00:03:23,300 --> 00:03:25,010 So if I pass in the distribution-- 54 00:03:25,010 --> 00:03:27,240 essentially the degrees and the counts-- 55 00:03:27,240 --> 00:03:29,816 it will then generate a set of vertices 56 00:03:29,816 --> 00:03:31,565 that is consistent with that distribution. 57 00:03:34,440 --> 00:03:38,820 And now I'm going to create some permutations, basically, 58 00:03:38,820 --> 00:03:39,950 of these vertices. 59 00:03:39,950 --> 00:03:44,620 I'm going to create a-- randomly permute the edge 60 00:03:44,620 --> 00:03:46,670 order with these two sets here. 61 00:03:46,670 --> 00:03:49,210 So if I make a random permutation 62 00:03:49,210 --> 00:03:53,960 of the number of edges, that allows me to permute the edges. 63 00:03:53,960 --> 00:03:57,600 And I can also randomly permute the vertex labels themselves, 64 00:03:57,600 --> 00:04:01,400 and then I can look at these different permutations 65 00:04:01,400 --> 00:04:04,250 to see what kind of grass they've created. 66 00:04:04,250 --> 00:04:08,930 So if I don't permute the data, I just have a0 here. 67 00:04:08,930 --> 00:04:11,530 I just pass in the vertices that were created. 68 00:04:11,530 --> 00:04:15,720 I create a non-permuted adjacency matrix. 69 00:04:15,720 --> 00:04:17,670 I can then look at that. 70 00:04:21,269 --> 00:04:23,560 And that just shows you-- this is the adjacency matrix. 71 00:04:23,560 --> 00:04:24,320 This is in the plot. 72 00:04:24,320 --> 00:04:26,570 It just shows you that I just created a vertex that's 73 00:04:26,570 --> 00:04:28,640 entirely self loops. 74 00:04:28,640 --> 00:04:30,557 Not very interesting, but nevertheless, 75 00:04:30,557 --> 00:04:32,390 completely consistent with the perfect power 76 00:04:32,390 --> 00:04:34,230 of law degree distribution. 77 00:04:34,230 --> 00:04:41,600 So likewise, if I just permute the edges-- so basically, 78 00:04:41,600 --> 00:04:43,280 I take my permutation here, and I just 79 00:04:43,280 --> 00:04:49,060 permute that-- this is just not permuting the vertex labels, 80 00:04:49,060 --> 00:04:52,364 but just permuting the edges. 81 00:04:52,364 --> 00:04:55,620 To the next figure here, 3, and we 82 00:04:55,620 --> 00:04:57,430 get this distribution-- basically 83 00:04:57,430 --> 00:04:58,770 start vertex, end vertex. 84 00:04:58,770 --> 00:05:00,470 Looks fairly random. 85 00:05:00,470 --> 00:05:04,190 However, you see the highest degree vertices are up here 86 00:05:04,190 --> 00:05:08,120 at 0, 0, because that's the order in which the generator 87 00:05:08,120 --> 00:05:08,870 spits them out. 88 00:05:08,870 --> 00:05:11,360 It gives you the highest degree vertices first. 89 00:05:11,360 --> 00:05:14,050 And so although we've reconnected them 90 00:05:14,050 --> 00:05:19,260 with different edges, you see that their vertex labels still 91 00:05:19,260 --> 00:05:25,510 have an intrinsic order which is corresponding to vertex degree. 92 00:05:25,510 --> 00:05:27,470 I should say this generates this naturally. 93 00:05:27,470 --> 00:05:32,795 It's a very common way to-- if you have arbitrary vertices 94 00:05:32,795 --> 00:05:35,260 and you want to put them on an adjacency matrix, 95 00:05:35,260 --> 00:05:37,290 one of the first things to do is to reorder them 96 00:05:37,290 --> 00:05:38,580 according to degree. 97 00:05:38,580 --> 00:05:42,280 You'll always get some kind of structure that emerges there, 98 00:05:42,280 --> 00:05:45,500 and it's often an easier way to see what's going on there. 99 00:05:45,500 --> 00:05:47,800 And as you can see from the previous data, 100 00:05:47,800 --> 00:05:49,770 you saw if you ordered things by degree 101 00:05:49,770 --> 00:05:52,430 and you saw this type of [INAUDIBLE] aha, 102 00:05:52,430 --> 00:05:54,830 that looks a lot like a power law distribution 103 00:05:54,830 --> 00:05:57,690 just from the scatter of the dots. 104 00:06:00,250 --> 00:06:01,200 So moving on here. 105 00:06:01,200 --> 00:06:04,450 So now we're just going to permute the vertex labels. 106 00:06:04,450 --> 00:06:07,800 So we are not permuting the edges, just the vertex labels. 107 00:06:12,030 --> 00:06:17,720 So if you see here-- and then by permuting the vertex labels, 108 00:06:17,720 --> 00:06:19,800 it looks random, but fairly sparse. 109 00:06:19,800 --> 00:06:22,160 That's because we still-- every single time, 110 00:06:22,160 --> 00:06:24,010 we've moved the vertex labels around. 111 00:06:24,010 --> 00:06:27,160 But all the edges associated with that vertex pair 112 00:06:27,160 --> 00:06:29,050 still will ship around [INAUDIBLE]. 113 00:06:29,050 --> 00:06:31,790 We haven't done anything to break them up. 114 00:06:31,790 --> 00:06:33,470 Whereas before, we broke up the edges, 115 00:06:33,470 --> 00:06:36,570 but we didn't change the labels. 116 00:06:36,570 --> 00:06:40,200 And so then the final one, which is the one that we typically 117 00:06:40,200 --> 00:06:41,650 do when we want to create random, 118 00:06:41,650 --> 00:06:45,540 is we permute both the vertices and the edges 119 00:06:45,540 --> 00:06:52,240 to create a truly random looking adjacency matrix. 120 00:06:52,240 --> 00:06:54,030 And there you see you have something. 121 00:06:54,030 --> 00:06:56,690 And now what you can also see here 122 00:06:56,690 --> 00:07:00,660 is these high degree vertices do stand out. 123 00:07:00,660 --> 00:07:03,330 This is a very standard power law distribution. 124 00:07:03,330 --> 00:07:07,320 You'll see these dense rows and columns in there 125 00:07:07,320 --> 00:07:10,260 that are indicative of these very high degree nodes. 126 00:07:10,260 --> 00:07:16,380 When we just permuted the edges but not the vertices, 127 00:07:16,380 --> 00:07:19,160 these high degree rows and columns 128 00:07:19,160 --> 00:07:23,550 all got shifted up into the corner of the plot. 129 00:07:23,550 --> 00:07:28,280 So again, another way to begin to look at the data in a way 130 00:07:28,280 --> 00:07:29,700 to recognize structure. 131 00:07:29,700 --> 00:07:31,412 And I think this adjacency matrices, 132 00:07:31,412 --> 00:07:32,870 once you start looking at them, you 133 00:07:32,870 --> 00:07:35,174 begin to get a comfort zone, just as we 134 00:07:35,174 --> 00:07:36,340 do with other types of data. 135 00:07:36,340 --> 00:07:39,917 We begin to learn what they look like. 136 00:07:39,917 --> 00:07:41,750 This is a very good way to look at the data, 137 00:07:41,750 --> 00:07:43,410 because you can look at a fair amount. 138 00:07:43,410 --> 00:07:47,240 If I were to actually plot the graph of 18,000 vertices 139 00:07:47,240 --> 00:07:50,810 as a traditional graph of the vertices and lines connecting 140 00:07:50,810 --> 00:07:52,170 them, it would just be blue. 141 00:07:52,170 --> 00:07:53,540 The entire screen would be blue. 142 00:07:53,540 --> 00:07:57,000 There would be no way to properly position 143 00:07:57,000 --> 00:07:58,290 those vertices. 144 00:07:58,290 --> 00:08:00,560 Here, this is 84,000 data points, 145 00:08:00,560 --> 00:08:03,510 and I can still kind of see it a little bit. 146 00:08:03,510 --> 00:08:05,080 This is the limit. 147 00:08:05,080 --> 00:08:07,710 100,000 edges is the limit of what 148 00:08:07,710 --> 00:08:09,540 you can put on any plot to really see it, 149 00:08:09,540 --> 00:08:11,480 unless there's some hidden structure that 150 00:08:11,480 --> 00:08:15,260 allows you to really, really move it all together. 151 00:08:15,260 --> 00:08:17,860 But the adjacency matrix is a great way 152 00:08:17,860 --> 00:08:22,770 to look at fairly large graphs. 153 00:08:22,770 --> 00:08:24,260 So we generally do that. 154 00:08:24,260 --> 00:08:26,470 So this is the procedure. 155 00:08:26,470 --> 00:08:30,590 Create your perfect parallel distribution, 156 00:08:30,590 --> 00:08:34,100 create some edges, create some permutations, 157 00:08:34,100 --> 00:08:36,210 and then permute it, and then you off you are. 158 00:08:36,210 --> 00:08:39,230 You've just created a randomized perfect power law graph. 159 00:08:39,230 --> 00:08:42,990 Again, this is an example of a code where you probably 160 00:08:42,990 --> 00:08:44,490 might even just take this program 161 00:08:44,490 --> 00:08:47,450 and adjust it to suit your needs. 162 00:08:47,450 --> 00:08:49,130 And then, I think, we did that. 163 00:08:49,130 --> 00:08:54,000 And then finally, we plotted the degree distribution of a 3 just 164 00:08:54,000 --> 00:08:55,820 to show you that I'm not lying. 165 00:08:55,820 --> 00:08:58,080 And again, the triangle is the original 166 00:08:58,080 --> 00:08:59,220 and the blue was the thing. 167 00:08:59,220 --> 00:09:01,740 And you see throughout all those permutations, 168 00:09:01,740 --> 00:09:04,482 our degree distribution remained the same. 169 00:09:04,482 --> 00:09:05,940 Even though they look-- those would 170 00:09:05,940 --> 00:09:08,356 be completely different graphs, their degree distributions 171 00:09:08,356 --> 00:09:09,640 are identical. 172 00:09:13,020 --> 00:09:14,570 So let's move on here. 173 00:09:29,460 --> 00:09:33,040 So this is just showing-- so we actually rolled this up all 174 00:09:33,040 --> 00:09:35,180 into one function here for you. 175 00:09:35,180 --> 00:09:38,430 Ran power law matrix, if you give it an alpha, a D max, 176 00:09:38,430 --> 00:09:41,800 and an ND, it will do those three steps for you 177 00:09:41,800 --> 00:09:44,190 that I had in the previous chart all in one thing 178 00:09:44,190 --> 00:09:48,170 and produce an adjacency matrix that's a perfect power 179 00:09:48,170 --> 00:09:50,416 law based on these parameters. 180 00:09:50,416 --> 00:09:52,165 And again, you saw we have the same number 181 00:09:52,165 --> 00:09:54,350 of vertices that we saw and edges and ratio 182 00:09:54,350 --> 00:09:55,260 that we saw before. 183 00:09:55,260 --> 00:09:57,340 So that's all the same. 184 00:09:57,340 --> 00:10:01,020 Now we're going to transform this data, clean up 185 00:10:01,020 --> 00:10:06,410 this data by making it unweighted, undirected. 186 00:10:06,410 --> 00:10:07,910 We're going to eliminate self loops. 187 00:10:07,910 --> 00:10:09,944 We're going take the upper triangular part. 188 00:10:09,944 --> 00:10:11,610 Here's another one that's unweighted, no 189 00:10:11,610 --> 00:10:14,120 self loops, different versions of them. 190 00:10:14,120 --> 00:10:18,950 And then we do-- you can see what those look like. 191 00:10:18,950 --> 00:10:21,950 So if we look at the first one here, 192 00:10:21,950 --> 00:10:25,700 this just shows the unweighted, what 193 00:10:25,700 --> 00:10:28,660 basically making the data unweighted does to the data 194 00:10:28,660 --> 00:10:29,160 set. 195 00:10:29,160 --> 00:10:32,700 So the triangles are the original data, and just making 196 00:10:32,700 --> 00:10:37,860 it unweighted, how it distorts that data set. 197 00:10:42,690 --> 00:10:46,550 This shows you what happens when you make it undirected. 198 00:10:46,550 --> 00:10:50,315 So unweighted means we took any cases 199 00:10:50,315 --> 00:10:52,440 where we had vertices with more than one connection 200 00:10:52,440 --> 00:10:54,495 to them, if something had five connections, now 201 00:10:54,495 --> 00:10:56,430 it just gets one connection. 202 00:10:56,430 --> 00:10:59,130 And so that was a fairly big distortion. 203 00:10:59,130 --> 00:11:03,100 A perfect example is if you take a person's social network 204 00:11:03,100 --> 00:11:06,650 graph, and if you were to make it unweighted, 205 00:11:06,650 --> 00:11:09,490 what you're saying is that the connection you have 206 00:11:09,490 --> 00:11:11,600 with your spouse is identical with someone 207 00:11:11,600 --> 00:11:14,910 that you emailed once or that you friended once. 208 00:11:14,910 --> 00:11:16,410 And I think we all agree that that's 209 00:11:16,410 --> 00:11:19,410 a fair amount of information that's lost there. 210 00:11:19,410 --> 00:11:25,670 And so again, encouraging folks to be aware of that 211 00:11:25,670 --> 00:11:28,630 and to be careful of when they're doing it. 212 00:11:28,630 --> 00:11:33,310 Again, making it undirected just means we basically-- 213 00:11:33,310 --> 00:11:38,350 if I phone you a lot or I cite you a lot in papers, 214 00:11:38,350 --> 00:11:42,070 That's the same as you citing me a lot, which you 215 00:11:42,070 --> 00:11:44,500 lose some information there. 216 00:11:44,500 --> 00:11:47,312 Again, this shows the kind of distortion 217 00:11:47,312 --> 00:11:48,895 that we get from making it undirected. 218 00:11:53,380 --> 00:11:55,430 And again, this shows what happens 219 00:11:55,430 --> 00:11:57,650 when you do no self loops. 220 00:11:57,650 --> 00:11:59,570 Well, there's not a lot of self loops, 221 00:11:59,570 --> 00:12:02,980 so we only have affected a few vertices here. 222 00:12:02,980 --> 00:12:08,430 So in this case, eliminating self loops is not a terribly 223 00:12:08,430 --> 00:12:10,960 distorted-- doesn't really distort the data very much 224 00:12:10,960 --> 00:12:12,430 at all. 225 00:12:12,430 --> 00:12:15,140 And then finally, this shows the upper correlation matrix. 226 00:12:15,140 --> 00:12:16,890 So when we correlated the two, basically 227 00:12:16,890 --> 00:12:19,370 multiplied the adjacency matrix together, 228 00:12:19,370 --> 00:12:21,640 again showing what we saw before. 229 00:12:24,840 --> 00:12:25,340 Moving on. 230 00:12:42,900 --> 00:12:48,950 You can see the plots that got eliminated from my PowerPoint. 231 00:12:48,950 --> 00:12:50,740 MATLAB has defeated PowerPoint's attempts 232 00:12:50,740 --> 00:12:52,640 to deny you your education. 233 00:12:52,640 --> 00:12:56,080 So again, what we're doing here is 234 00:12:56,080 --> 00:13:00,200 we're creating a perfect power law. 235 00:13:00,200 --> 00:13:01,170 This is a bigger one. 236 00:13:01,170 --> 00:13:02,940 I wan a lot of vertices. 237 00:13:02,940 --> 00:13:07,090 So this time, we had 50,000 vertices, 329,000 edges 238 00:13:07,090 --> 00:13:09,330 with a ratio of 6 and 1/2. 239 00:13:09,330 --> 00:13:11,240 We create our vertices. 240 00:13:11,240 --> 00:13:14,220 We randomize the edge order, et cetera. 241 00:13:14,220 --> 00:13:17,690 Now we're going to randomly pick a subsample of these. 242 00:13:17,690 --> 00:13:18,960 And what is F samp set at? 243 00:13:18,960 --> 00:13:20,290 It's 1/40. 244 00:13:20,290 --> 00:13:24,230 So I'm going to take 1/40 of all the vertices. 245 00:13:24,230 --> 00:13:26,440 Now I'm going to go and compute that degree, 246 00:13:26,440 --> 00:13:29,660 and I'm going to basically subsample all of these. 247 00:13:29,660 --> 00:13:32,520 And later, what you'll see-- so let's just take a look at that. 248 00:13:37,450 --> 00:13:39,000 So this just shows that chart here. 249 00:13:39,000 --> 00:13:41,460 So this is the original data. 250 00:13:41,460 --> 00:13:42,880 Again, this is the vertex. 251 00:13:42,880 --> 00:13:45,870 We're sorting the vertices by degree here. 252 00:13:45,870 --> 00:13:48,120 So this is the highest degree vertex. 253 00:13:48,120 --> 00:13:52,160 These are the lowest degree vertex. 254 00:13:52,160 --> 00:13:54,550 And each vertex is getting a dot here. 255 00:13:54,550 --> 00:13:56,140 So we have 50,000 vertices. 256 00:13:56,140 --> 00:14:00,570 They all get a dot here, and we've only taken 1/40 of them. 257 00:14:00,570 --> 00:14:05,720 And this just shows you here-- If you only take 1/40 of them 258 00:14:05,720 --> 00:14:09,800 and then compute their sample, this is what you get. 259 00:14:09,800 --> 00:14:13,929 Now, standard sampling theory would say aha, well, 260 00:14:13,929 --> 00:14:15,220 I know how to correct for this. 261 00:14:15,220 --> 00:14:16,740 The way I correct for this is I just 262 00:14:16,740 --> 00:14:19,880 multiply my sample data by 40. 263 00:14:19,880 --> 00:14:23,610 And we took 140, so that means whenever I measure it, 264 00:14:23,610 --> 00:14:27,090 the true value should be 40 times higher. 265 00:14:27,090 --> 00:14:29,310 So we can look at that in figure 2. 266 00:14:29,310 --> 00:14:30,630 So this is the true sample. 267 00:14:35,540 --> 00:14:38,670 So again, we see for our high-degree vertices here-- 268 00:14:38,670 --> 00:14:41,300 this is the highest degree vertex here again. 269 00:14:41,300 --> 00:14:42,800 So this is the high-degree vertices. 270 00:14:42,800 --> 00:14:44,030 These are the low-degree vertices. 271 00:14:44,030 --> 00:14:46,405 I don't know if I said that, opposite when I mentioned it 272 00:14:46,405 --> 00:14:46,910 before. 273 00:14:46,910 --> 00:14:48,420 This is the highest degree vertex. 274 00:14:48,420 --> 00:14:50,537 And you see that by sampling the data, 275 00:14:50,537 --> 00:14:52,870 we're doing a very good job on the high-degree vertices. 276 00:14:52,870 --> 00:14:54,510 We're sampling them just fine. 277 00:14:54,510 --> 00:15:00,200 And that's why statistics works. 278 00:15:00,200 --> 00:15:03,710 If something is really not rare and you sample, 279 00:15:03,710 --> 00:15:05,530 you're going to get a good estimate. 280 00:15:05,530 --> 00:15:09,130 However, for these low-degree vertices over here, 281 00:15:09,130 --> 00:15:12,800 what you see-- by multiplying by 40, 282 00:15:12,800 --> 00:15:15,110 we're significantly over-estimating 283 00:15:15,110 --> 00:15:17,555 their probability. 284 00:15:17,555 --> 00:15:19,180 As I say, this is the curve that proves 285 00:15:19,180 --> 00:15:21,610 that optimists and pessimists are both correct. 286 00:15:24,680 --> 00:15:26,880 There are so many rare things. 287 00:15:26,880 --> 00:15:28,970 If the world is a power law distribution, 288 00:15:28,970 --> 00:15:31,990 it means that there are so many rare events in the world, some 289 00:15:31,990 --> 00:15:34,550 of them are going to happen to you. 290 00:15:34,550 --> 00:15:38,865 So it means if you're an optimist, go play the lottery. 291 00:15:38,865 --> 00:15:42,430 If you're a pessimist, it means that lightning could hit you, 292 00:15:42,430 --> 00:15:45,550 and you better just stay inside. 293 00:15:45,550 --> 00:15:47,960 So there's just so many rare things 294 00:15:47,960 --> 00:15:51,630 that some really rare things are going to happen 295 00:15:51,630 --> 00:15:53,060 to you in your lifetime. 296 00:15:53,060 --> 00:15:55,060 Most likely, those rare things are very mundane. 297 00:15:58,710 --> 00:15:59,980 But we can correct for this. 298 00:15:59,980 --> 00:16:05,190 And so we have basically a way here of deriving calculations. 299 00:16:05,190 --> 00:16:09,300 So we compute the parameters of our distribution, 300 00:16:09,300 --> 00:16:12,340 and through these two functions, compute degree correction, 301 00:16:12,340 --> 00:16:13,890 and apply degree correction. 302 00:16:13,890 --> 00:16:17,730 We can actually go back and say all right, given 303 00:16:17,730 --> 00:16:21,860 that we believe the data is power law and we've sampled it, 304 00:16:21,860 --> 00:16:25,000 can we then come up with a more uniform correction that 305 00:16:25,000 --> 00:16:28,020 basically gives us a better estimate that works at both 306 00:16:28,020 --> 00:16:30,890 the high and the low end? 307 00:16:30,890 --> 00:16:34,560 And that's what you see here. 308 00:16:34,560 --> 00:16:36,300 So basically, we haven't changed. 309 00:16:36,300 --> 00:16:38,170 The correction hasn't changed here. 310 00:16:38,170 --> 00:16:41,366 But we've downgraded these lower ones. 311 00:16:41,366 --> 00:16:42,740 And essentially, what we're doing 312 00:16:42,740 --> 00:16:45,570 is instead of just using the average as the statistic, 313 00:16:45,570 --> 00:16:46,860 we're using the median. 314 00:16:46,860 --> 00:16:51,720 So we're using, essentially, a quantile-based correction 315 00:16:51,720 --> 00:16:55,121 here, a 50th percent quantile-based based correction 316 00:16:55,121 --> 00:16:55,620 here. 317 00:16:55,620 --> 00:16:59,800 And that causes us to lower the estimates of these vertices. 318 00:16:59,800 --> 00:17:01,810 And so it would be a better estimate 319 00:17:01,810 --> 00:17:06,300 and allow you to do the sampling of that. 320 00:17:06,300 --> 00:17:06,890 Very good. 321 00:17:11,960 --> 00:17:13,084 And our final demo. 322 00:17:38,050 --> 00:17:41,300 So now what we're doing is power law fitting. 323 00:17:41,300 --> 00:17:43,260 And so we have the routines for doing that. 324 00:17:43,260 --> 00:17:46,115 So again, here's our distribution. 325 00:17:46,115 --> 00:17:48,020 It's a power law of 1.3. 326 00:17:48,020 --> 00:17:51,785 We've set D max to be 2,000 and about 60-ish bins. 327 00:17:51,785 --> 00:17:54,050 We create our parallel distribution. 328 00:17:54,050 --> 00:17:57,230 It has 50,000 vertices, 329,000 edges. 329 00:17:57,230 --> 00:17:59,720 Ratio is the same as the one we did before. 330 00:17:59,720 --> 00:18:02,920 We're going to make it undirected-- 331 00:18:02,920 --> 00:18:05,050 undirected and unweighted, undirected, unweighted, 332 00:18:05,050 --> 00:18:08,910 no self loops, so standard corrections that we do. 333 00:18:08,910 --> 00:18:11,510 We're going to compute the degree distribution 334 00:18:11,510 --> 00:18:13,730 of that data and plot it. 335 00:18:13,730 --> 00:18:16,410 Or actually, get it there now. 336 00:18:16,410 --> 00:18:19,910 I'm going to then-- we have this function called power law fit. 337 00:18:19,910 --> 00:18:24,521 So if I compute-- so I compute the degree distribution. 338 00:18:24,521 --> 00:18:26,520 So we have this function called out degree which 339 00:18:26,520 --> 00:18:28,270 gives us the distribution. 340 00:18:28,270 --> 00:18:33,480 And I can find, essentially, the number of values with the one, 341 00:18:33,480 --> 00:18:36,660 and the one that-- our maximum, so this is estimating 342 00:18:36,660 --> 00:18:38,160 our poor man's slope. 343 00:18:38,160 --> 00:18:39,490 So we're computing the slope. 344 00:18:39,490 --> 00:18:43,430 We're counting the total number of edges here, 345 00:18:43,430 --> 00:18:45,980 and then we have this function called power law fit. 346 00:18:45,980 --> 00:18:50,130 Basically, we can plug in what the estimated alpha is, 347 00:18:50,130 --> 00:18:54,090 what the number of vertices is, and the number of edges 348 00:18:54,090 --> 00:18:57,770 is to find our best fit distribution. 349 00:18:57,770 --> 00:19:00,940 So this basically inverts those formulas I showed you, 350 00:19:00,940 --> 00:19:03,130 which is given a degree distribution 351 00:19:03,130 --> 00:19:06,880 that sums to a particular number of vertices and sums 352 00:19:06,880 --> 00:19:09,360 to a particular number of edges, can you 353 00:19:09,360 --> 00:19:13,746 give me a new D max and a new ND, 354 00:19:13,746 --> 00:19:17,530 these parameters that don't really have as much meaning, 355 00:19:17,530 --> 00:19:19,840 to do that? 356 00:19:19,840 --> 00:19:22,920 And so basically, we use, essentially, 357 00:19:22,920 --> 00:19:26,600 a combination of three different techniques here. 358 00:19:26,600 --> 00:19:31,410 Because this is so nonlinear, and there's this-- basically, 359 00:19:31,410 --> 00:19:34,250 remember I talked about integer bins and logarithmic bins? 360 00:19:34,250 --> 00:19:37,090 Well, if you look at that plot, it showed there's that bending. 361 00:19:37,090 --> 00:19:42,540 It's a very nasty manifold, the surface of this function. 362 00:19:42,540 --> 00:19:45,220 And it has a continuous part and a discrete part. 363 00:19:45,220 --> 00:19:49,080 So what we do here is we do essentially a sampled search 364 00:19:49,080 --> 00:19:51,960 where we randomly sample, looking for a location. 365 00:19:51,960 --> 00:19:55,470 We do a heuristic search, which is a simulated [INAUDIBLE] 366 00:19:55,470 --> 00:19:56,020 search. 367 00:19:56,020 --> 00:20:00,850 And we also use Broyden's nonlinear-- essentially a 368 00:20:00,850 --> 00:20:05,420 variation of Newton's method to all try and find the best set 369 00:20:05,420 --> 00:20:07,980 of parameters that will fit this data. 370 00:20:07,980 --> 00:20:09,440 We rarely get an exact match. 371 00:20:09,440 --> 00:20:11,810 But you can see here it's choosing different ones. 372 00:20:11,810 --> 00:20:13,854 And this gives you how it's doing. 373 00:20:13,854 --> 00:20:15,270 From the sample search, this shows 374 00:20:15,270 --> 00:20:17,860 you the number of vertices and the number of edges 375 00:20:17,860 --> 00:20:19,030 it was able to achieve. 376 00:20:19,030 --> 00:20:23,190 The heuristic search didn't do very well at all, 377 00:20:23,190 --> 00:20:27,520 and then the Broyden search did a pretty good job, 378 00:20:27,520 --> 00:20:28,810 and it got us pretty well. 379 00:20:28,810 --> 00:20:34,360 So actually, it ended up comparing all of these, 380 00:20:34,360 --> 00:20:40,010 and it ended up choosing the sample search-- this one, 381 00:20:40,010 --> 00:20:40,910 this first one I did. 382 00:20:40,910 --> 00:20:42,660 It liked that best of all. 383 00:20:42,660 --> 00:20:45,540 So we'll look at that here. 384 00:20:48,490 --> 00:20:53,740 So Figure 2 was the original data. 385 00:20:53,740 --> 00:20:57,210 Figure 1 shows you the manifold of this space. 386 00:20:57,210 --> 00:20:59,990 And so plotted in this coordinate system 387 00:20:59,990 --> 00:21:02,820 of n versus m, the dot is the dot that it found. 388 00:21:02,820 --> 00:21:06,150 It's actually here in this [INAUDIBLE] 389 00:21:06,150 --> 00:21:07,820 these lines show the boundaries. 390 00:21:07,820 --> 00:21:12,120 But you can see this very nonlinear manifold here. 391 00:21:12,120 --> 00:21:14,080 This is the continuous regime. 392 00:21:14,080 --> 00:21:15,560 This is the integer regime. 393 00:21:15,560 --> 00:21:18,240 It cusps right at the transition. 394 00:21:18,240 --> 00:21:21,465 Again, a very nasty function to try and invert, 395 00:21:21,465 --> 00:21:23,590 which is why we used all those different techniques 396 00:21:23,590 --> 00:21:24,240 to invert it. 397 00:21:29,640 --> 00:21:32,290 And then Figure 3 shows the results. 398 00:21:32,290 --> 00:21:37,400 So this black line shows the original model input 399 00:21:37,400 --> 00:21:38,420 that we provided. 400 00:21:38,420 --> 00:21:41,360 So that was the true model. 401 00:21:41,360 --> 00:21:44,300 The circle shows the data after it was transformed. 402 00:21:44,300 --> 00:21:47,465 We made it undirected, unweighted with no self loops. 403 00:21:47,465 --> 00:21:48,940 That's this. 404 00:21:48,940 --> 00:21:51,970 Alpha is-- this is our poor man's alpha. 405 00:21:51,970 --> 00:21:53,960 What you can see is almost identical 406 00:21:53,960 --> 00:21:55,660 to the original model. 407 00:21:55,660 --> 00:21:58,740 So the poor man's alpha does a very good job 408 00:21:58,740 --> 00:22:01,350 of fitting in this case. 409 00:22:01,350 --> 00:22:03,250 The triangle shows the model fit. 410 00:22:03,250 --> 00:22:07,330 So when we fit the data, we came up with that best fit it shows, 411 00:22:07,330 --> 00:22:09,420 and we then created a new distribution. 412 00:22:09,420 --> 00:22:11,000 This is what it looked like. 413 00:22:11,000 --> 00:22:15,400 And then the plus sign shows us rebinning that data 414 00:22:15,400 --> 00:22:17,610 onto the bins from the model. 415 00:22:17,610 --> 00:22:20,910 And you see that we've done a very nice job of recovering 416 00:22:20,910 --> 00:22:23,690 the original power law distribution even 417 00:22:23,690 --> 00:22:25,560 after we did that distortion. 418 00:22:25,560 --> 00:22:28,830 So again, that is the last example. 419 00:22:28,830 --> 00:22:30,360 So I want thank you. 420 00:22:30,360 --> 00:22:32,990 And then for the homework, a lot of you 421 00:22:32,990 --> 00:22:36,020 did Homework 2, which is great. 422 00:22:36,020 --> 00:22:39,720 I think Homework 3 wasn't such a great hit. 423 00:22:39,720 --> 00:22:42,490 I'm going to definitely rethink that one. 424 00:22:42,490 --> 00:22:44,410 But the next homework does not require 425 00:22:44,410 --> 00:22:46,130 you to have done homework 3. 426 00:22:46,130 --> 00:22:48,840 If you did homework 2, basically, 427 00:22:48,840 --> 00:22:52,940 it's just saying compute a degree distribution. 428 00:22:52,940 --> 00:22:55,430 And in fact, you don't even need to have done Homework 2. 429 00:22:55,430 --> 00:22:57,460 You can compute a degree distribution 430 00:22:57,460 --> 00:23:00,720 on your Homework 2, or you can compute a degree distribution 431 00:23:00,720 --> 00:23:01,522 on any data set. 432 00:23:01,522 --> 00:23:02,855 For example, today is Halloween. 433 00:23:02,855 --> 00:23:07,450 If you go trick or treating with somebody-- yourself, 434 00:23:07,450 --> 00:23:10,360 your children, somebody else's children, 435 00:23:10,360 --> 00:23:17,110 you can maybe email me the histogram of your candy 436 00:23:17,110 --> 00:23:19,365 and plot it on a degree distribution, 437 00:23:19,365 --> 00:23:22,900 and maybe compute the poor man's alpha coefficient from that. 438 00:23:22,900 --> 00:23:26,210 And the other coefficients from that would be a fun 439 00:23:26,210 --> 00:23:27,070 exercise to do. 440 00:23:27,070 --> 00:23:28,710 So that'll be the next homework. 441 00:23:28,710 --> 00:23:31,440 I'll email that out in the next couple of days. 442 00:23:31,440 --> 00:23:33,210 So look forward. 443 00:23:33,210 --> 00:23:35,000 Again, if you send me the homework 444 00:23:35,000 --> 00:23:43,010 prior to next-- actually, just a reminder, no class next week. 445 00:23:43,010 --> 00:23:44,980 This room has been taken. 446 00:23:44,980 --> 00:23:48,010 But again, if you email me the homework prior to this time 447 00:23:48,010 --> 00:23:50,964 next week, I will give you feedback on it. 448 00:23:50,964 --> 00:23:52,880 You can still send me the homework after that. 449 00:23:52,880 --> 00:23:54,930 I just won't give you any feedback on it. 450 00:23:54,930 --> 00:23:58,210 So thank you again, and look forward 451 00:23:58,210 --> 00:24:00,550 to seeing you in two weeks.