1 00:00:00,090 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,810 Commons license. 3 00:00:03,810 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,140 continue to offer high quality educational resources for free. 5 00:00:10,140 --> 00:00:12,700 To make a donation or to view additional materials 6 00:00:12,700 --> 00:00:16,600 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,600 --> 00:00:17,260 at ocw.mit.edu. 8 00:00:20,862 --> 00:00:21,820 JEREMY KEPNER: Welcome. 9 00:00:21,820 --> 00:00:24,140 Happy Halloween. 10 00:00:24,140 --> 00:00:29,019 And for those of you watching this video at home, 11 00:00:29,019 --> 00:00:31,310 you'll be glad to know the whole audience has joined me 12 00:00:31,310 --> 00:00:33,940 and they're all dressed out in costumes. 13 00:00:33,940 --> 00:00:37,790 And it's a real fun day here as we do this. 14 00:00:37,790 --> 00:00:42,090 So making me not feel alone in my costume. 15 00:00:42,090 --> 00:00:45,470 A lot of moral support there, so that's great. 16 00:00:45,470 --> 00:00:48,210 So, yeah. 17 00:00:48,210 --> 00:00:54,720 So this is Lecture 05 five on Signal Processing on Databases, 18 00:00:54,720 --> 00:00:55,750 just for a recap. 19 00:00:55,750 --> 00:00:57,580 For those of you who missed earlier classes 20 00:00:57,580 --> 00:01:03,700 or are going this out of order on the web, 21 00:01:03,700 --> 00:01:07,860 signal processing really alludes detection theory, 22 00:01:07,860 --> 00:01:10,470 finding things, which alludes to the underlying 23 00:01:10,470 --> 00:01:14,310 mathematical basis of that, which is linear algebra. 24 00:01:14,310 --> 00:01:18,060 And the databases really refers to working 25 00:01:18,060 --> 00:01:22,810 with unstructured data, strings, and other types of things. 26 00:01:22,810 --> 00:01:26,230 Two things that aren't really talked about together, 27 00:01:26,230 --> 00:01:28,450 but we're bringing them together here 28 00:01:28,450 --> 00:01:31,510 because we have lots of new data sets that required it. 29 00:01:31,510 --> 00:01:35,880 And so this talk is probably the one that's 30 00:01:35,880 --> 00:01:38,770 getting most into something that we would say 31 00:01:38,770 --> 00:01:41,200 relates to detection theory, because we're 32 00:01:41,200 --> 00:01:45,060 going to be dealing with a lot with background data models. 33 00:01:45,060 --> 00:01:48,310 And in particular, power laws and methods 34 00:01:48,310 --> 00:01:51,110 of constructing power law data sets, 35 00:01:51,110 --> 00:01:56,650 methods on sampling and fitting, and using that as a basis 36 00:01:56,650 --> 00:02:03,040 for doing the kind of work that you want to do. 37 00:02:03,040 --> 00:02:04,370 So moving in here. 38 00:02:04,370 --> 00:02:05,400 So just your outline. 39 00:02:05,400 --> 00:02:10,310 And so we've got a lot of material to go over here today. 40 00:02:10,310 --> 00:02:13,786 This is all using the data set, when 41 00:02:13,786 --> 00:02:15,410 we get into the data set that we talked 42 00:02:15,410 --> 00:02:19,240 about in the last lecture, which is this Reuters data set. 43 00:02:19,240 --> 00:02:22,510 So we'll be applying some of these ideas to that data set. 44 00:02:22,510 --> 00:02:24,769 So just going to-- introduction here, 45 00:02:24,769 --> 00:02:26,310 and then I'm going to get to sampling 46 00:02:26,310 --> 00:02:29,410 theory, and subsampling theory, various types of distributions. 47 00:02:29,410 --> 00:02:33,330 And then end up with the Reuter's data set. 48 00:02:33,330 --> 00:02:35,160 The overall goal of this lecture is 49 00:02:35,160 --> 00:02:41,140 to really develop a background model for these types of data 50 00:02:41,140 --> 00:02:45,570 sets that is based on what I'm calling a perfect power law. 51 00:02:45,570 --> 00:02:47,090 And then we're going to, basically, 52 00:02:47,090 --> 00:02:49,356 after we can construct a perfect power law, 53 00:02:49,356 --> 00:02:50,855 we're going to sample that power law 54 00:02:50,855 --> 00:02:53,210 and look at what happens when we sample it. 55 00:02:53,210 --> 00:02:55,360 What are the effects of sampling it? 56 00:02:55,360 --> 00:02:56,850 And then we can use the power law 57 00:02:56,850 --> 00:03:00,132 to look at things like deviations and such. 58 00:03:00,132 --> 00:03:01,590 Now you might ask, well, why are we 59 00:03:01,590 --> 00:03:04,770 so concerned about backgrounds and power laws? 60 00:03:04,770 --> 00:03:07,860 And it's because here's sort of the basis of detection 61 00:03:07,860 --> 00:03:09,630 theory on one slide. 62 00:03:09,630 --> 00:03:12,850 So in detection theory, you basically 63 00:03:12,850 --> 00:03:17,020 have a model which consists of noise and the signal, 64 00:03:17,020 --> 00:03:19,640 and you have two hypotheses, H0, which 65 00:03:19,640 --> 00:03:21,640 is there's only noise in your data, 66 00:03:21,640 --> 00:03:24,530 and H1 that there's signal plus noise. 67 00:03:24,530 --> 00:03:27,770 And so, essentially, when you do detection theory, what you're 68 00:03:27,770 --> 00:03:32,540 doing is given these models, you can then 69 00:03:32,540 --> 00:03:37,290 compute optimal filters for answering the question, 70 00:03:37,290 --> 00:03:39,960 is there a signal there or is it just noise? 71 00:03:39,960 --> 00:03:43,900 That's essentially what detection theory boils down to. 72 00:03:43,900 --> 00:03:47,670 In this data, now when we deal with graph theory, 73 00:03:47,670 --> 00:03:51,930 it's obviously not so clean in terms of our dimensions here. 74 00:03:51,930 --> 00:03:53,979 We'll have some kind of high dimensional space 75 00:03:53,979 --> 00:03:55,520 and our signal will be projected into 76 00:03:55,520 --> 00:03:57,320 that high dimensional space. 77 00:03:57,320 --> 00:04:02,450 But nevertheless, the concept is still just as important 78 00:04:02,450 --> 00:04:04,974 that we have noise and a signal. 79 00:04:04,974 --> 00:04:06,640 And that's what we're trying to do here. 80 00:04:09,180 --> 00:04:12,870 Detection theory works in the traditional domains 81 00:04:12,870 --> 00:04:15,460 that we've applied it to because we 82 00:04:15,460 --> 00:04:18,220 have a fairly good model for the background, which tends 83 00:04:18,220 --> 00:04:20,260 to be Gaussian random noise. 84 00:04:20,260 --> 00:04:23,700 It's kind of the fundamental distribution 85 00:04:23,700 --> 00:04:29,400 that we use in lots of our data sets. 86 00:04:29,400 --> 00:04:30,830 And if we didn't have that model, 87 00:04:30,830 --> 00:04:32,980 it would be very difficult for us to proceed 88 00:04:32,980 --> 00:04:34,880 with much of detection theory. 89 00:04:34,880 --> 00:04:39,120 And the Gaussian random noise model 90 00:04:39,120 --> 00:04:41,210 works, because in many respects, if you're 91 00:04:41,210 --> 00:04:42,910 collecting sensor data, you really 92 00:04:42,910 --> 00:04:46,560 will have Gaussian physics going on. 93 00:04:46,560 --> 00:04:49,750 You also-- law of large numbers in terms 94 00:04:49,750 --> 00:04:53,240 of if you have lots of different distributions 95 00:04:53,240 --> 00:04:55,340 that are pulled together, they will end up 96 00:04:55,340 --> 00:04:58,300 beginning to look like a Gaussian as well. 97 00:04:58,300 --> 00:05:00,870 So that's what we have in many of the traditional fields 98 00:05:00,870 --> 00:05:03,100 that we've worked in in signal processing, 99 00:05:03,100 --> 00:05:06,390 but now we're in this new area where a lot of our data 100 00:05:06,390 --> 00:05:10,100 sets arrive from artificial processes, processes that 101 00:05:10,100 --> 00:05:12,270 are a result of human actions. 102 00:05:12,270 --> 00:05:17,560 Be it data on a network, be it data in a social network, 103 00:05:17,560 --> 00:05:21,200 be it other types of data that have a strong sort 104 00:05:21,200 --> 00:05:23,420 of artificial element to them. 105 00:05:23,420 --> 00:05:26,550 We find that the Gaussian model does not 106 00:05:26,550 --> 00:05:30,190 reveal itself in the same way that we've seen in other sets. 107 00:05:30,190 --> 00:05:33,620 So we need this. 108 00:05:33,620 --> 00:05:35,470 We really need a background model, 109 00:05:35,470 --> 00:05:37,890 so we have to do something about it. 110 00:05:37,890 --> 00:05:41,330 So there has been a fair amount of research and literature 111 00:05:41,330 --> 00:05:44,420 to come up with first principles, methods 112 00:05:44,420 --> 00:05:49,540 for creating power law distribution in data sets. 113 00:05:49,540 --> 00:05:51,810 We talked a little bit about these distributions 114 00:05:51,810 --> 00:05:54,270 in the previous lectures. 115 00:05:54,270 --> 00:05:56,550 And they've met with mixed results. 116 00:05:56,550 --> 00:05:59,010 It's been difficult to come up with, 117 00:05:59,010 --> 00:06:03,430 what is the underlying physics of the processes 118 00:06:03,430 --> 00:06:06,240 that result in certain vertices in graphs 119 00:06:06,240 --> 00:06:09,160 having enormous number of edges and other vertices 120 00:06:09,160 --> 00:06:10,600 only having a few? 121 00:06:10,600 --> 00:06:12,230 And so there has been work on that. 122 00:06:12,230 --> 00:06:14,080 I encourage you to look at that literature. 123 00:06:14,080 --> 00:06:18,490 Here we're going to do much more of a-- sort of go 124 00:06:18,490 --> 00:06:20,770 from the reverse direction, which 125 00:06:20,770 --> 00:06:25,490 is let's begin by coming up with some way to construct 126 00:06:25,490 --> 00:06:27,100 a perfect power law. 127 00:06:27,100 --> 00:06:32,090 With no concept of well, what is the underlying physics 128 00:06:32,090 --> 00:06:32,920 motivating this? 129 00:06:32,920 --> 00:06:38,620 Essentially, a basic linear model for a perfect power law. 130 00:06:38,620 --> 00:06:40,220 That we probably can do. 131 00:06:40,220 --> 00:06:42,550 And then we'll go from there. 132 00:06:42,550 --> 00:06:44,470 And linear models are something that we often 133 00:06:44,470 --> 00:06:46,000 use in our business. 134 00:06:46,000 --> 00:06:48,460 And it's a good first starting point. 135 00:06:48,460 --> 00:06:52,350 So along those lines, this is a way 136 00:06:52,350 --> 00:06:58,310 to construct a perfect power law in a matrix. 137 00:06:58,310 --> 00:07:00,190 This is basically a slide of definitions 138 00:07:00,190 --> 00:07:03,430 here, so let me spend a little time going through them. 139 00:07:03,430 --> 00:07:06,380 So we're going to represent our graph or our data 140 00:07:06,380 --> 00:07:08,580 as a random matrix. 141 00:07:08,580 --> 00:07:15,830 Basically of zero where there's no connection between-- this 142 00:07:15,830 --> 00:07:20,010 is a set of vertices connected with another set of vertices. 143 00:07:20,010 --> 00:07:23,110 There are N out of these vertices and N 144 00:07:23,110 --> 00:07:24,800 in of these vertices. 145 00:07:24,800 --> 00:07:26,280 You have a dot here. 146 00:07:26,280 --> 00:07:30,280 The row corresponds to the vertex at the edge left, 147 00:07:30,280 --> 00:07:33,550 and the column corresponds to the vertex 148 00:07:33,550 --> 00:07:35,610 that the edge is going into. 149 00:07:35,610 --> 00:07:38,710 So this adjacency matrix A is just 150 00:07:38,710 --> 00:07:42,100 going to be constructed by randomly filling 151 00:07:42,100 --> 00:07:45,880 this matrix with entries. 152 00:07:45,880 --> 00:07:50,160 And the only real constraint on them is that when you sum A, 153 00:07:50,160 --> 00:07:52,495 we're going to allow multiple edges. 154 00:07:52,495 --> 00:07:54,370 So you can have more-- the values aren't just 155 00:07:54,370 --> 00:07:56,370 zero and one, but they can have more than that. 156 00:07:56,370 --> 00:07:59,770 But when you sum the matrix A, all its values up, 157 00:07:59,770 --> 00:08:04,770 you get a value M, which is the total number of edges 158 00:08:04,770 --> 00:08:05,940 in the graph. 159 00:08:05,940 --> 00:08:09,840 So we have essentially a graph with-- this could 160 00:08:09,840 --> 00:08:12,530 be a bipartite graph or not. 161 00:08:12,530 --> 00:08:18,375 But with essentially N out vertices, N in vertices, and M 162 00:08:18,375 --> 00:08:19,010 total edges. 163 00:08:21,950 --> 00:08:24,210 The perfect power law, we're going 164 00:08:24,210 --> 00:08:26,910 to have essentially two perfect power laws here. 165 00:08:26,910 --> 00:08:29,700 One on the out degree. 166 00:08:29,700 --> 00:08:39,190 So if you sum these rows, and then you do a histogram, 167 00:08:39,190 --> 00:08:41,617 you're going to want to produce a histogram that 168 00:08:41,617 --> 00:08:42,700 looks something like this. 169 00:08:42,700 --> 00:08:45,610 So you have an out degree for each vertex. 170 00:08:45,610 --> 00:08:49,080 And this would show how many vertices have that out degree. 171 00:08:49,080 --> 00:08:51,980 And so our power law says that these points 172 00:08:51,980 --> 00:08:56,080 should fall on a slope with a negative power law 173 00:08:56,080 --> 00:08:58,110 coefficient of alpha out. 174 00:08:58,110 --> 00:09:00,760 So that's essentially the one definition there. 175 00:09:00,760 --> 00:09:03,040 And then when you likewise have another going 176 00:09:03,040 --> 00:09:04,290 in the-- for the other degree. 177 00:09:04,290 --> 00:09:07,280 So the in degrees have their own power law. 178 00:09:07,280 --> 00:09:08,500 So these are the definitions. 179 00:09:08,500 --> 00:09:11,150 This is what we're saying is a perfect power law. 180 00:09:11,150 --> 00:09:15,075 So we're saying a perfect power law has these properties. 181 00:09:15,075 --> 00:09:17,370 Now we're going to attempt to construct it. 182 00:09:17,370 --> 00:09:20,860 We have no physical basis for saying why 183 00:09:20,860 --> 00:09:22,170 the data should look this way. 184 00:09:22,170 --> 00:09:25,270 We're just saying this is a linear model of the data 185 00:09:25,270 --> 00:09:27,760 and we're going to construct it that way. 186 00:09:27,760 --> 00:09:30,250 And again, these can be undirected, multi-edge, 187 00:09:30,250 --> 00:09:34,160 we can allow self-loops and disconnected vertices, 188 00:09:34,160 --> 00:09:35,290 and hyper-edges. 189 00:09:35,290 --> 00:09:38,120 Anything you can get by just randomly throwing down 190 00:09:38,120 --> 00:09:40,130 values onto a matrix. 191 00:09:40,130 --> 00:09:41,960 And again, the only constraint is 192 00:09:41,960 --> 00:09:46,020 that the sum in both directions is 193 00:09:46,020 --> 00:09:49,630 equal to the number of edges. 194 00:09:49,630 --> 00:09:52,520 So given that, can we construct such a thing? 195 00:09:52,520 --> 00:09:55,550 Well, it turns out we can construct such a thing fairly 196 00:09:55,550 --> 00:09:56,350 simply. 197 00:09:56,350 --> 00:09:59,010 And so in MATLAB we can construct a perfect power law 198 00:09:59,010 --> 00:10:04,440 graph with this four line function here. 199 00:10:04,440 --> 00:10:07,390 It will construct a degree distribution 200 00:10:07,390 --> 00:10:09,460 that has this property. 201 00:10:09,460 --> 00:10:12,830 And the three number parameters to this distribution 202 00:10:12,830 --> 00:10:16,020 are alpha, which is the slope, dmax, 203 00:10:16,020 --> 00:10:19,270 which is the maximum degree vertex, 204 00:10:19,270 --> 00:10:22,260 and then this number Nd, which is roughly 205 00:10:22,260 --> 00:10:24,430 proportional to the number of bins 206 00:10:24,430 --> 00:10:25,680 that we're going to have here. 207 00:10:25,680 --> 00:10:27,055 Essentially the number of points. 208 00:10:27,055 --> 00:10:30,550 It's not exactly that, but roughly proportional to that. 209 00:10:30,550 --> 00:10:35,090 And so when you do this little equation here, 210 00:10:35,090 --> 00:10:37,420 the first thing that you will see 211 00:10:37,420 --> 00:10:42,500 is we are going to be creating a logarithmic spacing of bins. 212 00:10:42,500 --> 00:10:45,150 We kind of need to do that here. 213 00:10:45,150 --> 00:10:47,730 But at a certain point, we get below a value 214 00:10:47,730 --> 00:10:50,510 where the spacing will be-- we'll have essentially one 215 00:10:50,510 --> 00:10:51,820 bin per integer. 216 00:10:51,820 --> 00:10:53,634 And so these are two very separate regimes. 217 00:10:53,634 --> 00:10:55,800 You're going to have one which is called the integer 218 00:10:55,800 --> 00:10:59,860 regime, where basically each integer has 219 00:10:59,860 --> 00:11:01,980 one-- is representative. 220 00:11:01,980 --> 00:11:04,651 And then it transitions to a logarithmic regime. 221 00:11:04,651 --> 00:11:06,400 You might say this is somewhat artificial. 222 00:11:06,400 --> 00:11:09,430 It's actually very reflective of what's really going on. 223 00:11:09,430 --> 00:11:13,070 We really see here-- we really have a dmax, 224 00:11:13,070 --> 00:11:16,790 we really have, almost always, a count at 1. 225 00:11:16,790 --> 00:11:19,080 And then we have a count at 2 or 3, 226 00:11:19,080 --> 00:11:21,430 and then they start spreading out. 227 00:11:21,430 --> 00:11:23,360 And so this is just an artificial way 228 00:11:23,360 --> 00:11:25,970 to create this type of distribution, 229 00:11:25,970 --> 00:11:30,810 which is a perfect power law distribution. 230 00:11:30,810 --> 00:11:32,990 So it's a very simple, very efficient code 231 00:11:32,990 --> 00:11:35,060 for creating one of these. 232 00:11:35,060 --> 00:11:37,850 It has a smooth transition from what we call the integer 233 00:11:37,850 --> 00:11:40,930 bins and the logarithmic bins. 234 00:11:40,930 --> 00:11:44,280 And it also gives a very nice what we call poor man slope 235 00:11:44,280 --> 00:11:44,800 estimator. 236 00:11:44,800 --> 00:11:47,110 So there's a lot of research out there 237 00:11:47,110 --> 00:11:49,960 about how do you estimate the slope of your power law. 238 00:11:49,960 --> 00:11:53,780 And there's all kinds of algorithms for doing this. 239 00:11:53,780 --> 00:11:57,820 Well, the simplest way is just to take the two endpoints. 240 00:11:57,820 --> 00:12:00,280 Take the first point and the last point, 241 00:12:00,280 --> 00:12:02,360 and you know you're perfectly fitting two points. 242 00:12:02,360 --> 00:12:04,193 And you could argue you're perfectly fitting 243 00:12:04,193 --> 00:12:05,730 the two most important points. 244 00:12:05,730 --> 00:12:10,660 And you get this nice, simple value for the slope here. 245 00:12:10,660 --> 00:12:12,630 In addition, you can make the argument 246 00:12:12,630 --> 00:12:16,250 that regardless of how you bin the data, 247 00:12:16,250 --> 00:12:18,280 you'll always have these two bins. 248 00:12:18,280 --> 00:12:20,360 You will always have a bin at dmax 249 00:12:20,360 --> 00:12:22,640 and you will always have a bin at 1. 250 00:12:22,640 --> 00:12:26,070 And all the other bins are going to be somewhat 251 00:12:26,070 --> 00:12:29,180 a matter of choice, or of fitting. 252 00:12:29,180 --> 00:12:32,440 And so again, that's another reason to rationalize alpha. 253 00:12:32,440 --> 00:12:34,980 So I would say, if you plot your data 254 00:12:34,980 --> 00:12:36,930 and you have to estimate an alpha, 255 00:12:36,930 --> 00:12:40,250 then I would just say, well, here's what you do. 256 00:12:40,250 --> 00:12:43,180 And it's as good an estimate of alpha as any. 257 00:12:43,180 --> 00:12:44,610 And it's very nicely defined. 258 00:12:44,610 --> 00:12:47,590 So we call that-- when we talk about estimating 259 00:12:47,590 --> 00:12:52,460 the slope here, this is the formula we're going to use. 260 00:12:52,460 --> 00:12:56,680 So far, this code has just constructed 261 00:12:56,680 --> 00:12:59,130 a degree distribution, i.e., the degree, and then 262 00:12:59,130 --> 00:13:01,780 the number of vertices with that degree 263 00:13:01,780 --> 00:13:05,620 will be the outputs of this perfect power law function. 264 00:13:05,620 --> 00:13:10,720 We still have to assign that degree distribution 265 00:13:10,720 --> 00:13:12,930 to an actual set of edges. 266 00:13:12,930 --> 00:13:13,850 OK? 267 00:13:13,850 --> 00:13:16,050 And here's the code that will do that for. 268 00:13:16,050 --> 00:13:18,500 It will say, given a degree distribution, 269 00:13:18,500 --> 00:13:24,340 it will create a set of vertices that do that. 270 00:13:24,340 --> 00:13:31,910 Now, the actual pairing of the vertices into edges 271 00:13:31,910 --> 00:13:33,290 is arbitrary. 272 00:13:33,290 --> 00:13:42,100 And in fact, these are all different adjacency matrices 273 00:13:42,100 --> 00:13:45,980 for the same degree distribution. 274 00:13:45,980 --> 00:13:47,650 That is, every single one of these 275 00:13:47,650 --> 00:13:51,484 has the same degree distribution in both the rows 276 00:13:51,484 --> 00:13:52,150 and the columns. 277 00:13:52,150 --> 00:13:55,810 So the actual which vertices are connected to which 278 00:13:55,810 --> 00:13:58,670 is a second order statistic. 279 00:13:58,670 --> 00:14:01,160 So the degree distribution in this first order [INAUDIBLE], 280 00:14:01,160 --> 00:14:03,820 but how you want to connect those vertices up 281 00:14:03,820 --> 00:14:05,390 is somewhat arbitrary. 282 00:14:05,390 --> 00:14:09,020 And so that's a freedom that you have here. 283 00:14:09,020 --> 00:14:13,110 So for example, if I just take the vertices out of here 284 00:14:13,110 --> 00:14:15,660 and I just say, all right, every single vertex in your list, 285 00:14:15,660 --> 00:14:17,710 I'm just going to pair it up with yourself, 286 00:14:17,710 --> 00:14:20,440 I will get an adjacency matrix that's all diagonals. 287 00:14:20,440 --> 00:14:22,680 Essentially, all self-loops. 288 00:14:22,680 --> 00:14:25,580 If I take that list and just randomly 289 00:14:25,580 --> 00:14:28,940 reorder the vertex labels themselves, 290 00:14:28,940 --> 00:14:31,210 then I get something that looks like this. 291 00:14:31,210 --> 00:14:34,520 If I just randomly reorder the edge pairs, 292 00:14:34,520 --> 00:14:36,750 I get something like this. 293 00:14:36,750 --> 00:14:39,480 And if I randomly relabel both the vertices 294 00:14:39,480 --> 00:14:43,750 and reconnect the vertices into different edges, 295 00:14:43,750 --> 00:14:45,460 I get something like this. 296 00:14:45,460 --> 00:14:48,590 And for the most part, when we talk about our randomly 297 00:14:48,590 --> 00:14:50,800 generating our perfect power laws, 298 00:14:50,800 --> 00:14:52,620 we're going to talk about this. 299 00:14:52,620 --> 00:14:55,815 Which is probably most like what we really encounter. 300 00:14:55,815 --> 00:14:57,420 It's essentially something that's 301 00:14:57,420 --> 00:14:59,720 equivalent to randomly labeling your vertices, 302 00:14:59,720 --> 00:15:01,420 and then randomly taking those vertices 303 00:15:01,420 --> 00:15:05,050 and randomly pairing them together. 304 00:15:05,050 --> 00:15:08,600 So that basically talks about how 305 00:15:08,600 --> 00:15:11,545 we can actually construct a graph from our perfect power 306 00:15:11,545 --> 00:15:12,710 law edge. 307 00:15:12,710 --> 00:15:15,430 Now, so this is a forward model. 308 00:15:15,430 --> 00:15:19,290 Given a set of these three sort of parameters, 309 00:15:19,290 --> 00:15:21,760 we can generate a perfect power law. 310 00:15:21,760 --> 00:15:25,420 But if we're dealing with data, we often 311 00:15:25,420 --> 00:15:27,800 want slightly different parameters. 312 00:15:27,800 --> 00:15:30,480 So as I said, before our three parameters were alpha, 313 00:15:30,480 --> 00:15:35,500 which is greater than 0, dmax, which 314 00:15:35,500 --> 00:15:39,550 is the highest degree in the data, which we're saying 315 00:15:39,550 --> 00:15:43,320 is greater than 1, and then this parameter Nd, which roughly 316 00:15:43,320 --> 00:15:45,210 corresponds to number of bins. 317 00:15:45,210 --> 00:15:49,500 So we can generate a power law model for any of these values 318 00:15:49,500 --> 00:15:51,390 here that satisfy these constraints. 319 00:15:51,390 --> 00:15:52,900 So that's a large number. 320 00:15:52,900 --> 00:15:55,480 However, what we'll typically see 321 00:15:55,480 --> 00:15:59,290 is that we want to use these parameters. 322 00:15:59,290 --> 00:16:02,450 So we'll want to have an alpha, a number of vertices, 323 00:16:02,450 --> 00:16:05,100 and a number of edges, is more often the parameters we 324 00:16:05,100 --> 00:16:06,580 want to work with. 325 00:16:06,580 --> 00:16:09,730 And we can compute those by inverting these formulas. 326 00:16:09,730 --> 00:16:11,950 That is, if we compute the degree, 327 00:16:11,950 --> 00:16:15,450 we can sum it to compute the number of vertices. 328 00:16:15,450 --> 00:16:20,360 And likewise, we can sum the distribution times the degree 329 00:16:20,360 --> 00:16:21,730 to get the number of edges. 330 00:16:21,730 --> 00:16:25,720 So given these, an alpha and our model, we can invert these. 331 00:16:25,720 --> 00:16:26,750 All right? 332 00:16:26,750 --> 00:16:32,000 And what you see here is for a given value of alpha, 333 00:16:32,000 --> 00:16:36,730 the allowed values of N and M, given-- 334 00:16:36,730 --> 00:16:40,940 that is, the values of N and M that will be a power law. 335 00:16:40,940 --> 00:16:47,970 So what you see is that not all combinations of vertices, 336 00:16:47,970 --> 00:16:51,360 vertex count and edge count, can be constructed in a power law. 337 00:16:51,360 --> 00:16:52,510 There's a band here. 338 00:16:52,510 --> 00:16:54,020 This is a logarithmic graph. 339 00:16:54,020 --> 00:16:55,590 It's a wide band. 340 00:16:55,590 --> 00:16:59,780 But there's a band here of allowable data 341 00:16:59,780 --> 00:17:01,250 that will produce that. 342 00:17:01,250 --> 00:17:03,210 And typically, what you see is kind 343 00:17:03,210 --> 00:17:06,530 of the middle of this band is like a ratio of around 10, 344 00:17:06,530 --> 00:17:09,210 which happens to be the magic number that we see 345 00:17:09,210 --> 00:17:11,540 in lots of our data sets when people say, 346 00:17:11,540 --> 00:17:12,750 I have power law data. 347 00:17:12,750 --> 00:17:15,369 Someone will ask you, well, what's your to vertex to ratio? 348 00:17:15,369 --> 00:17:18,280 And [INAUDIBLE] we say, it's like 8, or 10, or 20, 349 00:17:18,280 --> 00:17:19,560 or something like that. 350 00:17:19,560 --> 00:17:21,560 And again, you see it's because in order for it 351 00:17:21,560 --> 00:17:24,109 to be power law data, at least according to this model, 352 00:17:24,109 --> 00:17:26,790 it has to fall into this band here. 353 00:17:26,790 --> 00:17:30,270 You'll also see this is a very nonlinear function here. 354 00:17:30,270 --> 00:17:32,190 And we'll get into fitting that later. 355 00:17:32,190 --> 00:17:35,030 And it's a nasty, nasty function to invert, 356 00:17:35,030 --> 00:17:36,990 because we have integer data, and data 357 00:17:36,990 --> 00:17:38,860 that's almost continuous. 358 00:17:38,860 --> 00:17:41,750 And it's a nasty, nasty bus-- we can do it, 359 00:17:41,750 --> 00:17:43,740 but it's kind of a nasty business. 360 00:17:43,740 --> 00:17:46,420 But given an alpha and an N that are consistent, 361 00:17:46,420 --> 00:17:50,510 we can actually then generate a dmax and an Nd that 362 00:17:50,510 --> 00:17:54,570 will best fit those parameters. 363 00:17:54,570 --> 00:17:57,770 So let's do an example here. 364 00:17:57,770 --> 00:18:00,620 So I didn't just dress up in this crazy outfit for nothing. 365 00:18:00,620 --> 00:18:03,920 We have a whole Halloween theme to our lecture today. 366 00:18:03,920 --> 00:18:08,500 So this is-- when I go trick or treating 367 00:18:08,500 --> 00:18:11,100 with my daughter, of course, our favorite thing 368 00:18:11,100 --> 00:18:14,690 is to do the distribution of the candy when we're done. 369 00:18:14,690 --> 00:18:17,230 And so this shows last year's candy distribution. 370 00:18:17,230 --> 00:18:19,310 We'll see how it varies. 371 00:18:19,310 --> 00:18:22,430 As you can see, Hershey's chocolate 372 00:18:22,430 --> 00:18:27,010 bars, not surprisingly, extremely popular. 373 00:18:27,010 --> 00:18:28,900 What else is popular here? 374 00:18:28,900 --> 00:18:31,830 Swedish fish, not so popular. 375 00:18:31,830 --> 00:18:33,980 Nestle's Crunch bars, not so popular. 376 00:18:33,980 --> 00:18:36,900 Again, I actually found this somewhat-- this list 377 00:18:36,900 --> 00:18:39,640 hasn't change since when I went trick or treating. 378 00:18:39,640 --> 00:18:42,940 This is a tough list to break into. 379 00:18:42,940 --> 00:18:46,370 Getting a new candy that makes it to Halloween-worthy candy 380 00:18:46,370 --> 00:18:48,630 is pretty hard. 381 00:18:48,630 --> 00:18:51,000 So this year shows the distribution of all the candy 382 00:18:51,000 --> 00:18:52,400 that we collected. 383 00:18:52,400 --> 00:18:55,250 And here are some basic information. 384 00:18:55,250 --> 00:18:58,780 So we had 77 pieces of candy, or distinct edges. 385 00:18:58,780 --> 00:19:00,610 We had 19 types of candy. 386 00:19:00,610 --> 00:19:04,540 Our edge to vertex ratio was 4. 387 00:19:04,540 --> 00:19:09,030 The dmax was 15. 388 00:19:09,030 --> 00:19:11,380 So we had 15 Hershey's Kisses. 389 00:19:11,380 --> 00:19:15,150 N1, we had eight types of candy that we only got one of. 390 00:19:15,150 --> 00:19:18,360 And then our power slope was alpha. 391 00:19:18,360 --> 00:19:20,130 And then our fit parameters to this, 392 00:19:20,130 --> 00:19:25,940 when we actually fit, where we got 77, 21, and M/N of 3.7. 393 00:19:25,940 --> 00:19:27,510 And this shows you the data. 394 00:19:27,510 --> 00:19:29,300 So this is the candy degree. 395 00:19:29,300 --> 00:19:31,270 And this is the number. 396 00:19:31,270 --> 00:19:33,550 And this shows you what we measured. 397 00:19:33,550 --> 00:19:37,290 This is the poor man's slope here. 398 00:19:37,290 --> 00:19:38,990 This is the model. 399 00:19:38,990 --> 00:19:41,646 And then one thing we can do is actually-- 400 00:19:41,646 --> 00:19:43,020 which is very helpful-- is we can 401 00:19:43,020 --> 00:19:46,840 re-bin the measured data using the bins extracted 402 00:19:46,840 --> 00:19:47,810 from the model. 403 00:19:47,810 --> 00:19:57,882 And that gets you these red x's here, or plus signs here. 404 00:19:57,882 --> 00:19:59,590 And we'll discover that's very important. 405 00:19:59,590 --> 00:20:01,900 Because the data you have is often 406 00:20:01,900 --> 00:20:03,740 very rarely binned in a way that's 407 00:20:03,740 --> 00:20:07,810 proper for seeing the proper distribution. 408 00:20:07,810 --> 00:20:09,790 And we can actually use this model 409 00:20:09,790 --> 00:20:12,990 to come up with what a better set of bins 410 00:20:12,990 --> 00:20:16,540 would be, and then bin the data with respect to that. 411 00:20:16,540 --> 00:20:19,230 So that's just an example of this in actual practice. 412 00:20:23,300 --> 00:20:27,390 So now that we have a mechanism for generating perfect power 413 00:20:27,390 --> 00:20:31,450 law, let's see what happens when we sample it. 414 00:20:31,450 --> 00:20:36,380 Let's see what happens when we do 415 00:20:36,380 --> 00:20:38,680 the things to it that we typically 416 00:20:38,680 --> 00:20:40,450 do to clean up our data. 417 00:20:40,450 --> 00:20:44,670 I bring this up because in standard graph theory, as I've 418 00:20:44,670 --> 00:20:47,000 talked about in previous lectures, 419 00:20:47,000 --> 00:20:53,380 we often have what we call random, 420 00:20:53,380 --> 00:20:57,130 undirected Erdos-Renyi graphs, which 421 00:20:57,130 --> 00:21:01,000 are basically vertices without direction. 422 00:21:01,000 --> 00:21:03,010 And usually the edges are unweighted. 423 00:21:03,010 --> 00:21:05,400 So we just have a 0 or a 1. 424 00:21:05,400 --> 00:21:08,740 So very simplified graphs. 425 00:21:08,740 --> 00:21:11,480 I'm actually going to-- getting a little hot here 426 00:21:11,480 --> 00:21:12,718 in the top hat. 427 00:21:17,520 --> 00:21:21,680 So a lot of our graph theory is based on these types of graphs. 428 00:21:21,680 --> 00:21:24,090 And as we've talked about before that our data tends 429 00:21:24,090 --> 00:21:25,810 to not look like that. 430 00:21:25,810 --> 00:21:27,390 So one of the things we do so that we 431 00:21:27,390 --> 00:21:31,580 can apply the theory to that data is that we often 432 00:21:31,580 --> 00:21:33,600 make with data look like that. 433 00:21:33,600 --> 00:21:36,390 So we'll often make with data undirected. 434 00:21:36,390 --> 00:21:40,700 We'll often make the data basically unweighted 435 00:21:40,700 --> 00:21:43,980 and other types of things so we can apply all the theory 436 00:21:43,980 --> 00:21:46,440 that we've developed over the last several decades 437 00:21:46,440 --> 00:21:49,440 on these particular types of very well studied graphs. 438 00:21:49,440 --> 00:21:52,190 So now that we have a perfect power law graph, 439 00:21:52,190 --> 00:21:55,870 we can see what happens if we apply those same corrections 440 00:21:55,870 --> 00:21:57,270 to the data. 441 00:21:57,270 --> 00:21:58,670 And so here's what we see. 442 00:21:58,670 --> 00:22:00,530 So we generated a perfect power law graph. 443 00:22:00,530 --> 00:22:02,340 The alpha is 1.3. 444 00:22:02,340 --> 00:22:03,630 The dmax was 1,000. 445 00:22:03,630 --> 00:22:05,060 Our Nd was 50. 446 00:22:05,060 --> 00:22:10,490 This generated a data set with 18,000 vertices and 84,000 447 00:22:10,490 --> 00:22:11,410 edges. 448 00:22:11,410 --> 00:22:17,510 And so here's a very simple way to make it. 449 00:22:17,510 --> 00:22:21,550 We're going to make it undirected by basically taking 450 00:22:21,550 --> 00:22:23,920 the matrix and adding its transpose, 451 00:22:23,920 --> 00:22:25,610 and then taking the upper diagonal. 452 00:22:25,610 --> 00:22:30,750 This is actually the best way to make an adjacency matrix 453 00:22:30,750 --> 00:22:35,460 undirected, to take that upper portion, 454 00:22:35,460 --> 00:22:40,190 because a lot of the statistics that-- basically it 455 00:22:40,190 --> 00:22:43,210 saves you having to deal with a lot of annoying factors of 2. 456 00:22:43,210 --> 00:22:45,790 So a lot of times, we'll just do A plus A transpose, 457 00:22:45,790 --> 00:22:48,280 but then you get these annoying factors of 2 lying around. 458 00:22:48,280 --> 00:22:51,940 And so this is a way to sort of not do that. 459 00:22:51,940 --> 00:22:56,120 So we're getting rid of-- we've made it undirected. 460 00:22:56,120 --> 00:22:58,030 We're made it undirected by doing that. 461 00:22:58,030 --> 00:23:01,070 We're going to make it unweighted by basically 462 00:23:01,070 --> 00:23:03,900 converting everything to a 0 or 1, and then back to double. 463 00:23:03,900 --> 00:23:05,370 So that makes it unweighted. 464 00:23:05,370 --> 00:23:07,420 And then we're getting rid of the diagonal, 465 00:23:07,420 --> 00:23:09,250 so that eliminates self-loops. 466 00:23:11,729 --> 00:23:13,020 So we've done all these things. 467 00:23:13,020 --> 00:23:14,740 We've cleaned up our data in this way. 468 00:23:14,740 --> 00:23:16,400 So what happens? 469 00:23:16,400 --> 00:23:19,550 Well, so the triangles were the input model. 470 00:23:19,550 --> 00:23:21,210 Well, now we've cleaned up our data, 471 00:23:21,210 --> 00:23:27,970 and we see this sort of mess that we've done to our data. 472 00:23:27,970 --> 00:23:34,050 And in fact, I'll call this-- in keeping with our Halloween 473 00:23:34,050 --> 00:23:37,150 theme here, we'll call this our witch's broom distribution 474 00:23:37,150 --> 00:23:37,704 here. 475 00:23:37,704 --> 00:23:40,120 And if anybody's looked at degree redistributions on data, 476 00:23:40,120 --> 00:23:43,290 it will be like you'll recognize this shape instantly 477 00:23:43,290 --> 00:23:47,260 because you have this bendiness coming up here. 478 00:23:47,260 --> 00:23:50,300 And then sort of fanning out down here. 479 00:23:50,300 --> 00:23:57,860 Very common thing that we see in the data sets that we plot. 480 00:23:57,860 --> 00:24:00,740 And in fact, there's not an insignificant amount 481 00:24:00,740 --> 00:24:05,525 of literature devoted to trying to understand these bumps 482 00:24:05,525 --> 00:24:09,450 and wiggles, and do they really mean something 483 00:24:09,450 --> 00:24:13,287 underlying about the physical phenomenon that's taking place. 484 00:24:13,287 --> 00:24:15,620 And while it's the case that those bumps and wiggles may 485 00:24:15,620 --> 00:24:20,200 actually be representative of some physical phenomenon, 486 00:24:20,200 --> 00:24:22,330 based on this, we also have to concede 487 00:24:22,330 --> 00:24:25,630 the fact it's also consistent with our cleaning up procedure. 488 00:24:25,630 --> 00:24:28,810 That is, the thing we're trying to do to make our data better 489 00:24:28,810 --> 00:24:31,090 is introducing nonlinear phenomenon 490 00:24:31,090 --> 00:24:36,010 on the data, which we may confuse with real phenomena. 491 00:24:36,010 --> 00:24:39,270 So this is very much a cautionary tale. 492 00:24:39,270 --> 00:24:41,860 And so based on that, I certainly 493 00:24:41,860 --> 00:24:45,020 encourage people not to clean up their data in that way, 494 00:24:45,020 --> 00:24:47,500 and keep the directedness. 495 00:24:47,500 --> 00:24:49,374 Don't throw away the self-loops. 496 00:24:49,374 --> 00:24:50,290 Keep the weightedness. 497 00:24:50,290 --> 00:24:52,597 Do your degree distributions in this way. 498 00:24:52,597 --> 00:24:54,930 And live with the fact that that's what your data really 499 00:24:54,930 --> 00:24:56,760 is like and try and understand it that way, 500 00:24:56,760 --> 00:24:58,832 rather than trying cleaning up in this way. 501 00:24:58,832 --> 00:25:00,040 Sometimes you have no choice. 502 00:25:00,040 --> 00:25:01,623 The algorithms that you have will only 503 00:25:01,623 --> 00:25:03,940 work on data being that's been cleaned up in this way. 504 00:25:03,940 --> 00:25:05,390 But you have to recognize you are 505 00:25:05,390 --> 00:25:07,250 introducing a new phenomena. 506 00:25:07,250 --> 00:25:10,067 It's a highly non-linear process, this cleaning up, 507 00:25:10,067 --> 00:25:11,650 and you have to be careful about that. 508 00:25:14,580 --> 00:25:17,410 However, given that we've done this, 509 00:25:17,410 --> 00:25:21,430 is there a way that we can recover the original power law? 510 00:25:21,430 --> 00:25:23,360 So we can try that. 511 00:25:23,360 --> 00:25:31,060 So we have here is the data that we-- the original data that we 512 00:25:31,060 --> 00:25:33,930 cleaned up is now these circles. 513 00:25:33,930 --> 00:25:34,620 OK? 514 00:25:34,620 --> 00:25:36,680 And we're going to take that data set 515 00:25:36,680 --> 00:25:39,320 and compute an alpha and an N and M from it 516 00:25:39,320 --> 00:25:42,360 from using pour inversion formulas. 517 00:25:42,360 --> 00:25:45,510 And then compute what the power law of that would be. 518 00:25:45,510 --> 00:25:48,170 So that's the triangles. 519 00:25:48,170 --> 00:25:52,400 So here's our poor man's alpha fit. 520 00:25:52,400 --> 00:25:55,090 This is our model. 521 00:25:55,090 --> 00:25:56,300 This is what the model is. 522 00:25:56,300 --> 00:25:58,600 These triangles here is the model, we're saying. 523 00:25:58,600 --> 00:26:02,060 And then we can say, aha, let's use the bins that 524 00:26:02,060 --> 00:26:05,960 came from this model to re-bin these circles 525 00:26:05,960 --> 00:26:08,426 onto these red plus signs here. 526 00:26:08,426 --> 00:26:09,550 So that's our new data set. 527 00:26:09,550 --> 00:26:11,008 And what you see is that we've done 528 00:26:11,008 --> 00:26:15,200 a pretty good job of recovering the original power law. 529 00:26:15,200 --> 00:26:18,670 So if we had data that we observed to look like this, 530 00:26:18,670 --> 00:26:20,410 we wouldn't be sure it was a power law. 531 00:26:20,410 --> 00:26:21,243 Like, we don't know. 532 00:26:21,243 --> 00:26:22,660 Say, well, what's this bend here? 533 00:26:22,660 --> 00:26:24,290 And what's this fanning out here? 534 00:26:24,290 --> 00:26:26,800 But then if you go through this process and re-bin it, 535 00:26:26,800 --> 00:26:30,320 you can be like, oh, no, that really looks like a power law. 536 00:26:30,320 --> 00:26:32,205 And so that's a way of recovering the power 537 00:26:32,205 --> 00:26:34,955 law that we may have lost through some filtering 538 00:26:34,955 --> 00:26:35,455 procedure. 539 00:26:41,190 --> 00:26:43,020 Here's another example. 540 00:26:43,020 --> 00:26:45,600 So what we're going to do is essentially take our matrix 541 00:26:45,600 --> 00:26:48,460 and compute the correlation of it. 542 00:26:48,460 --> 00:26:50,420 We talked a lot about-- if we say 543 00:26:50,420 --> 00:26:54,490 I have an incidence matrix, we multiply it to do correlation. 544 00:26:54,490 --> 00:26:57,155 In this case, we're treating our random matrix 545 00:26:57,155 --> 00:26:59,070 as not an adjacency matrix, but as 546 00:26:59,070 --> 00:27:02,330 an incidence matrix, a randomly generated incidence matrix. 547 00:27:02,330 --> 00:27:05,810 And so these are, again, the parameters that we use. 548 00:27:05,810 --> 00:27:10,630 We're converting it to all unweighted, all 0's and 1's. 549 00:27:10,630 --> 00:27:13,224 And then we are correlating it with itself 550 00:27:13,224 --> 00:27:14,640 to construct the adjacency matrix. 551 00:27:14,640 --> 00:27:18,410 Taking the upper diagonal, and then removing the diagonal. 552 00:27:18,410 --> 00:27:20,660 And this is the result of what we see. 553 00:27:20,660 --> 00:27:22,140 So here's our input model. 554 00:27:22,140 --> 00:27:23,570 Again, the triangles. 555 00:27:23,570 --> 00:27:27,770 And then this is the measured-- what we get out from there. 556 00:27:27,770 --> 00:27:30,090 And if you saw this, you might be like, wow, 557 00:27:30,090 --> 00:27:32,635 that's a really good power law. 558 00:27:32,635 --> 00:27:35,010 In fact, I've certainly seen data-- I mean, most the time 559 00:27:35,010 --> 00:27:37,330 I would see, yep, that is a power law distribution. 560 00:27:37,330 --> 00:27:40,080 We absolutely have a power law distribution. 561 00:27:40,080 --> 00:27:42,300 However, we then apply our procedure. 562 00:27:45,810 --> 00:27:48,320 So again, we have our measured data. 563 00:27:48,320 --> 00:27:51,310 OK, we're going to do our parameters here. 564 00:27:51,310 --> 00:27:54,220 Get our poor man's alpha parameter. 565 00:27:54,220 --> 00:27:57,280 And then fit, the triangles are the new fit. 566 00:27:57,280 --> 00:27:58,140 OK. 567 00:27:58,140 --> 00:28:01,350 And then we use those bins to re-bin. 568 00:28:01,350 --> 00:28:05,220 And we see here that when we actually re-bin the data, 569 00:28:05,220 --> 00:28:09,140 we get something that looks very much not like a power law 570 00:28:09,140 --> 00:28:09,750 distribution. 571 00:28:09,750 --> 00:28:11,830 So there's an example of the reverse. 572 00:28:11,830 --> 00:28:14,150 Before we had data that didn't look like a power law, 573 00:28:14,150 --> 00:28:16,840 but when we re-binned it, we recovered the power law. 574 00:28:16,840 --> 00:28:18,510 Here we have data that may-- just 575 00:28:18,510 --> 00:28:21,700 sort of in this random binning may look like a power law. 576 00:28:21,700 --> 00:28:23,540 We actually see it has this bump. 577 00:28:23,540 --> 00:28:26,020 And then continuing with our Halloween theme, 578 00:28:26,020 --> 00:28:28,300 we can call this the witch's nose distribution, 579 00:28:28,300 --> 00:28:30,350 because it comes along here as this giant bump 580 00:28:30,350 --> 00:28:32,180 and then goes back to a power law. 581 00:28:32,180 --> 00:28:35,690 And this actually, there's meaning for this. 582 00:28:35,690 --> 00:28:39,180 And we will see this later in the actual data. 583 00:28:39,180 --> 00:28:41,660 But this is not just that when you do these correlation 584 00:28:41,660 --> 00:28:45,940 matrices, certain types of them, particularly self-correlations, 585 00:28:45,940 --> 00:28:50,030 very likely will produce this type of distribution. 586 00:28:50,030 --> 00:28:52,800 But again, even though we have this bump, 587 00:28:52,800 --> 00:28:56,640 you would still argue that our linear power law is still 588 00:28:56,640 --> 00:28:59,230 a very good first order fit. 589 00:28:59,230 --> 00:29:02,050 So we still captured most of the dynamic range 590 00:29:02,050 --> 00:29:04,330 of the distribution. 591 00:29:04,330 --> 00:29:05,747 And this is now a delta from that. 592 00:29:05,747 --> 00:29:07,704 And so we're very comfortable with that, right? 593 00:29:07,704 --> 00:29:09,130 We start with our linear models. 594 00:29:09,130 --> 00:29:10,707 That models most of the data. 595 00:29:10,707 --> 00:29:12,040 And then we have a second order. 596 00:29:12,040 --> 00:29:13,831 If we wanted to, we could go in and come up 597 00:29:13,831 --> 00:29:15,360 with some kind of second order. 598 00:29:15,360 --> 00:29:17,400 Subtract the linear model from here 599 00:29:17,400 --> 00:29:21,400 and you would see some kind of hump distribution here. 600 00:29:21,400 --> 00:29:24,340 And you could then model your data as a linear model, 601 00:29:24,340 --> 00:29:25,820 plus some kind of correction. 602 00:29:25,820 --> 00:29:29,300 Again, very classic signal processing way 603 00:29:29,300 --> 00:29:32,040 to deal with our data, and certainly seems 604 00:29:32,040 --> 00:29:33,650 as relevant here as anywhere else. 605 00:29:39,260 --> 00:29:40,000 Let's see here. 606 00:29:47,920 --> 00:29:52,580 And again, so the power law can be preserved 607 00:29:52,580 --> 00:29:54,400 as we talked about there. 608 00:29:54,400 --> 00:29:56,370 So moving on, another phenomenon that's 609 00:29:56,370 --> 00:29:58,880 often documented in the literature 610 00:29:58,880 --> 00:30:01,830 is called the densification. 611 00:30:01,830 --> 00:30:03,795 In fact, there's many papers written on what 612 00:30:03,795 --> 00:30:05,510 is called densification. 613 00:30:05,510 --> 00:30:09,800 This is the observation that if you construct a graph, 614 00:30:09,800 --> 00:30:12,520 and you compute the ratio of the edges 615 00:30:12,520 --> 00:30:18,410 to vertices over time, that ratio will go up. 616 00:30:18,410 --> 00:30:20,830 And there's a lot of research talking 617 00:30:20,830 --> 00:30:23,050 about the physical phenomenon that might 618 00:30:23,050 --> 00:30:25,300 produce that type of effect. 619 00:30:25,300 --> 00:30:28,040 And so while that physical phenomenon might be there, 620 00:30:28,040 --> 00:30:31,460 it's also a byproduct of just sampling the data. 621 00:30:31,460 --> 00:30:34,490 So for instance here, what we're going to do 622 00:30:34,490 --> 00:30:37,570 is we created our perfect power law graph 623 00:30:37,570 --> 00:30:39,300 and we're going to sample it. 624 00:30:39,300 --> 00:30:43,000 We're basically going to take subsamples of that data. 625 00:30:43,000 --> 00:30:46,700 And we're going to do it in little chunks, about 10% 626 00:30:46,700 --> 00:30:48,540 of the data at a time. 627 00:30:48,540 --> 00:30:52,130 And the triangles and the circles 628 00:30:52,130 --> 00:30:55,760 show when we look at each set of data independently. 629 00:30:55,760 --> 00:30:58,070 And then we have these lines that 630 00:30:58,070 --> 00:31:01,170 show what happens when we do it cumulatively. 631 00:31:01,170 --> 00:31:05,030 We basically take 10% of the data, then 20% of the data, 632 00:31:05,030 --> 00:31:06,910 then 30% and move on here. 633 00:31:06,910 --> 00:31:10,070 And we have two different ways of sampling our data here. 634 00:31:10,070 --> 00:31:12,150 Random is, I'm just taking that whole matrix 635 00:31:12,150 --> 00:31:14,560 and I'm randomly picking edges out of it. 636 00:31:14,560 --> 00:31:18,320 And what you see is that each sample 637 00:31:18,320 --> 00:31:26,400 has a relatively low edge to vertex distribution. 638 00:31:26,400 --> 00:31:30,960 But as you add more and more and more up, it gets denser. 639 00:31:30,960 --> 00:31:34,340 And this is just simply the fact that given 640 00:31:34,340 --> 00:31:39,610 a finite number of vertices, you eventually, 641 00:31:39,610 --> 00:31:42,060 if you kept on adding edges and edges and edges, 642 00:31:42,060 --> 00:31:45,027 eventually it would go-- this would become infinite. 643 00:31:45,027 --> 00:31:46,610 If you add an infinite number of edges 644 00:31:46,610 --> 00:31:49,270 to a finite [INAUDIBLE] vertices, 645 00:31:49,270 --> 00:31:52,100 then it will get denser and denser and denser and denser. 646 00:31:52,100 --> 00:31:57,682 And this is sort of a byproduct of treating these as 0 and 1's. 647 00:31:57,682 --> 00:31:59,440 And not recognizing. 648 00:31:59,440 --> 00:32:02,950 So this just naturally occurs through sampling. 649 00:32:02,950 --> 00:32:05,370 The linear sampling here is where, basically, I'm 650 00:32:05,370 --> 00:32:07,260 taking whole rows at a time. 651 00:32:07,260 --> 00:32:09,440 Or I could have taken whole columns at a time. 652 00:32:09,440 --> 00:32:11,780 So I'm taking each row and eventually, all right, I'm 653 00:32:11,780 --> 00:32:12,820 dropping them down. 654 00:32:12,820 --> 00:32:14,970 And there you say it's constant. 655 00:32:14,970 --> 00:32:16,900 So I'm basically for each-- I'm essentially 656 00:32:16,900 --> 00:32:19,360 taking a whole vertex and adding it at a time. 657 00:32:19,360 --> 00:32:22,640 And here the sampling is somewhat independent of it. 658 00:32:22,640 --> 00:32:26,260 The densification-- if you sample whole rows, 659 00:32:26,260 --> 00:32:29,592 the density of that will be the same as if you did it. 660 00:32:29,592 --> 00:32:30,800 So this is just good to know. 661 00:32:30,800 --> 00:32:33,299 And good to know about sampling and these phenomena 662 00:32:33,299 --> 00:32:34,840 can take place, and then how sampling 663 00:32:34,840 --> 00:32:38,430 can play an important role in the data that we observe. 664 00:32:41,770 --> 00:32:44,320 Another phenomenon that's been studied extensively 665 00:32:44,320 --> 00:32:48,060 is what happens to the slope of the degree distribution 666 00:32:48,060 --> 00:32:49,380 as you add data. 667 00:32:49,380 --> 00:32:52,540 And again, we do this exact same type of sampling. 668 00:32:52,540 --> 00:32:57,330 And you see here that when we just-- this data had a slope, 669 00:32:57,330 --> 00:32:59,980 I believe, of 1.3. 670 00:32:59,980 --> 00:33:03,540 And you can see if we sample it randomly, just 671 00:33:03,540 --> 00:33:06,360 take random vertices, that the slope starts out very, 672 00:33:06,360 --> 00:33:07,300 very high. 673 00:33:07,300 --> 00:33:09,530 And each sample stays high. 674 00:33:09,530 --> 00:33:11,710 But when we start accumulating them, 675 00:33:11,710 --> 00:33:15,640 they start converging on the true value-- converging 676 00:33:15,640 --> 00:33:17,150 from above. 677 00:33:17,150 --> 00:33:19,460 And likewise, when you do linear sampling, 678 00:33:19,460 --> 00:33:23,020 you have a direction in the opposite way. 679 00:33:23,020 --> 00:33:25,700 And they both end up converging onto the true value here 680 00:33:25,700 --> 00:33:27,100 of 1.3. 681 00:33:27,100 --> 00:33:30,690 And so again, this just shows that the slope of your degree 682 00:33:30,690 --> 00:33:34,320 distribution is also very much a function of the sampling. 683 00:33:34,320 --> 00:33:37,240 It could also be a function of the underlying phenomenon. 684 00:33:37,240 --> 00:33:39,170 But again, a cautionary tale that one 685 00:33:39,170 --> 00:33:41,830 needs to be very aware of how one's sampling. 686 00:33:41,830 --> 00:33:47,170 And again, these perfect power law data sets 687 00:33:47,170 --> 00:33:49,370 are a very useful tool for doing that. 688 00:33:49,370 --> 00:33:51,460 So if you have a real data set and you're 689 00:33:51,460 --> 00:33:53,380 sampling in some way, and you want 690 00:33:53,380 --> 00:33:57,250 to know what is maybe real phenomenon versus what 691 00:33:57,250 --> 00:34:00,670 is sampling effects, if you go and generate a perfect power 692 00:34:00,670 --> 00:34:02,980 law that's an approximation of this data set, 693 00:34:02,980 --> 00:34:06,210 you can then very quickly see which phenomena are just 694 00:34:06,210 --> 00:34:08,469 a result of sampling a perfect power law, 695 00:34:08,469 --> 00:34:11,780 and which phenomena are maybe indicative of some deeper 696 00:34:11,780 --> 00:34:14,760 underlying correlations between the data. 697 00:34:14,760 --> 00:34:19,659 So again, a very useful tool here. 698 00:34:19,659 --> 00:34:21,840 Moving on, we're going to talk about subsampling. 699 00:34:21,840 --> 00:34:25,460 And one of the problems that we have is very large data sets. 700 00:34:25,460 --> 00:34:29,370 And often we can't compute the degree distribution 701 00:34:29,370 --> 00:34:30,449 on the entire data set. 702 00:34:30,449 --> 00:34:32,865 Or we can't-- and this has sort of been a bread and butter 703 00:34:32,865 --> 00:34:36,860 of signal processing for years, where if we want to compute 704 00:34:36,860 --> 00:34:40,320 a background model, that we don't just simply sum up all 705 00:34:40,320 --> 00:34:40,820 the data. 706 00:34:40,820 --> 00:34:44,886 That we randomly select data from the data 707 00:34:44,886 --> 00:34:46,969 set, and we use that as a model of our background. 708 00:34:46,969 --> 00:34:48,940 And that's a much more efficient way, 709 00:34:48,940 --> 00:34:51,580 from a computational and data handling perspective, 710 00:34:51,580 --> 00:34:54,050 than simply computing the mean or the variance 711 00:34:54,050 --> 00:34:57,970 based on the entire data set. 712 00:34:57,970 --> 00:35:01,570 So again, we need good background estimation in order 713 00:35:01,570 --> 00:35:03,870 to do our anomaly detection. 714 00:35:03,870 --> 00:35:07,020 And again, it's prohibitive to traverse the data. 715 00:35:07,020 --> 00:35:08,830 So the question is, can we add accurately 716 00:35:08,830 --> 00:35:12,257 estimate the background from a sample? 717 00:35:12,257 --> 00:35:13,340 So let's see what happens. 718 00:35:13,340 --> 00:35:14,570 We have a perfect power law. 719 00:35:14,570 --> 00:35:17,410 We can look at what happens when we sample that. 720 00:35:17,410 --> 00:35:19,490 So we've generated a power law. 721 00:35:19,490 --> 00:35:19,990 OK. 722 00:35:19,990 --> 00:35:24,580 And this is not-- I've changed-- this may look like the degree 723 00:35:24,580 --> 00:35:26,950 distribution, but it's actually a different plot. 724 00:35:26,950 --> 00:35:30,890 So this is showing every single vertex in the data set. 725 00:35:30,890 --> 00:35:31,960 All right. 726 00:35:31,960 --> 00:35:35,060 And this shows the N degree of that vertex. 727 00:35:35,060 --> 00:35:36,720 And we've sorted them. 728 00:35:36,720 --> 00:35:40,960 So the highest degree vertex is over here 729 00:35:40,960 --> 00:35:44,970 and the lowest degree vertex is over here. 730 00:35:44,970 --> 00:35:45,470 OK. 731 00:35:45,470 --> 00:35:48,135 So this is all the vertices. 732 00:35:50,740 --> 00:35:53,520 And so this is the true data. 733 00:35:53,520 --> 00:35:56,862 And this is what happens when we take a 1/40 sample. 734 00:35:56,862 --> 00:35:59,070 I just say, I'm only going to take 1/40 of the edges. 735 00:35:59,070 --> 00:36:01,380 What does it look like? 736 00:36:01,380 --> 00:36:04,517 Some relatively simple math here, which I won't go over. 737 00:36:04,517 --> 00:36:05,475 But it's there for you. 738 00:36:05,475 --> 00:36:07,000 We can actually come up a correction 739 00:36:07,000 --> 00:36:09,414 that allows us to sample that data based 740 00:36:09,414 --> 00:36:10,580 on the median distributions. 741 00:36:10,580 --> 00:36:11,880 I apologize for the slides. 742 00:36:11,880 --> 00:36:17,140 We will correct them for the web when we go out there. 743 00:36:17,140 --> 00:36:17,640 All right. 744 00:36:17,640 --> 00:36:19,067 So moving on. 745 00:36:19,067 --> 00:36:20,900 So one of the things we talk about sampling, 746 00:36:20,900 --> 00:36:23,280 and we talk mainly about single distributions. 747 00:36:23,280 --> 00:36:25,860 We can also talk about joint distributions. 748 00:36:25,860 --> 00:36:29,730 So we can actually use the degree 749 00:36:29,730 --> 00:36:33,620 as a way of labeling the vertices and look at them. 750 00:36:33,620 --> 00:36:36,300 So it's a way of compressing-- if we just say, 751 00:36:36,300 --> 00:36:39,180 we're going to label each vertex by its degree, 752 00:36:39,180 --> 00:36:44,490 this is a way of compressing many vertices into a smaller 753 00:36:44,490 --> 00:36:47,460 dimensional space way to look at that. 754 00:36:47,460 --> 00:36:50,950 And we can then count the correlations. 755 00:36:50,950 --> 00:36:54,960 We can look at the distribution of how many edges are there 756 00:36:54,960 --> 00:36:59,300 from vertices of this degree to vertices of that degree? 757 00:36:59,300 --> 00:37:01,920 And so it's a tool for projecting our data 758 00:37:01,920 --> 00:37:03,640 and understanding what's going on. 759 00:37:03,640 --> 00:37:05,370 And we can also then re-bin that data 760 00:37:05,370 --> 00:37:10,170 with a power law, which will make it more easily understood. 761 00:37:10,170 --> 00:37:20,070 So if we look here, we see that we had the degree distribution. 762 00:37:20,070 --> 00:37:26,110 So this shows us for a perfect power law data, 763 00:37:26,110 --> 00:37:32,890 for vertices with this degree in and this degree out, how many 764 00:37:32,890 --> 00:37:34,320 edges there were between them. 765 00:37:34,320 --> 00:37:36,635 And as you see here, obviously, there 766 00:37:36,635 --> 00:37:40,840 was a lot of vertices here between low degree edges, 767 00:37:40,840 --> 00:37:43,240 and not so much here. 768 00:37:43,240 --> 00:37:46,470 But this is somewhat a misnomer because 769 00:37:46,470 --> 00:37:47,810 of the way the data comes out. 770 00:37:47,810 --> 00:37:50,610 We're not really binning it properly. 771 00:37:50,610 --> 00:37:55,080 However, if we go and fit a perfect power law to this data, 772 00:37:55,080 --> 00:37:56,990 and then pick a new set of bins based 773 00:37:56,990 --> 00:38:01,300 on those, so re-bin the data, we can see here 774 00:38:01,300 --> 00:38:04,200 that we get a much smoother uniform distribution. 775 00:38:04,200 --> 00:38:06,940 So while here we may have thought 776 00:38:06,940 --> 00:38:12,460 that this was an artificially low dense-- not dense region, 777 00:38:12,460 --> 00:38:14,030 this was artificially high-- what 778 00:38:14,030 --> 00:38:16,060 you see here is when you re-bin this 779 00:38:16,060 --> 00:38:20,500 that there's a very smooth distribution from what 780 00:38:20,500 --> 00:38:23,060 we expect for our perfect power law here. 781 00:38:23,060 --> 00:38:24,810 That this is a fairly uniform distribution 782 00:38:24,810 --> 00:38:25,750 with respect to that. 783 00:38:25,750 --> 00:38:28,840 And so basically, it essentially puts more bins 784 00:38:28,840 --> 00:38:31,290 where we actually have data as opposed 785 00:38:31,290 --> 00:38:34,123 to wasting bins over here where we don't really have any data. 786 00:38:39,690 --> 00:38:41,580 Using our perfect power law model, 787 00:38:41,580 --> 00:38:43,320 we can actually compute analytically 788 00:38:43,320 --> 00:38:44,920 what this should be. 789 00:38:44,920 --> 00:38:47,610 So here's an example of what that looks like. 790 00:38:47,610 --> 00:38:49,870 And again, very similar to what we see. 791 00:38:49,870 --> 00:38:55,160 And given the data, and a model for the data, 792 00:38:55,160 --> 00:39:00,570 we can then compute the ratio of the observed to the model 793 00:39:00,570 --> 00:39:06,400 to get a sense of what data is unusual 794 00:39:06,400 --> 00:39:10,910 versus what we expect from a perfect power-- or linear fit. 795 00:39:10,910 --> 00:39:12,540 And we see that very clearly here. 796 00:39:12,540 --> 00:39:16,891 So this is the data just dividing the data by the model 797 00:39:16,891 --> 00:39:17,390 here. 798 00:39:17,390 --> 00:39:19,960 And you see again, you see all this data 799 00:39:19,960 --> 00:39:23,510 appears like this is the ratio, the log of the ratio. 800 00:39:23,510 --> 00:39:26,460 And so it's a-- zero means that it's essentially 801 00:39:26,460 --> 00:39:28,900 the ratio is 1, so all of this is expected. 802 00:39:28,900 --> 00:39:31,150 And then we see all these fluctuations here. 803 00:39:31,150 --> 00:39:33,060 Things that are higher than we expected 804 00:39:33,060 --> 00:39:35,390 and things that are lower than we expected. 805 00:39:35,390 --> 00:39:42,100 And so this is the classic-- the time where you see this most is 806 00:39:42,100 --> 00:39:45,810 whenever they show you a map of the United States by county 807 00:39:45,810 --> 00:39:51,440 of some-- maybe like it's-- a classic is like a cancer 808 00:39:51,440 --> 00:39:53,864 cluster, or heart disease, or any phenomena. 809 00:39:53,864 --> 00:39:56,405 And what you see is that certain counties in the western part 810 00:39:56,405 --> 00:39:58,790 of the United States are extremely healthy, 811 00:39:58,790 --> 00:40:00,796 and certain counties are just deadly. 812 00:40:00,796 --> 00:40:02,170 And it's just a fact that they're 813 00:40:02,170 --> 00:40:04,169 very sparsely populated. 814 00:40:04,169 --> 00:40:06,460 And so they are dealing with small numbers effect here. 815 00:40:06,460 --> 00:40:09,550 So basically, this is just showing you oscillations 816 00:40:09,550 --> 00:40:11,790 between 0 and 1 here. 817 00:40:11,790 --> 00:40:14,730 So what we call Poisson sampling. 818 00:40:14,730 --> 00:40:16,840 And so it makes it very difficult 819 00:40:16,840 --> 00:40:20,580 to know what those are. 820 00:40:20,580 --> 00:40:24,880 However, if we re-bin the data and then divide by the model, 821 00:40:24,880 --> 00:40:27,960 we see that the vast majority of our data, as expected, 822 00:40:27,960 --> 00:40:34,352 is in this normal regime. 823 00:40:34,352 --> 00:40:35,810 Another thing we can actually do is 824 00:40:35,810 --> 00:40:38,610 we can look at like the most unexpected data set. 825 00:40:38,610 --> 00:40:41,510 So we can look at the most typical data set. 826 00:40:41,510 --> 00:40:43,455 Oh, these got moved here, didn't they. 827 00:40:43,455 --> 00:40:46,240 I don't know what's going on with PowerPoint today. 828 00:40:46,240 --> 00:40:49,920 But this shows us the surpluses, the deficits, 829 00:40:49,920 --> 00:40:51,400 and the most typical. 830 00:40:51,400 --> 00:40:58,260 So here's like the most extreme bin, the most underrepresented 831 00:40:58,260 --> 00:41:00,500 bin, and sort of the most average bin. 832 00:41:00,500 --> 00:41:02,780 And these are the areas that they correspond to 833 00:41:02,780 --> 00:41:04,270 in the real data set. 834 00:41:04,270 --> 00:41:06,050 So you can use this to find extremes 835 00:41:06,050 --> 00:41:10,440 based on a statistical test. 836 00:41:10,440 --> 00:41:15,520 And again, you see that here again. 837 00:41:15,520 --> 00:41:18,600 We have these different-- that shows 838 00:41:18,600 --> 00:41:22,730 what the measured over expected is versus measured. 839 00:41:22,730 --> 00:41:25,107 And you can see, this is the original data set 840 00:41:25,107 --> 00:41:26,440 and this is the re-bin data set. 841 00:41:26,440 --> 00:41:28,720 And you see that the re-binning removes 842 00:41:28,720 --> 00:41:31,440 a lot of these very sparse points over here 843 00:41:31,440 --> 00:41:34,100 and gives you very narrow distribution around what you 844 00:41:34,100 --> 00:41:35,480 expected. 845 00:41:35,480 --> 00:41:38,920 You can actually go and find selected edges if you want. 846 00:41:38,920 --> 00:41:43,210 So this just shows the different types of edges. 847 00:41:43,210 --> 00:41:44,302 The maximum. 848 00:41:44,302 --> 00:41:46,510 If you actually wanted to go and look at those edges, 849 00:41:46,510 --> 00:41:49,030 these would be maximum, these would be the minimum, 850 00:41:49,030 --> 00:41:51,042 and these are the other types. 851 00:41:51,042 --> 00:41:52,000 So it's a useful thing. 852 00:41:52,000 --> 00:41:53,850 You can, again, go, say all right, 853 00:41:53,850 --> 00:41:56,730 we found that if we have an artificially high correlation 854 00:41:56,730 --> 00:42:00,530 between vertices with this degree and this degree, 855 00:42:00,530 --> 00:42:02,260 we can then backtrack and find out 856 00:42:02,260 --> 00:42:04,954 which specific vertices there are, and see if that's 857 00:42:04,954 --> 00:42:06,120 anything that's interesting. 858 00:42:08,720 --> 00:42:13,120 We can also use this plot to look at these questions of edge 859 00:42:13,120 --> 00:42:14,570 order. 860 00:42:14,570 --> 00:42:16,960 And so, hopefully, this will work today. 861 00:42:16,960 --> 00:42:22,890 So here's is a-- if I randomly select vertices 862 00:42:22,890 --> 00:42:26,867 and I compute their degree versus-- so basically, what 863 00:42:26,867 --> 00:42:27,450 we did before. 864 00:42:27,450 --> 00:42:31,940 We subsample and then compute over what is expected. 865 00:42:31,940 --> 00:42:33,520 And we play this. 866 00:42:33,520 --> 00:42:37,420 And you can just see here we get this sort of-- when we randomly 867 00:42:37,420 --> 00:42:39,800 select the vertices, we basically 868 00:42:39,800 --> 00:42:44,950 get kind of a typical-- each sample 869 00:42:44,950 --> 00:42:46,810 looks very much the same. 870 00:42:46,810 --> 00:42:48,510 Again, up there in these high degrees, 871 00:42:48,510 --> 00:42:50,610 we have this Poisson sampling effect 872 00:42:50,610 --> 00:42:53,160 where we have some-- this is we have 873 00:42:53,160 --> 00:42:57,040 no vertices, so we get a 0, which is lower than expected. 874 00:42:57,040 --> 00:42:58,620 And if we have like two vertices, 875 00:42:58,620 --> 00:43:00,330 we get higher than expected. 876 00:43:00,330 --> 00:43:03,950 And so again, we still have that Poisson effect. 877 00:43:03,950 --> 00:43:06,070 Interestingly, if we do linear sampling, which 878 00:43:06,070 --> 00:43:08,340 we take whole rows at the time, we 879 00:43:08,340 --> 00:43:12,090 get a very different type of phenomenon. 880 00:43:12,090 --> 00:43:15,280 So you see you get these-- whenever 881 00:43:15,280 --> 00:43:19,800 you do run into a high degree row, by definition, 882 00:43:19,800 --> 00:43:22,290 it is unusual. 883 00:43:22,290 --> 00:43:24,555 Which again, means you have to be very careful. 884 00:43:24,555 --> 00:43:26,430 You're going to run into this high degree row 885 00:43:26,430 --> 00:43:28,690 eventually by sampling, but you have to be careful. 886 00:43:28,690 --> 00:43:29,970 It's like, oh, my goodness. 887 00:43:29,970 --> 00:43:33,480 This is a very, very unusual type of thing. 888 00:43:33,480 --> 00:43:38,111 So again, cautionary tale about sampling. 889 00:43:38,111 --> 00:43:38,610 All right. 890 00:43:38,610 --> 00:43:40,443 So we've talked a lot about the theory here. 891 00:43:40,443 --> 00:43:42,530 Let's get into some real data. 892 00:43:42,530 --> 00:43:44,680 So this is our Reuter's data again. 893 00:43:44,680 --> 00:43:47,040 I showed it to you the other day. 894 00:43:47,040 --> 00:43:49,250 The various document distributions we had. 895 00:43:49,250 --> 00:43:54,790 In this case, 800,000 documents, 47,000 extracted entities, 896 00:43:54,790 --> 00:43:57,620 for a total of 6,000,000, essentially, edges. 897 00:43:57,620 --> 00:44:00,860 So it's a bipartite graph [INAUDIBLE] between documents. 898 00:44:00,860 --> 00:44:02,410 And four different types. 899 00:44:02,410 --> 00:44:04,290 And we can now look at the different degree 900 00:44:04,290 --> 00:44:07,970 distributions of the different classes and see what we have. 901 00:44:07,970 --> 00:44:10,830 So the first one we want to look at are the locations. 902 00:44:10,830 --> 00:44:14,460 And so we look at the distribution of the documents 903 00:44:14,460 --> 00:44:15,490 and the entities. 904 00:44:15,490 --> 00:44:20,510 So basically, imagine we took-- so we're very clear here. 905 00:44:20,510 --> 00:44:23,310 So what I'm doing is just taking this part of the matrix, 906 00:44:23,310 --> 00:44:28,290 and the distribution of the documents is summing this way. 907 00:44:28,290 --> 00:44:31,210 And the distribution locations is summing up and down 908 00:44:31,210 --> 00:44:32,470 along the columns. 909 00:44:32,470 --> 00:44:33,910 So for each one of these types, we 910 00:44:33,910 --> 00:44:35,170 can do those different types. 911 00:44:35,170 --> 00:44:37,810 We have, essentially, two different degree distributions. 912 00:44:37,810 --> 00:44:39,540 One associated with the documents and one 913 00:44:39,540 --> 00:44:41,330 associated with the entities. 914 00:44:41,330 --> 00:44:44,110 So this shows us our document distribution. 915 00:44:44,110 --> 00:44:46,170 So we have the measured data. 916 00:44:46,170 --> 00:44:48,150 We have our fit. 917 00:44:48,150 --> 00:44:50,820 We have our model, and then our re-binning. 918 00:44:50,820 --> 00:44:55,110 And you know, you could say that this is approximately a power 919 00:44:55,110 --> 00:44:57,010 law, and that when you re-bin it, 920 00:44:57,010 --> 00:44:59,120 that this sort of S-shaped effect 921 00:44:59,120 --> 00:45:01,547 is still there, which probably means it's really there. 922 00:45:01,547 --> 00:45:03,630 There's something going on in the data that really 923 00:45:03,630 --> 00:45:05,530 is making this bowing effect. 924 00:45:05,530 --> 00:45:07,680 And so that's really there. 925 00:45:07,680 --> 00:45:11,797 Likewise here, we have our model. 926 00:45:11,797 --> 00:45:12,880 We have the measured data. 927 00:45:12,880 --> 00:45:16,360 We have our model, alpha, which is the blue line. 928 00:45:16,360 --> 00:45:17,420 We have the model fit. 929 00:45:17,420 --> 00:45:18,810 And then our re-binning. 930 00:45:18,810 --> 00:45:20,754 And again, you could say the power law 931 00:45:20,754 --> 00:45:22,420 model's a pretty good fit, but again, we 932 00:45:22,420 --> 00:45:24,160 have something going on here. 933 00:45:24,160 --> 00:45:27,297 Some kind of phenomenon going on there. 934 00:45:27,297 --> 00:45:28,880 And we can then do this for each type. 935 00:45:28,880 --> 00:45:31,160 We can look at the organizations. 936 00:45:31,160 --> 00:45:33,230 And again, we don't have as many organizations. 937 00:45:33,230 --> 00:45:35,780 Again, similar types of phenomenon here. 938 00:45:35,780 --> 00:45:38,510 This is so sparse, it's difficult to really say 939 00:45:38,510 --> 00:45:41,740 what's going on here. 940 00:45:41,740 --> 00:45:44,690 Of course, we do people, which is, of course, 941 00:45:44,690 --> 00:45:46,719 people always are the first thing 942 00:45:46,719 --> 00:45:48,760 that you talk about when you talk about power law 943 00:45:48,760 --> 00:45:49,720 distributions. 944 00:45:49,720 --> 00:45:53,420 And again, very nicely, we have our measured data and our fit. 945 00:45:53,420 --> 00:45:57,290 And again, we see this sort of bent broom distribution 946 00:45:57,290 --> 00:46:00,900 here, bending, and then with this fan out effect here. 947 00:46:00,900 --> 00:46:03,380 But then when we model it and re-bin it, 948 00:46:03,380 --> 00:46:05,980 we get something that looks much more like a true power law. 949 00:46:05,980 --> 00:46:08,330 And you can see that very nicely here 950 00:46:08,330 --> 00:46:10,490 with the actual person data. 951 00:46:10,490 --> 00:46:12,510 A very good power law. 952 00:46:12,510 --> 00:46:16,370 So this tells us that regardless of what this underlying data 953 00:46:16,370 --> 00:46:20,630 looks like, this probably really is a power law distribution. 954 00:46:20,630 --> 00:46:22,030 And then we have the times. 955 00:46:22,030 --> 00:46:26,110 And again, a similar type of thing here. 956 00:46:26,110 --> 00:46:27,480 Different types of things. 957 00:46:27,480 --> 00:46:30,830 And we actually have a little spike here. 958 00:46:30,830 --> 00:46:33,772 The Reuter's data has a certain collection 959 00:46:33,772 --> 00:46:35,980 of times associated with the actual filing of events. 960 00:46:35,980 --> 00:46:39,570 There's only 35 of them, so we do get a little spike here, 961 00:46:39,570 --> 00:46:41,460 which is actually what we expect. 962 00:46:41,460 --> 00:46:43,650 Although you wouldn't see that really clearly 963 00:46:43,650 --> 00:46:45,880 in the true data, but when you re-bin it, 964 00:46:45,880 --> 00:46:48,490 this bump comes out fairly clearly. 965 00:46:48,490 --> 00:46:51,450 So again, proper binning extremely important. 966 00:46:51,450 --> 00:46:53,320 We can look at our correlations as well. 967 00:46:53,320 --> 00:46:56,120 So let's just look at the person-person correlations. 968 00:46:56,120 --> 00:46:58,880 And again, what we saw here is sort of-- we 969 00:46:58,880 --> 00:47:00,440 see this is the raw data. 970 00:47:00,440 --> 00:47:02,584 So something that looks very much like a power law. 971 00:47:02,584 --> 00:47:04,500 But when we go through our re-binning process, 972 00:47:04,500 --> 00:47:09,200 we see kind of this witch's nose effect here really in the data. 973 00:47:09,200 --> 00:47:12,090 So this would tell you to first order it's a power law, 974 00:47:12,090 --> 00:47:16,790 but you actually have this correlation taking place here, 975 00:47:16,790 --> 00:47:17,300 right here. 976 00:47:17,300 --> 00:47:23,260 So that's something that really seems to be taking place. 977 00:47:23,260 --> 00:47:24,885 You see the same thing when we do time. 978 00:47:29,568 --> 00:47:31,610 We can look at document. 979 00:47:31,610 --> 00:47:32,871 Let's now look at sampling. 980 00:47:32,871 --> 00:47:34,870 So again, this is the same sampling that we did. 981 00:47:34,870 --> 00:47:36,540 We're going to now basically look 982 00:47:36,540 --> 00:47:38,450 at the document densification. 983 00:47:38,450 --> 00:47:40,200 So this is selecting whole rows. 984 00:47:40,200 --> 00:47:43,310 We select a whole document, so we're getting a whole row. 985 00:47:43,310 --> 00:47:46,760 And this just shows you the four different types here. 986 00:47:46,760 --> 00:47:52,220 And you see that they behave exactly as expected. 987 00:47:52,220 --> 00:47:54,520 Each individual sample is reflective 988 00:47:54,520 --> 00:47:57,010 of the overall densification, because you're 989 00:47:57,010 --> 00:47:58,570 taking whole rows. 990 00:47:58,570 --> 00:47:59,600 OK? 991 00:47:59,600 --> 00:48:01,370 Now let's take entities. 992 00:48:01,370 --> 00:48:03,430 So now we're cutting across these. 993 00:48:03,430 --> 00:48:04,930 And now you see something that looks 994 00:48:04,930 --> 00:48:06,250 much more like random sampling. 995 00:48:06,250 --> 00:48:09,360 When you randomly select an entity, 996 00:48:09,360 --> 00:48:11,500 that's sort of a random set of documents. 997 00:48:11,500 --> 00:48:13,990 And again, the individual samples, 998 00:48:13,990 --> 00:48:15,810 but when you start summing them up, 999 00:48:15,810 --> 00:48:18,900 you see that they get denser and denser and denser. 1000 00:48:18,900 --> 00:48:21,120 So individual document has a sort 1001 00:48:21,120 --> 00:48:23,720 of a good sample of the overall distribution. 1002 00:48:23,720 --> 00:48:27,574 When you pick a person, as you take a higher 1003 00:48:27,574 --> 00:48:28,990 fraction of the data, you're going 1004 00:48:28,990 --> 00:48:32,310 to get more and more and more with that. 1005 00:48:32,310 --> 00:48:35,350 So again, all consistent with what we saw before. 1006 00:48:35,350 --> 00:48:37,970 A little noisier to see here, but trust me, 1007 00:48:37,970 --> 00:48:43,650 the power law exponent also behaves exactly as we expected. 1008 00:48:43,650 --> 00:48:45,800 Likewise, this is essentially a linear sampling, 1009 00:48:45,800 --> 00:48:48,550 and here's the random sampling behaving exactly 1010 00:48:48,550 --> 00:48:51,340 as we expected. 1011 00:48:51,340 --> 00:48:54,130 We can look at the joint distributions. 1012 00:48:54,130 --> 00:48:58,790 So here's the actual-- this is the location cross-correlation. 1013 00:48:58,790 --> 00:49:02,680 So they're showing you basically the document versus entity 1014 00:49:02,680 --> 00:49:04,600 degree distributions. 1015 00:49:04,600 --> 00:49:05,600 So this is the measured. 1016 00:49:09,060 --> 00:49:10,560 And it's been re-binned. 1017 00:49:10,560 --> 00:49:12,480 This is measured divided by the expected, 1018 00:49:12,480 --> 00:49:14,440 so the expected re-bins. 1019 00:49:14,440 --> 00:49:16,580 This is measured divided by the model. 1020 00:49:16,580 --> 00:49:20,730 And here's the model, and expected divided by the model. 1021 00:49:20,730 --> 00:49:23,440 And you can basically compare all these different types 1022 00:49:23,440 --> 00:49:26,920 to create different statistical tests. 1023 00:49:26,920 --> 00:49:31,350 And actually, we can find here the measured re-binned divided 1024 00:49:31,350 --> 00:49:33,170 by the expected re-binned. 1025 00:49:33,170 --> 00:49:36,820 We can get our surpluses and deficits and other types 1026 00:49:36,820 --> 00:49:38,264 of things in the actual data. 1027 00:49:38,264 --> 00:49:39,680 And so here you see something that 1028 00:49:39,680 --> 00:49:42,770 maybe looks like an artificially high grouping here, 1029 00:49:42,770 --> 00:49:44,535 some artificially low, and some expected. 1030 00:49:44,535 --> 00:49:46,060 And you can then use these as ways 1031 00:49:46,060 --> 00:49:49,127 to actually go and find anomalous documents. 1032 00:49:49,127 --> 00:49:50,710 Or documents are like-- you say, look, 1033 00:49:50,710 --> 00:49:52,335 this is the most typical type of thing. 1034 00:49:52,335 --> 00:49:54,709 I mean, people are like, well, why would you try and find 1035 00:49:54,709 --> 00:49:55,430 the most normal? 1036 00:49:55,430 --> 00:49:57,305 Well, a lot of times, you want to be able to, 1037 00:49:57,305 --> 00:49:59,250 in terms of summarization, be like here's 1038 00:49:59,250 --> 00:50:03,484 a very representative set of documents. 1039 00:50:03,484 --> 00:50:05,025 They have the statistical properties. 1040 00:50:05,025 --> 00:50:07,100 This is very consistent with everything else. 1041 00:50:07,100 --> 00:50:09,725 Because people will [INAUDIBLE] like what that mean to be that? 1042 00:50:09,725 --> 00:50:13,660 So again, a very useful way to look at the data. 1043 00:50:13,660 --> 00:50:15,360 Do the same thing with organization. 1044 00:50:15,360 --> 00:50:18,980 Again, basically get the measured re-binned, 1045 00:50:18,980 --> 00:50:21,490 the measured divided by the expected, 1046 00:50:21,490 --> 00:50:25,050 the expected re-binned, the model, refit the model, 1047 00:50:25,050 --> 00:50:28,080 and the various different ratios of all of them, which 1048 00:50:28,080 --> 00:50:31,340 you can do to find various outliers and such. 1049 00:50:31,340 --> 00:50:32,590 Again, a very useful. 1050 00:50:32,590 --> 00:50:34,840 Interesting that it picked the most representative one 1051 00:50:34,840 --> 00:50:35,930 way up here. 1052 00:50:35,930 --> 00:50:38,430 So that's an interesting-- so here's a very unusual 1053 00:50:38,430 --> 00:50:39,460 representative sample. 1054 00:50:44,560 --> 00:50:46,230 Persons, you can do the same thing. 1055 00:50:46,230 --> 00:50:48,355 And I guess one of the things I'm also trying to do 1056 00:50:48,355 --> 00:50:51,830 is that you see that these all don't look the same. 1057 00:50:51,830 --> 00:50:52,510 Right? 1058 00:50:52,510 --> 00:50:55,600 Hopefully, from looking at these, it's like this one, 1059 00:50:55,600 --> 00:50:58,030 the location distribution is clearly 1060 00:50:58,030 --> 00:51:00,970 a little similar to the organization distribution, 1061 00:51:00,970 --> 00:51:03,072 but the person distribution looks very different. 1062 00:51:03,072 --> 00:51:05,030 And the time distribution looks very different. 1063 00:51:05,030 --> 00:51:09,350 So you see that you have to be careful about-- sometimes we'll 1064 00:51:09,350 --> 00:51:11,490 just take all these different categories 1065 00:51:11,490 --> 00:51:13,950 and lump them together in one big distribution. 1066 00:51:13,950 --> 00:51:16,310 And here's a situation, like these are really 1067 00:51:16,310 --> 00:51:17,480 pretty different things. 1068 00:51:17,480 --> 00:51:18,855 And you really want to treat them 1069 00:51:18,855 --> 00:51:23,290 as four distinct classes of types of things. 1070 00:51:23,290 --> 00:51:25,700 So as an example of selected edges, 1071 00:51:25,700 --> 00:51:28,010 this just shows the most typical. 1072 00:51:28,010 --> 00:51:32,030 This just shows the various entities that were selected, 1073 00:51:32,030 --> 00:51:33,790 and very representative document. 1074 00:51:33,790 --> 00:51:36,910 And this is a very low degree, various entities. 1075 00:51:36,910 --> 00:51:39,730 So this might show you entities-- here's 1076 00:51:39,730 --> 00:51:41,690 the kind of surplus example. 1077 00:51:41,690 --> 00:51:44,910 And we could go in and actually finding those edges. 1078 00:51:44,910 --> 00:51:46,660 Same with the person. 1079 00:51:46,660 --> 00:51:50,880 Very generic people, higher degree. 1080 00:51:50,880 --> 00:51:51,880 Here's a very low. 1081 00:51:51,880 --> 00:51:54,620 So this person, Jeremy Smith, is very unusual in terms 1082 00:51:54,620 --> 00:51:56,520 of what they connected with. 1083 00:51:56,520 --> 00:51:58,169 Surplus ones here. 1084 00:51:58,169 --> 00:51:59,960 Can't really read that over there, but just 1085 00:51:59,960 --> 00:52:02,126 to give you examples of things that you can actually 1086 00:52:02,126 --> 00:52:05,161 use this to go in and actually find things. 1087 00:52:05,161 --> 00:52:05,660 All right. 1088 00:52:05,660 --> 00:52:08,210 So that brings again to the lecture part. 1089 00:52:08,210 --> 00:52:11,080 So just, again, developing this background model's 1090 00:52:11,080 --> 00:52:13,170 very important for the graphs. 1091 00:52:13,170 --> 00:52:15,015 Based on the perfect power law, gives us 1092 00:52:15,015 --> 00:52:18,490 a very simple heuristic for creating a linear model. 1093 00:52:18,490 --> 00:52:21,550 We can then really quantify the effects of sampling, 1094 00:52:21,550 --> 00:52:23,320 which is very important. 1095 00:52:23,320 --> 00:52:25,000 Again, traditional sampling approaches 1096 00:52:25,000 --> 00:52:28,820 can easily create nonlinear phenomenon 1097 00:52:28,820 --> 00:52:31,890 that we have to be careful of and be aware of. 1098 00:52:31,890 --> 00:52:33,700 It allows us to develop techniques 1099 00:52:33,700 --> 00:52:36,550 for comparing real data with power law fits 1100 00:52:36,550 --> 00:52:38,510 that we can then use as statistical tests 1101 00:52:38,510 --> 00:52:40,300 for finding unusual bits of data. 1102 00:52:40,300 --> 00:52:45,010 It's a very classic detection theory here. 1103 00:52:45,010 --> 00:52:47,150 Come up with a model for the background. 1104 00:52:47,150 --> 00:52:49,450 Create a linear fit for that background. 1105 00:52:49,450 --> 00:52:51,660 Then use that model to then quantify 1106 00:52:51,660 --> 00:52:54,880 the data to see which things are unusual. 1107 00:52:54,880 --> 00:52:57,530 Again, just using very classic detection 1108 00:52:57,530 --> 00:52:59,120 theory, the techniques. 1109 00:52:59,120 --> 00:53:01,520 Again, having a background model being 1110 00:53:01,520 --> 00:53:03,400 the linchpin of doing that. 1111 00:53:03,400 --> 00:53:07,460 And I'm not saying that this-- this is very recent work. 1112 00:53:07,460 --> 00:53:11,580 This parallel model-- something we did in the last year or so. 1113 00:53:11,580 --> 00:53:14,200 We just published it this summer. 1114 00:53:14,200 --> 00:53:17,480 I can't guarantee that in three or four years from now 1115 00:53:17,480 --> 00:53:20,000 that people will still be using this model, 1116 00:53:20,000 --> 00:53:21,530 because it is very new. 1117 00:53:21,530 --> 00:53:23,670 But I think this is representative 1118 00:53:23,670 --> 00:53:25,840 of the kinds of things that will be in three or four 1119 00:53:25,840 --> 00:53:28,620 or five years, that people-- it may not be this, 1120 00:53:28,620 --> 00:53:32,040 but something like this to characterize our data. 1121 00:53:32,040 --> 00:53:33,700 And so I think it's very useful there. 1122 00:53:33,700 --> 00:53:34,200 All right. 1123 00:53:34,200 --> 00:53:36,270 So with that, we will take a short break 1124 00:53:36,270 --> 00:53:38,460 and then we will show the example code 1125 00:53:38,460 --> 00:53:40,500 and talk about the assignment. 1126 00:53:40,500 --> 00:53:42,710 So very good. 1127 00:53:42,710 --> 00:53:44,600 Thank you very much.