1 00:00:00,110 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,810 Commons license. 3 00:00:03,810 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,160 continue to offer high quality educational resources for free. 5 00:00:10,160 --> 00:00:12,700 To make a donation or to view additional materials 6 00:00:12,700 --> 00:00:18,060 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,060 --> 00:00:20,672 at ocw.mit.edu. 8 00:00:20,672 --> 00:00:22,130 PROFESSOR: What we're going to talk 9 00:00:22,130 --> 00:00:25,440 about today is in the previous class, 10 00:00:25,440 --> 00:00:30,800 we did a lot of examples on some data sets. 11 00:00:30,800 --> 00:00:36,030 And they were artificially generated data sets, which 12 00:00:36,030 --> 00:00:39,460 is a requirement in terms of if you want to do anything 13 00:00:39,460 --> 00:00:41,960 with big data and really show stuff, 14 00:00:41,960 --> 00:00:45,290 it can be really difficult to pass around terabytes 15 00:00:45,290 --> 00:00:46,790 and terabytes of data. 16 00:00:46,790 --> 00:00:50,270 So having data generators that generate the statistics 17 00:00:50,270 --> 00:00:53,340 that you want or approximate the statistics that you want 18 00:00:53,340 --> 00:00:54,050 are good to have. 19 00:00:54,050 --> 00:00:56,690 We had a previous lecture on power law graphs 20 00:00:56,690 --> 00:00:58,940 and talked about a perfect [AUDIO OUT] is very useful, 21 00:00:58,940 --> 00:01:00,350 and I encourage you to use that. 22 00:01:00,350 --> 00:01:02,334 It's a useful analytic tool. 23 00:01:02,334 --> 00:01:04,500 This one, we're going to talk about Kronecker graphs 24 00:01:04,500 --> 00:01:06,360 and generation. 25 00:01:06,360 --> 00:01:11,140 Kronecker graphs are used in the largest benchmark in the world 26 00:01:11,140 --> 00:01:13,322 for benchmarking graph systems. 27 00:01:13,322 --> 00:01:14,780 And then finally, I'm going to talk 28 00:01:14,780 --> 00:01:16,890 a little bit about performance. 29 00:01:16,890 --> 00:01:20,210 What are the things you should be aware of when you're 30 00:01:20,210 --> 00:01:23,449 doing systems and building systems from a performance 31 00:01:23,449 --> 00:01:24,990 perspective, the kinds of things that 32 00:01:24,990 --> 00:01:29,150 are essentially fundamental limits of the technology. 33 00:01:29,150 --> 00:01:37,740 So moving forward, let's talk about the Graph500 benchmark. 34 00:01:37,740 --> 00:01:41,420 So the Graph500 benchmark was created 35 00:01:41,420 --> 00:01:47,240 to provide a mechanism for timing 36 00:01:47,240 --> 00:01:49,090 the world's most powerful computers 37 00:01:49,090 --> 00:01:54,680 and see how good they were on graph type operations. 38 00:01:54,680 --> 00:01:56,940 And so I was involved and actually 39 00:01:56,940 --> 00:01:59,790 wrote some of the initial code for this benchmark 40 00:01:59,790 --> 00:02:02,480 about 5 or 10 years ago. 41 00:02:02,480 --> 00:02:05,850 And it's now become a community effort, 42 00:02:05,850 --> 00:02:09,810 and the code has changed, and other types of things. 43 00:02:09,810 --> 00:02:12,760 And it's mainly meant for dealing 44 00:02:12,760 --> 00:02:15,850 with parallel in-memory computations, 45 00:02:15,850 --> 00:02:17,530 but the benchmark is actually very 46 00:02:17,530 --> 00:02:20,810 useful for timing databases or any types of things. 47 00:02:20,810 --> 00:02:23,240 You can do the exact same operations. 48 00:02:23,240 --> 00:02:25,700 So it generates data, and then it 49 00:02:25,700 --> 00:02:27,600 has you construct it in a graph, and then 50 00:02:27,600 --> 00:02:29,650 it goes on to do some other operations. 51 00:02:29,650 --> 00:02:31,150 And so we've actually found it to be 52 00:02:31,150 --> 00:02:34,330 a very useful tool as a very fast, 53 00:02:34,330 --> 00:02:35,990 high-performance data generator. 54 00:02:35,990 --> 00:02:39,200 And if you want to time inserts of power law data, 55 00:02:39,200 --> 00:02:40,990 it's very useful for that. 56 00:02:40,990 --> 00:02:42,890 This is some older performance numbers. 57 00:02:42,890 --> 00:02:48,550 But this just shows essentially here 58 00:02:48,550 --> 00:02:54,340 on a single core of computing capability, the number 59 00:02:54,340 --> 00:02:59,460 of entries that we're inserting into a table and the inserts 60 00:02:59,460 --> 00:03:01,250 per second that we're getting. 61 00:03:01,250 --> 00:03:03,750 So the single core performance here 62 00:03:03,750 --> 00:03:06,920 for this data into Accumulo, it's 63 00:03:06,920 --> 00:03:09,730 about 20,000 inserts per second. 64 00:03:09,730 --> 00:03:13,440 So there's a one core database, very, very small. 65 00:03:13,440 --> 00:03:16,880 As you've seen-- we've even showed last week 66 00:03:16,880 --> 00:03:18,870 on modern servers where we're getting 67 00:03:18,870 --> 00:03:25,040 several thousand inserts per second on standard databases. 68 00:03:28,330 --> 00:03:32,230 I'm comparing this also with just D4M without a database. 69 00:03:32,230 --> 00:03:36,140 If I just have triples and I construct an associative array, 70 00:03:36,140 --> 00:03:37,720 how fast does it do that? 71 00:03:37,720 --> 00:03:41,870 And so obviously, D4M is limited by how much memory you 72 00:03:41,870 --> 00:03:44,790 have on your system, but it just show you 73 00:03:44,790 --> 00:03:46,800 the constructing and associative array. 74 00:03:46,800 --> 00:03:50,980 And memory is faster than inserting data 75 00:03:50,980 --> 00:03:53,740 into a database by a good amount. 76 00:03:53,740 --> 00:03:56,100 And so again, if you could work with associative arrays 77 00:03:56,100 --> 00:04:00,930 and memory, that's going to be better for you. 78 00:04:00,930 --> 00:04:02,940 Obviously, though, the database allows 79 00:04:02,940 --> 00:04:05,295 you to go to very large sizes. 80 00:04:05,295 --> 00:04:07,045 And one of the great things about Accumulo 81 00:04:07,045 --> 00:04:10,604 is this line stays pretty flat. 82 00:04:10,604 --> 00:04:12,520 You can go on, out, and out, and out, and out. 83 00:04:12,520 --> 00:04:14,200 That's what it really prides itself on, 84 00:04:14,200 --> 00:04:21,589 is that-- as you add data, it might degrade slightly, 85 00:04:21,589 --> 00:04:23,390 but if you're inserting it right, 86 00:04:23,390 --> 00:04:24,780 that line stays pretty flat. 87 00:04:27,554 --> 00:04:28,970 In the last class, we did show how 88 00:04:28,970 --> 00:04:31,260 to do parallel inserts in parallel D4M. 89 00:04:31,260 --> 00:04:37,790 And this shows parallel inserts here on a single node database, 90 00:04:37,790 --> 00:04:41,950 D4M itself basically taking this number and scaling it up 91 00:04:41,950 --> 00:04:43,220 and also into the database. 92 00:04:43,220 --> 00:04:48,910 And so here a single node, single core database 93 00:04:48,910 --> 00:04:53,690 who are getting 100,000 inserts a second with that. 94 00:04:53,690 --> 00:04:59,140 And so that just shows you some of the parallel expectations, 95 00:04:59,140 --> 00:05:05,740 and when you should use parallel versions databases. 96 00:05:05,740 --> 00:05:08,625 Again, if you can be in memory, you want to be in memory. 97 00:05:08,625 --> 00:05:10,080 It's faster. 98 00:05:10,080 --> 00:05:11,809 If you can be in parallel memory, 99 00:05:11,809 --> 00:05:13,600 you probably want to be in parallel memory. 100 00:05:13,600 --> 00:05:14,430 That's faster. 101 00:05:14,430 --> 00:05:16,680 But if you have really large problems, then obviously, 102 00:05:16,680 --> 00:05:18,070 you need to go to the database. 103 00:05:18,070 --> 00:05:21,200 And that's what that's for. 104 00:05:21,200 --> 00:05:29,160 The data in the graph 500 benchmark 105 00:05:29,160 --> 00:05:31,860 is what we call a Kronecker graph. 106 00:05:31,860 --> 00:05:37,500 It basically takes a little, tiny matrix, a little seed 107 00:05:37,500 --> 00:05:41,940 matrix-- in this case, g-- and you can imagine 108 00:05:41,940 --> 00:05:43,820 what it looks like here. 109 00:05:43,820 --> 00:05:45,790 Imagine this is a two-by-two matrix here. 110 00:05:45,790 --> 00:05:49,310 You can sort of see, maybe, a little bit of the structure. 111 00:05:49,310 --> 00:05:51,560 And then it does a Kronecker product 112 00:05:51,560 --> 00:05:54,460 of that on out to create and adjacency matrix. 113 00:05:57,930 --> 00:06:01,100 By creating this adjacency matrix, that creates the graph. 114 00:06:01,100 --> 00:06:06,380 And as a result, it naturally produces-- this is the result. 115 00:06:06,380 --> 00:06:08,130 It produces something that naturally looks 116 00:06:08,130 --> 00:06:10,820 like a power law graph here. 117 00:06:10,820 --> 00:06:12,530 And this just shows you the dots, 118 00:06:12,530 --> 00:06:17,540 shows you the power law degree distribution of that. 119 00:06:17,540 --> 00:06:22,790 You can see the slope there generated from this. 120 00:06:22,790 --> 00:06:25,140 I should say one of the powerful things 121 00:06:25,140 --> 00:06:27,860 about this particular generator is 122 00:06:27,860 --> 00:06:32,090 you don't have to form the adjacency matrix to construct 123 00:06:32,090 --> 00:06:32,930 the graph. 124 00:06:32,930 --> 00:06:35,700 So it doesn't require you to build this gigantic matrix. 125 00:06:35,700 --> 00:06:39,920 It generates the edges atomically. 126 00:06:39,920 --> 00:06:43,740 If you have many, many processes all doing this, 127 00:06:43,740 --> 00:06:46,020 as long as they set their random number seeds 128 00:06:46,020 --> 00:06:47,620 to different locations, they'll all 129 00:06:47,620 --> 00:06:52,350 be generating perfectly consistent edges in a larger 130 00:06:52,350 --> 00:06:53,220 graph. 131 00:06:53,220 --> 00:06:57,950 So people have run this on computers with millions 132 00:06:57,950 --> 00:07:00,210 of cores to do that. 133 00:07:00,210 --> 00:07:04,370 Because the graph generator is completely-- parallel 134 00:07:04,370 --> 00:07:09,260 allows you to essentially create a giant graph. 135 00:07:09,260 --> 00:07:15,470 And also, the Kronecker graph does generate this structure 136 00:07:15,470 --> 00:07:18,790 here, which is these bumps and wiggles. 137 00:07:18,790 --> 00:07:24,080 And that actually corresponds to the power k. 138 00:07:24,080 --> 00:07:25,810 So you can count the number of bumps. 139 00:07:25,810 --> 00:07:30,250 We have 1, 2, 3, 4, 5, 6, 7, 8. 140 00:07:30,250 --> 00:07:34,280 So this would be a g to the 8th graph. 141 00:07:34,280 --> 00:07:38,900 So that structure is visible there, 142 00:07:38,900 --> 00:07:43,000 which is an artificial feature of the data. 143 00:07:43,000 --> 00:07:47,114 And so one of the things that's actually occurred now-- 144 00:07:47,114 --> 00:07:48,780 because people have been actually trying 145 00:07:48,780 --> 00:07:52,340 to use this generator because it's so fast and efficient 146 00:07:52,340 --> 00:07:54,140 to simulate graphs. 147 00:07:54,140 --> 00:07:56,890 And we've basically said to people, 148 00:07:56,890 --> 00:07:59,790 not really designed to simulate large graphs, 149 00:07:59,790 --> 00:08:04,470 because it does create these artificial structures in there. 150 00:08:04,470 --> 00:08:08,010 And that's the reason we develop the perfect power law method. 151 00:08:08,010 --> 00:08:09,470 Because that's a much better method 152 00:08:09,470 --> 00:08:13,730 if you're actually trying to simulate graphs for doing that. 153 00:08:13,730 --> 00:08:15,760 Again, this method has these advantages 154 00:08:15,760 --> 00:08:17,430 in terms of being able to generate them. 155 00:08:21,340 --> 00:08:23,890 Another advantage of these Kronecker 156 00:08:23,890 --> 00:08:29,190 graphs-- and so in this case, B is my little matrix here 157 00:08:29,190 --> 00:08:31,840 that I'm going to do Kronecker products of. 158 00:08:31,840 --> 00:08:35,940 In fact, let me remind you about what a Kronecker product is. 159 00:08:35,940 --> 00:08:39,049 So if I have here a matrix B, that's 160 00:08:39,049 --> 00:08:41,919 nb by nb, although these don't have to be square, 161 00:08:41,919 --> 00:08:44,480 and another matrix C that's nc by mc 162 00:08:44,480 --> 00:08:45,917 doesn't also have to be square. 163 00:08:45,917 --> 00:08:47,500 And when you do the Kronecker product, 164 00:08:47,500 --> 00:08:49,610 basically what you're doing is essentially 165 00:08:49,610 --> 00:08:52,490 taking c and multiplying by the first element, 166 00:08:52,490 --> 00:08:54,140 and having essentially a block here. 167 00:08:54,140 --> 00:08:56,580 And so it's a way of essentially expanding 168 00:08:56,580 --> 00:08:59,870 the matrix in both dimensions. 169 00:08:59,870 --> 00:09:04,010 And Jure Leskovec, who's now a professor at Stanford, 170 00:09:04,010 --> 00:09:09,620 essentially developed this method back in 2005 171 00:09:09,620 --> 00:09:12,100 for creating these matrices which 172 00:09:12,100 --> 00:09:15,260 allow you to very naturally create power law matrices, 173 00:09:15,260 --> 00:09:18,330 and even had tools for fitting them to data, 174 00:09:18,330 --> 00:09:21,490 which is a useful thing. 175 00:09:21,490 --> 00:09:23,800 Again, its biggest impact has been 176 00:09:23,800 --> 00:09:28,450 on generating very large graphs, power law graphs very quickly. 177 00:09:31,170 --> 00:09:37,370 The Kronecker graph itself has several different types 178 00:09:37,370 --> 00:09:40,250 that I call here, basically, explicit, stochastic, 179 00:09:40,250 --> 00:09:42,110 and an instance. 180 00:09:42,110 --> 00:09:47,260 So if I just set the coefficients of the Kronecker 181 00:09:47,260 --> 00:09:50,900 graph to be ones and zeroes and I Kronecker out, 182 00:09:50,900 --> 00:09:53,356 I will get structures that look like this. 183 00:09:53,356 --> 00:09:55,230 So this is a Kronecker graph that essentially 184 00:09:55,230 --> 00:09:57,930 is a bipartite, essentially starting 185 00:09:57,930 --> 00:10:02,410 with a-- what you can call a star graph and then a diagonal. 186 00:10:02,410 --> 00:10:04,190 So that's 1. 187 00:10:04,190 --> 00:10:06,880 That's g1. 188 00:10:06,880 --> 00:10:10,385 Then I Kronecker product again, g2 and g3, 189 00:10:10,385 --> 00:10:15,760 and I get these different structure 0 1 matrices. 190 00:10:15,760 --> 00:10:19,250 And so it's only 0's and 1's, and so the structure 191 00:10:19,250 --> 00:10:21,280 and the connections are very obvious. 192 00:10:21,280 --> 00:10:26,630 And this is actually useful for doing theory. 193 00:10:26,630 --> 00:10:29,210 So you can actually do a fair amount of theory 194 00:10:29,210 --> 00:10:34,600 on the structure of graphs using the 0 1 matrices. 195 00:10:34,600 --> 00:10:36,660 And in fact, there's a famous paper 196 00:10:36,660 --> 00:10:39,890 by Van Loan on Kronecker products. 197 00:10:39,890 --> 00:10:42,640 And basically, the paper is full of identities 198 00:10:42,640 --> 00:10:49,350 of Kronecker products that-- the eigenvalues of the Kronecker 199 00:10:49,350 --> 00:10:51,950 product of two matrices is the same as the Kronecker 200 00:10:51,950 --> 00:10:55,030 product of their eigenvalues, and a whole series 201 00:10:55,030 --> 00:10:55,920 of those things. 202 00:10:55,920 --> 00:10:59,180 And so what's nice is if you compute something 203 00:10:59,180 --> 00:11:04,040 on just the small matrix g, you can very quickly 204 00:11:04,040 --> 00:11:08,390 compute all the properties of the Kronecker product matrix, 205 00:11:08,390 --> 00:11:10,670 and that makes it very nice for doing theory. 206 00:11:10,670 --> 00:11:13,960 And so there's times when you want to understand 207 00:11:13,960 --> 00:11:15,730 the structure of these things. 208 00:11:15,730 --> 00:11:18,080 Very useful theoretical tool for doing that. 209 00:11:20,630 --> 00:11:24,660 If instead, we use our C graph, instead of 210 00:11:24,660 --> 00:11:28,740 it being 1's and 0's, it contains, say, 211 00:11:28,740 --> 00:11:31,330 numbers between 0 and 1 that are probabilities, 212 00:11:31,330 --> 00:11:34,060 the probability of creating an edge between those two 213 00:11:34,060 --> 00:11:35,090 vertices. 214 00:11:35,090 --> 00:11:38,460 And we multiply that on out. 215 00:11:38,460 --> 00:11:42,960 You get, essentially, a giant probability matrix showing 216 00:11:42,960 --> 00:11:45,410 the probability of edges. 217 00:11:45,410 --> 00:11:47,779 These are the same, but here we see we have differences. 218 00:11:47,779 --> 00:11:49,070 We don't just have 0's and 1's. 219 00:11:49,070 --> 00:11:51,670 We have values between 0's and 1's. 220 00:11:51,670 --> 00:11:54,610 And then using the Kronecker generation technique, 221 00:11:54,610 --> 00:11:59,300 we essentially draw instances from this probability matrix 222 00:11:59,300 --> 00:12:00,110 over here. 223 00:12:00,110 --> 00:12:02,204 So this allows us to randomly sample. 224 00:12:02,204 --> 00:12:04,120 So when you're talking about Kronecker graphs, 225 00:12:04,120 --> 00:12:06,210 there's three different types here. 226 00:12:06,210 --> 00:12:08,630 This is very easy to do theory on. 227 00:12:08,630 --> 00:12:10,980 You can do theory on this, too, but it tends 228 00:12:10,980 --> 00:12:13,370 to be a little bit more work-y. 229 00:12:13,370 --> 00:12:16,780 And then likewise, when you're doing simulations, 230 00:12:16,780 --> 00:12:19,640 usually you end up with Kronecker graphs like this. 231 00:12:23,700 --> 00:12:29,300 As an example of the kind of theory we can do, 232 00:12:29,300 --> 00:12:34,950 one of the nicest graphs to work with are bipartite graphs. 233 00:12:34,950 --> 00:12:39,480 So a bipartite graph is two sets of vertices, 234 00:12:39,480 --> 00:12:42,820 and each vertex has a connection to every vertex 235 00:12:42,820 --> 00:12:45,080 in the other set, but they don't have any connections 236 00:12:45,080 --> 00:12:46,800 amongst themselves. 237 00:12:46,800 --> 00:12:51,430 And so here, I'm showing you a bipartite or star graph, 238 00:12:51,430 --> 00:12:53,540 so basically one set of vertices here 239 00:12:53,540 --> 00:12:55,310 and another set of vertices here. 240 00:12:55,310 --> 00:12:57,830 And all the vertices have connections to those vertex, 241 00:12:57,830 --> 00:13:00,330 but they have no connections themselves. 242 00:13:00,330 --> 00:13:04,130 And this is the adjacency matrix of that graph. 243 00:13:04,130 --> 00:13:06,190 And here is another bipartite graph. 244 00:13:06,190 --> 00:13:08,890 It's a 3 1 set. 245 00:13:08,890 --> 00:13:12,600 And if I Kronecker product two bipartite graphs, 246 00:13:12,600 --> 00:13:16,360 the result is two more bipartite graphs. 247 00:13:16,360 --> 00:13:19,420 So as we see here, we've created this one here 248 00:13:19,420 --> 00:13:22,000 and this one here. 249 00:13:22,000 --> 00:13:26,890 And so in my notation, I'll say B 5, 1 is a bipartite. 250 00:13:26,890 --> 00:13:28,390 That's basically-- I'm saying I have 251 00:13:28,390 --> 00:13:35,330 a matrix that's a 5, 1 bipartite graph, 3, 1 bipartite graph. 252 00:13:35,330 --> 00:13:37,830 And within some permutation, they 253 00:13:37,830 --> 00:13:45,340 produce a matrix that is the union of a 15, a 1, and a 3, 5. 254 00:13:45,340 --> 00:13:49,200 And this result actually was first shown by a professor 255 00:13:49,200 --> 00:13:53,040 by the name of Weichsel in 1962. 256 00:13:53,040 --> 00:13:54,770 And there was actually a little flurry 257 00:13:54,770 --> 00:14:01,600 of activity back in the 1960s with this whole algebraic view 258 00:14:01,600 --> 00:14:10,140 of graph theory, which was very productive, 259 00:14:10,140 --> 00:14:11,860 but then it all died off. 260 00:14:11,860 --> 00:14:14,540 And after the lecture, I can tell you why that happened. 261 00:14:14,540 --> 00:14:16,850 But now it's all coming back, which is nice. 262 00:14:16,850 --> 00:14:20,240 So it took us 50 years to rediscover this work. 263 00:14:20,240 --> 00:14:22,350 But there's actually a whole book on it 264 00:14:22,350 --> 00:14:25,470 with lots of interesting results. 265 00:14:25,470 --> 00:14:28,730 But then this just shows you the general result here. 266 00:14:28,730 --> 00:14:31,265 For any two bipartite graphs, if I Kronecker product 267 00:14:31,265 --> 00:14:34,560 them together, they are under some permutation 268 00:14:34,560 --> 00:14:39,380 equal to two other bipartite graphs. 269 00:14:39,380 --> 00:14:43,870 And this naturally shows you how the power law 270 00:14:43,870 --> 00:14:48,370 structure gets built up when you Kronecker product these graphs. 271 00:14:48,370 --> 00:14:53,080 So basically, you had this and this. 272 00:14:53,080 --> 00:14:57,650 And there, we basically-- that's the super node vertex here, 273 00:14:57,650 --> 00:14:58,680 this 1, 1 vertex. 274 00:14:58,680 --> 00:15:02,940 And these are the lower degree pieces. 275 00:15:02,940 --> 00:15:05,980 And then these other vertices here 276 00:15:05,980 --> 00:15:08,170 are the singleton vertices. 277 00:15:08,170 --> 00:15:11,570 So you can see as you naturally product 278 00:15:11,570 --> 00:15:14,010 these bipartite graphs together, you naturally 279 00:15:14,010 --> 00:15:18,700 create these power law graphs. 280 00:15:18,700 --> 00:15:21,390 An example-- I won't really belabor the details here-- 281 00:15:21,390 --> 00:15:25,960 you can compute the degree distribution. 282 00:15:25,960 --> 00:15:29,560 And this is all-- there's a long chapter on this in the book. 283 00:15:29,560 --> 00:15:32,310 Basically, you can compute analytically 284 00:15:32,310 --> 00:15:34,550 the degree distributions of Kronecker 285 00:15:34,550 --> 00:15:35,980 producing bipartite graphs. 286 00:15:35,980 --> 00:15:40,180 This shows you the formula for doing that. 287 00:15:40,180 --> 00:15:43,930 And here is the actual degree distribution 288 00:15:43,930 --> 00:15:48,400 with the binomial coefficients for that. 289 00:15:48,400 --> 00:15:49,830 And so a very useful thing. 290 00:15:49,830 --> 00:15:52,720 You can a priori do that. 291 00:15:52,720 --> 00:15:53,950 This shows you the results. 292 00:15:53,950 --> 00:15:57,980 So if I have a Kronecker product of a bipartite graph n 293 00:15:57,980 --> 00:16:01,470 equals 5 comma 1, and I take it out to the 10th power, 294 00:16:01,470 --> 00:16:04,730 this is the result is this blue line. 295 00:16:04,730 --> 00:16:08,500 And what you see is that you get-- if you connect 296 00:16:08,500 --> 00:16:10,990 the last vertex and this vertex-- essentially 297 00:16:10,990 --> 00:16:13,360 our poor man's alpha-- or connected 298 00:16:13,360 --> 00:16:15,880 any one of these other sets here, 299 00:16:15,880 --> 00:16:18,720 they have a constant slope of negative 1. 300 00:16:18,720 --> 00:16:20,810 So that's a theoretical result is that they 301 00:16:20,810 --> 00:16:23,170 have this constant slope. 302 00:16:23,170 --> 00:16:26,880 And that's true down here, k to the 5, other types of things. 303 00:16:26,880 --> 00:16:28,420 So we do get this bowing effect. 304 00:16:28,420 --> 00:16:31,890 So it's not perfect with respect to that. 305 00:16:31,890 --> 00:16:34,420 But you can see how this power law 306 00:16:34,420 --> 00:16:37,320 is baked in to the Kronecker generator, which 307 00:16:37,320 --> 00:16:40,910 is a very nice thing. 308 00:16:45,070 --> 00:16:46,830 This shows a sample. 309 00:16:46,830 --> 00:16:52,150 We basically created a 4 by 1 bipartite graph 310 00:16:52,150 --> 00:16:53,900 and then sampled it. 311 00:16:53,900 --> 00:16:57,280 And so this shows the stochastic instance. 312 00:16:57,280 --> 00:16:59,300 There's about a million vertices here. 313 00:16:59,300 --> 00:17:00,270 Shows you that. 314 00:17:00,270 --> 00:17:03,970 And then these curves here show-- 315 00:17:03,970 --> 00:17:07,270 when we take the stochastic graph 316 00:17:07,270 --> 00:17:11,150 and analytically compute what the expected value 317 00:17:11,150 --> 00:17:13,500 of every single degree is assuming a Poisson 318 00:17:13,500 --> 00:17:17,270 distribution, and you can see that we get pretty much 319 00:17:17,270 --> 00:17:23,339 a perfect fit here all the way out, except to the very end 320 00:17:23,339 --> 00:17:25,480 here. 321 00:17:25,480 --> 00:17:29,210 And this is where you get some pretty interesting statistics 322 00:17:29,210 --> 00:17:31,200 going on. 323 00:17:31,200 --> 00:17:39,780 In theory, no one vertex-- this is your distribution. 324 00:17:39,780 --> 00:17:43,970 No one vertex should actually be attracting this many nodes. 325 00:17:43,970 --> 00:17:46,762 But you have so many vertices with one of them 326 00:17:46,762 --> 00:17:47,720 that have this problem. 327 00:17:47,720 --> 00:17:49,630 When you sum them all together, one of them 328 00:17:49,630 --> 00:17:51,520 gets to be chosen to be the winner. 329 00:17:51,520 --> 00:17:54,140 And it's what pops it up here. 330 00:17:54,140 --> 00:17:56,932 So all the way out to the-- our fit 331 00:17:56,932 --> 00:17:58,890 is very good all the way out to the super node, 332 00:17:58,890 --> 00:18:00,264 and we have this difference here. 333 00:18:00,264 --> 00:18:03,740 And this is where the Poisson statistics begin to fail. 334 00:18:03,740 --> 00:18:07,670 So again, lots of opportunities for a very interesting theory 335 00:18:07,670 --> 00:18:08,170 there. 336 00:18:10,820 --> 00:18:14,460 So that's just pure bipartite graphs. 337 00:18:14,460 --> 00:18:18,980 But bipartite graphs have no inter-group connections. 338 00:18:18,980 --> 00:18:21,670 So we can actually create something 339 00:18:21,670 --> 00:18:23,180 that looks like a power law graph 340 00:18:23,180 --> 00:18:27,210 and see all the community substructure very clearly. 341 00:18:27,210 --> 00:18:30,130 But those communities are not connected in any way. 342 00:18:30,130 --> 00:18:33,330 And so to create connections to the communities, 343 00:18:33,330 --> 00:18:35,770 we have to add the identity matrix down the diagonal. 344 00:18:35,770 --> 00:18:40,741 And that well then now connect all our sub bipartite graphs 345 00:18:40,741 --> 00:18:41,240 together. 346 00:18:44,110 --> 00:18:46,020 So this just shows here-- so we basically 347 00:18:46,020 --> 00:18:49,370 take our bipartite graph plus the identity, 348 00:18:49,370 --> 00:18:53,605 and that's essentially this combinatorial sum here 349 00:18:53,605 --> 00:18:57,362 of the individual bipartite graphs there. 350 00:18:57,362 --> 00:18:59,820 Again, where in quotes, it means that there's a permutation 351 00:18:59,820 --> 00:19:02,770 you need to actually make this all work out. 352 00:19:07,190 --> 00:19:08,590 We actually do the computation. 353 00:19:11,870 --> 00:19:16,860 This shows the 4, 1 bipartite plus an identity bipartite 354 00:19:16,860 --> 00:19:18,320 graph out to the fourth power. 355 00:19:18,320 --> 00:19:21,270 So you see here you're going to get 1, 2. 356 00:19:21,270 --> 00:19:24,380 You can compute it on out there. 357 00:19:24,380 --> 00:19:29,580 And what you see is this recursive structure, 358 00:19:29,580 --> 00:19:31,720 almost fractal like structure. 359 00:19:31,720 --> 00:19:33,240 So that's a nice way to view it. 360 00:19:33,240 --> 00:19:35,020 That's one way to see it. 361 00:19:35,020 --> 00:19:38,360 But you don't really see the bipartite substructure 362 00:19:38,360 --> 00:19:38,860 in there. 363 00:19:38,860 --> 00:19:41,870 It's hard to see what's going on there. 364 00:19:41,870 --> 00:19:43,850 Well, since this is analytic, we can actually 365 00:19:43,850 --> 00:19:47,050 compute a permutation that says please recover all 366 00:19:47,050 --> 00:19:49,030 those bipartite substructures. 367 00:19:49,030 --> 00:19:51,590 And so in the software, actually, [INAUDIBLE], 368 00:19:51,590 --> 00:19:53,630 we have a way of basically computing 369 00:19:53,630 --> 00:19:56,930 a permutation of this that basically regroups 370 00:19:56,930 --> 00:20:00,060 all our bipartite groups here. 371 00:20:00,060 --> 00:20:04,870 And then you can see all the bipartite cores or bipartite 372 00:20:04,870 --> 00:20:07,220 pieces, and then all the interconnections 373 00:20:07,220 --> 00:20:10,080 between those core bipartite pieces. 374 00:20:10,080 --> 00:20:12,940 To give you a better sense of what that really means, 375 00:20:12,940 --> 00:20:23,000 here, basically we have b to the i to the third power here. 376 00:20:23,000 --> 00:20:24,190 That's what it looks like. 377 00:20:24,190 --> 00:20:27,720 If we just subtract out the first and second order 378 00:20:27,720 --> 00:20:31,390 components, then we're left with these are the interconnecting 379 00:20:31,390 --> 00:20:32,010 pieces. 380 00:20:32,010 --> 00:20:34,700 We can see that much better when we permute it. 381 00:20:34,700 --> 00:20:39,120 So here is the permutation of just the bipartite piece, 382 00:20:39,120 --> 00:20:43,290 and that shows you these core bipartite chunks 383 00:20:43,290 --> 00:20:45,430 that are in the graph. 384 00:20:45,430 --> 00:20:50,630 We then do the second term, which is b kron b kron i. 385 00:20:50,630 --> 00:20:51,610 So this is the second. 386 00:20:51,610 --> 00:20:53,610 Now you can see that this creates connections 387 00:20:53,610 --> 00:20:56,540 between these groups. 388 00:20:56,540 --> 00:21:02,850 And then likewise, if you do the next one-- so b kron i b, 389 00:21:02,850 --> 00:21:05,200 that shows you the connections between these groups, 390 00:21:05,200 --> 00:21:08,230 and then ibb shows you the different connections 391 00:21:08,230 --> 00:21:08,930 in these groups. 392 00:21:08,930 --> 00:21:14,610 So each set of-- when we take b plus i 393 00:21:14,610 --> 00:21:17,310 and multiply it out and look at all 394 00:21:17,310 --> 00:21:20,430 those different combinations of the different multiplies 395 00:21:20,430 --> 00:21:25,890 there, there's the core piece, which is just b kron to the 3, 396 00:21:25,890 --> 00:21:30,020 and that creates this core structure here, 397 00:21:30,020 --> 00:21:33,519 which is these dense pieces, and then all the other polynomials 398 00:21:33,519 --> 00:21:35,810 are essentially creating interconnections between that. 399 00:21:35,810 --> 00:21:38,360 So you can get a really deep understanding 400 00:21:38,360 --> 00:21:42,270 of the core structure and how those things connect together, 401 00:21:42,270 --> 00:21:43,610 which can be very interesting. 402 00:21:46,360 --> 00:21:49,220 But we can compute these interconnected structures 403 00:21:49,220 --> 00:21:51,480 fairly nicely. 404 00:21:51,480 --> 00:21:56,230 Here's a higher order example going out to the fifth. 405 00:21:56,230 --> 00:21:59,565 But we can compute the structure, this chi structure 406 00:21:59,565 --> 00:22:02,060 here, which is essentially a summary of these, 407 00:22:02,060 --> 00:22:07,630 by just looking at a two-by-two. 408 00:22:07,630 --> 00:22:09,274 So we can compute all the structures 409 00:22:09,274 --> 00:22:10,940 of the interconnections by just looking. 410 00:22:10,940 --> 00:22:13,130 Since we have a bipartite thing, we 411 00:22:13,130 --> 00:22:15,379 can collapse all those pieces down and just have 412 00:22:15,379 --> 00:22:15,920 a two-by-two. 413 00:22:15,920 --> 00:22:19,560 So we have here a little tiny version of this structure here. 414 00:22:19,560 --> 00:22:20,830 It shows you that. 415 00:22:20,830 --> 00:22:23,650 Likewise, this structure here is summarized by that, 416 00:22:23,650 --> 00:22:26,030 and this structure here is summarized by that. 417 00:22:26,030 --> 00:22:27,480 So you can get complete knowledge, 418 00:22:27,480 --> 00:22:30,870 not just at the low level scale, but at all scales if you wanted 419 00:22:30,870 --> 00:22:32,380 to say I want to look at the blocks 420 00:22:32,380 --> 00:22:34,640 and how the edges are connected in detail, 421 00:22:34,640 --> 00:22:37,160 or if I just want to consider things in big blocks 422 00:22:37,160 --> 00:22:41,050 and connect them, I can do that very nicely as well. 423 00:22:41,050 --> 00:22:44,050 And again, we can compute the degree distributions as well. 424 00:22:44,050 --> 00:22:48,060 So this just shows b kron to the k, and then 425 00:22:48,060 --> 00:22:51,749 b plus the second order terms, and then moving on out. 426 00:22:51,749 --> 00:22:54,290 And actually, what you see is by adding this identity matrix, 427 00:22:54,290 --> 00:22:59,030 all it does is slide the degree structure over. 428 00:22:59,030 --> 00:23:01,630 It also does change the slope a little bit 429 00:23:01,630 --> 00:23:04,160 here by adding this identity matrix along. 430 00:23:04,160 --> 00:23:06,940 So it's steeper than the original. 431 00:23:06,940 --> 00:23:08,820 And you can do other types of things. 432 00:23:08,820 --> 00:23:11,550 Here's something we did in iso-parametric ratios. 433 00:23:11,550 --> 00:23:12,520 I won't belabor that. 434 00:23:12,520 --> 00:23:14,686 But it essentially is a way of computing the surface 435 00:23:14,686 --> 00:23:15,960 to volume of subgraphs. 436 00:23:15,960 --> 00:23:18,020 And we can compute this analytically. 437 00:23:18,020 --> 00:23:21,170 And we see that the iso-parametric ratio 438 00:23:21,170 --> 00:23:25,480 of a bipartite subgraph is constant, 439 00:23:25,480 --> 00:23:29,090 but the half bipartite graph has this property here. 440 00:23:29,090 --> 00:23:30,640 And just to summarize here-- and this 441 00:23:30,640 --> 00:23:32,620 is done a great deal in the chapter-- 442 00:23:32,620 --> 00:23:34,460 this just shows different examples 443 00:23:34,460 --> 00:23:37,050 of the theoretical results you can compute 444 00:23:37,050 --> 00:23:41,510 for a bipartite graph or a bipartite plus an identity, 445 00:23:41,510 --> 00:23:43,530 the degree distribution. 446 00:23:43,530 --> 00:23:45,820 Betweenness centrality is a fairly complicated metric. 447 00:23:45,820 --> 00:23:47,360 We actually haven't figured it out for this one, 448 00:23:47,360 --> 00:23:49,770 or I didn't bother to figure it out for this one. 449 00:23:49,770 --> 00:23:52,640 I can compute the diameter very nicely. 450 00:23:52,640 --> 00:23:54,310 You can compute the eigenvalues. 451 00:23:54,310 --> 00:23:57,130 And the iso-parametric are just all examples 452 00:23:57,130 --> 00:23:59,742 of a kind of theoretical work you can do. 453 00:23:59,742 --> 00:24:01,200 And again, I think it's very useful 454 00:24:01,200 --> 00:24:04,160 if you want to understand the substructure 455 00:24:04,160 --> 00:24:06,330 and how the different parts of a power law graph 456 00:24:06,330 --> 00:24:10,390 might interact with each other. 457 00:24:10,390 --> 00:24:13,090 So that just talks about Kronecker graphs 458 00:24:13,090 --> 00:24:16,540 and what are the bases of these benchmarks. 459 00:24:16,540 --> 00:24:20,070 And now I'm going to get into some benchmarks again 460 00:24:20,070 --> 00:24:20,570 themselves. 461 00:24:23,460 --> 00:24:26,130 So this is just to remind folks. 462 00:24:26,130 --> 00:24:28,760 This just shows some examples of when 463 00:24:28,760 --> 00:24:34,382 you're doing benchmarking of inserts into a database. 464 00:24:34,382 --> 00:24:36,090 Normally, when you do parallel computing, 465 00:24:36,090 --> 00:24:42,830 you have one parameter, which is how many parallel processes are 466 00:24:42,830 --> 00:24:45,510 you running at the same time. 467 00:24:45,510 --> 00:24:47,410 And so most of your scaling curves 468 00:24:47,410 --> 00:24:52,430 will be with respect to, in this case, 469 00:24:52,430 --> 00:24:54,210 the number of concurrent processes 470 00:24:54,210 --> 00:24:56,120 that you're running at one time here. 471 00:24:56,120 --> 00:24:58,770 I think we're going up to 32 here. 472 00:24:58,770 --> 00:25:01,370 That's the standard parameter in parallel computing. 473 00:25:01,370 --> 00:25:04,710 When you have parallel databases involved, 474 00:25:04,710 --> 00:25:08,030 now you have a couple of additional parameters, which 475 00:25:08,030 --> 00:25:10,820 just make it that much more stuff 476 00:25:10,820 --> 00:25:13,330 you have to keep in your head. 477 00:25:13,330 --> 00:25:15,520 Essentially, in this case, Accumulo 478 00:25:15,520 --> 00:25:17,410 calls its individual data servers 479 00:25:17,410 --> 00:25:20,110 tablet servers, so you have to be aware of how many tablet 480 00:25:20,110 --> 00:25:21,670 servers you have. 481 00:25:21,670 --> 00:25:23,089 So that's another parameter. 482 00:25:23,089 --> 00:25:24,880 So you have to create separate curves here. 483 00:25:24,880 --> 00:25:28,260 This is our scaling curve with a number of ingest processes 484 00:25:28,260 --> 00:25:31,870 into one tablet server versus number of ingest processes 485 00:25:31,870 --> 00:25:33,940 into six tablets servers. 486 00:25:33,940 --> 00:25:37,270 And as you can see, not a huge difference here 487 00:25:37,270 --> 00:25:38,030 at the small end. 488 00:25:38,030 --> 00:25:44,480 But eventually, the one tablet server database levels off 489 00:25:44,480 --> 00:25:46,650 while the other one keeps on scaling. 490 00:25:46,650 --> 00:25:48,360 And so this is very tricky. 491 00:25:48,360 --> 00:25:50,280 If you want to do these kind of results, 492 00:25:50,280 --> 00:25:53,740 you have to be aware of both of these parameters 493 00:25:53,740 --> 00:25:56,520 in order to really understand what you're doing. 494 00:26:00,050 --> 00:26:01,770 There's also a third parameter which 495 00:26:01,770 --> 00:26:04,412 is generally true of most databases, 496 00:26:04,412 --> 00:26:05,870 and Accumulo is no exception, which 497 00:26:05,870 --> 00:26:10,260 is that you're inserting a table into a parallel database, 498 00:26:10,260 --> 00:26:15,740 that table is going to be split amongst the different parallel 499 00:26:15,740 --> 00:26:17,740 servers in the database. 500 00:26:17,740 --> 00:26:22,150 And that will have a big impact on performance. 501 00:26:22,150 --> 00:26:25,536 How many splits you have-- because you're not 502 00:26:25,536 --> 00:26:27,660 going to be able to take advantage of-- if you have 503 00:26:27,660 --> 00:26:30,550 fewer splits than the number of database server nodes, 504 00:26:30,550 --> 00:26:32,390 then you're not taking advantage of those. 505 00:26:32,390 --> 00:26:35,250 And then how balanced those splits are. 506 00:26:35,250 --> 00:26:39,490 Is the data going in uniformly into those things? 507 00:26:39,490 --> 00:26:42,720 And so this just shows you an example of an 8 tablet 508 00:26:42,720 --> 00:26:45,000 server Accumulo doing the inserts 509 00:26:45,000 --> 00:26:48,550 with no splits versus doing it, in this case, with 35 splits. 510 00:26:48,550 --> 00:26:51,860 And you see there's a rather large impact on the performance 511 00:26:51,860 --> 00:26:52,390 there. 512 00:26:52,390 --> 00:26:54,420 So D4M actually has tools. 513 00:26:54,420 --> 00:26:55,670 We didn't really go into them. 514 00:26:55,670 --> 00:26:58,780 But if you look at the documentation, 515 00:26:58,780 --> 00:27:00,650 we're allowing you to do splits. 516 00:27:00,650 --> 00:27:02,590 The databases we've given you access to 517 00:27:02,590 --> 00:27:04,480 are all single node databases, so we've 518 00:27:04,480 --> 00:27:05,480 made your life, in a certain sense, 519 00:27:05,480 --> 00:27:06,896 a little bit easier that you don't 520 00:27:06,896 --> 00:27:08,370 have to worry about splits. 521 00:27:08,370 --> 00:27:10,860 But it is something-- as you work on larger systems, 522 00:27:10,860 --> 00:27:13,705 you have to be very aware of the splits. 523 00:27:17,700 --> 00:27:20,940 Another thing that's often very important 524 00:27:20,940 --> 00:27:23,430 when you're inserting into a database 525 00:27:23,430 --> 00:27:26,480 is the size of the chunks of data 526 00:27:26,480 --> 00:27:28,380 you're handing to the database. 527 00:27:28,380 --> 00:27:30,250 This is sometimes called the block size. 528 00:27:30,250 --> 00:27:31,790 This is a very important parameter. 529 00:27:31,790 --> 00:27:36,820 Blocking of programs is very important. 530 00:27:36,820 --> 00:27:41,000 Probably the key thing we do in optimizing any program 531 00:27:41,000 --> 00:27:43,190 is coming up, finding what the right block 532 00:27:43,190 --> 00:27:44,260 size is for a system. 533 00:27:44,260 --> 00:27:48,020 Because it almost always has a peak around some number here. 534 00:27:48,020 --> 00:27:53,260 And so this just shows the impact 535 00:27:53,260 --> 00:27:59,420 of block size on the performance rate here for Accumulo. 536 00:27:59,420 --> 00:28:01,330 And the nice about D4M, there's actually 537 00:28:01,330 --> 00:28:04,420 a parameter you can set for the table which will say how many 538 00:28:04,420 --> 00:28:06,040 bytes I want to set. 539 00:28:06,040 --> 00:28:08,502 I think the default is 500 kilobytes, which 540 00:28:08,502 --> 00:28:10,210 is a lot smaller than I would've thought, 541 00:28:10,210 --> 00:28:11,751 but it's a number that we found tends 542 00:28:11,751 --> 00:28:14,460 to be fairly optimal across a wide range of things. 543 00:28:14,460 --> 00:28:16,419 Typically, when you do inserts into a database, 544 00:28:16,419 --> 00:28:18,459 you're like, well, I want to give you the biggest 545 00:28:18,459 --> 00:28:19,750 chunk that I can and move on. 546 00:28:19,750 --> 00:28:21,790 And Accumulo and probably a lot of databases 547 00:28:21,790 --> 00:28:24,620 actually like to get data in smaller chunks 548 00:28:24,620 --> 00:28:25,910 than you might think. 549 00:28:25,910 --> 00:28:28,371 Basically, what's happening then is 550 00:28:28,371 --> 00:28:30,120 if you're giving it the right size chunks, 551 00:28:30,120 --> 00:28:33,430 then it all fits in cache, and any sorting and other types 552 00:28:33,430 --> 00:28:35,510 of operations it has to do on that data can 553 00:28:35,510 --> 00:28:37,240 be done much more efficiently. 554 00:28:37,240 --> 00:28:39,070 So this just shows another parameter 555 00:28:39,070 --> 00:28:40,500 you have to be worried about. 556 00:28:40,500 --> 00:28:43,710 If you do everything right-- and I showed these results 557 00:28:43,710 --> 00:28:46,940 before-- these are the kind of results you can get. 558 00:28:46,940 --> 00:28:51,350 This shows essentially on an 8 node system-- fairly powerful, 559 00:28:51,350 --> 00:28:56,710 24 cores per node getting the 4 million inserts or entries 560 00:28:56,710 --> 00:29:00,500 per second that we saw, which is, to our knowledge, again, 561 00:29:00,500 --> 00:29:01,970 the world record holder for this. 562 00:29:04,710 --> 00:29:06,380 That talks about inserts. 563 00:29:06,380 --> 00:29:11,290 Another thing that we also want to look at is queries. 564 00:29:11,290 --> 00:29:17,000 So as I said before, Accumulo is a row-based store, which 565 00:29:17,000 --> 00:29:18,950 means if you give it a row key, it 566 00:29:18,950 --> 00:29:22,320 will return the result in constant time. 567 00:29:22,320 --> 00:29:24,220 And so that's true. 568 00:29:24,220 --> 00:29:28,110 So we've done a bunch of queries here, 569 00:29:28,110 --> 00:29:30,820 lots of different concurrent processes all querying 570 00:29:30,820 --> 00:29:32,160 at the same time. 571 00:29:32,160 --> 00:29:37,600 And you can see the response time is generally, they say, 572 00:29:37,600 --> 00:29:40,130 around 50 milliseconds, which is great. 573 00:29:40,130 --> 00:29:41,834 And this is the full round trip time. 574 00:29:41,834 --> 00:29:43,750 A lot of this could have been network latency, 575 00:29:43,750 --> 00:29:44,827 for all we know. 576 00:29:44,827 --> 00:29:45,910 So that's what you expect. 577 00:29:45,910 --> 00:29:46,993 That's what Accumulo does. 578 00:29:46,993 --> 00:29:50,270 If you do row queries, then you get constant time. 579 00:29:50,270 --> 00:29:51,710 Now, as we've talked about before, 580 00:29:51,710 --> 00:29:54,690 we do our special exploded transpose schema, 581 00:29:54,690 --> 00:29:57,660 which essentially give this performance for both row 582 00:29:57,660 --> 00:29:58,465 and column queries. 583 00:29:58,465 --> 00:30:00,590 But if you have a table where you haven't done that 584 00:30:00,590 --> 00:30:04,060 and you are going to query a column, 585 00:30:04,060 --> 00:30:06,400 it's a complete scan of the table. 586 00:30:06,400 --> 00:30:12,780 And so you see here-- when you want to do a column query, 587 00:30:12,780 --> 00:30:16,310 the performance is just going to go up and up and up, 588 00:30:16,310 --> 00:30:18,460 because they're all just doing more and more scans, 589 00:30:18,460 --> 00:30:21,760 and the system can only scan at a certain rate. 590 00:30:21,760 --> 00:30:24,230 So that's, again, the main reason 591 00:30:24,230 --> 00:30:27,290 why we do these special schemas is to avoid this performance 592 00:30:27,290 --> 00:30:27,790 penalty. 593 00:30:30,890 --> 00:30:35,050 So finally, in talking about performance, 594 00:30:35,050 --> 00:30:40,810 let's talk a little bit about D4M performance itself. 595 00:30:40,810 --> 00:30:47,610 If you write your D4M programs optimally-- 596 00:30:47,610 --> 00:30:49,570 and this sometimes takes a while to get to, 597 00:30:49,570 --> 00:30:51,320 because usually you don't quite have the-- 598 00:30:51,320 --> 00:30:55,510 or I should say, in the most elegant way possible, again, 599 00:30:55,510 --> 00:30:58,430 this sometimes requires some understanding. 600 00:30:58,430 --> 00:31:01,090 Most applications end up becoming 601 00:31:01,090 --> 00:31:04,830 a combination of matrix multiplies 602 00:31:04,830 --> 00:31:06,941 on these associative arrays. 603 00:31:06,941 --> 00:31:08,940 Now, it's usually not the first program I write, 604 00:31:08,940 --> 00:31:11,230 but usually, it's the last one I write. 605 00:31:11,230 --> 00:31:15,460 It finally dawns on me how to do the entire complicated analytic 606 00:31:15,460 --> 00:31:18,980 and all its queries as a series of special sparse matrix 607 00:31:18,980 --> 00:31:22,570 multiplies, which then allows us to maximally leverage 608 00:31:22,570 --> 00:31:25,060 the underlying libraries which have been 609 00:31:25,060 --> 00:31:29,720 very optimized to do-- it's basically not 610 00:31:29,720 --> 00:31:30,870 calling any of my code. 611 00:31:30,870 --> 00:31:34,190 It's just calling the MATLAB sparse matrix multiply routine, 612 00:31:34,190 --> 00:31:37,350 which is further calling the sparse BLAS, which are heavily 613 00:31:37,350 --> 00:31:40,150 optimized by the computing vendors, 614 00:31:40,150 --> 00:31:42,650 and likewise are calling sort routines. 615 00:31:42,650 --> 00:31:46,490 So the thing you have to understand, though, 616 00:31:46,490 --> 00:31:49,910 is that sparse matrix multiply has fundamental hardware 617 00:31:49,910 --> 00:31:54,995 limits depending on the kinds of multiplies you're doing here. 618 00:31:54,995 --> 00:31:58,970 And we have a whole program that basically times all these 619 00:31:58,970 --> 00:31:59,770 and generates them. 620 00:31:59,770 --> 00:32:02,492 So if you ever want to get what your fundamental limits are, 621 00:32:02,492 --> 00:32:03,950 we have a set of programs that will 622 00:32:03,950 --> 00:32:06,760 run all these benchmarks for you and show you 623 00:32:06,760 --> 00:32:09,310 what you can expect. 624 00:32:09,310 --> 00:32:11,680 And this shows you, basically, the fraction 625 00:32:11,680 --> 00:32:13,600 of the theoretical peak performance 626 00:32:13,600 --> 00:32:18,190 of the processor as a function of the total memory 627 00:32:18,190 --> 00:32:18,940 on that processor. 628 00:32:18,940 --> 00:32:23,870 So this is all single core, single processor benchmarks. 629 00:32:23,870 --> 00:32:28,330 So as you can see here, this is dense matrix multiply. 630 00:32:28,330 --> 00:32:32,750 And it's 95% efficient. 631 00:32:32,750 --> 00:32:35,760 If I was to build a piece of special purpose 632 00:32:35,760 --> 00:32:40,710 hardware and a memory subsystem for doing dense matrix 633 00:32:40,710 --> 00:32:44,130 multiply, it would be identical to the computers 634 00:32:44,130 --> 00:32:46,340 that you all have. 635 00:32:46,340 --> 00:32:49,420 Your computers have been absolutely designed 636 00:32:49,420 --> 00:32:51,850 to do dense matrix multiply. 637 00:32:51,850 --> 00:32:55,970 And for good or bad, that's just the way it is. 638 00:32:55,970 --> 00:32:59,860 The caching structure, the matrix multiply units, 639 00:32:59,860 --> 00:33:03,580 the vectorization units, they're all designed to-- they're 640 00:33:03,580 --> 00:33:07,800 identical to a custom built matrix multiply unit. 641 00:33:07,800 --> 00:33:11,800 And so we get enormously high performance here, 642 00:33:11,800 --> 00:33:15,110 95% peak on dense matrix multiply. 643 00:33:15,110 --> 00:33:19,080 Unfortunately, that hardware architecture and everything 644 00:33:19,080 --> 00:33:20,640 we do it for dense matrix multiply 645 00:33:20,640 --> 00:33:23,200 is the opposite of what we would do 646 00:33:23,200 --> 00:33:27,560 if we built a computer for doing sparse matrix multiply. 647 00:33:27,560 --> 00:33:30,450 And that's why we have, when we go from dense matrix multiply 648 00:33:30,450 --> 00:33:36,700 to sparse matrix multiply, this 1,000x drop in performance. 649 00:33:36,700 --> 00:33:39,490 And this is not getting better with time. 650 00:33:39,490 --> 00:33:43,044 This is getting worse with time. 651 00:33:43,044 --> 00:33:44,710 And there's nothing you can do about it. 652 00:33:44,710 --> 00:33:47,010 That is just what the hardware does. 653 00:33:47,010 --> 00:33:49,470 Now, it's still better to do sparse matrix multiply 654 00:33:49,470 --> 00:33:52,610 than to convert it to dense, because you're still 655 00:33:52,610 --> 00:33:57,170 a net win by not computing all those zeros that you would 656 00:33:57,170 --> 00:33:58,670 have in a very sparse matrix. 657 00:33:58,670 --> 00:34:03,130 So it's still a significant win in execution time 658 00:34:03,130 --> 00:34:06,020 to do your calculation on sparse matrices 659 00:34:06,020 --> 00:34:10,150 if they are sparse, but not otherwise. 660 00:34:10,150 --> 00:34:15,719 And then-- so that's that. 661 00:34:15,719 --> 00:34:22,210 This shows the associative array. 662 00:34:22,210 --> 00:34:28,870 So the top one is MATLAB dense, MATLAB sparse, D4M associative 663 00:34:28,870 --> 00:34:29,370 array. 664 00:34:29,370 --> 00:34:34,560 And we worked very hard to make the D4M associative 665 00:34:34,560 --> 00:34:38,900 array be a constant factor below the core sparse. 666 00:34:38,900 --> 00:34:41,060 So you're really getting performance that's 667 00:34:41,060 --> 00:34:42,610 pretty darn close to that. 668 00:34:42,610 --> 00:34:46,920 And so generally, doing sparse matrix multiplies 669 00:34:46,920 --> 00:34:50,760 on the associative arrays works out pretty well. 670 00:34:50,760 --> 00:34:53,940 But again, there is still about 21% efficient of the hardware. 671 00:34:53,940 --> 00:34:56,550 And then the final thing is we have 672 00:34:56,550 --> 00:35:00,020 these special sparse matrix multiplies where we concatenate 673 00:35:00,020 --> 00:35:01,290 the keys together. 674 00:35:01,290 --> 00:35:05,150 If you remember way back when we did the bioinformatics example, 675 00:35:05,150 --> 00:35:07,900 we could do special matrix multiplies 676 00:35:07,900 --> 00:35:12,010 that essentially preserved where the data came from. 677 00:35:12,010 --> 00:35:15,290 If we correlated two DNA sequences, 678 00:35:15,290 --> 00:35:19,080 we would have a matrix that would show 679 00:35:19,080 --> 00:35:21,330 the IDs of the DNA sequence. 680 00:35:21,330 --> 00:35:23,460 And if we did these special matrix multiplies, 681 00:35:23,460 --> 00:35:28,320 the values would then hold the exact DNA sequences themselves 682 00:35:28,320 --> 00:35:29,821 that were the matches. 683 00:35:29,821 --> 00:35:31,320 Well, there's a performance hit that 684 00:35:31,320 --> 00:35:33,920 comes with holding that additional information, all 685 00:35:33,920 --> 00:35:35,370 of this string information. 686 00:35:35,370 --> 00:35:38,162 It may be one day we'll be able to optimize this further, 687 00:35:38,162 --> 00:35:40,120 although we've actually optimized it a fair bit 688 00:35:40,120 --> 00:35:40,760 already. 689 00:35:40,760 --> 00:35:43,030 And so there's another factor of 100. 690 00:35:43,030 --> 00:35:45,600 And it looks like this is just, again, 691 00:35:45,600 --> 00:35:48,630 fundamentally limited by the speed 692 00:35:48,630 --> 00:35:51,790 at which the hardware will do sparse matrix multiply 693 00:35:51,790 --> 00:35:55,940 and the speed at which it can do string sorting. 694 00:35:55,940 --> 00:35:58,100 That's essentially the two routines 695 00:35:58,100 --> 00:36:00,630 that all these functions boil down to, 696 00:36:00,630 --> 00:36:03,220 is sparse matrix multiply and string storing. 697 00:36:03,220 --> 00:36:05,280 And those are essentially hardware limited. 698 00:36:05,280 --> 00:36:08,070 Those are very optimized functions. 699 00:36:08,070 --> 00:36:10,860 So this just gives you a way. 700 00:36:10,860 --> 00:36:12,720 I think it's always-- the first thing 701 00:36:12,720 --> 00:36:20,130 we do when we run a program is if we want to optimize it, 702 00:36:20,130 --> 00:36:23,230 we'll write the program and run it, get a time, 703 00:36:23,230 --> 00:36:27,070 and then we'll try and figure out based on this type of data, 704 00:36:27,070 --> 00:36:30,500 well, what's the theoretical best we could do? 705 00:36:30,500 --> 00:36:33,740 Given what this calculation is dominated by, 706 00:36:33,740 --> 00:36:36,080 what is the best we can do? 707 00:36:36,080 --> 00:36:42,070 Because if the best we could do-- the theoretical peak 708 00:36:42,070 --> 00:36:46,390 for what we were doing was 1,000 times better than what 709 00:36:46,390 --> 00:36:49,400 we're doing, that tells us, you know what? 710 00:36:49,400 --> 00:36:51,530 If we have to invest time in optimization, 711 00:36:51,530 --> 00:36:54,040 it probably is time well spent. 712 00:36:54,040 --> 00:36:58,580 Going from being 1,000x below whatever the best you can do 713 00:36:58,580 --> 00:37:01,540 getting to 100x usually doesn't take too much trouble. 714 00:37:01,540 --> 00:37:03,500 Usually, you can find what that issue is 715 00:37:03,500 --> 00:37:06,030 and you can correct it. 716 00:37:06,030 --> 00:37:13,240 Likewise, going from 100x to 10x is still usually 717 00:37:13,240 --> 00:37:16,075 a worthwhile thing to do. 718 00:37:16,075 --> 00:37:20,540 If you're at 10% of the best that you can do, 719 00:37:20,540 --> 00:37:22,360 then you begin to think hard about, 720 00:37:22,360 --> 00:37:24,940 well, do I really want to chase? 721 00:37:24,940 --> 00:37:27,430 I'm never going to probably do better than 50% 722 00:37:27,430 --> 00:37:30,280 of what the best is, or maybe even 30% or 40%. 723 00:37:30,280 --> 00:37:33,180 Usually, there's a lot of effort that 724 00:37:33,180 --> 00:37:37,590 goes into going from 10% of peak performance 725 00:37:37,590 --> 00:37:39,470 to 50% of performance. 726 00:37:39,470 --> 00:37:41,890 But going from 0.1% of peak performance 727 00:37:41,890 --> 00:37:44,980 to 1% of peak performance is usually 728 00:37:44,980 --> 00:37:49,510 profile the code, aha, I'm doing something wrong here, fix it, 729 00:37:49,510 --> 00:37:50,010 done. 730 00:37:50,010 --> 00:37:53,000 And so this is something that's a part of our philosophy 731 00:37:53,000 --> 00:37:56,260 in doing optimization, and why we spend so much time running 732 00:37:56,260 --> 00:37:58,910 benchmarks and understanding where-- 733 00:37:58,910 --> 00:38:01,125 because it tells us when to stop. 734 00:38:01,125 --> 00:38:03,390 If someone says, well, will this be faster, 735 00:38:03,390 --> 00:38:05,800 and you can say no, it won't be faster, you're done. 736 00:38:05,800 --> 00:38:09,730 Or, ah, no, we think in a few weeks' time, 737 00:38:09,730 --> 00:38:11,760 we can really make it a lot better. 738 00:38:11,760 --> 00:38:13,970 So very useful thing, and that's why benchmarking 739 00:38:13,970 --> 00:38:16,910 is a big part of what we do. 740 00:38:16,910 --> 00:38:20,060 So finally, I want to wrap up here. 741 00:38:20,060 --> 00:38:22,530 This was one of the first charts I showed you, 742 00:38:22,530 --> 00:38:25,780 and hopefully now it makes a lot more sense. 743 00:38:25,780 --> 00:38:30,660 This shows you the easy path toward solving your problems 744 00:38:30,660 --> 00:38:33,800 in signal processing on databases. 745 00:38:33,800 --> 00:38:37,290 And if you look at your problem as one of how much data 746 00:38:37,290 --> 00:38:42,420 I want to process and how much-- what the granularity of access 747 00:38:42,420 --> 00:38:46,380 is, well, if I don't have a lot of data and I'm 748 00:38:46,380 --> 00:38:49,880 accessing it at a very small granularity, 749 00:38:49,880 --> 00:38:55,470 well, then I should just do it in the memory of my computer. 750 00:38:55,470 --> 00:38:56,980 Don't mess around with files. 751 00:38:56,980 --> 00:38:58,870 Don't mess around with databases. 752 00:38:58,870 --> 00:39:00,570 Keep your life simple. 753 00:39:00,570 --> 00:39:03,720 Work it there. 754 00:39:03,720 --> 00:39:07,496 If the data request gets high, memory is still the best. 755 00:39:07,496 --> 00:39:09,370 Whether you're doing little bits or big bits, 756 00:39:09,370 --> 00:39:11,730 working out of main memory on a single processor, 757 00:39:11,730 --> 00:39:16,110 generally, for the most part, is where you want to be. 758 00:39:16,110 --> 00:39:21,030 If you're going to be getting to larger amounts of data 759 00:39:21,030 --> 00:39:24,190 but having a fairly large request size, 760 00:39:24,190 --> 00:39:27,350 then using files, reading those files in 761 00:39:27,350 --> 00:39:29,816 and out-- so basically, if I have 762 00:39:29,816 --> 00:39:31,690 a lot of data I want to process, it won't fit 763 00:39:31,690 --> 00:39:34,800 into serial memory, we'll write it out into a bunch of files, 764 00:39:34,800 --> 00:39:38,140 just read them in, process them, move onto the next file 765 00:39:38,140 --> 00:39:40,100 is the right thing to do. 766 00:39:40,100 --> 00:39:42,710 Or spread it out in parallel. 767 00:39:42,710 --> 00:39:45,200 Run a bunch of nodes and use the RAM in each one of those. 768 00:39:45,200 --> 00:39:46,658 That's still the best way to do it. 769 00:39:50,310 --> 00:39:51,880 You could also do that in parallel. 770 00:39:51,880 --> 00:39:55,800 You can have parallel programs reading parallel files, too. 771 00:39:55,800 --> 00:39:58,020 So if you have really enormous amounts of data 772 00:39:58,020 --> 00:40:02,500 and you're going to be reading the data in large chunks, 773 00:40:02,500 --> 00:40:06,520 then you want to use parallel files, parallel computing. 774 00:40:06,520 --> 00:40:08,520 That's going to give you better performance. 775 00:40:08,520 --> 00:40:10,550 And it's just going to be easy. 776 00:40:10,550 --> 00:40:11,920 Have a lot of files. 777 00:40:11,920 --> 00:40:15,610 Each program reads into sets of files, processes its set. 778 00:40:15,610 --> 00:40:16,720 Very easy to do. 779 00:40:16,720 --> 00:40:18,930 And we have lots of support in our [INAUDIBLE] system 780 00:40:18,930 --> 00:40:20,346 for doing those types of problems. 781 00:40:20,346 --> 00:40:22,020 It's the thing that most people do. 782 00:40:22,020 --> 00:40:25,980 However, if you have a large amount of data 783 00:40:25,980 --> 00:40:31,920 and you want to be accessing small fractions of it randomly, 784 00:40:31,920 --> 00:40:33,630 that's when you want to use the database. 785 00:40:33,630 --> 00:40:37,507 That is the use case for the database. 786 00:40:37,507 --> 00:40:38,965 And there are definitely times when 787 00:40:38,965 --> 00:40:42,390 we have that case where we want to do that. 788 00:40:42,390 --> 00:40:45,109 But you want to build towards it. 789 00:40:45,109 --> 00:40:47,400 A lot of times when you're doing analytics or algorithm 790 00:40:47,400 --> 00:40:51,320 development, you tend to be in these cases first. 791 00:40:51,320 --> 00:40:56,840 And so rather than bite off the complexity associated 792 00:40:56,840 --> 00:41:00,270 with dealing with a database, use these tools first. 793 00:41:00,270 --> 00:41:02,839 But then eventually, there may be a time where you actually 794 00:41:02,839 --> 00:41:03,880 have to use the database. 795 00:41:03,880 --> 00:41:06,760 And hopefully, one thing you get out of this course 796 00:41:06,760 --> 00:41:09,310 is understanding this nuance type of thing. 797 00:41:09,310 --> 00:41:12,670 Because a lot of people say, we need a parallel database. 798 00:41:12,670 --> 00:41:13,480 Why? 799 00:41:13,480 --> 00:41:14,471 We have a lot of data. 800 00:41:14,471 --> 00:41:14,970 All right. 801 00:41:14,970 --> 00:41:15,810 That's great. 802 00:41:15,810 --> 00:41:17,018 Well, what do you want to do? 803 00:41:17,018 --> 00:41:19,160 We want to scan over the whole thing. 804 00:41:19,160 --> 00:41:23,180 Well, that's just going to be a lot of data really slow. 805 00:41:23,180 --> 00:41:24,210 So people will do that. 806 00:41:24,210 --> 00:41:26,001 They'll load a bunch of data in the system. 807 00:41:26,001 --> 00:41:28,607 They're like, well, but nothing is faster. 808 00:41:28,607 --> 00:41:30,190 It's like, well, yeah, if you're going 809 00:41:30,190 --> 00:41:32,410 to scan over a significant fraction the database, 810 00:41:32,410 --> 00:41:35,050 it will be slower-- certainly slower than loading it 811 00:41:35,050 --> 00:41:36,970 into the memories of a bunch of computers, 812 00:41:36,970 --> 00:41:39,740 and even slower than just having a bunch of files 813 00:41:39,740 --> 00:41:41,770 and reading them into memory. 814 00:41:41,770 --> 00:41:48,170 And increasingly nowadays, you're talking about 815 00:41:48,170 --> 00:41:49,860 this is millions of entries. 816 00:41:52,530 --> 00:41:55,320 There's a lot of things that were databases that we have, 817 00:41:55,320 --> 00:41:58,880 like the lab's LDAP server. 818 00:41:58,880 --> 00:42:00,980 It's like 30,000 entries. 819 00:42:00,980 --> 00:42:02,520 We have a database for it. 820 00:42:02,520 --> 00:42:04,880 You could send it around an Excel spreadsheet 821 00:42:04,880 --> 00:42:07,395 and pretty much do everything with that. 822 00:42:07,395 --> 00:42:10,530 Or the entire release review query, or the phone book, 823 00:42:10,530 --> 00:42:13,230 or whatever-- these are all now, by today's standards, 824 00:42:13,230 --> 00:42:16,030 microscopic data sets that can trivially 825 00:42:16,030 --> 00:42:19,050 be stored in a single associative array in something 826 00:42:19,050 --> 00:42:21,280 like this and manipulated to your heart's content. 827 00:42:21,280 --> 00:42:24,220 So there's a lot of data sets that are big to the human 828 00:42:24,220 --> 00:42:28,480 but are microscopic to the technology nowadays. 829 00:42:28,480 --> 00:42:32,430 So this is usually millions of entries. 830 00:42:32,430 --> 00:42:35,370 This might be hundreds of millions of entries. 831 00:42:35,370 --> 00:42:37,340 Definitely, when you really need-- often, 832 00:42:37,340 --> 00:42:39,380 you're talking about billions of entries 833 00:42:39,380 --> 00:42:41,630 when you're getting here to the point where you really 834 00:42:41,630 --> 00:42:42,480 need a database. 835 00:42:42,480 --> 00:42:44,770 The exception to this being if you're 836 00:42:44,770 --> 00:42:48,330 in a program that requires-- many of our customers 837 00:42:48,330 --> 00:42:50,470 will say the database is being used. 838 00:42:50,470 --> 00:42:51,710 This is where the data is. 839 00:42:51,710 --> 00:42:54,010 It's in no other place. 840 00:42:54,010 --> 00:42:55,820 So for integration purposes, then you 841 00:42:55,820 --> 00:42:57,700 throw all this out the window. 842 00:42:57,700 --> 00:42:59,840 If they say, look, this is where the data is 843 00:42:59,840 --> 00:43:03,634 and that's the only way you can get it, then of course. 844 00:43:03,634 --> 00:43:06,050 But that still doesn't mean you might be like, yeah, well, 845 00:43:06,050 --> 00:43:07,758 you're going to store it in the database, 846 00:43:07,758 --> 00:43:10,536 but I'm going to store it in /temp, and work on it, 847 00:43:10,536 --> 00:43:12,410 and then put my results back in the database, 848 00:43:12,410 --> 00:43:15,660 because I know that what I'm doing is going to be better. 849 00:43:15,660 --> 00:43:18,840 There's no-- as they say, we don't 850 00:43:18,840 --> 00:43:20,467 want to be too proud to win. 851 00:43:20,467 --> 00:43:22,800 Just because some people, it's like, we have a database, 852 00:43:22,800 --> 00:43:25,097 and everything must be purely done in the database. 853 00:43:25,097 --> 00:43:27,055 Or some people are like, we have a file system, 854 00:43:27,055 --> 00:43:29,420 and everything must be purely done in the file system. 855 00:43:29,420 --> 00:43:33,030 Or we have this language and everything-- 856 00:43:33,030 --> 00:43:36,620 the right tool for the job, pulling 857 00:43:36,620 --> 00:43:38,620 them together, totally fine. 858 00:43:38,620 --> 00:43:40,580 It's a quick way to get this work done 859 00:43:40,580 --> 00:43:42,580 without driving yourself crazy. 860 00:43:42,580 --> 00:43:44,080 So with that, I'll come to the end. 861 00:43:44,080 --> 00:43:48,240 I have one code example which is relatively short to show you. 862 00:43:48,240 --> 00:43:50,990 But again, parallel graphs are the dominant type 863 00:43:50,990 --> 00:43:52,390 of data we see. 864 00:43:52,390 --> 00:43:55,380 Graph500 relies on this Kronecker graphs. 865 00:43:55,380 --> 00:43:58,340 Kronecker graph theory is a great theoretical framework 866 00:43:58,340 --> 00:43:59,560 for looking at stuff. 867 00:43:59,560 --> 00:44:02,060 You get all the identity benefits 868 00:44:02,060 --> 00:44:05,265 that come with Kronecker product eigenvalues. 869 00:44:05,265 --> 00:44:07,210 If you're into that, I certainly suggest 870 00:44:07,210 --> 00:44:09,380 you read the Van Loan paper, which 871 00:44:09,380 --> 00:44:12,621 is also the seminal work on Kronecker products. 872 00:44:12,621 --> 00:44:14,912 And I don't think anyone has written a followup to him, 873 00:44:14,912 --> 00:44:16,912 because everyone is like, yep, covered it there. 874 00:44:16,912 --> 00:44:18,450 It's about 12, 15 pages. 875 00:44:18,450 --> 00:44:22,220 Covered all the properties of Kronecker products. 876 00:44:22,220 --> 00:44:25,400 We can do parallel computing in D4M via pMATLAB. 877 00:44:25,400 --> 00:44:27,507 And again, fundamentally, if you've 878 00:44:27,507 --> 00:44:29,090 implemented your algorithms correctly, 879 00:44:29,090 --> 00:44:32,324 they're limited by hardware, by fundamental aspects 880 00:44:32,324 --> 00:44:32,990 of the hardware. 881 00:44:32,990 --> 00:44:33,906 There are limitations. 882 00:44:33,906 --> 00:44:35,500 It's important to know what those are 883 00:44:35,500 --> 00:44:38,150 and to be aware of them.