1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,415 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,415 --> 00:00:17,040 at OCW.MIT.edu. 8 00:00:21,446 --> 00:00:22,820 JEREMY KEPNER: All right, well, I 9 00:00:22,820 --> 00:00:26,590 want to thank you all for coming to what 10 00:00:26,590 --> 00:00:29,660 I have advertised as the penultimate course 11 00:00:29,660 --> 00:00:31,060 in this lecture series. 12 00:00:31,060 --> 00:00:33,270 Everything else we've done up to this point 13 00:00:33,270 --> 00:00:36,250 has sort of been building up to actually 14 00:00:36,250 --> 00:00:39,480 finally really using databases. 15 00:00:39,480 --> 00:00:41,580 And hopefully, you haven't been too 16 00:00:41,580 --> 00:00:47,997 disappointed at how long I've led you along here to get 17 00:00:47,997 --> 00:00:48,580 to this point. 18 00:00:48,580 --> 00:00:52,400 But the point being that if you have certain abstract concepts 19 00:00:52,400 --> 00:00:56,330 in your mind, that once we get to the database part, 20 00:00:56,330 --> 00:01:01,000 it just feels very straightforward. 21 00:01:01,000 --> 00:01:04,080 If we do the database piece and you don't have those concepts, 22 00:01:04,080 --> 00:01:06,220 then you can easily get distracted 23 00:01:06,220 --> 00:01:09,500 by extraneous information. 24 00:01:09,500 --> 00:01:12,310 So today, there's no view graphs. 25 00:01:12,310 --> 00:01:16,190 So I'm sure you're all thrilled about that. 26 00:01:16,190 --> 00:01:19,420 It's going to be all demos showing 27 00:01:19,420 --> 00:01:22,550 interacting with the actual technologies that we have here. 28 00:01:22,550 --> 00:01:26,130 And everything I'm showing is stuff that you can use. 29 00:01:26,130 --> 00:01:29,080 I mean, so everyone has been-- so I'm just 30 00:01:29,080 --> 00:01:31,560 going to kind of get into this. 31 00:01:31,560 --> 00:01:34,420 And let's just start with step one. 32 00:01:34,420 --> 00:01:36,370 We're going to be using the Accumulo 33 00:01:36,370 --> 00:01:38,290 databases that we've set up. 34 00:01:38,290 --> 00:01:42,710 We have a clearinghouse of these databases on our LLGrid system. 35 00:01:42,710 --> 00:01:45,557 And you can get to that list by going to this web page 36 00:01:45,557 --> 00:01:46,890 here, dbstatusllgrid.ll.mit.edu. 37 00:01:51,520 --> 00:01:53,650 And when you go there, it will prompt you 38 00:01:53,650 --> 00:01:55,480 for your one password. 39 00:01:55,480 --> 00:01:58,190 And then it will show you the databases 40 00:01:58,190 --> 00:02:00,490 that you have access to. 41 00:02:00,490 --> 00:02:03,030 Now, I have access to all the databases. 42 00:02:03,030 --> 00:02:08,870 But you guys should only see these class databases 43 00:02:08,870 --> 00:02:11,450 if you log into that. 44 00:02:11,450 --> 00:02:18,210 And so as you see here, we have five databases that are set up. 45 00:02:18,210 --> 00:02:21,730 These are five independent instances of Accumulo. 46 00:02:21,730 --> 00:02:23,280 And I started a couple already. 47 00:02:23,280 --> 00:02:26,340 And we can even take a look at them. 48 00:02:26,340 --> 00:02:29,800 So this is what running Accumulo instance looks like. 49 00:02:29,800 --> 00:02:32,240 This is its main page here. 50 00:02:32,240 --> 00:02:34,650 And it shows you how much disk is used, 51 00:02:34,650 --> 00:02:36,921 and the number of tables, and all that type of stuff. 52 00:02:36,921 --> 00:02:38,420 And it gives you a nice history that 53 00:02:38,420 --> 00:02:43,370 shows ingest rate over the last few [INAUDIBLE], and scan rate. 54 00:02:43,370 --> 00:02:47,240 This is all in entries per second, ingest in megabytes, 55 00:02:47,240 --> 00:02:50,580 all different kinds of really useful information here. 56 00:02:50,580 --> 00:02:52,757 And you'll see that this has got the URL 57 00:02:52,757 --> 00:02:53,840 of classdb01.cloud.llgrid. 58 00:02:59,320 --> 00:03:01,180 When I started it, it was an actual machine 59 00:03:01,180 --> 00:03:02,965 that was allocated to that. 60 00:03:02,965 --> 00:03:08,250 In fact, just for fun here, I could turn one of these on. 61 00:03:08,250 --> 00:03:09,770 You guys are free to start them. 62 00:03:09,770 --> 00:03:12,040 I wouldn't encourage you to hit the Stop button, 63 00:03:12,040 --> 00:03:13,680 because if someone else is using it, 64 00:03:13,680 --> 00:03:18,050 and you hit stop, then that may not 65 00:03:18,050 --> 00:03:19,620 be something you want to do. 66 00:03:19,620 --> 00:03:21,430 But it's the same set up-- for instance, 67 00:03:21,430 --> 00:03:24,802 if you have a project, everyone in the class 68 00:03:24,802 --> 00:03:26,260 can see this because we've made you 69 00:03:26,260 --> 00:03:28,840 all a part of the class group. 70 00:03:28,840 --> 00:03:31,180 But you can see here there's other classes. 71 00:03:31,180 --> 00:03:33,600 We have a bioinformatics group. 72 00:03:33,600 --> 00:03:35,770 And they have a couple of databases. 73 00:03:35,770 --> 00:03:37,010 Those are there. 74 00:03:37,010 --> 00:03:38,600 They're not running right now. 75 00:03:38,600 --> 00:03:41,220 There's a very large graph database group. 76 00:03:41,220 --> 00:03:42,150 It's running now. 77 00:03:42,150 --> 00:03:46,050 And just to show this is it's running. 78 00:03:46,050 --> 00:03:48,980 You see this has about 200 gigabytes of data. 79 00:03:48,980 --> 00:03:51,706 And if we look in the tables here, 80 00:03:51,706 --> 00:03:53,080 we see here we have a few tables. 81 00:03:53,080 --> 00:03:56,810 Here are some tables with a few billion entries 82 00:03:56,810 --> 00:03:58,260 that have been put in there. 83 00:03:58,260 --> 00:04:02,460 And this is really what Accumulo does very, very well. 84 00:04:02,460 --> 00:04:06,280 But I'm going to start one just for fun here, if that works. 85 00:04:06,280 --> 00:04:08,140 And so it will be starting that. 86 00:04:08,140 --> 00:04:10,640 And all that happens. 87 00:04:10,640 --> 00:04:14,940 And you can see it's starting, and all those type of stuff. 88 00:04:14,940 --> 00:04:18,820 So we're going to get going here now with the specific examples. 89 00:04:22,730 --> 00:04:23,850 And I have these. 90 00:04:26,730 --> 00:04:28,710 Just so you know, today's examples-- 91 00:04:28,710 --> 00:04:30,770 so it's in the Examples directory, 92 00:04:30,770 --> 00:04:35,450 in the Scaling directory, and two parallel database. 93 00:04:35,450 --> 00:04:36,950 So this is the directory we're going 94 00:04:36,950 --> 00:04:38,850 to be going through today. 95 00:04:38,850 --> 00:04:42,130 And we have a lot of examples to get through, 96 00:04:42,130 --> 00:04:44,600 because we're going to be covering a lot of ground 97 00:04:44,600 --> 00:04:49,800 here about how you can take advantage of D4M and Accumulo 98 00:04:49,800 --> 00:04:52,660 together here. 99 00:04:52,660 --> 00:05:05,382 So the first thing I'm going to do is go here. 100 00:05:05,382 --> 00:05:06,340 I'm going to run these. 101 00:05:06,340 --> 00:05:09,560 I have, essentially, two versions of the code-- one 102 00:05:09,560 --> 00:05:14,487 that's going to do fairly smaller databases on my laptop. 103 00:05:14,487 --> 00:05:16,070 And then I have another version that's 104 00:05:16,070 --> 00:05:18,250 sitting in my LLGrid account that I 105 00:05:18,250 --> 00:05:21,990 can do some bigger things with. 106 00:05:21,990 --> 00:05:26,320 So to get started, we're going to do this first example PDB01 107 00:05:26,320 --> 00:05:26,840 data test. 108 00:05:48,110 --> 00:05:52,970 So in order to do database work, and to test data, 109 00:05:52,970 --> 00:05:55,070 we need to generate some data. 110 00:05:55,070 --> 00:05:58,590 And so I'm using a built-in data generator 111 00:05:58,590 --> 00:06:02,291 that we have called the Kronecker graph. 112 00:06:02,291 --> 00:06:06,060 It's basically borrowed from a benchmark called the Graph 500 113 00:06:06,060 --> 00:06:06,560 benchmark. 114 00:06:06,560 --> 00:06:08,890 There's actually a list called Graph 500. 115 00:06:08,890 --> 00:06:12,070 And I helped write that benchmark. 116 00:06:12,070 --> 00:06:14,550 And, in fact, I actually-- the Matlab code on that website 117 00:06:14,550 --> 00:06:18,510 is stuff that I originally wrote, and other people 118 00:06:18,510 --> 00:06:20,530 have since modified. 119 00:06:20,530 --> 00:06:22,445 And so this is a graph generator. 120 00:06:22,445 --> 00:06:30,370 It generates a very large power law graph using a Kronecker 121 00:06:30,370 --> 00:06:31,980 product approach. 122 00:06:31,980 --> 00:06:35,040 And it has a few parameters here-- a scale parameter, 123 00:06:35,040 --> 00:06:36,840 which is basically the number of vertices. 124 00:06:36,840 --> 00:06:40,280 So 2 this scale parameter is approximately the number 125 00:06:40,280 --> 00:06:40,780 of vertices. 126 00:06:40,780 --> 00:06:44,790 So 2 to the 12th gets you about 4,000 vertices. 127 00:06:44,790 --> 00:06:47,970 It then creates a certain number of edges per vertex-- 128 00:06:47,970 --> 00:06:50,310 so 16 edges per vertex. 129 00:06:50,310 --> 00:06:54,320 And so this computes n max as 2 to the scale here. 130 00:06:54,320 --> 00:06:57,560 And then the number of edges is edges per vertex times n max. 131 00:06:57,560 --> 00:06:59,710 This is the maximum number of edges. 132 00:06:59,710 --> 00:07:02,320 And then it generates this. 133 00:07:02,320 --> 00:07:05,360 And it comes back with two vectors, 134 00:07:05,360 --> 00:07:09,670 which is just the list of the-- the first vector 135 00:07:09,670 --> 00:07:11,700 is a list of starting vertices. 136 00:07:11,700 --> 00:07:15,440 And the second vector is a list of ending vertices. 137 00:07:15,440 --> 00:07:17,670 And we're not really using any D4M here. 138 00:07:17,670 --> 00:07:22,550 We're just creating a sparse adjacency matrix of that data, 139 00:07:22,550 --> 00:07:26,950 showing it, and then plotting the degree distribution. 140 00:07:26,950 --> 00:07:31,380 So if we look at that figure-- so this 141 00:07:31,380 --> 00:07:36,460 shows the adjacency matrix of this graph, 142 00:07:36,460 --> 00:07:39,080 start vertex to end vertex. 143 00:07:39,080 --> 00:07:42,629 These Kronecker graphs have this sort of recursive structure. 144 00:07:42,629 --> 00:07:44,170 And if you kept zooming in, you would 145 00:07:44,170 --> 00:07:48,850 see that the graph looked like itself in a recursive way here. 146 00:07:48,850 --> 00:07:52,360 That's what gives us this power law distribution. 147 00:07:52,360 --> 00:07:55,230 And this is a relatively small graph. 148 00:07:55,230 --> 00:07:57,150 This particular data generator is 149 00:07:57,150 --> 00:08:01,500 chosen because you can make enormous graphs in parallel 150 00:08:01,500 --> 00:08:06,500 very easily, which is something that if we had to pass around 151 00:08:06,500 --> 00:08:09,170 large data sets every single time we wanted to test 152 00:08:09,170 --> 00:08:11,240 our software, it would be prohibitive 153 00:08:11,240 --> 00:08:14,960 because we'd be passing around gigabytes and terabytes. 154 00:08:14,960 --> 00:08:20,200 And I think the largest this has ever been run on 155 00:08:20,200 --> 00:08:23,030 is on a scale of 2 to the 37. 156 00:08:23,030 --> 00:08:29,160 So that's trillions of vertices, or certainly 157 00:08:29,160 --> 00:08:33,650 billions of vertices, almost trillions of vertices. 158 00:08:33,650 --> 00:08:36,630 And then we do the degree distribution of this. 159 00:08:36,630 --> 00:08:38,929 And you see here, it creates a power law distribution. 160 00:08:38,929 --> 00:08:41,669 We have a few vertices with only one connection. 161 00:08:41,669 --> 00:08:44,720 And we always have a super node with a lot of connections. 162 00:08:44,720 --> 00:08:49,910 And you can see, actually, here the Kronecker structure 163 00:08:49,910 --> 00:08:56,030 in this data, which creates this characteristic sawtooth 164 00:08:56,030 --> 00:08:56,741 pattern. 165 00:08:56,741 --> 00:08:58,740 And there's ways to get rid of that if you want. 166 00:08:58,740 --> 00:09:00,770 But for our purposes, having that structure 167 00:09:00,770 --> 00:09:02,270 there, there's no problem with that. 168 00:09:02,270 --> 00:09:05,630 So this is kind of exactly what the degree distribution looks 169 00:09:05,630 --> 00:09:07,950 like. 170 00:09:07,950 --> 00:09:10,870 So that's just a small version to show you 171 00:09:10,870 --> 00:09:12,600 what the data looks like. 172 00:09:12,600 --> 00:09:15,420 Now we're going to create a bigger version. 173 00:09:19,190 --> 00:09:29,980 So this program, which I'll now show you-- so this program 174 00:09:29,980 --> 00:09:35,540 creates, essentially, the same Kronecker graph. 175 00:09:35,540 --> 00:09:38,592 But it's going to do it eight times. 176 00:09:38,592 --> 00:09:40,550 And one of the nice things about this generator 177 00:09:40,550 --> 00:09:43,730 is if you just keep calling this, 178 00:09:43,730 --> 00:09:47,570 it gives you more independent samples from the same graph. 179 00:09:47,570 --> 00:09:50,320 So we're just creating a graph that's 180 00:09:50,320 --> 00:09:52,570 got eight times as many edges as the previous one 181 00:09:52,570 --> 00:09:55,010 just by calling it over and over again, just 182 00:09:55,010 --> 00:09:58,030 from the random number generator. 183 00:09:58,030 --> 00:09:58,750 So I have eight. 184 00:09:58,750 --> 00:10:00,130 I'm going to do this eight times. 185 00:10:00,130 --> 00:10:03,560 And I'm going to save each one of those to a separate file. 186 00:10:03,560 --> 00:10:05,380 So I create a file name. 187 00:10:05,380 --> 00:10:08,140 I'm actually setting the random number seed 188 00:10:08,140 --> 00:10:12,280 to be set by the file name so that I can do this 189 00:10:12,280 --> 00:10:18,000 if I want to-- the seventh file will always have, essentially, 190 00:10:18,000 --> 00:10:21,930 the same random sequence regardless of when I run it. 191 00:10:21,930 --> 00:10:23,730 And so I create my vertices. 192 00:10:23,730 --> 00:10:25,730 And I'm going to convert these to strings, 193 00:10:25,730 --> 00:10:27,890 and then write these out to files. 194 00:10:27,890 --> 00:10:30,000 And so that's all that does here. 195 00:10:30,000 --> 00:10:32,460 And one of the things I do throughout this process 196 00:10:32,460 --> 00:10:34,030 that you will see is I keep track 197 00:10:34,030 --> 00:10:38,030 of how many edges per second I'm generating things. 198 00:10:38,030 --> 00:10:42,020 So here, I'm generating about 150,000. 199 00:10:42,020 --> 00:10:46,000 It varies in terms of the edges per second here, 200 00:10:46,000 --> 00:10:50,930 but between 30,0000 and 150,000 or 180,000 edges per second, 201 00:10:50,930 --> 00:10:53,190 because when you're creating a whole data processing 202 00:10:53,190 --> 00:10:55,620 pipeline, that's essentially the kind of metrics 203 00:10:55,620 --> 00:10:59,730 you're looking-- some steps might process 204 00:10:59,730 --> 00:11:01,430 your edges extremely quickly. 205 00:11:01,430 --> 00:11:03,819 And other steps might process your edges more slowly. 206 00:11:03,819 --> 00:11:05,360 And that's, obviously, the ones where 207 00:11:05,360 --> 00:11:10,420 you want to put more energy and effort into it. 208 00:11:10,420 --> 00:11:12,830 So we can actually now go and look. 209 00:11:12,830 --> 00:11:16,680 It stuck it in this data directory here. 210 00:11:16,680 --> 00:11:18,340 And we just created that. 211 00:11:18,340 --> 00:11:20,870 And so, basically, we write it out on three files. 212 00:11:20,870 --> 00:11:24,935 Essentially, each one of these holds one part of a triple-- so 213 00:11:24,935 --> 00:11:27,020 a row, a column, and a value. 214 00:11:27,020 --> 00:11:30,890 So if we look at the tow, you can just 215 00:11:30,890 --> 00:11:35,520 see it's a sequence of strings separated by commas, 216 00:11:35,520 --> 00:11:38,860 same with the column, just a separate sequence of strings 217 00:11:38,860 --> 00:11:40,760 separated by commas. 218 00:11:40,760 --> 00:11:42,930 And then in this case, the values 219 00:11:42,930 --> 00:11:46,300 we just made all ones, nothing fancy there. 220 00:11:46,300 --> 00:11:48,300 So now we have eight files. 221 00:11:48,300 --> 00:11:48,872 That's great. 222 00:11:48,872 --> 00:11:50,205 We generated those very quickly. 223 00:11:50,205 --> 00:11:53,447 And now we want to do a little processing on them. 224 00:11:56,540 --> 00:12:01,970 So if we go pDB03, the first thing we're going to do 225 00:12:01,970 --> 00:12:05,804 is read those files back in and construct associative arrays, 226 00:12:05,804 --> 00:12:07,470 because the associative ray construction 227 00:12:07,470 --> 00:12:09,101 time takes a little time. 228 00:12:09,101 --> 00:12:11,350 And we're going to want to use it over and over again. 229 00:12:11,350 --> 00:12:14,190 So we might as well take those triples 230 00:12:14,190 --> 00:12:16,390 and construct them into associative arrays, 231 00:12:16,390 --> 00:12:18,107 and save them out as malak binary files. 232 00:12:18,107 --> 00:12:19,690 And now that will be something that we 233 00:12:19,690 --> 00:12:22,030 can work with very quickly. 234 00:12:22,030 --> 00:12:24,714 So we're going to do that. 235 00:12:24,714 --> 00:12:25,380 So there you go. 236 00:12:25,380 --> 00:12:26,475 It read them in. 237 00:12:26,475 --> 00:12:28,600 And it shows you at the rate at which it reads them 238 00:12:28,600 --> 00:12:30,450 in, and then essentially writes them out, 239 00:12:30,450 --> 00:12:32,140 and then gives us another example 240 00:12:32,140 --> 00:12:33,580 of the edges per second. 241 00:12:33,580 --> 00:12:37,410 And now you see we have Matlab files for each one of those. 242 00:12:37,410 --> 00:12:39,540 And, not surprisingly, the Matlab file 243 00:12:39,540 --> 00:12:46,910 is smaller than the three input triples that it gave. 244 00:12:46,910 --> 00:12:49,520 So this is a 24 kilobyte Matlab file. 245 00:12:49,520 --> 00:12:53,834 And it was probably about 80 kilobytes of input data. 246 00:12:53,834 --> 00:12:56,000 And that's just because we've compressed all the row 247 00:12:56,000 --> 00:12:58,390 keys into single vectors. 248 00:12:58,390 --> 00:13:02,000 And we have the sparse adjacency matrix, which stores things. 249 00:13:02,000 --> 00:13:04,840 And so that makes it a little bit better there. 250 00:13:08,890 --> 00:13:13,050 If we actually look at that program here-- so we can see 251 00:13:13,050 --> 00:13:16,190 we basically are reading it in. 252 00:13:16,190 --> 00:13:18,230 And then what we're doing is we're basically 253 00:13:18,230 --> 00:13:19,790 creating an associative array. 254 00:13:19,790 --> 00:13:23,150 We read in each set of triples. 255 00:13:23,150 --> 00:13:26,280 And then the constructor takes the list 256 00:13:26,280 --> 00:13:28,000 of row strings, column strings. 257 00:13:28,000 --> 00:13:30,500 We just all want this-- since we knew they were all one, 258 00:13:30,500 --> 00:13:32,090 we were just letting that be one. 259 00:13:32,090 --> 00:13:35,270 And then there's this optional fourth argument that tells us, 260 00:13:35,270 --> 00:13:40,520 what do we want to do if we put in two triples 261 00:13:40,520 --> 00:13:44,030 with the same row and column, what to do? 262 00:13:44,030 --> 00:13:47,600 The default is it will just do the min. 263 00:13:47,600 --> 00:13:49,935 So if I have a collision, it will just do a min. 264 00:13:49,935 --> 00:13:52,670 If I give it this optional fourth argument 265 00:13:52,670 --> 00:13:55,290 at sum-- in fact, you can put in essentially 266 00:13:55,290 --> 00:13:57,310 any binary operation there. 267 00:13:57,310 --> 00:14:00,010 But at sum will just add them together. 268 00:14:00,010 --> 00:14:04,220 So now we'll have-- in the associative array, 269 00:14:04,220 --> 00:14:08,120 a particular row and column will have how many that occurred. 270 00:14:08,120 --> 00:14:11,042 And so we're summing up as we go here. 271 00:14:11,042 --> 00:14:13,000 And then after we create the associative array, 272 00:14:13,000 --> 00:14:14,830 we save them out to a file. 273 00:14:14,830 --> 00:14:16,622 And so we have that step done. 274 00:14:21,370 --> 00:14:23,712 Now, the whole reason I showed you this process 275 00:14:23,712 --> 00:14:26,170 is because now I'm actually in a position I can start doing 276 00:14:26,170 --> 00:14:28,110 computation just on the files. 277 00:14:28,110 --> 00:14:30,569 As I said before, I don't have to use the database. 278 00:14:30,569 --> 00:14:32,610 If I'm going to do any kind of calculation that's 279 00:14:32,610 --> 00:14:35,150 going to involve traversing all the data, 280 00:14:35,150 --> 00:14:37,900 it's going to be faster just to read in those Matlab files 281 00:14:37,900 --> 00:14:39,710 and do my processing on that. 282 00:14:39,710 --> 00:14:41,460 It's also very easy to make parallel. 283 00:14:41,460 --> 00:14:43,390 I have a lot of files. 284 00:14:43,390 --> 00:14:46,750 I just have different-- if I launch a parallel job, 285 00:14:46,750 --> 00:14:49,770 I can just have different processes 286 00:14:49,770 --> 00:14:51,160 reading separate files. 287 00:14:51,160 --> 00:14:52,730 It will scale very well. 288 00:14:52,730 --> 00:14:56,994 The data array read rates will be very fast. 289 00:14:56,994 --> 00:15:01,100 Reading these files in parallel will take much less time 290 00:15:01,100 --> 00:15:05,100 than trying to pull out all the data from the database again. 291 00:15:05,100 --> 00:15:07,270 So we're going to do a little analytics here. 292 00:15:12,661 --> 00:15:18,160 pDB04 so I'm going to basically compute-- 293 00:15:18,160 --> 00:15:21,110 I'm going to take those eight files, read them all in, 294 00:15:21,110 --> 00:15:23,158 and accumulate the results as we go. 295 00:15:25,800 --> 00:15:26,890 And there we go. 296 00:15:26,890 --> 00:15:29,490 We get to the in degree distribution 297 00:15:29,490 --> 00:15:36,860 and the out degree distribution of this result. 298 00:15:36,860 --> 00:15:39,900 If you look to that program here, 299 00:15:39,900 --> 00:15:44,110 you can see all we did is we looped over all the files, 300 00:15:44,110 --> 00:15:48,860 just loaded them from in Matlab, and then basically summed 301 00:15:48,860 --> 00:15:52,600 the rows and added that to a temp variable, 302 00:15:52,600 --> 00:15:54,870 and summed the columns, and then plotted them out. 303 00:15:54,870 --> 00:15:59,140 So we just sort of accumulated them as we went. 304 00:15:59,140 --> 00:16:01,800 This actually-- this method of just summing 305 00:16:01,800 --> 00:16:03,627 on top of an associative array is something 306 00:16:03,627 --> 00:16:04,710 that you can certainly do. 307 00:16:04,710 --> 00:16:06,850 It's a very convenient way to do it. 308 00:16:06,850 --> 00:16:09,030 I should say, though-- and you can kind of 309 00:16:09,030 --> 00:16:10,990 see it here a little bit. 310 00:16:10,990 --> 00:16:13,430 You notice that the time is beginning 311 00:16:13,430 --> 00:16:16,110 to-- it's not so clear here, this took so little time. 312 00:16:16,110 --> 00:16:18,550 But on a larger example, what you would see 313 00:16:18,550 --> 00:16:21,720 is that every single time we did that-- because we're 314 00:16:21,720 --> 00:16:25,500 building and then adding, we're basically redoing 315 00:16:25,500 --> 00:16:26,907 the construction process. 316 00:16:26,907 --> 00:16:29,240 And so, eventually, this will become longer, and longer, 317 00:16:29,240 --> 00:16:30,360 and longer. 318 00:16:30,360 --> 00:16:32,990 And so it's OK for small stuff to do that, 319 00:16:32,990 --> 00:16:34,865 or if you're only going to do it a few times. 320 00:16:34,865 --> 00:16:37,640 But if you're going to be accumulating an enormous amount 321 00:16:37,640 --> 00:16:43,125 of data, then what we can actually do 322 00:16:43,125 --> 00:16:50,740 is we have another version of this program pDB04 cat 323 00:16:50,740 --> 00:16:55,630 DegreeTest And you can tell that was a little bit faster. 324 00:16:55,630 --> 00:16:58,900 You see here it's all in milliseconds of time. 325 00:17:01,420 --> 00:17:03,830 And this is a little bit longer program. 326 00:17:10,910 --> 00:17:16,480 What we're doing here is, basically, we're 327 00:17:16,480 --> 00:17:18,640 reading in-- doing the exact same thing. 328 00:17:18,640 --> 00:17:21,720 We're loading our Matlab file. 329 00:17:21,720 --> 00:17:24,520 We're doing the sum. 330 00:17:24,520 --> 00:17:30,460 And then since I know something about the structure of that. 331 00:17:30,460 --> 00:17:33,480 That is, basically I'm summing the rows. 332 00:17:36,500 --> 00:17:42,240 I can just append that to a longer list, and then at the 333 00:17:42,240 --> 00:17:47,700 and do one large sum. 334 00:17:47,700 --> 00:17:50,670 And that's, obviously, much faster. 335 00:17:50,670 --> 00:17:53,255 And so these kind of tricks you just need to be aware of. 336 00:17:53,255 --> 00:17:55,650 If you're trying to do very-- people typically 337 00:17:55,650 --> 00:17:57,570 want to do a large amount of data. 338 00:17:57,570 --> 00:17:59,050 You just do the simple sum. 339 00:17:59,050 --> 00:18:00,000 That will be OK. 340 00:18:00,000 --> 00:18:03,090 But if you're doing a lot over a large list, 341 00:18:03,090 --> 00:18:06,260 that essentially becomes almost an n squared operation 342 00:18:06,260 --> 00:18:08,810 with the loop variable. 343 00:18:08,810 --> 00:18:11,720 And this is one that will be even faster. 344 00:18:11,720 --> 00:18:15,120 You can make it even faster, because we are 345 00:18:15,120 --> 00:18:17,110 doing this concatenation here. 346 00:18:17,110 --> 00:18:19,970 When you do a concatenation in Matlab, you're doing a malak. 347 00:18:19,970 --> 00:18:21,740 If you want to make it even faster, 348 00:18:21,740 --> 00:18:24,440 you can pre-allocate a large buffer, 349 00:18:24,440 --> 00:18:28,160 and then append and into that buffer. 350 00:18:28,160 --> 00:18:32,770 And then when you hit the edge, do a sum then, 351 00:18:32,770 --> 00:18:33,790 and do it that way. 352 00:18:33,790 --> 00:18:37,370 And that's the fastest you can do. 353 00:18:37,370 --> 00:18:39,810 So these are tricks-- very, very large sums, 354 00:18:39,810 --> 00:18:43,180 you can do them very quickly, and all with files. 355 00:18:43,180 --> 00:18:45,440 You don't need a database. 356 00:18:45,440 --> 00:18:47,270 And this is the way to go if you're 357 00:18:47,270 --> 00:18:49,940 going to be doing an analytic where you really 358 00:18:49,940 --> 00:18:54,250 want to traverse most of the data in the database, 359 00:18:54,250 --> 00:18:55,090 in your data set. 360 00:18:55,090 --> 00:18:58,110 If you just need to get pieces of the data in the database, 361 00:18:58,110 --> 00:19:00,190 then the database will be a better tool. 362 00:19:04,520 --> 00:19:05,600 We did that. 363 00:19:05,600 --> 00:19:10,110 So those show how we worked with files. 364 00:19:10,110 --> 00:19:12,990 And that's always a good place to start. 365 00:19:12,990 --> 00:19:14,740 Even if you are working with the database, 366 00:19:14,740 --> 00:19:17,264 if you find that you're doing one query over and over again 367 00:19:17,264 --> 00:19:19,430 that you're going to be then working with that data, 368 00:19:19,430 --> 00:19:21,260 a lot times better to just do that query, 369 00:19:21,260 --> 00:19:23,710 and then save those results to a file, and then just work 370 00:19:23,710 --> 00:19:25,500 with that file while you're doing it. 371 00:19:25,500 --> 00:19:27,458 And, again, this is something that people often 372 00:19:27,458 --> 00:19:30,040 do in our business. 373 00:19:30,040 --> 00:19:37,260 Now we're going to get to the actual database part of it. 374 00:19:37,260 --> 00:19:38,760 So the first thing we're going to do 375 00:19:38,760 --> 00:19:40,420 is we have to set up our database. 376 00:19:40,420 --> 00:19:44,850 So we're going to create some tables in Accumulo. 377 00:19:44,850 --> 00:19:46,940 And so we want to create those first 378 00:19:46,940 --> 00:19:48,610 so they're created properly. 379 00:19:48,610 --> 00:19:51,110 And so I'm going to show you the program that does that. 380 00:19:51,110 --> 00:19:52,090 The first thing that you're going to do, 381 00:19:52,090 --> 00:19:54,150 it's going to call a DB setup command. 382 00:19:54,150 --> 00:19:57,700 And so let me-- this program, I'm going to now show you that. 383 00:19:57,700 --> 00:19:59,600 And so when you run these examples, 384 00:19:59,600 --> 00:20:02,302 you will have to modify this DB setup program. 385 00:20:05,950 --> 00:20:08,590 So the first thing you'll notice is 386 00:20:08,590 --> 00:20:15,210 that we're all using the same-- each group that's 387 00:20:15,210 --> 00:20:19,210 with those databases is all using one user account. 388 00:20:19,210 --> 00:20:22,720 And you can say, well, that's not the best way to do it. 389 00:20:22,720 --> 00:20:25,530 Well, it's very consistent with the group structure, 390 00:20:25,530 --> 00:20:27,600 in that it's basically you're all users. 391 00:20:27,600 --> 00:20:31,010 The database is there to share data amongst your group. 392 00:20:31,010 --> 00:20:32,870 And so it is not an uncommon practice 393 00:20:32,870 --> 00:20:36,910 to have a single user account in which you put that data. 394 00:20:36,910 --> 00:20:39,330 So we have a bit of a namespace problem. 395 00:20:39,330 --> 00:20:42,550 If you all just ran this example together, 396 00:20:42,550 --> 00:20:45,571 you'd all create the exact same table, and all fill it up. 397 00:20:45,571 --> 00:20:47,070 So the first thing we're going to do 398 00:20:47,070 --> 00:20:49,390 is just pre-pend the tables. 399 00:20:49,390 --> 00:20:51,920 And I would suggest that you put your name-- instead 400 00:20:51,920 --> 00:20:54,964 of having my name there, you put your name there. 401 00:20:54,964 --> 00:20:56,380 And then we have a special command 402 00:20:56,380 --> 00:20:59,930 you're called DB Setup LLGrid, which basically creates 403 00:20:59,930 --> 00:21:05,056 a binding to a database just using the name of the database. 404 00:21:05,056 --> 00:21:06,180 So it's a special function. 405 00:21:06,180 --> 00:21:07,430 That's not a generic function. 406 00:21:07,430 --> 00:21:09,100 It only works with our LLGrid system. 407 00:21:09,100 --> 00:21:13,760 And it only works if you have mounted the LLGrid file system. 408 00:21:13,760 --> 00:21:21,488 So hide all this stuff here, get rid of that. 409 00:21:27,160 --> 00:21:29,980 So as you see here, I have mounted the yellow grid file 410 00:21:29,980 --> 00:21:30,480 system. 411 00:21:35,700 --> 00:21:40,920 And you need to do that because the DB setup command, when 412 00:21:40,920 --> 00:21:42,650 it binds to the database it actually 413 00:21:42,650 --> 00:21:47,400 goes get the keys from the LLGrid file system. 414 00:21:47,400 --> 00:21:51,470 And those keys are sitting in the group directory for that. 415 00:21:51,470 --> 00:21:55,130 So, basically, from a password management perspective, 416 00:21:55,130 --> 00:21:57,610 all we need to do is add you to the group. 417 00:21:57,610 --> 00:21:59,850 And then you have access to the database. 418 00:21:59,850 --> 00:22:01,790 Or if we remove you from the group, 419 00:22:01,790 --> 00:22:03,620 you no longer have access to the database. 420 00:22:03,620 --> 00:22:05,390 So that's how we do-- otherwise, we'd 421 00:22:05,390 --> 00:22:12,170 have to distribute keys to every single user all the time. 422 00:22:12,170 --> 00:22:14,190 And so this is why we do that. 423 00:22:14,190 --> 00:22:15,940 So this greatly simplifies it. 424 00:22:15,940 --> 00:22:19,270 But you will not be able to make a connection to one 425 00:22:19,270 --> 00:22:21,270 of these databases unless you are either logged 426 00:22:21,270 --> 00:22:22,590 into your LLGrid account. 427 00:22:22,590 --> 00:22:26,980 Or if you are on your computer, you've mounted the file system, 428 00:22:26,980 --> 00:22:30,580 and D4M knows where to look for the keys 429 00:22:30,580 --> 00:22:32,730 when you pass in that setup command. 430 00:22:40,530 --> 00:22:46,300 So if we look at that again-- and this is just a shorthand 431 00:22:46,300 --> 00:22:50,000 for the full DB server command. 432 00:22:50,000 --> 00:22:53,430 So if you were connecting to some database directly other 433 00:22:53,430 --> 00:22:55,130 than one of these-- or you could even 434 00:22:55,130 --> 00:22:57,010 do it with these-- you would have to pass in, 435 00:22:57,010 --> 00:22:59,130 essentially, a five argument thing, 436 00:22:59,130 --> 00:23:05,560 which is the hostname of the computer and the port, 437 00:23:05,560 --> 00:23:10,230 the name of the instance, the name of the-- I 438 00:23:10,230 --> 00:23:13,810 guess there's a couple instance names here-- and then the name 439 00:23:13,810 --> 00:23:16,190 of the user, and then an actual password. 440 00:23:16,190 --> 00:23:19,090 And so that's the generic way to connect in general. 441 00:23:19,090 --> 00:23:22,030 But for of those of you connected to LLGrid, 442 00:23:22,030 --> 00:23:25,960 we can just use this shorthand, which is very nice. 443 00:23:25,960 --> 00:23:29,410 Then we're going to build a couple of tables. 444 00:23:29,410 --> 00:23:31,420 So we have these-- first thing we're 445 00:23:31,420 --> 00:23:34,340 going to do is we're going to want to create a table that's 446 00:23:34,340 --> 00:23:38,050 going to hold that adjacency matrix that I just created 447 00:23:38,050 --> 00:23:39,160 with the files. 448 00:23:39,160 --> 00:23:42,760 And so we're going to do that with a database pair. 449 00:23:42,760 --> 00:23:44,450 So if we have our database object here 450 00:23:44,450 --> 00:23:46,850 and we give it two string names, it 451 00:23:46,850 --> 00:23:51,310 will know to create two tables in the database, 452 00:23:51,310 --> 00:23:56,340 and return a binding to that table that's a transposed pair. 453 00:23:56,340 --> 00:23:58,800 So whenever we do an insert into that table, 454 00:23:58,800 --> 00:24:02,360 it will insert the row and the column in one table, 455 00:24:02,360 --> 00:24:06,160 and then flip those and insert the column and the row 456 00:24:06,160 --> 00:24:07,250 in the other table. 457 00:24:07,250 --> 00:24:09,000 And then whenever you do a lookup, 458 00:24:09,000 --> 00:24:12,560 it will know if it's doing a row lookup to look on one table, 459 00:24:12,560 --> 00:24:14,560 and if it's doing a column lookup to do it 460 00:24:14,560 --> 00:24:15,730 on the other table. 461 00:24:15,730 --> 00:24:18,280 And this allows you to do fast lookups of both rows 462 00:24:18,280 --> 00:24:21,250 and columns and makes it all nice for you. 463 00:24:21,250 --> 00:24:23,000 We're also going to want to take advantage 464 00:24:23,000 --> 00:24:26,695 of Accumulo's built-in ability to sum as we insert. 465 00:24:26,695 --> 00:24:28,570 And so we're going to create something that's 466 00:24:28,570 --> 00:24:35,060 going to hold the degree as we go-- very useful to have 467 00:24:35,060 --> 00:24:37,620 these statistics, because a lot of times 468 00:24:37,620 --> 00:24:39,364 you want to look up something. 469 00:24:39,364 --> 00:24:40,780 But the first thing you want to do 470 00:24:40,780 --> 00:24:43,270 is to see, well, how many of them are in there? 471 00:24:43,270 --> 00:24:44,970 And so if you create, essentially, 472 00:24:44,970 --> 00:24:48,850 a column vector with that information, it's very helpful. 473 00:24:48,850 --> 00:24:50,700 Later, we're going to do something 474 00:24:50,700 --> 00:24:54,430 where we actually store the raw edges that were created. 475 00:24:54,430 --> 00:24:57,400 So when we create the adjacency matrix, 476 00:24:57,400 --> 00:24:59,806 we actually lose a little bit of information. 477 00:24:59,806 --> 00:25:01,430 And so when we create this edge matrix, 478 00:25:01,430 --> 00:25:04,400 we'll be able to preserve that matrix, that information. 479 00:25:04,400 --> 00:25:06,880 And, likewise, we'll be doing the tallies 480 00:25:06,880 --> 00:25:08,950 of the edges in that as well. 481 00:25:08,950 --> 00:25:10,695 So that's what this does. 482 00:25:10,695 --> 00:25:11,820 And that's the setup. 483 00:25:11,820 --> 00:25:15,690 You'll need to modify that and this program here. 484 00:25:15,690 --> 00:25:20,530 Actually, basically after the setup is done, 485 00:25:20,530 --> 00:25:24,490 it adds these accumulator things by designating certain columns 486 00:25:24,490 --> 00:25:26,730 to be what they call combiners. 487 00:25:26,730 --> 00:25:29,870 So in this adjacency degree table, I've said, 488 00:25:29,870 --> 00:25:32,060 I want to create two new columns-- an out degree, 489 00:25:32,060 --> 00:25:34,250 in degree column, and the operation-- 490 00:25:34,250 --> 00:25:36,190 that I want to be applied when there 491 00:25:36,190 --> 00:25:39,700 are collisions on those values is sum. 492 00:25:39,700 --> 00:25:41,620 It will then sum the values. 493 00:25:41,620 --> 00:25:45,670 And, likewise, with the edge, since I only have one 494 00:25:45,670 --> 00:25:49,630 I'm going to have to degree and sum there. 495 00:25:49,630 --> 00:25:51,750 So that's how we do that. 496 00:25:51,750 --> 00:25:57,360 So if we now go to our database page here-- so you can see, 497 00:25:57,360 --> 00:25:59,220 class DB3, it started. 498 00:25:59,220 --> 00:26:01,620 We can actually view the info. 499 00:26:01,620 --> 00:26:05,942 And this is what a nice, fresh, never before used 500 00:26:05,942 --> 00:26:07,150 Accumulo instance looks like. 501 00:26:07,150 --> 00:26:09,100 It has very little data. 502 00:26:09,100 --> 00:26:14,210 It has one table, a meta data table and a trace table. 503 00:26:14,210 --> 00:26:17,270 And there's no errors or anything 504 00:26:17,270 --> 00:26:19,460 like that-- so a very clean instance. 505 00:26:24,280 --> 00:26:25,530 So that's what one looks like. 506 00:26:25,530 --> 00:26:27,905 We're going to use one that we've already started though, 507 00:26:27,905 --> 00:26:30,920 which is DB1. 508 00:26:30,920 --> 00:26:35,680 And if we look at the tables here, 509 00:26:35,680 --> 00:26:37,180 there's already some tables. 510 00:26:37,180 --> 00:26:39,130 Michelle ran a practice run on this. 511 00:26:39,130 --> 00:26:41,320 [? Chansup ?] ran a practice run. 512 00:26:41,320 --> 00:26:47,170 I'm going to now create those tables by live setup test. 513 00:26:57,160 --> 00:26:59,790 There, it created all those tables. 514 00:26:59,790 --> 00:27:01,490 You can now see, did it actually work? 515 00:27:05,300 --> 00:27:09,060 So if I refresh that-- and you can see, 516 00:27:09,060 --> 00:27:16,240 it's now created these six tables which are empty. 517 00:27:16,240 --> 00:27:18,360 We do have abilities to set the write 518 00:27:18,360 --> 00:27:20,200 and read permissions of these tables. 519 00:27:20,200 --> 00:27:21,980 So right now, everyone has the ability 520 00:27:21,980 --> 00:27:26,140 to read, and write, and delete everyone else's tables 521 00:27:26,140 --> 00:27:27,534 in a class database. 522 00:27:27,534 --> 00:27:29,950 In a project, that's not such a difficult thing to manage. 523 00:27:29,950 --> 00:27:30,970 You all know that. 524 00:27:30,970 --> 00:27:33,500 But you could imagine in a situation [INAUDIBLE], 525 00:27:33,500 --> 00:27:35,360 we had a big ingest. 526 00:27:35,360 --> 00:27:37,060 This is a corpus of data. 527 00:27:37,060 --> 00:27:39,680 We don't want anybody-- we can actually make the permissions. 528 00:27:39,680 --> 00:27:42,272 We can make it read only so that no one can delete it. 529 00:27:42,272 --> 00:27:44,230 Or we can make it so it's still read and write, 530 00:27:44,230 --> 00:27:48,060 but it can't be deleted, whole cloth, those permissions exist. 531 00:27:48,060 --> 00:27:50,220 A feature we will add to this database manager 532 00:27:50,220 --> 00:27:52,340 will also be a checkpoint feature. 533 00:27:52,340 --> 00:27:55,690 So, for instance, if you did a big ingest, 534 00:27:55,690 --> 00:27:57,770 have a bunch of data that you're very happy with, 535 00:27:57,770 --> 00:28:00,126 you can checkpoint it. 536 00:28:00,126 --> 00:28:01,500 You'll have to stop the database. 537 00:28:01,500 --> 00:28:05,170 Then you can create a checkpoint of that stop database, name 538 00:28:05,170 --> 00:28:06,051 checkpoint. 539 00:28:06,051 --> 00:28:08,300 And then you can restart from that, if for some reason 540 00:28:08,300 --> 00:28:10,140 your database gets corrupted. 541 00:28:10,140 --> 00:28:13,960 As I like to say, Accumulo, like all other databases, 542 00:28:13,960 --> 00:28:15,350 is stable for production. 543 00:28:15,350 --> 00:28:17,510 But it can be unstable for development. 544 00:28:17,510 --> 00:28:22,790 New database users, the database will train you 545 00:28:22,790 --> 00:28:25,810 in terms of the things that you should not do to it. 546 00:28:25,810 --> 00:28:29,740 And so over time, you will not do things that destroy data, 547 00:28:29,740 --> 00:28:31,812 or cause your database to be very unhappy. 548 00:28:31,812 --> 00:28:33,895 And then you will have a nice production database, 549 00:28:33,895 --> 00:28:36,190 because you will only do things that make it happy. 550 00:28:36,190 --> 00:28:38,330 But in that phase where you're learning, 551 00:28:38,330 --> 00:28:40,460 or you're experimenting with things, as 552 00:28:40,460 --> 00:28:42,580 with all databases, any database, 553 00:28:42,580 --> 00:28:44,880 it's easy to do commands that will 554 00:28:44,880 --> 00:28:47,670 put the database in a fairly unhappy state, even 555 00:28:47,670 --> 00:28:49,975 to the point of corrupting or losing the data. 556 00:28:53,780 --> 00:28:56,510 But for the most part, once it's up and running in production-- 557 00:28:56,510 --> 00:28:59,620 and we have a database that's been running for almost two 558 00:28:59,620 --> 00:29:04,730 years continuously using a very old version of this software. 559 00:29:04,730 --> 00:29:06,980 So it just continues to hum away. 560 00:29:06,980 --> 00:29:08,650 It's got billions of entries in it. 561 00:29:11,620 --> 00:29:13,820 It's running on a single Mac. 562 00:29:13,820 --> 00:29:19,610 And so it definitely, we've seen situations where it's-- it has 563 00:29:19,610 --> 00:29:23,610 the same stability as just about any other database. 564 00:29:23,610 --> 00:29:27,010 So now we've created these tables. 565 00:29:27,010 --> 00:29:31,380 And let's insert some data into them. 566 00:29:31,380 --> 00:29:36,130 So I do pDB06. 567 00:29:36,130 --> 00:29:39,196 So I'm now going to insert the adjacency matrix. 568 00:29:44,230 --> 00:29:47,770 And now it's basically ereading in each file 569 00:29:47,770 --> 00:29:51,710 and ingesting that data. 570 00:29:51,710 --> 00:29:52,835 And it's not a lot of data. 571 00:29:52,835 --> 00:29:54,840 It doesn't take very long. 572 00:29:54,840 --> 00:30:03,050 And now you can see up here that data is getting ingested. 573 00:30:03,050 --> 00:30:03,880 And you see there. 574 00:30:03,880 --> 00:30:09,340 So we just inserted about 62,000 in each of those two tables, 575 00:30:09,340 --> 00:30:14,390 and 25,000, which for Accumulo is a trivial amount. 576 00:30:14,390 --> 00:30:19,480 We just inserted 150,000 entries into a database, 577 00:30:19,480 --> 00:30:22,840 which is pretty impressive, to be 578 00:30:22,840 --> 00:30:25,630 able to do that in, essentially, the blink of an eye. 579 00:30:25,630 --> 00:30:29,240 And that's really the real power of Accumulo, 580 00:30:29,240 --> 00:30:33,320 is this-- on a lot of other databases, 150,000 entries, 581 00:30:33,320 --> 00:30:35,039 you're talking about a few minutes. 582 00:30:35,039 --> 00:30:36,580 And here, it's just you wouldn't even 583 00:30:36,580 --> 00:30:37,896 think about doing that twice. 584 00:30:42,780 --> 00:30:44,435 So we can take a look at that program. 585 00:30:48,080 --> 00:30:49,870 So here we go. 586 00:30:49,870 --> 00:30:55,080 So we had eight files here, and basically loaded the data. 587 00:30:55,080 --> 00:30:58,430 And, basically, one thing we have to remember 588 00:30:58,430 --> 00:31:04,260 is that our adjacency matrix has numeric values 589 00:31:04,260 --> 00:31:05,850 in an associative array. 590 00:31:05,850 --> 00:31:08,500 And Accumulo can only hold strings. 591 00:31:08,500 --> 00:31:11,310 So we have to call NumtoString function, which 592 00:31:11,310 --> 00:31:16,630 will convert those numbers into strings to be stored. 593 00:31:16,630 --> 00:31:22,800 So the first thing we do is we load our adjacency matrix A. 594 00:31:22,800 --> 00:31:25,220 We convert the numeric values to strings. 595 00:31:25,220 --> 00:31:26,580 And we just do a put. 596 00:31:26,580 --> 00:31:29,120 So we can just insert the associative array directly 597 00:31:29,120 --> 00:31:31,580 into the adjacency matrix. 598 00:31:31,580 --> 00:31:33,270 It pulls apart the triples. 599 00:31:33,270 --> 00:31:36,720 And it knows how to take care of the fact 600 00:31:36,720 --> 00:31:39,180 that it recognizes this is a transposed table pair. 601 00:31:39,180 --> 00:31:41,860 And it does that ingest for you. 602 00:31:41,860 --> 00:31:44,850 And, likewise, same thing here-- now on these other things, 603 00:31:44,850 --> 00:31:46,000 we pulled it out. 604 00:31:46,000 --> 00:31:50,150 We summed it, convert it to strings to do the out degree, 605 00:31:50,150 --> 00:31:52,570 and the same thing to do in degree. 606 00:31:52,570 --> 00:31:54,950 And so this is actually where the adjacency matrix comes 607 00:31:54,950 --> 00:31:58,410 in very handy because when we're doing accumulating, 608 00:31:58,410 --> 00:32:01,440 if we didn't first sum it and then do it-- 609 00:32:01,440 --> 00:32:05,530 if we put those raw triples into insert, 610 00:32:05,530 --> 00:32:09,170 we're essentially redoing the complete insert. 611 00:32:09,170 --> 00:32:11,570 And so this saves usually an order 612 00:32:11,570 --> 00:32:13,550 of magnitude in number of inserts 613 00:32:13,550 --> 00:32:17,510 by basically doing the sum in our D4M program first, 614 00:32:17,510 --> 00:32:20,250 and then just inserting those sum values, just a nice way 615 00:32:20,250 --> 00:32:24,280 to save the database a little bit of trouble in doing that. 616 00:32:24,280 --> 00:32:26,680 And so we certainly recommend this type of approach 617 00:32:26,680 --> 00:32:27,830 for doing that. 618 00:32:35,860 --> 00:32:38,920 So the next one DB07. 619 00:32:38,920 --> 00:32:40,380 So now we're going to a query. 620 00:32:40,380 --> 00:32:43,220 We're going to get some data out of that table. 621 00:32:50,570 --> 00:32:53,090 And so what we did here is we said, 622 00:32:53,090 --> 00:32:57,190 I want to pick 100 random vertices. 623 00:32:57,190 --> 00:33:02,260 So in this case, I randomly generate 100 random vertices 624 00:33:02,260 --> 00:33:05,900 over the range 1 to 1,000. 625 00:33:05,900 --> 00:33:11,269 OK And I then convert those to a string, because these 626 00:33:11,269 --> 00:33:12,060 are numeric values. 627 00:33:12,060 --> 00:33:13,730 And I will look up strings. 628 00:33:13,730 --> 00:33:15,740 And then the first thing I'm going to do 629 00:33:15,740 --> 00:33:18,420 is I'm going to look up the degrees, essentially, 630 00:33:18,420 --> 00:33:19,950 of those vertices. 631 00:33:19,950 --> 00:33:24,160 So I have my sum table here called T adjacency degree. 632 00:33:24,160 --> 00:33:25,850 I'm going to pass those strings in. 633 00:33:25,850 --> 00:33:28,050 And then I'm going to get just the degrees. 634 00:33:28,050 --> 00:33:30,910 So that's just a big, long, skinny vector. 635 00:33:30,910 --> 00:33:33,860 Looking things up from that takes much less time 636 00:33:33,860 --> 00:33:37,457 than looking up whole rows or columns. 637 00:33:37,457 --> 00:33:39,040 And it stores all those values for me. 638 00:33:39,040 --> 00:33:40,687 So that's a great place to start. 639 00:33:40,687 --> 00:33:42,270 And then I want to say, you know what? 640 00:33:42,270 --> 00:33:46,290 I only care about degrees that have been a particular range. 641 00:33:46,290 --> 00:33:47,970 This is a very common thing to do. 642 00:33:47,970 --> 00:33:51,210 There will be vertices that are so common you're like, 643 00:33:51,210 --> 00:33:53,000 I don't care about those. 644 00:33:53,000 --> 00:33:54,440 I have their tally. 645 00:33:54,440 --> 00:33:56,880 And these are sometimes called super nodes. 646 00:33:56,880 --> 00:34:00,220 And doing anything with those is a waste of my time, 647 00:34:00,220 --> 00:34:03,680 and forces me to end up traversing enormous swathes 648 00:34:03,680 --> 00:34:04,690 of the database. 649 00:34:04,690 --> 00:34:09,199 So I'll set an upper threshold here and a lower threshold. 650 00:34:09,199 --> 00:34:11,159 So, basically, I take a degree. 651 00:34:11,159 --> 00:34:12,960 I'm just going to look at the out degree. 652 00:34:12,960 --> 00:34:15,460 And I say, I want things that are greater than the min 653 00:34:15,460 --> 00:34:16,920 and less than the max. 654 00:34:16,920 --> 00:34:18,880 And I want to get just those rows. 655 00:34:18,880 --> 00:34:21,620 So that's this query, a fairly complicated thing, 656 00:34:21,620 --> 00:34:23,989 analytic here done in just one line. 657 00:34:23,989 --> 00:34:28,230 And now a new set of vertices, which are just the rows that 658 00:34:28,230 --> 00:34:30,250 satisfy this requirement. 659 00:34:30,250 --> 00:34:32,033 And then I'm going to look those up again. 660 00:34:32,033 --> 00:34:34,449 So I'm just going to look up the ones that satisfy those-- 661 00:34:34,449 --> 00:34:37,520 now I'm going to get the whole row of those things. 662 00:34:37,520 --> 00:34:39,502 So before, I was just looking up their counts. 663 00:34:39,502 --> 00:34:40,960 Now I'm going to get the whole row. 664 00:34:40,960 --> 00:34:42,850 And I know that there's none that 665 00:34:42,850 --> 00:34:46,800 have more than this certain value, or have too few. 666 00:34:46,800 --> 00:34:51,540 And then I'm going to plot it. 667 00:34:51,540 --> 00:34:54,219 And so if we look at the figure, there we see. 668 00:34:54,219 --> 00:35:01,440 So we ended up finding one, two, three, four rows that 669 00:35:01,440 --> 00:35:05,160 had between-- these should all write-- one, two, three, four, 670 00:35:05,160 --> 00:35:07,160 five, six, seven, so that's between five and 10. 671 00:35:07,160 --> 00:35:09,100 These should all be between five and 10. 672 00:35:09,100 --> 00:35:12,590 And then this shows what their actual column was. 673 00:35:12,590 --> 00:35:15,270 And so that's a very quick example 674 00:35:15,270 --> 00:35:17,540 of that kind of analytic. 675 00:35:17,540 --> 00:35:19,390 And, again, real bread and butter, and this 676 00:35:19,390 --> 00:35:24,150 is basically standard from a signal processing perspective. 677 00:35:24,150 --> 00:35:26,400 The max is our clutter threshold. 678 00:35:26,400 --> 00:35:27,410 We don't want that. 679 00:35:27,410 --> 00:35:28,710 It's too bright. 680 00:35:28,710 --> 00:35:30,724 And then we'll have a noise threshold. 681 00:35:30,724 --> 00:35:32,140 We're like, we don't care anything 682 00:35:32,140 --> 00:35:33,580 that's below a certain value. 683 00:35:33,580 --> 00:35:35,570 And this sort of narrows in on the signal. 684 00:35:35,570 --> 00:35:37,910 Same kind of processing and signal processing, 685 00:35:37,910 --> 00:35:39,510 we're doing it here on graphs. 686 00:35:39,510 --> 00:35:42,076 And Accumulo supports that very, very nicely for us. 687 00:35:49,030 --> 00:35:51,960 So let's move on to the next example. 688 00:35:51,960 --> 00:36:01,987 And, actually, if we look here, if we go back to the overview-- 689 00:36:01,987 --> 00:36:03,820 so you see here, that was that ingest I did. 690 00:36:03,820 --> 00:36:06,240 This is the ingest-- so very quick, 691 00:36:06,240 --> 00:36:08,290 it's getting about 5,000 inserts a second. 692 00:36:08,290 --> 00:36:10,280 That was over a very short period of time. 693 00:36:10,280 --> 00:36:12,962 It doesn't even time to get reach its full rise time there. 694 00:36:12,962 --> 00:36:14,420 And then you can see the scan rate. 695 00:36:14,420 --> 00:36:15,340 And it was very small. 696 00:36:15,340 --> 00:36:17,260 It was a very tiny amount type of thing. 697 00:36:17,260 --> 00:36:20,010 As we do larger data sets, you'll really see that. 698 00:36:20,010 --> 00:36:23,820 And this is a great tool here. 699 00:36:23,820 --> 00:36:25,960 It really shows you what's really going on. 700 00:36:29,640 --> 00:36:34,220 Before you use the database, always check this page. 701 00:36:34,220 --> 00:36:36,805 If you can't get to the page of the database, 702 00:36:36,805 --> 00:36:38,930 you're not going to be able to get to the database. 703 00:36:38,930 --> 00:36:41,480 There's no point in doing anything with D4M 704 00:36:41,480 --> 00:36:44,390 if this page is not responding. 705 00:36:44,390 --> 00:36:46,300 Likewise, if you look at this, and you see 706 00:36:46,300 --> 00:36:49,222 this thing is just hamming away. 707 00:36:49,222 --> 00:36:51,180 You're seeing hundreds of thousands or millions 708 00:36:51,180 --> 00:36:53,138 of inserts a second, it means that, guess what? 709 00:36:53,138 --> 00:36:55,170 Someone is probably using your database. 710 00:36:55,170 --> 00:36:57,730 And you might want to either pick a different database, 711 00:36:57,730 --> 00:36:59,640 or find out who that is and say, hey, 712 00:36:59,640 --> 00:37:01,890 when are you going to be done, or something like that? 713 00:37:01,890 --> 00:37:04,710 Likewise, on the scan side. 714 00:37:04,710 --> 00:37:09,150 Likewise, when you do inserts-- you'll get some examples here 715 00:37:09,150 --> 00:37:11,270 of the kinds of rates you should be seeing. 716 00:37:11,270 --> 00:37:12,870 You want to make sure you're seeing those rates. 717 00:37:12,870 --> 00:37:14,244 If you're not seeing those rates, 718 00:37:14,244 --> 00:37:17,030 if you're basically just using the resource 719 00:37:17,030 --> 00:37:19,120 but only inserting at a low rate, 720 00:37:19,120 --> 00:37:20,590 then you're actually doing yourself 721 00:37:20,590 --> 00:37:21,923 and everybody else a disservice. 722 00:37:21,923 --> 00:37:23,280 You're using the database. 723 00:37:23,280 --> 00:37:24,927 But you're using it inefficiently. 724 00:37:24,927 --> 00:37:27,010 And it's much better to have your inserts go fast, 725 00:37:27,010 --> 00:37:28,160 because then you're out of the way. 726 00:37:28,160 --> 00:37:29,450 Your work gets done quicker. 727 00:37:29,450 --> 00:37:31,840 And then you're out of everybody else's way quicker too. 728 00:37:31,840 --> 00:37:36,730 So I highly recommend people look at this 729 00:37:36,730 --> 00:37:38,930 to see what's going on. 730 00:37:38,930 --> 00:37:41,130 So I think the last one was seven. 731 00:37:41,130 --> 00:37:44,050 So we'll move on to eight here. 732 00:37:44,050 --> 00:37:46,130 So now we're going to do another type of query. 733 00:37:46,130 --> 00:37:49,170 We're going to do a little bit more sophisticated query 734 00:37:49,170 --> 00:37:52,320 That query used the degree tables 735 00:37:52,320 --> 00:37:54,010 to sort of prune our query. 736 00:37:54,010 --> 00:37:55,712 So you think about it. 737 00:37:55,712 --> 00:37:57,170 There was probably an edge in there 738 00:37:57,170 --> 00:37:59,680 that had like 100 entries. 739 00:37:59,680 --> 00:38:02,270 And we just were able to avoid that, 740 00:38:02,270 --> 00:38:04,080 never had to touch that edge. 741 00:38:04,080 --> 00:38:06,260 But if we had touched it, that could 742 00:38:06,260 --> 00:38:07,830 have been a much bigger query. 743 00:38:07,830 --> 00:38:09,390 You might be like, well, 100 doesn't sound bad. 744 00:38:09,390 --> 00:38:11,265 Well, it's easy to get databases where you'll 745 00:38:11,265 --> 00:38:13,760 have some rows with a million entries, 746 00:38:13,760 --> 00:38:16,910 or columns with a million entries, or a billion entries. 747 00:38:16,910 --> 00:38:18,480 And you really don't want to have 748 00:38:18,480 --> 00:38:22,970 to query those rows or columns, because they will just 749 00:38:22,970 --> 00:38:25,700 send back so much data. 750 00:38:25,700 --> 00:38:27,100 But you can still have situations 751 00:38:27,100 --> 00:38:28,420 where you're doing queries that are going 752 00:38:28,420 --> 00:38:30,961 to take-- they're going to send back a lot of data, more data 753 00:38:30,961 --> 00:38:35,920 that you can really handle in your memory segment. 754 00:38:35,920 --> 00:38:37,940 So we have a thing called an iterator. 755 00:38:37,940 --> 00:38:40,710 So we've created an iterator log to set up a query 756 00:38:40,710 --> 00:38:42,379 and have it work through it. 757 00:38:42,379 --> 00:38:44,420 Now, this table is so small that you won't really 758 00:38:44,420 --> 00:38:47,644 get to see the iteration happening. 759 00:38:47,644 --> 00:38:48,810 But I'll show you the setup. 760 00:38:48,810 --> 00:38:50,310 So we'll do that. 761 00:38:50,310 --> 00:38:51,820 So that was very quick. 762 00:38:51,820 --> 00:38:53,706 So, essentially, we're doing a similar thing. 763 00:38:53,706 --> 00:38:55,330 We're creating a random set of vertices 764 00:38:55,330 --> 00:39:01,740 here, about 100, creating over the range 1 to 1,000. 765 00:39:01,740 --> 00:39:04,750 And we're creating an iterator here called 766 00:39:04,750 --> 00:39:05,920 Tadjacency iterator. 767 00:39:09,600 --> 00:39:11,450 It's this function here, iterator. 768 00:39:11,450 --> 00:39:13,030 We pass it in the table. 769 00:39:13,030 --> 00:39:14,650 Then we have a flag, which is element, 770 00:39:14,650 --> 00:39:18,730 which is, how many entries do we want this iterator-- it's 771 00:39:18,730 --> 00:39:19,520 the chunk size. 772 00:39:19,520 --> 00:39:22,000 How many entries do we want this iterator 773 00:39:22,000 --> 00:39:25,000 to return every single time we call it? 774 00:39:25,000 --> 00:39:27,760 And here, we've set max element to be a pretty small number, 775 00:39:27,760 --> 00:39:29,220 to be 1,000. 776 00:39:29,220 --> 00:39:32,630 So what we're saying is every single time we do this query, 777 00:39:32,630 --> 00:39:36,870 we want you to return 1,000 at a time. 778 00:39:36,870 --> 00:39:41,940 Now for those of you who are Matlab aficionados, 779 00:39:41,940 --> 00:39:44,860 you should be in awe, because Matlab is supposed 780 00:39:44,860 --> 00:39:46,566 to be a stateless language. 781 00:39:46,566 --> 00:39:48,690 And there's nothing more stateful than an iterator. 782 00:39:48,690 --> 00:39:52,060 And so we have tricked Matlab with a combination of Matlab 783 00:39:52,060 --> 00:39:55,580 on the surface and some hidden Java under the covers 784 00:39:55,580 --> 00:39:59,850 to make it have the feel of Matlab, yet hold state. 785 00:39:59,850 --> 00:40:02,000 So now what we're going to do is we're 786 00:40:02,000 --> 00:40:05,390 going to-- this just creates the iterator. 787 00:40:05,390 --> 00:40:09,460 We then initialize the query by actually passing in a query. 788 00:40:09,460 --> 00:40:14,960 So we now say, OK, the query we want is over this set of rows 789 00:40:14,960 --> 00:40:15,690 here. 790 00:40:15,690 --> 00:40:18,890 And so we're going to run the query the first time. 791 00:40:18,890 --> 00:40:21,600 And since that thing returns string values, 792 00:40:21,600 --> 00:40:23,970 and we want numbers, we have to do a string to num. 793 00:40:23,970 --> 00:40:26,640 And that's our first associative array that's 794 00:40:26,640 --> 00:40:28,770 the result of the first query. 795 00:40:28,770 --> 00:40:31,930 We then initialize our tally here. 796 00:40:31,930 --> 00:40:36,391 And then we just do a while on this result. 797 00:40:36,391 --> 00:40:38,640 So we're going to say, oh, if there's something there, 798 00:40:38,640 --> 00:40:42,320 then we want sum that, and add it to our tally, 799 00:40:42,320 --> 00:40:43,770 to our in degree tally. 800 00:40:43,770 --> 00:40:45,350 And then we call it again. 801 00:40:45,350 --> 00:40:47,440 So we do the next round of the iteration 802 00:40:47,440 --> 00:40:51,970 by just calling the object with no argument. 803 00:40:51,970 --> 00:40:55,380 And it will just run it until the query is empty. 804 00:40:55,380 --> 00:40:58,270 If we put in an argument, it would re-initialize 805 00:40:58,270 --> 00:41:00,140 the iterator to that new query. 806 00:41:00,140 --> 00:41:02,370 And so you don't have to create a new iterator 807 00:41:02,370 --> 00:41:04,770 every single time you want to put in a different query. 808 00:41:04,770 --> 00:41:07,320 You can reuse the object. 809 00:41:07,320 --> 00:41:09,280 Not that it really matters, but this 810 00:41:09,280 --> 00:41:10,720 is how you get it to do again. 811 00:41:10,720 --> 00:41:14,160 So it's a pretty elegant syntax for basically doing 812 00:41:14,160 --> 00:41:15,010 an iterator. 813 00:41:15,010 --> 00:41:19,100 And it allows you to deal with things very nicely. 814 00:41:19,100 --> 00:41:23,180 And as we see here, then we did the calculation here, 815 00:41:23,180 --> 00:41:24,440 which is all right. 816 00:41:24,440 --> 00:41:30,170 I want it to then return the value with the largest 817 00:41:30,170 --> 00:41:31,500 maximum degree. 818 00:41:31,500 --> 00:41:33,610 So, basically, I compute the max. 819 00:41:33,610 --> 00:41:36,870 I get the adjacency matrix of the degree. 820 00:41:36,870 --> 00:41:38,580 And I compute its maximum. 821 00:41:38,580 --> 00:41:40,630 And then I set that equal to n degree. 822 00:41:40,630 --> 00:41:45,025 And it tells me that the first vertex had a degree of 14 823 00:41:45,025 --> 00:41:46,400 in this query, which makes sense. 824 00:41:46,400 --> 00:41:50,190 In the Kronecker matrix, 1, 1 is always 825 00:41:50,190 --> 00:41:55,942 the largest and densest value. 826 00:41:55,942 --> 00:41:57,442 So that's just how we use iterators. 827 00:42:00,220 --> 00:42:01,680 Let us continue on here. 828 00:42:06,010 --> 00:42:08,720 Now I'm going to use iterators in a more complicated way 829 00:42:08,720 --> 00:42:12,180 to do a join. 830 00:42:12,180 --> 00:42:16,690 So a join is where I want to basically-- maybe there's 831 00:42:16,690 --> 00:42:18,150 a row in the table. 832 00:42:18,150 --> 00:42:22,210 And I only want the rows that have 833 00:42:22,210 --> 00:42:24,810 both of a certain type of value in them. 834 00:42:24,810 --> 00:42:28,180 So, for instance, if I had a table of network records, 835 00:42:28,180 --> 00:42:30,640 I might like, look, please only return records that 836 00:42:30,640 --> 00:42:36,500 have this source IP, and are talking to this domain name. 837 00:42:36,500 --> 00:42:40,750 So we have to figure out how we do joins in this technology. 838 00:42:40,750 --> 00:42:42,040 So I'm going to do that. 839 00:42:42,040 --> 00:42:43,408 So let me run that. 840 00:42:48,080 --> 00:42:50,810 So join-- so a little bit more complicated, 841 00:42:50,810 --> 00:42:53,460 obviously, we're building up fairly complicated analytics 842 00:42:53,460 --> 00:42:54,500 here. 843 00:42:54,500 --> 00:43:00,590 So I create my iterator limit here. 844 00:43:00,590 --> 00:43:05,330 I'm going to pick two columns to join, column 1 and column 2. 845 00:43:05,330 --> 00:43:08,260 And this just does a simple join over those columns. 846 00:43:08,260 --> 00:43:11,130 So basically what I'm doing is I'm 847 00:43:11,130 --> 00:43:15,590 saying, please return all the columns that 848 00:43:15,590 --> 00:43:20,860 contain-- this is an or basically either column 1 or 2. 849 00:43:20,860 --> 00:43:24,430 I'm going to then convert those values, which 850 00:43:24,430 --> 00:43:29,000 would have been string values of numerics, to just zeros or 1. 851 00:43:29,000 --> 00:43:30,920 Then I'm going to sum that. 852 00:43:30,920 --> 00:43:33,550 So, basically, I got the two columns. 853 00:43:33,550 --> 00:43:35,085 I converted all the values to 1. 854 00:43:35,085 --> 00:43:38,840 Now I'm going to sum them together. 855 00:43:38,840 --> 00:43:41,650 And then I'm going to ask the question, 856 00:43:41,650 --> 00:43:44,490 where were those equal to 2? 857 00:43:44,490 --> 00:43:46,210 Those show me the records. 858 00:43:46,210 --> 00:43:47,540 So that's what I'm doing here. 859 00:43:47,540 --> 00:43:49,080 I'm saying, those equal to 2. 860 00:43:49,080 --> 00:43:53,270 So that shows me all the records where that value is equal to 2. 861 00:43:53,270 --> 00:43:57,260 And I can then get the row of those things, 862 00:43:57,260 --> 00:44:02,550 and then pass that back in to the original matrix. 863 00:44:02,550 --> 00:44:08,390 And now I will get back those rows with those things. 864 00:44:08,390 --> 00:44:09,540 And so we can see that. 865 00:44:09,540 --> 00:44:13,150 I think that's the first figure that we did here. 866 00:44:13,150 --> 00:44:15,870 We go to figure one. 867 00:44:15,870 --> 00:44:21,500 And those show, basically-- what did we do? 868 00:44:21,500 --> 00:44:22,000 Right. 869 00:44:35,080 --> 00:44:40,690 This shows us the complete rows of all the records 870 00:44:40,690 --> 00:44:43,100 that had the value 1. 871 00:44:43,100 --> 00:44:45,660 So this is here, 1, and I think it 872 00:44:45,660 --> 00:44:49,960 was 100, which is somewhere right probably in there. 873 00:44:49,960 --> 00:44:55,111 So basically every single one of these records had 1 and 100 874 00:44:55,111 --> 00:44:55,610 in it. 875 00:44:55,610 --> 00:44:57,790 And then it shows us all the rest of the values that 876 00:44:57,790 --> 00:44:59,860 are also in that record. 877 00:44:59,860 --> 00:45:02,980 If we just wanted those columns and those values, 878 00:45:02,980 --> 00:45:05,170 when we just summed them equal to 2, we were done. 879 00:45:05,170 --> 00:45:07,440 But this allowed us to then go back into the record, 880 00:45:07,440 --> 00:45:11,690 and now into the full set, and just get those records. 881 00:45:11,690 --> 00:45:14,260 So this is a way to do a join if you 882 00:45:14,260 --> 00:45:19,260 can hold the complete results of both of those things, 883 00:45:19,260 --> 00:45:23,460 like give me the whole column 1, and the whole column 2. 884 00:45:23,460 --> 00:45:26,700 That's one way to do a join, perfectly reasonable way 885 00:45:26,700 --> 00:45:28,930 to do a join. 886 00:45:28,930 --> 00:45:32,590 I'm going to now do this again, but with a column range. 887 00:45:32,590 --> 00:45:37,600 So I'm going to say, I want to do a join of all columns that 888 00:45:37,600 --> 00:45:42,070 begin with 111, and all columns that begin with 222. 889 00:45:42,070 --> 00:45:43,500 So that would return more. 890 00:45:43,500 --> 00:45:46,710 I'm going to create two iterators now to do that. 891 00:45:46,710 --> 00:45:49,295 So I have initialized my iterator, iterator one 892 00:45:49,295 --> 00:45:51,740 and iterator two. 893 00:45:51,740 --> 00:45:55,120 And so now I'm going to start the first query iterator here 894 00:45:55,120 --> 00:45:57,830 by giving it its column range that initializes it. 895 00:45:57,830 --> 00:45:59,560 And I get an A1. 896 00:45:59,560 --> 00:46:03,460 And then I check to see if A1 is not empty. 897 00:46:03,460 --> 00:46:06,640 If it isn't empty, I'm going to then sum it, and then call it 898 00:46:06,640 --> 00:46:08,810 again until it proceeds. 899 00:46:08,810 --> 00:46:10,620 And since it's such a small thing, 900 00:46:10,620 --> 00:46:13,729 it only went through that once. 901 00:46:13,729 --> 00:46:15,770 And then now I'm going to do the same thing again 902 00:46:15,770 --> 00:46:19,250 with the other iterator and sum them to get together again. 903 00:46:26,140 --> 00:46:33,064 And now I'm going to join those two columns. 904 00:46:33,064 --> 00:46:34,480 And that's a way of doing the join 905 00:46:34,480 --> 00:46:36,900 with essentially two nested iterators, 906 00:46:36,900 --> 00:46:38,480 and doing the join that way. 907 00:46:38,480 --> 00:46:41,400 So that's something you can do if you couldn't hold 908 00:46:41,400 --> 00:46:45,580 the whole column in memory and you wanted to build it up 909 00:46:45,580 --> 00:46:46,080 as you went. 910 00:46:46,080 --> 00:46:48,480 That's a way to do it with iterators. 911 00:46:48,480 --> 00:46:50,090 Then let's see here. 912 00:46:53,750 --> 00:46:56,350 There's an example of the results from that. 913 00:47:00,860 --> 00:47:03,240 And just so you know, when you do an SQL query 914 00:47:03,240 --> 00:47:07,470 in an SQL database, this is what it's doing under the covers. 915 00:47:07,470 --> 00:47:10,250 It's trying to do whatever information it can. 916 00:47:10,250 --> 00:47:13,560 It will [INAUDIBLE] hold a lot of internal statistical 917 00:47:13,560 --> 00:47:14,210 information. 918 00:47:14,210 --> 00:47:18,300 It's trying to figure out, can I hold the results in memory? 919 00:47:18,300 --> 00:47:18,884 Or can I not? 920 00:47:18,884 --> 00:47:20,300 Do I have to go through in chunks? 921 00:47:20,300 --> 00:47:21,590 Or do I not? 922 00:47:21,590 --> 00:47:24,910 Do I have information about oh-- you know, this query, 923 00:47:24,910 --> 00:47:29,230 do I have a sum table sitting around that will tell me, 924 00:47:29,230 --> 00:47:32,330 oh, which column should I query first, 925 00:47:32,330 --> 00:47:34,090 because it will return fewer results? 926 00:47:34,090 --> 00:47:36,790 And then I can go from there type of thing. 927 00:47:36,790 --> 00:47:39,220 So this is all going under the covers. 928 00:47:39,220 --> 00:47:43,750 Here, you get the power to do that directly on the database. 929 00:47:43,750 --> 00:47:46,570 And it's pretty easy to do. 930 00:47:46,570 --> 00:47:49,270 But you do have to understand these concepts. 931 00:47:49,270 --> 00:47:52,760 So now we'll move on. 932 00:47:52,760 --> 00:47:54,830 So that was all with the adjacency matrix. 933 00:47:54,830 --> 00:47:57,390 And as I said before, when we formed the adjacency matrix, 934 00:47:57,390 --> 00:47:59,810 we threw away a little information, 935 00:47:59,810 --> 00:48:04,410 because if we had multiple-- if we had a collision of any kind, 936 00:48:04,410 --> 00:48:09,910 we lost the distinctness of that record, or that edge. 937 00:48:09,910 --> 00:48:13,760 And a lot of times, like, no we want to keep these edges, 938 00:48:13,760 --> 00:48:17,190 because, yeah, we'll have other information about those edges. 939 00:48:17,190 --> 00:48:19,340 And we want to keep things. 940 00:48:19,340 --> 00:48:22,530 So we want to store that. 941 00:48:22,530 --> 00:48:27,130 So we're going to now reinsert the data in the table, 942 00:48:27,130 --> 00:48:30,380 and use, essentially, instead of an adjacency matrix, 943 00:48:30,380 --> 00:48:31,510 an incidence matrix. 944 00:48:31,510 --> 00:48:35,770 An incidence matrix, every single row is an edge. 945 00:48:35,770 --> 00:48:39,240 And every single column is a vertice. 946 00:48:39,240 --> 00:48:42,330 And so an edge can then connect multiple vertices. 947 00:48:42,330 --> 00:48:44,360 It also allows us to store essentially 948 00:48:44,360 --> 00:48:46,030 what we call hyper edges. 949 00:48:46,030 --> 00:48:48,590 So if you have an edge that connects multiple vertices 950 00:48:48,590 --> 00:48:50,700 at the same time, we can do that. 951 00:48:50,700 --> 00:48:53,700 So let's do that. 952 00:48:53,700 --> 00:48:56,960 This is inserting about twice as much data. 953 00:48:56,960 --> 00:49:02,410 So it naturally takes a little bit longer there. 954 00:49:02,410 --> 00:49:04,470 And you see the edge insertion rates 955 00:49:04,470 --> 00:49:08,960 that we're getting there, 30,000 edges per second. 956 00:49:08,960 --> 00:49:16,060 So let's go and see what it did to our data here. 957 00:49:16,060 --> 00:49:20,310 So if we look at our tables, we can see now 958 00:49:20,310 --> 00:49:24,850 that there's our edge data set. 959 00:49:24,850 --> 00:49:30,970 And you see we've inserted about 270,000 distinct entries 960 00:49:30,970 --> 00:49:31,830 in this data. 961 00:49:31,830 --> 00:49:33,320 So there's the edge table. 962 00:49:33,320 --> 00:49:34,820 And there's this transpose. 963 00:49:34,820 --> 00:49:37,660 And there's the degree count. 964 00:49:37,660 --> 00:49:41,050 And as you saw before, we had 53,000. 965 00:49:41,050 --> 00:49:45,310 So that just shows you the additional information. 966 00:49:45,310 --> 00:49:48,010 Let's look at that program here. 967 00:49:50,650 --> 00:49:55,880 So, again, we're looping over all our files here. 968 00:49:55,880 --> 00:49:56,810 We're reading them in. 969 00:49:56,810 --> 00:49:59,840 I should say, this case we're reading in the raw text 970 00:49:59,840 --> 00:50:00,530 files again. 971 00:50:00,530 --> 00:50:03,020 We're not reading in the associative array 972 00:50:03,020 --> 00:50:05,430 because we just want to insert those edges. 973 00:50:05,430 --> 00:50:06,960 And then the only thing we've done 974 00:50:06,960 --> 00:50:09,800 is that, basically, to create our edge we 975 00:50:09,800 --> 00:50:14,160 had to create-- this data didn't come with a record label. 976 00:50:14,160 --> 00:50:15,340 So we don't have any. 977 00:50:15,340 --> 00:50:18,840 So we're constructing a unique record label for each edge 978 00:50:18,840 --> 00:50:21,310 here just so that we have it for the row key. 979 00:50:21,310 --> 00:50:26,800 And then we're pre-pending the word out into the row string 980 00:50:26,800 --> 00:50:28,660 and in into the column string. 981 00:50:28,660 --> 00:50:32,970 So we know when we create our record, 982 00:50:32,970 --> 00:50:35,300 out shows the direction it came from. 983 00:50:35,300 --> 00:50:37,920 And in shows the direction it left. 984 00:50:37,920 --> 00:50:40,320 And so that's a way of creating the edge. 985 00:50:40,320 --> 00:50:46,340 And then, likewise, we do the count degree and such 986 00:50:46,340 --> 00:50:47,950 to preserve that information so we 987 00:50:47,950 --> 00:50:51,496 can sum the new total number of counts there. 988 00:50:57,410 --> 00:50:59,270 So let's do some queries on that. 989 00:51:05,600 --> 00:51:09,120 So I'm going to ask for 100 random vertices here. 990 00:51:09,120 --> 00:51:11,390 So I get my random vertices. 991 00:51:11,390 --> 00:51:15,880 And then I'm going to do my query of the strings. 992 00:51:15,880 --> 00:51:20,390 But I have to pre-pend this out, slash, the value in it 993 00:51:20,390 --> 00:51:26,820 so I know I'm looking for vertices from which the edge is 994 00:51:26,820 --> 00:51:28,060 departing. 995 00:51:28,060 --> 00:51:30,460 And I'm going to get those edges. 996 00:51:30,460 --> 00:51:31,375 So I get those edges. 997 00:51:31,375 --> 00:51:33,040 So I created the query. 998 00:51:33,040 --> 00:51:34,290 I get the edges. 999 00:51:34,290 --> 00:51:36,070 I'm going to do my thresholding again. 1000 00:51:36,070 --> 00:51:40,010 I want a certain min and max. 1001 00:51:40,010 --> 00:51:46,660 And then I'm going to do the threshold. 1002 00:51:46,660 --> 00:51:48,950 So this just gave me the degree counts. 1003 00:51:48,950 --> 00:51:52,150 And I thresholded between this range. 1004 00:51:52,150 --> 00:52:00,290 And then now I then do the same thing back with the-- I say, 1005 00:52:00,290 --> 00:52:02,040 give me everything greater than degree min 1006 00:52:02,040 --> 00:52:03,500 and less than degree max. 1007 00:52:03,500 --> 00:52:04,950 I get a new set of rows. 1008 00:52:04,950 --> 00:52:07,800 So that will just return the edges 1009 00:52:07,800 --> 00:52:14,130 that are a part of vertices with these degree range. 1010 00:52:14,130 --> 00:52:21,230 And then I'm going to get all those edges, all the records 1011 00:52:21,230 --> 00:52:26,480 that contain those vertices, through this nested query here. 1012 00:52:26,480 --> 00:52:28,455 The result is this. 1013 00:52:28,455 --> 00:52:32,890 So, basically, this shows me all the edges 1014 00:52:32,890 --> 00:52:40,310 there are a part of this random set of vertices 1015 00:52:40,310 --> 00:52:45,430 that have a degree range between five and 10. 1016 00:52:45,430 --> 00:52:47,780 This is a fairly sophisticated analytic. 1017 00:52:50,984 --> 00:52:52,650 We're doing about seven or eight queries 1018 00:52:52,650 --> 00:52:55,280 here, doing a lot of mathematics. 1019 00:52:55,280 --> 00:52:58,040 And you see how dense it is. 1020 00:52:58,040 --> 00:53:02,190 And, hopefully, from what we've learned in prior 1021 00:53:02,190 --> 00:53:04,170 has some intuition for you. 1022 00:53:07,080 --> 00:53:08,620 And we'll continue on. 1023 00:53:14,610 --> 00:53:17,240 So now I'm going to do a query with the iterator-- 1024 00:53:17,240 --> 00:53:18,752 again, same type of drill. 1025 00:53:18,752 --> 00:53:20,960 I set the maximum number of elements to the iterator. 1026 00:53:20,960 --> 00:53:22,750 I get my random set of things. 1027 00:53:22,750 --> 00:53:26,570 I create an iterator, again setting the number of elements. 1028 00:53:26,570 --> 00:53:31,450 I initialize the iterator to be over these vertices. 1029 00:53:31,450 --> 00:53:34,600 I then check to see if it returned anything. 1030 00:53:34,600 --> 00:53:39,710 If it did, I'm going to then actually pass 1031 00:53:39,710 --> 00:53:47,670 the rows of that back into it to get the edges containing 1032 00:53:47,670 --> 00:53:50,130 those vertices, and then do a sum 1033 00:53:50,130 --> 00:53:53,030 to compute the in degree, and so on and so forth. 1034 00:53:53,030 --> 00:53:56,750 And then I get here the maximum in degree 1035 00:53:56,750 --> 00:53:59,520 of that set of vertices was 25. 1036 00:53:59,520 --> 00:54:03,320 So that's just an example of that. 1037 00:54:03,320 --> 00:54:05,140 And that was 12. 1038 00:54:05,140 --> 00:54:07,090 And I think 13 is our last one. 1039 00:54:16,300 --> 00:54:20,840 All right, and again, a more complicated example 1040 00:54:20,840 --> 00:54:26,760 showing basically a join over this space creating, 1041 00:54:26,760 --> 00:54:33,420 essentially, a couple of sets of edges, 1042 00:54:33,420 --> 00:54:36,230 a couple of column ranges, iterators, 1043 00:54:36,230 --> 00:54:37,190 and so on and so forth. 1044 00:54:37,190 --> 00:54:38,490 And I won't belabor this point. 1045 00:54:38,490 --> 00:54:41,800 But this just shows how you can combine between using 1046 00:54:41,800 --> 00:54:46,155 the degree table and iterators. 1047 00:54:46,155 --> 00:54:49,470 You have all the tools at your disposal 1048 00:54:49,470 --> 00:54:54,990 that any type of query planning system would have inside it, 1049 00:54:54,990 --> 00:54:58,080 that it would use to make sure that you're not 1050 00:54:58,080 --> 00:55:02,871 over-taxing the results that you're returning too. 1051 00:55:02,871 --> 00:55:05,370 And that's a lot of times if you do a query on any database, 1052 00:55:05,370 --> 00:55:08,100 you get the big spinning watch or whatever. 1053 00:55:08,100 --> 00:55:13,590 It's because the query you asked was simply too long. 1054 00:55:13,590 --> 00:55:15,240 It also gives a lot of nice places 1055 00:55:15,240 --> 00:55:21,420 to-- if you're making a query system, to interrupt it. 1056 00:55:21,420 --> 00:55:24,290 So if you do the query against the counts, 1057 00:55:24,290 --> 00:55:27,440 you can quickly tell the user, look, you just did a query. 1058 00:55:27,440 --> 00:55:31,390 And this is going to return 10 million results. 1059 00:55:31,390 --> 00:55:33,312 Do you want to proceed? 1060 00:55:33,312 --> 00:55:34,770 And so you, of course, [INAUDIBLE]. 1061 00:55:34,770 --> 00:55:38,600 Or likewise, you can set a maximum number of iterations. 1062 00:55:38,600 --> 00:55:40,840 Like it says, OK, I want to get them back 1063 00:55:40,840 --> 00:55:44,070 in units of 100,000 entries. 1064 00:55:44,070 --> 00:55:46,330 But I only want to go up to a million. 1065 00:55:46,330 --> 00:55:49,220 And then I'm going to pause and get 1066 00:55:49,220 --> 00:55:52,380 some kind of additional guidance from the user to continue. 1067 00:55:52,380 --> 00:55:54,100 So those are the same tools and tricks 1068 00:55:54,100 --> 00:55:57,250 that are in any query planner very elegantly exposed 1069 00:55:57,250 --> 00:56:03,120 to you here for managing these types of queries. 1070 00:56:03,120 --> 00:56:05,750 With that, I want to do some stuff 1071 00:56:05,750 --> 00:56:08,081 where we do a little bit of bigger data sets. 1072 00:56:08,081 --> 00:56:09,580 So I've walked through the examples. 1073 00:56:09,580 --> 00:56:14,830 I want to show you what this is like running on a bigger data. 1074 00:56:14,830 --> 00:56:17,560 So let's close all that. 1075 00:56:20,530 --> 00:56:21,510 Close that. 1076 00:56:34,070 --> 00:56:35,400 I want to do this one. 1077 00:56:35,400 --> 00:56:41,620 So now I'm logged into-- I just SSHed 1078 00:56:41,620 --> 00:56:48,180 into classdb02.cloud. llgrid.ll.mit.edu, which 1079 00:56:48,180 --> 00:56:50,710 happens-- it tells you which node it's actually mapped to, 1080 00:56:50,710 --> 00:56:55,030 which is node F-15-11 in our cluster. 1081 00:56:55,030 --> 00:56:57,000 And this is a fairly powerful compute node. 1082 00:56:57,000 --> 00:56:59,564 These are our next generation compute nodes for LLGrid. 1083 00:56:59,564 --> 00:57:01,230 So those of you who've been using LLGrid 1084 00:57:01,230 --> 00:57:02,910 for all these past years, may have 1085 00:57:02,910 --> 00:57:05,290 noticed that they're getting a little long in the tooth. 1086 00:57:05,290 --> 00:57:08,180 These are the first set of the new nodes. 1087 00:57:08,180 --> 00:57:12,900 And we'll have about 500 of them total in various systems. 1088 00:57:12,900 --> 00:57:19,960 And so here, I am doing something a little bit larger. 1089 00:57:19,960 --> 00:57:21,245 So let me see here. 1090 00:57:26,060 --> 00:57:28,780 Examples-- so I'm on my LLGrid account here. 1091 00:57:28,780 --> 00:57:33,810 And I'm going to go to 3 and then 2. 1092 00:57:33,810 --> 00:57:37,500 Then I do open dots. 1093 00:57:37,500 --> 00:57:40,030 So that's the directory. 1094 00:57:40,030 --> 00:57:45,520 And so the first thing I did is in my DB setup here, 1095 00:57:45,520 --> 00:57:53,150 you'll notice that I have done class DB 0. 1096 00:57:53,150 --> 00:57:56,050 And also, we don't need to do 1. 1097 00:57:56,050 --> 00:57:58,200 But I will do 2 here. 1098 00:57:58,200 --> 00:57:59,510 I've made this bigger. 1099 00:57:59,510 --> 00:58:03,320 So I have made this now 2 to the 18th vertices, 1100 00:58:03,320 --> 00:58:06,740 instead of what I had before. 1101 00:58:06,740 --> 00:58:11,940 So let's go [INAUDIBLE] anymore. 1102 00:58:11,940 --> 00:58:18,936 So if I do PDB02, it's going to now generate these things. 1103 00:58:18,936 --> 00:58:20,560 And so it's generating a lot more data. 1104 00:58:20,560 --> 00:58:24,684 And you see it's doing at about 200,000 vertices per second. 1105 00:58:24,684 --> 00:58:26,850 Just shows you the difference between the capability 1106 00:58:26,850 --> 00:58:32,900 of my laptop and one of these servers here. 1107 00:58:32,900 --> 00:58:36,850 And this will also-- I'm logged onto this system. 1108 00:58:36,850 --> 00:58:38,390 It has 32 cores. 1109 00:58:38,390 --> 00:58:40,340 I can do things in parallel. 1110 00:58:40,340 --> 00:58:46,310 And so, for instance, if I did eval p run, 1111 00:58:46,310 --> 00:58:52,690 for those of you who have had the parallel Matlab training, 1112 00:58:52,690 --> 00:58:54,880 I can say before. 1113 00:58:54,880 --> 00:58:57,140 And since I'm logged into this node, 1114 00:58:57,140 --> 00:58:59,120 and I just do curly brackets, it just 1115 00:58:59,120 --> 00:59:00,710 says launch locally on that node. 1116 00:59:00,710 --> 00:59:01,890 Don't launch onto the grid. 1117 00:59:05,600 --> 00:59:10,660 And now it's launching that in parallel on this node data one, 1118 00:59:10,660 --> 00:59:11,210 did that. 1119 00:59:11,210 --> 00:59:12,610 Data two, did that. 1120 00:59:12,610 --> 00:59:13,360 Now it's done. 1121 00:59:13,360 --> 00:59:17,170 And the others finished there work too, probably right 1122 00:59:17,170 --> 00:59:18,010 about now. 1123 00:59:18,010 --> 00:59:19,600 So that's just an example of being 1124 00:59:19,600 --> 00:59:21,320 able to do things in parallel. 1125 00:59:21,320 --> 00:59:23,580 We've created here-- I mean, you look at it. 1126 00:59:23,580 --> 00:59:28,000 We did eight times 300,000. 1127 00:59:28,000 --> 00:59:31,880 We did 2.5 million edges in that, essentially, 1128 00:59:31,880 --> 00:59:34,330 five seconds type of thing. 1129 00:59:34,330 --> 00:59:37,420 So multiply this by 4. 1130 00:59:37,420 --> 00:59:39,860 You're doing like a million edges a second just 1131 00:59:39,860 --> 00:59:44,400 in that one type of calculation there. 1132 00:59:44,400 --> 00:59:47,970 Move on TBPB03. 1133 00:59:47,970 --> 00:59:54,200 And-- oh, I should say, I did modify that program slightly. 1134 00:59:54,200 --> 00:59:54,958 Let's see here. 1135 00:59:59,040 --> 01:00:07,400 So if I look at-- the line labeled in big capital letters 1136 01:00:07,400 --> 01:00:10,490 parallel, I uncommented it. 1137 01:00:10,490 --> 01:00:13,450 That's what allows each one of the processors when 1138 01:00:13,450 --> 01:00:16,290 they launched to then work on different files. 1139 01:00:16,290 --> 01:00:19,650 Otherwise, they all would have worked on the same files. 1140 01:00:19,650 --> 01:00:22,610 So by uncommenting that parallel, 1141 01:00:22,610 --> 01:00:24,730 this now becomes a parallel program 1142 01:00:24,730 --> 01:00:26,420 that I can run with evalp run command. 1143 01:00:26,420 --> 01:00:28,670 Of course, you have to have parallel Matlab installed, 1144 01:00:28,670 --> 01:00:30,753 which of course you all do since you're on LLGrid. 1145 01:00:30,753 --> 01:00:33,650 But for anyone seeing this on the internet, 1146 01:00:33,650 --> 01:00:36,470 they would need to have that software installed too, which 1147 01:00:36,470 --> 01:00:39,554 is also available on the web and installable there. 1148 01:00:39,554 --> 01:00:40,970 So that's all we needed to do, was 1149 01:00:40,970 --> 01:00:44,320 uncomment that one line to make that program parallel, 1150 01:00:44,320 --> 01:00:46,806 and did the right thing for us as well. 1151 01:00:46,806 --> 01:00:48,680 And we're going to go on to the next example. 1152 01:00:48,680 --> 01:00:51,090 And we did the same thing there. 1153 01:00:51,090 --> 01:00:52,570 We just uncommented parallel. 1154 01:00:52,570 --> 01:00:55,410 And it's now going to create these associate arrays 1155 01:00:55,410 --> 01:00:56,490 in parallel. 1156 01:00:56,490 --> 01:01:07,410 So if I do PDB30-- so it's now actually constructing 1157 01:01:07,410 --> 01:01:08,840 these associate arrays. 1158 01:01:08,840 --> 01:01:10,800 You see it takes about four seconds 1159 01:01:10,800 --> 01:01:11,890 to do each one of those. 1160 01:01:11,890 --> 01:01:17,740 It's doing about 120,000, 130,000 entries per second. 1161 01:01:17,740 --> 01:01:22,041 So this thing will take about 25 seconds to do the whole thing. 1162 01:01:35,730 --> 01:01:50,380 And, again, if we ran that in parallel, it automatically 1163 01:01:50,380 --> 01:01:52,090 tries to kill the last job you ran 1164 01:01:52,090 --> 01:01:54,173 if you're in the same directory so that you're not 1165 01:01:54,173 --> 01:01:57,330 [INAUDIBLE] on top of yourself. 1166 01:01:57,330 --> 01:02:01,180 And now you see it's doing that again. 1167 01:02:01,180 --> 01:02:03,900 And now it's done. 1168 01:02:03,900 --> 01:02:06,780 And the other one is finished as well. 1169 01:02:06,780 --> 01:02:10,310 You can actually check that, if you really want to. 1170 01:02:10,310 --> 01:02:18,062 Just hit Control Z, if I do more [INAUDIBLE] out. 1171 01:02:18,062 --> 01:02:20,520 You can see those for the output files of each one of them. 1172 01:02:20,520 --> 01:02:21,520 I'm not lying. 1173 01:02:21,520 --> 01:02:23,394 They didn't take a ridiculous amount of time. 1174 01:02:23,394 --> 01:02:24,920 They all completed. 1175 01:02:24,920 --> 01:02:26,470 You always encourage people to check 1176 01:02:26,470 --> 01:02:28,845 those dot out files, and then that [INAUDIBLE] directory. 1177 01:02:28,845 --> 01:02:30,660 That's where it sends all the standard out 1178 01:02:30,660 --> 01:02:32,060 from all the other nodes. 1179 01:02:32,060 --> 01:02:33,720 It's probably the number one feedback 1180 01:02:33,720 --> 01:02:36,570 we get from a user who says, my job didn't run. 1181 01:02:36,570 --> 01:02:38,820 We're like, what does it say in your .out files? 1182 01:02:38,820 --> 01:02:41,270 And usually, like, oh there's an error on node 3 1183 01:02:41,270 --> 01:02:43,700 because this calculation is wrong on that particular node, 1184 01:02:43,700 --> 01:02:48,800 or something like that-- so just reminding folks of that. 1185 01:02:48,800 --> 01:02:52,360 Moving on-- so what did we just do? 1186 01:02:52,360 --> 01:02:54,130 We did three. 1187 01:02:54,130 --> 01:02:59,380 So we did PDB4, re-test. 1188 01:03:02,860 --> 01:03:05,200 And this is doing a little bit bigger calculation. 1189 01:03:05,200 --> 01:03:07,840 And so you can see here-- I told you 1190 01:03:07,840 --> 01:03:11,040 it does begin to get bigger here. 1191 01:03:11,040 --> 01:03:13,050 So it started out-- the first iteration 1192 01:03:13,050 --> 01:03:14,760 was about 0.6 seconds. 1193 01:03:14,760 --> 01:03:18,390 And then it goes on to 0.8 seconds. 1194 01:03:18,390 --> 01:03:22,340 If we did that cat approach, it would do it faster. 1195 01:03:25,210 --> 01:03:26,870 I'll show you a little neat trick. 1196 01:03:26,870 --> 01:03:30,600 This is also a parallel program when I run it. 1197 01:03:30,600 --> 01:03:37,600 And, basically, I loop over each file here. 1198 01:03:37,600 --> 01:03:39,540 I'm just doing this little ag thing just 1199 01:03:39,540 --> 01:03:42,460 to sync the processors just for fun so I don't 1200 01:03:42,460 --> 01:03:44,840 have to wait for them to start. 1201 01:03:44,840 --> 01:03:46,000 And then it's going to go. 1202 01:03:46,000 --> 01:03:47,510 And they're going to do their sums. 1203 01:03:47,510 --> 01:03:49,350 And then when they're all done, they're 1204 01:03:49,350 --> 01:03:51,903 going to call this-- so each one will have a local sum. 1205 01:03:51,903 --> 01:03:53,660 And it needs to be pulled together. 1206 01:03:53,660 --> 01:03:56,860 So we have this function called GAG, which basically takes 1207 01:03:56,860 --> 01:04:00,280 associative rays and we'll just sum them all together, and very 1208 01:04:00,280 --> 01:04:02,849 nice tool for doing that. 1209 01:04:02,849 --> 01:04:04,890 And, of course, we had to uncomment that in order 1210 01:04:04,890 --> 01:04:06,000 for that to work. 1211 01:04:06,000 --> 01:04:08,490 And so let's go give that a try. 1212 01:04:08,490 --> 01:04:25,380 And so if we do eval pRUN, so it's launching them. 1213 01:04:25,380 --> 01:04:26,240 And it's reading. 1214 01:04:26,240 --> 01:04:28,281 And then now it's going to have to wait a second. 1215 01:04:28,281 --> 01:04:31,480 OK, so it took about two seconds to pull them all together 1216 01:04:31,480 --> 01:04:33,370 and do the sum across those processors. 1217 01:04:33,370 --> 01:04:36,420 So that's a parallel computation, 1218 01:04:36,420 --> 01:04:40,160 a classic example of what people do with LLGrid 1219 01:04:40,160 --> 01:04:42,320 and can do with D4M is they have a bunch files. 1220 01:04:42,320 --> 01:04:44,880 Each processor processes them independently. 1221 01:04:44,880 --> 01:04:50,190 And at the end, they pull something all together using 1222 01:04:50,190 --> 01:04:53,960 this GAG command. 1223 01:04:53,960 --> 01:05:04,485 All, right moving on-- so now I'm on database 2 here. 1224 01:05:04,485 --> 01:05:05,610 So I'm going to go to that. 1225 01:05:05,610 --> 01:05:07,862 And let's look at our tables. 1226 01:05:07,862 --> 01:05:10,320 Very little activity-- and you see we have no tables there. 1227 01:05:10,320 --> 01:05:11,400 So I have to create them. 1228 01:05:11,400 --> 01:05:14,590 So I'm going to pd [INAUDIBLE] setup 05. 1229 01:05:14,590 --> 01:05:17,616 That's going to create those tables on that database. 1230 01:05:23,600 --> 01:05:28,040 We can now look, see. 1231 01:05:28,040 --> 01:05:30,090 And it created all my tables. 1232 01:05:30,090 --> 01:05:31,765 So now I'm ready to go. 1233 01:05:31,765 --> 01:05:35,915 And now we're going to be insert again, PDB06. 1234 01:05:35,915 --> 01:05:38,220 I'm going to insert the adjacency matrix. 1235 01:05:49,620 --> 01:05:52,290 And this obviously takes a little bit longer. 1236 01:05:52,290 --> 01:05:55,110 Each one of these, there's a parameter associated 1237 01:05:55,110 --> 01:05:57,860 with the table, which is-- you would think, 1238 01:05:57,860 --> 01:05:59,870 normally, it should just send all its data 1239 01:05:59,870 --> 01:06:01,432 to the database at once. 1240 01:06:01,432 --> 01:06:02,890 But it turns out one thing we found 1241 01:06:02,890 --> 01:06:05,440 is that actually the database prefers 1242 01:06:05,440 --> 01:06:08,990 to receive the data in a smaller increment, typically 1243 01:06:08,990 --> 01:06:11,200 around half a megabyte chunk. 1244 01:06:11,200 --> 01:06:13,300 So every single time it's calling this, 1245 01:06:13,300 --> 01:06:15,734 it's sending half a megabyte waiting to get the all clear 1246 01:06:15,734 --> 01:06:17,275 again, and then setting the next one. 1247 01:06:17,275 --> 01:06:19,800 And we've actually found that makes a fairly significant-- 1248 01:06:19,800 --> 01:06:23,600 so you can see here, we're getting about 60,000 inserts 1249 01:06:23,600 --> 01:06:24,660 per second. 1250 01:06:24,660 --> 01:06:28,050 This is from one processor. 1251 01:06:28,050 --> 01:06:30,205 And this takes a little while. 1252 01:06:30,205 --> 01:06:33,580 Let's see if we can go and look at it while it's going on. 1253 01:06:33,580 --> 01:06:36,280 If we go here, we should begin to see some. 1254 01:06:36,280 --> 01:06:37,654 So there you go. 1255 01:06:37,654 --> 01:06:39,820 That's what a real insert is beginning to look like. 1256 01:06:39,820 --> 01:06:43,910 It just changes its axis for you dynamically here. 1257 01:06:43,910 --> 01:06:49,050 But we're getting about 60,000 inserts a second there. 1258 01:06:49,050 --> 01:06:51,560 Let me just go along here. 1259 01:06:51,560 --> 01:06:53,793 It'll start leveling off a little bit. 1260 01:07:01,192 --> 01:07:03,775 And then it will show you what's going on in the tables there. 1261 01:07:14,180 --> 01:07:16,750 And, I mean, not too many of you are probably 1262 01:07:16,750 --> 01:07:17,990 database aficionados. 1263 01:07:17,990 --> 01:07:24,340 But 60,000 inserts a seconds on a single node database 1264 01:07:24,340 --> 01:07:26,549 is pretty darn amazing. 1265 01:07:26,549 --> 01:07:29,090 I mean, you would mostly have had to use a parallel database. 1266 01:07:29,090 --> 01:07:30,680 And that's actually one of the great powers of Accumulo 1267 01:07:30,680 --> 01:07:32,300 is there's a lot of-- even though it's 1268 01:07:32,300 --> 01:07:34,927 a parallel database, there's a lot of problems you can now 1269 01:07:34,927 --> 01:07:37,510 do with a single node database that you would have had to have 1270 01:07:37,510 --> 01:07:39,310 a parallel system to do before. 1271 01:07:39,310 --> 01:07:41,400 And that's really-- because, parallel computing 1272 01:07:41,400 --> 01:07:43,180 is a real pain. 1273 01:07:43,180 --> 01:07:43,790 I should know. 1274 01:07:47,079 --> 01:07:49,120 If you can make your parallel problem fast enough 1275 01:07:49,120 --> 01:07:50,830 to now work on a single one, that's 1276 01:07:50,830 --> 01:07:53,864 really a tremendously convenient capability. 1277 01:08:02,460 --> 01:08:09,230 So we inserted there, what, eight million entries-- so 1278 01:08:09,230 --> 01:08:11,210 pretty impressive. 1279 01:08:11,210 --> 01:08:13,340 But that did take a while. 1280 01:08:13,340 --> 01:08:16,330 And so maybe I want to do that in parallel. 1281 01:08:16,330 --> 01:08:24,452 So if I just do eval pRUN, let's try four and see what happens. 1282 01:08:27,620 --> 01:08:29,040 Now I would expect, actually, this 1283 01:08:29,040 --> 01:08:30,660 to begin to top this thing out. 1284 01:08:30,660 --> 01:08:33,564 And so the individual inserts rates on each node 1285 01:08:33,564 --> 01:08:34,813 probably go down a little bit. 1286 01:08:38,930 --> 01:08:41,200 And you'll see it will get a little bit noisy here, 1287 01:08:41,200 --> 01:08:44,520 because now we have four separate processes all doing 1288 01:08:44,520 --> 01:08:45,214 inserts. 1289 01:08:45,214 --> 01:08:46,130 You see there was one. 1290 01:08:46,130 --> 01:08:47,713 It took a little bit, almost a second. 1291 01:08:47,713 --> 01:08:51,180 And you get this noise beginning to happen here. 1292 01:08:51,180 --> 01:08:53,720 But we're getting 50,000 edges per second 1293 01:08:53,720 --> 01:08:56,460 on one node, which means we should be getting 1294 01:08:56,460 --> 01:08:58,819 close to four times that. 1295 01:08:58,819 --> 01:08:59,810 So let's go check. 1296 01:08:59,810 --> 01:09:01,540 What's it seeing here? 1297 01:09:01,540 --> 01:09:04,270 So if we update that-- and there you see, 1298 01:09:04,270 --> 01:09:08,630 we're sort of now climbing the hill well over 100,000. 1299 01:09:08,630 --> 01:09:10,470 That was our first insert there. 1300 01:09:10,470 --> 01:09:14,224 And now we're entering the second one here. 1301 01:09:14,224 --> 01:09:16,414 Whoops, don't want to check my email. 1302 01:09:25,770 --> 01:09:26,630 Let's see here. 1303 01:09:26,630 --> 01:09:28,040 So how are we doing there? 1304 01:09:28,040 --> 01:09:28,720 Oh, it's done. 1305 01:09:31,750 --> 01:09:34,979 We may not have even get the full rise time of that. 1306 01:09:34,979 --> 01:09:38,240 Yeah, so it basically finished before we could even really hit 1307 01:09:38,240 --> 01:09:41,914 the-- it has a little filter here, has a certain resolution. 1308 01:09:41,914 --> 01:09:44,080 You really need to do an insert for about 10 minutes 1309 01:09:44,080 --> 01:09:46,060 before you can get really a sense of that. 1310 01:09:46,060 --> 01:09:52,120 But there you see, we probably got over 200,000 inserts 1311 01:09:52,120 --> 01:09:57,220 a second using four processes on one node. 1312 01:09:57,220 --> 01:10:01,110 And we could probably keep on going up this ramp. 1313 01:10:01,110 --> 01:10:02,760 For this data set, I'd expect we'd 1314 01:10:02,760 --> 01:10:04,980 be able to get maybe 500,000 inserts a second 1315 01:10:04,980 --> 01:10:08,540 if I kept adding processors and stuff like that. 1316 01:10:08,540 --> 01:10:13,710 And if you look at our data, what do we got here? 1317 01:10:13,710 --> 01:10:18,220 We got like 15 million entries now in our database. 1318 01:10:18,220 --> 01:10:21,010 Again, one of the nice things is for a typical databases, 1319 01:10:21,010 --> 01:10:24,770 a lot of times if you have to re-ingest the whole database, 1320 01:10:24,770 --> 01:10:25,480 that's fine. 1321 01:10:25,480 --> 01:10:28,040 In our cyber program, we have a month of data. 1322 01:10:28,040 --> 01:10:30,410 And we can re-ingest it in a couple hours. 1323 01:10:30,410 --> 01:10:32,640 And that's a very powerful tool to be able to like, 1324 01:10:32,640 --> 01:10:33,390 oh, you know what? 1325 01:10:33,390 --> 01:10:34,860 I didn't like the ingestion. 1326 01:10:34,860 --> 01:10:35,500 That's fine. 1327 01:10:35,500 --> 01:10:37,440 I'll just rewrite the ingestion and redo it. 1328 01:10:37,440 --> 01:10:39,810 And this gives you a very powerful tool 1329 01:10:39,810 --> 01:10:41,900 for exploring your data here. 1330 01:10:41,900 --> 01:10:44,920 So that's kind of what I want to do with that. 1331 01:10:44,920 --> 01:10:49,090 One of our students here very generously gave me 1332 01:10:49,090 --> 01:10:50,440 some Twitter data. 1333 01:10:50,440 --> 01:10:53,870 And so I wanted to show you a little bit with that Twitter 1334 01:10:53,870 --> 01:10:56,180 data, because it's probably maybe a hair more 1335 01:10:56,180 --> 01:11:00,610 meaningful than this abstract Kronecker graph data. 1336 01:11:00,610 --> 01:11:04,230 And by definition, Twitter data is about the most public data 1337 01:11:04,230 --> 01:11:05,130 that one can imagine. 1338 01:11:05,130 --> 01:11:07,450 I think no one who posts to Twitter 1339 01:11:07,450 --> 01:11:10,310 is expecting any sense of privacy there. 1340 01:11:10,310 --> 01:11:14,410 So I think we can use that OK. 1341 01:11:14,410 --> 01:11:18,027 So let's see here. 1342 01:11:18,027 --> 01:11:20,236 Let me exit out of that. 1343 01:11:28,270 --> 01:11:37,027 [INAUDIBLE] desktop, [INAUDIBLE], Twitter. 1344 01:11:40,800 --> 01:11:44,800 And so, basically, just a few examples here-- the first thing 1345 01:11:44,800 --> 01:11:47,640 we did is construct the associative array. 1346 01:11:47,640 --> 01:11:49,507 So let's start up here. 1347 01:11:52,130 --> 01:11:54,660 And I think it was like a million Twitter. 1348 01:11:54,660 --> 01:11:56,300 Is that right? 1349 01:11:56,300 --> 01:12:02,289 A million entries, and how many tweets do you think that was? 1350 01:12:02,289 --> 01:12:03,830 We should be able to find out, right? 1351 01:12:03,830 --> 01:12:05,080 We should be able to find out. 1352 01:12:10,150 --> 01:12:12,302 So let's do the first thing here. 1353 01:12:14,950 --> 01:12:17,744 So it's reading these in, and chunked, 1354 01:12:17,744 --> 01:12:19,535 and writing them out to associative arrays. 1355 01:12:23,234 --> 01:12:24,400 That's going to be annoying. 1356 01:12:24,400 --> 01:12:25,100 Isn't it? 1357 01:12:25,100 --> 01:12:26,290 Let's go to a faster system. 1358 01:12:40,610 --> 01:12:42,923 So this is running on the database system. 1359 01:12:49,890 --> 01:12:51,610 And I broke it up into chunks of 10. 1360 01:12:51,610 --> 01:12:54,564 I couldn't quite on my laptop fit the whole thing 1361 01:12:54,564 --> 01:12:55,605 in one associative array. 1362 01:12:55,605 --> 01:12:58,699 So I broke it up into chunks of 10. 1363 01:13:07,169 --> 01:13:08,710 Yeah, see we're still cruising there. 1364 01:13:12,320 --> 01:13:14,140 So that just reads it all in. 1365 01:13:14,140 --> 01:13:16,190 In fact, we can take a look at that file. 1366 01:13:24,530 --> 01:13:26,600 So I just took that exact same example 1367 01:13:26,600 --> 01:13:28,050 and just put his data in-- so just 1368 01:13:28,050 --> 01:13:30,400 to take a look at what that involved, 1369 01:13:30,400 --> 01:13:33,370 pretty much all the same. 1370 01:13:33,370 --> 01:13:34,970 It was one big file. 1371 01:13:34,970 --> 01:13:37,890 But I couldn't process it. 1372 01:13:37,890 --> 01:13:40,530 I mean, I could read it in. 1373 01:13:40,530 --> 01:13:42,810 He did a great job of creating it into triples. 1374 01:13:42,810 --> 01:13:44,890 And I could easily hold those triples. 1375 01:13:44,890 --> 01:13:48,350 But I couldn't quite construct the associative array. 1376 01:13:48,350 --> 01:13:51,470 And so I basically just go through 1377 01:13:51,470 --> 01:13:56,780 and find all the separators, and then basically take them out 1378 01:13:56,780 --> 01:13:59,160 of the vector in memory. 1379 01:13:59,160 --> 01:14:03,960 And so that's what I'm doing here, is I'm looping over here. 1380 01:14:03,960 --> 01:14:06,480 So, basically, I read in all the data. 1381 01:14:06,480 --> 01:14:08,574 I find all the separators. 1382 01:14:08,574 --> 01:14:09,490 And then I go through. 1383 01:14:09,490 --> 01:14:11,540 And it's a little bit of a messy calculation 1384 01:14:11,540 --> 01:14:14,324 to basically do them in these particular blocks. 1385 01:14:14,324 --> 01:14:16,240 And then I can construct the associative array 1386 01:14:16,240 --> 01:14:18,363 and save those out, no problem. 1387 01:14:22,409 --> 01:14:23,200 And let's see here. 1388 01:14:23,200 --> 01:14:25,870 So what else [INAUDIBLE]. 1389 01:14:25,870 --> 01:14:31,247 We can do it in little degree calculation from that data. 1390 01:14:31,247 --> 01:14:32,830 So it's now reading each one of those, 1391 01:14:32,830 --> 01:14:34,480 and computing the degrees. 1392 01:14:40,840 --> 01:14:44,444 [INAUDIBLE] do the same thing on this system. 1393 01:14:44,444 --> 01:14:45,887 It's pretty fast. 1394 01:15:05,530 --> 01:15:11,720 Proceed then to-- let's create some tables. 1395 01:15:11,720 --> 01:15:14,845 So I created a special class of tables for that. 1396 01:15:14,845 --> 01:15:18,100 [INAUDIBLE] modify that [INAUDIBLE] that. 1397 01:15:18,100 --> 01:15:20,896 If you go over here, I think it was on this one. 1398 01:15:23,710 --> 01:15:32,300 Nope, I did it on the other database-- database 1, 1399 01:15:32,300 --> 01:15:35,950 got tables there. 1400 01:15:35,950 --> 01:15:39,810 So all I was doing there was plotting the degree 1401 01:15:39,810 --> 01:15:41,260 distribution. 1402 01:15:41,260 --> 01:15:53,400 So this shows us-- so, for each tweet-- 1403 01:15:53,400 --> 01:16:01,790 let's go to figure one-- are you done? 1404 01:16:01,790 --> 01:16:03,130 It's very Twitter-like. 1405 01:16:03,130 --> 01:16:06,640 No, no one is ever done on Twitter. 1406 01:16:06,640 --> 01:16:09,776 So-- wow, what did I do? 1407 01:16:13,640 --> 01:16:16,440 I went to town, didn't I? 1408 01:16:16,440 --> 01:16:17,610 Done now? 1409 01:16:17,610 --> 01:16:20,130 So we go to figure 1. 1410 01:16:20,130 --> 01:16:23,480 You see for each tweet, this shows us how much information 1411 01:16:23,480 --> 01:16:24,340 was in each tweet. 1412 01:16:24,340 --> 01:16:26,830 And you see that, on average-- this is because he basically 1413 01:16:26,830 --> 01:16:29,340 parsed out each word uniquely. 1414 01:16:29,340 --> 01:16:33,450 So this shows there is about 1,000 pieces of information 1415 01:16:33,450 --> 01:16:39,030 associated with each tweet, which seems a little high. 1416 01:16:39,030 --> 01:16:40,765 So we should probably double check that. 1417 01:16:43,990 --> 01:16:45,780 And then what did I do? 1418 01:16:50,304 --> 01:16:51,470 So we loaded all of them up. 1419 01:16:51,470 --> 01:16:53,090 We summed the degrees. 1420 01:16:53,090 --> 01:16:58,090 And then-- oh, I said, show me all the locations with counts 1421 01:16:58,090 --> 01:17:04,790 greater than 100, and then all the words with at signs 1422 01:17:04,790 --> 01:17:07,120 greater than 100, and all the hashtags greater than 50. 1423 01:17:07,120 --> 01:17:09,710 So that's what these other guys are. 1424 01:17:09,710 --> 01:17:11,880 So this was the-- essentially, for each tweet, 1425 01:17:11,880 --> 01:17:14,310 how many do you have? 1426 01:17:14,310 --> 01:17:17,970 If we go to figure 2, this shows the degree distribution 1427 01:17:17,970 --> 01:17:21,144 of all the words and other things in there. 1428 01:17:21,144 --> 01:17:23,310 So there's some guy here who is really, really high. 1429 01:17:23,310 --> 01:17:26,090 In fact, we can find him out. 1430 01:17:26,090 --> 01:17:29,590 But as, of course, most things occur only once-- like, 1431 01:17:29,590 --> 01:17:31,499 there's a lot of unique keys. 1432 01:17:31,499 --> 01:17:34,040 There's the message ID, which of course is probably something 1433 01:17:34,040 --> 01:17:35,990 that only appears once. 1434 01:17:35,990 --> 01:17:40,000 If we go to figure 3-- so this just shows the locations. 1435 01:17:40,000 --> 01:17:42,910 And this was tweets from the day before the hurricane, 1436 01:17:42,910 --> 01:17:46,370 or the Wednesday before Hurricane Sandy. 1437 01:17:46,370 --> 01:17:51,690 And so this shows us-- but limited to the New York 1438 01:17:51,690 --> 01:17:53,074 area or something like that? 1439 01:17:53,074 --> 01:17:54,990 AUDIENCE: Yeah, 40 miles around New York City. 1440 01:17:54,990 --> 01:17:57,180 JEREMY KEPNER: 40 miles around New York City. 1441 01:17:57,180 --> 01:18:00,760 Basically, this shows all the locations here. 1442 01:18:00,760 --> 01:18:02,619 So this is a classic example of the kind 1443 01:18:02,619 --> 01:18:04,660 of things you want to do, because the first thing 1444 01:18:04,660 --> 01:18:08,660 that we see is that we have some problems with our data, which 1445 01:18:08,660 --> 01:18:14,420 is New York City and New York City space got 1446 01:18:14,420 --> 01:18:15,910 to go in and correct all those. 1447 01:18:15,910 --> 01:18:17,610 So that's a classic example. 1448 01:18:17,610 --> 01:18:21,590 This is really what D4M-- it's the number one 1449 01:18:21,590 --> 01:18:23,732 thing that people do with D4M, is 1450 01:18:23,732 --> 01:18:25,190 it's the first time that you really 1451 01:18:25,190 --> 01:18:28,620 can do sums and tallies over the entire data. 1452 01:18:28,620 --> 01:18:30,400 And these things just don't pop out. 1453 01:18:30,400 --> 01:18:32,980 They stick out like a sore thumb. 1454 01:18:32,980 --> 01:18:34,321 Like, oh got to correct that. 1455 01:18:34,321 --> 01:18:36,070 You can either correct it in the database, 1456 01:18:36,070 --> 01:18:38,110 or correct it afterwards in the query. 1457 01:18:38,110 --> 01:18:40,580 But that just immediately improves the quality 1458 01:18:40,580 --> 01:18:42,270 of everything else you have. 1459 01:18:42,270 --> 01:18:46,170 And then there's this clutter one, like location none. 1460 01:18:46,170 --> 01:18:48,730 Well, clearly, you'd want to just get rid of that, 1461 01:18:48,730 --> 01:18:51,040 or do something with that, and then just 1462 01:18:51,040 --> 01:18:53,530 plain old normal spelled New York. 1463 01:18:53,530 --> 01:18:55,400 So most people can spell correctly. 1464 01:18:55,400 --> 01:18:57,290 And so we're very good. 1465 01:18:57,290 --> 01:18:59,520 But location, iPhone, what's that about? 1466 01:18:59,520 --> 01:19:05,420 Jersey City-- well, we don't care about that. 1467 01:19:05,420 --> 01:19:09,960 South Jersey-- well, what's-- South Jersey people don't know 1468 01:19:09,960 --> 01:19:12,840 that they're 40 miles from New York, I guess? 1469 01:19:12,840 --> 01:19:14,840 AUDIENCE: It's whatever they have. 1470 01:19:14,840 --> 01:19:16,180 JEREMY KEPNER: In their profile. 1471 01:19:16,180 --> 01:19:19,580 So a lot of people in South Jersey who say they're 1472 01:19:19,580 --> 01:19:21,850 from New York. 1473 01:19:21,850 --> 01:19:29,070 So what's her name from Jersey Shore? 1474 01:19:29,070 --> 01:19:31,890 Snooki, right Snooki says she's actually from New York 1475 01:19:31,890 --> 01:19:35,900 City, not South Jersey. 1476 01:19:35,900 --> 01:19:37,830 So there's a great example of that. 1477 01:19:37,830 --> 01:19:40,890 And then let's see here, figure 4. 1478 01:19:40,890 --> 01:19:44,410 So this just shows all the at signs. 1479 01:19:44,410 --> 01:19:50,290 So you see, basically, damnitstrue, 1480 01:19:50,290 --> 01:19:55,310 is like the most-- is this like a retweeted thing or something? 1481 01:19:55,310 --> 01:19:56,150 I don't know. 1482 01:19:56,150 --> 01:19:58,870 What does the at sign mean again? 1483 01:19:58,870 --> 01:20:01,119 AUDIENCE: It's to another user. 1484 01:20:01,119 --> 01:20:02,160 JEREMY KEPNER: To a user. 1485 01:20:02,160 --> 01:20:05,070 So most people tweet to damnitstrue in New York. 1486 01:20:05,070 --> 01:20:06,320 There you go. 1487 01:20:06,320 --> 01:20:12,900 Funny fact, relatable quote, and then Donald Trump, 1488 01:20:12,900 --> 01:20:15,620 the real Donald Trump, and then just word at sign. 1489 01:20:15,620 --> 01:20:19,820 So those are another example-- another here 1490 01:20:19,820 --> 01:20:27,320 is a hilarious idiot, badgalv, an Marilyn Monroe ID. 1491 01:20:27,320 --> 01:20:32,569 So there you go, a lot of fun stuff there on Twitter. 1492 01:20:32,569 --> 01:20:35,110 But this is sort of-- he's going to establish his background, 1493 01:20:35,110 --> 01:20:37,193 and then go back and look at during the hurricane. 1494 01:20:37,193 --> 01:20:42,790 So this is all basically the normal situation, very clearly. 1495 01:20:42,790 --> 01:20:47,160 And then the hashtags-- so we can look at the hashtags. 1496 01:20:47,160 --> 01:20:49,030 So what do we got here? 1497 01:20:49,030 --> 01:20:50,205 Favorite movie quotes. 1498 01:20:50,205 --> 01:20:51,997 AUDIENCE: Favorite movie quotes misspelled. 1499 01:20:51,997 --> 01:20:53,663 JEREMY KEPNER: And favorite movie quotes 1500 01:20:53,663 --> 01:20:55,310 misspelled right up there. 1501 01:20:55,310 --> 01:20:58,840 The Knicks, and then what I love the most, 1502 01:20:58,840 --> 01:21:02,880 and all this type-- team follow back. 1503 01:21:02,880 --> 01:21:06,070 I don't know, team auto-- no, what's this one? 1504 01:21:06,070 --> 01:21:06,670 What's TFB? 1505 01:21:09,260 --> 01:21:12,920 Maybe we don't want to know. 1506 01:21:12,920 --> 01:21:16,460 You can always look it up in Urban Dictionary. 1507 01:21:16,460 --> 01:21:17,500 It's a bad one? 1508 01:21:17,500 --> 01:21:20,100 All right, OK, good, we'll will leave it at that, 1509 01:21:20,100 --> 01:21:21,270 won't add that to the video. 1510 01:21:24,870 --> 01:21:28,220 So continuing on here, let's see. 1511 01:21:32,580 --> 01:21:33,560 Well, you get the idea. 1512 01:21:33,560 --> 01:21:36,660 And so all these examples, they work in parallel to, 1513 01:21:36,660 --> 01:21:39,910 you get a lot of speed up, lots of interesting stuff like that. 1514 01:21:39,910 --> 01:21:42,310 But that's very classic the kind of thing you do. 1515 01:21:42,310 --> 01:21:43,670 You get data. 1516 01:21:43,670 --> 01:21:44,600 You parse it. 1517 01:21:44,600 --> 01:21:46,210 You maybe stick it in Matlab files 1518 01:21:46,210 --> 01:21:48,547 to do your initial sweep through it. 1519 01:21:48,547 --> 01:21:50,130 But then if it gets really, really big 1520 01:21:50,130 --> 01:21:52,129 and you want to do more detailed things that you 1521 01:21:52,129 --> 01:21:54,200 insert in the database, can do queries there. 1522 01:21:54,200 --> 01:21:57,827 Leverage using your counts, so that you don't accidentally-- 1523 01:21:57,827 --> 01:22:00,035 you can imagine if we put all the tweets in the world 1524 01:22:00,035 --> 01:22:01,535 and you had location, New York City. 1525 01:22:01,535 --> 01:22:03,160 And you looked at-- you had to, give me 1526 01:22:03,160 --> 01:22:04,274 all this set of locations. 1527 01:22:04,274 --> 01:22:05,690 And one of them was New York City. 1528 01:22:05,690 --> 01:22:08,360 You'd be like, oh my God, I've just 1529 01:22:08,360 --> 01:22:11,120 done a query that's going to give me 5% of all the data 1530 01:22:11,120 --> 01:22:11,800 back. 1531 01:22:11,800 --> 01:22:14,044 That's going to just flush your system. 1532 01:22:14,044 --> 01:22:15,960 But if you can just do the count, and be like, 1533 01:22:15,960 --> 01:22:18,050 oh, New York City has got a million entries. 1534 01:22:18,050 --> 01:22:19,480 Don't touch that one. 1535 01:22:19,480 --> 01:22:24,570 Or put an iterator on that one so that I only 1536 01:22:24,570 --> 01:22:27,050 handle it in manageable chunks. 1537 01:22:27,050 --> 01:22:28,440 So I want to thank you. 1538 01:22:28,440 --> 01:22:30,810 So hopefully this was worth it. 1539 01:22:30,810 --> 01:22:32,530 We have one more class, which deals 1540 01:22:32,530 --> 01:22:34,490 with a little bit of wrapping up some 1541 01:22:34,490 --> 01:22:38,590 of the theory on this stuff, and some stuff on performance 1542 01:22:38,590 --> 01:22:39,510 metrics. 1543 01:22:39,510 --> 01:22:42,040 And then in two weeks, for those of you who signed up, 1544 01:22:42,040 --> 01:22:44,760 we have the Accumulo folks coming in showing you 1545 01:22:44,760 --> 01:22:49,130 how to run your own database all day 1546 01:22:49,130 --> 01:22:54,250 on just-- we're setting up a database for you guys 1547 01:22:54,250 --> 01:22:55,450 on LLGrid. 1548 01:22:55,450 --> 01:22:58,130 But you're definitely going to run in with your customers 1549 01:22:58,130 --> 01:22:59,430 Accumulo instances. 1550 01:22:59,430 --> 01:23:01,725 It's good to know some basics about that, 1551 01:23:01,725 --> 01:23:03,100 because a lot of times you're not 1552 01:23:03,100 --> 01:23:06,320 going to have all the nice stuff that we've provided. 1553 01:23:06,320 --> 01:23:08,770 And it's good to know how to set up your own Accumulo 1554 01:23:08,770 --> 01:23:10,460 and interact with that in the field. 1555 01:23:10,460 --> 01:23:12,950 So with that, that brings the lecture to the end. 1556 01:23:12,950 --> 01:23:16,940 And happy to stay for any questions, if anybody has them.