1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,496 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,496 --> 00:00:17,120 at ocw.mit.edu. 8 00:00:21,900 --> 00:00:25,240 PROFESSOR: And so I want to thank you all for coming. 9 00:00:25,240 --> 00:00:28,050 As the holiday season comes along here 10 00:00:28,050 --> 00:00:34,227 I'm guessing that our people are getting 11 00:00:34,227 --> 00:00:35,435 distracted with other things. 12 00:00:35,435 --> 00:00:38,990 So a little bit more of an intimate setting today. 13 00:00:38,990 --> 00:00:41,570 So I urge people to take advantage of that, 14 00:00:41,570 --> 00:00:42,195 ask questions. 15 00:00:42,195 --> 00:00:44,260 It's often a lot easier to do when there aren't 16 00:00:44,260 --> 00:00:45,690 as many people in the room. 17 00:00:45,690 --> 00:00:48,860 So I just really want to encourage you to do that. 18 00:00:48,860 --> 00:00:51,660 So let's get right into this. 19 00:00:56,290 --> 00:00:57,225 All right. 20 00:01:09,534 --> 00:01:11,450 Fortunately the mic actually doesn't pick that 21 00:01:11,450 --> 00:01:13,080 up at all for the most part. 22 00:01:13,080 --> 00:01:15,940 But if you all can hear the drilling of a-- yeah. 23 00:01:22,176 --> 00:01:26,080 All right, so this is Lecture 4. 24 00:01:26,080 --> 00:01:30,070 I kinda went out of order to get our special Halloween lecture 25 00:01:30,070 --> 00:01:31,710 in last time. 26 00:01:31,710 --> 00:01:34,060 But we're going to talking a little bit more 27 00:01:34,060 --> 00:01:37,900 about the analysis of what we call structured data. 28 00:01:37,900 --> 00:01:41,680 And we're doing more structured type analyzes 29 00:01:41,680 --> 00:01:44,940 on unstructured data. 30 00:01:44,940 --> 00:01:49,150 And of course the signal processing on databases, 31 00:01:49,150 --> 00:01:51,670 this is where we bring the ideas of detection theory 32 00:01:51,670 --> 00:01:53,420 and apply them to the kinds of data 33 00:01:53,420 --> 00:01:55,910 we see in databases, strings, and other types of things 34 00:01:55,910 --> 00:01:57,170 there. 35 00:01:57,170 --> 00:02:03,650 And so this lecture is going to talk a little bit more 36 00:02:03,650 --> 00:02:05,730 about-- we're going to get into a little bit 37 00:02:05,730 --> 00:02:07,670 more sophisticated analytics. 38 00:02:07,670 --> 00:02:11,850 I think up to this point we've done a fairly simple thing, 39 00:02:11,850 --> 00:02:17,950 basic queries using the technology, pretty 40 00:02:17,950 --> 00:02:19,930 simple correlations, pretty simple stuff. 41 00:02:19,930 --> 00:02:22,060 Now we're going to begin to kind of go 42 00:02:22,060 --> 00:02:25,410 a little bit more sophisticated in some of the things we do. 43 00:02:25,410 --> 00:02:28,930 And again, I think as things get more complicated the deform 44 00:02:28,930 --> 00:02:33,587 technology becomes, its benefits become even more apparent. 45 00:02:37,330 --> 00:02:41,850 So we're just going to get right into this, 46 00:02:41,850 --> 00:02:45,000 gonna show you our very generic scheme, 47 00:02:45,000 --> 00:02:47,450 gonna talk a little bit more about some of the issues 48 00:02:47,450 --> 00:02:50,567 that we encounter when we deal with particular databases. 49 00:02:50,567 --> 00:02:52,900 But the database we're going to be using for this course 50 00:02:52,900 --> 00:02:54,210 is called accumulo. 51 00:02:54,210 --> 00:02:56,460 And so I'll be talking about some issues that 52 00:02:56,460 --> 00:02:58,832 are more accumulo specific. 53 00:03:02,650 --> 00:03:05,845 So accumulo is a triple store. 54 00:03:13,440 --> 00:03:18,780 And so the way we read data into our triple store 55 00:03:18,780 --> 00:03:23,840 is using what we call this exploded transpose pair schema. 56 00:03:23,840 --> 00:03:28,270 So we have-- [INAUDIBLE]. 57 00:03:28,270 --> 00:03:33,170 We have an input data set that might 58 00:03:33,170 --> 00:03:37,150 look like a table like this with maybe our row key 59 00:03:37,150 --> 00:03:39,600 is going to be somehow based on time, 60 00:03:39,600 --> 00:03:42,610 and then we have various columns here, which may or may not 61 00:03:42,610 --> 00:03:44,780 contain various data. 62 00:03:44,780 --> 00:03:48,200 And then the first thing we do is we basically 63 00:03:48,200 --> 00:03:51,950 explode, because our triple stores can hold an arbitrary 64 00:03:51,950 --> 00:03:52,810 number of columns. 65 00:03:52,810 --> 00:03:57,920 It can add columns dynamically without any costs. 66 00:03:57,920 --> 00:04:00,380 We explode this schema by appending 67 00:04:00,380 --> 00:04:05,750 our Column 1 and its value together in this way. 68 00:04:05,750 --> 00:04:08,330 We then just took a value, could be anything, 69 00:04:08,330 --> 00:04:11,120 like one, anything that we wouldn't want a search on. 70 00:04:13,840 --> 00:04:15,400 And then we have a row key here. 71 00:04:15,400 --> 00:04:17,820 I'll get into back a little bit why we've 72 00:04:17,820 --> 00:04:19,630 done the row key in this way. 73 00:04:19,630 --> 00:04:22,650 And so by itself this doesn't give us any real advantage. 74 00:04:22,650 --> 00:04:25,540 But in accumulo or any other triple store end 75 00:04:25,540 --> 00:04:29,790 or in just in D form sociative arrays, 76 00:04:29,790 --> 00:04:32,140 we can store the transpose of the table, which 77 00:04:32,140 --> 00:04:36,530 means that all these columns are now rows. 78 00:04:36,530 --> 00:04:39,210 And this particular database is a row oriented database, 79 00:04:39,210 --> 00:04:43,710 which means it can look up any row key with constant access. 80 00:04:43,710 --> 00:04:45,870 So it can look up row keys very quickly. 81 00:04:45,870 --> 00:04:48,250 And D4M a lot we hide this from you 82 00:04:48,250 --> 00:04:50,520 so that whenever you do inserts it does, 83 00:04:50,520 --> 00:04:52,680 if you want it to happen, it will 84 00:04:52,680 --> 00:04:54,430 store the transpose for you. 85 00:04:54,430 --> 00:04:57,160 And so you look like you have a giant table where 86 00:04:57,160 --> 00:05:00,690 you can look up any row key or any column very, very quickly 87 00:05:00,690 --> 00:05:03,190 and makes it very easy to do very complicated analytics. 88 00:05:06,270 --> 00:05:09,740 One little twist here that you may have noticed 89 00:05:09,740 --> 00:05:16,570 is that I have flipped this time field here, 90 00:05:16,570 --> 00:05:20,860 taking it from essentially little endian 91 00:05:20,860 --> 00:05:23,110 and made it big endian. 92 00:05:23,110 --> 00:05:26,800 And I've always been an advocate of having row keys that 93 00:05:26,800 --> 00:05:28,050 have some meaning in them. 94 00:05:28,050 --> 00:05:31,990 Sometimes in databases they just create your arbitrary random 95 00:05:31,990 --> 00:05:32,955 hashes. 96 00:05:32,955 --> 00:05:35,160 I think we have enough random data in the world 97 00:05:35,160 --> 00:05:37,520 that we don't necessarily need to create more, 98 00:05:37,520 --> 00:05:39,960 and so if we can have meaning in our row keys, 99 00:05:39,960 --> 00:05:42,520 I think it's very useful, because it just 100 00:05:42,520 --> 00:05:44,050 makes it that much easier to debug 101 00:05:44,050 --> 00:05:45,470 data that actually has meaning. 102 00:05:48,200 --> 00:05:51,140 And so I've often advocated for creating having 103 00:05:51,140 --> 00:05:53,070 a row key be a timelike key. 104 00:05:53,070 --> 00:05:56,000 I think it's good to have timelike keys. 105 00:05:56,000 --> 00:05:59,530 And by itself there's nothing wrong 106 00:05:59,530 --> 00:06:05,520 with having a little endian row key, 107 00:06:05,520 --> 00:06:08,780 except when you go parallel. 108 00:06:08,780 --> 00:06:11,460 If this database is actually running on a parallel system, 109 00:06:11,460 --> 00:06:13,764 you can run into some issues. 110 00:06:13,764 --> 00:06:15,430 People have run into these issues, which 111 00:06:15,430 --> 00:06:18,610 is why I now advocate essentially doing 112 00:06:18,610 --> 00:06:22,980 something else with the row key more like this. 113 00:06:22,980 --> 00:06:28,420 In particular, accumulo, like many databases, 114 00:06:28,420 --> 00:06:30,540 when it goes parallel it takes the tables 115 00:06:30,540 --> 00:06:33,460 and it splits them up. 116 00:06:33,460 --> 00:06:36,000 And it splits them up by row keys. 117 00:06:36,000 --> 00:06:38,670 So it creates continuous blocks of row keys 118 00:06:38,670 --> 00:06:41,430 on different processors. 119 00:06:41,430 --> 00:06:48,200 If you have a little endian time row key, 120 00:06:48,200 --> 00:06:54,600 it means every single insert will go to the same processor. 121 00:06:54,600 --> 00:06:56,450 And that will create a bottleneck. 122 00:06:56,450 --> 00:06:58,430 And then what will happen over time 123 00:06:58,430 --> 00:07:03,310 it will then migrate that data to the other processor. 124 00:07:03,310 --> 00:07:06,340 So it can be a real bottleneck if you have essentially 125 00:07:06,340 --> 00:07:10,540 a little endian row key. 126 00:07:10,540 --> 00:07:13,340 If you have a big endian row key, if it just uses, 127 00:07:13,340 --> 00:07:15,250 it will then break up these things. 128 00:07:15,250 --> 00:07:18,220 And then when you insert your data, it will sort of naturally 129 00:07:18,220 --> 00:07:21,690 just cause that to spread out over all the systems. 130 00:07:21,690 --> 00:07:24,680 So that is if your data is coming in in some kind 131 00:07:24,680 --> 00:07:28,480 of time order, which happens. 132 00:07:28,480 --> 00:07:30,950 We definitely see people coming in that yesterday, 133 00:07:30,950 --> 00:07:32,560 you know, today's data comes in today 134 00:07:32,560 --> 00:07:34,244 and tomorrow's data comes in tomorrow. 135 00:07:34,244 --> 00:07:36,410 You don't want to have it all that data just hitting 136 00:07:36,410 --> 00:07:41,380 one processor or one compute node in your parallel database. 137 00:07:41,380 --> 00:07:46,270 Other databases have more sophisticated distribution 138 00:07:46,270 --> 00:07:49,380 things they can use as sort of a round robin 139 00:07:49,380 --> 00:07:53,750 or a modular type of-- they'll create a hash that 140 00:07:53,750 --> 00:07:56,460 does a modular so that it doesn't have that, 141 00:07:56,460 --> 00:07:58,490 eliminates that type of hotspot. 142 00:07:58,490 --> 00:08:03,190 But for now we just recommend that people use this. 143 00:08:03,190 --> 00:08:05,570 This does make it difficult to use the row 144 00:08:05,570 --> 00:08:07,780 key as your actual time value. 145 00:08:07,780 --> 00:08:11,330 And so what you would want to do is also stick, 146 00:08:11,330 --> 00:08:15,790 have a column called time that had the actual time in it. 147 00:08:15,790 --> 00:08:17,940 And then you could actually directly look up 148 00:08:17,940 --> 00:08:21,220 a time in that way. 149 00:08:21,220 --> 00:08:22,880 So that's just a little nuance there, 150 00:08:22,880 --> 00:08:25,030 good sort of design feature. 151 00:08:25,030 --> 00:08:26,920 Probably something that you wouldn't 152 00:08:26,920 --> 00:08:30,720 run into for quite a while, but we've definitely had people, 153 00:08:30,720 --> 00:08:33,409 hey I went off and implemented exactly the way 154 00:08:33,409 --> 00:08:36,062 you suggested, but now I'm going paralyzed 155 00:08:36,062 --> 00:08:37,020 seeing this bottleneck. 156 00:08:37,020 --> 00:08:37,690 And I'm like oh. 157 00:08:37,690 --> 00:08:40,789 So I'm going to try and correct that right now and tell people 158 00:08:40,789 --> 00:08:42,289 that this is the right way to do it. 159 00:08:45,240 --> 00:08:46,919 So starting out simple, we're just 160 00:08:46,919 --> 00:08:49,460 going to talk about kind of one of the simplest analytics you 161 00:08:49,460 --> 00:08:53,760 can do, which is getting basic statistical information 162 00:08:53,760 --> 00:08:54,760 from your data set. 163 00:08:54,760 --> 00:09:00,200 So if this is our data set, now we have a bunch of row keys 164 00:09:00,200 --> 00:09:03,050 here, timelike with some additional stuff 165 00:09:03,050 --> 00:09:05,950 to make them unique, unique rows. 166 00:09:05,950 --> 00:09:07,780 And then we have these various columns 167 00:09:07,780 --> 00:09:10,440 here, some of which-- the gray ones are, I'm saying, 168 00:09:10,440 --> 00:09:13,460 are filled in, and the white ones are empty. 169 00:09:13,460 --> 00:09:17,460 And so what I want to do is just grab a chunk of rows. 170 00:09:17,460 --> 00:09:20,290 OK, and then I'm going to compute basically 171 00:09:20,290 --> 00:09:23,590 how many times each column appears, essentially some. 172 00:09:23,590 --> 00:09:25,200 I'll sum by type. 173 00:09:25,200 --> 00:09:27,199 So I'll show you how we can look at if you 174 00:09:27,199 --> 00:09:28,990 want to compute just how many entries there 175 00:09:28,990 --> 00:09:34,060 were in column 1 or column 2, computing the covariance, 176 00:09:34,060 --> 00:09:37,197 computing type and para covariances are all 177 00:09:37,197 --> 00:09:38,530 different things that we can do. 178 00:09:41,350 --> 00:09:44,320 So I'm going to do all those statistics on the next slide. 179 00:09:44,320 --> 00:09:46,040 All those analytics, which if you 180 00:09:46,040 --> 00:09:49,207 were to do them in another environment would actually be, 181 00:09:49,207 --> 00:09:50,790 you would write a fair amount of code. 182 00:09:50,790 --> 00:09:52,680 But here we can do them and each one of them 183 00:09:52,680 --> 00:09:56,300 is essentially a one liner. 184 00:09:56,300 --> 00:09:59,970 So here's my set of row keys. 185 00:09:59,970 --> 00:10:01,100 OK? 186 00:10:01,100 --> 00:10:04,150 I've just created a list of row keys 187 00:10:04,150 --> 00:10:10,040 that have essentially a comma as the separator here. 188 00:10:10,040 --> 00:10:11,890 And there, service that. 189 00:10:27,850 --> 00:10:30,310 And this is our Table T. So in the notation 190 00:10:30,310 --> 00:10:32,500 we often just refer Table T. This 191 00:10:32,500 --> 00:10:34,990 is in a binding to an Accumulator Table 192 00:10:34,990 --> 00:10:39,880 or it could be any table really, any database that we support. 193 00:10:39,880 --> 00:10:44,920 And so this little bit of code here 194 00:10:44,920 --> 00:10:50,150 says, return me all the rows given by this list of rows here 195 00:10:50,150 --> 00:10:51,150 and all the columns. 196 00:10:51,150 --> 00:10:52,608 So we're using the Matlab notation. 197 00:10:52,608 --> 00:10:54,870 Colon means return all the columns. 198 00:10:54,870 --> 00:10:57,410 This will then return these results 199 00:10:57,410 --> 00:11:00,730 in the form of an associative array. 200 00:11:00,730 --> 00:11:06,130 Now since the values of that are strings, 201 00:11:06,130 --> 00:11:08,776 in this case maybe strings values of 1, 202 00:11:08,776 --> 00:11:10,150 we have this little function here 203 00:11:10,150 --> 00:11:13,130 called double logi, which will just basically 204 00:11:13,130 --> 00:11:16,160 say ignore whatever the value is and just make it-- 205 00:11:16,160 --> 00:11:19,310 if it's got an entry, give it a 1 or otherwise ignore it. 206 00:11:19,310 --> 00:11:20,925 So this is a shorthand. 207 00:11:23,440 --> 00:11:26,830 It basically does a logical, and then it does a double, 208 00:11:26,830 --> 00:11:28,440 so we can do math on it. 209 00:11:28,440 --> 00:11:30,560 So this is our query that gets us our rows 210 00:11:30,560 --> 00:11:32,290 and returns them as an associative array 211 00:11:32,290 --> 00:11:34,680 with numeric values. 212 00:11:34,680 --> 00:11:37,190 We then compute the column counts. 213 00:11:37,190 --> 00:11:39,280 So that's just the Matlab sum command, 214 00:11:39,280 --> 00:11:44,040 which says basically this tells you the dimension that's 215 00:11:44,040 --> 00:11:44,940 being compressed. 216 00:11:44,940 --> 00:11:47,950 So it's compressing the first dimension with the rows. 217 00:11:47,950 --> 00:11:51,260 So it's basically collapsing it into a row vector. 218 00:11:51,260 --> 00:11:52,500 So it's just summing. 219 00:11:52,500 --> 00:11:55,470 So I just would tell us now we could then for all those rows 220 00:11:55,470 --> 00:11:59,820 count how many occurrences of each unique column, 221 00:11:59,820 --> 00:12:01,156 of each column type there was. 222 00:12:05,150 --> 00:12:11,400 And then we can then get the covariance, 223 00:12:11,400 --> 00:12:15,820 the type type covariance by just doing A transpose A or square 224 00:12:15,820 --> 00:12:16,320 in. 225 00:12:16,320 --> 00:12:19,040 These do the same thing. 226 00:12:19,040 --> 00:12:23,670 This is usually slightly faster but not a lot. 227 00:12:23,670 --> 00:12:28,190 And so that just does the column type, column type, covariance, 228 00:12:28,190 --> 00:12:29,780 very simple. 229 00:12:29,780 --> 00:12:34,730 And then finally we have this function, which essentially 230 00:12:34,730 --> 00:12:37,200 undoes our exploded schema. 231 00:12:37,200 --> 00:12:38,800 So let's say we wanted to return it 232 00:12:38,800 --> 00:12:40,960 back to the original dense format 233 00:12:40,960 --> 00:12:45,010 where we have essentially four columns, column 1, 2, 3, and 4. 234 00:12:45,010 --> 00:12:50,580 And we wanted to value put back into the value position. 235 00:12:50,580 --> 00:12:54,440 So we have this function call to type with this-- 236 00:12:54,440 --> 00:12:56,690 and I don't know if the name makes sense 237 00:12:56,690 --> 00:13:01,230 or not, but that's what we call it-- and basically just says, 238 00:13:01,230 --> 00:13:02,520 oh, this is the limiter. 239 00:13:02,520 --> 00:13:05,460 So it takes each one of those, splits it back out, 240 00:13:05,460 --> 00:13:06,570 and stuffs it back in. 241 00:13:06,570 --> 00:13:12,260 And so then you now have that associative array. 242 00:13:12,260 --> 00:13:13,884 You can then do a sum again on that 243 00:13:13,884 --> 00:13:15,550 to just get like I want to know just how 244 00:13:15,550 --> 00:13:18,160 many column 1 instances, column 2 instance, 245 00:13:18,160 --> 00:13:20,610 column 3s there are. 246 00:13:20,610 --> 00:13:24,050 And likewise, just doing A transpose A or square in 247 00:13:24,050 --> 00:13:25,650 would then do the covariance of that. 248 00:13:25,650 --> 00:13:27,960 And that would tell you of those higher level 249 00:13:27,960 --> 00:13:29,350 types how many of there were. 250 00:13:29,350 --> 00:13:33,630 So this is a lot of high level information. 251 00:13:33,630 --> 00:13:37,040 Really highly recommend that people do this when they first 252 00:13:37,040 --> 00:13:38,620 get their data, because it really 253 00:13:38,620 --> 00:13:41,700 uncovers a lot of bad data right from the get go. 254 00:13:41,700 --> 00:13:45,290 You'll discover columns that are just like, 255 00:13:45,290 --> 00:13:49,424 wow these two column types never appear together. 256 00:13:49,424 --> 00:13:50,590 And that doesn't make sense. 257 00:13:50,590 --> 00:13:51,839 There's something wrong there. 258 00:13:51,839 --> 00:13:58,990 Or why is this exploded column, why is it so common when 259 00:13:58,990 --> 00:13:59,990 that doesn't make sense? 260 00:13:59,990 --> 00:14:03,670 So again, it's very, very useful for doing things. 261 00:14:03,670 --> 00:14:08,062 And we always recommend people start with this analytic. 262 00:14:08,062 --> 00:14:09,520 So again, that's very simple stuff. 263 00:14:09,520 --> 00:14:11,520 We've talked about it before. 264 00:14:11,520 --> 00:14:14,430 Let's talk about some more sophisticated types 265 00:14:14,430 --> 00:14:15,983 of analytics that we can do here. 266 00:14:20,570 --> 00:14:23,440 So I'm going to build what I call a data graph. 267 00:14:23,440 --> 00:14:26,990 So this is just a graph in the data. 268 00:14:26,990 --> 00:14:29,110 It may not actually be a real. 269 00:14:29,110 --> 00:14:30,790 You may have some other graph in mind, 270 00:14:30,790 --> 00:14:34,200 but this is the data supports as a particular kind of graph 271 00:14:34,200 --> 00:14:36,420 in it. 272 00:14:36,420 --> 00:14:40,100 So what we're going to do here is 273 00:14:40,100 --> 00:14:44,410 we're going to set a couple starting column here. 274 00:14:44,410 --> 00:14:48,940 OK, C0, that'll be our set of starting columns. 275 00:14:48,940 --> 00:14:52,400 And we're going to set a column. 276 00:14:52,400 --> 00:14:55,170 I said I've allowed column types. 277 00:14:55,170 --> 00:14:57,615 So we're going to be interested in certain column types. 278 00:14:57,615 --> 00:14:59,740 And we're going to create a set of clutter columns. 279 00:14:59,740 --> 00:15:04,120 These are columns that we want to ignore. 280 00:15:04,120 --> 00:15:08,150 We want to-- they're either very, very large or whatever. 281 00:15:08,150 --> 00:15:13,220 And so the basic algorithm is that we're 282 00:15:13,220 --> 00:15:16,705 going to get all the columns. 283 00:15:19,950 --> 00:15:22,950 Our result is going to be called C1 here. 284 00:15:22,950 --> 00:15:33,980 OK, and that's going to be all rows in C0 that are of type CT, 285 00:15:33,980 --> 00:15:38,410 and excluding columns CL. 286 00:15:38,410 --> 00:15:42,070 So this is a rather complicated joined type of thing 287 00:15:42,070 --> 00:15:43,890 that people often want to do. 288 00:15:43,890 --> 00:15:45,630 They want to do these types of things. 289 00:15:45,630 --> 00:15:47,420 Look, I want to get all these rows, 290 00:15:47,420 --> 00:15:50,890 but I only care about these particular types of columns. 291 00:15:50,890 --> 00:15:55,160 And I don't really care about, or I 292 00:15:55,160 --> 00:15:58,870 want to expressively eliminate certain clutter columns 293 00:15:58,870 --> 00:16:02,460 that I know are just like always pointing everywhere. 294 00:16:02,460 --> 00:16:05,010 So let's go look through the D4M Code that 295 00:16:05,010 --> 00:16:07,163 does this sort of complicated type of join. 296 00:16:11,710 --> 00:16:15,160 So I'm going to specify my C0 here. 297 00:16:15,160 --> 00:16:16,870 And this could be a whole list. 298 00:16:16,870 --> 00:16:21,950 It's just the whole list of starting columns. 299 00:16:21,950 --> 00:16:25,400 I'm going to specify my column types. 300 00:16:25,400 --> 00:16:28,090 In this case I'm going to have two column types. 301 00:16:28,090 --> 00:16:30,440 I'm gonna say, create a string called starts with, 302 00:16:30,440 --> 00:16:32,360 which essentially creates a column range, 303 00:16:32,360 --> 00:16:33,950 one column range around column 1, 304 00:16:33,950 --> 00:16:36,270 and one column range around column 3. 305 00:16:36,270 --> 00:16:40,570 And then I'm going to specify a clutter column here, 306 00:16:40,570 --> 00:16:42,030 which is just this A. And again, it 307 00:16:42,030 --> 00:16:44,530 could be a whole list as well. 308 00:16:44,530 --> 00:16:49,260 All right, so step one here is I'm going to pass. 309 00:16:49,260 --> 00:16:51,510 I'm assuming that this table is bound 310 00:16:51,510 --> 00:16:54,710 to one of these exploded transposed pairs. 311 00:16:54,710 --> 00:16:57,850 So it will know when I give it columns 312 00:16:57,850 --> 00:17:01,860 to look up to then point to the correct table. 313 00:17:01,860 --> 00:17:04,170 So we have a C0 here. 314 00:17:04,170 --> 00:17:10,359 And it will say, all right, please return all rows, 315 00:17:10,359 --> 00:17:16,230 all the data that contains C0, basically all those rows. 316 00:17:16,230 --> 00:17:20,329 I'm then going to say now I don't care about their values, 317 00:17:20,329 --> 00:17:22,040 I just want the rows. 318 00:17:22,040 --> 00:17:24,980 So this is an associative array, and this command row just says, 319 00:17:24,980 --> 00:17:26,500 just give me those rows. 320 00:17:26,500 --> 00:17:29,920 So basically I had a column, I returned all the rows, 321 00:17:29,920 --> 00:17:33,280 but now I just want the row keys. 322 00:17:33,280 --> 00:17:35,700 I'm now going to take those row keys 323 00:17:35,700 --> 00:17:39,210 and pass them back into the row position. 324 00:17:39,210 --> 00:17:43,750 So now I will have gotten the entire row. 325 00:17:43,750 --> 00:17:47,559 So basically I got a column, I took those rows, 326 00:17:47,559 --> 00:17:49,850 and now I'm going and striping and painting and getting 327 00:17:49,850 --> 00:17:50,950 the whole rows. 328 00:17:50,950 --> 00:17:52,302 So now I have every single. 329 00:17:52,302 --> 00:17:54,260 And since I don't care about the actual values, 330 00:17:54,260 --> 00:17:58,980 I just want them to be numeric, I just 331 00:17:58,980 --> 00:18:00,640 use the double logi command. 332 00:18:00,640 --> 00:18:03,270 So basically I've done a rather complicated little piece 333 00:18:03,270 --> 00:18:07,900 of query here in terms of get me all columns that contain, 334 00:18:07,900 --> 00:18:11,210 all rows that contain a certain column. 335 00:18:11,210 --> 00:18:14,380 And so that's now an associative array. 336 00:18:14,380 --> 00:18:17,350 I'm then going to reduce to specific allowed columns. 337 00:18:17,350 --> 00:18:18,905 So I'm going to pass in. 338 00:18:18,905 --> 00:18:21,240 I'm going to say, please give me just 339 00:18:21,240 --> 00:18:23,050 the ones of these particular column types. 340 00:18:23,050 --> 00:18:27,220 So I got the whole row, but now I just want to eliminate those. 341 00:18:29,780 --> 00:18:33,220 I could have probably actually put this in here, 342 00:18:33,220 --> 00:18:37,230 but whether it's more efficient or not to do it here or here, 343 00:18:37,230 --> 00:18:40,820 it's 6 of one, half dozen of the other. 344 00:18:40,820 --> 00:18:42,910 I try and make my table queries just 345 00:18:42,910 --> 00:18:45,423 either only columns or only rows. 346 00:18:45,423 --> 00:18:46,756 It tends to make things simpler. 347 00:18:49,330 --> 00:18:50,500 So now we do that. 348 00:18:50,500 --> 00:18:52,880 We have just those types. 349 00:18:52,880 --> 00:18:57,040 And now I want to eliminate the clutter columns. 350 00:18:57,040 --> 00:18:59,990 So I have A, which is just of these type. 351 00:18:59,990 --> 00:19:06,290 And I want to then eliminate any of the columns that are 352 00:19:06,290 --> 00:19:08,510 in this, basically, column 1. 353 00:19:08,510 --> 00:19:10,230 So I had column 1 as one of my types, 354 00:19:10,230 --> 00:19:11,660 but I don't care about that. 355 00:19:11,660 --> 00:19:13,500 And so I can just do subtraction. 356 00:19:13,500 --> 00:19:17,390 I can just basically say, oh go get the clutter columns 357 00:19:17,390 --> 00:19:21,130 and subtract them from the data that I have. 358 00:19:21,130 --> 00:19:25,170 And now I'm just going to get those columns, 359 00:19:25,170 --> 00:19:26,650 and I have now the set of columns. 360 00:19:26,650 --> 00:19:27,360 I now have C1. 361 00:19:27,360 --> 00:19:29,980 So I've completed the analytic I just 362 00:19:29,980 --> 00:19:33,160 described in the-- if you're grabbing like four lines rather 363 00:19:33,160 --> 00:19:35,730 sophisticating analytic, I could then proceed 364 00:19:35,730 --> 00:19:37,540 to look for additional clutter. 365 00:19:37,540 --> 00:19:43,230 For instance, I could then basically query say, 366 00:19:43,230 --> 00:19:48,700 please give me those, stick the C1 back in and then 367 00:19:48,700 --> 00:19:51,720 sum it up and look for things that had a lot of values 368 00:19:51,720 --> 00:19:53,220 and continue this process. 369 00:19:53,220 --> 00:19:54,720 Just an example of something I might 370 00:19:54,720 --> 00:19:57,100 want to do when I have this data set. 371 00:19:57,100 --> 00:20:01,710 So these show you the kinds of more sophisticated analytics 372 00:20:01,710 --> 00:20:02,530 that you can do. 373 00:20:07,110 --> 00:20:09,670 I want to talk a little bit about data graphs 374 00:20:09,670 --> 00:20:12,150 in terms of what are the things you can do here, 375 00:20:12,150 --> 00:20:13,020 what is supported. 376 00:20:13,020 --> 00:20:16,910 The topology of your data, remember 377 00:20:16,910 --> 00:20:18,275 this edge list has direction. 378 00:20:18,275 --> 00:20:19,900 When you have graphs they can very well 379 00:20:19,900 --> 00:20:21,212 have edges that have direction. 380 00:20:21,212 --> 00:20:22,670 And a lot of times people will say, 381 00:20:22,670 --> 00:20:24,128 well I want to get the whole graph. 382 00:20:24,128 --> 00:20:28,200 I'm like, well, if you're doing essentially things that 383 00:20:28,200 --> 00:20:30,950 are based on breadth first search, 384 00:20:30,950 --> 00:20:35,180 you may not be able to get the full graph, 385 00:20:35,180 --> 00:20:38,410 because you're never going at it from the correct direction. 386 00:20:38,410 --> 00:20:41,170 So you are definitely limited by the natural topology 387 00:20:41,170 --> 00:20:43,230 of your data. 388 00:20:43,230 --> 00:20:49,110 So for example here, let's say I start with C0 as column 1A. 389 00:20:49,110 --> 00:20:54,290 So I now have essentially this vertex A let's call it. 390 00:20:57,140 --> 00:21:04,070 And now here I say, OK, I now want to get all. 391 00:21:04,070 --> 00:21:06,580 Give me the whole row of all I'm going to do. 392 00:21:06,580 --> 00:21:09,100 Give me all the rows that contain that. 393 00:21:09,100 --> 00:21:12,500 All right, so then these two guys pop up. 394 00:21:12,500 --> 00:21:16,400 So I get another A and I get a B. So I got a B here. 395 00:21:16,400 --> 00:21:17,010 That's good. 396 00:21:21,500 --> 00:21:23,677 And then I'm going to go, proceed, go down again. 397 00:21:23,677 --> 00:21:24,510 I'm like, all right. 398 00:21:24,510 --> 00:21:27,154 I'm going to then now say, give me all the rows that 399 00:21:27,154 --> 00:21:28,070 contain those columns. 400 00:21:31,610 --> 00:21:39,230 And I go down again, and I got the C. Did I get a C? 401 00:21:39,230 --> 00:21:42,450 No, I never got a C. I never got a C in any one of these, 402 00:21:42,450 --> 00:21:45,520 even though it's all in the data and probably all connected. 403 00:21:45,520 --> 00:21:53,000 I never actually got a C. 404 00:21:53,000 --> 00:21:55,510 There I got the C. Now when I did it the second time 405 00:21:55,510 --> 00:21:58,800 I got the C, so there we go. 406 00:21:58,800 --> 00:22:01,080 So this is an example of a series 407 00:22:01,080 --> 00:22:03,730 of breadth first searches that result 408 00:22:03,730 --> 00:22:07,690 in getting the whole graph. 409 00:22:07,690 --> 00:22:10,860 But the graph had this typology and wouldn't naturally 410 00:22:10,860 --> 00:22:12,650 admit that. 411 00:22:12,650 --> 00:22:15,690 So certainly in this particular case the data and the queries 412 00:22:15,690 --> 00:22:18,190 were good for this, we would call a star, 413 00:22:18,190 --> 00:22:19,840 because essentially it's a vertex 414 00:22:19,840 --> 00:22:22,800 with everything going into it. 415 00:22:22,800 --> 00:22:25,440 Let's take a different graph. 416 00:22:25,440 --> 00:22:27,700 This is what we call a cycle. 417 00:22:27,700 --> 00:22:30,750 So we see we have a little cycle here going like this. 418 00:22:30,750 --> 00:22:36,590 We start again with A. We get our columns. 419 00:22:36,590 --> 00:22:40,880 We get C1s across here. 420 00:22:40,880 --> 00:22:42,620 And that's kind of the end of the game. 421 00:22:45,130 --> 00:22:49,920 We get A's, we get a B, but we're not 422 00:22:49,920 --> 00:22:53,140 going to get anything else when we column back up. 423 00:22:53,140 --> 00:22:54,820 We're not going to get anything else. 424 00:22:54,820 --> 00:22:56,480 And so we're basically stuck here 425 00:22:56,480 --> 00:23:00,080 at B. We weren't able to get to C or D. 426 00:23:00,080 --> 00:23:02,380 So these are the kind of a little more subtle things 427 00:23:02,380 --> 00:23:04,810 that everyone has to worry about. 428 00:23:04,810 --> 00:23:07,020 And once you see it it's kind of, well, of course. 429 00:23:07,020 --> 00:23:10,410 I can't do-- I'm not going to get the whole graph. 430 00:23:10,410 --> 00:23:12,430 But you'd be amazed how many teams were like, 431 00:23:12,430 --> 00:23:14,630 I wanted the whole graph, and I just can't do it. 432 00:23:14,630 --> 00:23:17,930 It's like, well, you don't have the edges going 433 00:23:17,930 --> 00:23:20,181 in the right direction for you to do that. 434 00:23:20,181 --> 00:23:21,680 You're going to have to think things 435 00:23:21,680 --> 00:23:23,410 through a little bit more. 436 00:23:23,410 --> 00:23:27,894 So I just want to-- it's kind of a little catch 437 00:23:27,894 --> 00:23:30,310 that I want to point out to people, because it's something 438 00:23:30,310 --> 00:23:31,351 that people can run into. 439 00:23:35,422 --> 00:23:37,630 We're going to do a little different type of analytic 440 00:23:37,630 --> 00:23:38,130 here. 441 00:23:38,130 --> 00:23:40,270 I've changed some of my columns here. 442 00:23:40,270 --> 00:23:41,232 I have some. 443 00:23:41,232 --> 00:23:42,440 Let's call these coordinates. 444 00:23:42,440 --> 00:23:43,856 I'm going to have now with my data 445 00:23:43,856 --> 00:23:47,680 set an x and a y-coordinate that I'm storing in different rows 446 00:23:47,680 --> 00:23:48,270 and columns. 447 00:23:48,270 --> 00:23:50,800 I want to do some kind of space windowing. 448 00:23:50,800 --> 00:23:55,752 I want to find all data within a particular x and y-coordinate. 449 00:23:58,140 --> 00:23:59,640 So what I'm going to do is I'm going 450 00:23:59,640 --> 00:24:03,340 to select a set of data here R instead of rows. 451 00:24:03,340 --> 00:24:07,140 And I'm going to give a space polygon. 452 00:24:07,140 --> 00:24:11,310 And I'm going to query, get the data. 453 00:24:11,310 --> 00:24:13,930 And then I'm going to extract the space coordinates 454 00:24:13,930 --> 00:24:16,320 from the values there, and I'm gonna 455 00:24:16,320 --> 00:24:23,240 return all columns that are within my space window here. 456 00:24:23,240 --> 00:24:25,375 And again, this is good for finding columns 457 00:24:25,375 --> 00:24:26,583 in between your space window. 458 00:24:29,230 --> 00:24:31,930 If you're concerned that you're going 459 00:24:31,930 --> 00:24:36,150 to be getting an awful lot of if you have, 460 00:24:36,150 --> 00:24:38,620 let's say you have a coordinate that goes through New York. 461 00:24:38,620 --> 00:24:40,334 And you're concerned that's just going 462 00:24:40,334 --> 00:24:42,500 to-- you don't want to get, you don't want New York, 463 00:24:42,500 --> 00:24:44,450 but you happen to be on the same latitude 464 00:24:44,450 --> 00:24:46,340 and longitude of New York. 465 00:24:46,340 --> 00:24:50,670 You can do something called Mertonization, which basically 466 00:24:50,670 --> 00:24:53,200 is essentially imagine taking your strings 467 00:24:53,200 --> 00:24:56,970 of your coordinates and interleaving them. 468 00:24:56,970 --> 00:25:02,840 And now you've essentially created an Ascii based grid 469 00:25:02,840 --> 00:25:05,240 of the entire planet. 470 00:25:05,240 --> 00:25:08,980 And so that's a way of if you want to quickly filter down, 471 00:25:08,980 --> 00:25:11,000 you can get a box and then go back 472 00:25:11,000 --> 00:25:13,200 and do the detailed coordinates to vent yourself 473 00:25:13,200 --> 00:25:14,780 from having to do. 474 00:25:14,780 --> 00:25:16,720 So that's a standard trick. 475 00:25:16,720 --> 00:25:18,960 And there's a variety of Mertonization schemes 476 00:25:18,960 --> 00:25:22,570 that people use for interleaving coordinates in this way. 477 00:25:22,570 --> 00:25:25,574 I think Google Earth has a standard box now as well. 478 00:25:25,574 --> 00:25:27,740 I find this the simplest, because you literally just 479 00:25:27,740 --> 00:25:31,444 take the two strings and interleave them. 480 00:25:31,444 --> 00:25:33,360 And if you have a space, you can actually then 481 00:25:33,360 --> 00:25:35,950 do with variable precision, because if you just like, 482 00:25:35,950 --> 00:25:37,360 space I don't know. 483 00:25:37,360 --> 00:25:39,290 And it all kind of works out pretty nicely. 484 00:25:39,290 --> 00:25:41,510 And you can read the coordinate right there. 485 00:25:41,510 --> 00:25:42,960 Like, the first one is the first. 486 00:25:42,960 --> 00:25:45,418 And you can even include the plus and minus signs for a lot 487 00:25:45,418 --> 00:25:46,380 if you wanted to. 488 00:25:46,380 --> 00:25:51,831 So maybe not the most efficient scheme, but one that's human 489 00:25:51,831 --> 00:25:52,330 readable. 490 00:25:52,330 --> 00:25:57,670 And I'm a big advocate of human readable types of data schemes. 491 00:25:57,670 --> 00:26:00,180 All right, so let's actually do that analytic. 492 00:26:00,180 --> 00:26:05,340 So again, I created my-- selected my row, 493 00:26:05,340 --> 00:26:10,480 got my x and y-coordinates in those rows, 494 00:26:10,480 --> 00:26:15,540 and then figured out which columns they were that 495 00:26:15,540 --> 00:26:17,990 satisfied that. 496 00:26:17,990 --> 00:26:19,360 So let's do that now in code. 497 00:26:22,480 --> 00:26:24,750 Let's see here. 498 00:26:24,750 --> 00:26:37,960 So we have the-- all right, so we have, 499 00:26:37,960 --> 00:26:39,860 in this case I gave It a row range. 500 00:26:39,860 --> 00:26:42,890 So you can do-- this is what a range query looks like. 501 00:26:42,890 --> 00:26:48,470 If you give any, either an associative array or a table, 502 00:26:48,470 --> 00:26:55,040 something that is a triple, essentially that's a string, 503 00:26:55,040 --> 00:26:57,030 colon, and another string, it will 504 00:26:57,030 --> 00:26:59,030 treat that as a range query. 505 00:26:59,030 --> 00:27:00,770 And we actually support doing if you 506 00:27:00,770 --> 00:27:03,310 have multiple sets of triples, it 507 00:27:03,310 --> 00:27:07,220 should handle that, which is good. 508 00:27:07,220 --> 00:27:13,380 I'm going to specify my bounding box here, essentially a box. 509 00:27:13,380 --> 00:27:16,990 And I happen to do with complex numbers 510 00:27:16,990 --> 00:27:20,310 just for fun, just because complex numbers are a nice way 511 00:27:20,310 --> 00:27:25,700 to store coordinates on a two dimensional plane. 512 00:27:25,700 --> 00:27:28,010 And Matlab supports them very nicely. 513 00:27:28,010 --> 00:27:30,550 Complex numbers are our friends, so there you go. 514 00:27:30,550 --> 00:27:35,030 But I could have just as easily had a set of x and y vectors. 515 00:27:35,030 --> 00:27:36,920 So I'm going to get all the rows. 516 00:27:36,920 --> 00:27:38,200 So I query that. 517 00:27:38,200 --> 00:27:40,810 Very good, we have that. 518 00:27:40,810 --> 00:27:43,360 And then, so that just gives me that set 519 00:27:43,360 --> 00:27:45,760 of data, all those rows. 520 00:27:45,760 --> 00:27:52,470 I then use my starts with my x and y to get just the columns 521 00:27:52,470 --> 00:27:54,820 of those X's and Y's. 522 00:27:54,820 --> 00:27:58,690 And I'm now kind of going to convert those exploded back 523 00:27:58,690 --> 00:28:03,220 into a regular table with this call to type function. 524 00:28:03,220 --> 00:28:06,080 So that basically takes those values. 525 00:28:06,080 --> 00:28:08,640 So it takes those coordinate values, so like we saw. 526 00:28:08,640 --> 00:28:12,930 It takes these coordinate values here like the 0 1, 527 00:28:12,930 --> 00:28:15,205 and it puts it back into the value position. 528 00:28:20,270 --> 00:28:21,810 So now I have, though, it will still 529 00:28:21,810 --> 00:28:23,979 be a string in the value position, which 530 00:28:23,979 --> 00:28:25,520 our associate arrays can handle fine. 531 00:28:25,520 --> 00:28:28,030 But I now want to really treat it like a number. 532 00:28:28,030 --> 00:28:31,200 So we just have overloaded the standard Matlab string 533 00:28:31,200 --> 00:28:34,760 to numb function, which will convert those strings 534 00:28:34,760 --> 00:28:36,090 and will store them back. 535 00:28:36,090 --> 00:28:39,440 You now have an associated array with numbers in them. 536 00:28:39,440 --> 00:28:41,850 So we call this Axy. 537 00:28:41,850 --> 00:28:45,610 And now we can do something. 538 00:28:45,610 --> 00:28:48,252 We basically can extract the x values here. 539 00:28:48,252 --> 00:28:49,710 So we have Axy, and say, all right, 540 00:28:49,710 --> 00:28:54,050 give me the x column, and then Axy and give me the y column. 541 00:28:54,050 --> 00:28:59,300 And Matlab has a built in function called in polygon, 542 00:28:59,300 --> 00:29:00,920 which you give it a polygon. 543 00:29:00,920 --> 00:29:03,250 So I give it the real and the imaginary parts 544 00:29:03,250 --> 00:29:06,570 of my polygon here S and the x and y-coordinates. 545 00:29:06,570 --> 00:29:10,300 And it will return in essentially the value 546 00:29:10,300 --> 00:29:16,240 of is that in there, which is great, 547 00:29:16,240 --> 00:29:19,380 because there are many dissertations written 548 00:29:19,380 --> 00:29:21,070 on the in polygon problem. 549 00:29:21,070 --> 00:29:23,710 And it's nice that we have a nice built in Matlab function 550 00:29:23,710 --> 00:29:25,680 to do that. 551 00:29:25,680 --> 00:29:28,400 And then now I have that, and I can just pass that back 552 00:29:28,400 --> 00:29:31,550 into the original A. So I do find. 553 00:29:31,550 --> 00:29:33,910 This actually returns a logical of zeros and ones. 554 00:29:33,910 --> 00:29:36,970 If I do find, then that will return a set of indices. 555 00:29:36,970 --> 00:29:39,165 And I just pass those indices back into A. 556 00:29:39,165 --> 00:29:41,690 And then I can get those columns and there 557 00:29:41,690 --> 00:29:45,990 we go, all very standard Matlab like syntax. 558 00:29:45,990 --> 00:29:48,550 Again, this is a fairly complicated analytic. 559 00:29:48,550 --> 00:29:51,010 If you were doing this using other technologies, 560 00:29:51,010 --> 00:29:52,950 I mean, I'm sure all of us would be 561 00:29:52,950 --> 00:29:54,200 writing a fair amount of code. 562 00:29:54,200 --> 00:29:55,230 And this is just the kind of thing 563 00:29:55,230 --> 00:29:56,646 that we can do very easily in D4M. 564 00:30:02,290 --> 00:30:04,910 Another analytic, which is probably a bit of a stretch, 565 00:30:04,910 --> 00:30:08,710 but I was just having fun here, is doing convolution 566 00:30:08,710 --> 00:30:12,820 on strings, which is a little odd. 567 00:30:12,820 --> 00:30:15,530 But I gave it I gave it a whirl. 568 00:30:15,530 --> 00:30:20,410 So what we want to do is we want to convolve some of our data 569 00:30:20,410 --> 00:30:21,040 with a filter. 570 00:30:21,040 --> 00:30:24,050 I mean, convolving the filters as a standard type of thing. 571 00:30:24,050 --> 00:30:28,300 So it's the standard way we do detection here. 572 00:30:28,300 --> 00:30:34,650 And so the way we do that is once again, 573 00:30:34,650 --> 00:30:37,940 I give a list of rows that I want here. 574 00:30:37,940 --> 00:30:40,270 I'm going to create a filter, which is just essentially 575 00:30:40,270 --> 00:30:46,950 a box of 4 wide box. 576 00:30:46,950 --> 00:30:47,870 So I get my rows. 577 00:30:50,580 --> 00:30:59,500 And then I convert them to numeric. 578 00:30:59,500 --> 00:31:05,162 And I'm going to do my convolution on the x columns. 579 00:31:08,360 --> 00:31:09,730 So let's see here. 580 00:31:09,730 --> 00:31:11,410 So I'm going to get these. 581 00:31:11,410 --> 00:31:13,398 I'm basically getting all the x-coordinates. 582 00:31:16,270 --> 00:31:21,780 I'm going to sum all of those, so I basically now 583 00:31:21,780 --> 00:31:24,450 have all those. 584 00:31:24,450 --> 00:31:28,840 And now I'm going to pop those back into their values. 585 00:31:28,840 --> 00:31:30,710 And now I can do a convolution. 586 00:31:30,710 --> 00:31:36,890 And this convolution works if one of the axes 587 00:31:36,890 --> 00:31:40,160 is sort of like an integer sequence type of thing. 588 00:31:40,160 --> 00:31:44,310 So you can do-- it tries to extend that naturally. 589 00:31:44,310 --> 00:31:47,890 So something to play around with if you want to do convolutions. 590 00:31:47,890 --> 00:31:51,590 We sort of support it. 591 00:31:51,590 --> 00:31:53,870 And I'm sure if anyone of you do play around it, 592 00:31:53,870 --> 00:31:55,700 we would be glad to hear your experiences, 593 00:31:55,700 --> 00:31:58,150 think about how we should extend it. 594 00:31:58,150 --> 00:32:00,690 So these are all sort of basic standard first order 595 00:32:00,690 --> 00:32:03,570 statistical analytics that one can do on data sets. 596 00:32:03,570 --> 00:32:05,460 And we can support them very, very well. 597 00:32:05,460 --> 00:32:07,770 Let's do some more complicated what I would call second order 598 00:32:07,770 --> 00:32:08,270 analytics. 599 00:32:12,870 --> 00:32:15,260 So I'm going to do something called, 600 00:32:15,260 --> 00:32:18,480 it's a complicated join essentially, 601 00:32:18,480 --> 00:32:20,810 what I call a type pair. 602 00:32:20,810 --> 00:32:23,810 So what I want to do here is I want 603 00:32:23,810 --> 00:32:29,840 to find all rows that contain values. 604 00:32:29,840 --> 00:32:34,370 I want to find rows that have both value of type 1 605 00:32:34,370 --> 00:32:36,050 and of type 2. 606 00:32:36,050 --> 00:32:40,290 So I'm going to specify this to be, basically x to be type 1 607 00:32:40,290 --> 00:32:41,950 and y to be type 2. 608 00:32:41,950 --> 00:32:45,400 And I want to find all data that has 609 00:32:45,400 --> 00:32:47,480 entries in both those very, very standard type 610 00:32:47,480 --> 00:32:48,700 of join type of thing. 611 00:32:51,590 --> 00:32:54,350 And this is done a little bit more complicated 612 00:32:54,350 --> 00:32:56,332 than we need it to be just to show you kind 613 00:32:56,332 --> 00:32:57,540 of some of the richness here. 614 00:32:57,540 --> 00:32:59,750 You can kind of take a fork in any way. 615 00:32:59,750 --> 00:33:02,330 We could probably do this whole thing in about two lines, 616 00:33:02,330 --> 00:33:06,550 but I'm kind of showing you some additional features of D4M 617 00:33:06,550 --> 00:33:09,140 in the spirit of this analytic. 618 00:33:09,140 --> 00:33:11,910 So again, I'm just going to use a range query here. 619 00:33:11,910 --> 00:33:13,680 So I have this range. 620 00:33:13,680 --> 00:33:16,940 I'm going to have my type 1 be starts with x, and my type 2 621 00:33:16,940 --> 00:33:18,760 be starts with y. 622 00:33:18,760 --> 00:33:20,050 So I do my query. 623 00:33:20,050 --> 00:33:23,610 I convert all the string 1's to numeric 1's. 624 00:33:23,610 --> 00:33:25,660 And then what I'm going to do is I'm 625 00:33:25,660 --> 00:33:30,600 going to basically, all right, get me all the columns of type 626 00:33:30,600 --> 00:33:39,490 1, sum them all together, find everything that equals-- 627 00:33:39,490 --> 00:33:42,600 and I only care about the 1's that are exactly equal 1. 628 00:33:42,600 --> 00:33:45,910 So like if I had two x's, I'm like no. 629 00:33:45,910 --> 00:33:46,760 I don't want those. 630 00:33:46,760 --> 00:33:52,220 I want exactly one x in this row. 631 00:33:58,830 --> 00:34:03,240 And then I'm going to take those rows that have exactly one x. 632 00:34:03,240 --> 00:34:05,350 I'm going to pass them back into A. 633 00:34:05,350 --> 00:34:10,469 So I now just get the rows that have exactly one x. 634 00:34:10,469 --> 00:34:13,840 I'm going to filter it again with ct1 and ct2, 635 00:34:13,840 --> 00:34:17,400 although, I don't need to do that. 636 00:34:17,400 --> 00:34:20,030 Then I'm going to sum it again. 637 00:34:20,030 --> 00:34:22,330 And now I'm going to say, all right, 638 00:34:22,330 --> 00:34:24,380 give me the only ones that are exactly 2. 639 00:34:24,380 --> 00:34:26,650 So I know my x is exactly 1. 640 00:34:26,650 --> 00:34:30,320 So in order for it to be exactly 2, that means my y also 641 00:34:30,320 --> 00:34:32,580 had to have only exactly one entry in it. 642 00:34:32,580 --> 00:34:35,719 So now I have the data that just has 643 00:34:35,719 --> 00:34:38,162 exactly one of each of those. 644 00:34:43,050 --> 00:34:45,670 Now I want to create sort of like a cross-correlation pair 645 00:34:45,670 --> 00:34:48,520 mapping of this. 646 00:34:48,520 --> 00:34:51,656 So I'm actually to look for x's across columns 647 00:34:51,656 --> 00:34:53,280 that appear say, I want to look for x's 648 00:34:53,280 --> 00:34:57,980 that appear with more than one y or a y that 649 00:34:57,980 --> 00:35:00,220 appears with more than one x. 650 00:35:00,220 --> 00:35:02,510 So there's a variety of ways to do that. 651 00:35:02,510 --> 00:35:06,200 Here what I'm doing is-- so I have gotten the rows of A that 652 00:35:06,200 --> 00:35:08,800 have exactly one x and y. 653 00:35:08,800 --> 00:35:13,810 I now pass that back again to get my C, to get the x's again. 654 00:35:13,810 --> 00:35:17,400 And one of the things that we've done that's kind of nice 655 00:35:17,400 --> 00:35:21,190 is we've overloaded the syntax of this query 656 00:35:21,190 --> 00:35:24,030 on our associated arrays and on our table queries 657 00:35:24,030 --> 00:35:26,350 such that if it only has one argument, 658 00:35:26,350 --> 00:35:28,490 it will return an associative array. 659 00:35:28,490 --> 00:35:30,650 But if you give it three output arguments, 660 00:35:30,650 --> 00:35:35,067 it will return the triple, so in this case, the row, the column, 661 00:35:35,067 --> 00:35:35,650 and the value. 662 00:35:35,650 --> 00:35:38,060 Now I don't care about the row and the value, 663 00:35:38,060 --> 00:35:39,660 I just care about the column. 664 00:35:39,660 --> 00:35:40,640 But that's a nice way. 665 00:35:40,640 --> 00:35:42,262 We're often gonna in certain cases 666 00:35:42,262 --> 00:35:43,720 want to bump back and forth between 667 00:35:43,720 --> 00:35:49,336 the triples representation and the associated array 668 00:35:49,336 --> 00:35:49,960 implementation. 669 00:35:49,960 --> 00:35:52,060 Now you always can use the Find command 670 00:35:52,060 --> 00:35:56,480 around any associated array just as you can in normal Matlab 671 00:35:56,480 --> 00:35:59,100 matrices not return the triple. 672 00:35:59,100 --> 00:36:01,740 The advantage of doing it here is 673 00:36:01,740 --> 00:36:07,220 that it's faster, because what we actually 674 00:36:07,220 --> 00:36:10,020 do when we do the query internally 675 00:36:10,020 --> 00:36:12,940 is we actually get the triples and then convert 676 00:36:12,940 --> 00:36:14,180 to an associative array. 677 00:36:14,180 --> 00:36:15,500 And if you just say I want the triples, 678 00:36:15,500 --> 00:36:17,625 we can just short cut that and give you the triples 679 00:36:17,625 --> 00:36:18,250 right away. 680 00:36:18,250 --> 00:36:20,250 So sometimes if you're dealing with a very large 681 00:36:20,250 --> 00:36:24,960 associative arrays or some operation that's just 682 00:36:24,960 --> 00:36:28,080 I want to get some more performance back then working. 683 00:36:28,080 --> 00:36:31,350 Especially if you like, well, I only care about one thing, 684 00:36:31,350 --> 00:36:33,360 I don't care about all of the values, 685 00:36:33,360 --> 00:36:35,600 I don't really need to be a full associative array, 686 00:36:35,600 --> 00:36:39,000 then that's a great way to sort of short circuit that. 687 00:36:39,000 --> 00:36:39,890 So we do that here. 688 00:36:39,890 --> 00:36:42,620 And now we can construct a new associate array, 689 00:36:42,620 --> 00:36:47,000 which is just taking the x's and the y's and creating 690 00:36:47,000 --> 00:36:50,140 a new associative array with those. 691 00:36:50,140 --> 00:36:54,210 And that just shows me the correlations between 692 00:36:54,210 --> 00:36:55,590 the x's and the y's. 693 00:36:55,590 --> 00:37:01,110 And I can then find ct, basically x's that have more 694 00:37:01,110 --> 00:37:04,070 than one y-- so I've just summed them there-- 695 00:37:04,070 --> 00:37:06,430 or y's with more than one x's. 696 00:37:06,430 --> 00:37:09,600 Again, these are very similar to analytics 697 00:37:09,600 --> 00:37:10,780 that people want to do. 698 00:37:10,780 --> 00:37:12,310 And again, very simple to do. 699 00:37:12,310 --> 00:37:15,740 And again, showing you some of the different types of syntax 700 00:37:15,740 --> 00:37:18,120 that are available to you in D4M. 701 00:37:18,120 --> 00:37:21,340 Again, if you're used to using Matlab, these types of tricks 702 00:37:21,340 --> 00:37:22,770 are very natural. 703 00:37:22,770 --> 00:37:26,210 We're just showing you that they also exist within D4M. 704 00:37:30,380 --> 00:37:32,220 So here's another one. 705 00:37:32,220 --> 00:37:37,310 So I wanted to find column of pair set C1 and C2, 706 00:37:37,310 --> 00:37:42,090 and get all columns C1 and C2, and find the rows that have 707 00:37:42,090 --> 00:37:43,950 just one entry in C1 and C2. 708 00:37:46,570 --> 00:37:48,620 And it basically checks to see if data pairs are 709 00:37:48,620 --> 00:37:50,110 present in the same row. 710 00:37:50,110 --> 00:37:52,497 Again, something that people often want to do. 711 00:37:52,497 --> 00:37:54,080 You've got a complicated type of join. 712 00:37:54,080 --> 00:37:58,440 So here we have a set of columns one C1, a set of column C2. 713 00:38:01,830 --> 00:38:05,150 I want to create this, I want to sort of interleave these 714 00:38:05,150 --> 00:38:08,270 together into a pair. 715 00:38:08,270 --> 00:38:10,240 So I want to create some concept of a pair. 716 00:38:10,240 --> 00:38:12,170 And so we have this function here 717 00:38:12,170 --> 00:38:14,530 called cat string, which basically 718 00:38:14,530 --> 00:38:18,380 will take two strings and another delimiter, 719 00:38:18,380 --> 00:38:21,000 and will basically if they are of the same number of strings 720 00:38:21,000 --> 00:38:22,950 will just leave them together. 721 00:38:22,950 --> 00:38:24,700 If one of these is just a single string, 722 00:38:24,700 --> 00:38:27,900 it will just an essentially preappend or append that. 723 00:38:27,900 --> 00:38:30,120 So for instance, if you are wondering how we actually 724 00:38:30,120 --> 00:38:34,250 create these exploded values like call 1 and B, 725 00:38:34,250 --> 00:38:36,290 that's just basically using this function here. 726 00:38:36,290 --> 00:38:39,110 We get the values, we get the columns, 727 00:38:39,110 --> 00:38:41,560 we put essentially the pipe thing in the middle, 728 00:38:41,560 --> 00:38:43,530 and it just merges them together. 729 00:38:43,530 --> 00:38:46,010 So we now sort of interleave these two together. 730 00:38:46,010 --> 00:38:51,100 So we'll now have something like Col1 b pipe Col3 731 00:38:51,100 --> 00:38:55,500 b pipe comma as the separator. 732 00:38:55,500 --> 00:38:57,540 So now I can create a set of pair mappings 733 00:38:57,540 --> 00:39:00,020 from C1 to its pairs. 734 00:39:00,020 --> 00:39:04,090 OK, that's A1 to its pairs and A2 to it's pairs. 735 00:39:04,090 --> 00:39:07,830 I can get the columns of those A1 and A2. 736 00:39:07,830 --> 00:39:10,040 And then I can find all the pairs by essentially 737 00:39:10,040 --> 00:39:14,010 through this combination of matrix multiplies and additions 738 00:39:14,010 --> 00:39:15,010 and etc. 739 00:39:15,010 --> 00:39:18,741 So a very sort of complicated analytic done very nicely. 740 00:39:18,741 --> 00:39:20,740 And then there's a whole bunch of different ones 741 00:39:20,740 --> 00:39:21,650 you can do here. 742 00:39:21,650 --> 00:39:23,687 These are almost semantic extension. 743 00:39:23,687 --> 00:39:25,770 The column types may have several different types, 744 00:39:25,770 --> 00:39:26,940 and you want to do. 745 00:39:26,940 --> 00:39:29,930 So for instance, if I have a pair of columns here, 746 00:39:29,930 --> 00:39:33,260 column 1 and column 3, I could say, well, 747 00:39:33,260 --> 00:39:34,800 that also implies this. 748 00:39:34,800 --> 00:39:36,740 Column three equals column 1. 749 00:39:36,740 --> 00:39:39,530 That's one kind of sort of pair reversal type of thing. 750 00:39:39,530 --> 00:39:40,550 You'll have extensions. 751 00:39:40,550 --> 00:39:42,133 You might say, look if I have a column 752 00:39:42,133 --> 00:39:44,210 1A that also implies that really there 753 00:39:44,210 --> 00:39:47,220 should also be a column 2A, and other types 754 00:39:47,220 --> 00:39:48,130 of things like that. 755 00:39:48,130 --> 00:39:49,505 So these are just types of things 756 00:39:49,505 --> 00:39:50,770 that people do with pairs. 757 00:39:50,770 --> 00:39:53,990 They're often very useful. 758 00:39:53,990 --> 00:39:56,610 And I think that basically brings us 759 00:39:56,610 --> 00:40:01,060 to the end of the lecture portion of class. 760 00:40:01,060 --> 00:40:02,670 So again, just the exploited schema 761 00:40:02,670 --> 00:40:05,820 really allows you to do this very rapidly with your data. 762 00:40:05,820 --> 00:40:10,530 And you can implement a very efficient graph analytics 763 00:40:10,530 --> 00:40:13,670 as a sequence of essentially row and column queries, 764 00:40:13,670 --> 00:40:21,250 because we use this very special exploded transpose pair schema. 765 00:40:21,250 --> 00:40:24,860 And increasingly as you become more and more skilled 766 00:40:24,860 --> 00:40:27,240 with this, you will discover that many, many, many 767 00:40:27,240 --> 00:40:31,400 of your analytics really reduce to matrix matrix multiplies. 768 00:40:31,400 --> 00:40:33,910 That matrix matrix multiply really 769 00:40:33,910 --> 00:40:38,030 captures the whole sort of all correlation 770 00:40:38,030 --> 00:40:43,480 that you want to do without having to kind of figure things 771 00:40:43,480 --> 00:40:45,020 out. 772 00:40:45,020 --> 00:40:47,030 All right, so I'm now going to go and show some 773 00:40:47,030 --> 00:40:49,260 not these specific analytics, but some analytics 774 00:40:49,260 --> 00:40:52,290 that are more sophisticated based on the Reuters data set. 775 00:40:52,290 --> 00:40:54,720 If you remember a few weeks ago, we 776 00:40:54,720 --> 00:40:58,490 worked with the Reuters data set. 777 00:40:58,490 --> 00:41:00,980 And so let's see here. 778 00:41:00,980 --> 00:41:05,120 So we already did the entity analysis application 779 00:41:05,120 --> 00:41:05,790 a few weeks ago. 780 00:41:05,790 --> 00:41:07,940 I'm going to now do basically what 781 00:41:07,940 --> 00:41:09,430 happens when you construct tracks, 782 00:41:09,430 --> 00:41:12,470 which is a more sophisticated structured analytic. 783 00:41:12,470 --> 00:41:17,690 And the assignment I'll send out was basically 784 00:41:17,690 --> 00:41:19,660 doing more cross correlations. 785 00:41:19,660 --> 00:41:22,626 For those of you who have kept it going here this far 786 00:41:22,626 --> 00:41:24,000 and continue to do the homeworks, 787 00:41:24,000 --> 00:41:25,830 I'll send this homework out to you, 788 00:41:25,830 --> 00:41:27,920 which is basically just cross correlating 789 00:41:27,920 --> 00:41:29,200 the data sets that you have. 790 00:41:29,200 --> 00:41:33,720 Again, and not an assignment that really predispose 791 00:41:33,720 --> 00:41:36,930 requires you having done the previous assignments. 792 00:41:36,930 --> 00:41:40,530 Just any data set, pull it into an associate array, 793 00:41:40,530 --> 00:41:43,960 and then do matrix multiplies to figure out 794 00:41:43,960 --> 00:41:46,719 the cross correlations and what they mean. 795 00:41:46,719 --> 00:41:48,260 All right, so with that, why don't we 796 00:41:48,260 --> 00:41:50,530 take a short five minute break. 797 00:41:50,530 --> 00:41:53,450 And then I'll come back and show you the demo.