1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation, or to view additional materials 6 00:00:12,680 --> 00:00:16,496 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,496 --> 00:00:17,120 at ocw.mit.edu. 8 00:00:21,362 --> 00:00:27,030 JEREMY KEPNER: All right, so we're doing lecture 06 9 00:00:27,030 --> 00:00:28,120 in the course today. 10 00:00:28,120 --> 00:00:31,240 That's in the-- just remind people it's 11 00:00:31,240 --> 00:00:35,350 in the docs directory here. 12 00:00:35,350 --> 00:00:39,660 And I'm going to be doing a lot, talking 13 00:00:39,660 --> 00:00:43,060 about a particular application, which I think is actually 14 00:00:43,060 --> 00:00:45,690 representative of a variety of applications 15 00:00:45,690 --> 00:00:51,460 that we do a lot of things, similar statistical properties 16 00:00:51,460 --> 00:00:56,826 of a sequence cross correlation. 17 00:00:56,826 --> 00:00:58,130 OK. 18 00:00:58,130 --> 00:01:02,834 So diving right in. 19 00:01:02,834 --> 00:01:04,250 Just going to give an introduction 20 00:01:04,250 --> 00:01:07,650 to this particular problem of genetic sequence analysis 21 00:01:07,650 --> 00:01:14,870 from a computational perspective and how the D4M technology can 22 00:01:14,870 --> 00:01:19,160 really make it pretty easy to do the kinds of things 23 00:01:19,160 --> 00:01:20,890 that people like to do. 24 00:01:20,890 --> 00:01:23,260 And then just talk about the pipeline 25 00:01:23,260 --> 00:01:26,930 that we implemented to implement this system because, as I've 26 00:01:26,930 --> 00:01:32,360 said before, when you're dealing with this type of large data, 27 00:01:32,360 --> 00:01:34,240 the D4M technology is one piece. 28 00:01:34,240 --> 00:01:37,270 It's often the piece you use to prototype your algorithms. 29 00:01:37,270 --> 00:01:40,730 But to be a part of a system, you usually 30 00:01:40,730 --> 00:01:42,824 have to stitch together a variety of technologies, 31 00:01:42,824 --> 00:01:44,990 databases obviously being an important part of that. 32 00:01:47,860 --> 00:01:50,400 So this is a great chart. 33 00:01:50,400 --> 00:01:58,090 This is the relative cost per DNA sequence over time 34 00:01:58,090 --> 00:02:03,117 here, over the last 20 years. 35 00:02:03,117 --> 00:02:04,700 So we're getting a little cut off here 36 00:02:04,700 --> 00:02:06,760 at the bottom of the screen. 37 00:02:06,760 --> 00:02:09,930 So I think I'm-- hmm. 38 00:02:09,930 --> 00:02:14,780 So you know, this just shows Moore's law, 39 00:02:14,780 --> 00:02:17,410 so we all know that that technology has been increasing 40 00:02:17,410 --> 00:02:18,840 at an incredible rate. 41 00:02:18,840 --> 00:02:24,010 And as we've seen, the cost of DNA sequencing 42 00:02:24,010 --> 00:02:25,630 is going down dramatically. 43 00:02:25,630 --> 00:02:29,790 So the first DNA sequence, people nominally 44 00:02:29,790 --> 00:02:32,440 say to sequence the first human genome 45 00:02:32,440 --> 00:02:34,710 was around a billion dollars. 46 00:02:34,710 --> 00:02:45,540 And they're expecting it to be $100 within the next few years. 47 00:02:45,540 --> 00:02:49,620 So having your DNA sequenced will 48 00:02:49,620 --> 00:02:55,010 become probably a fairly routine medical activity 49 00:02:55,010 --> 00:02:59,080 in the next decade. 50 00:02:59,080 --> 00:03:04,100 And so the data generated, you know, 51 00:03:04,100 --> 00:03:08,460 typically a human genome will have billions 52 00:03:08,460 --> 00:03:11,690 of DNA sequences in it. 53 00:03:11,690 --> 00:03:12,996 And that's a lot of data. 54 00:03:15,770 --> 00:03:20,510 What's actually perhaps even more interesting 55 00:03:20,510 --> 00:03:24,001 than sequencing your DNA is sequencing 56 00:03:24,001 --> 00:03:25,500 the DNA of all the other things that 57 00:03:25,500 --> 00:03:29,420 are in you, which is sometimes called the metagenome. 58 00:03:29,420 --> 00:03:35,700 So take a swab and not just get the DNA-- 59 00:03:35,700 --> 00:03:38,280 your DNA, but also of all the other things that 60 00:03:38,280 --> 00:03:43,260 are a part of you, which can be ten times larger than your DNA. 61 00:03:43,260 --> 00:03:46,420 So depending on that, so they now 62 00:03:46,420 --> 00:03:48,690 have developed these high volume sequencers. 63 00:03:48,690 --> 00:03:53,818 Here's an example of one that I believe can do 600. 64 00:03:53,818 --> 00:03:54,990 AUDIENCE: [INAUDIBLE] 65 00:03:54,990 --> 00:03:55,770 JEREMY KEPNER: OK. 66 00:03:55,770 --> 00:03:56,590 No problem. 67 00:03:56,590 --> 00:04:03,960 That 600 billion base pairs a day, 68 00:04:03,960 --> 00:04:07,850 so like 600 gigabytes of data a day. 69 00:04:07,850 --> 00:04:11,610 And this is all data that you want to cross correlate. 70 00:04:11,610 --> 00:04:15,100 I mean, it's your-- it's a-- so that's 71 00:04:15,100 --> 00:04:20,470 what this-- this is a table top, a table top apparatus here 72 00:04:20,470 --> 00:04:23,150 that sells for a few hundred thousand dollars. 73 00:04:23,150 --> 00:04:26,770 And they are even getting into portable sequencers 74 00:04:26,770 --> 00:04:30,100 that you can plug in with a USB connection 75 00:04:30,100 --> 00:04:32,340 into a laptop or something like that. 76 00:04:32,340 --> 00:04:36,370 It-- to do more in the field types of things. 77 00:04:36,370 --> 00:04:39,010 So why would you want to do this? 78 00:04:39,010 --> 00:04:44,250 I think abstractly to understand all that would be good. 79 00:04:44,250 --> 00:04:47,900 Computation plays a huge role because this data is collected. 80 00:04:47,900 --> 00:04:52,630 And it's just sort of abstract snippets of DNA, you know? 81 00:04:52,630 --> 00:04:56,030 Just even assembling them into a-- just your DNA 82 00:04:56,030 --> 00:05:00,050 into a whole process can take a fair amount of computation. 83 00:05:00,050 --> 00:05:02,060 And right now, that is actually something that 84 00:05:02,060 --> 00:05:04,280 takes a fair amount of time. 85 00:05:04,280 --> 00:05:09,060 And so to give you an example, here's a great use case. 86 00:05:09,060 --> 00:05:12,110 This shows, if you recall, in the summer of 2011 87 00:05:12,110 --> 00:05:16,610 there was a virulent E. coli outbreak in Germany. 88 00:05:16,610 --> 00:05:18,162 And not to single out the Germans. 89 00:05:18,162 --> 00:05:19,620 We've certainly had the same things 90 00:05:19,620 --> 00:05:20,840 occur in the United States. 91 00:05:20,840 --> 00:05:24,150 And these occur all across the world. 92 00:05:24,150 --> 00:05:29,880 And so, you know, this shows kind of the time course in May. 93 00:05:29,880 --> 00:05:32,540 You know, the first cases starting appear. 94 00:05:32,540 --> 00:05:35,990 And then those lead to the first deaths. 95 00:05:35,990 --> 00:05:37,990 And then it spikes. 96 00:05:37,990 --> 00:05:40,340 And then that's just kind of when 97 00:05:40,340 --> 00:05:43,105 you hit this peak is when they really identify the outbreak. 98 00:05:43,105 --> 00:05:44,480 And then they finally figured out 99 00:05:44,480 --> 00:05:48,960 what the-- what it is that's causing people 100 00:05:48,960 --> 00:05:50,340 and begin to remediate it. 101 00:05:50,340 --> 00:05:53,920 But, you know, until you kind of really have this portion, 102 00:05:53,920 --> 00:05:58,430 people are still getting exposed to the thing, usually 103 00:05:58,430 --> 00:06:01,790 before they actually nail it down. 104 00:06:01,790 --> 00:06:03,900 There's lots of rumors flying around. 105 00:06:03,900 --> 00:06:08,920 All other parts of the food chain are disrupted. 106 00:06:08,920 --> 00:06:13,430 You know, any single time a particular product 107 00:06:13,430 --> 00:06:15,750 is implicated, that's hundreds of millions 108 00:06:15,750 --> 00:06:18,910 of dollars of lost business as people just 109 00:06:18,910 --> 00:06:21,810 basically-- you know, they say it's spinach. 110 00:06:21,810 --> 00:06:24,530 Then everyone stops buying spinach for a while. 111 00:06:24,530 --> 00:06:26,200 And, oh, it wasn't spinach. 112 00:06:26,200 --> 00:06:26,910 Sorry. 113 00:06:26,910 --> 00:06:28,650 It was something else. 114 00:06:28,650 --> 00:06:33,180 And so that's-- so there's a dual-- you know, 115 00:06:33,180 --> 00:06:35,960 so they started by implicating this, the cucumbers, 116 00:06:35,960 --> 00:06:37,320 but that wasn't quite right. 117 00:06:37,320 --> 00:06:38,810 They've then sequenced the stuff. 118 00:06:38,810 --> 00:06:41,952 And then they correctly identified it was the sprouts. 119 00:06:41,952 --> 00:06:44,410 At least I believe that was the time course of events here. 120 00:06:44,410 --> 00:06:49,900 So-- and this is sort of the integrated number of deaths. 121 00:06:49,900 --> 00:06:53,060 And so, you know, the story here is obviously 122 00:06:53,060 --> 00:06:56,190 the thing we want to do most is when a person gets sick 123 00:06:56,190 --> 00:06:59,130 here or here, wouldn't it be great to sequence them 124 00:06:59,130 --> 00:07:02,050 immediately, get that information, 125 00:07:02,050 --> 00:07:04,810 know exactly what's causing the problem, 126 00:07:04,810 --> 00:07:09,670 and then be able to start testing the food supply channel 127 00:07:09,670 --> 00:07:12,570 so that you can make a real impact on the mortality? 128 00:07:12,570 --> 00:07:15,060 And then likewise, not have the economic-- 129 00:07:15,060 --> 00:07:18,060 you know, obviously the loss of life 130 00:07:18,060 --> 00:07:23,720 is the preeminent issue here. 131 00:07:23,720 --> 00:07:25,780 But there's also the economic impact, 132 00:07:25,780 --> 00:07:28,410 which certainly the people who are in those businesses 133 00:07:28,410 --> 00:07:30,110 would want to address. 134 00:07:30,110 --> 00:07:31,920 So as you can see, there was a really sort 135 00:07:31,920 --> 00:07:34,280 of a rather long delay here between sort 136 00:07:34,280 --> 00:07:39,200 of when the outbreak started and the DNA sequence released. 137 00:07:39,200 --> 00:07:43,230 And this was actually a big step forward in the sense 138 00:07:43,230 --> 00:07:45,915 that DNA sequencing really did play-- ended up 139 00:07:45,915 --> 00:07:49,370 playing a role in this process as opposed 140 00:07:49,370 --> 00:07:52,410 to previously where it may not have. 141 00:07:52,410 --> 00:07:56,450 And in the-- and people see that now 142 00:07:56,450 --> 00:08:00,570 and they would love to move this earlier. 143 00:08:00,570 --> 00:08:03,420 So, you know, and obviously in our business 144 00:08:03,420 --> 00:08:05,800 and across, it's not just this type of example. 145 00:08:05,800 --> 00:08:07,550 But there's other types of examples 146 00:08:07,550 --> 00:08:12,870 where rapid sequencing and identification would 147 00:08:12,870 --> 00:08:13,700 be very important. 148 00:08:13,700 --> 00:08:15,158 And there are certainly investments 149 00:08:15,158 --> 00:08:18,170 being made to try and make that more possible. 150 00:08:18,170 --> 00:08:21,610 So an example of what the processing timeline looks 151 00:08:21,610 --> 00:08:23,410 now, I mean, you're basically starting 152 00:08:23,410 --> 00:08:24,860 with the human infection. 153 00:08:24,860 --> 00:08:26,690 And it could be a natural disease. 154 00:08:26,690 --> 00:08:29,430 Obviously in the DOD, they're very concerned about bioweapons 155 00:08:29,430 --> 00:08:31,070 as well. 156 00:08:31,070 --> 00:08:34,740 So there's the collection of the sample, the preparation, 157 00:08:34,740 --> 00:08:37,470 the analysis, and sort of the overall time 158 00:08:37,470 --> 00:08:39,120 to actionable data. 159 00:08:39,120 --> 00:08:42,470 And really, it's not to say processing plays everything 160 00:08:42,470 --> 00:08:44,450 on this, but if you, as part of a whole system, 161 00:08:44,450 --> 00:08:48,540 you could imagine if you could do on-site collection, 162 00:08:48,540 --> 00:08:53,210 automatic preparation, and then very quick analysis, 163 00:08:53,210 --> 00:08:55,250 you could imagine shorting this cycle down 164 00:08:55,250 --> 00:08:57,960 to one day, which would be something that would really 165 00:08:57,960 --> 00:09:01,330 make a huge impact. 166 00:09:01,330 --> 00:09:06,090 Some of the other useful sequences, useful-- I'm sorry, 167 00:09:06,090 --> 00:09:11,690 roles for DNA sequence matching is quickly comparing two data-- 168 00:09:11,690 --> 00:09:12,646 two sets of DNA. 169 00:09:16,715 --> 00:09:22,190 Identification, that is, who is it? 170 00:09:22,190 --> 00:09:25,360 Analysis of mixtures, you know, what type of things 171 00:09:25,360 --> 00:09:29,350 could you determine if someone was related to somebody else? 172 00:09:29,350 --> 00:09:31,085 Ancestry analysis, which can be used 173 00:09:31,085 --> 00:09:32,960 in disease outbreaks, criminal investigation, 174 00:09:32,960 --> 00:09:34,460 and personal medicine. 175 00:09:34,460 --> 00:09:39,930 You know, the set of things is pretty large here. 176 00:09:39,930 --> 00:09:43,250 So I'm not going to explain to you kind of fundamentally what 177 00:09:43,250 --> 00:09:45,810 is the algorithm that we use for doing this matching, 178 00:09:45,810 --> 00:09:48,394 but we're going to explain it in terms of the mathematics 179 00:09:48,394 --> 00:09:49,810 that we've described before, which 180 00:09:49,810 --> 00:09:51,770 is these associative arrays, which actually 181 00:09:51,770 --> 00:09:55,310 make it very, very easy to describe what's going on here. 182 00:09:55,310 --> 00:09:58,350 If I was to describe to you the traditional approaches for how 183 00:09:58,350 --> 00:10:00,150 we describe DNA sequence matching, 184 00:10:00,150 --> 00:10:01,690 it would actually be-- that would 185 00:10:01,690 --> 00:10:05,690 be a whole lecture in itself. 186 00:10:05,690 --> 00:10:07,520 So let me get into that algorithm. 187 00:10:07,520 --> 00:10:09,700 And so basically this is it. 188 00:10:09,700 --> 00:10:13,740 On one slide is how we do DNA sequencing matching. 189 00:10:13,740 --> 00:10:18,210 So we have a reference sequence here. 190 00:10:18,210 --> 00:10:20,680 This is something that we know, a database 191 00:10:20,680 --> 00:10:22,190 of data that we know. 192 00:10:22,190 --> 00:10:27,990 And it consists of a sequence ID and then a whole bunch 193 00:10:27,990 --> 00:10:30,930 of what are called base pairs. 194 00:10:30,930 --> 00:10:36,430 And this can usually be several hundred long. 195 00:10:36,430 --> 00:10:38,840 And you'll have thousands of these, 196 00:10:38,840 --> 00:10:44,730 each that are a few hundred, maybe 1,000 base pairs long. 197 00:10:44,730 --> 00:10:51,820 And so the standard approach to this is to take these sequences 198 00:10:51,820 --> 00:10:53,970 and break them up into smaller units, which 199 00:10:53,970 --> 00:10:57,430 are called words or mers. 200 00:10:57,430 --> 00:11:00,419 And a standard number is called a 10mer. 201 00:11:00,419 --> 00:11:01,960 So they're basically-- what you would 202 00:11:01,960 --> 00:11:05,070 say is you take the first ten letters, and you say, 203 00:11:05,070 --> 00:11:05,920 "All right. 204 00:11:05,920 --> 00:11:08,100 That's one 10mer." 205 00:11:08,100 --> 00:11:10,460 Then you move it over one, and you 206 00:11:10,460 --> 00:11:14,080 say that's another 10mer, and so on and so forth. 207 00:11:14,080 --> 00:11:19,780 So you had, if this was a sequence of 400 long, 208 00:11:19,780 --> 00:11:22,940 you would have 400 10mers, OK? 209 00:11:22,940 --> 00:11:25,560 And then you're obviously multiplying the total data 210 00:11:25,560 --> 00:11:29,120 volume by a factor of ten because of this thing. 211 00:11:29,120 --> 00:11:31,310 And so for those of us who know signal processing, 212 00:11:31,310 --> 00:11:34,160 this is just standard filtering. 213 00:11:34,160 --> 00:11:35,280 Nothing new here. 214 00:11:35,280 --> 00:11:38,020 Very, very standard type of filtering approach. 215 00:11:38,020 --> 00:11:45,250 So then what we do is for each sequence ID, OK, 216 00:11:45,250 --> 00:11:50,850 this forms the row key of an associative array. 217 00:11:50,850 --> 00:11:55,980 And each 10mer of that sequence forms 218 00:11:55,980 --> 00:11:58,450 a column key of that 10mer. 219 00:11:58,450 --> 00:12:01,680 So you can see here each of these rows 220 00:12:01,680 --> 00:12:08,680 shows you all the unique 10mers that appeared in that sequence. 221 00:12:08,680 --> 00:12:11,900 And then a column is a particular 10mer. 222 00:12:11,900 --> 00:12:16,940 So as we can see, certain 10mers appear very commonly. 223 00:12:16,940 --> 00:12:20,400 And some appear in a not so common way. 224 00:12:20,400 --> 00:12:22,510 So that gives us an associative array, 225 00:12:22,510 --> 00:12:26,690 which is a sparse matrix where we have sequence ID by 10mer. 226 00:12:26,690 --> 00:12:28,850 And then we do the same thing with the collection. 227 00:12:28,850 --> 00:12:31,890 So we've collected, in this case, some unknown bacteria 228 00:12:31,890 --> 00:12:32,870 sample. 229 00:12:32,870 --> 00:12:35,550 And we have a similar set of sequences here. 230 00:12:35,550 --> 00:12:36,890 And we do the exact same thing. 231 00:12:36,890 --> 00:12:39,820 We have a sequence ID and then the 10mer. 232 00:12:39,820 --> 00:12:42,270 And then what we want to do is cross correlate, 233 00:12:42,270 --> 00:12:44,710 find all the matches between them. 234 00:12:44,710 --> 00:12:48,560 And so that's just done with matrix multiply. 235 00:12:48,560 --> 00:12:52,280 So A1 transpose A2 will then result 236 00:12:52,280 --> 00:12:54,740 in a new matrix, which is reference sequence 237 00:12:54,740 --> 00:12:57,580 ID by unknown sequence ID. 238 00:12:57,580 --> 00:12:59,850 And then it will then-- the value in here 239 00:12:59,850 --> 00:13:03,090 will be how many matches they had. 240 00:13:03,090 --> 00:13:07,280 And so generally you look, if say, 400 was the match, 241 00:13:07,280 --> 00:13:11,880 then you would be looking for things well above 30 or 40 242 00:13:11,880 --> 00:13:13,920 matches between them. 243 00:13:13,920 --> 00:13:20,010 Or our true match would be maybe 50%, 60%, 70% of them match. 244 00:13:20,010 --> 00:13:23,750 And so then you can just apply a threshold to this 245 00:13:23,750 --> 00:13:26,110 to deter-- to find the true, true matches. 246 00:13:26,110 --> 00:13:28,040 So very simple. 247 00:13:28,040 --> 00:13:31,430 And, you know, there are large software packages out there 248 00:13:31,430 --> 00:13:32,170 that do this. 249 00:13:32,170 --> 00:13:33,670 They essentially do this algorithm 250 00:13:33,670 --> 00:13:36,450 with various twists and variations and stuff like that. 251 00:13:36,450 --> 00:13:38,750 But here we can explain the whole thing 252 00:13:38,750 --> 00:13:41,120 knowing the mathematics of associative arrays 253 00:13:41,120 --> 00:13:43,890 on one slide. 254 00:13:43,890 --> 00:13:50,670 So this calculation is what we call a direct match 255 00:13:50,670 --> 00:13:51,340 calculation. 256 00:13:51,340 --> 00:13:54,970 We're literally comparing every single sequence's 10mers 257 00:13:54,970 --> 00:13:55,995 with all the other ones. 258 00:13:58,870 --> 00:14:03,660 And this is essentially what sequence comparison does. 259 00:14:03,660 --> 00:14:07,930 And it takes a lot of computation to do this. 260 00:14:07,930 --> 00:14:12,710 If you might have millions of sequence IDs and millions 261 00:14:12,710 --> 00:14:15,120 of sequence IDs, this very quickly 262 00:14:15,120 --> 00:14:20,890 becomes a large amount of computational effort. 263 00:14:20,890 --> 00:14:24,750 So we are, of course, interested in other techniques that 264 00:14:24,750 --> 00:14:27,140 could possibly accelerate this. 265 00:14:27,140 --> 00:14:31,110 So one of the things we're able to do using the Accumulo 266 00:14:31,110 --> 00:14:34,260 database is ingest this entire set 267 00:14:34,260 --> 00:14:38,850 of data as an associative array into the database. 268 00:14:38,850 --> 00:14:43,790 And using Accumulo's tally features, then tally. 269 00:14:43,790 --> 00:14:45,730 Have it essentially, as we do the ingestion, 270 00:14:45,730 --> 00:14:50,180 automatically tally the counts for each 10mer. 271 00:14:50,180 --> 00:14:51,720 So we can then essentially construct 272 00:14:51,720 --> 00:14:54,680 a histogram of all the 10mers. 273 00:14:54,680 --> 00:14:59,070 Some 10mers will appear in a very large number of sequences, 274 00:14:59,070 --> 00:15:03,080 and some 10mers will appear in not very many. 275 00:15:03,080 --> 00:15:07,980 So this is that histogram, or essentially 276 00:15:07,980 --> 00:15:09,040 a degree distribution. 277 00:15:09,040 --> 00:15:11,470 We talked about degree distributions in other. 278 00:15:11,470 --> 00:15:15,060 This tells us that there's one 10mer that 279 00:15:15,060 --> 00:15:21,310 appears in, you know, 80% or 90% of all the sequences. 280 00:15:21,310 --> 00:15:30,320 And then there's 20 or 30 10mers that appear in just a few. 281 00:15:30,320 --> 00:15:33,210 And then there's a distribution, as in this case, almost a law-- 282 00:15:33,210 --> 00:15:35,520 of sort of almost a log normal curve that 283 00:15:35,520 --> 00:15:42,070 shows you that really, most of the 10mers 284 00:15:42,070 --> 00:15:49,850 seem to have-- appear in a few hundred sequences. 285 00:15:49,850 --> 00:15:53,100 And so now one thing I've done here 286 00:15:53,100 --> 00:15:59,160 is create certain thresholds to say, all right. 287 00:15:59,160 --> 00:16:05,680 If I only wanted-- if I looked at that large, sparse matrix 288 00:16:05,680 --> 00:16:11,810 of the data, and I wanted to threshold, what-- how many-- 289 00:16:11,810 --> 00:16:13,560 how much of the data would I eliminate? 290 00:16:13,560 --> 00:16:17,460 So if I wanted to eliminate 50% of the data, 291 00:16:17,460 --> 00:16:19,920 I could set a threshold, let's only 292 00:16:19,920 --> 00:16:24,207 look at things that are less than a degree of 10,000. 293 00:16:24,207 --> 00:16:26,790 You might say, well, why would I want to eliminate these very, 294 00:16:26,790 --> 00:16:28,870 very popular things? 295 00:16:28,870 --> 00:16:31,190 Well because they appear everywhere, 296 00:16:31,190 --> 00:16:33,020 a true match is-- they're not going 297 00:16:33,020 --> 00:16:35,390 to give me any information that really tells me 298 00:16:35,390 --> 00:16:37,400 about true matches. 299 00:16:37,400 --> 00:16:38,830 Those are going to be clutter. 300 00:16:38,830 --> 00:16:40,220 Everything has them. 301 00:16:40,220 --> 00:16:44,770 That two sequences share that particular 10mer doesn't really 302 00:16:44,770 --> 00:16:49,000 give me a lot of power in selecting which one it really 303 00:16:49,000 --> 00:16:51,310 belongs to. 304 00:16:51,310 --> 00:16:53,350 So like-- so I can do that. 305 00:16:53,350 --> 00:16:57,300 If I wanted to go down to only 5% of the data, I could say, 306 00:16:57,300 --> 00:17:00,850 you know, I only want to look at 10mers that are 100, 307 00:17:00,850 --> 00:17:04,690 you know, or that have-- that appear in 100 308 00:17:04,690 --> 00:17:06,300 of these sequences or less. 309 00:17:06,300 --> 00:17:08,690 And if I wanted to go even further, you know, 310 00:17:08,690 --> 00:17:11,609 I could go down to is 20, 30, 40, 50, 311 00:17:11,609 --> 00:17:14,540 and I would only have one half percent of the data. 312 00:17:14,540 --> 00:17:18,500 I would have eliminated from consideration 99.5% 313 00:17:18,500 --> 00:17:20,310 of the data. 314 00:17:20,310 --> 00:17:25,910 So and then if I can do that, then 315 00:17:25,910 --> 00:17:31,160 that's very powerful because I can quickly take my sample data 316 00:17:31,160 --> 00:17:31,660 set. 317 00:17:31,660 --> 00:17:33,650 I know all the 10mers it has. 318 00:17:33,650 --> 00:17:35,780 And I can quickly look it up against this histogram 319 00:17:35,780 --> 00:17:36,770 and say, "No. 320 00:17:36,770 --> 00:17:38,030 I don't want to do any. 321 00:17:38,030 --> 00:17:40,820 I only care about this very small section of data, 322 00:17:40,820 --> 00:17:43,580 and I only need to do correlations from that." 323 00:17:43,580 --> 00:17:46,130 So let's see how that works out. 324 00:17:46,130 --> 00:17:49,950 And I should say, this technique is very generic. 325 00:17:49,950 --> 00:17:51,702 You could do it for text matching. 326 00:17:51,702 --> 00:17:53,660 If you have documents, you have the same issue. 327 00:17:53,660 --> 00:17:57,740 Very popular words are not going to tell you really anything 328 00:17:57,740 --> 00:17:59,794 meaningful about whether two documents are 329 00:17:59,794 --> 00:18:00,710 related to each other. 330 00:18:00,710 --> 00:18:02,100 It's going to be clutter. 331 00:18:02,100 --> 00:18:05,220 And other types of records of that thing, you know? 332 00:18:05,220 --> 00:18:07,920 So in the graph theory perspective, 333 00:18:07,920 --> 00:18:09,800 we call these super nodes. 334 00:18:09,800 --> 00:18:11,390 These 10mers are super nodes. 335 00:18:11,390 --> 00:18:13,840 They have connections to many, many things, 336 00:18:13,840 --> 00:18:17,490 and therefore, if you try and connect through them, 337 00:18:17,490 --> 00:18:21,650 it's just going to not give you very useful information. 338 00:18:21,650 --> 00:18:27,140 You know, it's like people visiting Google. 339 00:18:27,140 --> 00:18:30,060 Looking for all the records where people connect-- 340 00:18:30,060 --> 00:18:31,670 visited Google is not really going 341 00:18:31,670 --> 00:18:36,010 to tell you much unless you have more information. 342 00:18:36,010 --> 00:18:41,000 And so it's not a very big distinguishing factor. 343 00:18:41,000 --> 00:18:45,120 So here's an example of the results 344 00:18:45,120 --> 00:18:50,470 of doing-- of selecting this low threshold, 345 00:18:50,470 --> 00:18:55,650 eliminating 99.5% of the data, and then comparing our matches 346 00:18:55,650 --> 00:18:57,590 that we got with what happens when 347 00:18:57,590 --> 00:19:00,060 we've used 100% of the data. 348 00:19:00,060 --> 00:19:05,700 And so what we have here is the true 10mer match and then 349 00:19:05,700 --> 00:19:08,900 the measured sub-sampled match here. 350 00:19:08,900 --> 00:19:15,210 And what you see here is that we get a very, very high success 351 00:19:15,210 --> 00:19:24,870 rate in terms of we basically detect all strong matches using 352 00:19:24,870 --> 00:19:26,920 only half percent of the data. 353 00:19:26,920 --> 00:19:31,560 And, you know, the number of false positives 354 00:19:31,560 --> 00:19:33,600 is extremely low. 355 00:19:33,600 --> 00:19:35,280 In fact, a better way to look at that 356 00:19:35,280 --> 00:19:38,420 is if we look at the cumular-- cumulative probability 357 00:19:38,420 --> 00:19:40,470 of detection. 358 00:19:40,470 --> 00:19:43,890 This shows this that if the actual match, if there 359 00:19:43,890 --> 00:19:48,410 was actually 100 matches between two sequences, 360 00:19:48,410 --> 00:19:55,830 we detect all of those using only 1/20 of the data. 361 00:19:55,830 --> 00:19:59,480 And likewise, in our probability false alarm rate, 362 00:19:59,480 --> 00:20:03,360 we see that if you see more than a match of ten 363 00:20:03,360 --> 00:20:07,610 in the sub-sample data, that is going to be a true match 364 00:20:07,610 --> 00:20:10,070 essentially 100% of the time. 365 00:20:10,070 --> 00:20:16,030 And so this technique allows us to dramatically speed up 366 00:20:16,030 --> 00:20:19,222 the rate at which we can do these comparisons. 367 00:20:19,222 --> 00:20:20,930 And you do have the price to pay that you 368 00:20:20,930 --> 00:20:23,920 have to pre-load your reference data 369 00:20:23,920 --> 00:20:26,190 into this special database. 370 00:20:26,190 --> 00:20:28,950 But since you tend to be more concerned 371 00:20:28,950 --> 00:20:30,990 about comparing with respect to it, 372 00:20:30,990 --> 00:20:34,650 that is a worthwhile investment. 373 00:20:34,650 --> 00:20:38,940 So that's just an example of how you can use these techniques, 374 00:20:38,940 --> 00:20:41,610 use the mathematics of associative arrays, 375 00:20:41,610 --> 00:20:46,490 and these databases together in a coherent way. 376 00:20:49,090 --> 00:20:54,600 So we can do more than just find. 377 00:20:54,600 --> 00:20:57,280 So the matrix multiply I showed you 378 00:20:57,280 --> 00:21:02,540 before, A1 times A2 transposed showed us 379 00:21:02,540 --> 00:21:05,360 the counts, the number of matches. 380 00:21:05,360 --> 00:21:07,940 But sometimes we don't want to know more than that. 381 00:21:07,940 --> 00:21:10,610 We want to know not just the number of matches, 382 00:21:10,610 --> 00:21:14,990 but please show me the exact set of 10mers 383 00:21:14,990 --> 00:21:17,540 that caused the match, OK? 384 00:21:17,540 --> 00:21:18,980 And so this is where, and I think 385 00:21:18,980 --> 00:21:21,110 we talked about this in a previous lecture, 386 00:21:21,110 --> 00:21:24,500 we have these special matrix multiplies that will actually 387 00:21:24,500 --> 00:21:30,710 take the intersecting key in the matrix and multiply it now, 388 00:21:30,710 --> 00:21:33,567 assign that to the value field or [INAUDIBLE]. 389 00:21:33,567 --> 00:21:35,400 And so that's why we have the special matrix 390 00:21:35,400 --> 00:21:37,860 multiply called CatKeyMul. 391 00:21:37,860 --> 00:21:39,860 And so, for instance here, if we look 392 00:21:39,860 --> 00:21:43,380 at the result of that, which is AK, and we say, 393 00:21:43,380 --> 00:21:48,090 "Show me all the value matches that are greater than six 394 00:21:48,090 --> 00:21:50,560 in their rows and their columns together," 395 00:21:50,560 --> 00:21:53,590 now we can see that this sequence ID 396 00:21:53,590 --> 00:21:55,260 matched with this sequence ID. 397 00:21:55,260 --> 00:21:58,790 And these were the actual 10mers that generated 398 00:21:58,790 --> 00:22:00,620 that they had in common. 399 00:22:00,620 --> 00:22:03,740 Now clearly six is not a true match 400 00:22:03,740 --> 00:22:05,790 in this little sample data set. 401 00:22:05,790 --> 00:22:07,450 We don't have any true matches. 402 00:22:07,450 --> 00:22:10,760 But this just shows you what that is like. 403 00:22:10,760 --> 00:22:14,860 And so this is what we call a pedigree preserving 404 00:22:14,860 --> 00:22:15,680 correlation. 405 00:22:15,680 --> 00:22:18,119 That is, it shows you the-- it doesn't just 406 00:22:18,119 --> 00:22:18,910 give you the count. 407 00:22:18,910 --> 00:22:21,840 It shows you where that evidence came from. 408 00:22:21,840 --> 00:22:23,300 And you can track it back. 409 00:22:23,300 --> 00:22:25,670 And this is something we do want to do all the time. 410 00:22:25,670 --> 00:22:27,800 If you imagined two documents that you 411 00:22:27,800 --> 00:22:30,740 wanted to cross correlate, you might say, all right. 412 00:22:30,740 --> 00:22:32,690 I have these two documents, and now I've 413 00:22:32,690 --> 00:22:34,970 cross correlated their word matches. 414 00:22:34,970 --> 00:22:39,220 Well, now I want to know the actual words that matched, 415 00:22:39,220 --> 00:22:40,280 not just the counts. 416 00:22:40,280 --> 00:22:44,150 And you would use the exact same multiply to do that. 417 00:22:44,150 --> 00:22:48,560 Likewise, you could do the word/word correlation 418 00:22:48,560 --> 00:22:49,490 of a document. 419 00:22:49,490 --> 00:22:52,170 So that would be A transpose times 420 00:22:52,170 --> 00:22:55,190 A instead of A, A transpose. 421 00:22:55,190 --> 00:22:58,110 And then it would show you two words. 422 00:22:58,110 --> 00:22:59,750 It would show you the list of documents 423 00:22:59,750 --> 00:23:02,020 that they actually had in common. 424 00:23:02,020 --> 00:23:06,030 So again, this is a powerful-- a powerful tool. 425 00:23:06,030 --> 00:23:08,580 Again, I should remind people when using this, 426 00:23:08,580 --> 00:23:15,390 you do have to be careful when you do, say, CatKeyMul A1 427 00:23:15,390 --> 00:23:20,180 times A1 transpose if you do a square because you will then 428 00:23:20,180 --> 00:23:23,690 end up with this very dense diagonal, 429 00:23:23,690 --> 00:23:27,090 and these lists will get extremely long. 430 00:23:27,090 --> 00:23:30,810 And you can often run out of memory when that happens. 431 00:23:30,810 --> 00:23:34,230 So you do have to be careful when you do correlations, 432 00:23:34,230 --> 00:23:37,180 these CatKeyMul correlations on things 433 00:23:37,180 --> 00:23:40,917 that are going to have very large, overlapping matches. 434 00:23:40,917 --> 00:23:42,500 The regular matrix multiply, you don't 435 00:23:42,500 --> 00:23:44,560 have to worry about that, creating dense. 436 00:23:44,560 --> 00:23:46,370 You know, that's not a problem. 437 00:23:46,370 --> 00:23:48,640 But that's just a little caveat to be aware of. 438 00:23:48,640 --> 00:23:50,390 And there's really nothing we can do about 439 00:23:50,390 --> 00:23:53,015 that if you do square them, you know? 440 00:23:53,015 --> 00:23:55,410 And we've thought about creating a new function, which 441 00:23:55,410 --> 00:23:57,620 is basically squaring with and basically 442 00:23:57,620 --> 00:23:59,250 not doing the diagonal. 443 00:23:59,250 --> 00:24:05,940 And we may end up making that if we can figure that one out. 444 00:24:05,940 --> 00:24:11,440 So once you have those actual specific matches here, 445 00:24:11,440 --> 00:24:15,480 so for example, we have our two reference samples. 446 00:24:15,480 --> 00:24:19,030 And we looked at the ones that were larger. 447 00:24:19,030 --> 00:24:20,420 So here's our two sequences. 448 00:24:20,420 --> 00:24:26,440 This, if we look back at the original data, which actually 449 00:24:26,440 --> 00:24:28,704 stored the locations of that. 450 00:24:28,704 --> 00:24:30,120 So now we're saying, oh, these two 451 00:24:30,120 --> 00:24:32,680 have six matches between them. 452 00:24:32,680 --> 00:24:36,690 Let me look them up through this one line statement here. 453 00:24:36,690 --> 00:24:40,570 Now I can see the actual 10mers that match 454 00:24:40,570 --> 00:24:44,130 and their actual locations in the real data. 455 00:24:44,130 --> 00:24:48,670 And from that, I can then deduce that, oh, actually this is not 456 00:24:48,670 --> 00:24:51,560 six separate 10mer matches, but it's really 457 00:24:51,560 --> 00:24:57,374 two sort of-- what is this, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. 458 00:24:57,374 --> 00:24:59,700 One 10mer match, and then I guess-- 459 00:24:59,700 --> 00:25:03,580 I think this is like a 17 match, right? 460 00:25:03,580 --> 00:25:04,860 And likewise over here. 461 00:25:04,860 --> 00:25:06,620 You get a similar type of thing. 462 00:25:06,620 --> 00:25:08,470 So that's basically stitching it back 463 00:25:08,470 --> 00:25:10,220 together because that's really what you're 464 00:25:10,220 --> 00:25:13,130 trying to do is find these longer sequences where 465 00:25:13,130 --> 00:25:14,490 it's identical. 466 00:25:14,490 --> 00:25:16,760 And then people can then use those 467 00:25:16,760 --> 00:25:20,430 to determine functionality and other types of things 468 00:25:20,430 --> 00:25:22,150 that have to do with that. 469 00:25:22,150 --> 00:25:25,030 So this is just part of the process of doing that. 470 00:25:25,030 --> 00:25:28,090 And it just shows by having this pedigree-preserving matrix 471 00:25:28,090 --> 00:25:31,350 multiply and the ability to store string values 472 00:25:31,350 --> 00:25:33,570 in the value, that we can actually preserve 473 00:25:33,570 --> 00:25:38,290 that information and do that. 474 00:25:38,290 --> 00:25:40,930 So that just shows you, I think, a really great example. 475 00:25:40,930 --> 00:25:42,679 It's kind of one of the most recent things 476 00:25:42,679 --> 00:25:46,320 we did this summer in terms of the power of D4M coupled 477 00:25:46,320 --> 00:25:52,508 with databases to really do algorithms that are completely 478 00:25:52,508 --> 00:25:55,990 kind of new and interesting. 479 00:25:55,990 --> 00:25:57,490 And as algorithm developers, I think 480 00:25:57,490 --> 00:26:00,095 that's something that we are all very excited about. 481 00:26:00,095 --> 00:26:01,720 So now I'm going to talk about how this 482 00:26:01,720 --> 00:26:03,700 fits into an overall pipeline. 483 00:26:03,700 --> 00:26:06,030 So once again, just for reminders, 484 00:26:06,030 --> 00:26:10,610 you know, D4M, we've talked mostly about working over here 485 00:26:10,610 --> 00:26:13,610 in this space, in the Matlab space, 486 00:26:13,610 --> 00:26:16,400 for doing your analytics with associative arrays. 487 00:26:16,400 --> 00:26:20,660 But they also have ways-- we have very nice bindings too. 488 00:26:20,660 --> 00:26:22,260 In particular, the Accumulo database, 489 00:26:22,260 --> 00:26:25,560 but we can bind to just about any database. 490 00:26:25,560 --> 00:26:26,720 And here's an example. 491 00:26:26,720 --> 00:26:29,810 If I have a table that shows all this data, 492 00:26:29,810 --> 00:26:32,340 and I just wanted to get, please give me 493 00:26:32,340 --> 00:26:35,400 the column of this 10mer sequence, 494 00:26:35,400 --> 00:26:37,730 I would just type that query and it would return 495 00:26:37,730 --> 00:26:39,560 that column for me very nicely. 496 00:26:39,560 --> 00:26:42,890 So it's a very powerful binding to databases. 497 00:26:42,890 --> 00:26:45,290 It's funny because a lot of people, when I talk to them, 498 00:26:45,290 --> 00:26:49,510 are like-- they think D4M is either just about 499 00:26:49,510 --> 00:26:54,760 databases or just about this associative array mathematics. 500 00:26:54,760 --> 00:26:57,180 And because they usually kind of-- people take sort of-- 501 00:26:57,180 --> 00:27:01,540 they usually see it sort of more used in one context or another. 502 00:27:01,540 --> 00:27:02,940 And it's really about both. 503 00:27:02,940 --> 00:27:06,660 It's really about connecting the two worlds together. 504 00:27:06,660 --> 00:27:08,940 That particular application that we showed you, 505 00:27:08,940 --> 00:27:13,830 this genetic sequence comparison application 506 00:27:13,830 --> 00:27:16,290 is like you-- is a part of a pipeline, 507 00:27:16,290 --> 00:27:20,000 like you would see in any real system. 508 00:27:20,000 --> 00:27:23,380 And so I'm going to show you that pipeline here. 509 00:27:23,380 --> 00:27:30,140 And so we have here the raw data. 510 00:27:30,140 --> 00:27:33,250 So Fasta is just the name of the file format 511 00:27:33,250 --> 00:27:35,900 that all of this DNA sequence comes in, 512 00:27:35,900 --> 00:27:38,030 which basically looks pretty much like a CSV 513 00:27:38,030 --> 00:27:40,080 file of what I just saw you. 514 00:27:40,080 --> 00:27:44,200 I guess that deserves a name as a file format, but, you know. 515 00:27:44,200 --> 00:27:47,210 So it's basically that. 516 00:27:47,210 --> 00:27:56,670 And then one thing we do is we parse that data from the Fasta 517 00:27:56,670 --> 00:28:02,240 data into a triples format, which 518 00:28:02,240 --> 00:28:09,040 allows us to then really work it with this associative array. 519 00:28:09,040 --> 00:28:11,240 So it basically creates the 10mers 520 00:28:11,240 --> 00:28:14,500 and puts them into a series of triple files, 521 00:28:14,500 --> 00:28:16,830 or a row triple file, which holds the sequence 522 00:28:16,830 --> 00:28:21,460 ID, the actual 10mer itself, a list of those, 523 00:28:21,460 --> 00:28:25,750 and then the position of where that 10mer appeared 524 00:28:25,750 --> 00:28:28,660 in the sequence. 525 00:28:28,660 --> 00:28:31,280 And then typically what we do then 526 00:28:31,280 --> 00:28:36,480 is we read-- we write a program. 527 00:28:36,480 --> 00:28:38,060 And this can be a parallel program 528 00:28:38,060 --> 00:28:41,170 if we want it that will then, usually in Matlab 529 00:28:41,170 --> 00:28:44,410 using D4, that will read these sequences in 530 00:28:44,410 --> 00:28:48,130 and will often just directly insert them 531 00:28:48,130 --> 00:28:50,040 without any additional formatting. 532 00:28:50,040 --> 00:28:53,410 Just-- we have a way of just inserting triples directly 533 00:28:53,410 --> 00:28:56,000 into the Accumulo database, and which is the fastest way 534 00:28:56,000 --> 00:28:59,030 that we can do inserts. 535 00:28:59,030 --> 00:29:05,130 And then it will also convert these to associative arrays 536 00:29:05,130 --> 00:29:09,360 in Matlab and will save them out as mat files 537 00:29:09,360 --> 00:29:11,460 because this will take a little time to convert 538 00:29:11,460 --> 00:29:13,625 all of these things into files. 539 00:29:13,625 --> 00:29:15,250 You're like, well why would we do that? 540 00:29:15,250 --> 00:29:18,520 Well, there's two very good reasons for doing that. 541 00:29:18,520 --> 00:29:21,280 One, the Accumulo database, or any database, 542 00:29:21,280 --> 00:29:23,200 is very good if we want to look up 543 00:29:23,200 --> 00:29:27,880 a small part of a large data set, OK? 544 00:29:27,880 --> 00:29:29,910 So if we have billions of records, 545 00:29:29,910 --> 00:29:32,310 and we want millions of those records, 546 00:29:32,310 --> 00:29:34,180 that's a great use of a database. 547 00:29:34,180 --> 00:29:36,180 However, there are certain times we're like, no. 548 00:29:36,180 --> 00:29:39,710 We want to traverse the entire data set. 549 00:29:39,710 --> 00:29:42,140 Well, databases are actually bad at doing that. 550 00:29:42,140 --> 00:29:45,180 If you say, "I want to scan over the entire database," 551 00:29:45,180 --> 00:29:48,060 it's like doing a billion queries, you know? 552 00:29:48,060 --> 00:29:50,300 And so there's overheads associated with that. 553 00:29:50,300 --> 00:29:54,970 In that instance, it's far more efficient to just save the data 554 00:29:54,970 --> 00:29:56,940 into these associative array files, which 555 00:29:56,940 --> 00:29:58,805 will read in very quickly. 556 00:29:58,805 --> 00:30:00,430 And then you can just-- if you're like, 557 00:30:00,430 --> 00:30:03,902 "I want to do an analysis of the data 558 00:30:03,902 --> 00:30:06,360 in that way where I'm going to want-- I really want to work 559 00:30:06,360 --> 00:30:08,820 with 10% or 20% of the data." 560 00:30:08,820 --> 00:30:12,800 Then having these-- this data in this already parsed 561 00:30:12,800 --> 00:30:15,490 into this binary format, is a very efficient way 562 00:30:15,490 --> 00:30:18,730 to run an application or an analytic that will run over 563 00:30:18,730 --> 00:30:21,030 all of those files. 564 00:30:21,030 --> 00:30:24,080 It also makes parallelism very easily. 565 00:30:24,080 --> 00:30:27,180 You just get-- let different processors 566 00:30:27,180 --> 00:30:28,890 process different files. 567 00:30:28,890 --> 00:30:30,730 You know that's very easy to do. 568 00:30:30,730 --> 00:30:33,510 We have lots of support for that type of model. 569 00:30:33,510 --> 00:30:36,790 And so that's a good reason to do that. 570 00:30:36,790 --> 00:30:39,520 And at worst, you're doubling the size 571 00:30:39,520 --> 00:30:41,506 of the data and your database. 572 00:30:41,506 --> 00:30:42,380 Don't worry about it. 573 00:30:42,380 --> 00:30:45,180 We double data and databases all the time. 574 00:30:45,180 --> 00:30:47,090 Databases are notoriously, if you 575 00:30:47,090 --> 00:30:50,880 put a certain amount of data in, they make it much larger. 576 00:30:50,880 --> 00:30:53,000 Accumulo actually does a very good job of that. 577 00:30:53,000 --> 00:30:55,060 It won't really be that much larger. 578 00:30:55,060 --> 00:30:58,850 But there's no reason to not, as you're doing the ingest, 579 00:30:58,850 --> 00:31:00,580 to save those files. 580 00:31:00,580 --> 00:31:04,920 And then you can do various comparisons. 581 00:31:04,920 --> 00:31:09,700 So for example, we then can-- if this-- 582 00:31:09,700 --> 00:31:13,930 if we save the sample data as a Matlab file, we 583 00:31:13,930 --> 00:31:15,260 could read that in. 584 00:31:15,260 --> 00:31:21,100 And then do our comparison with the reference data 585 00:31:21,100 --> 00:31:22,790 that's sitting inside the database 586 00:31:22,790 --> 00:31:24,280 to get our top matches. 587 00:31:24,280 --> 00:31:26,860 And that's exactly how this application actually works. 588 00:31:26,860 --> 00:31:32,190 So this pipeline from raw data to an intermediate format 589 00:31:32,190 --> 00:31:37,830 to a sort of efficient binary format and insertion 590 00:31:37,830 --> 00:31:41,160 to then doing the analytics and comparisons. 591 00:31:41,160 --> 00:31:43,080 And eventually, you'll usually often 592 00:31:43,080 --> 00:31:45,760 have another layer, which is you might create a web 593 00:31:45,760 --> 00:31:47,796 interface to something. 594 00:31:47,796 --> 00:31:48,420 We're like, OK. 595 00:31:48,420 --> 00:31:50,760 This is now a full system. 596 00:31:50,760 --> 00:31:53,670 A person can type in your-- or put in their own data. 597 00:31:53,670 --> 00:31:56,180 And then it will then call a function. 598 00:31:56,180 --> 00:32:00,080 So again, this is very standard how this stuff ends up fitting 599 00:32:00,080 --> 00:32:03,440 into an overall pipeline. 600 00:32:03,440 --> 00:32:05,880 And it's certainly a situation where 601 00:32:05,880 --> 00:32:08,107 if you're going to deploy a system, you might decide, 602 00:32:08,107 --> 00:32:08,690 you know what? 603 00:32:08,690 --> 00:32:12,160 I don't want Matlab to be a part of my deployed system. 604 00:32:12,160 --> 00:32:15,420 I want to do that, say, in Java or something else that's 605 00:32:15,420 --> 00:32:18,980 sort of more universal, which we certainly see people do. 606 00:32:18,980 --> 00:32:22,730 It still makes sense to do it in this approach 607 00:32:22,730 --> 00:32:25,900 because the algorithm development in testing 608 00:32:25,900 --> 00:32:30,850 is just much, much easier to do in an environment like D4M. 609 00:32:30,850 --> 00:32:33,000 And then once you have the algorithm correct, 610 00:32:33,000 --> 00:32:34,810 it's now much easier to give that 611 00:32:34,810 --> 00:32:37,570 to someone else who is going to do 612 00:32:37,570 --> 00:32:40,570 the implementation in another environment and deal 613 00:32:40,570 --> 00:32:43,960 with all the issues that are associated with maybe doing 614 00:32:43,960 --> 00:32:45,830 a deployment type of thing. 615 00:32:45,830 --> 00:32:51,090 So one certainly could use the Matlab or the new octave code 616 00:32:51,090 --> 00:32:52,390 in a production environment. 617 00:32:52,390 --> 00:32:54,280 We certainly have seen that. 618 00:32:54,280 --> 00:32:56,650 But often the case, one has limitations 619 00:32:56,650 --> 00:32:58,410 about what can deploy. 620 00:32:58,410 --> 00:33:01,880 And so it is still better to do the algorithm development 621 00:33:01,880 --> 00:33:05,500 in this type of environment than to try and do 622 00:33:05,500 --> 00:33:08,460 the algorithm in a deployment language like Java. 623 00:33:13,820 --> 00:33:19,660 One of the things that was very important for this database, 624 00:33:19,660 --> 00:33:22,940 and it's true of most parallel databases. 625 00:33:22,940 --> 00:33:28,200 So if we want to get the highest performance insert, that 626 00:33:28,200 --> 00:33:31,645 is, we want to read the data and insert it 627 00:33:31,645 --> 00:33:33,610 as quickly as possible in the database, 628 00:33:33,610 --> 00:33:36,780 typically we'll need to have some kind of parallel program 629 00:33:36,780 --> 00:33:39,420 running, in this case, maybe each reading 630 00:33:39,420 --> 00:33:42,020 different sets of input files and all inserting them 631 00:33:42,020 --> 00:33:44,280 into the parallel database. 632 00:33:44,280 --> 00:33:48,370 And so in Accumulo, they divide your table 633 00:33:48,370 --> 00:33:53,200 amongst the different computers, which they call tablet servers. 634 00:33:53,200 --> 00:33:56,400 And it's very important to avoid the situation 635 00:33:56,400 --> 00:33:59,310 where everyone is inserting and all the data 636 00:33:59,310 --> 00:34:01,990 is inserting into the same tablet server. 637 00:34:01,990 --> 00:34:05,450 You're not going to get really very good performance. 638 00:34:05,450 --> 00:34:08,350 Now, the way Accumulo splits up its data 639 00:34:08,350 --> 00:34:11,030 is similar to many other databases. 640 00:34:11,030 --> 00:34:13,386 It's called sharting, sometimes they call it. 641 00:34:13,386 --> 00:34:14,969 It just means they split up the table, 642 00:34:14,969 --> 00:34:16,679 but the term, in the database community, 643 00:34:16,679 --> 00:34:18,800 uses-- they call it sharting. 644 00:34:18,800 --> 00:34:21,380 What they'll do is they'll basically take the table 645 00:34:21,380 --> 00:34:24,850 and they'll assign certain row keys, in this c-- in Accumulo's 646 00:34:24,850 --> 00:34:27,830 case, certain contiguous sets of row keys 647 00:34:27,830 --> 00:34:31,929 to particular tablet servers. 648 00:34:31,929 --> 00:34:38,170 So, you know, if you had a data set that was universally 649 00:34:38,170 --> 00:34:43,699 split over the alphabet, then-- and you were going to split it, 650 00:34:43,699 --> 00:34:47,050 the first split would be between m and n. 651 00:34:47,050 --> 00:34:49,560 And so this is called splitting. 652 00:34:49,560 --> 00:34:52,989 And it's very important to try and get good splits 653 00:34:52,989 --> 00:34:56,929 and choose your splits so that you get good performance. 654 00:34:56,929 --> 00:35:00,990 Now D4M has a native interface that allows you to just say, 655 00:35:00,990 --> 00:35:04,110 here are the-- I want these to be the splits of this table. 656 00:35:04,110 --> 00:35:06,690 You can actually compute those and assign them 657 00:35:06,690 --> 00:35:09,000 if you have a parallel instance. 658 00:35:09,000 --> 00:35:12,080 In the class, you will only be working on the databases that 659 00:35:12,080 --> 00:35:13,410 will be set up for you. 660 00:35:13,410 --> 00:35:17,930 We have set up for you are all single node instances. 661 00:35:17,930 --> 00:35:19,700 They do not have multiple tablet servers, 662 00:35:19,700 --> 00:35:22,360 so you don't really have to do-- deal with splitting. 663 00:35:22,360 --> 00:35:24,500 It's only an issue in parallel, but it's 664 00:35:24,500 --> 00:35:29,782 certainly something to be aware of and is often the key. 665 00:35:29,782 --> 00:35:32,820 People will often have a very large Accumulo instance. 666 00:35:32,820 --> 00:35:35,940 And they may only be getting not very-- the performance 667 00:35:35,940 --> 00:35:39,934 they would get on-- in just a two node instance because-- 668 00:35:39,934 --> 00:35:41,850 and usually it's because their splitting needs 669 00:35:41,850 --> 00:35:44,730 to be done in this proper way. 670 00:35:44,730 --> 00:35:47,050 And this is true of all databases, not just Accumulo. 671 00:35:47,050 --> 00:35:49,080 But other databases have the exact same issue. 672 00:35:52,740 --> 00:35:57,110 Accumulo is called Accumulo because it has something called 673 00:35:57,110 --> 00:35:59,900 accumulators, which is what actually makes 674 00:35:59,900 --> 00:36:04,960 this whole bioinformatics application really go, 675 00:36:04,960 --> 00:36:12,840 which is that I can denote a column in a table 676 00:36:12,840 --> 00:36:14,770 to be an accumulator column, which 677 00:36:14,770 --> 00:36:18,160 means whenever a new entry is hit, 678 00:36:18,160 --> 00:36:19,950 some action will be performed. 679 00:36:19,950 --> 00:36:24,170 In this case, the default is addition. 680 00:36:24,170 --> 00:36:27,930 And so a standard thing we'll do in our schema, 681 00:36:27,930 --> 00:36:29,940 as we've already talked about these 682 00:36:29,940 --> 00:36:34,500 exploded transpose pair schemas that allows 683 00:36:34,500 --> 00:36:36,820 do fast lookups in rows and columns 684 00:36:36,820 --> 00:36:40,840 is we'll, OK, create two additional tables, one of which 685 00:36:40,840 --> 00:36:44,010 holds the sums of the rows, and one of which 686 00:36:44,010 --> 00:36:47,470 holds the sums of the columns which then allows us to do 687 00:36:47,470 --> 00:36:51,790 these very fast lookups of the statistics, which is very 688 00:36:51,790 --> 00:36:56,090 useful to know how do we avoid accidentally looking up columns 689 00:36:56,090 --> 00:36:59,410 that will have-- that are present in the whole database. 690 00:36:59,410 --> 00:37:01,110 An issue that happens all the time 691 00:37:01,110 --> 00:37:03,400 is that you'll have a column, and it's essentially 692 00:37:03,400 --> 00:37:05,860 almost a dense column in the database, and you really, 693 00:37:05,860 --> 00:37:08,040 really, really don't want to look that up 694 00:37:08,040 --> 00:37:10,850 because it's basically giving you the whole database. 695 00:37:10,850 --> 00:37:16,400 Well, with this accumulator, you can look up the column first, 696 00:37:16,400 --> 00:37:18,821 and it will give you a count and be like, oh, yeah. 697 00:37:18,821 --> 00:37:20,070 And then you can just say, no. 698 00:37:20,070 --> 00:37:22,810 I want to look up everything but that column. 699 00:37:22,810 --> 00:37:27,010 So very powerful, very powerful tool for doing that. 700 00:37:27,010 --> 00:37:30,810 That's also another reason why we construct 701 00:37:30,810 --> 00:37:34,780 the associative array when we load it for insert 702 00:37:34,780 --> 00:37:36,990 because when we construct the associative array, 703 00:37:36,990 --> 00:37:41,270 we can actually do a quick sum right then and there 704 00:37:41,270 --> 00:37:42,970 of whatever piece we've read in. 705 00:37:42,970 --> 00:37:46,060 And so then when we send that into the database, 706 00:37:46,060 --> 00:37:49,180 we've dramatically reduced the number of-- the amount 707 00:37:49,180 --> 00:37:50,560 of tallying it has to do. 708 00:37:50,560 --> 00:37:52,630 And this can be a huge time saver 709 00:37:52,630 --> 00:37:57,940 because if you don't do that, you're essentially reinserting 710 00:37:57,940 --> 00:38:01,400 the data two more times because, you know, 711 00:38:01,400 --> 00:38:06,000 when you've inserted it into the table and it's transpose, 712 00:38:06,000 --> 00:38:07,660 and to get the accumulation effect, 713 00:38:07,660 --> 00:38:09,920 you would have to directly insert it. 714 00:38:09,920 --> 00:38:13,330 But if we can do a pre-sum, that reduces 715 00:38:13,330 --> 00:38:18,120 the amount of work on that accumulator table dramatically. 716 00:38:18,120 --> 00:38:20,300 And so again, that's another reason why we do that. 717 00:38:23,780 --> 00:38:26,780 So let's just talk about some of the results that we see. 718 00:38:26,780 --> 00:38:30,000 So here's an example on a-- let's see, 719 00:38:30,000 --> 00:38:32,270 this is an eight node database. 720 00:38:32,270 --> 00:38:34,740 And this shows us the ingest rate 721 00:38:34,740 --> 00:38:37,940 that we get using different numbers of ingesters. 722 00:38:37,940 --> 00:38:40,420 So this is different programs all 723 00:38:40,420 --> 00:38:43,430 trying to insert into this database simultaneously. 724 00:38:43,430 --> 00:38:49,140 And this just shows the difference between having 0 725 00:38:49,140 --> 00:38:55,470 splits and, in this case, 35 splits was the optimal number 726 00:38:55,470 --> 00:38:56,400 that we had. 727 00:38:56,400 --> 00:39:00,370 And you see, it's a rather large difference. 728 00:39:00,370 --> 00:39:02,900 And you don't get-- you get some benefit 729 00:39:02,900 --> 00:39:06,090 with multiple inserters, but that sort of ends 730 00:39:06,090 --> 00:39:08,000 when you don't do this. 731 00:39:08,000 --> 00:39:12,440 So this is just an example of why you want to do that. 732 00:39:12,440 --> 00:39:16,930 Another example is with the actual human DNA database. 733 00:39:16,930 --> 00:39:21,610 This shows us our insert rate on doing-- using 734 00:39:21,610 --> 00:39:29,740 different numbers of inserters. 735 00:39:29,740 --> 00:39:34,750 And yeah, so this is eight tablet servers. 736 00:39:34,750 --> 00:39:37,480 And this just shows the different number of ingesters 737 00:39:37,480 --> 00:39:38,000 here. 738 00:39:38,000 --> 00:39:39,890 And since we're doing proper splitting, 739 00:39:39,890 --> 00:39:42,470 we're getting very nice scaling. 740 00:39:42,470 --> 00:39:45,330 And this is actually the actual output from Accumulo. 741 00:39:45,330 --> 00:39:48,030 It actually has a nice little meter here 742 00:39:48,030 --> 00:39:51,320 that shows you what it says you're actually getting, 743 00:39:51,320 --> 00:39:54,420 which is very, very nice to be able to verify 744 00:39:54,420 --> 00:39:57,300 that you and the database both agree 745 00:39:57,300 --> 00:40:00,500 that your insert rate is-- so this is insert right here. 746 00:40:00,500 --> 00:40:03,510 So you see, we're getting about 4,000-- 400,000 entries 747 00:40:03,510 --> 00:40:06,780 per second, which is an extraordinarily high number 748 00:40:06,780 --> 00:40:09,800 in the database community. 749 00:40:09,800 --> 00:40:11,920 Just to give you a comparison, it 750 00:40:11,920 --> 00:40:13,770 would not be uncommon if you were 751 00:40:13,770 --> 00:40:17,240 to set up a MySQL instance on a single computer 752 00:40:17,240 --> 00:40:21,840 and have one inserter going into it to get maybe 1,000 inserts 753 00:40:21,840 --> 00:40:23,980 per second on a good day. 754 00:40:23,980 --> 00:40:26,100 And so, you know, this is essentially 755 00:40:26,100 --> 00:40:29,090 the core reason why people use Accumulo 756 00:40:29,090 --> 00:40:35,220 is because of this very high insert and query performance. 757 00:40:35,220 --> 00:40:37,820 And this just shows our extrapolated run time. 758 00:40:37,820 --> 00:40:43,440 If we wanted to, for instance, ingest the entire human Fasta 759 00:40:43,440 --> 00:40:47,450 database here of 4.5 billion entries, 760 00:40:47,450 --> 00:40:51,280 we could do that in eight hours, which 761 00:40:51,280 --> 00:40:53,950 would be a very reasonable amount of time to do that. 762 00:40:53,950 --> 00:40:56,280 So in summary, what we are able to achieve 763 00:40:56,280 --> 00:41:01,290 with this application is this shows one of these diagrams. 764 00:41:01,290 --> 00:41:03,270 I think I showed one in the first lecture. 765 00:41:03,270 --> 00:41:06,630 This is a way of sort of measuring our productivity 766 00:41:06,630 --> 00:41:08,020 and our performance. 767 00:41:08,020 --> 00:41:11,240 So this shows the volume of code that we wrote. 768 00:41:11,240 --> 00:41:12,830 And this shows the run time. 769 00:41:12,830 --> 00:41:16,060 So obviously this is better down here. 770 00:41:16,060 --> 00:41:17,900 And this shows BLAST. 771 00:41:17,900 --> 00:41:22,390 So BLAST is the industry standard application 772 00:41:22,390 --> 00:41:24,380 for doing sequence match. 773 00:41:24,380 --> 00:41:27,650 And I don't want, in any way, to say that we are 774 00:41:27,650 --> 00:41:29,160 doing everything BLAST does. 775 00:41:29,160 --> 00:41:32,990 BLAST is almost a million lines of code. 776 00:41:32,990 --> 00:41:34,840 It does lots and lots of different things. 777 00:41:34,840 --> 00:41:36,780 It handles all different types of file formats 778 00:41:36,780 --> 00:41:39,780 and little nuances and other types of things. 779 00:41:39,780 --> 00:41:44,370 But at its core, it does what we did. 780 00:41:44,370 --> 00:41:51,210 And we were able to basically do that in, you know, 781 00:41:51,210 --> 00:41:55,140 150 lines of D4M code. 782 00:41:55,140 --> 00:41:58,730 And using the database, we were able to do 783 00:41:58,730 --> 00:42:02,480 that 100 times faster. 784 00:42:02,480 --> 00:42:05,990 And so this is really the power of this technology 785 00:42:05,990 --> 00:42:09,690 is to allow you to develop these algorithms 786 00:42:09,690 --> 00:42:12,650 and implement them and, in this case, 787 00:42:12,650 --> 00:42:16,890 actually leverage the database to essentially replace lookups 788 00:42:16,890 --> 00:42:19,220 with computations in a very intelligent way, 789 00:42:19,220 --> 00:42:21,800 in a way that's knowledgeable about the statistics 790 00:42:21,800 --> 00:42:23,000 of your data. 791 00:42:23,000 --> 00:42:24,290 And that's really the power. 792 00:42:24,290 --> 00:42:26,820 And these are the types of results that people can get. 793 00:42:26,820 --> 00:42:31,600 And this whole result was done with one summer student, a very 794 00:42:31,600 --> 00:42:33,970 smart summer student, but we were 795 00:42:33,970 --> 00:42:36,740 able to put this whole system together in about a couple 796 00:42:36,740 --> 00:42:37,440 of months. 797 00:42:37,440 --> 00:42:40,840 And this is from a person who knew nothing about Accumulo 798 00:42:40,840 --> 00:42:42,400 or D4M or anything like that. 799 00:42:42,400 --> 00:42:44,890 They were a good-- they were smart. 800 00:42:44,890 --> 00:42:49,024 They were good, solid Java background and very energetic. 801 00:42:49,024 --> 00:42:50,440 But, you know, these are the kinds 802 00:42:50,440 --> 00:42:53,710 of things we've been able to see. 803 00:42:53,710 --> 00:42:55,680 So just to summarize, we, again, we 804 00:42:55,680 --> 00:43:02,080 see-- I think just to-- this types of techniques 805 00:43:02,080 --> 00:43:04,950 are useful for document analysis and network analysis and DNA 806 00:43:04,950 --> 00:43:06,570 sequencing. 807 00:43:06,570 --> 00:43:08,070 You know, this is really summarizing 808 00:43:08,070 --> 00:43:10,750 all the applications that we've talked about. 809 00:43:10,750 --> 00:43:14,350 We think there's a pretty big gap between doing the data 810 00:43:14,350 --> 00:43:17,560 analysis tools that our really algorithm developers need. 811 00:43:17,560 --> 00:43:20,210 And we think D4M is really allowing 812 00:43:20,210 --> 00:43:24,600 us to use a tool like Matlab for its traditional role 813 00:43:24,600 --> 00:43:27,730 in this new domain, which is algorithm development. 814 00:43:27,730 --> 00:43:29,610 Figure things out, getting things right 815 00:43:29,610 --> 00:43:34,490 before you can then hand it on to someone else to implement 816 00:43:34,490 --> 00:43:38,159 and actually get enough production environment. 817 00:43:38,159 --> 00:43:40,200 So with that, that brings the end to the lecture. 818 00:43:40,200 --> 00:43:41,700 And then there's some examples where 819 00:43:41,700 --> 00:43:44,290 we show you the-- this stuff. 820 00:43:44,290 --> 00:43:48,100 So-- and there's no homework other 821 00:43:48,100 --> 00:43:52,680 than I sent you all that link to see 822 00:43:52,680 --> 00:43:55,330 if your access to the database exists. 823 00:43:55,330 --> 00:43:57,060 Did everyone-- raise your hand. 824 00:43:57,060 --> 00:43:59,001 Did you check the link out? 825 00:43:59,001 --> 00:43:59,500 Yes? 826 00:43:59,500 --> 00:44:00,260 You logged in? 827 00:44:00,260 --> 00:44:01,470 It worked? 828 00:44:01,470 --> 00:44:03,430 You saw a bunch of databases there? 829 00:44:03,430 --> 00:44:04,190 OK. 830 00:44:04,190 --> 00:44:06,260 If it doesn't work, because the next class is 831 00:44:06,260 --> 00:44:08,435 sort of the penultimate class, I'm 832 00:44:08,435 --> 00:44:10,060 going to go through all these examples. 833 00:44:10,060 --> 00:44:12,480 And the assignment will be to run those examples 834 00:44:12,480 --> 00:44:13,630 on those databases. 835 00:44:13,630 --> 00:44:17,710 So you will be really using D4M, touching a database, 836 00:44:17,710 --> 00:44:20,740 doing real analytics, doing very complicated things. 837 00:44:20,740 --> 00:44:23,480 The whole class is just going to be a demonstration of that. 838 00:44:23,480 --> 00:44:25,990 Everything we've been doing has been leading up. 839 00:44:25,990 --> 00:44:28,460 So the concepts are in place so that you can understand 840 00:44:28,460 --> 00:44:30,080 those examples fairly easily. 841 00:44:30,080 --> 00:44:32,490 All right, so we will take a short break here. 842 00:44:32,490 --> 00:44:34,530 And then we will come back, and I will show 843 00:44:34,530 --> 00:44:37,720 you the demo, this week's demo.