1 00:00:00,100 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,810 Commons license. 3 00:00:03,810 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,160 continue to offer high quality educational resources for free. 5 00:00:10,160 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,610 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,610 --> 00:00:17,260 at ocw.mit.edu. 8 00:00:21,480 --> 00:00:23,890 PROFESSOR: All right, so we're back. 9 00:00:23,890 --> 00:00:29,600 We're going to do the examples we just described 10 00:00:29,600 --> 00:00:31,850 in the previous notes. 11 00:00:31,850 --> 00:00:34,232 And again, once again, these are in the-- 12 00:00:34,232 --> 00:00:35,960 if you have your d4m directory here, 13 00:00:35,960 --> 00:00:43,656 they're in Examples, Applications, BioBlast. 14 00:00:43,656 --> 00:00:45,780 This shows you the two programs we're going to run. 15 00:00:45,780 --> 00:00:47,740 We just have two examples. 16 00:00:47,740 --> 00:00:49,500 And then we're going to be comparing 17 00:00:49,500 --> 00:00:52,050 two pieces of actual data. 18 00:00:52,050 --> 00:00:57,860 So if you look at this, this shows you 19 00:00:57,860 --> 00:01:00,330 what that data looks like. 20 00:01:00,330 --> 00:01:07,126 So sequence ID, and then very long sequences here. 21 00:01:07,126 --> 00:01:09,250 It'd be sure nice if they were all the same length, 22 00:01:09,250 --> 00:01:11,920 but they certainly are not the same time. 23 00:01:11,920 --> 00:01:15,060 They vary dramatically here you can see. 24 00:01:15,060 --> 00:01:16,310 Look at that guy. 25 00:01:16,310 --> 00:01:17,510 What's he doing? 26 00:01:17,510 --> 00:01:18,070 Super long. 27 00:01:18,070 --> 00:01:22,190 So this is a reference bacteria dataset. 28 00:01:22,190 --> 00:01:24,290 This is known data. 29 00:01:24,290 --> 00:01:28,270 And this is a sequence here of some data 30 00:01:28,270 --> 00:01:31,310 taken from somebody's palm. 31 00:01:31,310 --> 00:01:38,240 This is open source data, so this is just 32 00:01:38,240 --> 00:01:41,480 an example of that. 33 00:01:41,480 --> 00:01:43,610 And so you can imagine one would want 34 00:01:43,610 --> 00:01:48,230 to do a comparison of the data, of what the DNA 35 00:01:48,230 --> 00:01:51,800 sequences of bacteria on the person's hand 36 00:01:51,800 --> 00:01:54,000 are with some reference dataset. 37 00:01:54,000 --> 00:01:57,110 So obviously, we're not going to be using a dataset. 38 00:01:57,110 --> 00:02:01,229 This is a very small dataset here. 39 00:02:01,229 --> 00:02:03,020 We wouldn't expect to have any matches here 40 00:02:03,020 --> 00:02:05,210 and we're not going to have any, but this just 41 00:02:05,210 --> 00:02:07,560 shows you how that works. 42 00:02:07,560 --> 00:02:14,619 So let me go over here, and so the first one here 43 00:02:14,619 --> 00:02:15,160 was this BB1. 44 00:02:45,190 --> 00:02:51,260 All right, so what we've done is we set our word size. 45 00:02:51,260 --> 00:02:54,630 We're going to use 10 base pairs, 46 00:02:54,630 --> 00:02:57,200 so that's where we're basically dealing with 10-mers. 47 00:02:57,200 --> 00:03:00,260 And wrote a little program here that what 48 00:03:00,260 --> 00:03:06,250 it does is given a CSV file, FASTA file, and a word size, 49 00:03:06,250 --> 00:03:09,530 it then creates the associative array as we described. 50 00:03:09,530 --> 00:03:12,620 So basically each sequence will be a row. 51 00:03:12,620 --> 00:03:16,840 So we'll get into that. 52 00:03:16,840 --> 00:03:19,920 And likewise, each 10-mer will then be a column. 53 00:03:19,920 --> 00:03:22,530 And so we do this for the two datasets. 54 00:03:22,530 --> 00:03:27,370 Here is our matrix multiply, so A1 times A2 transpose 55 00:03:27,370 --> 00:03:29,260 gives us A1, A2. 56 00:03:29,260 --> 00:03:33,450 And here's is our pedigreed matrix multiply, the CatKeyMul 57 00:03:33,450 --> 00:03:36,950 A1, A2 transpose, and gives us A1, A2 key. 58 00:03:36,950 --> 00:03:38,510 And so we can now look at that. 59 00:03:38,510 --> 00:03:46,060 So A1 has 198-- that's the bacteria-- 198 sequences 60 00:03:46,060 --> 00:03:50,790 with 8,001 unique 10-mers. 61 00:03:50,790 --> 00:03:57,910 A2 had 1,000 sequences with 2,365 unique 10-mers. 62 00:03:57,910 --> 00:04:00,750 And then when you matrix multiply them, 63 00:04:00,750 --> 00:04:08,010 you see that 85 of the 195 in the reference 64 00:04:08,010 --> 00:04:14,040 had a match with 415 of the 999 in the sample. 65 00:04:14,040 --> 00:04:16,610 So we're losing about half in each case 66 00:04:16,610 --> 00:04:20,910 because the associative rate does not store empty values. 67 00:04:20,910 --> 00:04:25,640 There's going to be no key which is a completely empty row 68 00:04:25,640 --> 00:04:28,050 or a completely column. 69 00:04:28,050 --> 00:04:30,990 So we can now look at that data. 70 00:04:30,990 --> 00:04:33,670 So if we look at Figure 1 here. 71 00:04:36,620 --> 00:04:41,845 So this is A1, so this shows you-- again, here, 72 00:04:41,845 --> 00:04:42,720 why don't we zoom in? 73 00:04:50,875 --> 00:04:52,875 So that gives you better sense of the structure. 74 00:04:52,875 --> 00:04:55,050 You can actually click on that. 75 00:04:55,050 --> 00:04:58,790 So here's a 10-mer that appears in an awful lot of things. 76 00:04:58,790 --> 00:05:04,080 So basically this 10-mer appears in an awful lot of things. 77 00:05:04,080 --> 00:05:11,070 And these are examples of a row corresponding 78 00:05:11,070 --> 00:05:15,989 to a particular sequence ID. 79 00:05:15,989 --> 00:05:16,530 So that's A1. 80 00:05:20,140 --> 00:05:21,040 That's the reference. 81 00:05:21,040 --> 00:05:25,359 A2, the sample, pretty much the same type of thing. 82 00:05:25,359 --> 00:05:27,150 Although a little bit different structure-- 83 00:05:27,150 --> 00:05:31,056 more these dense things, maybe. 84 00:05:31,056 --> 00:05:33,230 Why don't we zoom in on one of these blobs here? 85 00:05:35,900 --> 00:05:39,392 See those type of blobs there. 86 00:05:39,392 --> 00:05:41,460 We've got this guy. 87 00:05:41,460 --> 00:05:42,798 Who's he? 88 00:05:42,798 --> 00:05:50,370 This guy begins with GCCAC and GCCTT and other types of things 89 00:05:50,370 --> 00:05:50,870 there. 90 00:05:54,550 --> 00:06:01,260 And this is the cross, so if we go now to Figure 3, 91 00:06:01,260 --> 00:06:04,400 this is the cross correlation of all those, 92 00:06:04,400 --> 00:06:06,360 and it holds the count. 93 00:06:06,360 --> 00:06:11,480 So you can see here 1, 1-- most of these are just 1. 94 00:06:11,480 --> 00:06:13,970 1-- and unless I did some kind of thresholding, 95 00:06:13,970 --> 00:06:19,510 I probably couldn't even find the ones that were larger. 96 00:06:19,510 --> 00:06:22,950 And then figure 4 here is our pedigreed multiplication, 97 00:06:22,950 --> 00:06:26,230 so now it shows you what the actual. 98 00:06:26,230 --> 00:06:30,620 So this tells us that this reference sequence 99 00:06:30,620 --> 00:06:35,540 ID and this sample sequence ID had a match at this 10-mer. 100 00:06:44,930 --> 00:06:45,760 So we back here. 101 00:06:45,760 --> 00:06:48,090 And as we see, as I clicked on that, 102 00:06:48,090 --> 00:06:50,950 it printed out the full sequence. 103 00:06:50,950 --> 00:06:53,960 So if you wanted to ever just do a cut and paste 104 00:06:53,960 --> 00:06:56,280 and then put this in something else-- so, like, 105 00:06:56,280 --> 00:07:06,590 if I wanted to look at this row, I could then do A1, 106 00:07:06,590 --> 00:07:09,292 and that gives me that row. 107 00:07:09,292 --> 00:07:10,870 And I can do other things. 108 00:07:10,870 --> 00:07:19,720 I could do sum that-- I could sum the columns up. 109 00:07:19,720 --> 00:07:25,317 That tells me there's 138 unique sequences in that sequence ID. 110 00:07:25,317 --> 00:07:29,930 So those types of things there. 111 00:07:29,930 --> 00:07:30,660 All right. 112 00:07:35,280 --> 00:07:39,180 I think we pretty much covered that. 113 00:07:41,669 --> 00:07:43,460 And so now we'll go on to the next example. 114 00:07:43,460 --> 00:07:45,660 And I apologize, this takes a little bit longer 115 00:07:45,660 --> 00:07:54,159 to run because now I'm doing two things when I read in the data. 116 00:07:54,159 --> 00:07:55,700 I'm doing the same thing I did before 117 00:07:55,700 --> 00:07:57,655 in terms of [INAUDIBLE] I want 10-mers, 118 00:07:57,655 --> 00:07:59,600 then I'm going to read the data again 119 00:07:59,600 --> 00:08:01,260 at a slightly different function, 120 00:08:01,260 --> 00:08:03,890 and it's going to return two associative arrays. 121 00:08:03,890 --> 00:08:06,190 This associative array was the same as before, 122 00:08:06,190 --> 00:08:09,000 which showed us the actual counts-- how many times. 123 00:08:13,660 --> 00:08:16,760 The next one gives us a concatenated list 124 00:08:16,760 --> 00:08:20,630 of the positions, so if it appears five times, 125 00:08:20,630 --> 00:08:23,700 it'll actually show us-- it'll hold that information-- 126 00:08:23,700 --> 00:08:27,770 a list of the five locations in the greater sequence, which 127 00:08:27,770 --> 00:08:31,300 then allows us to do those things I talked about before. 128 00:08:31,300 --> 00:08:34,270 So this takes a minute to do that. 129 00:08:37,150 --> 00:08:39,220 Probably with some optimization, I have no doubt 130 00:08:39,220 --> 00:08:44,190 that this could be made faster. 131 00:08:46,902 --> 00:08:51,424 And also with the video recording going on, 132 00:08:51,424 --> 00:08:57,665 we have a lot of pressure on this personal computer. 133 00:09:06,930 --> 00:09:08,970 Just some additional information here 134 00:09:08,970 --> 00:09:11,270 in terms of the whose command, so we basically 135 00:09:11,270 --> 00:09:14,340 did whose A to show me all the different associative arrays 136 00:09:14,340 --> 00:09:17,120 I have, as you can see. 137 00:09:17,120 --> 00:09:20,040 And this tallies up the bytes that are associated. 138 00:09:20,040 --> 00:09:22,310 So our first associative array, as you saw, 139 00:09:22,310 --> 00:09:24,760 was about 700 kilobytes. 140 00:09:24,760 --> 00:09:28,840 And that basically adds up all the data stored 141 00:09:28,840 --> 00:09:32,520 and all the data structures inside it. 142 00:09:32,520 --> 00:09:36,050 Usually does a pretty good job of that, likewise. 143 00:09:36,050 --> 00:09:38,990 And you can see, there's a significant reduction 144 00:09:38,990 --> 00:09:42,590 in the total size of that. 145 00:09:42,590 --> 00:09:44,370 And in this case, we didn't really 146 00:09:44,370 --> 00:09:47,490 add that much by adding the keys, 147 00:09:47,490 --> 00:09:51,890 so this is a very nice dataset for us to do our CatKeyMul on. 148 00:09:51,890 --> 00:09:54,580 There was no dense diagonal or anything 149 00:09:54,580 --> 00:09:56,370 like that to get in our way there, 150 00:09:56,370 --> 00:09:58,450 so that was a very useful thing to have. 151 00:10:01,430 --> 00:10:04,960 At least we're now onto the second read, so almost there. 152 00:10:14,220 --> 00:10:17,710 And being able to click on the thing it actually printed out 153 00:10:17,710 --> 00:10:23,650 is because when we actually do the spy plot, in this case, 154 00:10:23,650 --> 00:10:27,850 it works out pretty well because our keys are relatively short, 155 00:10:27,850 --> 00:10:29,160 but we do truncate that. 156 00:10:29,160 --> 00:10:33,870 So if you have a row key or a column key that's very long, 157 00:10:33,870 --> 00:10:35,820 when we do the plots we truncate that, 158 00:10:35,820 --> 00:10:37,403 and then you would never actually know 159 00:10:37,403 --> 00:10:40,480 what the true full string was. 160 00:10:40,480 --> 00:10:42,710 I think we're just about finished there. 161 00:10:47,860 --> 00:10:53,430 All right, so again, we read in the data, 162 00:10:53,430 --> 00:10:56,230 and now I wanted to make a degree 163 00:10:56,230 --> 00:10:58,430 distribution of the data. 164 00:10:58,430 --> 00:11:00,674 So just like we did our degree distributions earlier, 165 00:11:00,674 --> 00:11:02,090 there's a great way to get a sense 166 00:11:02,090 --> 00:11:05,360 of the overall statistics. 167 00:11:05,360 --> 00:11:08,300 So I took one, A1N, and the way I 168 00:11:08,300 --> 00:11:11,235 do that is I have the associative array, A1. 169 00:11:11,235 --> 00:11:15,450 If I do the ADJ or adjacency matrix, 170 00:11:15,450 --> 00:11:18,620 this basically just pops out the internal sparse 171 00:11:18,620 --> 00:11:20,750 matrix that's stored. 172 00:11:20,750 --> 00:11:22,680 And now I can do regular math on that. 173 00:11:22,680 --> 00:11:25,096 And we happen to have this little built-in function called 174 00:11:25,096 --> 00:11:28,920 out degree, which essentially does, on the out 175 00:11:28,920 --> 00:11:32,060 direction, the appropriate sums and returns 176 00:11:32,060 --> 00:11:37,200 a sparse matrix which we can then plot in this log log form. 177 00:11:41,880 --> 00:11:47,520 So is the reference data. 178 00:11:47,520 --> 00:11:53,270 You see as this distribution here, 179 00:11:53,270 --> 00:11:56,040 which is very power law-like, although, as we 180 00:11:56,040 --> 00:12:01,650 saw in the earlier classes, unless we properly fit and bin 181 00:12:01,650 --> 00:12:05,770 this, we wouldn't really know if this was a power law 182 00:12:05,770 --> 00:12:06,780 or something like it. 183 00:12:06,780 --> 00:12:11,140 But probably our first order linear fit of a power law 184 00:12:11,140 --> 00:12:12,860 would be at least a good place to start, 185 00:12:12,860 --> 00:12:14,540 as it usually is with this data. 186 00:12:18,610 --> 00:12:21,280 There's the other one. 187 00:12:21,280 --> 00:12:23,490 Would appear to be even more power law-like, 188 00:12:23,490 --> 00:12:27,010 but again, until we bin it, we don't really know what happens. 189 00:12:27,010 --> 00:12:29,000 When we start binning this stuff up here, 190 00:12:29,000 --> 00:12:33,350 maybe it will bow up in some way, maybe it won't. 191 00:12:33,350 --> 00:12:34,100 Probably it would. 192 00:12:34,100 --> 00:12:35,740 Probably it would look more like this. 193 00:12:39,640 --> 00:12:41,770 A question for you, or a good question 194 00:12:41,770 --> 00:12:43,870 is I showed you that other dataset, 195 00:12:43,870 --> 00:12:47,070 and it didn't look like this. 196 00:12:47,070 --> 00:12:50,750 You saw a very almost log normal Gaussian distribution. 197 00:12:54,100 --> 00:12:55,070 It's the same data. 198 00:12:55,070 --> 00:12:59,070 It's still 10-mer type of data, and so the only difference 199 00:12:59,070 --> 00:12:59,810 is the size. 200 00:13:02,806 --> 00:13:11,690 And so when you have 10-mers, since you 201 00:13:11,690 --> 00:13:16,660 have four choices of letter, you essentially 202 00:13:16,660 --> 00:13:21,240 have 2 million total possible 10-mers. 203 00:13:21,240 --> 00:13:28,256 So this data set-- in fact, we can count here, right? 204 00:13:28,256 --> 00:13:28,830 That's A1. 205 00:13:28,830 --> 00:13:39,110 If we do N and Z-- we only have 274,000 entries, 206 00:13:39,110 --> 00:13:45,930 so we're not fully sampling the 2 million possibilities here. 207 00:13:45,930 --> 00:13:47,600 In the other dataset, that database 208 00:13:47,600 --> 00:13:49,840 was actually 500 million entries. 209 00:13:49,840 --> 00:13:52,460 And so we're really fully sampling, 210 00:13:52,460 --> 00:13:57,170 and so basically what's happening 211 00:13:57,170 --> 00:14:00,570 is these guys are rare, they don't show up, 212 00:14:00,570 --> 00:14:05,370 and you're hitting them once, and eventually 213 00:14:05,370 --> 00:14:09,880 you begin to create this sort of more Gaussian distribution 214 00:14:09,880 --> 00:14:14,180 look, or bell curve look when you start fully sampling. 215 00:14:14,180 --> 00:14:16,670 I don't know if that's generally true or not, 216 00:14:16,670 --> 00:14:22,330 but one could hypothesize that a lot of the data-- and this just 217 00:14:22,330 --> 00:14:26,770 a guess-- a lot of the data we see is power law because we're 218 00:14:26,770 --> 00:14:30,640 not fully sampling the total space that emerges, 219 00:14:30,640 --> 00:14:35,490 or that total space is growing at a rate as we sample it. 220 00:14:35,490 --> 00:14:37,970 So domain names-- we haven't fully 221 00:14:37,970 --> 00:14:40,942 sampled the entire space of domain names, and the name, 222 00:14:40,942 --> 00:14:42,650 and the set of domain names, and the unit 223 00:14:42,650 --> 00:14:45,830 continues to grow at a fairly significant rate. 224 00:14:45,830 --> 00:14:50,400 Now, maybe one day there will stop being the new domain 225 00:14:50,400 --> 00:14:52,490 names, and then when we start doing our sampling, 226 00:14:52,490 --> 00:14:55,920 maybe our distributions and domain names 227 00:14:55,920 --> 00:14:58,370 won't look so power law anymore, but more 228 00:14:58,370 --> 00:15:03,410 like a log normal distribution with a kind of bell shape. 229 00:15:03,410 --> 00:15:06,360 Just a hypothesis, but certainly many data 230 00:15:06,360 --> 00:15:09,377 we have don't do that. 231 00:15:09,377 --> 00:15:10,960 Now, of course, the easiest way for me 232 00:15:10,960 --> 00:15:13,610 to change this is instead of doing 10-mers 233 00:15:13,610 --> 00:15:16,290 would be to do 20-mers. 234 00:15:16,290 --> 00:15:20,240 Now I've dramatically increased the space of possibilities, 235 00:15:20,240 --> 00:15:23,150 and so the odds of having just statistical 236 00:15:23,150 --> 00:15:26,820 matches start piling up are very, very low. 237 00:15:26,820 --> 00:15:30,730 So that is certainly something that people do, too. 238 00:15:30,730 --> 00:15:33,210 And in the case for the calculation we were doing, 239 00:15:33,210 --> 00:15:34,620 we just got lucky. 240 00:15:34,620 --> 00:15:38,630 We actually did not think 10-mer would be the right sampling 241 00:15:38,630 --> 00:15:40,187 for that data. 242 00:15:40,187 --> 00:15:41,770 We thought it would be a higher number 243 00:15:41,770 --> 00:15:44,530 like a 15-mer or a 20-mer, and it 244 00:15:44,530 --> 00:15:47,590 turns out the 10-mer was just actually spot on in terms 245 00:15:47,590 --> 00:15:51,700 of giving us this nice balance between sampling, 246 00:15:51,700 --> 00:15:55,100 but statistical power that we could eliminate 247 00:15:55,100 --> 00:15:56,100 more common occurrences. 248 00:15:58,650 --> 00:16:03,180 All right, so we plotted the out degree again, 249 00:16:03,180 --> 00:16:05,290 a very useful technique. 250 00:16:05,290 --> 00:16:08,850 One should always understand the overall statistics 251 00:16:08,850 --> 00:16:11,350 of one's data. 252 00:16:11,350 --> 00:16:13,790 We're doing the same cross correlation 253 00:16:13,790 --> 00:16:18,680 here that we did before, except since in that previous dataset 254 00:16:18,680 --> 00:16:20,430 the value didn't store to the count-- that 255 00:16:20,430 --> 00:16:24,700 was a very fast read, and all the values were 1-- here 256 00:16:24,700 --> 00:16:26,280 we actually have a count. 257 00:16:26,280 --> 00:16:34,430 And so in this particular correlation, 258 00:16:34,430 --> 00:16:36,820 we don't care if well, this 10-mer appeared 10 times 259 00:16:36,820 --> 00:16:39,729 in this one sequence and appeared 3 times 260 00:16:39,729 --> 00:16:41,770 in this other sequence, and when we multiply them 261 00:16:41,770 --> 00:16:45,040 together we get a 30. 262 00:16:45,040 --> 00:16:47,200 We're really more interested in, like, how many 263 00:16:47,200 --> 00:16:50,760 unique 1's did they each have? 264 00:16:50,760 --> 00:16:52,690 How many distinct 1's did we each have? 265 00:16:52,690 --> 00:16:55,180 So we use the double logi function 266 00:16:55,180 --> 00:16:57,490 to basically convert all our numbers back 267 00:16:57,490 --> 00:17:00,370 to just 0's and 1's, and then do that correlation again. 268 00:17:05,300 --> 00:17:07,180 And since the values don't matter 269 00:17:07,180 --> 00:17:10,329 when we do the CatKeyMul, we don't need to do that, 270 00:17:10,329 --> 00:17:12,650 and then we do the same correlation. 271 00:17:12,650 --> 00:17:13,720 And so here we are. 272 00:17:13,720 --> 00:17:19,740 We're finding the sequences that have a greater than 8 273 00:17:19,740 --> 00:17:22,280 match here. 274 00:17:22,280 --> 00:17:27,670 Actually doing some fairly inside tricks here to do that. 275 00:17:27,670 --> 00:17:29,320 You might ask, what is this put? 276 00:17:29,320 --> 00:17:31,840 So it's fairly clear what I'm doing here. 277 00:17:31,840 --> 00:17:36,270 I'm taking this count and I'm saying, please 278 00:17:36,270 --> 00:17:39,239 return the sub-associative array of all things with greater 279 00:17:39,239 --> 00:17:41,155 than a value of 8, or more than eight matches. 280 00:17:41,155 --> 00:17:43,120 All right, that's good. 281 00:17:43,120 --> 00:17:49,950 Then I'm flooring that and saying, make that all just 1's. 282 00:17:49,950 --> 00:17:52,980 And then I'm doing this trick here, which is called PutVal, 283 00:17:52,980 --> 00:17:54,965 and I'm giving it this random string. 284 00:17:57,754 --> 00:17:59,670 So right now this is just an associative array 285 00:17:59,670 --> 00:18:00,430 without a value. 286 00:18:00,430 --> 00:18:03,690 When I put that thing in there, since all my things have 287 00:18:03,690 --> 00:18:07,260 a value of 1 and I'll have one string, it's now saying, 288 00:18:07,260 --> 00:18:09,950 you know what, I've just made all those values tilde. 289 00:18:09,950 --> 00:18:13,250 I'm like, why on earth would we do that? 290 00:18:13,250 --> 00:18:17,170 Well, it's because when I AND them together, 291 00:18:17,170 --> 00:18:22,010 we have to decide what mathematically-- when 292 00:18:22,010 --> 00:18:24,455 we AND two associative arrays together, 293 00:18:24,455 --> 00:18:27,140 if you recall the lecture where we talk about these collision 294 00:18:27,140 --> 00:18:29,770 functions, we have to pick a collision function for the two 295 00:18:29,770 --> 00:18:31,560 strings. 296 00:18:31,560 --> 00:18:34,850 And this key stores this pedigreed list, 297 00:18:34,850 --> 00:18:36,510 which I really want to keep. 298 00:18:36,510 --> 00:18:39,190 And this just shows me which ones are the high match, 299 00:18:39,190 --> 00:18:41,160 and I don't really care about that. 300 00:18:41,160 --> 00:18:44,290 Well, the default collision function, when in doubt, 301 00:18:44,290 --> 00:18:47,930 throughout d4m is min. 302 00:18:47,930 --> 00:18:53,910 So tilde is lexicographically after every other printable 303 00:18:53,910 --> 00:18:55,960 character in the alphabet. 304 00:18:55,960 --> 00:18:58,930 So when I do a min here, I'm just 305 00:18:58,930 --> 00:19:03,520 going to always return these values here. 306 00:19:03,520 --> 00:19:05,690 So there was only one entry greater than 8, 307 00:19:05,690 --> 00:19:07,130 and I ANDed them together. 308 00:19:07,130 --> 00:19:11,110 It then did a min of tilde with this whole string here, 309 00:19:11,110 --> 00:19:14,050 and it said, aha, C is less than tilde. 310 00:19:14,050 --> 00:19:16,620 I'm just going to give you the C, and so there we go on. 311 00:19:16,620 --> 00:19:20,020 So this is just a cute trick, but it's 312 00:19:20,020 --> 00:19:24,350 a way of using the collision function in the group theory 313 00:19:24,350 --> 00:19:28,540 that we talked about in earlier lectures to get a very nice 314 00:19:28,540 --> 00:19:31,040 just selecting the one that I want here. 315 00:19:31,040 --> 00:19:33,080 And so that's what that does. 316 00:19:33,080 --> 00:19:36,090 And then there you can see of this sequence 317 00:19:36,090 --> 00:19:41,930 ID and this sequence ID, they had greater than eight matches, 318 00:19:41,930 --> 00:19:46,035 and this was the exact 10-mers that actually matched. 319 00:19:50,870 --> 00:19:53,070 And then we can go back before and look up those two 320 00:19:53,070 --> 00:19:57,840 sequences, compare them side by side in the original data, 321 00:19:57,840 --> 00:20:01,940 and then we see these were the actual positions here. 322 00:20:01,940 --> 00:20:03,994 And you would probably look at these-- well, 323 00:20:03,994 --> 00:20:04,910 what do we think here? 324 00:20:04,910 --> 00:20:12,700 Well, 137 and 138, that's 139, 140, 325 00:20:12,700 --> 00:20:17,710 141, and then jumping all the way to 149. 326 00:20:17,710 --> 00:20:22,150 So looks like those guys are all part of one long group. 327 00:20:22,150 --> 00:20:31,250 But then 60 and 172 and 130 are probably isolated 10-mers. 328 00:20:31,250 --> 00:20:32,790 And then over here what do we have? 329 00:20:32,790 --> 00:20:42,730 We have 111-- no, but we do have 119, 120, 121, 122, 123, 330 00:20:42,730 --> 00:20:49,485 so that's probably also a bigger group match. 331 00:20:49,485 --> 00:20:55,470 In fact, those all line up with this, except for the 149 year. 332 00:20:55,470 --> 00:20:56,940 Well, actually, no. 333 00:20:56,940 --> 00:20:58,600 Some of them match, some of the time. 334 00:20:58,600 --> 00:21:01,480 So they're each part of a larger piece, 335 00:21:01,480 --> 00:21:07,490 but they aren't contiguous with each other. 336 00:21:07,490 --> 00:21:10,387 And that's the kind of thing that people who do this 337 00:21:10,387 --> 00:21:11,220 then really look at. 338 00:21:11,220 --> 00:21:13,595 And, of course, a real match would be more like something 339 00:21:13,595 --> 00:21:15,330 with many more sequences. 340 00:21:15,330 --> 00:21:18,250 This is all just random randomness 341 00:21:18,250 --> 00:21:20,220 and things like that. 342 00:21:20,220 --> 00:21:22,437 All right, so that's the demo. 343 00:21:22,437 --> 00:21:23,270 I want to thank you. 344 00:21:23,270 --> 00:21:28,870 And again, next week tell your friends who weren't here today 345 00:21:28,870 --> 00:21:30,860 that that is the one lecture. 346 00:21:30,860 --> 00:21:34,050 If you had to come to one lecture in this entire class-- 347 00:21:34,050 --> 00:21:36,324 I didn't tell you that at the beginning, you notice, 348 00:21:36,324 --> 00:21:38,490 because then you would have skipped the other seven. 349 00:21:38,490 --> 00:21:41,412 But no, what we've done before will really 350 00:21:41,412 --> 00:21:42,370 make that a lot easier. 351 00:21:42,370 --> 00:21:45,500 That's the real one, and you will then be able to go, 352 00:21:45,500 --> 00:21:47,950 when you leave that lecture, run all those examples 353 00:21:47,950 --> 00:21:50,450 on those test databases that we have there 354 00:21:50,450 --> 00:21:55,680 from your LO grid accounts, and have all that kind of fun. 355 00:21:55,680 --> 00:21:59,840 And we really want to do that because we've released d4m 356 00:21:59,840 --> 00:22:02,040 on the web, we need people to test it. 357 00:22:02,040 --> 00:22:05,330 It's much better if you guys find the issues, 358 00:22:05,330 --> 00:22:07,130 and then we'll just make it better. 359 00:22:07,130 --> 00:22:09,255 And likewise, it's a step towards you guys actually 360 00:22:09,255 --> 00:22:14,410 using the technology for your actual application. 361 00:22:14,410 --> 00:22:17,230 So thank you very much.