1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation, or to view additional materials 6 00:00:12,680 --> 00:00:16,590 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,590 --> 00:00:17,260 at ocw.mit.edu. 8 00:00:20,924 --> 00:00:23,340 JEREMY KEPNER: All right, we're going to get started here, 9 00:00:23,340 --> 00:00:24,900 everyone. 10 00:00:24,900 --> 00:00:28,600 And I know we have a lot going on here at the lab today, 11 00:00:28,600 --> 00:00:33,287 and I'm very pleased that those of you decided to choose us 12 00:00:33,287 --> 00:00:35,370 today over all the other things that are going on. 13 00:00:35,370 --> 00:00:37,290 So that's great, I'm very pleased. 14 00:00:39,850 --> 00:00:44,820 I wanted to say I think I've provided feedback to everyone 15 00:00:44,820 --> 00:00:49,130 who submitted their earlier first homework 16 00:00:49,130 --> 00:00:52,169 over the weekend, and I was very impressed and pleased 17 00:00:52,169 --> 00:00:53,710 by those of you who did the homework. 18 00:00:53,710 --> 00:00:56,310 I think you really embraced the spirit of the homework 19 00:00:56,310 --> 00:00:57,850 very well. 20 00:00:57,850 --> 00:00:59,850 I know the homework that just happened 21 00:00:59,850 --> 00:01:01,746 is one that kind of really builds on that. 22 00:01:01,746 --> 00:01:03,620 For those of you who didn't do the first one, 23 00:01:03,620 --> 00:01:06,530 then you're kind of left out a colon. 24 00:01:06,530 --> 00:01:10,000 But I would like to give my congratulations for those 25 00:01:10,000 --> 00:01:11,860 who took on that second homework, which 26 00:01:11,860 --> 00:01:15,950 was significantly-- definitely required 27 00:01:15,950 --> 00:01:20,310 a fair amount of contemplation to sort of attempt at homework. 28 00:01:20,310 --> 00:01:23,340 So I want to thank those of you who did that. 29 00:01:23,340 --> 00:01:26,510 So we're going to get started here. 30 00:01:26,510 --> 00:01:31,080 And so we're going to bring up the slides here. 31 00:01:31,080 --> 00:01:35,600 So we're going to be getting a little bit more. 32 00:01:35,600 --> 00:01:39,969 We're kind of entering the second of the three sections. 33 00:01:39,969 --> 00:01:41,510 We had these first three courses here 34 00:01:41,510 --> 00:01:43,760 that really was the first part. 35 00:01:43,760 --> 00:01:47,040 Did a lot with motivation and a lot with theory, 36 00:01:47,040 --> 00:01:50,400 and now we're going to do stuff that's a little bit more-- 37 00:01:50,400 --> 00:01:52,337 you know, we'll be continuing that trend. 38 00:01:52,337 --> 00:01:54,170 We'll still always have plenty of motivation 39 00:01:54,170 --> 00:01:56,950 and a little bit of theory, but getting more and more 40 00:01:56,950 --> 00:01:59,540 towards things that feel like the things 41 00:01:59,540 --> 00:02:01,520 you might actually really do. 42 00:02:01,520 --> 00:02:03,830 And so we're kind of entering into that next phase 43 00:02:03,830 --> 00:02:04,621 of the course here. 44 00:02:04,621 --> 00:02:05,770 So let's get started here. 45 00:02:05,770 --> 00:02:08,780 So this is the lecture 03. 46 00:02:14,170 --> 00:02:16,920 Pull up an example here from some real data, 47 00:02:16,920 --> 00:02:21,210 and talking about that. 48 00:02:21,210 --> 00:02:24,229 Here is the outline of the course. 49 00:02:24,229 --> 00:02:26,020 Actually just before we're doing that, sort 50 00:02:26,020 --> 00:02:29,470 of somewhat mandatory for those of you who are just joining us 51 00:02:29,470 --> 00:02:31,560 or are watching on the web and have jumped 52 00:02:31,560 --> 00:02:34,440 to add to this lecture, the title of the class 53 00:02:34,440 --> 00:02:35,890 is Signal Processing on Databases, 54 00:02:35,890 --> 00:02:37,931 is two words that we don't normally see together, 55 00:02:37,931 --> 00:02:39,350 or two phrases. 56 00:02:39,350 --> 00:02:40,960 Signal processing, which is really 57 00:02:40,960 --> 00:02:44,940 alluding to Detection Theory, Formal Detection 58 00:02:44,940 --> 00:02:48,520 theory, and its linear algebraic foundations. 59 00:02:48,520 --> 00:02:50,850 And databases, which is really alluding 60 00:02:50,850 --> 00:02:55,030 to dealing with strings, and unstructured data, and graphs. 61 00:02:55,030 --> 00:02:57,940 And so we're, in this course, talking 62 00:02:57,940 --> 00:03:01,140 about a technology, D4M, which allows us to pull these two 63 00:03:01,140 --> 00:03:02,100 views together. 64 00:03:02,100 --> 00:03:07,110 And so again we're now on the lecture 03, 65 00:03:07,110 --> 00:03:12,160 and we will continue on this journey that we have started. 66 00:03:12,160 --> 00:03:15,170 Here's the outline. 67 00:03:15,170 --> 00:03:17,200 A little bit of history, I'm going 68 00:03:17,200 --> 00:03:22,110 to talk about how the web became the thing it is today, which 69 00:03:22,110 --> 00:03:27,760 I think talks a lot about why we invented the D4M technology. 70 00:03:27,760 --> 00:03:32,760 So it's going to talk about the specific gap that it fills, 71 00:03:32,760 --> 00:03:36,000 and talk a little bit about the D4M technology, some 72 00:03:36,000 --> 00:03:37,340 of the results that we've had. 73 00:03:37,340 --> 00:03:42,910 And then, hopefully today, the demo is more substantive, 74 00:03:42,910 --> 00:03:45,980 and so we'll spend more time on that 75 00:03:45,980 --> 00:03:48,660 than we did last time, where last time was 76 00:03:48,660 --> 00:03:52,000 more serious than the demo was really fairly small. 77 00:03:52,000 --> 00:03:56,830 So for those of you who don't remember 78 00:03:56,830 --> 00:04:00,740 back in the ancient primordial days of the web, back 79 00:04:00,740 --> 00:04:08,600 in the early 90s, this is what it looked like. 80 00:04:08,600 --> 00:04:12,810 So if I talk about the hardware side of the web, 81 00:04:12,810 --> 00:04:14,440 it was quite simple. 82 00:04:14,440 --> 00:04:18,329 So there we go. 83 00:04:18,329 --> 00:04:23,170 The hardware side of the web was all Sun boxes. 84 00:04:23,170 --> 00:04:26,450 We all had our Sun workstations, and there were clients, 85 00:04:26,450 --> 00:04:29,670 and there were servers, and there were databases. 86 00:04:29,670 --> 00:04:35,310 And the databases that we had at the time were SQL databases. 87 00:04:35,310 --> 00:04:37,710 Oracle was sort of really kind of just beginning 88 00:04:37,710 --> 00:04:39,370 to get going then. 89 00:04:39,370 --> 00:04:41,885 Sybase was sort of a dominant player at that stage. 90 00:04:45,300 --> 00:04:49,880 This is sort of an example of the very first modern web 91 00:04:49,880 --> 00:04:53,000 page, or website, that was built. 92 00:04:53,000 --> 00:04:56,500 It was built by-- I'm sure there are other people who 93 00:04:56,500 --> 00:04:58,380 are doing very similar things at same time, 94 00:04:58,380 --> 00:04:59,900 so it's among the first. 95 00:04:59,900 --> 00:05:02,860 It's difficult to-- not going to be like Al Gore 96 00:05:02,860 --> 00:05:05,630 and say I invented the internet. 97 00:05:05,630 --> 00:05:07,672 But a friend of mine, Don Beaudry and I, 98 00:05:07,672 --> 00:05:09,880 were working for a company called Vision Intelligence 99 00:05:09,880 --> 00:05:12,320 Corporation, which is now part of GE. 100 00:05:12,320 --> 00:05:18,670 And we had a data set and a SQL database, 101 00:05:18,670 --> 00:05:21,780 and we wanted to make it available. 102 00:05:21,780 --> 00:05:24,340 And there are a lot of thick client tools 103 00:05:24,340 --> 00:05:26,510 that came with our Sybase package for doing that, 104 00:05:26,510 --> 00:05:29,270 but we found them a little bit clunky. 105 00:05:29,270 --> 00:05:36,360 And so a friend of mine, Don, he downloaded this-- he said, 106 00:05:36,360 --> 00:05:37,700 hey I found this new software. 107 00:05:37,700 --> 00:05:42,010 I found a beta release of this software called NCSA Mosaic. 108 00:05:42,010 --> 00:05:44,210 And so for those who don't know, NCSA Mosaic 109 00:05:44,210 --> 00:05:48,040 was the first browser, and it was 110 00:05:48,040 --> 00:05:50,579 invented at the National Center for Supercomputing 111 00:05:50,579 --> 00:05:51,120 Applications. 112 00:05:53,960 --> 00:05:56,210 And they released it. 113 00:05:56,210 --> 00:05:57,550 And it was awesome. 114 00:05:57,550 --> 00:06:00,220 It changed everything. 115 00:06:00,220 --> 00:06:04,430 Just so you know, prior to that, and not to disrespect anybody 116 00:06:04,430 --> 00:06:12,080 who invented anything, but HTTP and HTML were dying. 117 00:06:12,080 --> 00:06:16,970 They had actually come out in 1989, and everyone like, 118 00:06:16,970 --> 00:06:18,180 this is silly. 119 00:06:18,180 --> 00:06:18,700 What's this? 120 00:06:18,700 --> 00:06:20,825 Why do I have to type this HTTP and then this HTML, 121 00:06:20,825 --> 00:06:23,910 and it's just very clunky and stuff like. 122 00:06:23,910 --> 00:06:28,060 That and it was getting its clock pretty well cleaned 123 00:06:28,060 --> 00:06:32,680 by another technology called Gopher, which was developed 124 00:06:32,680 --> 00:06:34,834 in the University of Minnesota, which 125 00:06:34,834 --> 00:06:36,000 was a much easier interface. 126 00:06:36,000 --> 00:06:39,060 The server was a lot better. 127 00:06:39,060 --> 00:06:41,930 It was a lot easier to do links then in HTML. 128 00:06:41,930 --> 00:06:49,750 And it was just a menu driven, text based way to do that. 129 00:06:49,750 --> 00:06:57,210 Well Mosaic created a GUI front end to HTML 130 00:06:57,210 --> 00:06:59,740 and allowed you to have pictures, 131 00:06:59,740 --> 00:07:01,310 and that changed everything. 132 00:07:01,310 --> 00:07:04,490 That just absolutely-- and it just sort of caught on 133 00:07:04,490 --> 00:07:05,550 like wildfire. 134 00:07:05,550 --> 00:07:07,660 But it was just a beta at that point in time, 135 00:07:07,660 --> 00:07:09,618 and I don't think you could really do pictures, 136 00:07:09,618 --> 00:07:11,280 but it rendered fonts nicely. 137 00:07:11,280 --> 00:07:12,300 And Gopher did not. 138 00:07:12,300 --> 00:07:14,190 Gopher was like just plain text. 139 00:07:14,190 --> 00:07:16,870 So you could have like underline, in bold, 140 00:07:16,870 --> 00:07:18,170 and the links were very clear. 141 00:07:18,170 --> 00:07:19,170 You could click at them. 142 00:07:19,170 --> 00:07:20,850 You could even have indents and bullets. 143 00:07:20,850 --> 00:07:23,190 You could imagine organizing with something that would 144 00:07:23,190 --> 00:07:26,460 look like a web page today. 145 00:07:26,460 --> 00:07:29,460 And so we had that, and they're like, 146 00:07:29,460 --> 00:07:32,320 oh well we'll talk to our database using that. 147 00:07:32,320 --> 00:07:36,030 We'll use you HTT put to post a request. 148 00:07:38,850 --> 00:07:41,010 The Mosaic server was so bad, we actually 149 00:07:41,010 --> 00:07:45,200 just used the Gopher server to talk to the Mosaic browser. 150 00:07:45,200 --> 00:07:47,130 And so we would do puts, and then we 151 00:07:47,130 --> 00:07:49,800 used a language called perl, which many of you know, 152 00:07:49,800 --> 00:07:53,330 because it actually had a direct binding to Sybase. 153 00:07:53,330 --> 00:07:55,930 And so we would take that, format it, 154 00:07:55,930 --> 00:07:59,532 and create SQL queries. 155 00:07:59,532 --> 00:08:01,990 And then it would send us back strings, which we would then 156 00:08:01,990 --> 00:08:05,620 format as HTML, and input to the browser, and it was great. 157 00:08:05,620 --> 00:08:07,710 It was like very nice, and it was 158 00:08:07,710 --> 00:08:12,159 sort of the first modern web page that we did in 1992. 159 00:08:12,159 --> 00:08:14,700 And essentially a good fraction of the internet kind of looks 160 00:08:14,700 --> 00:08:17,460 like this today. 161 00:08:17,460 --> 00:08:22,080 Our conclusion at that time was that this browser was a really 162 00:08:22,080 --> 00:08:26,360 lousy GUI, and that HTTP was really 163 00:08:26,360 --> 00:08:28,781 trying to be a file system, but it was a really bad file 164 00:08:28,781 --> 00:08:29,280 system. 165 00:08:29,280 --> 00:08:31,220 And Perl was really bad for analysis. 166 00:08:31,220 --> 00:08:33,990 And SQL really wasn't-- it's good for credit cards 167 00:08:33,990 --> 00:08:35,990 transactions, but it wasn't really good for data 168 00:08:35,990 --> 00:08:36,490 in and out. 169 00:08:36,490 --> 00:08:39,070 So our conclusion was this is an awful lot of work 170 00:08:39,070 --> 00:08:42,029 just to use some documents, and it won't catch on. 171 00:08:42,029 --> 00:08:43,570 Our conclusion was it won't catch on. 172 00:08:43,570 --> 00:08:47,920 So with that conclusion, maybe I'm 173 00:08:47,920 --> 00:08:50,630 underscoring my judgment for the entire course 174 00:08:50,630 --> 00:08:54,131 that we decided this was a bad idea. 175 00:08:54,131 --> 00:08:56,380 So after that we had sort of what we call the Cambrian 176 00:08:56,380 --> 00:09:01,050 Explosion of the internet, which is we had an explosion. 177 00:09:01,050 --> 00:09:05,350 On the hardware side, we had big, giant data centers 178 00:09:05,350 --> 00:09:08,130 merging servers and databases, which is great. 179 00:09:08,130 --> 00:09:11,400 We had all these new types of hardware technologies here, 180 00:09:11,400 --> 00:09:13,820 laptops, and tablets, and iPhones, 181 00:09:13,820 --> 00:09:15,420 and all that kind of stuff. 182 00:09:15,420 --> 00:09:18,540 So the hardware technology went forward in leaps and bounds. 183 00:09:18,540 --> 00:09:23,760 And then the browsers really took off, all of them. 184 00:09:23,760 --> 00:09:25,350 We have all these different browsers. 185 00:09:25,350 --> 00:09:27,859 You can still see, I have the mosaic logo 186 00:09:27,859 --> 00:09:30,400 back there still a little bit, because they're still showing. 187 00:09:30,400 --> 00:09:36,830 For those of you who don't know, the author of Mosaic and NCSA 188 00:09:36,830 --> 00:09:38,820 was a guy by the name of Marc Andreessen. 189 00:09:38,820 --> 00:09:41,110 He left NCSA and formed a company called 190 00:09:41,110 --> 00:09:43,950 Netscape, which actually, at a moment in time 191 00:09:43,950 --> 00:09:46,515 was the dominant-- they were the internet. 192 00:09:49,830 --> 00:09:53,322 They got crushed, although I think Marc Andreessen made 193 00:09:53,322 --> 00:09:54,280 a fair amount of money. 194 00:09:57,960 --> 00:10:00,770 But they did open source all their software, 195 00:10:00,770 --> 00:10:04,330 and so they basically rewrote Mosaic from the ground up, 196 00:10:04,330 --> 00:10:10,040 and that became Mozilla and Firefox and all that. 197 00:10:10,040 --> 00:10:14,610 So that's still around today. 198 00:10:14,610 --> 00:10:16,820 Safari may have borrowed from this. 199 00:10:16,820 --> 00:10:19,530 I don't really know. 200 00:10:19,530 --> 00:10:22,120 When Netscape was dominating the internet, 201 00:10:22,120 --> 00:10:25,520 Microsoft needed to-- first they thought the internet was 202 00:10:25,520 --> 00:10:26,419 going to be a fad. 203 00:10:26,419 --> 00:10:28,460 And then they're like, all right this is serious. 204 00:10:28,460 --> 00:10:30,050 We need a browser. 205 00:10:30,050 --> 00:10:33,930 And so Mosaic had actually been spun out 206 00:10:33,930 --> 00:10:36,160 of the University of Illinois into a company 207 00:10:36,160 --> 00:10:39,260 called Spyglass, which may actually still exist today. 208 00:10:39,260 --> 00:10:44,550 And then Microsoft bought the browser, 209 00:10:44,550 --> 00:10:47,350 which was the original NCSA Mosaic code, 210 00:10:47,350 --> 00:10:49,460 and that became in Internet Explorer. 211 00:10:49,460 --> 00:10:51,930 So it's still around today. 212 00:10:51,930 --> 00:10:53,760 This is Chrome, and as far as I know 213 00:10:53,760 --> 00:10:58,170 chrome is a complete rewrite from the ground up. 214 00:10:58,170 --> 00:11:01,220 And probably some of the other, like Opera, and other 215 00:11:01,220 --> 00:11:03,134 browsers that around today are completely-- 216 00:11:03,134 --> 00:11:04,050 but they still use it. 217 00:11:04,050 --> 00:11:04,966 It's a browser to use. 218 00:11:04,966 --> 00:11:08,460 HTML still uses the basic put get syntax. 219 00:11:08,460 --> 00:11:10,730 SQL, we're still using it today. 220 00:11:10,730 --> 00:11:13,090 We have all whole explosion of technologies, 221 00:11:13,090 --> 00:11:16,360 Oracle being very dominant, but my SQL, 222 00:11:16,360 --> 00:11:18,000 SQL Server, or other types of things, 223 00:11:18,000 --> 00:11:20,880 Sybase still around, very common. 224 00:11:20,880 --> 00:11:26,310 And we still have two major servers 225 00:11:26,310 --> 00:11:28,920 out there, Apache and Windows. 226 00:11:28,920 --> 00:11:31,015 So Apache was the Mosaic server. 227 00:11:31,015 --> 00:11:33,770 And just so you know, Apache was-- 228 00:11:33,770 --> 00:11:38,530 it's a patchy server, meaning it was very buggy. 229 00:11:38,530 --> 00:11:42,650 So it was the a patchy server, and that became Apache. 230 00:11:42,650 --> 00:11:44,490 So that was the new Mosaic server, 231 00:11:44,490 --> 00:11:48,980 became Apache because it was a patchy server. 232 00:11:48,980 --> 00:11:50,944 And I bet they had borrowed code from Gopher, 233 00:11:50,944 --> 00:11:52,360 so that's why the Gopher, I think, 234 00:11:52,360 --> 00:11:54,750 is still showing there a little bit. 235 00:11:54,750 --> 00:11:56,474 We've had a proliferation of languages 236 00:11:56,474 --> 00:11:57,390 that are very similar. 237 00:11:57,390 --> 00:11:59,910 So people, in terms of gluing these two together, 238 00:11:59,910 --> 00:12:03,680 Java, et cetera, all that type of stuff. 239 00:12:03,680 --> 00:12:05,890 But the important thing to recognize 240 00:12:05,890 --> 00:12:09,230 is that this whole architecture was not put together, 241 00:12:09,230 --> 00:12:14,000 because somehow we felt the web browser was the optimal tool, 242 00:12:14,000 --> 00:12:16,730 or SQL was the optimal tool, or these languages are 243 00:12:16,730 --> 00:12:18,340 the optimal tool. 244 00:12:18,340 --> 00:12:20,230 They were just things that were laying down 245 00:12:20,230 --> 00:12:23,390 to could be repurposed for this job. 246 00:12:23,390 --> 00:12:26,850 And so that's why they're not necessarily optimal things. 247 00:12:26,850 --> 00:12:28,420 But nevertheless, our conclusion has 248 00:12:28,420 --> 00:12:31,180 to be slightly revised that it would not catch on, 249 00:12:31,180 --> 00:12:32,600 because obviously it did catch on. 250 00:12:32,600 --> 00:12:35,970 The desire to share information was so great 251 00:12:35,970 --> 00:12:40,190 that it overcame the high barrier to entry 252 00:12:40,190 --> 00:12:44,630 just to set up on a web page or something like that. 253 00:12:44,630 --> 00:12:46,300 Now things have changed significantly. 254 00:12:46,300 --> 00:12:47,990 We're not the first people to observe 255 00:12:47,990 --> 00:12:50,920 that this original architecture is somewhat 256 00:12:50,920 --> 00:12:52,900 broken for this type of model. 257 00:12:52,900 --> 00:12:57,100 And so today the web looks a lot more like this. 258 00:12:57,100 --> 00:13:01,250 You see all of our awesome client hardware which is great. 259 00:13:01,250 --> 00:13:03,280 We have servers and databases that 260 00:13:03,280 --> 00:13:06,060 have now moved in type of the data center 261 00:13:06,060 --> 00:13:08,890 into shipping containers and other types of things, 262 00:13:08,890 --> 00:13:11,220 which is great. 263 00:13:11,220 --> 00:13:13,760 Increasingly, our interfaces will look 264 00:13:13,760 --> 00:13:15,660 a lot more like video games. 265 00:13:15,660 --> 00:13:19,140 I would tell you that a typical app on the iPad 266 00:13:19,140 --> 00:13:22,500 feels a lot more like a video game than a browser, which 267 00:13:22,500 --> 00:13:25,670 is great, so we're really kind of moving beyond the browser 268 00:13:25,670 --> 00:13:28,480 is the dominant way for getting our information. 269 00:13:28,480 --> 00:13:31,040 And companies like Google and others 270 00:13:31,040 --> 00:13:34,860 recognize that, if you're just presenting data to people, 271 00:13:34,860 --> 00:13:38,240 and you're not doing say credit card transactions, 272 00:13:38,240 --> 00:13:41,900 then a lot of the features of SQL in those types of databases 273 00:13:41,900 --> 00:13:44,690 are maybe more than you need, and you can actually 274 00:13:44,690 --> 00:13:48,000 make these sort of very scalable databases 275 00:13:48,000 --> 00:13:50,550 more simply that are called triple stores. 276 00:13:50,550 --> 00:13:55,510 And so Google Bigtable being the original sort of one. 277 00:13:55,510 --> 00:13:56,890 Riak is another. 278 00:13:56,890 --> 00:13:58,430 Cassandra is another. 279 00:13:58,430 --> 00:13:59,890 HBase is another. 280 00:13:59,890 --> 00:14:02,920 Accumulo, which is the one we use in this class, is another. 281 00:14:02,920 --> 00:14:08,590 So this whole new technology space has really driven that 282 00:14:08,590 --> 00:14:11,497 and has made a lot more scalable. 283 00:14:11,497 --> 00:14:13,330 But unfortunately, in the middle here, we're 284 00:14:13,330 --> 00:14:16,725 still very much left with much of the same technology. 285 00:14:16,725 --> 00:14:21,200 And this is kind of the last piece, this sort of glue. 286 00:14:21,200 --> 00:14:23,590 And so our conclusion is now that it's 287 00:14:23,590 --> 00:14:25,340 a lot of work to view a lot of data, 288 00:14:25,340 --> 00:14:27,530 but it's a pretty great view, and we can really 289 00:14:27,530 --> 00:14:28,367 view a lot of data. 290 00:14:28,367 --> 00:14:29,950 But we still have this middle problem. 291 00:14:29,950 --> 00:14:31,340 There's this gap. 292 00:14:31,340 --> 00:14:33,820 And so we're trying to address that gap. 293 00:14:33,820 --> 00:14:36,190 And so we address that gap with D4M. 294 00:14:36,190 --> 00:14:39,920 So D4M is formally designed from the ground up 295 00:14:39,920 --> 00:14:41,760 to be something that is really designed 296 00:14:41,760 --> 00:14:43,490 to work on triples, the kind of data 297 00:14:43,490 --> 00:14:45,470 we store in these modern databases, 298 00:14:45,470 --> 00:14:48,660 and do the kind of analysis in less code 299 00:14:48,660 --> 00:14:49,753 than you would otherwise. 300 00:14:54,180 --> 00:14:57,200 So D4M, which stands for Dynamic Distributed Dimension Data 301 00:14:57,200 --> 00:14:59,440 model, which is sometimes in quotes, 302 00:14:59,440 --> 00:15:01,400 databases for Matlab, because it happens 303 00:15:01,400 --> 00:15:03,680 to be written in Matlab. 304 00:15:03,680 --> 00:15:05,640 It connects the triple stores, which 305 00:15:05,640 --> 00:15:08,660 can view their data as very large, sparse matrices 306 00:15:08,660 --> 00:15:13,020 of triples with the mathematical concepts of associative arrays 307 00:15:13,020 --> 00:15:14,580 that we talked about last class. 308 00:15:14,580 --> 00:15:18,600 That's essentially what D4M does for you. 309 00:15:18,600 --> 00:15:22,140 So that's kind of one sort of holistic view of the technology 310 00:15:22,140 --> 00:15:26,040 space that we're dealing with here. 311 00:15:26,040 --> 00:15:29,140 Another way to view at it, which I'm going to talk a little bit 312 00:15:29,140 --> 00:15:32,807 about how the technologies that we have brought this to you. 313 00:15:32,807 --> 00:15:34,890 So I'm going to tell you a little bit about LLGrid 314 00:15:34,890 --> 00:15:38,700 and what we do there, because you're using your LLGrid 315 00:15:38,700 --> 00:15:41,770 accounts for this, and so I want to tell you 316 00:15:41,770 --> 00:15:44,580 a little bit about that. 317 00:15:44,580 --> 00:15:50,690 So LLGrid is the core supercomputer 318 00:15:50,690 --> 00:15:52,380 at Lincoln Laboratory. 319 00:15:52,380 --> 00:15:57,670 And we're here to help with all your computing needs. 320 00:15:57,670 --> 00:16:00,250 Likewise, we're here to help with your D4M needs. 321 00:16:00,250 --> 00:16:01,750 If you have a project, and you think 322 00:16:01,750 --> 00:16:04,790 it would be useful to use D4M in that project, 323 00:16:04,790 --> 00:16:05,870 please contact us. 324 00:16:05,870 --> 00:16:07,670 You don't have to just take this class 325 00:16:07,670 --> 00:16:09,086 and think that you're on your own. 326 00:16:09,086 --> 00:16:10,530 I mean, we are a service. 327 00:16:10,530 --> 00:16:14,690 We will help you do the schemas, get the code going, 328 00:16:14,690 --> 00:16:17,780 get it sort of adapted to your problem. 329 00:16:17,780 --> 00:16:19,970 That's what we do. 330 00:16:19,970 --> 00:16:22,300 LLGrid, we have actually well over this, 331 00:16:22,300 --> 00:16:24,830 but just so I don't have to keep changing the slide. 332 00:16:24,830 --> 00:16:27,900 We always say it's about 500 users, about 2000 processors. 333 00:16:27,900 --> 00:16:29,612 It's more than that. 334 00:16:29,612 --> 00:16:31,320 To our knowledge, it's really the world's 335 00:16:31,320 --> 00:16:33,240 only desktop interactive supercomputer. 336 00:16:33,240 --> 00:16:35,781 It's dramatically easier to use than any other supercomputer. 337 00:16:35,781 --> 00:16:37,540 And as a result, as an organization, 338 00:16:37,540 --> 00:16:40,710 we have the highest fraction of staff using supercomputing 339 00:16:40,710 --> 00:16:42,970 on a regular basis of any organization, 340 00:16:42,970 --> 00:16:44,381 just because it's so easy. 341 00:16:44,381 --> 00:16:46,630 And it's also kind of been the foundation, the vision, 342 00:16:46,630 --> 00:16:50,170 for supercomputing in the entire state of Massachusetts. 343 00:16:50,170 --> 00:16:52,330 All of our ideas have been adopted 344 00:16:52,330 --> 00:16:57,140 by other universities in the state and moving forward. 345 00:16:57,140 --> 00:17:02,330 Not really the technology that we're talking about today, 346 00:17:02,330 --> 00:17:04,310 but something that is really kind of the bread 347 00:17:04,310 --> 00:17:07,740 and butter of LLGrid, is our Parallel Matlab technology, 348 00:17:07,740 --> 00:17:09,810 which is now going to be celebrating 349 00:17:09,810 --> 00:17:12,155 its 10th anniversary this year. 350 00:17:14,750 --> 00:17:17,770 And it's really a way to allow people 351 00:17:17,770 --> 00:17:21,530 to do parallel programming relatively easily. 352 00:17:21,530 --> 00:17:25,990 D4M interacts seamlessly with this technology, 353 00:17:25,990 --> 00:17:31,720 and so you will also be doing parallel database codes 354 00:17:31,720 --> 00:17:33,180 with this technology. 355 00:17:33,180 --> 00:17:35,550 So we have this thing called Parallel Matlab. 356 00:17:35,550 --> 00:17:38,410 If you're an LLGrid user, you are entitled to, 357 00:17:38,410 --> 00:17:41,390 or you almost always get the book on Parallel Matlab. 358 00:17:41,390 --> 00:17:43,510 And it allows you to take things like matrices 359 00:17:43,510 --> 00:17:48,610 and chop them up very easily to do complicated calculations. 360 00:17:48,610 --> 00:17:52,070 And almost every single parallel program that you want to write 361 00:17:52,070 --> 00:17:55,770 can be written with just the addition of a handful 362 00:17:55,770 --> 00:17:58,600 of additional functions. 363 00:17:58,600 --> 00:18:01,600 And we get to use what is called a distributed arrays 364 00:18:01,600 --> 00:18:03,370 parallel program model, which is generally 365 00:18:03,370 --> 00:18:07,350 recognized as by far the best of all the parallel programming 366 00:18:07,350 --> 00:18:10,820 models, if you are comfortable with a multi-dimensional 367 00:18:10,820 --> 00:18:13,200 arrays. 368 00:18:13,200 --> 00:18:15,860 Everyone at MIT is, so this is easy. 369 00:18:15,860 --> 00:18:19,290 However this is not generally true. 370 00:18:19,290 --> 00:18:22,070 Thinking in matrices is not a universal skill, 371 00:18:22,070 --> 00:18:26,560 or universal training, and so it is a bit of-- sometimes we do 372 00:18:26,560 --> 00:18:28,940 say this is software for the 1%. 373 00:18:28,940 --> 00:18:32,910 So you know, it is not for everyone. 374 00:18:32,910 --> 00:18:35,350 But if you do have this knowledge and background, 375 00:18:35,350 --> 00:18:37,700 it allows you to do things very naturally. 376 00:18:37,700 --> 00:18:39,500 And as you know, we've talked about all 377 00:18:39,500 --> 00:18:43,200 of our associated arrays as looking at data in matrices, 378 00:18:43,200 --> 00:18:47,629 and so these two technologies come together very nicely. 379 00:18:47,629 --> 00:18:49,920 And we are really the only place in the world where you 380 00:18:49,920 --> 00:18:51,637 can routinely use this model. 381 00:18:51,637 --> 00:18:53,220 This is the dominant model that people 382 00:18:53,220 --> 00:18:55,230 use when they do parallel programming, which 383 00:18:55,230 --> 00:18:57,330 is generally-- and you might say, well why do I 384 00:18:57,330 --> 00:18:58,080 say it's the best? 385 00:18:58,080 --> 00:18:59,370 Well we run contests. 386 00:18:59,370 --> 00:19:02,080 There's a contest every year where we bake off the best 387 00:19:02,080 --> 00:19:04,920 parallel programming models, and distributed arrays 388 00:19:04,920 --> 00:19:07,660 in high level environments wins every year. 389 00:19:07,660 --> 00:19:10,920 And so it's kind of the best model 390 00:19:10,920 --> 00:19:14,680 if you can understand them. 391 00:19:14,680 --> 00:19:18,620 It wouldn't be a class on computing in 2012 392 00:19:18,620 --> 00:19:20,550 unless we talked about the cloud, 393 00:19:20,550 --> 00:19:22,960 so we definitely need to talk about the cloud. 394 00:19:22,960 --> 00:19:24,840 And to make it simple for people, 395 00:19:24,840 --> 00:19:28,220 we're going to divide the cloud into sort of two halves. 396 00:19:28,220 --> 00:19:31,160 You can subdivide the cloud into many, many different 397 00:19:31,160 --> 00:19:33,970 components. 398 00:19:33,970 --> 00:19:36,900 There's what we call utility cloud computing, which is what 399 00:19:36,900 --> 00:19:39,260 most people use the cloud for. 400 00:19:39,260 --> 00:19:43,570 So gmail, what are enterprise services, 401 00:19:43,570 --> 00:19:46,580 calendars online, all that kind of stuff, 402 00:19:46,580 --> 00:19:49,180 basic data sharing, photo sharing, 403 00:19:49,180 --> 00:19:52,790 and stuff like that, that's what we call utility cloud 404 00:19:52,790 --> 00:19:54,220 computing. 405 00:19:54,220 --> 00:19:56,652 You can do human resources on the cloud. 406 00:19:56,652 --> 00:19:58,110 You can do accounting on the cloud. 407 00:19:58,110 --> 00:20:00,230 All this type of stuff that's on the cloud, 408 00:20:00,230 --> 00:20:04,460 and it's a very, very successful business. 409 00:20:04,460 --> 00:20:07,600 What we mostly focus on is what is called data intensive cloud 410 00:20:07,600 --> 00:20:08,360 computing. 411 00:20:08,360 --> 00:20:12,210 It's based on Hadoop and other types of technologies. 412 00:20:12,210 --> 00:20:14,180 But even still, a lot of that is technologies 413 00:20:14,180 --> 00:20:16,540 that are used by large scale cloud companies, 414 00:20:16,540 --> 00:20:19,040 but they don't really share with you. 415 00:20:19,040 --> 00:20:22,970 Google, although it's often called Hadoop, 416 00:20:22,970 --> 00:20:25,120 Google doesn't actually use Hadoop. 417 00:20:25,120 --> 00:20:29,620 Hadoop is based on a small part of the Google infrastructure. 418 00:20:29,620 --> 00:20:32,580 Google's infrastructure is vastly larger than that. 419 00:20:32,580 --> 00:20:35,630 And they don't just let anybody log into and say, 420 00:20:35,630 --> 00:20:38,350 oh go ahead mine all the Google data. 421 00:20:38,350 --> 00:20:39,980 They expose it to services, allowing 422 00:20:39,980 --> 00:20:43,300 you to do some of that, but the core technology that they do, 423 00:20:43,300 --> 00:20:44,120 they don't expose. 424 00:20:44,120 --> 00:20:46,890 In all large companies, they don't expose that to you. 425 00:20:46,890 --> 00:20:50,147 They give you selected services and such. 426 00:20:50,147 --> 00:20:51,980 But this is really what we're focused on is, 427 00:20:51,980 --> 00:20:55,100 this data intensive computing. 428 00:20:55,100 --> 00:21:04,160 If I want to further subdivided the cloud from an implementer's 429 00:21:04,160 --> 00:21:09,700 perspective, if you own a lot of computing hardware, 430 00:21:09,700 --> 00:21:11,710 I mean hundreds of kilowatts or megawatts 431 00:21:11,710 --> 00:21:14,720 of computing hardware, you're most likely doing 432 00:21:14,720 --> 00:21:15,785 one of these four things. 433 00:21:20,890 --> 00:21:24,290 You could be doing traditional supercomputing, which we all 434 00:21:24,290 --> 00:21:26,100 know and love, which is sort of closest 435 00:21:26,100 --> 00:21:27,940 to what we do in LLGrid. 436 00:21:27,940 --> 00:21:30,371 You could do a traditional database management system, 437 00:21:30,371 --> 00:21:32,870 so every single time you use a credit card, make a purchase, 438 00:21:32,870 --> 00:21:35,830 you're probably connecting to a traditional database 439 00:21:35,830 --> 00:21:37,330 managed system. 440 00:21:37,330 --> 00:21:39,090 You could be doing enterprise computing, 441 00:21:39,090 --> 00:21:42,760 which mostly, this day and age, is run out of VMware. 442 00:21:42,760 --> 00:21:45,720 So this is the LLGrid logo. 443 00:21:45,720 --> 00:21:48,130 And so you could be doing that. 444 00:21:48,130 --> 00:21:50,470 So that's all the things I earlier mentioned. 445 00:21:50,470 --> 00:21:53,010 Or this new kid on the block, big data, 446 00:21:53,010 --> 00:21:58,380 which is all the buzz today, using java and MapReduce 447 00:21:58,380 --> 00:22:00,070 and other types of things. 448 00:22:00,070 --> 00:22:03,220 The important thing to recognize is these four areas, 449 00:22:03,220 --> 00:22:08,830 each of these are multi tens of billions of dollars industry. 450 00:22:08,830 --> 00:22:10,660 The IT world has gotten large enough 451 00:22:10,660 --> 00:22:14,160 that we now see specialization down to the hardware level. 452 00:22:14,160 --> 00:22:16,530 The hardware, the processors, that these folks use 453 00:22:16,530 --> 00:22:17,732 is different. 454 00:22:17,732 --> 00:22:19,440 Likewise for these folks and these folks. 455 00:22:19,440 --> 00:22:21,910 You see specialization of chips, now, 456 00:22:21,910 --> 00:22:23,940 to these different types of things. 457 00:22:23,940 --> 00:22:27,550 So each of these at the center of a multi-billion dollar 458 00:22:27,550 --> 00:22:30,830 ecosystem, and they each have pros and cons. 459 00:22:30,830 --> 00:22:33,250 And sometimes you can do your mission wholly in one, 460 00:22:33,250 --> 00:22:34,600 but sometimes you have to cross. 461 00:22:34,600 --> 00:22:36,725 And when you do cross, it can be quite a challenge, 462 00:22:36,725 --> 00:22:40,570 because these worlds are not necessarily compatible. 463 00:22:40,570 --> 00:22:43,690 And I would say, just to help people to understand Hadoop, 464 00:22:43,690 --> 00:22:50,090 one must recognize that Java is the first class 465 00:22:50,090 --> 00:22:53,840 citizen in the Hadoop world. 466 00:22:53,840 --> 00:22:56,490 The entire infrastructure is written in Java. 467 00:22:56,490 --> 00:22:58,950 It's designed so that if you only know Java, 468 00:22:58,950 --> 00:23:01,860 you can actually administer and manage a cluster, which 469 00:23:01,860 --> 00:23:03,480 is why it's become so popular. 470 00:23:03,480 --> 00:23:04,980 There are so many Java programmers, 471 00:23:04,980 --> 00:23:08,280 and now they just know Java, and they downloaded Hadoop, 472 00:23:08,280 --> 00:23:11,450 and they have rudimentary access to some computer systems 473 00:23:11,450 --> 00:23:14,500 they can sort of cobble together a cluster. 474 00:23:14,500 --> 00:23:16,410 I should say the same thing occurred 475 00:23:16,410 --> 00:23:18,500 in the C in the Fortran commune in the mid 90s. 476 00:23:18,500 --> 00:23:20,250 There was a technology called NPI, 477 00:23:20,250 --> 00:23:23,890 in fact this is the NPI logo today, which allowed you, 478 00:23:23,890 --> 00:23:28,810 if you knew C, Fortran had rudimentary access on a network 479 00:23:28,810 --> 00:23:31,920 of workstations, you could create a cluster with NPI. 480 00:23:31,920 --> 00:23:34,550 And that really sort of began the whole sort 481 00:23:34,550 --> 00:23:36,590 of parallel cluster computing revolution. 482 00:23:36,590 --> 00:23:38,180 This does the same thing in Java, 483 00:23:38,180 --> 00:23:40,660 and so for Java programmers it's been 484 00:23:40,660 --> 00:23:44,100 a huge success in that regard. 485 00:23:44,100 --> 00:23:46,310 Just so you know where we sit-in with that, 486 00:23:46,310 --> 00:23:49,460 so we have a LLGrid here, which has really 487 00:23:49,460 --> 00:23:53,890 made traditional supercomputing feel interactive 488 00:23:53,890 --> 00:23:56,140 and on demand and elastic, so that's one of the things 489 00:23:56,140 --> 00:23:57,877 that we do that's LLGrid, very unique. 490 00:23:57,877 --> 00:23:58,710 You launch your job. 491 00:23:58,710 --> 00:23:59,543 It runs immediately. 492 00:23:59,543 --> 00:24:02,690 That is different than the way every other supercomputing 493 00:24:02,690 --> 00:24:05,130 center in the country is run. 494 00:24:05,130 --> 00:24:08,760 Most supercomputing centers are run by, you write your program, 495 00:24:08,760 --> 00:24:13,470 and you launch it, and you wait until the queue says 496 00:24:13,470 --> 00:24:15,110 that it is run. 497 00:24:15,110 --> 00:24:18,090 And in fact, some centers will even 498 00:24:18,090 --> 00:24:21,257 use, as a metric of success, how long their wait is. 499 00:24:21,257 --> 00:24:22,840 It's kind of like a restaurant, right? 500 00:24:22,840 --> 00:24:24,870 How many years do you have to wait 501 00:24:24,870 --> 00:24:27,690 to get a good reservation, or a good doctor right? 502 00:24:27,690 --> 00:24:32,910 And I remember someone saying, oh our system is so successful 503 00:24:32,910 --> 00:24:35,530 that our wait is a week. 504 00:24:35,530 --> 00:24:39,982 You hit submit and you will have to wait a week to run your job. 505 00:24:39,982 --> 00:24:41,690 That's not really consistent with the way 506 00:24:41,690 --> 00:24:42,990 we do business around here. 507 00:24:42,990 --> 00:24:44,590 You're under very tight deadlines. 508 00:24:44,590 --> 00:24:47,859 So we created a very different type of system 10 years ago, 509 00:24:47,859 --> 00:24:49,900 which sort of feels more like what you people are 510 00:24:49,900 --> 00:24:51,690 trying to do in the cloud. 511 00:24:51,690 --> 00:24:53,880 In the Hadoop community, there's been a lot of work 512 00:24:53,880 --> 00:24:55,390 to try and make databases. 513 00:24:55,390 --> 00:24:59,370 What these technologies give you is sort of efficient search, 514 00:24:59,370 --> 00:25:01,025 and so there are a number of databases 515 00:25:01,025 --> 00:25:03,150 that have been developed that sit on top of Hadoop, 516 00:25:03,150 --> 00:25:06,750 HBase being one, and Accumulo being another. 517 00:25:06,750 --> 00:25:10,630 Which to my knowledge is the highest performance 518 00:25:10,630 --> 00:25:12,370 open source triple store in the world, 519 00:25:12,370 --> 00:25:14,990 and probably will remain that way for a very long time. 520 00:25:14,990 --> 00:25:18,460 And so we will be using the Accumulo database. 521 00:25:18,460 --> 00:25:21,060 And then we've created bindings to that. 522 00:25:21,060 --> 00:25:23,420 So we have our D4M technology, which allows 523 00:25:23,420 --> 00:25:25,880 you to bind you to databases. 524 00:25:25,880 --> 00:25:28,640 And we also have something called LLGrid MapReduce. 525 00:25:28,640 --> 00:25:31,400 MapReduce is sort of the core programming model 526 00:25:31,400 --> 00:25:33,560 within the Hadoop community. 527 00:25:33,560 --> 00:25:36,410 It's such a trivial model, it almost 528 00:25:36,410 --> 00:25:37,680 doesn't even deserve a name. 529 00:25:37,680 --> 00:25:41,630 It's basically like, write a program given a list of files. 530 00:25:41,630 --> 00:25:44,550 Each program runs on each file independently. 531 00:25:44,550 --> 00:25:46,540 I mean it's a very, very simple model. 532 00:25:46,540 --> 00:25:49,710 It's been around since the beginning of computing, 533 00:25:49,710 --> 00:25:55,040 but the name has become popularized here, 534 00:25:55,040 --> 00:25:58,910 and we now have a way to do MapReduce right in LLGrid. 535 00:25:58,910 --> 00:26:01,610 It's very easy, and people enjoy it. 536 00:26:01,610 --> 00:26:04,170 You can, of course, do the same thing 537 00:26:04,170 --> 00:26:07,860 with other technology in LLGrid, but for those people, 538 00:26:07,860 --> 00:26:10,620 particularly if you're writing job or-- not Matlab. 539 00:26:10,620 --> 00:26:12,940 We don't recommend people use this for Matlab, 540 00:26:12,940 --> 00:26:15,920 because we have better technologies for doing that. 541 00:26:15,920 --> 00:26:20,720 But if you're using Python, or Java, or other programs, 542 00:26:20,720 --> 00:26:23,990 then MapReduce is there for you. 543 00:26:23,990 --> 00:26:26,580 As we move towards combining-- so our big vision here 544 00:26:26,580 --> 00:26:28,750 is to combine big compute and big data, 545 00:26:28,750 --> 00:26:31,290 and so we have a whole stack here 546 00:26:31,290 --> 00:26:33,690 as we deal with new applications, 547 00:26:33,690 --> 00:26:38,000 texts, cyber, bio, other types of things, the new APIs 548 00:26:38,000 --> 00:26:40,740 that people write, the new types of distributed databases. 549 00:26:40,740 --> 00:26:44,990 So it's really affecting everything. 550 00:26:44,990 --> 00:26:48,792 Just an example of the Hadoop architecture. 551 00:26:48,792 --> 00:26:49,947 You've had this course. 552 00:26:49,947 --> 00:26:51,530 You should at least have a few minutes 553 00:26:51,530 --> 00:26:53,446 so that if someone asked, well what is Hadoop, 554 00:26:53,446 --> 00:26:54,700 you know what it does. 555 00:26:54,700 --> 00:26:56,820 This is the basic architecture. 556 00:26:56,820 --> 00:27:00,080 So you have a Hadoop MapReduce job. 557 00:27:00,080 --> 00:27:02,790 You submit it, and it goes to a JobTracker. 558 00:27:02,790 --> 00:27:07,170 The JobTracker then breaks that up into subtasks, each of which 559 00:27:07,170 --> 00:27:08,720 has its own tracker. 560 00:27:08,720 --> 00:27:14,812 Those tasks then get sent to the DataNodes in the architecture, 561 00:27:14,812 --> 00:27:17,100 and it gets those DataNodes, and NameNode 562 00:27:17,100 --> 00:27:19,440 keeps track of the actual names of the things. 563 00:27:19,440 --> 00:27:21,710 Just a very simple a Hadoop cluster. 564 00:27:21,710 --> 00:27:24,740 This is the different types of the nodes 565 00:27:24,740 --> 00:27:27,230 that are in, and processes in architecture. 566 00:27:29,830 --> 00:27:32,360 Hadoop's strengths and weaknesses, well it 567 00:27:32,360 --> 00:27:35,405 does allow you to distribute processing a large data. 568 00:27:39,580 --> 00:27:45,450 Best use case is, let's say you had enormous number of log 569 00:27:45,450 --> 00:27:50,850 files, and you decided you only wanted to search them once, 570 00:27:50,850 --> 00:27:54,500 for one string. 571 00:27:54,500 --> 00:27:57,075 That's the perfect use case for Hadoop. 572 00:27:57,075 --> 00:27:59,450 If you decide, though, that you would like to actually do 573 00:27:59,450 --> 00:28:04,750 more than one search, then it doesn't necessarily 574 00:28:04,750 --> 00:28:06,290 always make sense. 575 00:28:06,290 --> 00:28:09,030 So it's basically designed to run grep 576 00:28:09,030 --> 00:28:11,690 on an enormous number of files. 577 00:28:11,690 --> 00:28:13,610 And I kid you not, there are companies today 578 00:28:13,610 --> 00:28:16,680 that actually do this, like at production scale. 579 00:28:16,680 --> 00:28:18,740 All their analytics are done by just grepping 580 00:28:18,740 --> 00:28:21,620 enormous numbers of log files. 581 00:28:21,620 --> 00:28:25,940 Needless to say that the entire database community cringes, 582 00:28:25,940 --> 00:28:29,990 and rolls over in their graves if they are not alive, at this. 583 00:28:29,990 --> 00:28:33,740 Because we have solved this problem 584 00:28:33,740 --> 00:28:36,750 by indexing our data, and that allows you to do fast search. 585 00:28:36,750 --> 00:28:38,650 And people, that's why people have invented 586 00:28:38,650 --> 00:28:40,470 databases like HBase and Accumulo, 587 00:28:40,470 --> 00:28:43,730 which would sit on top of Hadoop because they recognize that, 588 00:28:43,730 --> 00:28:46,150 really, if you're going to search more than once, 589 00:28:46,150 --> 00:28:48,660 you should index your data. 590 00:28:48,660 --> 00:28:51,480 Again, it is very scalable. 591 00:28:51,480 --> 00:28:54,980 It is fundamentally designed to have extremely 592 00:28:54,980 --> 00:28:57,160 unreliable hardware in it. 593 00:28:57,160 --> 00:28:58,790 So it is quite resilient. 594 00:28:58,790 --> 00:29:00,810 It does this resiliency at a cost. 595 00:29:00,810 --> 00:29:04,210 It relies on a significant amount of replication. 596 00:29:04,210 --> 00:29:07,370 Typically the standard replication in a Hadoop cluster 597 00:29:07,370 --> 00:29:10,500 is a factor of 3. 598 00:29:10,500 --> 00:29:12,630 If you're in the high performance storage business, 599 00:29:12,630 --> 00:29:15,110 this also makes you cringe, because you're 600 00:29:15,110 --> 00:29:17,160 paying 3x for your storage. 601 00:29:17,160 --> 00:29:21,870 We have techniques like RAID, which allow us to do that. 602 00:29:21,870 --> 00:29:25,070 But again, the expertise required to set up a cluster 603 00:29:25,070 --> 00:29:28,230 and do 3x replication is relatively low, 604 00:29:28,230 --> 00:29:30,020 and so that makes it very appealing 605 00:29:30,020 --> 00:29:35,020 to many, many, many folks, and so it's very useful for that. 606 00:29:35,020 --> 00:29:38,740 Some of the difficulties, the scheduler in Hadoop 607 00:29:38,740 --> 00:29:42,170 is very immature. 608 00:29:42,170 --> 00:29:44,320 Schedulers are very well defined. 609 00:29:44,320 --> 00:29:46,370 There's about two or three standard schedulers 610 00:29:46,370 --> 00:29:48,100 in the supercomputing community. 611 00:29:48,100 --> 00:29:49,490 They're each about 20 years old. 612 00:29:49,490 --> 00:29:52,430 They all have the same 200 features, 613 00:29:52,430 --> 00:29:54,670 and you need about 200 features to really do 614 00:29:54,670 --> 00:29:56,320 a proper scheduler. 615 00:29:56,320 --> 00:29:59,220 Hadoop is about four or five years old, 616 00:29:59,220 --> 00:30:00,890 and it's got about that many features. 617 00:30:00,890 --> 00:30:05,260 And so you do often have to deal with collisions. 618 00:30:05,260 --> 00:30:08,110 It's easy for one user to hog and monopolise 619 00:30:08,110 --> 00:30:09,910 the entire cluster. 620 00:30:09,910 --> 00:30:11,950 You're often dealing with various overheads 621 00:30:11,950 --> 00:30:14,032 from the scheduler itself. 622 00:30:14,032 --> 00:30:15,990 And this is well known to the Hadoop community. 623 00:30:15,990 --> 00:30:18,660 They're working on it actively, and we'll 624 00:30:18,660 --> 00:30:23,570 see all that-- it's not an easy multi-user environment. 625 00:30:23,570 --> 00:30:26,380 And as I've said before, it fundamentally 626 00:30:26,380 --> 00:30:30,800 relies on the fact that the JVM is on every node. 627 00:30:30,800 --> 00:30:34,020 Because when you're sending a program to every node, 628 00:30:34,020 --> 00:30:37,150 the interpreter for that must exist on every node. 629 00:30:37,150 --> 00:30:39,440 And by definition, the only language 630 00:30:39,440 --> 00:30:41,890 that you're guaranteed to have on every node in a Hadoop 631 00:30:41,890 --> 00:30:43,720 cluster is Java. 632 00:30:43,720 --> 00:30:46,890 Any other language, any other tool, 633 00:30:46,890 --> 00:30:49,570 has to be installed and become a part 634 00:30:49,570 --> 00:30:53,600 of the image for the entire system 635 00:30:53,600 --> 00:30:56,030 so that it can be run on every node. 636 00:30:56,030 --> 00:30:58,570 So that is something that one has to be aware of. 637 00:30:58,570 --> 00:31:00,153 And we've certainly seen it's like, oh 638 00:31:00,153 --> 00:31:02,270 I wrote these great programs in another language. 639 00:31:02,270 --> 00:31:04,020 And it's like, well can't I just run them? 640 00:31:04,020 --> 00:31:07,800 It's like no, that doesn't exist. 641 00:31:07,800 --> 00:31:11,220 Even distributing a fat binary can be difficult, 642 00:31:11,220 --> 00:31:13,980 because it has to be distributed through the Hadoop Distributed 643 00:31:13,980 --> 00:31:14,965 File system. 644 00:31:14,965 --> 00:31:16,590 All data is distributed there, and that 645 00:31:16,590 --> 00:31:20,140 is not-- you have to essentially write a wrapper program that 646 00:31:20,140 --> 00:31:22,270 then recognizes where to get your binary, 647 00:31:22,270 --> 00:31:27,190 and then exacts that thing and it can be tricky. 648 00:31:27,190 --> 00:31:29,640 But the basic LLGrid MapReduce architecture 649 00:31:29,640 --> 00:31:31,990 simplifies this significantly, although this maybe 650 00:31:31,990 --> 00:31:33,468 doesn't look like it. 651 00:31:36,620 --> 00:31:40,550 Essentially you call LLGrid MapReduce. 652 00:31:40,550 --> 00:31:43,680 It launches a bunch of what are called mapper tasks. 653 00:31:43,680 --> 00:31:45,920 It runs them on your different input files. 654 00:31:45,920 --> 00:31:48,270 When they have done, they've created output files. 655 00:31:48,270 --> 00:31:50,200 And then runs another program. 656 00:31:50,200 --> 00:31:52,610 If you specified, it then combines the output. 657 00:31:52,610 --> 00:31:54,650 So that's the basic model of MapReduce, 658 00:31:54,650 --> 00:31:58,210 which is you have what's called a map program 659 00:31:58,210 --> 00:32:01,360 and a reduced program, and the map program is 660 00:32:01,360 --> 00:32:02,400 given a list of files. 661 00:32:02,400 --> 00:32:04,220 It runs those programs on at a time. 662 00:32:04,220 --> 00:32:06,230 Each one of them generates an outpoint, 663 00:32:06,230 --> 00:32:08,870 and a reduce program pulls them all together. 664 00:32:08,870 --> 00:32:12,960 All right, so that's a little tutorial on Hadoop, 665 00:32:12,960 --> 00:32:16,540 because I felt an obligation to explain that to you. 666 00:32:16,540 --> 00:32:19,890 We couldn't be a big data course without having doing that. 667 00:32:19,890 --> 00:32:22,650 But it's very simple, and very popular, 668 00:32:22,650 --> 00:32:27,070 and I expect it to maintain it's popular for a very long time. 669 00:32:27,070 --> 00:32:29,140 I think it will evolve and get better, 670 00:32:29,140 --> 00:32:32,830 but for the vast majority of people on planet Earth, 671 00:32:32,830 --> 00:32:34,824 this is really the most accessible form 672 00:32:34,824 --> 00:32:37,240 of parallel computing technology that they have available. 673 00:32:37,240 --> 00:32:40,070 So we should all be aware of Hadoop, and its existence, 674 00:32:40,070 --> 00:32:41,320 and what it can do. 675 00:32:41,320 --> 00:32:43,830 Because many of our customers use Hadoop, 676 00:32:43,830 --> 00:32:47,611 and we need to work with them. 677 00:32:47,611 --> 00:32:50,260 All right, so getting back now to D4M. 678 00:32:50,260 --> 00:32:52,310 I think I've mentioned a lot of this before. 679 00:32:52,310 --> 00:32:55,560 The core concept of D4M is multi-dimensional associative 680 00:32:55,560 --> 00:32:56,260 array. 681 00:32:56,260 --> 00:32:59,080 Again, D4M is designed to sort of overcomes. 682 00:32:59,080 --> 00:33:03,300 For those of us who have more mathematical expertise, 683 00:33:03,300 --> 00:33:05,650 we can do much more sophisticated things 684 00:33:05,650 --> 00:33:08,530 than you might be able to do in Hadoop. 685 00:33:08,530 --> 00:33:13,970 Again, it allows you to look at your data in four ways at once. 686 00:33:13,970 --> 00:33:18,200 You can view it as 2D matrices, and reference rows and columns 687 00:33:18,200 --> 00:33:21,190 with strings, and have values that are strings. 688 00:33:21,190 --> 00:33:23,030 It's one to one with a triple store, 689 00:33:23,030 --> 00:33:25,730 so you can easily connect to databases. 690 00:33:25,730 --> 00:33:30,140 Again, it looks like matrices so you can do linear algebra. 691 00:33:30,140 --> 00:33:33,640 And also between the duality between adjacency matrices 692 00:33:33,640 --> 00:33:37,990 and graphs, you can also think about your data as graphs. 693 00:33:37,990 --> 00:33:41,540 This is composable, as I've said before. 694 00:33:41,540 --> 00:33:44,640 Almost all of our operations that I perform an associated 695 00:33:44,640 --> 00:33:46,860 rate of return another associative array, 696 00:33:46,860 --> 00:33:50,870 and so we can do things like add them, subtract them, and them, 697 00:33:50,870 --> 00:33:54,230 or them, multiply them, and we can do very complicated queries 698 00:33:54,230 --> 00:33:55,212 very easily. 699 00:33:55,212 --> 00:33:56,920 And these can work on associative arrays. 700 00:33:56,920 --> 00:34:01,270 And if we're bound to tables, they can also do that. 701 00:34:01,270 --> 00:34:03,100 Speaking of tables, I've already talked 702 00:34:03,100 --> 00:34:05,720 about our schema in this class that we always use. 703 00:34:05,720 --> 00:34:08,500 So if your standard data might look like this, 704 00:34:08,500 --> 00:34:12,159 say this is a cyber record of a source IP, domain, 705 00:34:12,159 --> 00:34:16,130 and destination IP, this is sort of the standard tabular view. 706 00:34:16,130 --> 00:34:21,340 We explode it by taking the source, pending the value to it 707 00:34:21,340 --> 00:34:25,310 which creates this very large, sparse table here 708 00:34:25,310 --> 00:34:28,767 which will naturally go into our triple store. 709 00:34:28,767 --> 00:34:30,350 Of course, by itself it doesn't really 710 00:34:30,350 --> 00:34:33,370 gain as anything, because most tables either 711 00:34:33,370 --> 00:34:35,474 are row based or column based. 712 00:34:35,474 --> 00:34:38,320 The databases that we are working with a row based, 713 00:34:38,320 --> 00:34:41,250 which allow you to do fast look up of row key. 714 00:34:41,250 --> 00:34:44,360 However, once we've exploded the schema, if we also 715 00:34:44,360 --> 00:34:49,010 store the transpose, we now can index everything here quickly 716 00:34:49,010 --> 00:34:51,210 and effectively to the user. 717 00:34:51,210 --> 00:34:53,290 It looks like we have indexed every single string 718 00:34:53,290 --> 00:34:55,659 in the database with one schema, which 719 00:34:55,659 --> 00:34:59,050 is very nice, very powerful stepping stone 720 00:34:59,050 --> 00:35:02,472 for getting results. 721 00:35:02,472 --> 00:35:04,180 All right, I'm going to talk a little bit 722 00:35:04,180 --> 00:35:06,120 about what we have done here. 723 00:35:06,120 --> 00:35:08,180 So I'm just going to go over some basic analytics 724 00:35:08,180 --> 00:35:09,960 and show you a little bit more code. 725 00:35:09,960 --> 00:35:11,830 So for example, here is our table. 726 00:35:14,730 --> 00:35:17,390 So this could be viewed as the sparse matrix. 727 00:35:17,390 --> 00:35:19,770 We have various source IPs here. 728 00:35:19,770 --> 00:35:23,470 And I want to just compute some very elementary statistics 729 00:35:23,470 --> 00:35:25,840 on this data. 730 00:35:25,840 --> 00:35:30,910 I should say, computing things like sums and averages 731 00:35:30,910 --> 00:35:36,340 may sound like not very sophisticated statistic. 732 00:35:36,340 --> 00:35:39,460 It's still extraordinarily powerful, 733 00:35:39,460 --> 00:35:42,830 and we are amazed at how valuable it 734 00:35:42,830 --> 00:35:46,320 is, because it usually shows you right 735 00:35:46,320 --> 00:35:50,730 away the bad data in your data. 736 00:35:50,730 --> 00:35:53,680 And you will always have bad data. 737 00:35:53,680 --> 00:35:56,640 And this is probably the first bit of value 738 00:35:56,640 --> 00:35:59,320 add we provide to almost any customer 739 00:35:59,320 --> 00:36:03,160 that we work with on D4M, is it's the first time 740 00:36:03,160 --> 00:36:05,980 anybody's actually sort of looked 741 00:36:05,980 --> 00:36:09,270 at their data in totality, and been able to do 742 00:36:09,270 --> 00:36:10,670 these types of things. 743 00:36:10,670 --> 00:36:12,567 And so usually the first thing we do is like, 744 00:36:12,567 --> 00:36:14,400 all right we've loaded their data in schema, 745 00:36:14,400 --> 00:36:15,649 and we've done some basic sum. 746 00:36:15,649 --> 00:36:19,260 And then we'll say, do you know that you have the following 747 00:36:19,260 --> 00:36:22,590 stuck on switches? 748 00:36:22,590 --> 00:36:25,070 You know, 8% of your data all has 749 00:36:25,070 --> 00:36:29,040 this value in this column, which absolutely can't be right, 750 00:36:29,040 --> 00:36:29,800 you know? 751 00:36:29,800 --> 00:36:32,836 And 8% could be a low enough number that you might not just 752 00:36:32,836 --> 00:36:34,460 encounter it through routine traversal, 753 00:36:34,460 --> 00:36:39,380 but it will stand out in just the most rudimentary histogram, 754 00:36:39,380 --> 00:36:40,730 extremely. 755 00:36:40,730 --> 00:36:43,714 And so that's an incredibly valuable piece of [INAUDIBLE]. 756 00:36:43,714 --> 00:36:45,380 Oh my goodness, they'll usually be like, 757 00:36:45,380 --> 00:36:47,430 there's something broken in our system. 758 00:36:47,430 --> 00:36:50,920 Or you'll be like, you have this column, 759 00:36:50,920 --> 00:36:52,954 but it's almost never filled in, and it sure 760 00:36:52,954 --> 00:36:54,370 looks like it should be filled in, 761 00:36:54,370 --> 00:36:57,830 so these switches are stuck off. 762 00:36:57,830 --> 00:37:00,360 I encourage, when you do the first thing, 763 00:37:00,360 --> 00:37:03,610 to just kind of do the basic statistics on the data 764 00:37:03,610 --> 00:37:07,550 and see where things are working and where things are broken. 765 00:37:07,550 --> 00:37:12,301 And usually you can then either fix them. 766 00:37:12,301 --> 00:37:14,175 And most of the time when we tell a customer, 767 00:37:14,175 --> 00:37:16,080 it's like, we're just letting you 768 00:37:16,080 --> 00:37:18,010 know these are issues in your data. 769 00:37:18,010 --> 00:37:19,110 We can work around them. 770 00:37:19,110 --> 00:37:21,660 We can sort of ignore them and still proceed, 771 00:37:21,660 --> 00:37:23,209 or you can fix them yourselves. 772 00:37:23,209 --> 00:37:24,750 And almost always they're like, no we 773 00:37:24,750 --> 00:37:25,930 want to fix that ourselves. 774 00:37:25,930 --> 00:37:27,130 That's something that's fundamentally 775 00:37:27,130 --> 00:37:28,005 broken in our system. 776 00:37:28,005 --> 00:37:29,964 We want to make sure that that data is correct. 777 00:37:29,964 --> 00:37:31,463 Because usually that data is flowing 778 00:37:31,463 --> 00:37:33,410 into all other places doing a being acted on, 779 00:37:33,410 --> 00:37:35,750 and so that's one thing we're going to do. 780 00:37:35,750 --> 00:37:38,590 So we're going to do some basic statistics here. 781 00:37:38,590 --> 00:37:43,380 We're going to compute how many times each column type appears. 782 00:37:43,380 --> 00:37:48,130 We're going to compute how many times each type appears. 783 00:37:48,130 --> 00:37:51,380 We'll compute some covariance matrices, just some very, very 784 00:37:51,380 --> 00:37:52,962 simple types of things here. 785 00:38:01,870 --> 00:38:05,070 All right, let's move on here. 786 00:38:05,070 --> 00:38:08,420 So this is the basic implementation. 787 00:38:08,420 --> 00:38:11,825 So we're going to give it a set of rows 788 00:38:11,825 --> 00:38:12,950 that we're looking at here. 789 00:38:12,950 --> 00:38:15,210 This could be very large. 790 00:38:15,210 --> 00:38:19,220 We have a table binding, T, so we say, please 791 00:38:19,220 --> 00:38:23,930 go and get me all those rows. 792 00:38:23,930 --> 00:38:25,520 So we get the whole swath of rows, 793 00:38:25,520 --> 00:38:28,730 and that will return that as an associative array. 794 00:38:28,730 --> 00:38:32,380 And then, normally I shorten this is double logi, 795 00:38:32,380 --> 00:38:34,912 since the table always contain strings, 796 00:38:34,912 --> 00:38:36,620 it return strings values, the first thing 797 00:38:36,620 --> 00:38:37,590 we knew it was going to get rid of those, 798 00:38:37,590 --> 00:38:38,882 because we're going to do math. 799 00:38:38,882 --> 00:38:41,256 So we turn them into logicals, and then we turn them back 800 00:38:41,256 --> 00:38:42,520 into double so we can do math. 801 00:38:46,060 --> 00:38:48,160 You can actually pass regular expressions. 802 00:38:48,160 --> 00:38:49,900 You can also do that starts with command. 803 00:38:49,900 --> 00:38:52,110 We want to get the source IP. 804 00:38:52,110 --> 00:38:54,269 We're just interested in source IP domain. 805 00:38:54,269 --> 00:38:56,560 And the first thing we do is find some popular columns. 806 00:38:56,560 --> 00:38:59,450 So we just type sum, and that shows up popular columns. 807 00:38:59,450 --> 00:39:01,300 If we want to find popular pairs, 808 00:39:01,300 --> 00:39:04,370 it's covariance matrix calculation there, 809 00:39:04,370 --> 00:39:10,270 or square in and find domains with many destination IPs. 810 00:39:10,270 --> 00:39:13,954 An example in this data set is this was the most popular 811 00:39:13,954 --> 00:39:15,120 in the data set that we had. 812 00:39:15,120 --> 00:39:20,670 These were the most popular things that appeared here. 813 00:39:20,670 --> 00:39:23,280 And you can see, this is all fairly reasonable stuff, 814 00:39:23,280 --> 00:39:29,500 LLBean, lot of New England Patriots fans in this data set, 815 00:39:29,500 --> 00:39:31,870 a lot of ads, Staples. 816 00:39:31,870 --> 00:39:34,120 These are the types of things it shows you right away. 817 00:39:37,780 --> 00:39:41,020 Here's the covariance matrix of this data. 818 00:39:41,020 --> 00:39:43,860 So as you can see, it's symmetric. 819 00:39:43,860 --> 00:39:48,320 Obviously there is a diagonal, and it 820 00:39:48,320 --> 00:39:50,000 has the bipartite structure. 821 00:39:50,000 --> 00:39:54,900 That is, there is no destination to destination IP links. 822 00:39:54,900 --> 00:39:58,390 And we have source IP the destination, domain, and all 823 00:39:58,390 --> 00:40:00,600 that type of stuff exactly. 824 00:40:00,600 --> 00:40:03,745 The covariance matrix is often extremely helpful thing 825 00:40:03,745 --> 00:40:05,620 to look at, because it will be the first time 826 00:40:05,620 --> 00:40:08,220 it shows you the full interrelated structure 827 00:40:08,220 --> 00:40:09,150 of your data. 828 00:40:09,150 --> 00:40:13,210 And you can quickly identify dense rows, dense columns, 829 00:40:13,210 --> 00:40:16,360 blocks, chunks, some things where you want to look at, 830 00:40:16,360 --> 00:40:18,110 chunks that aren't going to be interesting 831 00:40:18,110 --> 00:40:20,690 because they're very, very dense or very, very sparse. 832 00:40:20,690 --> 00:40:24,160 Again a very, very just sort of survey type of tool. 833 00:40:27,310 --> 00:40:29,880 A little shout out here to our colleagues 834 00:40:29,880 --> 00:40:32,130 in the other group, who developed structured knowledge 835 00:40:32,130 --> 00:40:32,630 space. 836 00:40:32,630 --> 00:40:35,560 Structured knowledge space is a very powerful analytic data 837 00:40:35,560 --> 00:40:38,240 set. 838 00:40:38,240 --> 00:40:41,410 I kind of tell people it's like the Google search 839 00:40:41,410 --> 00:40:43,850 where when you type a search key, 840 00:40:43,850 --> 00:40:47,120 it starts guessing what your next answer is. 841 00:40:47,120 --> 00:40:50,307 But Google does that based on sort of a long sort of analysis 842 00:40:50,307 --> 00:40:52,640 of all the different types of searches people have done, 843 00:40:52,640 --> 00:40:54,250 and it can make a good guess. 844 00:40:54,250 --> 00:40:58,650 SKS does that much more deeply, in that it maintains a database 845 00:40:58,650 --> 00:41:00,400 on a collection of documents. 846 00:41:00,400 --> 00:41:05,620 And in this case, if you type Afghanistan, 847 00:41:05,620 --> 00:41:08,397 will then actually go and count the occurrences 848 00:41:08,397 --> 00:41:10,480 of different types of entities in those documents, 849 00:41:10,480 --> 00:41:13,400 and then show you what the possible next key word 850 00:41:13,400 --> 00:41:16,800 choices can be, so in the actual data itself, 851 00:41:16,800 --> 00:41:19,980 not based on a set of cached queries 852 00:41:19,980 --> 00:41:21,924 and other types of things. 853 00:41:21,924 --> 00:41:23,840 I don't really know what Google actually does. 854 00:41:23,840 --> 00:41:24,710 I'm just guessing. 855 00:41:24,710 --> 00:41:27,450 I'm sure they do use some data for guiding their search, 856 00:41:27,450 --> 00:41:30,460 but this is a fairly sophisticated analytic. 857 00:41:30,460 --> 00:41:32,954 And one of the big sort of wins for we 858 00:41:32,954 --> 00:41:34,620 knew we were on the right track with D4M 859 00:41:34,620 --> 00:41:36,530 is that this has been implemented 860 00:41:36,530 --> 00:41:38,836 in hundreds or thousands of lines of job at SQL. 861 00:41:38,836 --> 00:41:40,710 And I'm going to show you how we implement it 862 00:41:40,710 --> 00:41:44,450 in one line in D4M. 863 00:41:44,450 --> 00:41:48,750 So let's go over that algorithm. 864 00:41:48,750 --> 00:41:52,240 So here's an example of what my data might look like. 865 00:41:52,240 --> 00:41:54,460 I have a bunch of documents here. 866 00:41:54,460 --> 00:41:57,280 I have a bunch of entities, where an entity appears 867 00:41:57,280 --> 00:41:58,600 in a document. 868 00:41:58,600 --> 00:42:00,580 I have a dot. 869 00:42:00,580 --> 00:42:03,800 I'm going to have my associative array over here, 870 00:42:03,800 --> 00:42:07,140 which is matching from the strings to the reals. 871 00:42:07,140 --> 00:42:10,430 And my two facets are just going to be two columns names here, 872 00:42:10,430 --> 00:42:14,010 so I'm going to pick a facet y1 and y2. 873 00:42:14,010 --> 00:42:16,150 So I'm going to pick these two columns. 874 00:42:16,150 --> 00:42:20,050 All right, so basically that's what this does. 875 00:42:20,050 --> 00:42:23,980 This says get me y1 and get me y2. 876 00:42:23,980 --> 00:42:26,010 I'm using this bar here to say I'm 877 00:42:26,010 --> 00:42:28,800 going to knock away the facet name afterwards 878 00:42:28,800 --> 00:42:30,350 so I can and them together. 879 00:42:30,350 --> 00:42:34,330 So that gets me all the documents that basically 880 00:42:34,330 --> 00:42:36,970 contain both UN and Carl, and then I 881 00:42:36,970 --> 00:42:38,850 can go and compute the counts that you 882 00:42:38,850 --> 00:42:41,770 saw in the previous page there just 883 00:42:41,770 --> 00:42:43,260 by doing this matrix multiply. 884 00:42:43,260 --> 00:42:45,340 So and them, I transpose them, and I 885 00:42:45,340 --> 00:42:48,690 matrix multiply to get them together, and then there we go. 886 00:42:52,030 --> 00:42:54,960 All right, so now I'm going to get into the demo, 887 00:42:54,960 --> 00:42:57,970 and I think we have plenty of time for that. 888 00:42:57,970 --> 00:43:00,650 I'll sort of set it up, and they'll take short break. 889 00:43:00,650 --> 00:43:04,379 People can have cookies, and then we 890 00:43:04,379 --> 00:43:06,670 will get into the demo, which really begins to show you 891 00:43:06,670 --> 00:43:11,054 a way to use D4M on real data. 892 00:43:11,054 --> 00:43:12,970 So the data that we're going to work with here 893 00:43:12,970 --> 00:43:14,320 is the Reuters Corpus. 894 00:43:14,320 --> 00:43:20,340 So this was a corpus of data released in 2000 to help spur 895 00:43:20,340 --> 00:43:21,480 research in this community. 896 00:43:21,480 --> 00:43:23,670 We're very grateful for Reuters to do this. 897 00:43:23,670 --> 00:43:27,030 They gave the data to NIST who actually 898 00:43:27,030 --> 00:43:29,860 stewards the data sets. 899 00:43:29,860 --> 00:43:32,370 It's a set of Reuters news reports 900 00:43:32,370 --> 00:43:37,290 over this particular time frame, the mid 90s, 901 00:43:37,290 --> 00:43:40,630 and it's like 800,000 total. 902 00:43:40,630 --> 00:43:42,515 And what we have done, we're not actually 903 00:43:42,515 --> 00:43:44,140 working with the straight Reuters data, 904 00:43:44,140 --> 00:43:46,560 we've run it through various parsers that 905 00:43:46,560 --> 00:43:48,770 have extracted just the entities, so 906 00:43:48,770 --> 00:43:51,540 the people, the places, and the organization. 907 00:43:51,540 --> 00:43:53,240 So it's sort of a summary. 908 00:43:53,240 --> 00:43:57,980 It's a very terse summarization of the data, if that. 909 00:43:57,980 --> 00:44:01,510 It's really sort of a derived product. 910 00:44:01,510 --> 00:44:03,770 The data is power law, so if we look 911 00:44:03,770 --> 00:44:07,740 at the documents per entity, you see here 912 00:44:07,740 --> 00:44:13,070 that certain people, places, and organizations, appear. 913 00:44:13,070 --> 00:44:16,490 A few of them appear lots of places, and most of them 914 00:44:16,490 --> 00:44:20,390 appear in just a few documents, and you see the same thing 915 00:44:20,390 --> 00:44:22,590 when you look at the entities per documents. 916 00:44:22,590 --> 00:44:24,130 So there's essentially, could view 917 00:44:24,130 --> 00:44:26,690 this is computing the in degrees and the out 918 00:44:26,690 --> 00:44:28,820 degrees of this bipartite graphs. 919 00:44:28,820 --> 00:44:31,910 And it has the classic power law shape 920 00:44:31,910 --> 00:44:34,450 that we expect to see in our data. 921 00:44:34,450 --> 00:44:37,360 So with that, we're going to go in the demo. 922 00:44:37,360 --> 00:44:41,430 And just to summarize, so just recall, 923 00:44:41,430 --> 00:44:43,430 the evolution of the web is really 924 00:44:43,430 --> 00:44:45,940 created a new class of technologies. 925 00:44:45,940 --> 00:44:50,070 We really see the web moving towards game style interfaces, 926 00:44:50,070 --> 00:44:52,270 triple store databases, and technologies 927 00:44:52,270 --> 00:44:56,320 like D4M for doing analysis, very new technology. 928 00:44:56,320 --> 00:44:59,460 And just so that we, for the record, 929 00:44:59,460 --> 00:45:01,377 there is no assignment this week. 930 00:45:01,377 --> 00:45:03,710 So for those of you who are still doing the assignments, 931 00:45:03,710 --> 00:45:05,400 yay. 932 00:45:05,400 --> 00:45:08,307 The example, actually won't do both of these today, 933 00:45:08,307 --> 00:45:09,390 we'll just do one of them. 934 00:45:09,390 --> 00:45:11,670 Oh and I keep making this mistake. 935 00:45:11,670 --> 00:45:12,410 I'll fix it. 936 00:45:12,410 --> 00:45:13,430 I'll fix it later. 937 00:45:13,430 --> 00:45:17,020 But it's in the examples directory. 938 00:45:17,020 --> 00:45:19,850 We've now moved on to the second subfolder, apps, 939 00:45:19,850 --> 00:45:21,410 entity analysis. 940 00:45:21,410 --> 00:45:24,120 And you guys are encouraged to run those examples, which 941 00:45:24,120 --> 00:45:25,680 we will now do shortly. 942 00:45:25,680 --> 00:45:27,870 So we will take a short five minute break, 943 00:45:27,870 --> 00:45:31,870 and then continue on with the demo.