1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation, or to view additional materials 6 00:00:12,680 --> 00:00:16,590 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,590 --> 00:00:17,260 at ocw.mit.edu. 8 00:00:21,570 --> 00:00:23,655 JEREMY KEPNER: All right, welcome everybody. 9 00:00:23,655 --> 00:00:28,440 I'm really pleased to see you all here today 10 00:00:28,440 --> 00:00:32,229 for what will be the first of a nine lecture course called 11 00:00:32,229 --> 00:00:35,010 signal processing on databases. 12 00:00:35,010 --> 00:00:39,020 And this is a really special day for me 13 00:00:39,020 --> 00:00:41,685 because this sort of represents a big step 14 00:00:41,685 --> 00:00:45,310 in a set of technology that I've been working 15 00:00:45,310 --> 00:00:49,420 on for a couple of decades, and really 16 00:00:49,420 --> 00:00:53,190 been working intensely on for the last three or four years. 17 00:00:53,190 --> 00:00:57,150 And this is the first course on what 18 00:00:57,150 --> 00:00:58,540 we hope will be a technology that 19 00:00:58,540 --> 00:01:00,890 will become very widely used. 20 00:01:05,030 --> 00:01:08,110 It is a completely unique course on a completely unique 21 00:01:08,110 --> 00:01:11,110 technology, so I want to thank you 22 00:01:11,110 --> 00:01:16,980 for being the guinea pigs of this course. 23 00:01:16,980 --> 00:01:18,800 It's hard enough just doing a new course 24 00:01:18,800 --> 00:01:20,690 on an existing topic. 25 00:01:20,690 --> 00:01:24,920 Doing a new course on a novel topic is quite a challenge, 26 00:01:24,920 --> 00:01:27,630 so we're going to be trying a few different things here. 27 00:01:27,630 --> 00:01:30,700 There are going to be certain topics covered 28 00:01:30,700 --> 00:01:32,810 from more than one angle, and we'll 29 00:01:32,810 --> 00:01:37,796 see which ones sort of make the most sense for you. 30 00:01:37,796 --> 00:01:39,170 The first thing I'd like to do is 31 00:01:39,170 --> 00:01:41,680 talk a little bit about the title of the course, which 32 00:01:41,680 --> 00:01:43,480 is a little bit different. 33 00:01:43,480 --> 00:01:46,610 The title is signal processing on databases, 34 00:01:46,610 --> 00:01:50,470 which is not two phrases, signal processing and databases, that 35 00:01:50,470 --> 00:01:55,220 are usually connected together very often. 36 00:01:55,220 --> 00:01:59,400 And so, signal processing, as we all know, 37 00:01:59,400 --> 00:02:05,160 is, here at Lincoln Labs, one of the most core things 38 00:02:05,160 --> 00:02:06,230 that we do. 39 00:02:06,230 --> 00:02:10,979 In terms of the technology, it's almost difficult 40 00:02:10,979 --> 00:02:14,450 to imagine a technology that's more core to what we do. 41 00:02:14,450 --> 00:02:18,280 The only thing I could think of was systems engineering, maybe. 42 00:02:18,280 --> 00:02:22,430 But signal processing is really at the heart of what we do. 43 00:02:22,430 --> 00:02:25,410 Mathematically, signal processing really, 44 00:02:25,410 --> 00:02:27,800 the heart of that is detection theory. 45 00:02:27,800 --> 00:02:29,700 How do you find things? 46 00:02:29,700 --> 00:02:33,020 Given sets of data, how do you eliminate 47 00:02:33,020 --> 00:02:35,130 things you're not looking for and find the things 48 00:02:35,130 --> 00:02:37,010 that you are looking for? 49 00:02:37,010 --> 00:02:40,850 And then, mathematically, the heart of detection theory 50 00:02:40,850 --> 00:02:44,807 is linear algebra, and so, just a show 51 00:02:44,807 --> 00:02:46,890 of hands how many people have had a linear algebra 52 00:02:46,890 --> 00:02:49,530 course in there? 53 00:02:49,530 --> 00:02:51,650 See, that's really awesome. 54 00:02:51,650 --> 00:02:53,110 See at Lincoln, that's probably-- 55 00:02:53,110 --> 00:02:54,526 this one a few places in the world 56 00:02:54,526 --> 00:02:56,540 where it was pretty unanimous. 57 00:02:56,540 --> 00:02:59,590 You know, there's not too many places where that's the case. 58 00:02:59,590 --> 00:03:02,860 And so this technology is really designed for you, 59 00:03:02,860 --> 00:03:07,350 because it really assumes a foundation in linear algebra, 60 00:03:07,350 --> 00:03:11,180 and that's not as common as you might think. 61 00:03:11,180 --> 00:03:13,330 And so people who haven't had linear algebra, 62 00:03:13,330 --> 00:03:16,259 it's much harder for them to take advantage of the ideas 63 00:03:16,259 --> 00:03:17,550 that we're going to talk about. 64 00:03:17,550 --> 00:03:22,910 So signal processing, detection theory, linear algebra, 65 00:03:22,910 --> 00:03:26,070 that's sort of the root of the first phrase. 66 00:03:26,070 --> 00:03:29,910 The second phrase, databases, when 67 00:03:29,910 --> 00:03:34,970 we think of databases we think of typically, maybe 68 00:03:34,970 --> 00:03:41,290 things that you do, searching on the web, texts, maybe 69 00:03:41,290 --> 00:03:46,310 other things like that, which is not necessarily 70 00:03:46,310 --> 00:03:49,100 consistent with the mathematics that we think about when we 71 00:03:49,100 --> 00:03:51,000 think about signal processing, detection 72 00:03:51,000 --> 00:03:56,030 theory, and linear algebra, we naturally go onto real numbers, 73 00:03:56,030 --> 00:03:58,510 matrices, those types of things. 74 00:03:58,510 --> 00:04:00,010 So what we're really doing here is 75 00:04:00,010 --> 00:04:03,500 we're connecting that mathematics with sort 76 00:04:03,500 --> 00:04:06,160 of a whole new branch of data. 77 00:04:06,160 --> 00:04:10,150 Words and strings and relationships. 78 00:04:10,150 --> 00:04:12,370 And so that's kind of the really novel piece of it, 79 00:04:12,370 --> 00:04:14,620 and it's really the core idea that we're bringing here 80 00:04:14,620 --> 00:04:17,660 is we're going to bring all that machinery that we know so well, 81 00:04:17,660 --> 00:04:19,700 and we're going to try and apply it 82 00:04:19,700 --> 00:04:21,820 to this new area which is becoming increasingly 83 00:04:21,820 --> 00:04:22,870 important for us. 84 00:04:22,870 --> 00:04:24,720 So that's the explanation of the title. 85 00:04:24,720 --> 00:04:26,770 That's how we picked the title. 86 00:04:26,770 --> 00:04:29,430 There's other titles, and so I just 87 00:04:29,430 --> 00:04:31,830 wanted to spend a little time on that. 88 00:04:35,460 --> 00:04:39,240 An enormous number of people have contributed to this work. 89 00:04:39,240 --> 00:04:43,972 This is a list of the people that I could remember 90 00:04:43,972 --> 00:04:45,430 at the time that I typed the slide, 91 00:04:45,430 --> 00:04:48,490 but no doubt I have forgotten somebody important, 92 00:04:48,490 --> 00:04:50,640 and so I apologize for that. 93 00:04:50,640 --> 00:04:55,200 But certainly these people, and the whole team 94 00:04:55,200 --> 00:04:59,720 have been very instrumental in allowing this technology 95 00:04:59,720 --> 00:05:01,340 to move forward here. 96 00:05:07,150 --> 00:05:09,060 So this is going to be the outline 97 00:05:09,060 --> 00:05:12,670 for this particular lecture. 98 00:05:12,670 --> 00:05:14,450 And there's going to be a rhythm to most 99 00:05:14,450 --> 00:05:15,950 of these lectures, which is there's 100 00:05:15,950 --> 00:05:18,960 going to be a lecture, which is going to be PowerPoint-- 101 00:05:18,960 --> 00:05:22,410 I apologize, we all see plenty of PowerPoint-- 102 00:05:22,410 --> 00:05:24,290 mainly to introduce ideas. 103 00:05:24,290 --> 00:05:26,710 And then there will be sort of a demo, which 104 00:05:26,710 --> 00:05:29,460 will go through code that is actually 105 00:05:29,460 --> 00:05:34,134 available in your LLGrid accounts now, 106 00:05:34,134 --> 00:05:35,550 and some of that code will even be 107 00:05:35,550 --> 00:05:37,610 part of homework assignments, if you so choose 108 00:05:37,610 --> 00:05:39,180 to do the homework assignments. 109 00:05:39,180 --> 00:05:40,420 And so that's going to be the flavor. 110 00:05:40,420 --> 00:05:42,350 We'll do the slides, we'll take a short break, 111 00:05:42,350 --> 00:05:46,995 and then we'll do the examples and talk about them. 112 00:05:51,140 --> 00:05:53,580 So let me talk about some of the kinds of-- 113 00:05:53,580 --> 00:05:55,740 and I stole this slide-- let me talk about some 114 00:05:55,740 --> 00:05:59,000 of the kinds of data that the technology that we're 115 00:05:59,000 --> 00:06:06,370 going to be talking about today really, really enables. 116 00:06:06,370 --> 00:06:08,710 It allows us to deal with data that 117 00:06:08,710 --> 00:06:14,390 has to do with relationships, relationships 118 00:06:14,390 --> 00:06:17,560 particularly as those represented by graphs. 119 00:06:17,560 --> 00:06:22,900 So for example, you might have vehicle tracks. 120 00:06:22,900 --> 00:06:28,410 Those represent relationships between locations. 121 00:06:28,410 --> 00:06:31,190 A car goes from one place to another. 122 00:06:31,190 --> 00:06:35,400 That's a relationship between those locations. 123 00:06:35,400 --> 00:06:37,950 Another is social networks. 124 00:06:37,950 --> 00:06:40,320 You know, everyone wants to know who their friends are, 125 00:06:40,320 --> 00:06:43,020 how many friends they are, who they're friends of friends are, 126 00:06:43,020 --> 00:06:44,570 those types of things. 127 00:06:44,570 --> 00:06:47,870 So that's another type of set of relationship. 128 00:06:47,870 --> 00:06:52,500 Again, very textural and oriented, very much oriented 129 00:06:52,500 --> 00:06:55,640 towards words and not numbers. 130 00:06:55,640 --> 00:07:01,830 Another area is cyber networks, a very apropros topic 131 00:07:01,830 --> 00:07:04,720 that people are, at least everyone is impacted by, 132 00:07:04,720 --> 00:07:08,730 and many people are doing work on, which is relationships 133 00:07:08,730 --> 00:07:11,670 between computers, communication between web servers, 134 00:07:11,670 --> 00:07:13,330 those types of things. 135 00:07:13,330 --> 00:07:19,690 And again, that is a very important area for us. 136 00:07:23,120 --> 00:07:25,390 Just to give you sort of a big picture here, 137 00:07:25,390 --> 00:07:27,079 like well, why am I even here? 138 00:07:27,079 --> 00:07:28,120 Why do I care about this? 139 00:07:28,120 --> 00:07:30,134 What are the things that we can do? 140 00:07:30,134 --> 00:07:32,300 I'm going to kind of walk through an example of some 141 00:07:32,300 --> 00:07:35,060 of the things that we have used this technology for that 142 00:07:35,060 --> 00:07:37,000 have been very exciting for us. 143 00:07:37,000 --> 00:07:40,430 And so, trying-- and we'll have examples 144 00:07:40,430 --> 00:07:42,610 like that throughout the course, you 145 00:07:42,610 --> 00:07:45,160 know, how you can actually use this, 146 00:07:45,160 --> 00:07:47,200 how this has had really a real impact. 147 00:07:47,200 --> 00:07:49,150 And so the first one we want to talk about 148 00:07:49,150 --> 00:07:53,860 is some work we've done in the cyber domain. 149 00:07:53,860 --> 00:07:55,930 So here's an example of a data set, 150 00:07:55,930 --> 00:08:00,610 which is a graph that is web traffic. 151 00:08:00,610 --> 00:08:05,540 And so, if we see here, this is the source IPs. 152 00:08:05,540 --> 00:08:08,510 So, you can imagine these are the computers that 153 00:08:08,510 --> 00:08:11,170 are on the inside of some domain, 154 00:08:11,170 --> 00:08:13,135 and then these are the destination IPs, 155 00:08:13,135 --> 00:08:17,780 all web servers that they hit are the green ones around here. 156 00:08:17,780 --> 00:08:20,090 This is sometimes referred to in graph theory 157 00:08:20,090 --> 00:08:21,430 as a bipartite graph. 158 00:08:21,430 --> 00:08:23,630 It's basically two separate sets, 159 00:08:23,630 --> 00:08:26,180 a vertices that talk to each other 160 00:08:26,180 --> 00:08:28,860 but have no, at least in this data set, 161 00:08:28,860 --> 00:08:33,250 we are not recording any connections within the sets. 162 00:08:33,250 --> 00:08:36,750 And then, hopefully you can see here these faint blue lines 163 00:08:36,750 --> 00:08:40,370 here that are the actual connections that show you 164 00:08:40,370 --> 00:08:44,360 which of these computers are connecting 165 00:08:44,360 --> 00:08:46,330 to which of these computers. 166 00:08:46,330 --> 00:08:49,240 There's obviously a lot of data, web traffic data. 167 00:08:49,240 --> 00:08:50,680 There's enormous amount of it. 168 00:08:50,680 --> 00:08:57,030 It shows 90 minutes of data, and in this particular case 169 00:08:57,030 --> 00:09:00,010 there was actually malicious activity 170 00:09:00,010 --> 00:09:03,160 that we were looking to detect in this data set. 171 00:09:06,230 --> 00:09:10,060 So one of the biggest challenges that you have in any data 172 00:09:10,060 --> 00:09:13,400 set like this is data representation. 173 00:09:13,400 --> 00:09:15,680 And sometimes in database parlance 174 00:09:15,680 --> 00:09:17,660 this is called the schema. 175 00:09:17,660 --> 00:09:20,784 So how you represent your data is usually 176 00:09:20,784 --> 00:09:22,200 if you can figure that out, you're 177 00:09:22,200 --> 00:09:26,720 almost halfway to solving the problem. 178 00:09:26,720 --> 00:09:30,120 And so, given that, in this case, 179 00:09:30,120 --> 00:09:32,260 we wanted to use graph techniques, 180 00:09:32,260 --> 00:09:34,110 we discover that our data almost always 181 00:09:34,110 --> 00:09:37,340 doesn't come to us in a nicely formed graph, 182 00:09:37,340 --> 00:09:40,840 and so we have to go through a lot of steps here. 183 00:09:40,840 --> 00:09:43,640 And so one thing you'll see is with this technology 184 00:09:43,640 --> 00:09:47,020 is that there's almost always possible part of a pipeline 185 00:09:47,020 --> 00:09:52,720 from raw data to conditioning the raw data to inserting it 186 00:09:52,720 --> 00:09:57,207 into a database to doing analytics to doing queries 187 00:09:57,207 --> 00:09:58,290 and other types of things. 188 00:09:58,290 --> 00:10:00,864 You're always really going to be using a pipeline. 189 00:10:00,864 --> 00:10:03,280 And you're going to be wanting to use different technology 190 00:10:03,280 --> 00:10:05,807 tools for different stages in the pipeline. 191 00:10:05,807 --> 00:10:07,390 We're not in any way trying to suggest 192 00:10:07,390 --> 00:10:09,348 that the technologies we're talking about today 193 00:10:09,348 --> 00:10:10,960 are the only pieces. 194 00:10:10,960 --> 00:10:12,710 It's an important piece, the technologies 195 00:10:12,710 --> 00:10:15,030 we're talking about, but they're not the only pieces. 196 00:10:15,030 --> 00:10:16,780 And you're always really functioning 197 00:10:16,780 --> 00:10:18,080 in the context of a pipeline. 198 00:10:22,580 --> 00:10:24,620 So this just shows the technology 199 00:10:24,620 --> 00:10:26,410 stack that we will be leveraging, 200 00:10:26,410 --> 00:10:29,490 that we typically leverage in these types of things. 201 00:10:29,490 --> 00:10:31,460 Ultimately, our goal is to get up here 202 00:10:31,460 --> 00:10:33,092 to these graph analytics. 203 00:10:35,366 --> 00:10:37,240 One of the things we'll be talking about most 204 00:10:37,240 --> 00:10:39,780 in this course is the high level languages 205 00:10:39,780 --> 00:10:44,420 that allow you to essentially do graph analytics very easily 206 00:10:44,420 --> 00:10:48,700 and bridge between the databases themselves, 207 00:10:48,700 --> 00:10:51,460 and the high performance computing that you will need. 208 00:10:51,460 --> 00:10:55,820 So again, not only do you have a pipeline of software and data, 209 00:10:55,820 --> 00:10:58,640 you have a pipeline or stack of technology. 210 00:10:58,640 --> 00:11:00,730 So when you're addressing problems like this, 211 00:11:00,730 --> 00:11:02,510 usually you're starting at one place 212 00:11:02,510 --> 00:11:05,110 but you will almost always, when you build a system, 213 00:11:05,110 --> 00:11:07,700 work out and deal with these other types of things. 214 00:11:11,040 --> 00:11:13,350 So that's sort of just a quick overview and example 215 00:11:13,350 --> 00:11:16,299 and I don't expect any of you-- it's probably just kind of, 216 00:11:16,299 --> 00:11:17,840 the purpose of the example's probably 217 00:11:17,840 --> 00:11:19,030 to raise more questions. 218 00:11:19,030 --> 00:11:21,261 Well, how do you really do this? 219 00:11:21,261 --> 00:11:23,510 So now I'm going to sort of take a bit of a right turn 220 00:11:23,510 --> 00:11:26,680 here and really talk about what the course is about, OK. 221 00:11:26,680 --> 00:11:28,860 And I'm going to actually talk about the course 222 00:11:28,860 --> 00:11:30,980 a little bit of a big picture. 223 00:11:30,980 --> 00:11:33,710 And I always felt that every course should tell you 224 00:11:33,710 --> 00:11:37,810 why that course is the most important course you'll ever 225 00:11:37,810 --> 00:11:39,149 take in your career. 226 00:11:39,149 --> 00:11:40,690 You know, I think every single course 227 00:11:40,690 --> 00:11:42,120 you can come up with an argument for that, 228 00:11:42,120 --> 00:11:43,120 but people don't say it. 229 00:11:43,120 --> 00:11:45,525 So I'm going to tell you why this course is 230 00:11:45,525 --> 00:11:47,400 the most important course you will ever take. 231 00:11:47,400 --> 00:11:49,210 Of course, every course is the most important course 232 00:11:49,210 --> 00:11:52,060 you'll ever take, but I'm going to tell you why this one is. 233 00:11:52,060 --> 00:11:54,980 So stepping way back, we're all part of MIT here 234 00:11:54,980 --> 00:11:58,900 and MIT has a formula for success. 235 00:11:58,900 --> 00:12:01,030 MIT is the number one science and engineering 236 00:12:01,030 --> 00:12:03,830 university on earth. 237 00:12:03,830 --> 00:12:07,150 We can actually, by any reasonable measure 238 00:12:07,150 --> 00:12:09,560 that is true. 239 00:12:09,560 --> 00:12:15,470 And in fact, by the people who track this, it's true by a lot. 240 00:12:15,470 --> 00:12:17,530 And so you know, it's a real honor 241 00:12:17,530 --> 00:12:19,540 to be associated with this organization, 242 00:12:19,540 --> 00:12:22,850 to be at an organization that's at its zenith. 243 00:12:22,850 --> 00:12:27,840 And MIT has had this incredible run of success over the last 60 244 00:12:27,840 --> 00:12:31,570 years because fundamentally they recognize the formula 245 00:12:31,570 --> 00:12:34,090 for making discoveries is a combination of theory 246 00:12:34,090 --> 00:12:35,920 and experiment. 247 00:12:35,920 --> 00:12:39,800 Bring those two together, and as an organization, 248 00:12:39,800 --> 00:12:43,030 we've actually organized ourselves, 249 00:12:43,030 --> 00:12:47,240 MIT is organized around this formula in that we have theory 250 00:12:47,240 --> 00:12:50,240 and experiment, and theory is the academic part 251 00:12:50,240 --> 00:12:53,120 of the mission. 252 00:12:53,120 --> 00:12:56,640 That's run out of departments. 253 00:12:56,640 --> 00:12:58,546 Typically, it involves mathematics. 254 00:12:58,546 --> 00:13:00,170 This is a place-- I mean, everyone here 255 00:13:00,170 --> 00:13:02,110 raised their hands to linear algebra, 256 00:13:02,110 --> 00:13:06,060 and I can go to any part of MIT and that will be true. 257 00:13:06,060 --> 00:13:08,700 We probably have the highest mathematical literacy 258 00:13:08,700 --> 00:13:13,310 of any organization on earth, and that's a very powerful, 259 00:13:13,310 --> 00:13:15,500 powerful capability. 260 00:13:15,500 --> 00:13:20,370 The mathematics that we use, they become manifest or real, 261 00:13:20,370 --> 00:13:21,970 typically in the forms of algorithms, 262 00:13:21,970 --> 00:13:26,130 and our algorithms become real in the form of software. 263 00:13:26,130 --> 00:13:28,540 Likewise, on the experimental side, you know, 264 00:13:28,540 --> 00:13:32,220 that's the research that we conduct. 265 00:13:32,220 --> 00:13:38,130 MIT actually has large laboratories. 266 00:13:38,130 --> 00:13:41,550 Laboratories that are fully peered with the departments. 267 00:13:41,550 --> 00:13:44,120 That is actually quite a unique thing 268 00:13:44,120 --> 00:13:46,680 for an academic institution to have laboratories 269 00:13:46,680 --> 00:13:49,060 and laboratory heads that are right up 270 00:13:49,060 --> 00:13:52,690 there with the department heads in how they're positioned. 271 00:13:52,690 --> 00:13:56,470 Lincoln being the largest of them, CSAIL, the Media Lab, 272 00:13:56,470 --> 00:13:58,170 there's a number of them. 273 00:13:58,170 --> 00:13:59,610 So we have this thing. 274 00:13:59,610 --> 00:14:01,270 And the focus of research, of course, 275 00:14:01,270 --> 00:14:04,870 is measurement, which is data, which 276 00:14:04,870 --> 00:14:07,230 often gets reduced to bytes. 277 00:14:07,230 --> 00:14:12,530 And so, often implementing this MIT formula 278 00:14:12,530 --> 00:14:16,990 comes down to marrying software and bytes together 279 00:14:16,990 --> 00:14:17,750 in a computer. 280 00:14:17,750 --> 00:14:21,590 That's sort of where we bring these two ideas together. 281 00:14:21,590 --> 00:14:24,040 And so we're going to be talking a lot about that. 282 00:14:24,040 --> 00:14:28,040 Implementing software bringing bytes together 283 00:14:28,040 --> 00:14:30,950 analyzing together that's where these two things come together, 284 00:14:30,950 --> 00:14:33,950 and I think that's really, really important. 285 00:14:33,950 --> 00:14:35,270 Now what's the tool we use? 286 00:14:35,270 --> 00:14:37,680 What's the machine that we use to bring these together? 287 00:14:37,680 --> 00:14:39,830 It's a computer, all right. 288 00:14:39,830 --> 00:14:42,130 And nowadays, computers have gotten a little bit 289 00:14:42,130 --> 00:14:45,269 more complicated than the way we used to think about them. 290 00:14:45,269 --> 00:14:46,810 We used to think about them typically 291 00:14:46,810 --> 00:14:48,490 as a von Neumann machine, which means 292 00:14:48,490 --> 00:14:52,420 you had data and operations, you know, bytes and software. 293 00:14:52,420 --> 00:14:55,480 And the bytes go into a part of the computer, 294 00:14:55,480 --> 00:14:57,310 they get operated on, they come out. 295 00:14:57,310 --> 00:14:59,970 And that model is still true at a high level, 296 00:14:59,970 --> 00:15:02,140 but the computers we tend to deal with now 297 00:15:02,140 --> 00:15:04,400 are much, much, much more complicated. 298 00:15:04,400 --> 00:15:08,020 So on the left is a standard parallel computer. 299 00:15:08,020 --> 00:15:11,060 Almost every computer used today looks like this. 300 00:15:11,060 --> 00:15:12,890 You have processors, you have memory, 301 00:15:12,890 --> 00:15:16,377 you have some kind of persistent storage, 302 00:15:16,377 --> 00:15:18,210 you have a network connecting them together, 303 00:15:18,210 --> 00:15:22,200 and this form a very complicated what we call memory hierarchy. 304 00:15:22,200 --> 00:15:27,590 It begins with registers here at the top, caches, 305 00:15:27,590 --> 00:15:29,700 local memory, remote memory, and then storage. 306 00:15:29,700 --> 00:15:32,410 And these are the units that go between them. 307 00:15:32,410 --> 00:15:35,970 And there's a lot of implications of this hierarchy, 308 00:15:35,970 --> 00:15:37,850 in terms of what things are-- you know, 309 00:15:37,850 --> 00:15:42,230 bandwidth goes up as you go this way, latency goes down. 310 00:15:42,230 --> 00:15:44,830 Programmability generally goes up the closer you are. 311 00:15:44,830 --> 00:15:46,550 You don't have to worry about things. 312 00:15:46,550 --> 00:15:48,120 And likewise, the capacity of data 313 00:15:48,120 --> 00:15:52,200 goes down as you go traverse this way in a hierarchy. 314 00:15:52,200 --> 00:15:55,110 So all modern computers that you use 315 00:15:55,110 --> 00:15:56,670 are these von Neumann architectures 316 00:15:56,670 --> 00:15:58,727 with multi-level memory hierarchy. 317 00:15:58,727 --> 00:16:00,560 And fundamentally the thing you have to know 318 00:16:00,560 --> 00:16:05,596 is that this architecture selects the algorithms you use. 319 00:16:05,596 --> 00:16:08,620 If algorithms are not going to run well in this architecture, 320 00:16:08,620 --> 00:16:11,730 you probably don't implement them. 321 00:16:11,730 --> 00:16:15,130 So one has to be very much aware of one's algorithms 322 00:16:15,130 --> 00:16:16,890 with respect to this architecture, 323 00:16:16,890 --> 00:16:21,200 because if you pick algorithms that aren't going to run well 324 00:16:21,200 --> 00:16:27,280 in this architecture, it's going to be very, very difficult. 325 00:16:27,280 --> 00:16:32,490 The goal of this course is to teach you 326 00:16:32,490 --> 00:16:37,720 techniques that will allow you to get your work done faster. 327 00:16:37,720 --> 00:16:40,910 And so this is a rather complicated drawing here, 328 00:16:40,910 --> 00:16:46,940 but this shows, given a problem that you want to solve, 329 00:16:46,940 --> 00:16:49,290 if you implement it in software using 330 00:16:49,290 --> 00:16:54,310 a variety of techniques, what the performance 331 00:16:54,310 --> 00:16:55,270 of those techniques is. 332 00:16:55,270 --> 00:16:57,980 So in a certain sense, you could view this as the volume of code 333 00:16:57,980 --> 00:17:00,075 that you have to write to solve a problem. 334 00:17:00,075 --> 00:17:02,320 And then we have implementing the program and C 335 00:17:02,320 --> 00:17:04,089 as a reference point. 336 00:17:04,089 --> 00:17:06,099 So as you move over here and you do 337 00:17:06,099 --> 00:17:10,380 things like in assembly or Java, MapReduce, 338 00:17:10,380 --> 00:17:12,420 you end up writing more code. 339 00:17:12,420 --> 00:17:15,770 If you move over here and do things in C++ or Java 340 00:17:15,770 --> 00:17:19,010 or MATLAB, you write less code, and this shows you the relative 341 00:17:19,010 --> 00:17:20,359 performance. 342 00:17:20,359 --> 00:17:22,710 Again, as you do things in lower level 343 00:17:22,710 --> 00:17:25,829 environments like assembly, you can get more performance. 344 00:17:25,829 --> 00:17:28,860 And as you do things in higher level environments like MATLAB, 345 00:17:28,860 --> 00:17:30,700 you get less performance. 346 00:17:30,700 --> 00:17:34,150 And this is a fairly general trade off space here. 347 00:17:34,150 --> 00:17:36,850 If we add parallel programming models here, 348 00:17:36,850 --> 00:17:39,280 of which there is many, we have direct memory access, 349 00:17:39,280 --> 00:17:44,040 we have message passing, we have a manager worker style 350 00:17:44,040 --> 00:17:47,700 or map reduce style, we have a whole different styles 351 00:17:47,700 --> 00:17:50,150 of programming here. 352 00:17:50,150 --> 00:17:52,160 We're going to want to try and be here. 353 00:17:52,160 --> 00:17:54,570 We're going to want to do things that take less effort 354 00:17:54,570 --> 00:17:56,520 and deliver you more performance. 355 00:17:56,520 --> 00:17:59,270 And so, we are going to be mostly focusing 356 00:17:59,270 --> 00:18:01,940 on MATLAB in this class and a technology 357 00:18:01,940 --> 00:18:05,770 we call [INAUDIBLE] to describe that more later. 358 00:18:05,770 --> 00:18:08,950 But this is a software combination 359 00:18:08,950 --> 00:18:14,110 that allows you to get relatively good performance 360 00:18:14,110 --> 00:18:16,450 at a significantly reduced effort, 361 00:18:16,450 --> 00:18:18,960 and that's really the only thing we're doing in this course. 362 00:18:18,960 --> 00:18:22,230 We're just saying, if you're doing these types of problems, 363 00:18:22,230 --> 00:18:24,040 there's a right tool for the job or there 364 00:18:24,040 --> 00:18:25,720 are right tools for the job, and if you 365 00:18:25,720 --> 00:18:27,840 use the right tool for the job, it will take you 366 00:18:27,840 --> 00:18:29,880 less time to get the job done. 367 00:18:29,880 --> 00:18:33,210 And we're going to try and teach you about these tools 368 00:18:33,210 --> 00:18:36,480 and how they actually work. 369 00:18:36,480 --> 00:18:39,110 Now, a little bit of bait and switch here. 370 00:18:39,110 --> 00:18:41,630 We call the course signal processing databases, 371 00:18:41,630 --> 00:18:44,980 but we're going to get to databases at the very end, 372 00:18:44,980 --> 00:18:48,760 unfortunately, to disappoint you all. 373 00:18:48,760 --> 00:18:53,090 And the reason is because databases 374 00:18:53,090 --> 00:18:54,700 are good for one type of problem, 375 00:18:54,700 --> 00:18:57,910 but it's often at the very end of the game 376 00:18:57,910 --> 00:19:00,380 where you really need to work with databases. 377 00:19:00,380 --> 00:19:03,290 So to understand, we have a variety of different ways 378 00:19:03,290 --> 00:19:07,320 we can solve problems with algorithms, OK. 379 00:19:07,320 --> 00:19:11,440 We can solve our problems just using a single computer's 380 00:19:11,440 --> 00:19:12,600 memory. 381 00:19:12,600 --> 00:19:17,160 We can solve our problems using the memory 382 00:19:17,160 --> 00:19:18,450 on multiple computers. 383 00:19:18,450 --> 00:19:23,920 We can solve our problems using disks or storage, or even 384 00:19:23,920 --> 00:19:27,030 with many disks, OK, and as we go this way, 385 00:19:27,030 --> 00:19:29,780 we can do more and more and more data. 386 00:19:29,780 --> 00:19:32,360 But one of the things that really matters, and it kind of 387 00:19:32,360 --> 00:19:35,690 goes back to that other chart there is well, 388 00:19:35,690 --> 00:19:39,770 when I'm working with this data, what is the chunk size? 389 00:19:39,770 --> 00:19:43,990 What are the blocks that I need to get the data in? 390 00:19:43,990 --> 00:19:51,060 And if I need to get the data in very, very small chunks, 391 00:19:51,060 --> 00:19:53,140 well then obviously having the data in memory 392 00:19:53,140 --> 00:19:55,230 is going to be the most efficient thing I can do. 393 00:19:55,230 --> 00:20:00,190 If I want to just get one record out of many records, 394 00:20:00,190 --> 00:20:05,370 then that's the most efficient, having the data in memory. 395 00:20:05,370 --> 00:20:10,420 Likewise-- so that's where we typically start, right? 396 00:20:10,420 --> 00:20:13,600 If our problem is small when we almost always try and start 397 00:20:13,600 --> 00:20:17,370 with our problems with small, we're going to be working here. 398 00:20:17,370 --> 00:20:21,880 We're going to be writing what we call serial programs on one 399 00:20:21,880 --> 00:20:23,200 computer. 400 00:20:23,200 --> 00:20:25,270 And then, the great thing about serial memory 401 00:20:25,270 --> 00:20:31,030 is that even as our requests size gets larger, 402 00:20:31,030 --> 00:20:34,390 they're really not going to do anything better than just 403 00:20:34,390 --> 00:20:35,670 having the data in memory. 404 00:20:35,670 --> 00:20:39,150 So we'll be able to grow that model up to data sets 405 00:20:39,150 --> 00:20:44,450 that are fairly small, maybe gigabytes, and we can continue 406 00:20:44,450 --> 00:20:46,870 to use that model just fine. 407 00:20:46,870 --> 00:20:49,390 And then, if we decide, well, we're 408 00:20:49,390 --> 00:20:54,220 really moving beyond that, we need to do more data, 409 00:20:54,220 --> 00:20:58,880 we can go to a parallel program to get more memory, 410 00:20:58,880 --> 00:21:01,940 or we can use data in files. 411 00:21:01,940 --> 00:21:03,580 We can write our data to files, we 412 00:21:03,580 --> 00:21:05,663 can read them in, process them, and read them out, 413 00:21:05,663 --> 00:21:07,640 and that model works very, very well. 414 00:21:07,640 --> 00:21:10,440 And in fact, particularly if we're 415 00:21:10,440 --> 00:21:12,874 going to be accessing the data in large chunks. 416 00:21:12,874 --> 00:21:14,290 If I have a lot of data and like I 417 00:21:14,290 --> 00:21:18,030 want to do some processing on the whole piece of it. 418 00:21:21,170 --> 00:21:23,680 Likewise, if we go to a very large data sets, 419 00:21:23,680 --> 00:21:27,240 we find that we can even do parallel programs writing files 420 00:21:27,240 --> 00:21:30,520 in parallel if we want to do extreme large problems that we 421 00:21:30,520 --> 00:21:33,820 want to traverse all the data. 422 00:21:33,820 --> 00:21:37,820 But finally, we're going to come back to this case 423 00:21:37,820 --> 00:21:39,680 where we need a database. 424 00:21:39,680 --> 00:21:42,350 And a database is when we have very large amounts of data 425 00:21:42,350 --> 00:21:44,900 but we want to access-- database, 426 00:21:44,900 --> 00:21:47,250 data that won't fit in our memory, 427 00:21:47,250 --> 00:21:49,160 but we want to access in little bits. 428 00:21:49,160 --> 00:21:50,770 We want to look up something. 429 00:21:50,770 --> 00:21:54,720 And that's really the only time we really need a database, 430 00:21:54,720 --> 00:21:58,520 unless of course someone says this is your data. 431 00:21:58,520 --> 00:22:00,260 It is available to you on a database. 432 00:22:00,260 --> 00:22:02,790 That's the only way you're going to access it. 433 00:22:02,790 --> 00:22:05,270 But if you have control-- so you always 434 00:22:05,270 --> 00:22:06,650 want to follow this path. 435 00:22:06,650 --> 00:22:09,100 You want to start with your serial program memory, 436 00:22:09,100 --> 00:22:12,570 maybe go to a parallel program using files, maybe 437 00:22:12,570 --> 00:22:15,590 parallel files, and then the database is almost 438 00:22:15,590 --> 00:22:17,850 the two of last resort because it 439 00:22:17,850 --> 00:22:22,430 adds a great deal of complexity, and while it can look up 440 00:22:22,430 --> 00:22:25,712 things much quicker, if you actually end up doing something 441 00:22:25,712 --> 00:22:27,170 where you're going to be traversing 442 00:22:27,170 --> 00:22:28,628 a significant fraction of the data, 443 00:22:28,628 --> 00:22:31,280 you will find it's actually significantly slower 444 00:22:31,280 --> 00:22:33,370 than if you have the data on files. 445 00:22:33,370 --> 00:22:35,610 So we're going to show you techniques, 446 00:22:35,610 --> 00:22:38,980 we're going to follow this path, and then hopefully by the end, 447 00:22:38,980 --> 00:22:40,970 you'll be like, oh, well yes, now I 448 00:22:40,970 --> 00:22:43,070 know why we use a database for these instances, 449 00:22:43,070 --> 00:22:44,510 but I realize that I can get most 450 00:22:44,510 --> 00:22:46,850 of my work done using these other techniques that 451 00:22:46,850 --> 00:22:49,670 work just fine. 452 00:22:49,670 --> 00:22:51,570 So hopefully you're going to be learning 453 00:22:51,570 --> 00:22:53,840 what we call the fast path. 454 00:22:53,840 --> 00:22:58,840 So if you want to get a certain level of performance, 455 00:22:58,840 --> 00:23:05,260 OK, if you have no training or minimal training, 456 00:23:05,260 --> 00:23:07,580 this is the path that you follow. 457 00:23:07,580 --> 00:23:10,830 You began using a technology. 458 00:23:10,830 --> 00:23:13,400 Your program gets slower for a while. 459 00:23:13,400 --> 00:23:16,740 You beat your head against the wall and the web for a while. 460 00:23:16,740 --> 00:23:18,540 Maybe eventually, after days, you 461 00:23:18,540 --> 00:23:21,260 get back to where you were, and then slowly you 462 00:23:21,260 --> 00:23:24,881 climb up here to get it to some level of performance weeks 463 00:23:24,881 --> 00:23:25,380 later. 464 00:23:25,380 --> 00:23:29,400 This is something we observe time and time again, 465 00:23:29,400 --> 00:23:33,210 and typically what happens is that right around here 466 00:23:33,210 --> 00:23:35,035 people just give up with the technology 467 00:23:35,035 --> 00:23:38,830 and they say the technology didn't work for them. 468 00:23:38,830 --> 00:23:41,490 The point of this class is to give you the techniques 469 00:23:41,490 --> 00:23:43,450 that an expert would have. 470 00:23:43,450 --> 00:23:45,600 An expert could use the technology, 471 00:23:45,600 --> 00:23:49,670 know exactly how to use it, and then can go on this path 472 00:23:49,670 --> 00:23:52,170 where basically within hours you're getting good performance 473 00:23:52,170 --> 00:23:54,340 and if they really want to take it to the next level 474 00:23:54,340 --> 00:23:56,260 they can spend days doing that. 475 00:23:56,260 --> 00:23:59,900 And so that's kind of our whole philosophy here. 476 00:23:59,900 --> 00:24:03,280 So we're going to actually have some new concepts 477 00:24:03,280 --> 00:24:04,150 in this course. 478 00:24:04,150 --> 00:24:09,960 I think they're talked about in the course 479 00:24:09,960 --> 00:24:14,410 outline a little bit, some of the really new ideas 480 00:24:14,410 --> 00:24:20,170 that you probably wouldn't have run across in other courses. 481 00:24:20,170 --> 00:24:22,700 So we're going to be talking about graphs a lot 482 00:24:22,700 --> 00:24:25,360 in this course. 483 00:24:25,360 --> 00:24:28,160 The standard graph that-- how many people 484 00:24:28,160 --> 00:24:32,160 have taken a graph theory course or a computer science course 485 00:24:32,160 --> 00:24:34,690 that talked about graphs? 486 00:24:34,690 --> 00:24:35,460 Most. 487 00:24:35,460 --> 00:24:40,570 I'm glad to see linear algebra is still the stronger one here. 488 00:24:40,570 --> 00:24:42,962 That's more important. 489 00:24:42,962 --> 00:24:45,170 Between the two, having the linear algebra background 490 00:24:45,170 --> 00:24:46,957 is the more important. 491 00:24:46,957 --> 00:24:48,540 And I can, if you know linear algebra, 492 00:24:48,540 --> 00:24:52,060 I can teach you graph theory pretty easily. 493 00:24:52,060 --> 00:24:54,170 The traditional definitions of graphs, 494 00:24:54,170 --> 00:24:56,660 though, that are taught in most computer science courses, 495 00:24:56,660 --> 00:25:02,990 they tend to focus on what we call 496 00:25:02,990 --> 00:25:06,850 undirected, unweighted graphs. 497 00:25:06,850 --> 00:25:10,381 So these are a graph is a set of vertices and edges, 498 00:25:10,381 --> 00:25:11,880 and so what we're saying when we say 499 00:25:11,880 --> 00:25:13,810 it's an undirected and unweighted graph 500 00:25:13,810 --> 00:25:17,584 is that all the edges are the same between any two vertices. 501 00:25:17,584 --> 00:25:19,250 There's no real difference between them. 502 00:25:19,250 --> 00:25:21,430 Either an edge exists or it doesn't. 503 00:25:21,430 --> 00:25:23,235 It's kind of like a zero or a one. 504 00:25:23,235 --> 00:25:24,610 And then it doesn't really-- when 505 00:25:24,610 --> 00:25:26,890 we say two vertices connected, there's 506 00:25:26,890 --> 00:25:29,880 no directionality of that. 507 00:25:29,880 --> 00:25:34,010 The majority of graph theory and what's taught in class 508 00:25:34,010 --> 00:25:36,670 is based on that. 509 00:25:36,670 --> 00:25:37,260 Excuse me. 510 00:25:37,260 --> 00:25:49,652 Unfortunately, when we get into the real world-- 511 00:25:49,652 --> 00:25:51,610 unfortunately, when we get into the real world, 512 00:25:51,610 --> 00:25:54,930 we find that I've never run into that kind of graph 513 00:25:54,930 --> 00:25:56,214 in the real world. 514 00:25:56,214 --> 00:25:57,630 In the years I've been doing this, 515 00:25:57,630 --> 00:26:01,904 I've never actually run into that kind of graph. 516 00:26:01,904 --> 00:26:03,570 In fact, another part of that is they'll 517 00:26:03,570 --> 00:26:04,930 talk about the distribution. 518 00:26:04,930 --> 00:26:06,700 In fact, you'll often hear something 519 00:26:06,700 --> 00:26:08,770 called an Erdos-Renyi graph, which 520 00:26:08,770 --> 00:26:10,950 is just a random connection. 521 00:26:10,950 --> 00:26:14,170 It is any two vertices are randomly connected 522 00:26:14,170 --> 00:26:15,520 with maybe some probability. 523 00:26:18,410 --> 00:26:21,640 And again, I've never run into that either in the real world. 524 00:26:21,640 --> 00:26:24,270 So the focus of graph theory that we have now 525 00:26:24,270 --> 00:26:26,100 that we have all this data and can compare 526 00:26:26,100 --> 00:26:27,940 the theory and the data, can bury 527 00:26:27,940 --> 00:26:31,810 the theory of the experiment, we see that the theory is wanting. 528 00:26:31,810 --> 00:26:36,830 The theory is not a good stepping stone into the data. 529 00:26:36,830 --> 00:26:42,690 And so what we see is that the real data is not random. 530 00:26:42,690 --> 00:26:44,900 It's often what we call a power law distribution, 531 00:26:44,900 --> 00:26:47,350 that a certain vertices are very popular 532 00:26:47,350 --> 00:26:52,420 and most vertices don't have a lot of connections to them. 533 00:26:52,420 --> 00:26:55,780 Usually the edges are directed, that is, there's meaning. 534 00:26:58,410 --> 00:27:00,240 If I friend you, it's not the same 535 00:27:00,240 --> 00:27:05,420 as you friending me on Facebook or Twitter, or you twit face, 536 00:27:05,420 --> 00:27:10,570 or whatever the technology is of the future. 537 00:27:10,570 --> 00:27:13,240 Usually the edge of self has meaning, has weight. 538 00:27:13,240 --> 00:27:14,650 That weight could be of value. 539 00:27:14,650 --> 00:27:17,086 It could be words, it could be other types of things. 540 00:27:20,830 --> 00:27:22,390 Typically you have multiple edges. 541 00:27:22,390 --> 00:27:24,402 I could send you many friend requests. 542 00:27:24,402 --> 00:27:26,610 You know, that's not the same thing as me sending you 543 00:27:26,610 --> 00:27:27,856 one friend request. 544 00:27:27,856 --> 00:27:29,480 Maybe I can't send you friend requests. 545 00:27:29,480 --> 00:27:30,910 I have multiple friend requests. 546 00:27:30,910 --> 00:27:32,020 I don't know. 547 00:27:32,020 --> 00:27:35,000 But you can have multiple edges between vertices. 548 00:27:35,000 --> 00:27:37,460 And probably most importantly is edges 549 00:27:37,460 --> 00:27:39,460 are often what we call hyper, and this 550 00:27:39,460 --> 00:27:43,830 is a very important concept that is almost never touched 551 00:27:43,830 --> 00:27:46,410 on in graph theory, which is that an edge can connect 552 00:27:46,410 --> 00:27:49,250 multiple vertices at the same time. 553 00:27:49,250 --> 00:27:51,806 A classic example would be an email 554 00:27:51,806 --> 00:27:53,180 that you send to multiple people. 555 00:27:53,180 --> 00:27:56,870 In fact, most emails these days have more than one recipient. 556 00:27:56,870 --> 00:28:01,400 And so, me sending an email to everyone 557 00:28:01,400 --> 00:28:04,260 in the class, one email to everyone in the class, 558 00:28:04,260 --> 00:28:05,170 is a hyper edge. 559 00:28:05,170 --> 00:28:08,720 That edge, the email, connects and has direction, 560 00:28:08,720 --> 00:28:11,530 you know, me with all the recipients. 561 00:28:11,530 --> 00:28:13,960 That's very different than me individually 562 00:28:13,960 --> 00:28:16,700 sending an email to every single one of you, which 563 00:28:16,700 --> 00:28:18,740 would be a standard edge. 564 00:28:18,740 --> 00:28:22,590 So this is an important concept not typically discussed a lot 565 00:28:22,590 --> 00:28:26,680 in standard graph theory that is very important. 566 00:28:26,680 --> 00:28:29,460 So we're going to move beyond the sort of standard definition 567 00:28:29,460 --> 00:28:33,225 of graphs in this class, and that's going to be hard. 568 00:28:33,225 --> 00:28:34,600 We're going to deal with a bigger 569 00:28:34,600 --> 00:28:37,110 definition of linear algebra. 570 00:28:37,110 --> 00:28:40,430 Traditionally, linear algebra is dealing with matrices. 571 00:28:40,430 --> 00:28:44,390 They have indices for the rows and the columns 572 00:28:44,390 --> 00:28:47,130 and the values are real numbers. 573 00:28:47,130 --> 00:28:49,310 Maybe integers, may be complex numbers, 574 00:28:49,310 --> 00:28:51,790 we deal with complex numbers here a lot, 575 00:28:51,790 --> 00:28:54,706 but we're going to be dealing with things like strings. 576 00:28:54,706 --> 00:28:56,330 We're going to be taking linear algebra 577 00:28:56,330 --> 00:28:58,900 and applying it to things like words. 578 00:28:58,900 --> 00:29:00,867 And so that's going to be different. 579 00:29:00,867 --> 00:29:02,450 In fact actually, one of my favorite-- 580 00:29:02,450 --> 00:29:03,670 it might be the part of the course that 581 00:29:03,670 --> 00:29:05,253 puts you all asleep, but it's actually 582 00:29:05,253 --> 00:29:07,860 one of my favorite parts of the course. 583 00:29:07,860 --> 00:29:11,530 And then, bigger definition of processing. 584 00:29:11,530 --> 00:29:13,200 One of the dominant things you'll 585 00:29:13,200 --> 00:29:15,340 hear about in this space-- there's 586 00:29:15,340 --> 00:29:17,290 a lot of technologies out there. 587 00:29:17,290 --> 00:29:19,600 A popular technology is called Hadoop, 588 00:29:19,600 --> 00:29:23,210 and it's map/reduce parallel programming paradigm, which 589 00:29:23,210 --> 00:29:29,960 is a great technology to kind of get people started, 590 00:29:29,960 --> 00:29:33,100 but for people with the mathematical sophistication 591 00:29:33,100 --> 00:29:35,810 that we have, we can use programming models 592 00:29:35,810 --> 00:29:37,950 that are simply better. 593 00:29:37,950 --> 00:29:41,080 If you understand mathematics to the level that we do, 594 00:29:41,080 --> 00:29:44,490 we can give you tools for programming parallel computers 595 00:29:44,490 --> 00:29:47,890 that are just more efficient in terms of you write less code 596 00:29:47,890 --> 00:29:51,054 and you get better performance than using techniques that 597 00:29:51,054 --> 00:29:52,970 really are designed for people that don't have 598 00:29:52,970 --> 00:29:55,150 the mathematical, the linear algebraic background 599 00:29:55,150 --> 00:29:57,200 that you have. 600 00:29:57,200 --> 00:30:02,030 So this is really going to be the foundation of the course. 601 00:30:02,030 --> 00:30:05,230 So let me continue here with the course outline. 602 00:30:05,230 --> 00:30:06,899 So this is going to be, essentially, 603 00:30:06,899 --> 00:30:08,190 the nine lecture of the course. 604 00:30:08,190 --> 00:30:10,500 I think this is still pretty accurate. 605 00:30:10,500 --> 00:30:12,060 So we're in the introductory course, 606 00:30:12,060 --> 00:30:14,240 we're going to review the course and goals. 607 00:30:14,240 --> 00:30:16,400 The next course is really going to be dealing 608 00:30:16,400 --> 00:30:18,440 with this concept of associative arrays, which 609 00:30:18,440 --> 00:30:22,480 is the core technology that sort of brings all this together. 610 00:30:22,480 --> 00:30:25,270 When I talk about extending linear algebra to words, 611 00:30:25,270 --> 00:30:28,770 using funny, using, sorry, by using fuzzy algebra that really 612 00:30:28,770 --> 00:30:31,840 gets into mathematical abstract algebraic concepts 613 00:30:31,840 --> 00:30:33,730 called group theory. 614 00:30:33,730 --> 00:30:37,380 It's not nearly as scary as it may sound. 615 00:30:37,380 --> 00:30:38,220 We'll get into that. 616 00:30:38,220 --> 00:30:39,761 It's actually, I think, quite elegant 617 00:30:39,761 --> 00:30:41,470 and really good to know. 618 00:30:41,470 --> 00:30:44,740 Good to know that when you work with the technology 619 00:30:44,740 --> 00:30:46,575 and you work with associative arrays, 620 00:30:46,575 --> 00:30:49,110 that we've actually thought about the mathematics of how 621 00:30:49,110 --> 00:30:50,880 these things work together. 622 00:30:50,880 --> 00:30:52,970 And when we add a feature, it's usually 623 00:30:52,970 --> 00:30:55,621 because it fits mathematically, and when 624 00:30:55,621 --> 00:30:57,120 we don't add a feature, it's usually 625 00:30:57,120 --> 00:30:58,170 because someone has made a request 626 00:30:58,170 --> 00:31:00,253 and like, yeah, that's really going to take people 627 00:31:00,253 --> 00:31:02,290 into a bad place. 628 00:31:02,290 --> 00:31:04,960 Then we're going to get into really sort of the center 629 00:31:04,960 --> 00:31:06,540 part of the class where we're going 630 00:31:06,540 --> 00:31:09,170 to show you how we do analysis of entities 631 00:31:09,170 --> 00:31:11,745 on unstructured data, and then doing analysis 632 00:31:11,745 --> 00:31:13,445 on unstructured data. 633 00:31:13,445 --> 00:31:15,320 We're going to--I've talked about power laws. 634 00:31:15,320 --> 00:31:18,600 We' going to whole class on power law data, modeling 635 00:31:18,600 --> 00:31:20,685 of that deal, but then we have another class 636 00:31:20,685 --> 00:31:22,707 in cross-correlation, and then we're 637 00:31:22,707 --> 00:31:24,540 really going to get into parallel processing 638 00:31:24,540 --> 00:31:25,270 and databases. 639 00:31:25,270 --> 00:31:27,110 And the last two classes aren't ever 640 00:31:27,110 --> 00:31:28,770 going to really be lecturers. 641 00:31:28,770 --> 00:31:32,094 They're going to be really just the demos, and really-- 642 00:31:32,094 --> 00:31:34,760 because that's where we're going to really bring a lot of things 643 00:31:34,760 --> 00:31:37,110 together, and we're going to walk you through code examples, 644 00:31:37,110 --> 00:31:39,430 but we're going to need to take some time to really walk you 645 00:31:39,430 --> 00:31:41,180 through them, because you're going to see, 646 00:31:41,180 --> 00:31:42,840 wow, there's all these ideas that we've 647 00:31:42,840 --> 00:31:45,300 talked about coming together in these code examples. 648 00:31:45,300 --> 00:31:48,160 So there is, we have lectures. 649 00:31:48,160 --> 00:31:51,520 If you see in the software that's available in your LLGrid 650 00:31:51,520 --> 00:31:54,430 accounts, all the lectures are there already, 651 00:31:54,430 --> 00:31:57,740 and I think we have lectures one through seven. 652 00:31:57,740 --> 00:31:59,940 So seven is the eighth lecture. 653 00:31:59,940 --> 00:32:02,010 Or, I'm sorry, zero through seven, so seven 654 00:32:02,010 --> 00:32:05,602 is the eighth lecture, and that will 655 00:32:05,602 --> 00:32:06,810 be a kind of a short lecture. 656 00:32:06,810 --> 00:32:10,940 And then the ninth lecture will be just going through examples. 657 00:32:13,550 --> 00:32:17,770 So we believe in this whole philosophy 658 00:32:17,770 --> 00:32:23,960 of taking a linear algebraic point of view 659 00:32:23,960 --> 00:32:27,050 with graphs so much that we even wrote a book about it. 660 00:32:27,050 --> 00:32:31,120 So you have all been given copies of this book. 661 00:32:31,120 --> 00:32:36,570 This really gives you the philosophy and the power 662 00:32:36,570 --> 00:32:39,235 of if you think about graphs using the techniques, 663 00:32:39,235 --> 00:32:42,430 so I'd highly encourage you to flip through it. 664 00:32:42,430 --> 00:32:44,750 There are certain parts of this, certain sections here 665 00:32:44,750 --> 00:32:47,710 that are really almost taken directly out of this book, 666 00:32:47,710 --> 00:32:50,680 but a lot of it is supplementary material. 667 00:32:50,680 --> 00:32:53,820 And what you'll discover is that this course is really 668 00:32:53,820 --> 00:32:56,560 the bridge to these techniques. 669 00:32:56,560 --> 00:32:58,890 This course allows you to go from kind 670 00:32:58,890 --> 00:33:02,200 of this mess of unstructured words 671 00:33:02,200 --> 00:33:05,640 and other types of things, and then 672 00:33:05,640 --> 00:33:07,985 put them into this nice linear algebraic format, 673 00:33:07,985 --> 00:33:11,760 so that is now you can do these techniques that 674 00:33:11,760 --> 00:33:13,420 are described in the book. 675 00:33:13,420 --> 00:33:16,200 So this is almost a bridge that and I'd certainly 676 00:33:16,200 --> 00:33:19,050 encourage you to look at that and to ask me 677 00:33:19,050 --> 00:33:20,510 questions about it at any time. 678 00:33:23,330 --> 00:33:27,860 So now I'm going to come back to our original example, which 679 00:33:27,860 --> 00:33:29,694 was the cyber thing and kind of walk us 680 00:33:29,694 --> 00:33:31,360 through that in a little bit more detail 681 00:33:31,360 --> 00:33:32,985 and maybe highlight a few of the things 682 00:33:32,985 --> 00:33:34,654 that I've talked about already. 683 00:33:37,260 --> 00:33:43,499 So here's a very standard data processing pipeline. 684 00:33:43,499 --> 00:33:45,040 So we want to do is we have raw data, 685 00:33:45,040 --> 00:33:48,760 in this case our raw cyber data, which is a series of records. 686 00:33:48,760 --> 00:33:52,070 And we need to convert that into a set of vertices and edge 687 00:33:52,070 --> 00:33:58,620 lists from which we can then do graphs and graphs analysis. 688 00:33:58,620 --> 00:34:03,050 And, as you'll see, typically what you do 689 00:34:03,050 --> 00:34:08,090 is you write parsers to convert this data from some raw format 690 00:34:08,090 --> 00:34:10,659 to some format that is sort of the next step 691 00:34:10,659 --> 00:34:12,370 in your processing chain. 692 00:34:12,370 --> 00:34:14,800 And then we convert these edge lists 693 00:34:14,800 --> 00:34:23,040 into adjacency matrices, which is how we view our graphs. 694 00:34:23,040 --> 00:34:25,530 One word on adjacency matrix, let me go back here. 695 00:34:28,719 --> 00:34:32,219 This is a graph, [LAUGHS] for those of you who don't know. 696 00:34:32,219 --> 00:34:35,800 The set of vertices and edges, OK? 697 00:34:35,800 --> 00:34:38,590 And this is an adjacency matrix of that graph. 698 00:34:38,590 --> 00:34:41,590 Basically every single row is a vertex, 699 00:34:41,590 --> 00:34:43,650 every single column is a vertex, and if an edge 700 00:34:43,650 --> 00:34:47,159 exists we put a dot here. 701 00:34:47,159 --> 00:34:49,489 There's a formal duality between graph theory, 702 00:34:49,489 --> 00:34:51,360 the fundamental operation of graph theory 703 00:34:51,360 --> 00:34:55,030 is what's called breadth-first search, which is shown here, 704 00:34:55,030 --> 00:34:57,440 giving a starting vertex. 705 00:34:57,440 --> 00:35:00,150 You know, traverse its edges to its neighbors, 706 00:35:00,150 --> 00:35:03,870 and we have the same thing going on here. 707 00:35:03,870 --> 00:35:06,180 Given a starting vertex in a vector, 708 00:35:06,180 --> 00:35:10,010 we do a matrix multiply to identify the neighbors. 709 00:35:10,010 --> 00:35:14,781 And so at the deepest, most fundamental level, graph theory 710 00:35:14,781 --> 00:35:16,030 and linear algebra are linked. 711 00:35:16,030 --> 00:35:18,071 Because the fundamental operation of graph theory 712 00:35:18,071 --> 00:35:21,630 is breadth-first search and the fundamental operation 713 00:35:21,630 --> 00:35:25,267 of linear algebra is vector matrix multiply. 714 00:35:28,680 --> 00:35:31,390 So we wish to form these adjacency matrices so we 715 00:35:31,390 --> 00:35:34,177 can take advantage of that. 716 00:35:34,177 --> 00:35:35,760 Here's a little bit more detail on how 717 00:35:35,760 --> 00:35:39,005 we're going to use this technology D4M to do that. 718 00:35:39,005 --> 00:35:41,890 So we have our raw data, we convert the raw data 719 00:35:41,890 --> 00:35:45,200 into a CSV file-- stands for a comma separated value files. 720 00:35:45,200 --> 00:35:48,814 It's kind of the default format of spreadsheets and tables. 721 00:35:48,814 --> 00:35:50,480 It's become very popular over the years. 722 00:35:50,480 --> 00:35:52,688 I remember when it first came out 20 or 30 years ago, 723 00:35:52,688 --> 00:35:53,686 it wasn't that popular. 724 00:35:53,686 --> 00:35:55,560 You know, it's like people did-- but now it's 725 00:35:55,560 --> 00:35:58,805 become extraordinarily popular file format. 726 00:35:58,805 --> 00:36:02,090 It really is ideally suited for this kind of data. 727 00:36:02,090 --> 00:36:05,930 So we convert that into CSV files. 728 00:36:05,930 --> 00:36:08,380 We then take those CSV files, read them 729 00:36:08,380 --> 00:36:11,710 in using our D4M technology to insert them 730 00:36:11,710 --> 00:36:14,290 into a distributed database. 731 00:36:14,290 --> 00:36:16,510 We can then query that distributed database to get 732 00:36:16,510 --> 00:36:18,760 these associative arrays. 733 00:36:18,760 --> 00:36:20,670 And then we can, from the associative arrays, 734 00:36:20,670 --> 00:36:24,320 now we have the full power to do our graph algorithms. 735 00:36:24,320 --> 00:36:28,050 So D4M really helps you in going, you know, 736 00:36:28,050 --> 00:36:30,880 these few steps and setting the table so that you can then you 737 00:36:30,880 --> 00:36:31,735 graph algorithms. 738 00:36:31,735 --> 00:36:33,610 And you can even then do the graph algorithms 739 00:36:33,610 --> 00:36:36,090 without even using D4M at all. 740 00:36:36,090 --> 00:36:39,410 You can just use regular linear algebraic operations in MATLAB 741 00:36:39,410 --> 00:36:40,410 just fine. 742 00:36:40,410 --> 00:36:42,100 But this is that bridge, that connector. 743 00:36:42,100 --> 00:36:44,590 In fact, I highly recommend people do that. 744 00:36:44,590 --> 00:36:47,210 I highly recommend you use D4M for only the parts 745 00:36:47,210 --> 00:36:49,750 of your problem that it's good for, 746 00:36:49,750 --> 00:36:52,310 and use other technologies, you know, 747 00:36:52,310 --> 00:36:56,340 like MATLAB, ordinary matrices, and other types of things. 748 00:36:56,340 --> 00:36:58,640 That's good for you, you'll get better performance. 749 00:36:58,640 --> 00:37:02,260 It's good for me because, if you have problems, 750 00:37:02,260 --> 00:37:04,504 you'll bother somebody else, and not us. 751 00:37:04,504 --> 00:37:06,170 So that's good, it's good for everybody. 752 00:37:11,110 --> 00:37:15,570 So an example of kind of what this looks like. 753 00:37:15,570 --> 00:37:19,310 Here's a proxy log that we got from the data. 754 00:37:19,310 --> 00:37:24,900 It shows a web, essentially a web log, the first thing we do 755 00:37:24,900 --> 00:37:29,700 is essentially convert this into a CSV file. 756 00:37:29,700 --> 00:37:33,500 So basically, each entry here gets a column, 757 00:37:33,500 --> 00:37:36,810 we have an ID for the actual entry. 758 00:37:36,810 --> 00:37:40,410 In this case it's a source IP, a sever IP, a time stamp. 759 00:37:40,410 --> 00:37:45,076 And then, the actual line of text that's associated with it. 760 00:37:45,076 --> 00:37:46,450 And this can go on and on and on. 761 00:37:46,450 --> 00:37:48,680 We can have many different types here. 762 00:37:48,680 --> 00:37:56,050 And if you are a standard database person, 763 00:37:56,050 --> 00:37:59,420 put your SQL hat, which is sort of the standard database, on. 764 00:37:59,420 --> 00:38:01,060 You would just take that data, and you 765 00:38:01,060 --> 00:38:04,090 would make a table that would have these columns, 766 00:38:04,090 --> 00:38:06,510 and you would insert them and away you would go up. 767 00:38:06,510 --> 00:38:07,640 OK? 768 00:38:07,640 --> 00:38:12,990 But what we discover is that-- the challenge becomes when you 769 00:38:12,990 --> 00:38:15,796 want to look up data quickly. 770 00:38:15,796 --> 00:38:16,920 Or you want to insert data. 771 00:38:16,920 --> 00:38:18,794 Let's say you have an enormous volume of data 772 00:38:18,794 --> 00:38:20,850 that you want to insert and you want 773 00:38:20,850 --> 00:38:24,610 to go to look up any, say, IP address, or any time stamp, 774 00:38:24,610 --> 00:38:26,230 or something like that. 775 00:38:26,230 --> 00:38:30,270 What you discover is that SQL might give you 776 00:38:30,270 --> 00:38:32,722 good performance, but there are databases out there 777 00:38:32,722 --> 00:38:34,930 and technology that will give you better performance. 778 00:38:38,290 --> 00:38:42,400 And these databases are called triple stores, 779 00:38:42,400 --> 00:38:48,330 because they store all the data as-- anyone? 780 00:38:48,330 --> 00:38:49,660 Triples, yes. 781 00:38:49,660 --> 00:38:52,550 They store all the data as triples, 782 00:38:52,550 --> 00:38:55,980 which means we will have a row, and a column, 783 00:38:55,980 --> 00:38:57,860 and a value associated with everything. 784 00:38:57,860 --> 00:39:00,930 So what we do is we take our dense data, 785 00:39:00,930 --> 00:39:04,680 and we do what we call, create an exploded table. 786 00:39:04,680 --> 00:39:08,690 So essentially we have the log ID, let me append that, 787 00:39:08,690 --> 00:39:15,025 and we take the column name and we append its value to it, 788 00:39:15,025 --> 00:39:16,650 and server IP. 789 00:39:16,650 --> 00:39:19,210 And this is what you would call an exploded table. 790 00:39:19,210 --> 00:39:21,890 Now, if you're an old SQL person like me, 791 00:39:21,890 --> 00:39:24,180 this schema will immediately give you hives, 792 00:39:24,180 --> 00:39:25,770 and other types of allergic reactions, 793 00:39:25,770 --> 00:39:27,570 and you will want to go to the hospital, 794 00:39:27,570 --> 00:39:31,750 because you're creating an enormous number of columns. 795 00:39:31,750 --> 00:39:35,420 And in SQL you where you can dynamically create columns, 796 00:39:35,420 --> 00:39:37,550 but at your peril. 797 00:39:37,550 --> 00:39:40,390 Basically, you can create columns dynamically, 798 00:39:40,390 --> 00:39:42,060 but the database reserves the right 799 00:39:42,060 --> 00:39:45,510 to recopy the entire table, and reformat the entire table, 800 00:39:45,510 --> 00:39:47,392 if it so requires. 801 00:39:47,392 --> 00:39:49,350 Well triple stores don't have this requirement. 802 00:39:49,350 --> 00:39:52,650 Triple stores only store the non-blank entries. 803 00:39:52,650 --> 00:39:55,780 And so they're perfectly happy with as many columns you want. 804 00:39:55,780 --> 00:39:59,700 If you have a billion rows, 10 billion columns, completely 805 00:39:59,700 --> 00:40:01,760 fine with that. 806 00:40:01,760 --> 00:40:02,988 So that's a powerful thing. 807 00:40:05,740 --> 00:40:07,360 So I think we talked about that. 808 00:40:07,360 --> 00:40:10,170 So we use the indices as rows, and we 809 00:40:10,170 --> 00:40:15,665 create unique type-value pairs for every single column. 810 00:40:19,820 --> 00:40:28,770 Now, by itself, this schema does you absolutely no good. 811 00:40:28,770 --> 00:40:32,580 You basically just made your life harder, 812 00:40:32,580 --> 00:40:37,930 because, by itself, most database tables, 813 00:40:37,930 --> 00:40:41,576 database systems, including triple stores, 814 00:40:41,576 --> 00:40:42,950 are going to have an orientation. 815 00:40:42,950 --> 00:40:45,158 They're going to be row oriented, or column oriented. 816 00:40:45,158 --> 00:40:47,580 And in this case, the databases we will be talking about 817 00:40:47,580 --> 00:40:49,530 are almost always row oriented. 818 00:40:49,530 --> 00:40:52,990 Which means they can get a row in constant time, 819 00:40:52,990 --> 00:40:56,870 give it a row key like this and you can look it up 820 00:40:56,870 --> 00:40:57,550 very quickly. 821 00:40:57,550 --> 00:40:59,450 And it will give it back to you very quickly. 822 00:40:59,450 --> 00:41:00,830 That's what it's design to do. 823 00:41:00,830 --> 00:41:03,550 And you can insert enormous volumes of data, 824 00:41:03,550 --> 00:41:06,650 and they will preserve that contract. 825 00:41:06,650 --> 00:41:09,010 You can be inserting millions of entries per second, 826 00:41:09,010 --> 00:41:11,530 and you can still look up stuff in milliseconds. 827 00:41:11,530 --> 00:41:15,100 So very powerful technology, but all we've done so far 828 00:41:15,100 --> 00:41:19,010 has made it so that the format of my data 829 00:41:19,010 --> 00:41:21,450 is in this funny format, coming out the other side. 830 00:41:21,450 --> 00:41:24,240 I've just made myself harder. 831 00:41:24,240 --> 00:41:34,380 Well, there is a payoff, which is we also store 832 00:41:34,380 --> 00:41:36,670 the transpose of the table. 833 00:41:36,670 --> 00:41:39,540 So, to anybody who is used to sparse linear algebra, 834 00:41:39,540 --> 00:41:41,990 this is an old ancient technique. 835 00:41:41,990 --> 00:41:44,320 We almost always store A and A transpose, 836 00:41:44,320 --> 00:41:46,230 because in sparse linear algebra we'll 837 00:41:46,230 --> 00:41:50,260 either have a row oriented or column based store. 838 00:41:50,260 --> 00:41:52,370 And then we just want to keep one around 839 00:41:52,370 --> 00:41:54,731 because certain operations are going to be faster 840 00:41:54,731 --> 00:41:56,480 with the transpose, and certain operations 841 00:41:56,480 --> 00:41:58,380 are going to be faster with the matrix. 842 00:41:58,380 --> 00:42:00,900 We do the exact same thing here, and combined 843 00:42:00,900 --> 00:42:03,410 with our exploded schema, now every single one 844 00:42:03,410 --> 00:42:05,600 of those columns is a row. 845 00:42:05,600 --> 00:42:08,600 And now we have indexed every single string 846 00:42:08,600 --> 00:42:10,270 in the entire database. 847 00:42:10,270 --> 00:42:14,190 Once, with one simple schema of two tables. 848 00:42:14,190 --> 00:42:18,600 And not to say that all your databases will only 849 00:42:18,600 --> 00:42:24,040 have two tables, but they might be three tables or four 850 00:42:24,040 --> 00:42:26,300 tables-- is typically what we see, 851 00:42:26,300 --> 00:42:29,960 which is a huge reduction compared 852 00:42:29,960 --> 00:42:31,950 to the standard database schemas that one 853 00:42:31,950 --> 00:42:35,030 uses where it's easy to have dozens and dozens of tables, 854 00:42:35,030 --> 00:42:38,550 even 100 of tables in a well thought out schema. 855 00:42:38,550 --> 00:42:41,920 Here, essentially it's one schema to rule them all. 856 00:42:41,920 --> 00:42:43,420 One table and its pair, and you've 857 00:42:43,420 --> 00:42:45,070 indexed every single thing. 858 00:42:45,070 --> 00:42:47,302 And the cost is, you've doubled your storage. 859 00:42:47,302 --> 00:42:49,010 And every single database person that you 860 00:42:49,010 --> 00:42:52,650 talk to-- if I said in exchange for doubling your storage I 861 00:42:52,650 --> 00:42:55,950 would give you a fast index access to every single string 862 00:42:55,950 --> 00:42:57,540 in your database, they'll usually 863 00:42:57,540 --> 00:43:02,280 take that deal 100 times out of 100. 864 00:43:02,280 --> 00:43:05,000 So that's the power here of this technology, 865 00:43:05,000 --> 00:43:07,789 is you can reference enormous data sets. 866 00:43:07,789 --> 00:43:10,080 And we'll come back to this schema over and over again. 867 00:43:10,080 --> 00:43:11,530 But that's sort of the madness. 868 00:43:11,530 --> 00:43:15,030 We'll be-- as we move forward with associate arrays we always 869 00:43:15,030 --> 00:43:16,470 view things in this way. 870 00:43:16,470 --> 00:43:17,947 And I want you to know, at the end, 871 00:43:17,947 --> 00:43:19,780 there's a really good reason why we do that. 872 00:43:22,760 --> 00:43:28,380 So now we've ingested our data into the database 873 00:43:28,380 --> 00:43:29,370 and we can do queries. 874 00:43:29,370 --> 00:43:32,270 So if I have a binding to a table-- and one 875 00:43:32,270 --> 00:43:33,980 of the things in the deformed technology 876 00:43:33,980 --> 00:43:35,480 that we do that's very nice for you, 877 00:43:35,480 --> 00:43:38,210 is if you're using a table and its transpose, 878 00:43:38,210 --> 00:43:40,060 we hide that for you completely. 879 00:43:40,060 --> 00:43:41,790 If you do an insert into the table, 880 00:43:41,790 --> 00:43:44,860 it will automatically insert its transpose. 881 00:43:44,860 --> 00:43:47,516 And if you do a look-up on rows, it will know to use one table, 882 00:43:47,516 --> 00:43:48,890 and it does a look-up on columns, 883 00:43:48,890 --> 00:43:50,870 it will know to use the other table. 884 00:43:50,870 --> 00:43:53,370 So here we're doing essentially a range query. 885 00:43:53,370 --> 00:43:58,190 This just says give me all the data from this range. 886 00:43:58,190 --> 00:44:01,500 Time stamp May whatever to, it looks 887 00:44:01,500 --> 00:44:03,430 like about a 3 hour time window. 888 00:44:03,430 --> 00:44:05,140 So that's a query into the database, 889 00:44:05,140 --> 00:44:08,770 and it returns an associative array of keys. 890 00:44:08,770 --> 00:44:10,350 And here is what that looks like. 891 00:44:10,350 --> 00:44:11,724 It looks like a bunch of triples. 892 00:44:14,930 --> 00:44:17,370 I'm going to then take the rows of those keys. 893 00:44:17,370 --> 00:44:22,580 So basically I found every single row, 894 00:44:22,580 --> 00:44:24,160 within a particular time stamp, and I 895 00:44:24,160 --> 00:44:26,170 want to get the whole row. 896 00:44:26,170 --> 00:44:31,200 So I can get the row of that and pass that back into the table 897 00:44:31,200 --> 00:44:33,190 to get the actual data. 898 00:44:33,190 --> 00:44:34,790 Now you see I get the whole row. 899 00:44:34,790 --> 00:44:37,856 So I get server IP, and all that stuff, the complete list. 900 00:44:42,050 --> 00:44:44,540 Now I'm going to do a little algebra here. 901 00:44:44,540 --> 00:44:46,780 I want to create the graph. 902 00:44:46,780 --> 00:44:52,480 It's all source IPs and server IPs. 903 00:44:52,480 --> 00:44:56,440 I get to do this with this basically correlation here, 904 00:44:56,440 --> 00:44:58,610 which is essentially a matrix-matrix multiply. 905 00:44:58,610 --> 00:45:04,030 That's it, I've just constructed the entire source IP and server 906 00:45:04,030 --> 00:45:07,170 IP graph from all the data in this time window. 907 00:45:07,170 --> 00:45:09,080 That is the power of D4M. 908 00:45:09,080 --> 00:45:13,040 It allows you to do correlations on this kind of data, 909 00:45:13,040 --> 00:45:16,160 same way we do correlations using traditional techniques 910 00:45:16,160 --> 00:45:18,020 in linear algebra. 911 00:45:18,020 --> 00:45:21,320 And you can be like, "Wow." 912 00:45:21,320 --> 00:45:22,800 This is motivational at this point. 913 00:45:22,800 --> 00:45:25,390 We'll get into exactly how you do this, 914 00:45:25,390 --> 00:45:28,940 and exactly how this works in the example. 915 00:45:28,940 --> 00:45:30,980 But this is really what people use 916 00:45:30,980 --> 00:45:34,890 D4M for, is to get to this point where can do 917 00:45:34,890 --> 00:45:36,660 these types of manipulations. 918 00:45:36,660 --> 00:45:39,650 And the key to deduction theory is doing correlations. 919 00:45:39,650 --> 00:45:41,910 Correlations allow us to determine 920 00:45:41,910 --> 00:45:43,630 what the backgrounds of our data are, 921 00:45:43,630 --> 00:45:45,600 what the clutter of our data are, 922 00:45:45,600 --> 00:45:49,200 and then we are on our way with traditional detection 923 00:45:49,200 --> 00:45:51,710 theory, and signal processing techniques, and all the things 924 00:45:51,710 --> 00:45:52,584 that we know we love. 925 00:45:52,584 --> 00:45:57,360 So this is kind of that connection made manifest. 926 00:45:57,360 --> 00:46:00,650 And so here you can see, we now have an associative array 927 00:46:00,650 --> 00:46:03,870 or an adjacency matrix that eventually, essentially 928 00:46:03,870 --> 00:46:07,106 sorts IPs and their server IPs. 929 00:46:07,106 --> 00:46:08,870 You can actually plot that and that's 930 00:46:08,870 --> 00:46:12,910 how we get the adjacency matrix of this graph, 931 00:46:12,910 --> 00:46:15,480 G. This is what it is. 932 00:46:15,480 --> 00:46:17,960 That's how we constructed-- this is the code we actually 933 00:46:17,960 --> 00:46:19,484 used to construct that graph. 934 00:46:22,860 --> 00:46:23,840 So moving on here. 935 00:46:23,840 --> 00:46:26,750 So this just shows the kind of the whole thing here. 936 00:46:26,750 --> 00:46:29,990 We had a whole week's worth of proxy data. 937 00:46:29,990 --> 00:46:32,670 This just shows you the timing, that it took about two hours 938 00:46:32,670 --> 00:46:38,330 to ingest it, and about three hours to do this processing. 939 00:46:38,330 --> 00:46:43,500 100 million proxy logs, 44.5 billion triples. 940 00:46:43,500 --> 00:46:45,390 This is big data. 941 00:46:45,390 --> 00:46:47,630 You know, really this is big data. 942 00:46:47,630 --> 00:46:50,860 This is the kind of thing that you can do. 943 00:46:50,860 --> 00:46:53,030 Whatever anybody else is doing, you 944 00:46:53,030 --> 00:46:56,372 should be able to do on larger scales than other people 945 00:46:56,372 --> 00:46:57,288 using this technology. 946 00:46:59,920 --> 00:47:03,370 So that brings us to the end of the lecture. 947 00:47:03,370 --> 00:47:07,112 So just to summarize, big data of this type 948 00:47:07,112 --> 00:47:09,320 is found in a wide range of areas, document analysis, 949 00:47:09,320 --> 00:47:12,520 computer networks, DNA sequencing. 950 00:47:12,520 --> 00:47:14,300 There's a kind of a gap between the tools 951 00:47:14,300 --> 00:47:16,680 that people traditionally use to do this problem, 952 00:47:16,680 --> 00:47:20,420 and D4M, this technology, fills this gap. 953 00:47:20,420 --> 00:47:24,260 So, with that, are there any questions 954 00:47:24,260 --> 00:47:27,730 before we get onto the examples slide? 955 00:47:27,730 --> 00:47:28,258 Yes? 956 00:47:28,258 --> 00:47:29,174 AUDIENCE: [INAUDIBLE]? 957 00:47:34,799 --> 00:47:35,590 JEREMY KEPNER: Yes. 958 00:47:35,590 --> 00:47:36,506 AUDIENCE: [INAUDIBLE]? 959 00:47:39,740 --> 00:47:43,620 JEREMY KEPNER: So the data-- 960 00:47:43,717 --> 00:47:45,300 AUDIENCE: Can you repeat the question? 961 00:47:45,300 --> 00:47:46,758 JEREMY KEPNER: Oh yes, the question 962 00:47:46,758 --> 00:47:50,070 was does it generalize to multi-dimensions, I guess? 963 00:47:50,070 --> 00:47:54,935 And so, there may be kind of two questions there. 964 00:47:57,870 --> 00:48:02,400 If you had-- if I just correlated all the data, 965 00:48:02,400 --> 00:48:06,300 took that table out, and just fully correlated it, 966 00:48:06,300 --> 00:48:09,110 I would end up with a giant matrix 967 00:48:09,110 --> 00:48:12,890 of blocks, each one representing different correlations. 968 00:48:12,890 --> 00:48:16,560 So the source IP, server IP correlation, the source IP, 969 00:48:16,560 --> 00:48:18,380 source IP correlation, which would be empty 970 00:48:18,380 --> 00:48:20,280 because it's a bipartite graph. 971 00:48:20,280 --> 00:48:21,790 And then all the other correlations 972 00:48:21,790 --> 00:48:25,160 would form different blocks, and that type of thing. 973 00:48:25,160 --> 00:48:28,284 So in that sense we support multi-dimensions. 974 00:48:28,284 --> 00:48:30,450 Whether we support-- but supporting multi-dimensions 975 00:48:30,450 --> 00:48:33,180 in terms of tensors, OK? 976 00:48:33,180 --> 00:48:35,860 Well, generally in sparse linear algebra, 977 00:48:35,860 --> 00:48:39,770 there's very minimal support for higher dimensions. 978 00:48:39,770 --> 00:48:41,930 I'll put a little shout out to my friend Tammy 979 00:48:41,930 --> 00:48:45,630 Kolda, who has a technology called the Sparse Tensor 980 00:48:45,630 --> 00:48:46,270 Toolbox. 981 00:48:46,270 --> 00:48:51,730 She's a scientist at Sandia and so I encourage you to do that. 982 00:48:51,730 --> 00:48:54,050 But all of our stuff is in two dimensions 983 00:48:54,050 --> 00:48:55,840 from a linear algebra perspective. 984 00:48:55,840 --> 00:48:57,820 Also linear algebraically, the math 985 00:48:57,820 --> 00:49:01,280 there is a little bit more common than when 986 00:49:01,280 --> 00:49:02,710 you get into the tensor. 987 00:49:02,710 --> 00:49:05,137 And then you have these kind of weird products 988 00:49:05,137 --> 00:49:07,470 that you're starting, which you can define, but are just 989 00:49:07,470 --> 00:49:10,020 not commonly taught. 990 00:49:10,020 --> 00:49:12,382 So and you might be like, well you're really 991 00:49:12,382 --> 00:49:13,340 losing something there. 992 00:49:13,340 --> 00:49:15,200 I'd really like to have multiple dimensions. 993 00:49:15,200 --> 00:49:18,780 Well, when we've looked at it, the fact of the matter 994 00:49:18,780 --> 00:49:23,560 is the sparsity always increases as you go to higher dimensions. 995 00:49:23,560 --> 00:49:25,520 And the sparsity is sparse enough. 996 00:49:25,520 --> 00:49:27,310 So typically the kinds of data that we 997 00:49:27,310 --> 00:49:30,200 will have-- if you have n vertices, 998 00:49:30,200 --> 00:49:33,840 you'll have maybe 10 times n edges. 999 00:49:33,840 --> 00:49:38,230 So the sparsity is extremely low. 1000 00:49:38,230 --> 00:49:39,900 We added a dimension. 1001 00:49:39,900 --> 00:49:42,050 This sort of internal data structure costs 1002 00:49:42,050 --> 00:49:44,820 associated with adding that dimension would not 1003 00:49:44,820 --> 00:49:45,570 reap the benefits. 1004 00:49:45,570 --> 00:49:47,920 Because now your sparsity is even lower. 1005 00:49:47,920 --> 00:49:50,630 And so that's generally not something 1006 00:49:50,630 --> 00:49:52,034 that is commonly used. 1007 00:49:52,034 --> 00:49:53,700 But, if you're interested exploring that 1008 00:49:53,700 --> 00:49:56,210 I certainly encourage people to download that toolbox. 1009 00:49:56,210 --> 00:50:01,694 Very well written software, and people can explore that. 1010 00:50:01,694 --> 00:50:02,360 Other questions? 1011 00:50:02,360 --> 00:50:03,980 That was an outstanding question. 1012 00:50:03,980 --> 00:50:07,514 And other questions at this time? 1013 00:50:07,514 --> 00:50:09,430 All right well why don't we stop the recording 1014 00:50:09,430 --> 00:50:13,140 right now, and then we will switch-- why don't we 1015 00:50:13,140 --> 00:50:14,980 take like a five minute break. 1016 00:50:14,980 --> 00:50:17,540 I think there's some water out there and stuff for people. 1017 00:50:17,540 --> 00:50:21,040 And then we'll get back to the examples, which probably 1018 00:50:21,040 --> 00:50:22,280 will be a lot more fun. 1019 00:50:22,280 --> 00:50:27,760 So hopefully wake you up after these new graphs.