1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,110 continue to offer high quality educational resources for free. 5 00:00:10,110 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,590 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,590 --> 00:00:17,470 at ocw.mit.edu. 8 00:00:21,810 --> 00:00:25,040 JEREMY KEPNER: Thank you all for coming. 9 00:00:25,040 --> 00:00:26,910 I'm glad to see that our attendance 10 00:00:26,910 --> 00:00:32,090 after the first lecture has not dropped dramatically-- 11 00:00:32,090 --> 00:00:32,820 a little bit. 12 00:00:32,820 --> 00:00:34,540 That's to be expected. 13 00:00:34,540 --> 00:00:38,270 But so far I haven't managed to scare you all off. 14 00:00:38,270 --> 00:00:42,340 So this is the second lecture of the signal processing 15 00:00:42,340 --> 00:00:44,502 on database of course. 16 00:00:50,230 --> 00:00:53,780 The format of the lecture will be the same as the last one. 17 00:00:53,780 --> 00:00:58,030 We'll have some slide material that will cover 18 00:00:58,030 --> 00:00:59,870 the first half of the course. 19 00:00:59,870 --> 00:01:03,250 And then we'll take a short break. 20 00:01:03,250 --> 00:01:09,280 And then there'll be some example demo programs. 21 00:01:09,280 --> 00:01:12,530 And so with that, we'll get into it. 22 00:01:12,530 --> 00:01:15,420 And all this material is available in your LLGrid 23 00:01:15,420 --> 00:01:16,550 accounts. 24 00:01:16,550 --> 00:01:20,350 In the Tools directory you get a directory like this. 25 00:01:20,350 --> 00:01:23,800 And the documents and the slides are right here. 26 00:01:23,800 --> 00:01:26,370 So we'll go to lecture 01. 27 00:01:40,714 --> 00:01:42,130 For those of you who weren't there 28 00:01:42,130 --> 00:01:46,460 or to recap for those who will be viewing this on the web 29 00:01:46,460 --> 00:01:49,090 later, the title of the course is Signal Processing 30 00:01:49,090 --> 00:01:52,840 on Databases-- two terms that we often don't see together. 31 00:01:52,840 --> 00:01:55,290 The signal processing element is really 32 00:01:55,290 --> 00:01:59,450 alluding to detection theory, and the deeper linear algebra 33 00:01:59,450 --> 00:02:02,260 basis of detection theory. 34 00:02:02,260 --> 00:02:07,170 The databases is really alluding to unstructured data, data 35 00:02:07,170 --> 00:02:10,699 that we often represent with strings and words and sort 36 00:02:10,699 --> 00:02:12,540 of sparse relationships. 37 00:02:12,540 --> 00:02:15,380 And so this course is really bringing these two 38 00:02:15,380 --> 00:02:18,270 normally quite separate ideas and bringing them together 39 00:02:18,270 --> 00:02:22,880 in a way because there's a lot of need to do so. 40 00:02:22,880 --> 00:02:28,120 So again, this is the first course 41 00:02:28,120 --> 00:02:29,690 ever taught on this topic. 42 00:02:29,690 --> 00:02:33,480 There really isn't a prior model for how to teach this. 43 00:02:33,480 --> 00:02:36,060 So the approach that we're taking 44 00:02:36,060 --> 00:02:38,330 is we're going to have a lot of examples 45 00:02:38,330 --> 00:02:42,100 that show how people use this technology 46 00:02:42,100 --> 00:02:44,120 and hopefully see that. 47 00:02:44,120 --> 00:02:48,300 And then we'll also have some theory as well as 48 00:02:48,300 --> 00:02:49,840 a lot of examples. 49 00:02:49,840 --> 00:02:52,410 And so hopefully through some combination 50 00:02:52,410 --> 00:02:57,210 of those different types of teaching vehicles, each of you 51 00:02:57,210 --> 00:02:59,855 will be able to find something that is helpful. 52 00:03:02,800 --> 00:03:06,270 So the outline of today's lecture 53 00:03:06,270 --> 00:03:08,370 is I'm going to go through an example 54 00:03:08,370 --> 00:03:11,940 where we analyzed some citation data. 55 00:03:11,940 --> 00:03:17,800 So citation data is one of the most common examples used 56 00:03:17,800 --> 00:03:23,050 in the field because it's so readily available and so easy 57 00:03:23,050 --> 00:03:24,900 to get. 58 00:03:24,900 --> 00:03:27,720 And so there's no issues associated with it. 59 00:03:27,720 --> 00:03:35,620 So you have lots of academics analyzing themselves, 60 00:03:35,620 --> 00:03:39,880 which is, of course, very symmetric, right? 61 00:03:39,880 --> 00:03:43,990 And so there's a lot of work on citation data. 62 00:03:43,990 --> 00:03:47,380 We are going to be looking at some citation data here, 63 00:03:47,380 --> 00:03:53,180 but doing it using the technology, the D4M technology, 64 00:03:53,180 --> 00:03:56,660 that you all have access to, doing it 65 00:03:56,660 --> 00:04:00,740 in the way that would work well there, which we think 66 00:04:00,740 --> 00:04:04,510 is more efficient both in terms of the performance 67 00:04:04,510 --> 00:04:06,960 that you'll get on your analytics and just 68 00:04:06,960 --> 00:04:09,630 in terms of the total time it will take you. 69 00:04:09,630 --> 00:04:12,890 Of course, it does require a certain level of comfort 70 00:04:12,890 --> 00:04:13,770 with linear algebra. 71 00:04:13,770 --> 00:04:18,870 That's sort of the mathematical foundation of the class. 72 00:04:18,870 --> 00:04:21,079 So we're going to get right into it. 73 00:04:21,079 --> 00:04:23,880 And I talked about this in the previous lecture, 74 00:04:23,880 --> 00:04:26,590 these special schemas that we set up 75 00:04:26,590 --> 00:04:31,030 for processing this kind of data that allow us to, essentially, 76 00:04:31,030 --> 00:04:33,820 take any kind of data and sort of pull it 77 00:04:33,820 --> 00:04:38,920 into one basic architecture that is 78 00:04:38,920 --> 00:04:42,530 a really a good starting point for analysis. 79 00:04:42,530 --> 00:04:47,040 So here's a very classic type of data 80 00:04:47,040 --> 00:04:49,920 that you might get in citation data here. 81 00:04:49,920 --> 00:04:53,350 So in this particular data set, we 82 00:04:53,350 --> 00:04:56,860 have a universal identifier for each citation. 83 00:04:56,860 --> 00:04:59,200 So every single citation is just given 84 00:04:59,200 --> 00:05:02,830 some index in this data set. 85 00:05:02,830 --> 00:05:11,955 And we'll have an author, a doc ID, and then a reference doc 86 00:05:11,955 --> 00:05:13,120 ID. 87 00:05:13,120 --> 00:05:18,400 So this basically is the document ID 88 00:05:18,400 --> 00:05:25,540 of the document that's being referenced 89 00:05:25,540 --> 00:05:28,130 and the one that the reference is coming from. 90 00:05:28,130 --> 00:05:29,990 That may seem a little opposite. 91 00:05:29,990 --> 00:05:31,570 But that's sort of what it means. 92 00:05:31,570 --> 00:05:33,950 And you can have absences of both. 93 00:05:33,950 --> 00:05:40,570 You can have references where it's 94 00:05:40,570 --> 00:05:43,320 referring to another document for which they have not 95 00:05:43,320 --> 00:05:45,850 constructed some kind of document ID. 96 00:05:45,850 --> 00:05:47,620 You can also have references that 97 00:05:47,620 --> 00:05:50,200 are within a document that just for some reason or other 98 00:05:50,200 --> 00:05:52,420 doesn't have a document ID as well. 99 00:05:52,420 --> 00:05:56,260 So I think the important thing to understand here 100 00:05:56,260 --> 00:06:03,280 is that this kind of incompleteness on a data set, 101 00:06:03,280 --> 00:06:06,530 which might be as clean a data set as you would expect, 102 00:06:06,530 --> 00:06:08,340 which is the scientific citation data 103 00:06:08,340 --> 00:06:11,240 set, this kind of incompleteness is very normal. 104 00:06:13,990 --> 00:06:16,450 You never get data sets that are even remotely perfect. 105 00:06:16,450 --> 00:06:21,500 To get good data sets that feel even like about 80% solid, 106 00:06:21,500 --> 00:06:22,920 you're doing very well. 107 00:06:22,920 --> 00:06:24,520 And it's very, very common to have 108 00:06:24,520 --> 00:06:26,710 data sets that are incomplete, that have 109 00:06:26,710 --> 00:06:32,470 mistakes, null values, mislabeled fields, et cetera, 110 00:06:32,470 --> 00:06:32,970 et cetera. 111 00:06:32,970 --> 00:06:35,710 It is just a natural part of the process. 112 00:06:35,710 --> 00:06:38,410 I think even though more and more of these sets 113 00:06:38,410 --> 00:06:40,050 are created by machines and you're 114 00:06:40,050 --> 00:06:42,420 hoping that would eliminate some of those issues, 115 00:06:42,420 --> 00:06:44,180 well, the machines have the same problems. 116 00:06:44,180 --> 00:06:47,090 And they can make mistakes on a much larger scale, 117 00:06:47,090 --> 00:06:49,740 more quickly than humans can. 118 00:06:49,740 --> 00:06:53,680 So incompleteness definitely exists. 119 00:06:53,680 --> 00:06:58,400 We pull this data into-- this data 120 00:06:58,400 --> 00:07:04,340 is sort of a standard SQL or tabular format here, 121 00:07:04,340 --> 00:07:08,420 in which you have records and columns and then values. 122 00:07:08,420 --> 00:07:11,050 And what we tend generally do is pull it 123 00:07:11,050 --> 00:07:13,730 into what we call this exploded schema 124 00:07:13,730 --> 00:07:16,530 where we still use the records. 125 00:07:16,530 --> 00:07:18,170 Those are the row keys. 126 00:07:18,170 --> 00:07:20,600 We create column keys, which are essentially 127 00:07:20,600 --> 00:07:25,840 the column label and the column value appended together. 128 00:07:25,840 --> 00:07:30,050 So every single unique value will have its own column. 129 00:07:30,050 --> 00:07:32,310 And then we sort of build it out that way. 130 00:07:32,310 --> 00:07:36,455 And this creates this very, very large, sparse structure. 131 00:07:39,800 --> 00:07:43,980 Once we store the transpose of that in our database as well, 132 00:07:43,980 --> 00:07:49,230 we get, essentially, constant-time access 133 00:07:49,230 --> 00:07:51,010 to any string in the database. 134 00:07:51,010 --> 00:07:54,910 So we can now look up any author very quickly, 135 00:07:54,910 --> 00:07:58,580 any document very quickly, any record very quickly. 136 00:07:58,580 --> 00:08:00,970 And so it's a very natural schema 137 00:08:00,970 --> 00:08:05,500 for handling this kind of data. 138 00:08:05,500 --> 00:08:08,700 Typically, we just hold the value 1 139 00:08:08,700 --> 00:08:11,730 in the actual value of the database. 140 00:08:11,730 --> 00:08:16,790 We can put other numbers in there. 141 00:08:16,790 --> 00:08:19,560 But it generally should be a number 142 00:08:19,560 --> 00:08:22,150 that you would never want to search on 143 00:08:22,150 --> 00:08:26,390 because in many of the data sets that we have 144 00:08:26,390 --> 00:08:28,440 or the databases we're working with, 145 00:08:28,440 --> 00:08:32,370 they can look up a row key or with this schema, a column key 146 00:08:32,370 --> 00:08:33,490 very quickly. 147 00:08:33,490 --> 00:08:37,490 But to look up a value requires a complete traversal 148 00:08:37,490 --> 00:08:40,860 of the table, which is, by definition, very expensive. 149 00:08:40,860 --> 00:08:42,702 So there's definitely information 150 00:08:42,702 --> 00:08:45,160 you can put in there but just general information you would 151 00:08:45,160 --> 00:08:46,368 never want to really look up. 152 00:08:49,020 --> 00:08:54,270 In addition for this particular data set, 153 00:08:54,270 --> 00:08:58,843 we have some fields that are quite long. 154 00:09:02,860 --> 00:09:10,180 And so we might have the fully formatted reference, 155 00:09:10,180 --> 00:09:12,930 which in a journal article can be quite long. 156 00:09:12,930 --> 00:09:16,477 You know if it's got the full name and the journal volume 157 00:09:16,477 --> 00:09:18,560 and maybe all the authors and all that stuff, that 158 00:09:18,560 --> 00:09:19,340 can be quite long. 159 00:09:19,340 --> 00:09:21,540 It's very unstructured. 160 00:09:21,540 --> 00:09:22,730 We might have the title. 161 00:09:22,730 --> 00:09:24,000 We might have the abstract. 162 00:09:24,000 --> 00:09:31,610 These are longer blobs of data that maybe our other data 163 00:09:31,610 --> 00:09:35,730 set essentially is extracted all the key information from that. 164 00:09:35,730 --> 00:09:38,230 But often, we still want to have access 165 00:09:38,230 --> 00:09:45,130 to this information as a complete uninterrupted piece 166 00:09:45,130 --> 00:09:45,814 of data. 167 00:09:45,814 --> 00:09:47,980 So if you actually find a record, you're like, well, 168 00:09:47,980 --> 00:09:54,260 I'd like to see the abstract as it was actually formatted 169 00:09:54,260 --> 00:09:55,310 as it was in the journal. 170 00:09:55,310 --> 00:09:58,500 Because a lot of times when you pull this stuff out and do 171 00:09:58,500 --> 00:10:02,310 keywords, you might lose punctuation, capitalization, 172 00:10:02,310 --> 00:10:03,810 other types of things that just make 173 00:10:03,810 --> 00:10:06,290 it generally easier to read, but don't really 174 00:10:06,290 --> 00:10:08,360 help you with indexing. 175 00:10:08,360 --> 00:10:11,090 So we actually recommend that that kind of information 176 00:10:11,090 --> 00:10:13,310 be stored in a completely separate table-- 177 00:10:13,310 --> 00:10:19,120 a traditional table, which would just be some kind of row key 178 00:10:19,120 --> 00:10:23,430 and then the actual reference with the values there. 179 00:10:23,430 --> 00:10:25,580 You're never going to search the contents of this. 180 00:10:25,580 --> 00:10:30,010 But this allows you to at least have access to the data. 181 00:10:30,010 --> 00:10:35,280 So this is sort of a small addition to our standard schema 182 00:10:35,280 --> 00:10:38,730 where we have the exploded schema with all the information 183 00:10:38,730 --> 00:10:40,540 that we would ever want to search on 184 00:10:40,540 --> 00:10:43,840 with this transpose, which gives us a very fast search. 185 00:10:43,840 --> 00:10:45,580 You may have an additional table which 186 00:10:45,580 --> 00:10:50,760 just stores the data in a raw format separately. 187 00:10:50,760 --> 00:10:52,340 You can look it up quickly. 188 00:10:52,340 --> 00:10:56,900 But it won't get in the way of any of your searches. 189 00:10:56,900 --> 00:10:59,950 The problem can be if you're doing searches 190 00:10:59,950 --> 00:11:03,070 and you have very large fields mixed in with very 191 00:11:03,070 --> 00:11:07,000 small fields, it can become a performance bottleneck 192 00:11:07,000 --> 00:11:09,380 to be handling these at the exact same time. 193 00:11:16,800 --> 00:11:19,720 Another table that we used in this data 194 00:11:19,720 --> 00:11:23,720 set, which was to essentially index 195 00:11:23,720 --> 00:11:28,920 all the words in the title and the abstract and their Ngrams, 196 00:11:28,920 --> 00:11:33,770 and so the one gram would just be the individual words 197 00:11:33,770 --> 00:11:34,290 in a title. 198 00:11:34,290 --> 00:11:36,490 If a word appeared in a title, you 199 00:11:36,490 --> 00:11:41,190 would create a new column for that word with respect 200 00:11:41,190 --> 00:11:42,750 to that reference. 201 00:11:42,750 --> 00:11:46,070 An Ngram would be word pairs. 202 00:11:46,070 --> 00:11:48,800 A trigram would be all words that occur together 203 00:11:48,800 --> 00:11:51,140 in threes and on and on and on. 204 00:11:51,140 --> 00:11:56,790 And so often for text analytics, you want to store all of these. 205 00:11:56,790 --> 00:12:00,150 And essentially, you might have here an input data 206 00:12:00,150 --> 00:12:03,360 set with various identifiers-- a title that 207 00:12:03,360 --> 00:12:05,460 consists of a set of words, an abstract that 208 00:12:05,460 --> 00:12:07,530 consists of set of words. 209 00:12:07,530 --> 00:12:11,030 And then typically what we would do 210 00:12:11,030 --> 00:12:15,040 is we would say for the one gram of the title a, 211 00:12:15,040 --> 00:12:16,900 we might format it like this. 212 00:12:16,900 --> 00:12:20,170 And then here we might hold where 213 00:12:20,170 --> 00:12:21,822 that appeared in the document. 214 00:12:21,822 --> 00:12:23,280 So this allows you to, essentially, 215 00:12:23,280 --> 00:12:26,780 come back and reconstruct the original documents. 216 00:12:26,780 --> 00:12:31,310 So that's a perfectly good example 217 00:12:31,310 --> 00:12:35,850 of something that we don't tend to need to search on, 218 00:12:35,850 --> 00:12:39,350 which is the exact position of a word in a document. 219 00:12:39,350 --> 00:12:42,640 We don't tend to really care that the word was 220 00:12:42,640 --> 00:12:46,220 in the eighth position or the 10th position or whatever. 221 00:12:46,220 --> 00:12:49,500 So we can just create a list of all of these values. 222 00:12:49,500 --> 00:12:53,430 And that way, if you want to then later reconstruct 223 00:12:53,430 --> 00:12:55,820 the original text, you can do that. 224 00:12:55,820 --> 00:12:58,560 Between a column and its positions, 225 00:12:58,560 --> 00:13:00,700 you have enough information to do that. 226 00:13:00,700 --> 00:13:04,900 So that's an example of another type of schema that we admit 227 00:13:04,900 --> 00:13:07,200 and how we handle this type of data 228 00:13:07,200 --> 00:13:08,950 for doing these types of searches. 229 00:13:08,950 --> 00:13:13,520 And now it makes it very easy to say, show me, 230 00:13:13,520 --> 00:13:18,630 if I wanted to look up, please give me all documents that 231 00:13:18,630 --> 00:13:25,080 had the word d, this word d, in the abstract, 232 00:13:25,080 --> 00:13:33,710 I would simply go over here, look up the row key, 233 00:13:33,710 --> 00:13:39,170 get these two columns, and then I could take those two column 234 00:13:39,170 --> 00:13:41,630 labels and then look them up here. 235 00:13:41,630 --> 00:13:43,260 And I would get all the Ngrams. 236 00:13:43,260 --> 00:13:45,804 Or I could then take those row keys 237 00:13:45,804 --> 00:13:47,720 and look them up in any of the previous tables 238 00:13:47,720 --> 00:13:48,830 and get that information. 239 00:13:48,830 --> 00:13:51,410 So that's kind of what you can do there. 240 00:13:51,410 --> 00:13:52,910 I'm going over this kind of quickly. 241 00:13:52,910 --> 00:13:55,470 The examples go through this in very specific detail. 242 00:13:55,470 --> 00:13:57,200 But this is kind of just showing you 243 00:13:57,200 --> 00:14:00,450 how we tend to think about these data sets in a way that's 244 00:14:00,450 --> 00:14:01,840 very flexible. 245 00:14:01,840 --> 00:14:03,960 But showing you some of the nuances. 246 00:14:03,960 --> 00:14:06,030 It's not just the one table. 247 00:14:06,030 --> 00:14:08,210 It ends up being two or three tables. 248 00:14:08,210 --> 00:14:11,433 But still, that really covers what you're looking for. 249 00:14:16,420 --> 00:14:23,250 Again, when you're processing data like this, 250 00:14:23,250 --> 00:14:26,230 you're always thinking about a pipeline. 251 00:14:26,230 --> 00:14:28,485 And D4M, the technologies you talk about 252 00:14:28,485 --> 00:14:30,400 is only one piece of that pipeline. 253 00:14:30,400 --> 00:14:32,070 We're not saying that the technologies 254 00:14:32,070 --> 00:14:36,870 we're talking about cover every single step of that pipeline. 255 00:14:36,870 --> 00:14:39,790 So in this particular data set that we had, 256 00:14:39,790 --> 00:14:48,100 which I think was 150 gigabyte data set, 257 00:14:48,100 --> 00:14:52,320 step one was getting a hard disk drive sent to us 258 00:14:52,320 --> 00:14:53,570 and copying it. 259 00:14:53,570 --> 00:14:56,210 It came as a giant XML zip file. 260 00:14:56,210 --> 00:14:57,840 Obviously we had to uncompress it. 261 00:14:57,840 --> 00:14:58,760 So we had to zip file. 262 00:14:58,760 --> 00:15:00,190 We uncompressed it. 263 00:15:00,190 --> 00:15:02,500 And then from there we had to write 264 00:15:02,500 --> 00:15:06,180 a parser of the XML that then spat 265 00:15:06,180 --> 00:15:08,470 that out into just a series of triples 266 00:15:08,470 --> 00:15:15,020 that then could be inserted into our database the way we want. 267 00:15:15,020 --> 00:15:17,700 Well, D4M forum is very good for developing 268 00:15:17,700 --> 00:15:21,990 how you're going to want to parse those triples initially. 269 00:15:21,990 --> 00:15:26,460 If you're going to do a high volume parsing of data, 270 00:15:26,460 --> 00:15:29,530 I would really recommend using a low-level language 271 00:15:29,530 --> 00:15:33,740 like C. Generally C is an outstanding language for doing 272 00:15:33,740 --> 00:15:36,590 this very repetitive, monotonous parsing 273 00:15:36,590 --> 00:15:39,070 and has excellent support for things like XML 274 00:15:39,070 --> 00:15:40,520 and other types of things. 275 00:15:40,520 --> 00:15:44,440 And it will run as fast as you can. 276 00:15:44,440 --> 00:15:46,670 Yeah, you end up writing more code to do that. 277 00:15:46,670 --> 00:15:50,780 But it tends to be kind of a do once type of thing. 278 00:15:50,780 --> 00:15:53,660 You actually do it more than once, usually, to get it right. 279 00:15:53,660 --> 00:15:58,940 But it's nice when, I think, our parser that we wrote in C++ 280 00:15:58,940 --> 00:16:03,480 could take this several hundred gigabytes of XML data 281 00:16:03,480 --> 00:16:07,800 and convert it in triples in under an hour. 282 00:16:07,800 --> 00:16:09,190 And so that's a nice thing. 283 00:16:09,190 --> 00:16:11,215 And other environments, if you try 284 00:16:11,215 --> 00:16:14,240 and to do this in, say, Python or other types of things 285 00:16:14,240 --> 00:16:16,790 or even in D4M at high volume, it's 286 00:16:16,790 --> 00:16:17,960 going to take a lot longer. 287 00:16:17,960 --> 00:16:20,320 So the right tool for the job-- it 288 00:16:20,320 --> 00:16:22,110 will take you more lines of code to do it. 289 00:16:22,110 --> 00:16:24,568 But we have found that it's usually a worthwhile investment 290 00:16:24,568 --> 00:16:26,770 because it can take a long time. 291 00:16:26,770 --> 00:16:29,480 Once we have the triples, typically what we then do 292 00:16:29,480 --> 00:16:32,860 is read them into D4M and construct dissociative arrays 293 00:16:32,860 --> 00:16:33,880 out of them. 294 00:16:33,880 --> 00:16:36,380 Talked a little bit about associative arrays last time. 295 00:16:36,380 --> 00:16:37,990 I'll get to that again. 296 00:16:37,990 --> 00:16:40,073 But these are essentially a MATLAB data structure. 297 00:16:43,000 --> 00:16:45,480 And then we generally save them to files. 298 00:16:48,130 --> 00:16:50,260 We always recommend right before you 299 00:16:50,260 --> 00:16:54,480 do an insert into a database, even 300 00:16:54,480 --> 00:16:56,240 when you write performance parsers 301 00:16:56,240 --> 00:16:58,330 or if you're not writing, the parsing 302 00:16:58,330 --> 00:17:00,580 can be a significant activity. 303 00:17:00,580 --> 00:17:03,930 And if you write the data out in some sort of efficient file 304 00:17:03,930 --> 00:17:07,130 format right before you insert it, if you ever need to go back 305 00:17:07,130 --> 00:17:11,099 and re insert the data, it's now a very, very fast thing to do. 306 00:17:11,099 --> 00:17:14,430 You'll often have databases especially during development, 307 00:17:14,430 --> 00:17:15,950 they can get corrupted. 308 00:17:15,950 --> 00:17:19,319 You try and recover them but they don't recover. 309 00:17:19,319 --> 00:17:22,630 And again, just knowing that you have the parsed files laying 310 00:17:22,630 --> 00:17:25,220 around someplace and you can just process them 311 00:17:25,220 --> 00:17:29,250 whenever you want is a very useful technique 312 00:17:29,250 --> 00:17:34,430 and usually is not such a large expense from a data perspective 313 00:17:34,430 --> 00:17:36,430 and can allow you essentially reinsert 314 00:17:36,430 --> 00:17:39,860 the data at the full rate the database can handle, as opposed 315 00:17:39,860 --> 00:17:42,210 to having you redo your parsing, which sometimes 316 00:17:42,210 --> 00:17:43,650 can take 10 times as long. 317 00:17:43,650 --> 00:17:45,250 We've definitely seen many folks who 318 00:17:45,250 --> 00:17:49,740 do full parsing, text parsing in Java or in Python that often 319 00:17:49,740 --> 00:17:51,940 is 10 times slower than the actual insert 320 00:17:51,940 --> 00:17:53,280 part of the database. 321 00:17:53,280 --> 00:17:57,190 So it's nice to be able to checkpoint that data. 322 00:17:57,190 --> 00:18:02,910 And then from there we read the files in and then 323 00:18:02,910 --> 00:18:08,110 insert them into our three two database pairs-- the keys 324 00:18:08,110 --> 00:18:10,680 and the transpose, the Ngrams and the transpose, 325 00:18:10,680 --> 00:18:13,755 and then the actual raw text itself. 326 00:18:16,504 --> 00:18:21,720 I should also say, keeping these associative arrays around 327 00:18:21,720 --> 00:18:28,160 as files is also very useful. 328 00:18:28,160 --> 00:18:34,350 Because the database, as we talked about before, 329 00:18:34,350 --> 00:18:36,820 is very good if you have a large volume of data 330 00:18:36,820 --> 00:18:40,590 and you want to look up a small piece of it quickly. 331 00:18:40,590 --> 00:18:43,820 If you want to re-traverse the entire data set, 332 00:18:43,820 --> 00:18:47,050 do some kind of analysis on the entire data set, 333 00:18:47,050 --> 00:18:49,130 that's going to take about the same time 334 00:18:49,130 --> 00:18:51,590 as it took to insert the data. 335 00:18:51,590 --> 00:18:54,195 So you're going to be having to pull all the data out 336 00:18:54,195 --> 00:18:56,320 if you're going to be wanting to analyze all of it. 337 00:18:56,320 --> 00:18:58,252 There's no sort of magic there. 338 00:18:58,252 --> 00:19:00,710 So the database is very useful for looking up small pieces. 339 00:19:00,710 --> 00:19:03,540 But if you want to traverse the whole thing, in which case 340 00:19:03,540 --> 00:19:06,140 having those files around is a very convenient thing. 341 00:19:06,140 --> 00:19:08,680 If you're like, I'm going to traverse all the data, 342 00:19:08,680 --> 00:19:13,990 I might as well just begin with the files and traverse them. 343 00:19:13,990 --> 00:19:15,820 And that's very easy. 344 00:19:15,820 --> 00:19:18,920 The file system will read it in much faster than the database. 345 00:19:18,920 --> 00:19:20,940 The bandwidth of the file systems 346 00:19:20,940 --> 00:19:22,770 are much higher than databases. 347 00:19:22,770 --> 00:19:25,260 And it's also very easy to handle in a parallel way. 348 00:19:25,260 --> 00:19:29,370 You just send different processes, parse-- 349 00:19:29,370 --> 00:19:30,660 in reading in different files. 350 00:19:30,660 --> 00:19:32,951 So if you're going to be doing something that traverses 351 00:19:32,951 --> 00:19:38,050 an entire set of data, we really recommend that you 352 00:19:38,050 --> 00:19:39,260 might do that in files. 353 00:19:39,260 --> 00:19:40,790 And in fact, we even see folks its 354 00:19:40,790 --> 00:19:44,910 like, well, I might have a query that requires 3% or 4% 355 00:19:44,910 --> 00:19:46,020 of the data. 356 00:19:46,020 --> 00:19:47,740 Well then we would recommend, oftentimes, 357 00:19:47,740 --> 00:19:50,239 if you know you're going to be working on that same data set 358 00:19:50,239 --> 00:19:53,380 over and over again, querying it out of the database, 359 00:19:53,380 --> 00:19:54,897 saving to a set of files, and then 360 00:19:54,897 --> 00:19:56,980 just working with those files over and over again. 361 00:19:56,980 --> 00:20:00,512 Now you never have to deal with the latencies or other issues 362 00:20:00,512 --> 00:20:01,720 associated with the database. 363 00:20:01,720 --> 00:20:03,960 You've completely isolated yourself from that. 364 00:20:03,960 --> 00:20:05,190 So databases are great. 365 00:20:05,190 --> 00:20:06,930 You need to use them for the right thing. 366 00:20:06,930 --> 00:20:08,530 But the file system is also great. 367 00:20:08,530 --> 00:20:10,710 And often in the kind of work that we do, 368 00:20:10,710 --> 00:20:13,850 that algorithm development people do, 369 00:20:13,850 --> 00:20:15,760 testing and developing on the files 370 00:20:15,760 --> 00:20:19,740 is often better than using the database 371 00:20:19,740 --> 00:20:24,970 as your fine-grain rudimentary access type of thing. 372 00:20:24,970 --> 00:20:29,340 So we covered that. 373 00:20:29,340 --> 00:20:33,040 So this particular data set had 42 million records. 374 00:20:33,040 --> 00:20:37,650 And just to kind of show you how long it 375 00:20:37,650 --> 00:20:40,980 took using the pipeline, we uncompressed the XML file 376 00:20:40,980 --> 00:20:41,630 in one hour. 377 00:20:41,630 --> 00:20:45,830 So that's just running gzip. 378 00:20:45,830 --> 00:20:49,740 Our parser, that converted the XML binary structure 379 00:20:49,740 --> 00:20:55,040 into triples, this was written in C#, was two hours. 380 00:20:55,040 --> 00:20:58,470 So that's very, very nice. 381 00:20:58,470 --> 00:21:01,022 As we debug the parser, we could rewrite it and not 382 00:21:01,022 --> 00:21:02,730 have to worry about, oh my goodness, this 383 00:21:02,730 --> 00:21:04,940 is going to be days and days. 384 00:21:04,940 --> 00:21:07,950 The constructing of the D4M associative arrays 385 00:21:07,950 --> 00:21:10,250 from the triples took about a day. 386 00:21:10,250 --> 00:21:12,360 That just kind of just shows you there. 387 00:21:12,360 --> 00:21:16,850 And then inserting the triples into Accumulo, 388 00:21:16,850 --> 00:21:19,750 this was a single node database. 389 00:21:24,660 --> 00:21:31,541 Not a powerful computer, it was a dual core circa 2006 Server. 390 00:21:31,541 --> 00:21:33,040 But we have sustained an insert rate 391 00:21:33,040 --> 00:21:36,270 of about 10,000 to 100,000 entries per second, which 392 00:21:36,270 --> 00:21:39,060 is extremely good for this particular database, which 393 00:21:39,060 --> 00:21:40,470 was Accumulo. 394 00:21:40,470 --> 00:21:42,830 So inserting the keys took about two days. 395 00:21:42,830 --> 00:21:44,640 Inserting the text took about one day. 396 00:21:44,640 --> 00:21:47,670 Inserting the Ngrams took about 10 days 397 00:21:47,670 --> 00:21:49,330 in this particular data set. 398 00:21:49,330 --> 00:21:51,430 We ended up not using the Ngrams very much 399 00:21:51,430 --> 00:21:54,167 in this particular data set, mostly the keys. 400 00:21:54,167 --> 00:21:55,000 And so there you go. 401 00:21:55,000 --> 00:21:57,570 That gives you an idea. 402 00:21:57,570 --> 00:22:00,390 If the database itself is running in parallel, 403 00:22:00,390 --> 00:22:03,350 you can increase those insert rates significantly. 404 00:22:03,350 --> 00:22:06,520 But these are the basic single node insert performance 405 00:22:06,520 --> 00:22:07,431 that we tend to see. 406 00:22:07,431 --> 00:22:08,430 Yes, question over here. 407 00:22:08,430 --> 00:22:09,471 AUDIENCE: Silly question. 408 00:22:09,471 --> 00:22:10,960 Is Accumulo a type of a database? 409 00:22:10,960 --> 00:22:11,876 JEREMY KEPNER: Ah yes. 410 00:22:11,876 --> 00:22:14,270 So I talked about that a little bit in the first class. 411 00:22:14,270 --> 00:22:17,760 So Accumulo is a type of database called a triple store 412 00:22:17,760 --> 00:22:20,730 or sometimes called a NoSQL database. 413 00:22:20,730 --> 00:22:23,540 It's a new class of database that's 414 00:22:23,540 --> 00:22:26,780 based on the architecture published 415 00:22:26,780 --> 00:22:29,320 in the Google Bigtable paper, which 416 00:22:29,320 --> 00:22:31,100 is about five or six years old. 417 00:22:31,100 --> 00:22:34,020 There's a number of databases that have been 418 00:22:34,020 --> 00:22:36,110 built with this technology. 419 00:22:36,110 --> 00:22:40,530 It sits on top of a file system infrastructure called Hadoop. 420 00:22:40,530 --> 00:22:43,040 And it's a very, very high performance database 421 00:22:43,040 --> 00:22:45,220 for what it does. 422 00:22:45,220 --> 00:22:48,090 And that's the database that we use here. 423 00:22:48,090 --> 00:22:51,740 And it's open source, a part of the Apache project. 424 00:22:51,740 --> 00:22:54,710 You can download it. 425 00:22:54,710 --> 00:22:56,730 And at the end of the class, we'll 426 00:22:56,730 --> 00:22:59,700 actually be doing examples working with databases. 427 00:22:59,700 --> 00:23:02,750 We are developing an infrastructure, part of LLGrid, 428 00:23:02,750 --> 00:23:05,270 to host these databases so that you don't really 429 00:23:05,270 --> 00:23:08,480 have to kind of mess around with the details of them. 430 00:23:08,480 --> 00:23:10,640 And our hope is to have that infrastructure 431 00:23:10,640 --> 00:23:14,440 ready by the last two classes so that you guys can actually 432 00:23:14,440 --> 00:23:15,590 try it out, so. 433 00:23:19,212 --> 00:23:21,320 So the next thing we want to do here 434 00:23:21,320 --> 00:23:23,990 is after we've built these databases, 435 00:23:23,990 --> 00:23:27,260 I want to show you how we construct graphs 436 00:23:27,260 --> 00:23:30,130 from this exploded schema. 437 00:23:30,130 --> 00:23:34,370 Formally, this exploded schema is what you 438 00:23:34,370 --> 00:23:37,740 would call an incidence matrix. 439 00:23:37,740 --> 00:23:42,530 So in graph theory, graphs generally 440 00:23:42,530 --> 00:23:45,920 can be represented as two different types of matrix-- 441 00:23:45,920 --> 00:23:50,680 an adjacency matrix, which is a matrix where each row 442 00:23:50,680 --> 00:23:52,560 and column is a vertex. 443 00:23:52,560 --> 00:23:56,050 And then if there is an edge between two vertices, 444 00:23:56,050 --> 00:23:58,760 there'll be a value associated with that. 445 00:23:58,760 --> 00:24:00,880 Generally very sparse in the types of data 446 00:24:00,880 --> 00:24:03,540 sets that we work with. 447 00:24:03,540 --> 00:24:07,190 That is great for covering certain types of graphs. 448 00:24:07,190 --> 00:24:11,350 It handles directed graphs very well. 449 00:24:11,350 --> 00:24:17,600 It does not handle graphs with multiple edges very well. 450 00:24:17,600 --> 00:24:22,600 It does not handle graphs with hyper-edges. 451 00:24:22,600 --> 00:24:25,140 These are edges that connect multiple vertices 452 00:24:25,140 --> 00:24:26,890 at the same time. 453 00:24:26,890 --> 00:24:30,230 And the data that we have here is very 454 00:24:30,230 --> 00:24:32,590 much in that latter category. 455 00:24:32,590 --> 00:24:37,960 A citation, essentially, is a hyper-edge. 456 00:24:37,960 --> 00:24:43,670 Or basically one of these things it connects a document 457 00:24:43,670 --> 00:24:46,510 with sets of people, with sets of titles, 458 00:24:46,510 --> 00:24:49,030 with sets of all different types of things. 459 00:24:49,030 --> 00:24:52,200 So it's a very, very rich edge that's 460 00:24:52,200 --> 00:24:54,150 connecting a lot of vertices. 461 00:24:54,150 --> 00:24:57,010 And so a traditional adjacency matrix 462 00:24:57,010 --> 00:24:58,330 doesn't really capture it. 463 00:24:58,330 --> 00:25:02,210 So that's why we store the data as an incidence matrix, which 464 00:25:02,210 --> 00:25:06,290 is essentially one row for each edge. 465 00:25:06,290 --> 00:25:11,440 And then the columns are vertices. 466 00:25:11,440 --> 00:25:13,830 And then you basically have information 467 00:25:13,830 --> 00:25:17,530 that shows which vertices are connected by that edge. 468 00:25:17,530 --> 00:25:21,790 So that's how we store the data in a lossless fashion. 469 00:25:21,790 --> 00:25:24,120 And we have a mechanism for storing it and indexing it 470 00:25:24,120 --> 00:25:26,390 fairly efficiently. 471 00:25:26,390 --> 00:25:29,560 That said, the types of analyses we want to do, 472 00:25:29,560 --> 00:25:34,530 tend to very quickly get us to adjacency matrices. 473 00:25:34,530 --> 00:25:38,220 For those of you who are familiar with radar-- and this 474 00:25:38,220 --> 00:25:41,840 is probably the only place where I could say that-- I almost 475 00:25:41,840 --> 00:25:43,610 think of one as kind of the voltage domain 476 00:25:43,610 --> 00:25:45,260 and the other the power domain. 477 00:25:45,260 --> 00:25:48,480 You do lose information when you go to the power domain. 478 00:25:48,480 --> 00:25:52,590 But often, to make progress on your analytics, 479 00:25:52,590 --> 00:25:54,100 you have to do that. 480 00:25:54,100 --> 00:25:57,920 And so the adjacency matrix is a formal projection 481 00:25:57,920 --> 00:25:59,970 of the incidence matrix. 482 00:25:59,970 --> 00:26:05,390 You essentially, by multiplying the incidence matrix by itself 483 00:26:05,390 --> 00:26:09,400 or squaring it, can get the adjacency matrix. 484 00:26:09,400 --> 00:26:10,350 So let's look at that. 485 00:26:10,350 --> 00:26:14,270 So here's an adjacency matrix of this data 486 00:26:14,270 --> 00:26:19,300 that shows the source document and the cited document. 487 00:26:19,300 --> 00:26:25,020 And so this just shows what we see in this data-- 488 00:26:25,020 --> 00:26:26,090 --cornor up here. 489 00:26:26,090 --> 00:26:29,290 If you see my mouse going up here, scream. 490 00:26:29,290 --> 00:26:32,080 So we'll stay in the safe area. 491 00:26:32,080 --> 00:26:34,360 So This is the source document. 492 00:26:34,360 --> 00:26:36,080 And this is the cited document. 493 00:26:36,080 --> 00:26:40,720 This is a significant subset of this data set. 494 00:26:40,720 --> 00:26:42,300 I think it's only 1,000 records. 495 00:26:42,300 --> 00:26:43,649 It's sort of a random sample. 496 00:26:43,649 --> 00:26:45,107 If I showed you the real, full data 497 00:26:45,107 --> 00:26:47,235 set, obviously that it just be solid blue. 498 00:26:53,841 --> 00:26:59,110 The source document ID is decreasing. 499 00:26:59,110 --> 00:27:03,390 And basically, newer documents are at the bottom, 500 00:27:03,390 --> 00:27:05,790 older documents at the top. 501 00:27:05,790 --> 00:27:07,910 I assume everyone can know why we have 502 00:27:07,910 --> 00:27:13,874 the sharp-edged boundary here. 503 00:27:13,874 --> 00:27:15,540 AUDIENCE: Can you cite future documents? 504 00:27:15,540 --> 00:27:17,790 JEREMY KEPNER: You can't cite future documents, right? 505 00:27:17,790 --> 00:27:23,180 This is the Einstein event cone of this data set, right? 506 00:27:23,180 --> 00:27:25,950 You can't cite documents into the future. 507 00:27:25,950 --> 00:27:30,360 So that's why we have that boundary there. 508 00:27:30,360 --> 00:27:32,505 And you could almost see it here. 509 00:27:35,190 --> 00:27:36,850 You see-- and I don't think that's just 510 00:27:36,850 --> 00:27:41,010 an optical illusion-- you see how it's denser here 511 00:27:41,010 --> 00:27:45,590 and sparser here, which just shows as documents 512 00:27:45,590 --> 00:27:51,190 get published over time, they get cited less and less 513 00:27:51,190 --> 00:27:52,236 going into the future. 514 00:27:55,100 --> 00:28:00,030 I think one statistic is that if you eliminated self-citations, 515 00:28:00,030 --> 00:28:04,230 most published journal articles are never cited. 516 00:28:04,230 --> 00:28:07,270 So thank God for self-citation. 517 00:28:13,970 --> 00:28:16,670 This shows the degree distribution of this data. 518 00:28:16,670 --> 00:28:19,990 We'll get into this in great much greater detail. 519 00:28:19,990 --> 00:28:26,530 But this just shows here on this small subset of data, basically 520 00:28:26,530 --> 00:28:31,560 for each document we count how many times it was cited. 521 00:28:31,560 --> 00:28:35,820 So these are documents that are cited more. 522 00:28:35,820 --> 00:28:37,940 And then we count how many documents 523 00:28:37,940 --> 00:28:41,510 have that number of citations. 524 00:28:41,510 --> 00:28:44,770 10 to the 0 is basically documents that are cited once. 525 00:28:44,770 --> 00:28:47,010 And there shows the number of documents 526 00:28:47,010 --> 00:28:53,160 in this set here, which was 20,000, 30,000, 40,000 527 00:28:53,160 --> 00:28:54,890 that have one citation. 528 00:28:54,890 --> 00:28:59,020 And then going on down here to the one that is cited the most. 529 00:28:59,020 --> 00:29:01,830 This is what is called a power law distribution. 530 00:29:01,830 --> 00:29:05,170 Basically it just says most things are not cited very often 531 00:29:05,170 --> 00:29:07,320 and some things are cited a long time. 532 00:29:07,320 --> 00:29:09,880 You see those are on a approximately 533 00:29:09,880 --> 00:29:14,230 linear negative slope in a log-log plot. 534 00:29:14,230 --> 00:29:20,210 This is something that we see all the time, 535 00:29:20,210 --> 00:29:27,090 I should say, in what we call sub-sampled data where 536 00:29:27,090 --> 00:29:29,020 essentially the space of citations 537 00:29:29,020 --> 00:29:30,150 has not been fully sampled. 538 00:29:30,150 --> 00:29:35,230 If people stopped publishing and we created new documents 539 00:29:35,230 --> 00:29:40,030 and all citations only were of older things, then over time 540 00:29:40,030 --> 00:29:43,050 you would probably see this thing 541 00:29:43,050 --> 00:29:48,370 begin to take on a bit more of a bell-shaped curve, OK? 542 00:29:48,370 --> 00:29:52,990 But as long as you're an expanding world, 543 00:29:52,990 --> 00:29:55,760 then you will get this type of power law. 544 00:29:55,760 --> 00:29:57,240 And a great many of the data sets 545 00:29:57,240 --> 00:30:00,470 that we tend to work with fall in this category. 546 00:30:00,470 --> 00:30:05,560 And, in fact, if you work with data 547 00:30:05,560 --> 00:30:11,970 that is a byproduct of artificial human-induced 548 00:30:11,970 --> 00:30:14,690 phenomena, this is what you should expect. 549 00:30:14,690 --> 00:30:18,456 And if you don't see it, then you kind of 550 00:30:18,456 --> 00:30:20,080 want to question what's going on there. 551 00:30:20,080 --> 00:30:22,710 There's usually something going on there. 552 00:30:25,900 --> 00:30:27,960 So those are some overall statistics. 553 00:30:27,960 --> 00:30:30,570 Those are two things that we-- the adjacency matrix 554 00:30:30,570 --> 00:30:34,830 and the redistribution of two things that we often look at. 555 00:30:34,830 --> 00:30:37,190 I'm going to move things forward here and talk 556 00:30:37,190 --> 00:30:40,710 about the different types of adjacency matrix 557 00:30:40,710 --> 00:30:43,430 we might construct from the incidence matrix. 558 00:30:43,430 --> 00:30:47,625 So in general, the adjacency matrix 559 00:30:47,625 --> 00:30:53,572 is often denoted by the letter A. 560 00:30:53,572 --> 00:30:58,880 And incident matrices, you think we might use the letter I. 561 00:30:58,880 --> 00:31:01,670 But I has obviously been taken in matrix theory. 562 00:31:01,670 --> 00:31:07,490 So we tend to use the letter E for the edge matrix. 563 00:31:07,490 --> 00:31:14,070 So I used E as the associative array representation 564 00:31:14,070 --> 00:31:15,890 of my edge matrix here. 565 00:31:15,890 --> 00:31:20,290 And if I want to find the author-author correlation 566 00:31:20,290 --> 00:31:26,470 matrix, I just basically can do E starts with author transpose 567 00:31:26,470 --> 00:31:28,870 times E starts with author. 568 00:31:28,870 --> 00:31:33,280 So basically the starts with author 569 00:31:33,280 --> 00:31:35,990 is the part of the incidence matrix 570 00:31:35,990 --> 00:31:37,770 that has just the authors. 571 00:31:37,770 --> 00:31:40,530 I now have an associative array of just authors. 572 00:31:40,530 --> 00:31:42,620 And then I square it with itself. 573 00:31:42,620 --> 00:31:46,610 And you get, obviously, a square symmetric matrix that 574 00:31:46,610 --> 00:31:53,230 is dense along that diagonal because every single document, 575 00:31:53,230 --> 00:31:55,490 the author appears with itself. 576 00:31:55,490 --> 00:31:58,470 And then this just shows you which authors 577 00:31:58,470 --> 00:32:01,120 appear with which other author. 578 00:32:01,120 --> 00:32:06,380 So if two authors appear on the same article together, 579 00:32:06,380 --> 00:32:07,650 they would get a dot. 580 00:32:07,650 --> 00:32:09,350 And so there's a very classic type 581 00:32:09,350 --> 00:32:12,090 of graph, the co-author graph. 582 00:32:12,090 --> 00:32:16,140 It's well studied, has a variety of phenomena 583 00:32:16,140 --> 00:32:17,420 associated with it. 584 00:32:17,420 --> 00:32:20,130 And here, it's constructed very simply 585 00:32:20,130 --> 00:32:24,550 from this matrix-matrix multiply. 586 00:32:27,480 --> 00:32:32,340 I'm going to call this the inner square. 587 00:32:32,340 --> 00:32:35,830 So we actually have a short cut for this in D4M called 588 00:32:35,830 --> 00:32:37,020 [INAUDIBLE]. 589 00:32:37,020 --> 00:32:40,600 So [INAUDIBLE] means the matrix transposed times 590 00:32:40,600 --> 00:32:46,740 itself, which sort of has an inner product feel to it. 591 00:32:46,740 --> 00:32:50,590 No one has their name associated with this product. 592 00:32:50,590 --> 00:32:53,430 Seems like someone missed an opportunity there. 593 00:32:57,400 --> 00:32:59,710 All of the special matrices are named after Germans. 594 00:32:59,710 --> 00:33:03,150 And so some German missed an opportunity there back 595 00:33:03,150 --> 00:33:06,500 in the days. 596 00:33:06,500 --> 00:33:08,890 Maybe we can call it the Inner Kepner product. 597 00:33:08,890 --> 00:33:13,980 That would sufficiently expire it-- obscure it 598 00:33:13,980 --> 00:33:18,490 like the Hadamard product, which I always have to look up. 599 00:33:18,490 --> 00:33:21,190 And here is the outer product. 600 00:33:21,190 --> 00:33:26,380 So this shows you the distribution of the documents 601 00:33:26,380 --> 00:33:29,920 that share common authors. 602 00:33:29,920 --> 00:33:33,130 So it's the same. 603 00:33:33,130 --> 00:33:36,600 What I'm trying to show you is these adjacency matrices 604 00:33:36,600 --> 00:33:39,610 are formally projections of the incidence 605 00:33:39,610 --> 00:33:41,950 matrix onto a subspace. 606 00:33:41,950 --> 00:33:45,900 And whether you do the inner squaring 607 00:33:45,900 --> 00:33:49,700 or the outer squaring product, those 608 00:33:49,700 --> 00:33:51,740 both have valuable information. 609 00:33:51,740 --> 00:33:54,370 One of them showed us which authors 610 00:33:54,370 --> 00:33:57,450 have published together. 611 00:33:57,450 --> 00:34:01,160 The other one shows us which documents have common authors. 612 00:34:01,160 --> 00:34:04,790 Both very legitimate pieces to analyze. 613 00:34:04,790 --> 00:34:07,170 And in each case, we've lost some information 614 00:34:07,170 --> 00:34:08,370 by constructing it. 615 00:34:08,370 --> 00:34:10,500 But probably out of necessity, we 616 00:34:10,500 --> 00:34:13,520 have to do this to make progress on the analysis 617 00:34:13,520 --> 00:34:15,389 that we're doing. 618 00:34:15,389 --> 00:34:18,360 And continuing, we can look at the institutions. 619 00:34:18,360 --> 00:34:25,340 So here's the inner institution squaring product. 620 00:34:25,340 --> 00:34:28,630 And so this shows you which institutions 621 00:34:28,630 --> 00:34:31,449 are on the same paper. 622 00:34:31,449 --> 00:34:34,659 And then likewise, this shows us which documents 623 00:34:34,659 --> 00:34:37,889 share the same institution. 624 00:34:37,889 --> 00:34:44,650 Same thing with keywords, which pairs of keywords 625 00:34:44,650 --> 00:34:49,060 occur in the same document, which document share 626 00:34:49,060 --> 00:34:52,670 the same keywords, et cetera, et cetera, et cetera. 627 00:34:52,670 --> 00:34:55,500 Typically, when we do this matrix multiply, 628 00:34:55,500 --> 00:34:58,770 if we have the value be a 1, then 629 00:34:58,770 --> 00:35:02,060 actually that the value of the result will be the count. 630 00:35:02,060 --> 00:35:05,390 So this would show you how many keywords 631 00:35:05,390 --> 00:35:08,400 are shared by the document. 632 00:35:08,400 --> 00:35:17,470 Or in the previous one, how many times a pair of keywords 633 00:35:17,470 --> 00:35:19,300 appear in the same document together. 634 00:35:19,300 --> 00:35:24,150 Again, valuable information for constructing analytics. 635 00:35:24,150 --> 00:35:33,524 And so these inner and outer Kepner products 636 00:35:33,524 --> 00:35:34,940 or whatever you want to call them, 637 00:35:34,940 --> 00:35:39,620 are very useful in that regard. 638 00:35:39,620 --> 00:35:47,500 So now we're going to take a little bit of a turn here. 639 00:35:47,500 --> 00:35:50,390 This is really going to lead you up into the examples. 640 00:35:50,390 --> 00:35:54,780 So we're going to get back in to this point I made and kind 641 00:35:54,780 --> 00:36:00,210 of revisit it again, which is the issues associated 642 00:36:00,210 --> 00:36:04,890 with adjacency matrix, hypergraphs, multi-edge graphs, 643 00:36:04,890 --> 00:36:06,490 and those types of things. 644 00:36:06,490 --> 00:36:08,570 I think I've already talked about this. 645 00:36:08,570 --> 00:36:14,890 Just to remind people, here's a graph, a directed graph. 646 00:36:14,890 --> 00:36:17,500 We've numbered all the vertices. 647 00:36:17,500 --> 00:36:22,420 This is the adjacency matrix of that graph. 648 00:36:22,420 --> 00:36:24,240 So basically two vertices have an edge. 649 00:36:24,240 --> 00:36:25,890 They get a dot. 650 00:36:25,890 --> 00:36:29,209 The fundamental operation of graph theory 651 00:36:29,209 --> 00:36:30,750 is what we call breadth-first search. 652 00:36:30,750 --> 00:36:33,710 So you start at a vertex and then go to its neighbors, 653 00:36:33,710 --> 00:36:36,230 following the edges. 654 00:36:36,230 --> 00:36:39,480 The fundamental operation of linear algebra 655 00:36:39,480 --> 00:36:42,280 is vector matrix multiply. 656 00:36:42,280 --> 00:36:47,780 So if I construct a vector with a value in a particular vertex 657 00:36:47,780 --> 00:36:51,850 and I do the matrix multiply, I find the edges 658 00:36:51,850 --> 00:36:55,200 and I result in the neighbors. 659 00:36:55,200 --> 00:36:57,780 And so you have this formal duality 660 00:36:57,780 --> 00:37:01,440 between the fundamental operation of graph theory 661 00:37:01,440 --> 00:37:06,730 here, which is breadth-first search and a linear algebra. 662 00:37:06,730 --> 00:37:10,305 And so this is philosophically how we will think about that. 663 00:37:10,305 --> 00:37:12,680 And I think we've already shown a little bit of that when 664 00:37:12,680 --> 00:37:19,800 we talk about these projections from incidence matrices 665 00:37:19,800 --> 00:37:21,760 to adjacency matrices. 666 00:37:21,760 --> 00:37:25,980 And again, to get into that a little bit 667 00:37:25,980 --> 00:37:32,110 further and bring that point home, the traditional graph 668 00:37:32,110 --> 00:37:36,270 theory that we are taught in graph theory courses 669 00:37:36,270 --> 00:37:46,490 tends to focus on what we call undirected, unweighted graphs. 670 00:37:46,490 --> 00:37:50,570 So these are graphs with vertices. 671 00:37:50,570 --> 00:37:53,290 And you just say whether there's a connection between vertices. 672 00:37:53,290 --> 00:37:54,790 And you may not have any information 673 00:37:54,790 --> 00:37:58,140 about if there's multiple connections between vertices. 674 00:37:58,140 --> 00:38:00,550 And you certainly can't do hyper-edges. 675 00:38:00,550 --> 00:38:03,360 And this gives you this very black and white picture 676 00:38:03,360 --> 00:38:08,110 of the world here, which is not really what the world really 677 00:38:08,110 --> 00:38:08,660 looks like. 678 00:38:08,660 --> 00:38:13,130 And in fact, there's a painting courtesy of a friend of mine 679 00:38:13,130 --> 00:38:17,100 who is an artist who does this type of painting. 680 00:38:17,100 --> 00:38:18,940 I tend to like it. 681 00:38:18,940 --> 00:38:25,550 And this is the same painting of what I showed before, 682 00:38:25,550 --> 00:38:28,210 but what it really looks like. 683 00:38:28,210 --> 00:38:31,280 And what you see is that this is really more representative 684 00:38:31,280 --> 00:38:32,030 of the true world. 685 00:38:32,030 --> 00:38:37,710 Just drawing some straight edge lines with colors on a canvas 686 00:38:37,710 --> 00:38:40,700 we already have gone well beyond what 687 00:38:40,700 --> 00:38:46,580 we could easily represent with undirected, unweighted graphs. 688 00:38:46,580 --> 00:38:49,240 First of all, our edges in this case have color. 689 00:38:49,240 --> 00:38:51,160 We have five distinct colors, which 690 00:38:51,160 --> 00:38:53,950 is information that would not be captured 691 00:38:53,950 --> 00:38:56,995 in the traditional unweighted, undirected graph 692 00:38:56,995 --> 00:38:57,620 representation. 693 00:39:01,760 --> 00:39:04,420 We have 20 distinct vertices here. 694 00:39:04,420 --> 00:39:06,620 So I've labeled every intersection or ending 695 00:39:06,620 --> 00:39:08,600 of a line as a vertex. 696 00:39:08,600 --> 00:39:10,720 So we have five colors and 20 distinct vertices. 697 00:39:14,990 --> 00:39:17,110 We have 12 things that we really should 698 00:39:17,110 --> 00:39:20,140 consider to be multi-edges, OK? 699 00:39:20,140 --> 00:39:21,950 These are essentially multiple edges 700 00:39:21,950 --> 00:39:23,300 that connect the same vertices. 701 00:39:23,300 --> 00:39:26,615 And I've labeled them by their color here. 702 00:39:32,500 --> 00:39:34,810 We have 19 hyper-edges. 703 00:39:34,810 --> 00:39:37,910 These are edges that fundamentally 704 00:39:37,910 --> 00:39:40,790 connect more than one vertex. 705 00:39:40,790 --> 00:39:43,200 So if we look at this example here, 706 00:39:43,200 --> 00:39:49,100 P3 connects this vertex, this vertex, this vertex, 707 00:39:49,100 --> 00:39:51,690 and this vertex with one edge. 708 00:39:51,690 --> 00:39:54,000 We could, of course, topologically say, well, this 709 00:39:54,000 --> 00:39:56,645 is the same as just having an edge from here to here 710 00:39:56,645 --> 00:39:58,080 or here to here to here. 711 00:39:58,080 --> 00:39:59,850 But that's not what's really going on. 712 00:39:59,850 --> 00:40:04,850 We are throwing away information if we decompose that. 713 00:40:04,850 --> 00:40:07,800 Finally, we often have a concept of edge ordering. 714 00:40:20,277 --> 00:40:21,860 The order that the line is drawn here, 715 00:40:21,860 --> 00:40:24,630 you can infer it just from the layering of the lines 716 00:40:24,630 --> 00:40:27,860 that certain edges occur before others. 717 00:40:27,860 --> 00:40:31,610 And ordering is also very important. 718 00:40:31,610 --> 00:40:38,390 And anyone want to guess what color was the first color 719 00:40:38,390 --> 00:40:40,820 of this painting? 720 00:40:40,820 --> 00:40:43,360 When the artist started, they painted one color first. 721 00:40:43,360 --> 00:40:45,646 Anyone want to guess what that color was? 722 00:40:45,646 --> 00:40:46,870 AUDIENCE: Red. 723 00:40:46,870 --> 00:40:47,796 JEREMY KEPNER: What? 724 00:40:47,796 --> 00:40:49,350 AUDIENCE: Brown? 725 00:40:49,350 --> 00:40:51,500 JEREMY KEPNER: It was an orange. 726 00:40:51,500 --> 00:40:52,250 It was orange. 727 00:40:52,250 --> 00:40:53,916 And I would have never have guessed that 728 00:40:53,916 --> 00:40:59,099 except by talking to the artist, which I found interesting. 729 00:40:59,099 --> 00:41:00,890 And when she was telling me how she did it, 730 00:41:00,890 --> 00:41:07,210 it was so counterintuitive to the perception 731 00:41:07,210 --> 00:41:10,210 of the work, which I think is very, very interesting. 732 00:41:16,490 --> 00:41:21,380 So if you were to represent this in a standard graph, 733 00:41:21,380 --> 00:41:26,190 you would create 53 standard edges. 734 00:41:26,190 --> 00:41:30,260 You would say this hyper-edge here, 735 00:41:30,260 --> 00:41:33,210 we would break up into three separate edges. 736 00:41:33,210 --> 00:41:36,616 This hyper-edge you would break up into two separate edges 737 00:41:36,616 --> 00:41:37,490 et cetera, et cetera. 738 00:41:37,490 --> 00:41:39,510 So you have 53 standard edges. 739 00:41:39,510 --> 00:41:42,510 So one of the basic observations is 740 00:41:42,510 --> 00:41:46,260 that the standard edge representation, the fragments, 741 00:41:46,260 --> 00:41:48,320 or hyper-edges, a lot of information is lost. 742 00:41:48,320 --> 00:41:50,770 The digraph representation compresses the multi-edges. 743 00:41:50,770 --> 00:41:52,620 And a lot of information is lost. 744 00:41:52,620 --> 00:41:55,240 The matrix representation drops the edge labels. 745 00:41:55,240 --> 00:41:57,289 And a lot of information is lost. 746 00:41:57,289 --> 00:41:58,830 And the standard graph representation 747 00:41:58,830 --> 00:41:59,990 drops the edge order. 748 00:41:59,990 --> 00:42:01,364 And again, more information loss. 749 00:42:01,364 --> 00:42:04,680 So we really need something better than the traditional way 750 00:42:04,680 --> 00:42:05,800 to do that. 751 00:42:05,800 --> 00:42:07,700 And so the solution to this problem 752 00:42:07,700 --> 00:42:10,670 is to use the incidence matrix formulation 753 00:42:10,670 --> 00:42:14,150 where we assign every single edge 754 00:42:14,150 --> 00:42:18,570 a label, a color, an order that it was laid down. 755 00:42:18,570 --> 00:42:22,550 And then for every single vertex we 756 00:42:22,550 --> 00:42:25,760 can-- so this edge B1 is blue. 757 00:42:25,760 --> 00:42:28,060 It's in the second order group. 758 00:42:28,060 --> 00:42:31,070 And its connects these three vertices. 759 00:42:31,070 --> 00:42:34,382 And you see various structures here appearing 760 00:42:34,382 --> 00:42:35,840 from the different types of things. 761 00:42:35,840 --> 00:42:38,780 So this is how we would represent this data 762 00:42:38,780 --> 00:42:41,830 as an incidence matrix in a way that 763 00:42:41,830 --> 00:42:44,380 preserves all the information. 764 00:42:44,380 --> 00:42:48,990 And so that's really kind of the power of the incidence matrix 765 00:42:48,990 --> 00:42:50,890 approach to this. 766 00:42:50,890 --> 00:42:51,390 All right. 767 00:42:51,390 --> 00:42:54,540 So that actually brings us to the end of the first-- oh, 768 00:42:54,540 --> 00:42:55,760 so we have a question here. 769 00:42:55,760 --> 00:42:57,260 AUDIENCE: So this is great, and I've 770 00:42:57,260 --> 00:42:58,778 worked with [INAUDIBLE] before. 771 00:42:58,778 --> 00:43:01,028 But what kind of analysis can you actually do on that? 772 00:43:03,540 --> 00:43:05,490 JEREMY KEPNER: So the analysis you typically 773 00:43:05,490 --> 00:43:08,680 do is you take projections into adjacency matrices. 774 00:43:08,680 --> 00:43:09,301 Yep. 775 00:43:09,301 --> 00:43:09,800 Yep. 776 00:43:09,800 --> 00:43:12,650 So this way, you're preserving the ability 777 00:43:12,650 --> 00:43:15,400 to do all those projections, you know? 778 00:43:15,400 --> 00:43:22,272 And it's not to say sometimes to make progress 779 00:43:22,272 --> 00:43:23,730 if your data sets are large, you're 780 00:43:23,730 --> 00:43:26,526 just forced to just make some projections to get started. 781 00:43:30,460 --> 00:43:34,090 Theoretically, one could argue that the right way to proceed 782 00:43:34,090 --> 00:43:38,470 is you have the incidence matrix. 783 00:43:38,470 --> 00:43:41,230 You derive the analytics you want. 784 00:43:41,230 --> 00:43:43,775 Maybe you use adjacency matrices as 785 00:43:43,775 --> 00:43:46,040 an analytic intermediate step. 786 00:43:46,040 --> 00:43:48,600 But then once you've figured out the analysis, 787 00:43:48,600 --> 00:43:51,620 you then go back and make sure that the analytic runs just 788 00:43:51,620 --> 00:43:53,720 on the raw instance matrix itself 789 00:43:53,720 --> 00:43:55,860 and save yourself the time of actually constructing 790 00:43:55,860 --> 00:43:57,360 the adjacency matrix. 791 00:43:57,360 --> 00:43:59,099 Sometimes this space is so large, 792 00:43:59,099 --> 00:44:01,640 though, it can be difficult for us to get our mind around it. 793 00:44:01,640 --> 00:44:03,710 And sometimes it can be useful just 794 00:44:03,710 --> 00:44:05,700 like you know what, I really think 795 00:44:05,700 --> 00:44:08,460 that these projections are going to be useful ones to get 796 00:44:08,460 --> 00:44:09,340 started. 797 00:44:09,340 --> 00:44:11,522 Rather than trying to just get lost in this forest, 798 00:44:11,522 --> 00:44:13,230 sometimes it's better to be like I'm just 799 00:44:13,230 --> 00:44:15,510 going to start by projecting. 800 00:44:15,510 --> 00:44:18,040 I'm going to create a few of these adjacency matrices 801 00:44:18,040 --> 00:44:21,760 to get started, make progress, do my analytics, 802 00:44:21,760 --> 00:44:25,070 and then figure out if I need to fix it from there. 803 00:44:25,070 --> 00:44:27,710 Because working in this space directly 804 00:44:27,710 --> 00:44:30,340 can be a little bit tricky. 805 00:44:30,340 --> 00:44:33,840 Every single analysis-- it's like working 806 00:44:33,840 --> 00:44:34,880 in the voltage domain. 807 00:44:34,880 --> 00:44:38,980 You're constantly having to keep extra theoretical machinery 808 00:44:38,980 --> 00:44:39,790 around. 809 00:44:39,790 --> 00:44:42,830 And sometimes to make progress, you're better off just 810 00:44:42,830 --> 00:44:45,080 going to the power domain, making progress, 811 00:44:45,080 --> 00:44:48,400 and maybe discover later if you can fix it 812 00:44:48,400 --> 00:44:51,560 up doing other types of things. 813 00:44:51,560 --> 00:44:53,060 Are there other questions before we 814 00:44:53,060 --> 00:44:57,131 come to the end of this part of the lecture? 815 00:44:57,131 --> 00:44:57,630 Great. 816 00:44:57,630 --> 00:44:58,290 OK, thank you. 817 00:44:58,290 --> 00:45:01,490 So we'll take a short break here. 818 00:45:01,490 --> 00:45:03,500 And then we'll proceed to the demo. 819 00:45:03,500 --> 00:45:04,940 There is a sign-up sheet here. 820 00:45:04,940 --> 00:45:07,700 If you haven't signed up, please write it down. 821 00:45:07,700 --> 00:45:10,290 Apparently this is extremely important for the accounting 822 00:45:10,290 --> 00:45:12,190 purposes of the laboratory. 823 00:45:12,190 --> 00:45:14,230 So I would encourage you to do that. 824 00:45:14,230 --> 00:45:19,080 So we'll start up again in about five minutes.