1 00:00:00,090 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,820 Commons license. 3 00:00:03,820 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,150 continue to offer high quality educational resources for free. 5 00:00:10,150 --> 00:00:12,700 To make a donation, or to view additional materials 6 00:00:12,700 --> 00:00:16,600 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,600 --> 00:00:17,261 at ocw.mit.edu. 8 00:00:21,510 --> 00:00:22,700 JEREMY KEPNER: Welcome back. 9 00:00:22,700 --> 00:00:27,980 We're now going get in the demo portion of lecture zero three. 10 00:00:27,980 --> 00:00:35,220 And so these demos are in the program directory, 11 00:00:35,220 --> 00:00:36,830 and the examples. 12 00:00:36,830 --> 00:00:39,290 We've now moved on from kind of the intro 13 00:00:39,290 --> 00:00:41,860 in to the applications, and we're 14 00:00:41,860 --> 00:00:44,520 going to do this when we're dealing with entities extracted 15 00:00:44,520 --> 00:00:45,910 from this data set. 16 00:00:45,910 --> 00:00:48,750 So we go in to the entity analysis directory. 17 00:00:48,750 --> 00:00:53,520 Start my shell here, so. 18 00:01:13,000 --> 00:01:14,250 Start our MATLAB session here. 19 00:01:16,830 --> 00:01:19,329 I always run MATLAB from the shell. 20 00:01:19,329 --> 00:01:21,370 If you prefer to run it from the ID, that's fine. 21 00:01:24,050 --> 00:01:28,720 And again, our D4M code runs in a variety of environments, not 22 00:01:28,720 --> 00:01:29,220 just MATLAB. 23 00:01:35,425 --> 00:01:35,925 Right. 24 00:01:39,980 --> 00:01:43,790 That's started up here, so let's-- before we do that, 25 00:01:43,790 --> 00:01:45,814 let's look at our data set here. 26 00:01:45,814 --> 00:01:47,230 So this is a nice little data set. 27 00:01:47,230 --> 00:01:51,290 It's about 2.4 megabytes of data. 28 00:01:51,290 --> 00:01:55,110 And some of the Reuters documents 29 00:01:55,110 --> 00:01:59,725 processed through our entity extractor 30 00:01:59,725 --> 00:02:01,000 here at Lincoln Laboratory. 31 00:02:01,000 --> 00:02:05,750 And so if I look at this, say with Microsoft Excel, try that. 32 00:02:24,850 --> 00:02:27,180 All right, so you can see that pretty clearly, right? 33 00:02:27,180 --> 00:02:32,050 So we basically just have a row, key here to read this in. 34 00:02:32,050 --> 00:02:36,680 We have a document column, which is the actual-- 35 00:02:36,680 --> 00:02:37,746 so we widen that. 36 00:02:37,746 --> 00:02:41,250 That's the actual document that it appears in. 37 00:02:41,250 --> 00:02:44,770 We have the entity, happen to be alphabetical here, 38 00:02:44,770 --> 00:02:46,200 the way it comes out. 39 00:02:46,200 --> 00:02:48,800 We have the word position in the document, 40 00:02:48,800 --> 00:02:51,740 so this tells you that this entity appeared 41 00:02:51,740 --> 00:02:54,820 in these positions in this document. 42 00:02:54,820 --> 00:02:58,424 And then various types-- location, person, organization, 43 00:02:58,424 --> 00:02:59,340 those types of things. 44 00:02:59,340 --> 00:03:03,050 So that's what we extracted from these documents. 45 00:03:03,050 --> 00:03:04,250 There's no verbs here. 46 00:03:04,250 --> 00:03:08,307 We just tell you a person, a place, an organization, 47 00:03:08,307 --> 00:03:09,390 appeared in this document. 48 00:03:09,390 --> 00:03:10,980 It wouldn't say thing about what they're doing, 49 00:03:10,980 --> 00:03:11,896 or anything like that. 50 00:03:11,896 --> 00:03:12,857 So. 51 00:03:12,857 --> 00:03:14,940 But this just shows an example of the kind of data 52 00:03:14,940 --> 00:03:16,820 that we have here. 53 00:03:16,820 --> 00:03:17,956 So. 54 00:03:17,956 --> 00:03:20,320 There's that. 55 00:03:20,320 --> 00:03:21,100 To save. 56 00:03:24,470 --> 00:03:26,430 So, all right. 57 00:03:29,449 --> 00:03:30,990 So the first thing we're going to do, 58 00:03:30,990 --> 00:03:34,037 is the first script, is entity analysis one. 59 00:03:34,037 --> 00:03:35,370 We're going to read the data in. 60 00:03:47,370 --> 00:03:48,920 All right, there we go. 61 00:03:48,920 --> 00:03:52,590 So-- and this first line just reads this in. 62 00:03:52,590 --> 00:03:55,080 We have-- it's a fair amount of data. 63 00:03:55,080 --> 00:03:57,550 We're turning it all in to an associative array. 64 00:03:57,550 --> 00:04:02,650 We have actually some faster parsers that will read it in. 65 00:04:02,650 --> 00:04:05,760 You know, but just the read CSV is 66 00:04:05,760 --> 00:04:09,810 kind of the most general purpose one that's fairly nice. 67 00:04:09,810 --> 00:04:11,460 So it's going to go, and we'll go back. 68 00:04:11,460 --> 00:04:13,418 It's running about a whole bunch of stuff here. 69 00:04:20,406 --> 00:04:22,980 It tries to pull together its graphics. 70 00:04:27,080 --> 00:04:28,142 All right. 71 00:04:28,142 --> 00:04:29,850 So let's take a look at what we did here. 72 00:04:29,850 --> 00:04:35,040 So we read in the data into, essentially an edge list, 73 00:04:35,040 --> 00:04:37,220 associative array. 74 00:04:37,220 --> 00:04:41,470 Just display the first few rows here so you can see. 75 00:04:41,470 --> 00:04:43,270 So it's alphabetized. 76 00:04:43,270 --> 00:04:46,660 The lexigraph we sorted the first columns, always 77 00:04:46,660 --> 00:04:47,880 something to be aware of. 78 00:04:47,880 --> 00:04:51,950 That 1 is then followed by 10, is then followed by 100 79 00:04:51,950 --> 00:04:54,500 and then 1,000 in this sorting. 80 00:04:54,500 --> 00:04:56,624 Here's the documents, here's the entity. 81 00:04:56,624 --> 00:04:57,540 There's the positions. 82 00:04:57,540 --> 00:04:59,831 So it just looks like the normal table that we read in, 83 00:04:59,831 --> 00:05:02,010 but it is an associative array. 84 00:05:02,010 --> 00:05:04,420 Now I'm going to want to sort of pull this 85 00:05:04,420 --> 00:05:05,960 into our exploded schema. 86 00:05:05,960 --> 00:05:07,940 And this table didn't necessarily 87 00:05:07,940 --> 00:05:10,180 come to us in the easiest format to do that in. 88 00:05:10,180 --> 00:05:12,710 So there's a few tricks that we have to do. 89 00:05:12,710 --> 00:05:17,670 So we're going to go and get each document, basically. 90 00:05:17,670 --> 00:05:22,520 So we have a row, a column and the doc, so that's the value. 91 00:05:22,520 --> 00:05:26,710 So this is-- whenever you do a query in an associative array, 92 00:05:26,710 --> 00:05:29,965 one of the powerful things about MATLAB is you can change, 93 00:05:29,965 --> 00:05:32,640 you can have variable numbers of output arguments. 94 00:05:32,640 --> 00:05:34,910 And you can change the behavior. 95 00:05:34,910 --> 00:05:39,080 So if I just wanted this to return-- if I set this to, 96 00:05:39,080 --> 00:05:41,990 a equals e this, I would have just gotten 97 00:05:41,990 --> 00:05:43,630 an associative array vector. 98 00:05:43,630 --> 00:05:45,830 But if I give it three output arguments, so say, 99 00:05:45,830 --> 00:05:50,210 I'll give you the triples that went into constructing that. 100 00:05:50,210 --> 00:05:52,700 Because sometimes that's what you want to do. 101 00:05:52,700 --> 00:05:56,240 Also it's faster, because to do this 102 00:05:56,240 --> 00:05:58,880 every time I do a sub associative array, 103 00:05:58,880 --> 00:06:02,130 I effectively have to pull out the triples, 104 00:06:02,130 --> 00:06:04,271 and then reconstitute the associative array. 105 00:06:04,271 --> 00:06:05,520 And that could take some time. 106 00:06:05,520 --> 00:06:06,561 It just saves you a step. 107 00:06:06,561 --> 00:06:11,180 It goes directly to the triple switches, which is nice. 108 00:06:11,180 --> 00:06:13,380 So I basically say, give me the document column, 109 00:06:13,380 --> 00:06:15,780 give me the entity column, give me the position column. 110 00:06:15,780 --> 00:06:17,080 Now I have all these. 111 00:06:17,080 --> 00:06:18,700 And since I know they're dense, I 112 00:06:18,700 --> 00:06:21,390 know that these are all lined up. 113 00:06:21,390 --> 00:06:23,101 The first doc string, and entity string, 114 00:06:23,101 --> 00:06:24,600 and position string, and type string 115 00:06:24,600 --> 00:06:29,180 are all-- so that's me exploiting a little bit. 116 00:06:29,180 --> 00:06:31,140 I know this is a dense table and I'm not 117 00:06:31,140 --> 00:06:33,920 going have any missing there. 118 00:06:33,920 --> 00:06:38,210 Now I'm going to interleave the type and entity strings. 119 00:06:38,210 --> 00:06:41,550 So I have this function called Cat String. 120 00:06:41,550 --> 00:06:45,660 So basically Cat String takes one string list 121 00:06:45,660 --> 00:06:47,790 and another string list, and then it 122 00:06:47,790 --> 00:06:51,450 leaves them with a separator, a single character separator. 123 00:06:51,450 --> 00:06:55,780 And now I can construct a new sparse matrix, which 124 00:06:55,780 --> 00:06:58,300 has docks for the row key. 125 00:06:58,300 --> 00:07:01,630 This new thing called type entity, to be the columns, 126 00:07:01,630 --> 00:07:04,530 and I'll just throw the position in as the value. 127 00:07:04,530 --> 00:07:06,890 And so that is our exploded schema 128 00:07:06,890 --> 00:07:09,110 that we just talked about. 129 00:07:09,110 --> 00:07:10,490 Done for us in a couple lines. 130 00:07:10,490 --> 00:07:13,100 There's other functions they do that as well. 131 00:07:13,100 --> 00:07:15,420 And since I don't want to repeat this, 132 00:07:15,420 --> 00:07:18,120 because it took a little while to read it in. 133 00:07:18,120 --> 00:07:23,010 I'm just going to save it as a MATLAB of binary, 134 00:07:23,010 --> 00:07:26,177 which will then read in very quickly from now on. 135 00:07:26,177 --> 00:07:26,760 Which is nice. 136 00:07:26,760 --> 00:07:28,236 So I can save it. 137 00:07:28,236 --> 00:07:29,910 There you go. 138 00:07:29,910 --> 00:07:32,400 Just to show you, if I show you the first entry 139 00:07:32,400 --> 00:07:39,080 here, in the row. 140 00:07:39,080 --> 00:07:43,560 And then, this just shows you this 141 00:07:43,560 --> 00:07:47,840 is the different types of columns 142 00:07:47,840 --> 00:07:50,230 that were in that data set. 143 00:07:50,230 --> 00:07:53,545 So this the row key here. 144 00:07:53,545 --> 00:07:54,545 And the various columns. 145 00:07:54,545 --> 00:07:55,878 These are the different columns. 146 00:07:55,878 --> 00:07:58,520 And then it shows you how many-- the locations of that 147 00:07:58,520 --> 00:08:00,500 word there. 148 00:08:00,500 --> 00:08:02,370 If I display it, I see that I ended up 149 00:08:02,370 --> 00:08:03,730 creating an associative array. 150 00:08:03,730 --> 00:08:10,130 And it consisted of almost 10,000 unique documents. 151 00:08:10,130 --> 00:08:21,050 And 3,657 unique entities in that data set. 152 00:08:21,050 --> 00:08:23,050 And then we're going to take a look at that. 153 00:08:23,050 --> 00:08:26,002 We can do spy and transpose. 154 00:08:26,002 --> 00:08:29,970 If we look at here, zoom in on that. 155 00:08:29,970 --> 00:08:35,890 You can see different types of structures here. 156 00:08:38,450 --> 00:08:41,350 Click on one of them. 157 00:08:41,350 --> 00:08:45,180 And that tells us that the location Singapore 158 00:08:45,180 --> 00:08:50,150 appeared in this document in these two word positions. 159 00:08:50,150 --> 00:08:51,730 So. 160 00:08:51,730 --> 00:08:53,710 That's typically about how many lines 161 00:08:53,710 --> 00:08:56,900 it takes to take essentially a basic CSV file, 162 00:08:56,900 --> 00:09:00,190 and cast it into the associative array to start doing 163 00:09:00,190 --> 00:09:01,280 start doing real work. 164 00:09:01,280 --> 00:09:05,110 That's typically what we see. 165 00:09:05,110 --> 00:09:07,550 If you find it's getting a lot longer than that, 166 00:09:07,550 --> 00:09:09,960 then that probably means there's a trick 167 00:09:09,960 --> 00:09:14,680 that you're not quite getting. 168 00:09:14,680 --> 00:09:17,720 And there again, we're always happy to help people 169 00:09:17,720 --> 00:09:19,150 do this very important step. 170 00:09:19,150 --> 00:09:23,330 This first-- taking your data and getting into this format 171 00:09:23,330 --> 00:09:27,716 so that you can do subsequent analysis on it. 172 00:09:27,716 --> 00:09:29,230 All right. 173 00:09:29,230 --> 00:09:30,630 So there's that. 174 00:09:30,630 --> 00:09:33,800 Let's go on to the next sample here. 175 00:09:37,590 --> 00:09:40,880 So we're going to do some statistics on this. 176 00:09:46,240 --> 00:09:48,040 See, that took a lot less time loading it 177 00:09:48,040 --> 00:09:52,930 in from a binary much, much faster. 178 00:09:52,930 --> 00:09:54,020 Much, much faster. 179 00:09:54,020 --> 00:09:54,750 Very fast. 180 00:09:54,750 --> 00:09:57,200 I mean, faster than if you put in a database 181 00:09:57,200 --> 00:09:58,450 and read it out of a database. 182 00:09:58,450 --> 00:10:00,412 Much faster to read a file. 183 00:10:00,412 --> 00:10:02,120 So we always encourage people to do that. 184 00:10:05,390 --> 00:10:07,220 This just gets displaced the size. 185 00:10:07,220 --> 00:10:10,460 So we have-- that's what we had in the original data. 186 00:10:10,460 --> 00:10:13,970 nnz shows the number of entries. 187 00:10:13,970 --> 00:10:17,920 So we do nnz, that's the number of entries. 188 00:10:17,920 --> 00:10:20,450 All right. 189 00:10:20,450 --> 00:10:23,060 I now want to undo what I did before. 190 00:10:23,060 --> 00:10:26,730 I want to convert it back into a dense table. 191 00:10:26,730 --> 00:10:28,706 I want to take all those values, and convert-- 192 00:10:28,706 --> 00:10:31,330 I want to take all those things that [INAUDIBLE] because I just 193 00:10:31,330 --> 00:10:33,704 want to count how many, I want to know how many locations 194 00:10:33,704 --> 00:10:35,207 organization people have. 195 00:10:35,207 --> 00:10:37,290 So we have this function here called call-to-type. 196 00:10:37,290 --> 00:10:39,623 And another function that I forget the precise name that 197 00:10:39,623 --> 00:10:41,820 does the reverse, which basically give me 198 00:10:41,820 --> 00:10:43,420 an associative array. 199 00:10:43,420 --> 00:10:45,970 Give me a delimiter for the columns, 200 00:10:45,970 --> 00:10:47,900 and I'm going to now rip that back apart. 201 00:10:47,900 --> 00:10:49,861 Stick that back into the value position, 202 00:10:49,861 --> 00:10:51,860 and then you're going to now sort of essentially 203 00:10:51,860 --> 00:10:53,310 make it dense again. 204 00:10:53,310 --> 00:10:55,070 And so we do that. 205 00:10:55,070 --> 00:10:58,799 We then throw away the actual values. 206 00:10:58,799 --> 00:11:00,340 And so we do this double [INAUDIBLE], 207 00:11:00,340 --> 00:11:04,630 which converts it all to ones, and then we can sum the rows. 208 00:11:04,630 --> 00:11:05,980 So we collapsed along rows. 209 00:11:05,980 --> 00:11:07,240 Now we have count. 210 00:11:07,240 --> 00:11:11,890 It tells me in this data set I have 9,600 locations, 211 00:11:11,890 --> 00:11:14,000 organizations, people, time. 212 00:11:14,000 --> 00:11:16,290 Those are the distinct ones there. 213 00:11:19,650 --> 00:11:22,140 All right. 214 00:11:22,140 --> 00:11:24,420 Let's see we can count each entity. 215 00:11:24,420 --> 00:11:25,530 So we have our thing here. 216 00:11:25,530 --> 00:11:27,720 We just do a count. 217 00:11:27,720 --> 00:11:30,680 We can find the triples-- so basically I've just done a sum. 218 00:11:30,680 --> 00:11:33,080 So this is the original, exploded scheme. 219 00:11:33,080 --> 00:11:35,700 I basically summed all the rows together. 220 00:11:35,700 --> 00:11:38,980 So now we're getting a count for each unique entity. 221 00:11:38,980 --> 00:11:43,700 And I can find-- instead if one another way to get the triple 222 00:11:43,700 --> 00:11:46,034 is just to pass, use the find command. 223 00:11:46,034 --> 00:11:47,700 Works the same way as it does in MATLAB. 224 00:11:47,700 --> 00:11:50,432 It will now give you a set up of triples. 225 00:11:50,432 --> 00:11:52,140 A temp, which we don't really care about, 226 00:11:52,140 --> 00:11:54,870 because we've collapsed that dimensional to be all ones. 227 00:11:54,870 --> 00:11:57,060 The actual entity and the count. 228 00:11:57,060 --> 00:12:01,590 And I'm going to create a new associative matrix out of that. 229 00:12:01,590 --> 00:12:05,100 So we now have the counts and the entity. 230 00:12:05,100 --> 00:12:06,950 And I'm going to plot a histogram of that. 231 00:12:06,950 --> 00:12:08,720 So I'm going to go with my count things. 232 00:12:08,720 --> 00:12:11,500 I'm just going to dip the locations out of that. 233 00:12:11,500 --> 00:12:13,780 So using that Starts With command, 234 00:12:13,780 --> 00:12:16,359 I say get me all locations. 235 00:12:16,359 --> 00:12:18,400 I'm now going to just real map, and I say give me 236 00:12:18,400 --> 00:12:20,920 the adjacency matrix, which is return to regular sparse 237 00:12:20,920 --> 00:12:21,420 matrix. 238 00:12:21,420 --> 00:12:24,770 I can do sum, full, log, log. 239 00:12:24,770 --> 00:12:28,000 Now this is the classic degree distribution 240 00:12:28,000 --> 00:12:30,870 of all the locations in this data. 241 00:12:30,870 --> 00:12:35,970 And it shows us that certain locations are very common. 242 00:12:35,970 --> 00:12:37,960 A few of them are very common. 243 00:12:37,960 --> 00:12:40,815 And a few locate, as they appear in most of the documents, 244 00:12:40,815 --> 00:12:44,310 and a lot of locations only appear in one document. 245 00:12:44,310 --> 00:12:47,300 So this is the classical-- we call power law-- degree 246 00:12:47,300 --> 00:12:48,470 distribution. 247 00:12:48,470 --> 00:12:50,939 Again, always a good thing to check. 248 00:12:50,939 --> 00:12:52,980 Sometimes you can't do this on the full data set, 249 00:12:52,980 --> 00:12:54,370 it's so large. 250 00:12:54,370 --> 00:12:56,670 But just even a sample, just to sort of-- you 251 00:12:56,670 --> 00:12:58,330 want to know this from the beginning. 252 00:12:58,330 --> 00:13:00,663 Am I dealing with something that looks like a power law? 253 00:13:00,663 --> 00:13:04,000 So like everything else-- or oh, what do you know, 254 00:13:04,000 --> 00:13:06,700 I see a big spike here. 255 00:13:06,700 --> 00:13:08,450 Well what's that about? 256 00:13:08,450 --> 00:13:10,140 Or it looks like a bell curve. 257 00:13:10,140 --> 00:13:11,190 Or something like that. 258 00:13:11,190 --> 00:13:16,090 And so this is just really computing means, good. 259 00:13:16,090 --> 00:13:18,760 Computing basic histograms, very good. 260 00:13:18,760 --> 00:13:21,720 Without this basic information, it's really hard 261 00:13:21,720 --> 00:13:24,730 to advance in detection theory. 262 00:13:24,730 --> 00:13:27,010 And so-- and this is probably where 263 00:13:27,010 --> 00:13:29,760 we take a slightly different philosophical approach 264 00:13:29,760 --> 00:13:31,890 than a lot of the data mining community, which 265 00:13:31,890 --> 00:13:35,420 tends to focus on just finding stuff that's interesting. 266 00:13:35,420 --> 00:13:38,340 Where we tend to focus on understanding the data, 267 00:13:38,340 --> 00:13:40,840 modeling the data, coming up with models 268 00:13:40,840 --> 00:13:43,160 for what we're trying to find, and then doing that. 269 00:13:43,160 --> 00:13:44,980 Rather than just-- show me something 270 00:13:44,980 --> 00:13:45,980 interesting in the data. 271 00:13:45,980 --> 00:13:49,110 Show me a low probability covariance or something 272 00:13:49,110 --> 00:13:49,840 like that. 273 00:13:49,840 --> 00:13:53,520 Which tends to be more than the basis of the data mining 274 00:13:53,520 --> 00:13:55,405 community. 275 00:13:55,405 --> 00:13:58,250 Although, no doubt, I probably offended them horribly 276 00:13:58,250 --> 00:14:00,320 by simplifying it that way. 277 00:14:00,320 --> 00:14:01,820 In which, case I apologize to people 278 00:14:01,820 --> 00:14:04,700 in the [INAUDIBLE] community. 279 00:14:04,700 --> 00:14:08,172 But here, we-- more traditional detection theory, 280 00:14:08,172 --> 00:14:10,630 and the first thing you want to do is get the distribution. 281 00:14:10,630 --> 00:14:13,520 So you can-- typically one thing you'll do immediately 282 00:14:13,520 --> 00:14:16,120 is like you know what, you put essentially 283 00:14:16,120 --> 00:14:18,820 what we call low pass, high pass filter on it. 284 00:14:18,820 --> 00:14:21,680 Things that only appear once, or so unique that they give us 285 00:14:21,680 --> 00:14:24,930 no information, and things that appear everywhere 286 00:14:24,930 --> 00:14:27,030 again give us no information about the data. 287 00:14:27,030 --> 00:14:30,940 So you might just you have a high pass and a low pass 288 00:14:30,940 --> 00:14:32,410 filter that just sort of eliminates 289 00:14:32,410 --> 00:14:34,630 these high end and low end types of things. 290 00:14:34,630 --> 00:14:35,640 Very standard. 291 00:14:35,640 --> 00:14:39,080 Signal processing technique, that's just as relevant here 292 00:14:39,080 --> 00:14:40,850 as elsewhere. 293 00:14:40,850 --> 00:14:41,350 All right. 294 00:14:44,400 --> 00:14:48,000 So statistics, histograms, all good things, 295 00:14:48,000 --> 00:14:50,960 easy to do in D4M on our data. 296 00:14:50,960 --> 00:14:52,700 Don't be afraid to take your data, 297 00:14:52,700 --> 00:14:54,720 bump it back into regular map of arrays, 298 00:14:54,720 --> 00:14:58,520 and just do the regular things that you would do it with 299 00:14:58,520 --> 00:14:59,580 MATLAB data. 300 00:14:59,580 --> 00:15:02,090 With MATLAB matrices, we highly encourage that. 301 00:15:02,090 --> 00:15:05,680 Now we're going to do the facet algorithm that I talked about 302 00:15:05,680 --> 00:15:07,910 in the lecture. 303 00:15:07,910 --> 00:15:08,590 Let's do that. 304 00:15:13,370 --> 00:15:19,550 So once again, we've loaded our data in very quickly. 305 00:15:19,550 --> 00:15:21,090 We've decided to convert directly 306 00:15:21,090 --> 00:15:23,910 to numeric, which is-- we're getting rid of all those word 307 00:15:23,910 --> 00:15:27,700 positions, because we don't really care about them. 308 00:15:27,700 --> 00:15:31,560 I'm going to pick-- so one thing I like about this data set, 309 00:15:31,560 --> 00:15:33,930 and for those of us who are a little older, 310 00:15:33,930 --> 00:15:36,630 we remember the news of the 1990's. 311 00:15:36,630 --> 00:15:37,130 You know? 312 00:15:37,130 --> 00:15:38,620 This is all a trip to remember [INAUDIBLE]. 313 00:15:38,620 --> 00:15:39,890 For those of you that are a little younger, who 314 00:15:39,890 --> 00:15:42,431 are in elementary school, this will not mean anything to you. 315 00:15:42,431 --> 00:15:43,430 But it's like-- oh yes. 316 00:15:43,430 --> 00:15:45,596 A lot of stuff about the Clinton's in this data set. 317 00:15:45,596 --> 00:15:47,860 So that's always fun. 318 00:15:47,860 --> 00:15:50,860 So the first facet we're going to do 319 00:15:50,860 --> 00:15:54,300 is, we want to look at the location New York. 320 00:15:54,300 --> 00:15:57,090 And the person Michael Chang. 321 00:15:57,090 --> 00:16:00,310 Does anybody remember who Michael Chang was? 322 00:16:00,310 --> 00:16:00,970 Tennis player. 323 00:16:00,970 --> 00:16:04,370 Yes, he was kind of like this hip tennis player. 324 00:16:04,370 --> 00:16:05,990 Used to battle under Andre Agassi. 325 00:16:05,990 --> 00:16:08,470 So, I don't know how long his career really lasted, 326 00:16:08,470 --> 00:16:11,520 but he was a very-- this is kind of right it his peak, 327 00:16:11,520 --> 00:16:12,529 I believe. 328 00:16:12,529 --> 00:16:15,070 So now there is-- to show you that equation that I showed you 329 00:16:15,070 --> 00:16:16,620 in the slides wasn't a lie. 330 00:16:16,620 --> 00:16:20,390 So we take our edge, or incidence matrix, 331 00:16:20,390 --> 00:16:25,070 we pick the column New York, then 332 00:16:25,070 --> 00:16:28,100 we then pick the column, Michael Chang. 333 00:16:28,100 --> 00:16:29,672 We need to do this no call thing, 334 00:16:29,672 --> 00:16:31,130 because we now have a column vector 335 00:16:31,130 --> 00:16:33,900 with a column, Michael Chang. 336 00:16:33,900 --> 00:16:34,924 And a column New York. 337 00:16:34,924 --> 00:16:36,590 And if we enter them together, well they 338 00:16:36,590 --> 00:16:37,798 should have no intersections. 339 00:16:37,798 --> 00:16:40,100 So if we do know the no call, it just sort of pops 340 00:16:40,100 --> 00:16:42,360 those two column names away. 341 00:16:42,360 --> 00:16:44,510 So now they're effectively both one, 342 00:16:44,510 --> 00:16:47,000 and we can add them together to find 343 00:16:47,000 --> 00:16:50,280 all the documents that contain both New York and Michael 344 00:16:50,280 --> 00:16:52,260 Chang. 345 00:16:52,260 --> 00:16:54,870 So we do that, so we've added them together. 346 00:16:54,870 --> 00:16:57,050 This is now a new column vector. 347 00:16:57,050 --> 00:16:58,110 OK. 348 00:16:58,110 --> 00:17:01,380 And we transpose it to make it a row vector. 349 00:17:01,380 --> 00:17:03,599 And then when we matrix multiply it back 350 00:17:03,599 --> 00:17:04,849 against the original data set. 351 00:17:04,849 --> 00:17:08,670 We just computed the histogram of the other entities that 352 00:17:08,670 --> 00:17:12,790 are in the documents they contain both Michael Chang 353 00:17:12,790 --> 00:17:14,400 and New York. 354 00:17:14,400 --> 00:17:16,990 And as we can see, Michael Chang appears three times. 355 00:17:16,990 --> 00:17:18,910 That tells us there's three documents. 356 00:17:18,910 --> 00:17:21,069 And New York appears three times. 357 00:17:21,069 --> 00:17:26,609 But we also see Czech Republic, and Austria, and Spain, 358 00:17:26,609 --> 00:17:29,630 and the United States. 359 00:17:29,630 --> 00:17:31,435 You want to guess why Czech Republic? 360 00:17:34,110 --> 00:17:35,720 Was Ivan Lendl still playing then? 361 00:17:35,720 --> 00:17:37,700 I'm just wandering if he had a lot of matches against him? 362 00:17:37,700 --> 00:17:39,735 Or would he already be into [INAUDIBLE] or something 363 00:17:39,735 --> 00:17:40,234 like that? 364 00:17:40,234 --> 00:17:40,890 I don't know. 365 00:17:40,890 --> 00:17:43,740 Probably had some Czech arch tribal or something like that. 366 00:17:43,740 --> 00:17:45,360 Or maybe there was just a tournament 367 00:17:45,360 --> 00:17:47,170 and these are the type of thing. 368 00:17:47,170 --> 00:17:48,895 With the Reuters data sets, you do 369 00:17:48,895 --> 00:17:51,350 have the problem is that the person 370 00:17:51,350 --> 00:17:54,070 who types the article always puts the location 371 00:17:54,070 --> 00:17:55,520 where they filed it. 372 00:17:55,520 --> 00:17:58,107 So New York could always be in these 373 00:17:58,107 --> 00:18:00,440 just because the person happens to be based in New York, 374 00:18:00,440 --> 00:18:02,780 and typed it from New York or something like that. 375 00:18:02,780 --> 00:18:05,370 So that's always fun. 376 00:18:05,370 --> 00:18:08,100 We can normalize this. 377 00:18:08,100 --> 00:18:13,540 So what I'm going to do is take the facets 378 00:18:13,540 --> 00:18:17,910 that we computed here, and normalize it by their sums. 379 00:18:17,910 --> 00:18:19,010 OK. 380 00:18:19,010 --> 00:18:22,290 So this just-- the facet is showing how many times 381 00:18:22,290 --> 00:18:24,490 something showed up, but it doesn't tell me 382 00:18:24,490 --> 00:18:26,730 if it's like-- well is that something that is 383 00:18:26,730 --> 00:18:28,830 really common or really rare. 384 00:18:28,830 --> 00:18:31,620 So now we want to look at the places and things that 385 00:18:31,620 --> 00:18:35,730 were-- we want to see, of the other entities that 386 00:18:35,730 --> 00:18:41,240 appear in these documents, how many of them 387 00:18:41,240 --> 00:18:42,610 are really, really popular? 388 00:18:42,610 --> 00:18:45,550 And how many of them really just only appear in these documents 389 00:18:45,550 --> 00:18:47,830 with Michael Chang and New York. 390 00:18:47,830 --> 00:18:52,290 So as you can see here-- so we have Belarus, Czech Republic, 391 00:18:52,290 --> 00:18:53,810 Virginia for some reason. 392 00:18:53,810 --> 00:18:55,640 And this just shows you-- look, these 393 00:18:55,640 --> 00:18:58,890 are in a lot of-- these have a very low, 394 00:18:58,890 --> 00:19:01,490 they're being divided by a fairly large number, which 395 00:19:01,490 --> 00:19:03,760 means they appear in a lot of places. 396 00:19:03,760 --> 00:19:07,070 So they happen to appear in the same set of documents. 397 00:19:07,070 --> 00:19:09,230 But they happen to appear in a lot of documents. 398 00:19:09,230 --> 00:19:16,594 As opposed to Michael Joyce, or Virginia Wade. 399 00:19:16,594 --> 00:19:18,010 That's a tennis tournament, right? 400 00:19:18,010 --> 00:19:20,133 The Virginia Wade tennis tournament or something 401 00:19:20,133 --> 00:19:20,633 like that. 402 00:19:20,633 --> 00:19:22,730 These appear more common, or more likely, 403 00:19:22,730 --> 00:19:24,640 to be something that is like, oh that's 404 00:19:24,640 --> 00:19:30,456 a real sort of relation that exists between these entities. 405 00:19:30,456 --> 00:19:32,080 So that just shows you how you do that. 406 00:19:32,080 --> 00:19:33,560 Again, very powerful. 407 00:19:33,560 --> 00:19:34,990 Very powerful technique. 408 00:19:34,990 --> 00:19:40,039 Basic covariance type of mat, so facets search is very powerful. 409 00:19:40,039 --> 00:19:42,080 All right, so I think we're now-- that was three, 410 00:19:42,080 --> 00:19:43,380 so we are moving on to four. 411 00:19:43,380 --> 00:19:45,380 So now we're going to do some stuff with graphs. 412 00:19:51,940 --> 00:19:53,960 All kinds of graphs here, all kinds of graphs. 413 00:19:57,440 --> 00:20:02,020 So once again, we loaded in our data very quickly. 414 00:20:02,020 --> 00:20:04,380 We're just going to make a copy of it. 415 00:20:04,380 --> 00:20:08,110 And then I'm going to basically now convert the original thing 416 00:20:08,110 --> 00:20:09,720 to numbers. 417 00:20:09,720 --> 00:20:12,210 So get rid of those positions. 418 00:20:12,210 --> 00:20:14,420 And now we're going to do-- well so before, we 419 00:20:14,420 --> 00:20:17,050 did essentially the facets of the correlation between two 420 00:20:17,050 --> 00:20:18,040 things. 421 00:20:18,040 --> 00:20:20,970 But the real power of D4M is it for about the same amount 422 00:20:20,970 --> 00:20:24,360 of work, we can correlate everything simultaneously. 423 00:20:24,360 --> 00:20:28,250 So we're going to do square in, which basically is the same as 424 00:20:28,250 --> 00:20:30,050 e transpose times e. 425 00:20:30,050 --> 00:20:33,400 It's a little bit faster when we're doing self correlations, 426 00:20:33,400 --> 00:20:34,380 but not too much. 427 00:20:34,380 --> 00:20:38,736 You could have typed e transpose times e. 428 00:20:38,736 --> 00:20:40,860 And then we're going to show the structure of that. 429 00:20:40,860 --> 00:20:45,010 So remember we had 3,657 unique entities. 430 00:20:45,010 --> 00:20:47,560 So now we've constructed a square matrix, 431 00:20:47,560 --> 00:20:51,090 it's 3,600 by 3,600 unique entities. 432 00:20:51,090 --> 00:20:52,840 And we're going to plot that. 433 00:20:52,840 --> 00:20:53,880 So let's see here. 434 00:20:58,740 --> 00:21:03,130 So that is-- hopefully this won't crash our computer. 435 00:21:03,130 --> 00:21:06,040 So this is the location, by location-- 436 00:21:06,040 --> 00:21:09,320 this is the full entity-entity correlation matrix here. 437 00:21:09,320 --> 00:21:10,830 You can see all the structure. 438 00:21:10,830 --> 00:21:11,940 These are the locations. 439 00:21:11,940 --> 00:21:16,030 So this block here is location by location correlation. 440 00:21:16,030 --> 00:21:20,790 This block here is person-person correlation. 441 00:21:20,790 --> 00:21:23,460 It's obviously as a dense diagonal, it's symmetric. 442 00:21:23,460 --> 00:21:27,480 And then down here are the time-time correlations. 443 00:21:27,480 --> 00:21:30,050 And so these in fact, the first thing you'll probably notice 444 00:21:30,050 --> 00:21:33,040 is these dense blocks of time have to deal with the fact 445 00:21:33,040 --> 00:21:35,145 that this is a finite set of Reuters documents. 446 00:21:35,145 --> 00:21:38,790 There's only 35 unique days, or times, 447 00:21:38,790 --> 00:21:41,360 associated with the documents themselves. 448 00:21:41,360 --> 00:21:44,210 So those are showing you the times are associated 449 00:21:44,210 --> 00:21:45,549 with the-- reference. 450 00:21:45,549 --> 00:21:47,840 So that would be a very-- this type of structure shows, 451 00:21:47,840 --> 00:21:52,010 oh well if I want to just get the times that were associated 452 00:21:52,010 --> 00:21:53,570 with reporting, I can get that. 453 00:21:53,570 --> 00:21:56,190 Or if I want to do, the referring to some date. 454 00:21:56,190 --> 00:21:57,870 And you can see there's times that 455 00:21:57,870 --> 00:22:00,530 are in the past, and future, and various types of things 456 00:22:00,530 --> 00:22:01,595 like that. 457 00:22:01,595 --> 00:22:02,970 And that shows you the structure. 458 00:22:02,970 --> 00:22:05,300 So this is a very nice structure. 459 00:22:05,300 --> 00:22:06,260 We can zoom in here. 460 00:22:11,980 --> 00:22:14,190 And then you see-- and more struct-- sub-structures 461 00:22:14,190 --> 00:22:15,830 emerge here. 462 00:22:15,830 --> 00:22:17,034 What's that one? 463 00:22:17,034 --> 00:22:17,825 Let me take a look. 464 00:22:17,825 --> 00:22:20,910 Here is this. 465 00:22:20,910 --> 00:22:23,770 United States, very popular-- dense. 466 00:22:23,770 --> 00:22:26,455 You see that very dense thing, like the United States here. 467 00:22:26,455 --> 00:22:26,955 Who's this? 468 00:22:26,955 --> 00:22:30,450 This is probably America or something like that. 469 00:22:30,450 --> 00:22:32,760 United Nations Security Council, very popular. 470 00:22:32,760 --> 00:22:34,212 Oh-- the organizations here. 471 00:22:34,212 --> 00:22:36,670 Yeah, we don't have so many of them, it's the most popular. 472 00:22:36,670 --> 00:22:40,080 So you can just get this feel of the data, very powerful. 473 00:22:40,080 --> 00:22:42,920 Because often, for the most part, 474 00:22:42,920 --> 00:22:45,320 when you work with your customer, and you do this, 475 00:22:45,320 --> 00:22:49,340 you can be just about 100% sure you're the first person 476 00:22:49,340 --> 00:22:51,770 to actually get a view. 477 00:22:51,770 --> 00:22:55,310 A high level view of the overall structure of the data, 478 00:22:55,310 --> 00:22:57,000 in a very powerful perspective. 479 00:22:57,000 --> 00:22:59,240 Because they simply just don't have a way to do this. 480 00:22:59,240 --> 00:23:02,160 And using the sparse matrix is a way to just review the data. 481 00:23:02,160 --> 00:23:04,230 It's very, very powerful. 482 00:23:04,230 --> 00:23:05,330 All right. 483 00:23:05,330 --> 00:23:07,400 So let's get rid of that. 484 00:23:07,400 --> 00:23:10,330 So we did the entity-entity graph. 485 00:23:10,330 --> 00:23:12,080 Again, what we do is just whenever 486 00:23:12,080 --> 00:23:15,580 we take an incidence matrix, and we sort of 487 00:23:15,580 --> 00:23:17,400 square it, or correlate it. 488 00:23:17,400 --> 00:23:19,789 Then the result is an adjacency matrix, or graph 489 00:23:19,789 --> 00:23:20,830 between those two things. 490 00:23:20,830 --> 00:23:24,347 So then adjacency matrices are a projection 491 00:23:24,347 --> 00:23:25,430 from the incidence matrix. 492 00:23:25,430 --> 00:23:26,910 We've lost some information, we've 493 00:23:26,910 --> 00:23:29,125 always thrown away some information in doing that. 494 00:23:29,125 --> 00:23:31,000 But often, that's exactly what we want to do. 495 00:23:31,000 --> 00:23:34,510 You want to project it down into a subspace 496 00:23:34,510 --> 00:23:38,130 that we find more interesting. 497 00:23:38,130 --> 00:23:40,320 Something else that we can do, is let's just look 498 00:23:40,320 --> 00:23:43,330 at the people beginning with J. 499 00:23:43,330 --> 00:23:44,440 All right. 500 00:23:44,440 --> 00:23:46,815 So I'm going to just get-- create 501 00:23:46,815 --> 00:23:48,944 a little starts with entity range 502 00:23:48,944 --> 00:23:50,360 here with our starts with command. 503 00:23:50,360 --> 00:23:55,990 I can pass that into it here, to grid just the records that 504 00:23:55,990 --> 00:23:57,690 have the entity p. 505 00:23:57,690 --> 00:24:01,370 And I'm going to do the correlation of them 506 00:24:01,370 --> 00:24:05,710 using this pedigree preserving matrix multiply that we have. 507 00:24:05,710 --> 00:24:06,670 This very special one. 508 00:24:06,670 --> 00:24:09,430 So we'll go take a look at that. 509 00:24:09,430 --> 00:24:14,330 And that will explain what I mean by pedigree preserving. 510 00:24:14,330 --> 00:24:16,900 So we go here. 511 00:24:16,900 --> 00:24:22,290 So again, this shows you all the people that appear together, 512 00:24:22,290 --> 00:24:24,530 beginning with the name J. And it's 513 00:24:24,530 --> 00:24:26,500 of course again, symmetric matrix. 514 00:24:26,500 --> 00:24:28,410 Distance diagonal. 515 00:24:28,410 --> 00:24:30,060 You click on, it tells you. 516 00:24:30,060 --> 00:24:32,810 So Jennifer Hurley and James Harvey 517 00:24:32,810 --> 00:24:35,160 appeared in this document together. 518 00:24:35,160 --> 00:24:38,260 So when we do that special matrix multiply, 519 00:24:38,260 --> 00:24:41,530 it doesn't preserve the values in the normal matrix multiply 520 00:24:41,530 --> 00:24:42,190 sense. 521 00:24:42,190 --> 00:24:47,530 Instead, it preserves the label of the common dimension that's 522 00:24:47,530 --> 00:24:48,180 eliminated. 523 00:24:48,180 --> 00:24:50,650 So when we do matrix multiply, you're limiting dimension. 524 00:24:50,650 --> 00:24:53,440 We now take that common intersection 525 00:24:53,440 --> 00:24:54,810 and throw that in the value. 526 00:24:54,810 --> 00:24:58,580 So you can now say, Jennifer Hurley, James Harvey 527 00:24:58,580 --> 00:25:01,560 were connected by this document. 528 00:25:01,560 --> 00:25:05,190 And so very powerful tool for doing that. 529 00:25:05,190 --> 00:25:07,320 You notice, I restricted it to J. 530 00:25:07,320 --> 00:25:09,560 Why did I do-- why didn't just I do the whole thing? 531 00:25:09,560 --> 00:25:12,370 Well, you see this dense diagonal here. 532 00:25:17,520 --> 00:25:21,480 If you had a very popular person, look at all the 533 00:25:21,480 --> 00:25:24,880 hits we're going to get when we correlated with itself. 534 00:25:24,880 --> 00:25:26,720 OK. 535 00:25:26,720 --> 00:25:29,220 This becomes a performance bottleneck 536 00:25:29,220 --> 00:25:33,330 when we start creating enormously long values. 537 00:25:33,330 --> 00:25:34,960 And so we always have to be careful 538 00:25:34,960 --> 00:25:40,380 when we do these pedigree preserving correlations. 539 00:25:40,380 --> 00:25:43,310 If it's calculating one set with another, 540 00:25:43,310 --> 00:25:45,060 we're not have a dense diagonal. 541 00:25:45,060 --> 00:25:46,162 It's going to be fine. 542 00:25:46,162 --> 00:25:48,120 But if we're correlating something with itself, 543 00:25:48,120 --> 00:25:49,530 we always have to be on the lookout 544 00:25:49,530 --> 00:25:51,560 that we might be creating these dense diagonals. 545 00:25:51,560 --> 00:25:53,500 In which case, we could have an issue 546 00:25:53,500 --> 00:25:56,360 that we might create a very, very large value. 547 00:25:56,360 --> 00:26:00,660 And the fact of the matter is, the way 548 00:26:00,660 --> 00:26:08,960 we sort the strains is we convert the list to a dense car 549 00:26:08,960 --> 00:26:09,937 matrix. 550 00:26:09,937 --> 00:26:11,520 And then call the MATLAB sort routine. 551 00:26:11,520 --> 00:26:13,690 It's the only way we can sort it. 552 00:26:13,690 --> 00:26:19,690 And so the width of that matrix is the length of the longest 553 00:26:19,690 --> 00:26:21,200 string in the list. 554 00:26:21,200 --> 00:26:24,044 So they're all about the same length, it's fine. 555 00:26:24,044 --> 00:26:25,460 But if you have mostly things that 556 00:26:25,460 --> 00:26:27,490 are 20 characters, and then you have something 557 00:26:27,490 --> 00:26:29,620 that's 2,000 characters long. 558 00:26:29,620 --> 00:26:34,760 And you create 100,000 by 2,000 matrix, 559 00:26:34,760 --> 00:26:37,020 you've just consumed an awful lot of memory. 560 00:26:37,020 --> 00:26:39,172 And sorting that and all that just-- so whenever 561 00:26:39,172 --> 00:26:41,130 people say it's slow or they run out of memory, 562 00:26:41,130 --> 00:26:45,090 it's almost always because they're creating, 563 00:26:45,090 --> 00:26:49,150 they have a string list and we're trying 564 00:26:49,150 --> 00:26:50,570 to manipulate it by sorting. 565 00:26:50,570 --> 00:26:55,590 And one of the values in that string is just extremely long. 566 00:26:55,590 --> 00:26:57,202 So that's just a little caveat. 567 00:26:57,202 --> 00:26:58,660 A very powerful technique, you have 568 00:26:58,660 --> 00:27:00,360 to watch out for this diagonal. 569 00:27:00,360 --> 00:27:01,880 And there's ways to work around it, 570 00:27:01,880 --> 00:27:03,660 that are usually data specific. 571 00:27:03,660 --> 00:27:05,690 But again a very powerful-- this pedigree 572 00:27:05,690 --> 00:27:09,210 preserving correlation, something that usually people 573 00:27:09,210 --> 00:27:09,710 want. 574 00:27:09,710 --> 00:27:12,150 And especially since we almost usually store 575 00:27:12,150 --> 00:27:15,810 just like a one in the value, we're not really 576 00:27:15,810 --> 00:27:17,454 throwing away any information. 577 00:27:17,454 --> 00:27:19,370 Sometimes we want it, because we want to count 578 00:27:19,370 --> 00:27:21,080 how many times those happen. 579 00:27:21,080 --> 00:27:23,041 If we didn't do the pedigree preserving, 580 00:27:23,041 --> 00:27:24,790 the catkey [INAUDIBLE], then we would have 581 00:27:24,790 --> 00:27:26,550 gotten a value of three here. 582 00:27:26,550 --> 00:27:27,824 And so we could have counted. 583 00:27:27,824 --> 00:27:28,740 It would have told us. 584 00:27:28,740 --> 00:27:30,030 Sometimes we do them both. 585 00:27:30,030 --> 00:27:32,640 One that gives us the count, one that gives us the pedigree. 586 00:27:32,640 --> 00:27:33,340 And just store them both. 587 00:27:33,340 --> 00:27:35,298 There's different things that we want-- that we 588 00:27:35,298 --> 00:27:36,680 do with one versus the other. 589 00:27:36,680 --> 00:27:39,160 So again, having fun with the different types of matrix 590 00:27:39,160 --> 00:27:43,980 multiplies that we can do in the space of associative arrays. 591 00:27:43,980 --> 00:27:44,480 All right. 592 00:27:44,480 --> 00:27:47,910 We're running out of time here, but I think we're almost done. 593 00:27:47,910 --> 00:27:49,560 So let me just see here. 594 00:27:49,560 --> 00:27:51,590 Then we did the document-document correlation, 595 00:27:51,590 --> 00:27:56,480 so this is the square out, which is e times e transpose. 596 00:27:56,480 --> 00:27:59,220 And we'll just take a quick look at that. 597 00:27:59,220 --> 00:28:02,400 So again, here we are for your four. 598 00:28:02,400 --> 00:28:06,057 This just shows us which documents are closely related 599 00:28:06,057 --> 00:28:06,640 to each other. 600 00:28:06,640 --> 00:28:09,160 Which documents have a lot of common entities. 601 00:28:09,160 --> 00:28:14,970 So this just says here that these two documents share-- had 602 00:28:14,970 --> 00:28:18,090 one common entity with them. 603 00:28:18,090 --> 00:28:19,930 Other ones. 604 00:28:19,930 --> 00:28:22,670 Just clicking around here, not having a lot of luck finding 605 00:28:22,670 --> 00:28:24,510 one that's got a lot in it. 606 00:28:24,510 --> 00:28:27,453 So maybe around here, nope. 607 00:28:27,453 --> 00:28:28,380 No. 608 00:28:28,380 --> 00:28:31,201 Well certainly if I click on the diagonal, nope. 609 00:28:31,201 --> 00:28:31,700 Wow. 610 00:28:36,900 --> 00:28:39,580 Basically pretty sparse, sparse correlations 611 00:28:39,580 --> 00:28:41,320 between these documents. 612 00:28:41,320 --> 00:28:42,790 So all right. 613 00:28:42,790 --> 00:28:47,251 And I think we'll go now to the last example. 614 00:28:47,251 --> 00:28:48,250 Is there a last example? 615 00:28:48,250 --> 00:28:48,750 Yes. 616 00:28:48,750 --> 00:28:50,880 One-- one last one. 617 00:28:50,880 --> 00:28:52,621 Yay. 618 00:28:52,621 --> 00:28:53,120 [INAUDIBLE] 619 00:28:56,986 --> 00:28:58,190 All right. 620 00:28:58,190 --> 00:29:00,570 So now we'll do-- the stuff I showed you 621 00:29:00,570 --> 00:29:04,470 so far is pretty easy stuff. 622 00:29:04,470 --> 00:29:06,840 Pretty sophisticated by a lot of the standards 623 00:29:06,840 --> 00:29:09,140 are out there, in terms being able to do 624 00:29:09,140 --> 00:29:12,370 sums, and histograms, and basic correlations. 625 00:29:12,370 --> 00:29:15,690 That's often significantly beyond what is just 626 00:29:15,690 --> 00:29:17,890 sort of the run of the mill, which you'll just 627 00:29:17,890 --> 00:29:22,270 run in to if you just go into a place doing basic analysis. 628 00:29:22,270 --> 00:29:23,200 So again we load it. 629 00:29:23,200 --> 00:29:25,240 We convert everything to numeric. 630 00:29:25,240 --> 00:29:26,450 We've now squared it. 631 00:29:26,450 --> 00:29:28,710 So we created the graph. 632 00:29:28,710 --> 00:29:31,440 Now I want to get rid of that annoying diagonal, 633 00:29:31,440 --> 00:29:33,270 so I got the diagonal out. 634 00:29:33,270 --> 00:29:34,690 So just so you know, all I did is 635 00:29:34,690 --> 00:29:37,810 I take-- I know it's a square, so I can do certain things 636 00:29:37,810 --> 00:29:39,000 without worrying. 637 00:29:39,000 --> 00:29:41,300 So I can basically take the adjacency matrix, 638 00:29:41,300 --> 00:29:44,150 that just going to be a regular MATLAB sparse matrix. 639 00:29:44,150 --> 00:29:46,700 I can now do-- take the diagonal back. 640 00:29:46,700 --> 00:29:49,720 And since it's a square, I can just take those two, 641 00:29:49,720 --> 00:29:53,150 subtract them from each other, and just insert it 642 00:29:53,150 --> 00:29:55,880 back in without any harm done. 643 00:29:55,880 --> 00:30:00,870 Knowing it's like look, I took a square adjacency matrix out, 644 00:30:00,870 --> 00:30:02,080 did some manipulation. 645 00:30:02,080 --> 00:30:03,580 Stuff it back in. 646 00:30:03,580 --> 00:30:05,150 The dimensions are the same. 647 00:30:05,150 --> 00:30:07,490 You know, it should be OK. 648 00:30:07,490 --> 00:30:12,290 I might have an issue that I may have introduced an empty row, 649 00:30:12,290 --> 00:30:14,990 or an empty column, you know. 650 00:30:14,990 --> 00:30:18,270 Which could cause me a little problems. 651 00:30:18,270 --> 00:30:20,670 I'm now going to get the triples of that, 652 00:30:20,670 --> 00:30:22,880 because I want to normalize the values here. 653 00:30:22,880 --> 00:30:25,560 So I'm going to get the-- the diagonals showed me 654 00:30:25,560 --> 00:30:27,730 how many counts are in each document. 655 00:30:27,730 --> 00:30:31,794 So I want to normalize any entities in a document 656 00:30:31,794 --> 00:30:33,960 because sometimes you'll get a document that's like, 657 00:30:33,960 --> 00:30:36,040 and here's a list of every single country that 658 00:30:36,040 --> 00:30:38,350 participated in this UN meeting. 659 00:30:38,350 --> 00:30:40,710 And it'll have a hundred and something countries. 660 00:30:40,710 --> 00:30:42,649 So a country appearing in that document 661 00:30:42,649 --> 00:30:44,690 is not such a big deal, because all the countries 662 00:30:44,690 --> 00:30:46,070 appear in that document. 663 00:30:46,070 --> 00:30:48,500 So this allows us to do that. 664 00:30:48,500 --> 00:30:52,120 If we want to do a correlation, we take the actual value 665 00:30:52,120 --> 00:30:53,540 and then divide it by the minimum. 666 00:30:53,540 --> 00:30:56,040 The least documented would be the maximum number 667 00:30:56,040 --> 00:30:58,150 of hits you could get in the correlation. 668 00:30:58,150 --> 00:31:00,990 You can do multi-facet queries. 669 00:31:00,990 --> 00:31:03,940 So we want do location against all people. 670 00:31:03,940 --> 00:31:06,590 And so here's a more complicated query which says, 671 00:31:06,590 --> 00:31:12,840 find me all people that appeared more than four times, 672 00:31:12,840 --> 00:31:16,874 and with a probability greater than 0.3, and with respect 673 00:31:16,874 --> 00:31:17,540 to the New York. 674 00:31:17,540 --> 00:31:20,640 So here's New York, and all these different people 675 00:31:20,640 --> 00:31:22,150 who are basically-- these are people 676 00:31:22,150 --> 00:31:26,020 that appeared with New York, and almost exclusively appeared 677 00:31:26,020 --> 00:31:28,680 with New York. 678 00:31:28,680 --> 00:31:29,180 So. 679 00:31:29,180 --> 00:31:32,040 So John Kennedy, so this is his son that 680 00:31:32,040 --> 00:31:34,410 was still alive at that time. 681 00:31:34,410 --> 00:31:38,060 Was very popular in the news. 682 00:31:38,060 --> 00:31:41,710 And so let's focus on him. 683 00:31:41,710 --> 00:31:45,810 So let's get John Kennedy, let's get his neighborhood. 684 00:31:45,810 --> 00:31:50,270 So we can go into his-- get all the people other people who 685 00:31:50,270 --> 00:31:52,730 appeared in documents with John Kennedy, 686 00:31:52,730 --> 00:31:57,690 and we can plot that neighborhood, and you go here. 687 00:31:57,690 --> 00:31:58,760 This is his neighborhood. 688 00:31:58,760 --> 00:31:59,635 There's John Kennedy. 689 00:31:59,635 --> 00:32:01,240 His rows and columns. 690 00:32:01,240 --> 00:32:04,200 And then these are all the other people that appeared with him. 691 00:32:04,200 --> 00:32:05,749 You see this dense row and column. 692 00:32:05,749 --> 00:32:07,790 And then these are what are called the triangles. 693 00:32:07,790 --> 00:32:14,320 So these are other people who appeared in documents together, 694 00:32:14,320 --> 00:32:15,520 you know. 695 00:32:15,520 --> 00:32:18,010 So George Bush and Jim Wright appeared 696 00:32:18,010 --> 00:32:21,030 in a document together, but not necessarily with John Kennedy. 697 00:32:21,030 --> 00:32:22,710 And we can actually find those people 698 00:32:22,710 --> 00:32:24,910 by just doing this basic arithmetic here. 699 00:32:24,910 --> 00:32:26,410 And this shows us all the triangles, 700 00:32:26,410 --> 00:32:29,295 all the other people also appeared in documents together, 701 00:32:29,295 --> 00:32:30,880 who appeared with John Kennedy. 702 00:32:30,880 --> 00:32:33,100 And so that's what you asked. 703 00:32:33,100 --> 00:32:36,360 So with that, we will-- we're running a little bit late here, 704 00:32:36,360 --> 00:32:39,470 so we will wrap it up. 705 00:32:39,470 --> 00:32:41,920 And again, encourage you to go into your code, 706 00:32:41,920 --> 00:32:43,820 and run these examples and try out. 707 00:32:43,820 --> 00:32:45,490 And this should give you the first sense really kind 708 00:32:45,490 --> 00:32:47,440 of working with some real data, and playing with stuff. 709 00:32:47,440 --> 00:32:47,940 So. 710 00:32:47,940 --> 00:32:49,960 Thank you very much.