1 00:00:00,040 --> 00:00:02,410 The following content is provided under a Creative 2 00:00:02,410 --> 00:00:03,790 Commons license. 3 00:00:03,790 --> 00:00:06,030 Your support will help MIT OpenCourseWare 4 00:00:06,030 --> 00:00:10,100 continue to offer high quality educational resources for free. 5 00:00:10,100 --> 00:00:12,680 To make a donation or to view additional materials 6 00:00:12,680 --> 00:00:16,590 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,590 --> 00:00:18,270 at ocw.mit.edu. 8 00:00:20,969 --> 00:00:22,010 JEREMY KEPNER: All right. 9 00:00:22,010 --> 00:00:25,730 We're back. 10 00:00:25,730 --> 00:00:29,025 So I think we just went over a sort 11 00:00:29,025 --> 00:00:31,150 of a tour of some of the more complicated analytics 12 00:00:31,150 --> 00:00:32,439 that people can do. 13 00:00:32,439 --> 00:00:34,760 And now, we're going to show some examples of not 14 00:00:34,760 --> 00:00:37,200 those specific analytics, but using the Reuters data 15 00:00:37,200 --> 00:00:39,530 set that we had before some of the analytics 16 00:00:39,530 --> 00:00:40,810 that we can do here. 17 00:00:40,810 --> 00:00:44,700 So again, this is in the D4M API. 18 00:00:44,700 --> 00:00:46,580 We go into the Examples directory. 19 00:00:46,580 --> 00:00:48,040 It's Apps. 20 00:00:48,040 --> 00:00:50,320 And now we're going to deal with tracks. 21 00:00:50,320 --> 00:00:54,250 And this is a data set here. 22 00:00:54,250 --> 00:00:56,440 The data actually, this Entity.mat 23 00:00:56,440 --> 00:01:00,005 was actually created in the previous. 24 00:01:02,720 --> 00:01:04,110 And this entity analysis actually 25 00:01:04,110 --> 00:01:05,560 creates the Entity.mat. 26 00:01:05,560 --> 00:01:08,210 It's the same file. 27 00:01:08,210 --> 00:01:10,770 So we just have over here. 28 00:01:10,770 --> 00:01:13,708 And so let's get started. 29 00:01:16,500 --> 00:01:21,680 And just to remind people, we do load entity.mat. 30 00:01:26,529 --> 00:01:28,960 You see we have this E here. 31 00:01:28,960 --> 00:01:30,780 That's the data. 32 00:01:30,780 --> 00:01:36,340 It represents, essentially, almost 10,000 documents 33 00:01:36,340 --> 00:01:38,500 and 3,600 entities in that. 34 00:01:38,500 --> 00:01:45,116 If we do spy E. transpose. 35 00:01:45,116 --> 00:01:47,282 AUDIENCE: Can you move the bottom of your screen up? 36 00:01:47,282 --> 00:01:50,385 Because we're seeing the top half of your thing. 37 00:02:02,040 --> 00:02:03,520 JEREMY KEPNER: That better? 38 00:02:03,520 --> 00:02:04,500 AUDIENCE: Much better. 39 00:02:04,500 --> 00:02:05,250 JEREMY KEPNER: OK. 40 00:02:09,300 --> 00:02:11,460 And so there you go. 41 00:02:11,460 --> 00:02:13,260 That is the entire data set. 42 00:02:13,260 --> 00:02:17,640 Again, spy plot, very useful tool for doing that. 43 00:02:17,640 --> 00:02:20,260 So you can see we have locations and people 44 00:02:20,260 --> 00:02:23,310 and times in this data set. 45 00:02:23,310 --> 00:02:25,640 We also have organizations in here. 46 00:02:25,640 --> 00:02:27,870 You can zoom in on it if you want to. 47 00:02:31,870 --> 00:02:34,820 You can see, the are the locations, and then 48 00:02:34,820 --> 00:02:37,540 the organizations, and the people. 49 00:02:37,540 --> 00:02:40,360 Zoom in a little bit more, you can actually 50 00:02:40,360 --> 00:02:41,840 look at the actual values here. 51 00:02:41,840 --> 00:02:44,140 And you see there's this common popular location here. 52 00:02:44,140 --> 00:02:45,595 What is that? 53 00:02:45,595 --> 00:02:47,250 Ah, it's the United States. 54 00:02:47,250 --> 00:02:49,530 Yes, the United States does appear a lot 55 00:02:49,530 --> 00:02:51,930 in Reuters documents as one would expect. 56 00:02:51,930 --> 00:02:53,480 What do we think this one is here? 57 00:02:53,480 --> 00:02:54,616 Any guesses? 58 00:02:54,616 --> 00:02:55,430 AUDIENCE: New York. 59 00:02:55,430 --> 00:02:56,651 JEREMY KEPNER: What? 60 00:02:56,651 --> 00:02:58,020 AUDIENCE: Is that New York? 61 00:02:58,020 --> 00:02:58,330 JEREMY KEPNER: Yeah. 62 00:02:58,330 --> 00:02:58,890 Oh, New York. 63 00:02:58,890 --> 00:02:59,889 See, it's already there. 64 00:02:59,889 --> 00:03:00,420 New York. 65 00:03:00,420 --> 00:03:02,400 Yes, we can read. 66 00:03:02,400 --> 00:03:03,940 All right. 67 00:03:03,940 --> 00:03:07,110 Organizations, anything really popular here? 68 00:03:07,110 --> 00:03:09,420 Maybe this guy, is he popular? 69 00:03:09,420 --> 00:03:10,947 International Red Cross, right? 70 00:03:10,947 --> 00:03:13,280 And people here, we don't have any really popular people 71 00:03:13,280 --> 00:03:15,504 in this list. 72 00:03:15,504 --> 00:03:18,840 Ahmad Shah Massoud, I have no idea who that person was back 73 00:03:18,840 --> 00:03:21,310 in 1996. 74 00:03:21,310 --> 00:03:24,640 Anyway, so that's the data. 75 00:03:24,640 --> 00:03:27,670 And you see, by the way, when every single time you click. 76 00:03:27,670 --> 00:03:30,980 Because when we make those spy plots, 77 00:03:30,980 --> 00:03:33,910 I have to do some compression on the strings just 78 00:03:33,910 --> 00:03:36,350 to make it work. 79 00:03:36,350 --> 00:03:39,010 But we actually always print out onto the screen 80 00:03:39,010 --> 00:03:40,790 the full exact string. 81 00:03:40,790 --> 00:03:43,480 So if I want that string, you can just then copy it 82 00:03:43,480 --> 00:03:45,657 if you want to say copy and paste 83 00:03:45,657 --> 00:03:46,740 it or something like that. 84 00:03:46,740 --> 00:03:49,031 Or the full person or something like that, you can then 85 00:03:49,031 --> 00:03:51,370 copy that and paste that. 86 00:03:51,370 --> 00:03:59,772 You can do something like that like, E, this guy, yeah. 87 00:03:59,772 --> 00:04:00,730 And there, it's showed. 88 00:04:00,730 --> 00:04:05,000 That's all the documents that contain this person's name. 89 00:04:05,000 --> 00:04:08,350 And then this shows the character position, I believe, 90 00:04:08,350 --> 00:04:10,130 that they appeared in the document. 91 00:04:10,130 --> 00:04:15,550 So again, you have all kinds of fun with that. 92 00:04:15,550 --> 00:04:17,560 You can do row, that. 93 00:04:17,560 --> 00:04:18,839 That gets us all those rows. 94 00:04:18,839 --> 00:04:21,085 And then we can pass those rows back in. 95 00:04:21,085 --> 00:04:27,980 And let's say we want to do starts with, 96 00:04:27,980 --> 00:04:31,384 let's see here, how about organization? 97 00:04:31,384 --> 00:04:32,800 Everyone correct my spelling here. 98 00:04:32,800 --> 00:04:39,050 I'm going to type this wrong, organization slash. 99 00:04:39,050 --> 00:04:42,532 All right, there we go. 100 00:04:42,532 --> 00:04:44,120 AUDIENCE: Starts with. 101 00:04:44,120 --> 00:04:44,190 JEREMY KEPNER: Yeah. 102 00:04:44,190 --> 00:04:45,190 I always get that wrong. 103 00:04:45,190 --> 00:04:46,917 Stars with-- starts with. 104 00:04:46,917 --> 00:04:47,417 See? 105 00:04:47,417 --> 00:04:48,251 There you go. 106 00:04:48,251 --> 00:04:48,810 All right. 107 00:04:48,810 --> 00:04:49,351 There you go. 108 00:04:49,351 --> 00:04:51,540 So this shows you all the organizations 109 00:04:51,540 --> 00:04:54,080 that are cited in the documents that 110 00:04:54,080 --> 00:04:56,366 contained this person's name. 111 00:04:56,366 --> 00:04:56,865 You know? 112 00:04:59,694 --> 00:05:01,360 This is kind of the spirit of it, right? 113 00:05:01,360 --> 00:05:03,090 I mean, it's like just, oh, I want that? 114 00:05:03,090 --> 00:05:03,760 I can get that. 115 00:05:03,760 --> 00:05:06,397 Oh, I could then say, oh, all right, well get me-- you know? 116 00:05:06,397 --> 00:05:08,480 And you could just keep going and going and going. 117 00:05:08,480 --> 00:05:13,982 If I did r c s of that, right, it 118 00:05:13,982 --> 00:05:15,440 would never return those as triple. 119 00:05:15,440 --> 00:05:17,180 So r, those are all the documents. 120 00:05:17,180 --> 00:05:19,680 C, those are all the columns. 121 00:05:19,680 --> 00:05:21,840 V, there we go. 122 00:05:21,840 --> 00:05:23,260 You get the idea. 123 00:05:23,260 --> 00:05:27,830 So again, very, very powerful type of syntax there. 124 00:05:27,830 --> 00:05:28,996 All right. 125 00:05:28,996 --> 00:05:30,870 So I'll just give you a sense of the data set 126 00:05:30,870 --> 00:05:31,786 that we're looking at. 127 00:05:31,786 --> 00:05:34,040 So let's look at the first example here. 128 00:05:34,040 --> 00:05:37,432 So we're going to do track, analytic, build, test. 129 00:05:37,432 --> 00:05:39,640 So we're going to build some tracks out of this data. 130 00:05:39,640 --> 00:05:41,500 So I have these documents. 131 00:05:41,500 --> 00:05:46,110 And they have locations in them, and people, and times. 132 00:05:46,110 --> 00:05:50,830 So I could say, hey, if there's a person and a location 133 00:05:50,830 --> 00:05:53,370 and a time in a document, maybe I could 134 00:05:53,370 --> 00:05:55,760 call that's sort of a track. 135 00:05:55,760 --> 00:05:56,480 So let's do that. 136 00:05:56,480 --> 00:05:57,840 So what do we do here? 137 00:05:57,840 --> 00:06:01,560 So the first thing we do is we loaded the data. 138 00:06:01,560 --> 00:06:03,560 The string values are those character positions. 139 00:06:03,560 --> 00:06:04,935 We don't really care about those. 140 00:06:04,935 --> 00:06:06,900 So we're going to just get rid of them 141 00:06:06,900 --> 00:06:09,400 and just convert it to a numeric like we've always 142 00:06:09,400 --> 00:06:10,550 been doing here. 143 00:06:10,550 --> 00:06:14,600 And then I'm going to say my thing 144 00:06:14,600 --> 00:06:16,990 that I want to track is going to be anything starting 145 00:06:16,990 --> 00:06:17,730 with person. 146 00:06:17,730 --> 00:06:18,960 So I set that. 147 00:06:18,960 --> 00:06:22,360 And my time thing is going to be anything starting my time. 148 00:06:22,360 --> 00:06:24,710 And my location thing is going to be anything 149 00:06:24,710 --> 00:06:25,920 here starting with location. 150 00:06:25,920 --> 00:06:29,114 So I've done the starts with to get these ranges. 151 00:06:29,114 --> 00:06:30,780 And now, the first thing I'm going to do 152 00:06:30,780 --> 00:06:37,690 is I want to limit my data to only rows that have at least 153 00:06:37,690 --> 00:06:39,690 one of all three of those. 154 00:06:39,690 --> 00:06:43,630 So I'm not dealing with I have a person and a location 155 00:06:43,630 --> 00:06:44,680 and no time or whatever. 156 00:06:44,680 --> 00:06:46,470 So I'm just going to clean that up. 157 00:06:46,470 --> 00:06:49,780 So basically, I get all the people. 158 00:06:49,780 --> 00:06:51,480 And I sum that. 159 00:06:51,480 --> 00:06:56,550 I basically sum across the columns. 160 00:06:56,550 --> 00:06:59,010 So I basically compress the columns. 161 00:06:59,010 --> 00:07:01,340 And then I sum the rows. 162 00:07:01,340 --> 00:07:03,550 All right. 163 00:07:03,550 --> 00:07:06,620 There we go, sum those. 164 00:07:06,620 --> 00:07:09,397 I get, then, all three. 165 00:07:09,397 --> 00:07:10,730 And then I filter them back out. 166 00:07:10,730 --> 00:07:13,230 And that just reduces this to the ones that contain 167 00:07:13,230 --> 00:07:16,911 just the ones that have these. 168 00:07:20,480 --> 00:07:21,330 Let's see here. 169 00:07:21,330 --> 00:07:25,330 So now, I want to collapse these. 170 00:07:25,330 --> 00:07:29,070 I want to create, essentially, just edges and times. 171 00:07:29,070 --> 00:07:32,700 So I can do that with the call to type syntax and the val 172 00:07:32,700 --> 00:07:33,780 to call syntax. 173 00:07:33,780 --> 00:07:37,710 And going and bopping back and forth between that, 174 00:07:37,710 --> 00:07:40,840 I get a set of edges, the edge list, which is essentially 175 00:07:40,840 --> 00:07:43,629 the document and the time. 176 00:07:43,629 --> 00:07:45,670 And now, I'm going to combine these back together 177 00:07:45,670 --> 00:07:49,770 into a new associative array, which essentially still 178 00:07:49,770 --> 00:07:54,780 has the same text label, which is essentially the document. 179 00:07:54,780 --> 00:07:56,710 But it has columns of time. 180 00:07:56,710 --> 00:07:58,200 And the value is space. 181 00:07:58,200 --> 00:07:58,700 All right. 182 00:07:58,700 --> 00:08:01,180 And then I'm going to do another one, which is, 183 00:08:01,180 --> 00:08:06,480 again, has a row, which is the document or the edge. 184 00:08:06,480 --> 00:08:10,710 And it has a column of space and a value of time. 185 00:08:10,710 --> 00:08:14,740 And now, I can construct a track from this 186 00:08:14,740 --> 00:08:20,300 through this wonderful sparse matrix multiply. 187 00:08:20,300 --> 00:08:25,360 So essentially, I transpose Etx. 188 00:08:25,360 --> 00:08:28,190 And then I'm going to just get the people 189 00:08:28,190 --> 00:08:31,720 and convert those numeric values. 190 00:08:31,720 --> 00:08:33,600 And then we do this cat value mul, which 191 00:08:33,600 --> 00:08:36,260 will actually convert that. 192 00:08:36,260 --> 00:08:37,220 These are time tracks. 193 00:08:37,220 --> 00:08:38,679 And these are space tracks. 194 00:08:38,679 --> 00:08:40,220 And again, it's a little difficult 195 00:08:40,220 --> 00:08:44,710 to explain to you exactly why this works, 196 00:08:44,710 --> 00:08:47,410 why these matrix multiplies give us the answer 197 00:08:47,410 --> 00:08:48,399 that we're going for. 198 00:08:48,399 --> 00:08:50,940 Because we're going to have to sit and think about the actual 199 00:08:50,940 --> 00:08:51,780 matrices. 200 00:08:51,780 --> 00:08:54,120 This is a great example to go do, and then 201 00:08:54,120 --> 00:08:56,890 explore these various associative arrays to actually 202 00:08:56,890 --> 00:09:02,620 see why these matrix multiplies actually give us 203 00:09:02,620 --> 00:09:04,150 the answer that we want. 204 00:09:04,150 --> 00:09:08,320 So in fact, we can take a look at those. 205 00:09:08,320 --> 00:09:09,866 I just want to look at Figure 1 here. 206 00:09:12,830 --> 00:09:15,150 So this shows us, basically, and I 207 00:09:15,150 --> 00:09:20,190 plotted the transpose of this, the people on the right. 208 00:09:20,190 --> 00:09:22,000 And then these are times. 209 00:09:22,000 --> 00:09:24,800 And so, basically, for each row here, I 210 00:09:24,800 --> 00:09:26,870 have a listing of times. 211 00:09:26,870 --> 00:09:31,680 And if I click on one of them, it will give me a location. 212 00:09:31,680 --> 00:09:32,180 OK. 213 00:09:36,860 --> 00:09:38,600 In fact, I think that number's the number 214 00:09:38,600 --> 00:09:40,380 of times that appears. 215 00:09:40,380 --> 00:09:43,990 So basically, we have here the person, 216 00:09:43,990 --> 00:09:48,990 Daniel Smith was in a document with a time stamp of 1996, 217 00:09:48,990 --> 00:09:49,750 November 12. 218 00:09:49,750 --> 00:09:51,960 Oh, that's almost-- yesterday. 219 00:09:51,960 --> 00:10:01,700 And then the location New York appeared once in that. 220 00:10:01,700 --> 00:10:02,520 Here's another one. 221 00:10:02,520 --> 00:10:05,020 And so this is a track of sorts. 222 00:10:05,020 --> 00:10:08,340 We basically have a person and a set of times 223 00:10:08,340 --> 00:10:11,410 and a set of tracks. 224 00:10:11,410 --> 00:10:15,150 That's one kind of track. 225 00:10:15,150 --> 00:10:16,650 Another kind of track here is now 226 00:10:16,650 --> 00:10:20,637 we have person and locations. 227 00:10:20,637 --> 00:10:22,720 So that was the other matrix multiply that we did. 228 00:10:22,720 --> 00:10:26,180 And so now, we have person Carole King, location Buffalo, 229 00:10:26,180 --> 00:10:29,130 and on this time. 230 00:10:29,130 --> 00:10:32,502 So those are two different ways of representing the tracks. 231 00:10:32,502 --> 00:10:33,710 Obviously, these are triples. 232 00:10:33,710 --> 00:10:35,680 But then you can use either of these matrices 233 00:10:35,680 --> 00:10:39,260 to do additional queries and other types of things. 234 00:10:39,260 --> 00:10:40,150 All right. 235 00:10:40,150 --> 00:10:42,905 So that just shows you how using matrix multiplies 236 00:10:42,905 --> 00:10:44,280 and other types of things you can 237 00:10:44,280 --> 00:10:48,370 construct more sophisticated graphs or data structures, 238 00:10:48,370 --> 00:10:52,630 in this case tracks, which is a very interesting type of thing. 239 00:10:52,630 --> 00:10:54,032 Let's move on to the next one. 240 00:10:54,032 --> 00:10:54,990 That's going to be TA2. 241 00:11:01,500 --> 00:11:03,830 So this is a slightly more sophisticated tract builder. 242 00:11:03,830 --> 00:11:06,320 Again, so when I read the data in, 243 00:11:06,320 --> 00:11:09,090 i create my three sort of categories here, 244 00:11:09,090 --> 00:11:11,820 the object, and the time, and the location, 245 00:11:11,820 --> 00:11:13,020 or the coordinate. 246 00:11:13,020 --> 00:11:15,960 And then I have a function here called 247 00:11:15,960 --> 00:11:18,600 find tracks, which actually just goes and creates those tracks 248 00:11:18,600 --> 00:11:24,070 that I essentially did in the last section. 249 00:11:24,070 --> 00:11:25,770 To be honest with you, the reason 250 00:11:25,770 --> 00:11:28,210 I did that is because some of those matrix multiplies 251 00:11:28,210 --> 00:11:30,790 used to be really, really, really slow. 252 00:11:30,790 --> 00:11:33,540 And so I did a sort of special function 253 00:11:33,540 --> 00:11:36,480 that took advantage of certain properties of the data 254 00:11:36,480 --> 00:11:40,000 to make it find these tracks much faster. 255 00:11:40,000 --> 00:11:42,660 Eventually, I broke down and just optimized the matrix 256 00:11:42,660 --> 00:11:44,610 multiply. 257 00:11:44,610 --> 00:11:47,840 In the past when I ran that last query, before it would have 258 00:11:47,840 --> 00:11:50,810 taken like a minute to actually run the analytic, 259 00:11:50,810 --> 00:11:52,140 which got annoying. 260 00:11:52,140 --> 00:11:53,550 So I optimized it. 261 00:11:53,550 --> 00:11:54,960 But we still have this code. 262 00:11:54,960 --> 00:11:59,600 This code shows all kinds of little tricks and techniques 263 00:11:59,600 --> 00:12:02,290 for doing things that are slightly better 264 00:12:02,290 --> 00:12:04,539 and using triples instead of associate arrays 265 00:12:04,539 --> 00:12:05,830 if you want to do optimization. 266 00:12:05,830 --> 00:12:07,990 So we leave it here. 267 00:12:07,990 --> 00:12:10,410 But the matrix multiply performance 268 00:12:10,410 --> 00:12:14,960 is now pretty good that these tricks are less necessary. 269 00:12:14,960 --> 00:12:17,310 So what we want to do here, we have this track now. 270 00:12:17,310 --> 00:12:19,180 And I want to do a track query. 271 00:12:19,180 --> 00:12:23,230 So I have a person here, Michael Chang, 272 00:12:23,230 --> 00:12:26,930 another person Javier Sanchez. 273 00:12:26,930 --> 00:12:29,310 Now, Michael Chang was a tennis player at this time. 274 00:12:29,310 --> 00:12:33,160 Was Javier Sanchez also a tennis player at this time? 275 00:12:33,160 --> 00:12:33,720 I don't know. 276 00:12:33,720 --> 00:12:35,511 I think there was a Javier Sanchez that was 277 00:12:35,511 --> 00:12:36,950 a tennis player at that time. 278 00:12:36,950 --> 00:12:38,580 So we just want to look. 279 00:12:38,580 --> 00:12:42,100 We're going to just do, essentially, here A and just 280 00:12:42,100 --> 00:12:44,800 say give me of this track. 281 00:12:44,800 --> 00:12:52,510 And say give me the listings for these two people, P1, P2. 282 00:12:52,510 --> 00:12:55,260 And then we use our Display Full command to sort of make them 283 00:12:55,260 --> 00:12:57,450 in a nice neat tabular format. 284 00:12:57,450 --> 00:12:59,440 And you see here, basically, here 285 00:12:59,440 --> 00:13:03,870 is Javier Sanchez' listing. 286 00:13:03,870 --> 00:13:04,440 OK. 287 00:13:04,440 --> 00:13:05,610 And here is Michael Chang's. 288 00:13:05,610 --> 00:13:07,490 And you see there's no overlap here. 289 00:13:07,490 --> 00:13:13,280 We don't ever have them in the same time or same place. 290 00:13:13,280 --> 00:13:15,280 We can also do things like track windows. 291 00:13:15,280 --> 00:13:18,920 So we can say I want to set a time range here 292 00:13:18,920 --> 00:13:22,280 and a location, Australia. 293 00:13:22,280 --> 00:13:24,250 So if we have our track thing here 294 00:13:24,250 --> 00:13:28,940 and I say, all right, give me the time range, T, 295 00:13:28,940 --> 00:13:33,830 and then equal to all locations in Australia, 296 00:13:33,830 --> 00:13:37,140 this shows me all tracks that essentially 297 00:13:37,140 --> 00:13:42,410 went through this location in this time window. 298 00:13:42,410 --> 00:13:45,010 And these are the different folks that they list, 299 00:13:45,010 --> 00:13:48,680 Sanchez, Melissa Russo, whoever that was, 300 00:13:48,680 --> 00:13:50,660 Michael Chang, and Michelle Martin. 301 00:13:50,660 --> 00:13:54,230 So those are just an example of a more sophisticated analytic. 302 00:13:54,230 --> 00:13:56,180 And here, we're using the fact that 303 00:13:56,180 --> 00:14:00,830 for our associative arrays, we actually 304 00:14:00,830 --> 00:14:03,540 have defined equals equals. 305 00:14:03,540 --> 00:14:07,940 So this only works, though, if x is a constant. 306 00:14:07,940 --> 00:14:10,250 So it will check the value to see 307 00:14:10,250 --> 00:14:13,322 if that value, if it's a string, if it's equal to that string, 308 00:14:13,322 --> 00:14:14,780 or if it's a numeric value, if it's 309 00:14:14,780 --> 00:14:17,070 equal to that numeric value. 310 00:14:17,070 --> 00:14:20,960 But it only works with a constant. 311 00:14:20,960 --> 00:14:23,530 One could argue that maybe I should make this work 312 00:14:23,530 --> 00:14:26,292 for a list of strings. 313 00:14:26,292 --> 00:14:28,000 But then the MATLAB syntax doesn't really 314 00:14:28,000 --> 00:14:29,370 work there either. 315 00:14:29,370 --> 00:14:34,704 If I have a matrix equals equal to a list of-- I don't know. 316 00:14:34,704 --> 00:14:36,120 I don't know if that really works. 317 00:14:36,120 --> 00:14:39,400 So we try and preserve the MATLAB syntax where we can. 318 00:14:39,400 --> 00:14:41,700 And again, then we're just getting the columns. 319 00:14:41,700 --> 00:14:43,790 Again, this thing returns an associative array. 320 00:14:43,790 --> 00:14:46,450 This equal equals returns the associative array 321 00:14:46,450 --> 00:14:49,010 of all things that are equal to that. 322 00:14:49,010 --> 00:14:51,151 And then we can look at the columns. 323 00:14:51,151 --> 00:14:53,680 All right. 324 00:14:53,680 --> 00:14:59,030 Moving on here, next one's TA3. 325 00:14:59,030 --> 00:15:01,222 So those are fairly simple track builders. 326 00:15:01,222 --> 00:15:02,680 Let's begin to do something that's, 327 00:15:02,680 --> 00:15:08,990 I think, kind of-- doing that track analytic, 328 00:15:08,990 --> 00:15:11,780 one could imagine doing that with existing techniques that 329 00:15:11,780 --> 00:15:14,510 are out there, existing tools and stuff like that. 330 00:15:14,510 --> 00:15:15,526 It would be long. 331 00:15:15,526 --> 00:15:17,150 You would write a lot of code to do it. 332 00:15:17,150 --> 00:15:19,507 But you could do it. 333 00:15:19,507 --> 00:15:21,590 Now, let's do some things where you go, like, wow, 334 00:15:21,590 --> 00:15:24,120 this is really something that would just 335 00:15:24,120 --> 00:15:27,058 be prohibitively complicated to do it using other techniques. 336 00:15:35,210 --> 00:15:37,070 So once again, we load our data. 337 00:15:37,070 --> 00:15:38,990 We convert it to numeric. 338 00:15:38,990 --> 00:15:43,570 We get our object and our time and our space keys. 339 00:15:43,570 --> 00:15:45,400 We find our tracks. 340 00:15:45,400 --> 00:15:47,720 And then we've built something called FindTrackGraph. 341 00:15:50,310 --> 00:15:51,370 All right. 342 00:15:51,370 --> 00:15:53,690 And this is actually not that complicated. 343 00:15:53,690 --> 00:15:55,930 But it is more than, like, one or two lines. 344 00:15:55,930 --> 00:15:59,210 But what it does, it says, OK, I have this track. 345 00:15:59,210 --> 00:16:03,910 This track is a sequence of locations in a particular time 346 00:16:03,910 --> 00:16:04,770 order. 347 00:16:04,770 --> 00:16:07,280 Well, now, I want to build a graph that's 348 00:16:07,280 --> 00:16:09,580 location by location. 349 00:16:09,580 --> 00:16:13,200 So if a track started in one location, 350 00:16:13,200 --> 00:16:16,210 and then its next destination was another, 351 00:16:16,210 --> 00:16:18,870 that will, of course, create a new graph. 352 00:16:18,870 --> 00:16:19,680 OK. 353 00:16:19,680 --> 00:16:22,800 So I now have a new graph, which is essentially 354 00:16:22,800 --> 00:16:26,320 220 by 220 locations. 355 00:16:26,320 --> 00:16:29,210 And we can actually take a look at that. 356 00:16:29,210 --> 00:16:31,700 And that's this graph here. 357 00:16:31,700 --> 00:16:36,260 So this basically says, you know, 358 00:16:36,260 --> 00:16:39,320 there was a track that started in Belgium, 359 00:16:39,320 --> 00:16:41,200 and then its next stop was Albania. 360 00:16:43,864 --> 00:16:44,780 Or here's another one. 361 00:16:44,780 --> 00:16:47,420 It started in Australia and ended in Colombia. 362 00:16:47,420 --> 00:16:52,920 And obviously, we have a dense diagonal here, 363 00:16:52,920 --> 00:16:56,260 because by definition-- well, actually 364 00:16:56,260 --> 00:16:59,516 a lot of times, that's just the way it works. 365 00:16:59,516 --> 00:17:01,140 And so again, here's Damascus, Florida, 366 00:17:01,140 --> 00:17:02,098 all this type of thing. 367 00:17:02,098 --> 00:17:04,784 So now, we've created a new graph of these tracks. 368 00:17:07,980 --> 00:17:13,690 Now, we can do something like a track pattern. 369 00:17:13,690 --> 00:17:19,530 So let's say I just want to look at the tracks associated 370 00:17:19,530 --> 00:17:24,140 with people associated with the organization International 371 00:17:24,140 --> 00:17:25,890 Monetary Fund. 372 00:17:25,890 --> 00:17:30,950 So I'm going have starts with person. 373 00:17:30,950 --> 00:17:34,570 And I'm going to limit my data. 374 00:17:34,570 --> 00:17:38,430 So I'm going to basically limit it 375 00:17:38,430 --> 00:17:40,270 to data that begins with the organization. 376 00:17:40,270 --> 00:17:44,120 So now, I'm building a new graph, G0. 377 00:17:44,120 --> 00:17:45,120 OK. 378 00:17:45,120 --> 00:17:48,170 FindTrackGraph of just that data set. 379 00:17:48,170 --> 00:17:50,940 So I've basically taken my A, which is this graph, 380 00:17:50,940 --> 00:17:53,670 and I said, oh, just go back and find me 381 00:17:53,670 --> 00:17:57,160 the people associated with the International Monetary Fund. 382 00:17:57,160 --> 00:18:00,570 And now, I can do things like, all right-- 383 00:18:00,570 --> 00:18:03,590 because this track graph the value 384 00:18:03,590 --> 00:18:06,500 is the number of times that occurred. 385 00:18:06,500 --> 00:18:10,570 So for instance, if I say show me now 386 00:18:10,570 --> 00:18:17,250 all edges that occurred more than twice 387 00:18:17,250 --> 00:18:23,010 and where are the tracks that were due to people 388 00:18:23,010 --> 00:18:24,870 associate the International Monetary 389 00:18:24,870 --> 00:18:30,105 Fund were greater than 20% of all the tracks that occurred. 390 00:18:30,105 --> 00:18:31,730 So basically, I'm looking for something 391 00:18:31,730 --> 00:18:35,560 that happened more than twice, in that IMF folks did 392 00:18:35,560 --> 00:18:36,770 more than twice. 393 00:18:36,770 --> 00:18:39,830 And of all the data, the IMF people 394 00:18:39,830 --> 00:18:41,724 did it a lot of the time. 395 00:18:41,724 --> 00:18:42,890 So we're in a lot of people. 396 00:18:42,890 --> 00:18:45,690 It wasn't just a really, really popular track. 397 00:18:45,690 --> 00:18:49,070 And so we see here, now we get Karachi and Afghanistan. 398 00:18:49,070 --> 00:18:51,760 So we had, essentially, at least two of those. 399 00:18:51,760 --> 00:18:54,490 And all of them were people associated with the IMF. 400 00:18:54,490 --> 00:18:57,030 Afghanistan and Britain. 401 00:18:57,030 --> 00:18:58,460 Here's Britain and England. 402 00:18:58,460 --> 00:19:03,040 Well, obviously that's a little bit-- Britain and England 403 00:19:03,040 --> 00:19:05,050 are the same thing, right? 404 00:19:05,050 --> 00:19:07,430 Here's Islamabad, Islamabad, Moscow, Moscow. 405 00:19:07,430 --> 00:19:10,600 So here just shows you the kinds of things that you 406 00:19:10,600 --> 00:19:12,670 can do with typing in again. 407 00:19:12,670 --> 00:19:15,450 This is a very sophisticated type of analytic. 408 00:19:15,450 --> 00:19:17,890 If you were to try and to do those things using existing 409 00:19:17,890 --> 00:19:21,894 types of things-- if you knew this was the analytic to do 410 00:19:21,894 --> 00:19:23,560 and someone handed this to you and said, 411 00:19:23,560 --> 00:19:25,280 go implement it using another technique, 412 00:19:25,280 --> 00:19:26,860 you definitely could do it. 413 00:19:26,860 --> 00:19:30,450 But discovering the analytic using existing techniques 414 00:19:30,450 --> 00:19:32,650 would be very, very time consuming. 415 00:19:32,650 --> 00:19:35,370 And this kind of tool is very, very easy 416 00:19:35,370 --> 00:19:38,000 for you to explore and get the analytic just right. 417 00:19:38,000 --> 00:19:40,140 And I would say that in a certain sense what 418 00:19:40,140 --> 00:19:43,270 D4M is doing here is doing the same rule that we've always 419 00:19:43,270 --> 00:19:46,460 used MATLAB for in signal processing. 420 00:19:46,460 --> 00:19:48,787 People use it to play with their algorithms, 421 00:19:48,787 --> 00:19:51,370 to figure out their algorithms, to get their algorithms right. 422 00:19:51,370 --> 00:19:53,870 They know this will give us the right answer. 423 00:19:53,870 --> 00:19:56,330 And then when they deploy it and actually 424 00:19:56,330 --> 00:19:59,430 make it part of a real system, sometimes they'll 425 00:19:59,430 --> 00:20:00,870 just take the MATLAB code and make 426 00:20:00,870 --> 00:20:02,080 it a part of the real system. 427 00:20:02,080 --> 00:20:05,470 But more often than not, the target system or the target 428 00:20:05,470 --> 00:20:09,710 application will require you to port it to some other language, 429 00:20:09,710 --> 00:20:16,160 maybe C++, maybe Java, for deployment reasons. 430 00:20:16,160 --> 00:20:18,340 We still see that happening today 431 00:20:18,340 --> 00:20:20,750 that, you know, algorithm development is one thing, 432 00:20:20,750 --> 00:20:23,070 deployment is another thing. 433 00:20:23,070 --> 00:20:26,020 Even if people use the same language 434 00:20:26,020 --> 00:20:27,976 for doing algorithm and deployment, 435 00:20:27,976 --> 00:20:29,600 usually deployment people end up having 436 00:20:29,600 --> 00:20:31,900 to completely rewrite what the algorithm analysts wrote 437 00:20:31,900 --> 00:20:32,400 anyway. 438 00:20:32,400 --> 00:20:35,057 Because the algorithm analysts had certain things 439 00:20:35,057 --> 00:20:36,140 they were concerned about. 440 00:20:36,140 --> 00:20:38,050 And the deployment person will have 441 00:20:38,050 --> 00:20:40,580 completely different issues that they have to worry about. 442 00:20:40,580 --> 00:20:43,310 But I think D4M allows you to still do 443 00:20:43,310 --> 00:20:47,130 that same kind of model on these new types of data 444 00:20:47,130 --> 00:20:49,520 in a very useful and productive way 445 00:20:49,520 --> 00:20:53,360 to get the productivity that we want out of that. 446 00:20:53,360 --> 00:20:54,620 All right. 447 00:20:54,620 --> 00:20:55,450 Let's see here. 448 00:20:55,450 --> 00:21:00,860 So finally, one last thing, a more complicated analytic, 449 00:21:00,860 --> 00:21:04,630 which I call sort of a multiple hypothesis tracker. 450 00:21:04,630 --> 00:21:09,340 Essentially, what we're doing here is we're loading the data. 451 00:21:09,340 --> 00:21:13,310 We're going to just focus on one person here. 452 00:21:13,310 --> 00:21:17,050 And then the locations are specified by time. 453 00:21:17,050 --> 00:21:19,470 I mean, the time is specified by the time column. 454 00:21:19,470 --> 00:21:21,362 And location is specified by location. 455 00:21:21,362 --> 00:21:23,320 And then I'm going to have this function called 456 00:21:23,320 --> 00:21:27,780 find multiply hypothesis trackers for Michael Chang 457 00:21:27,780 --> 00:21:29,840 with respect to the data set E. 458 00:21:29,840 --> 00:21:33,670 And what this does is this says, all right, 459 00:21:33,670 --> 00:21:37,930 in the previous thing, I was basically just making one pick. 460 00:21:37,930 --> 00:21:42,140 If I had a document and Michael Chang was in it 461 00:21:42,140 --> 00:21:45,760 and there was multiple locations and times, 462 00:21:45,760 --> 00:21:48,930 I would just sort of pick one of those locations and times. 463 00:21:48,930 --> 00:21:52,530 Here, I'm now going to show you, for Michael Chang, 464 00:21:52,530 --> 00:21:55,610 for each document, all the locations and times. 465 00:21:55,610 --> 00:22:01,260 So here's, basically, Africa and time. 466 00:22:01,260 --> 00:22:02,410 And let's see here. 467 00:22:02,410 --> 00:22:04,580 Maybe we can make this a little smaller. 468 00:22:04,580 --> 00:22:09,940 That would probably help a bit, a little smaller. 469 00:22:09,940 --> 00:22:10,700 There we go. 470 00:22:10,700 --> 00:22:11,910 You can barely see that. 471 00:22:11,910 --> 00:22:16,220 But you see here, this basically shows all the times, 472 00:22:16,220 --> 00:22:17,670 all the locations. 473 00:22:17,670 --> 00:22:21,970 And this shows you, essentially, in theory 474 00:22:21,970 --> 00:22:26,604 the true track could be going through any one of these. 475 00:22:26,604 --> 00:22:28,770 There isn't a single track really for Michael Chang. 476 00:22:28,770 --> 00:22:30,820 There's multiple potential tracks. 477 00:22:30,820 --> 00:22:32,700 And then this complex value I happened 478 00:22:32,700 --> 00:22:35,350 to store here just because I'm using complex values, 479 00:22:35,350 --> 00:22:40,060 the first one is the character distance. 480 00:22:40,060 --> 00:22:45,370 The location is-- so Michael Chang 481 00:22:45,370 --> 00:22:49,580 appears in a particular word position in the document. 482 00:22:49,580 --> 00:22:54,220 And it tells me that Austria appears 278 characters 483 00:22:54,220 --> 00:23:00,960 before him and that the time stamp appears 11 characters 484 00:23:00,960 --> 00:23:02,680 before his name. 485 00:23:02,680 --> 00:23:08,570 And so over here, you see the time stamp 486 00:23:08,570 --> 00:23:13,040 appears-- this is a different document, 487 00:23:13,040 --> 00:23:16,820 but with the same time-- in different locations. 488 00:23:16,820 --> 00:23:20,130 So you can then use this data to actually go back and say, 489 00:23:20,130 --> 00:23:23,350 you know what, I only to pick the words that 490 00:23:23,350 --> 00:23:25,460 are closest to Michael Chang. 491 00:23:25,460 --> 00:23:28,590 And I want that to actually be the real track for Michael 492 00:23:28,590 --> 00:23:29,090 Chang. 493 00:23:29,090 --> 00:23:32,500 So that just shows the more complicated things 494 00:23:32,500 --> 00:23:34,070 that we can do with that. 495 00:23:34,070 --> 00:23:39,510 So with that, that leads us to the end of the examples. 496 00:23:39,510 --> 00:23:42,260 And if there's any questions, I'll 497 00:23:42,260 --> 00:23:44,780 be happy to take them now if there's any. 498 00:23:47,464 --> 00:23:48,130 All right, good. 499 00:23:48,130 --> 00:23:50,130 I'm just showing you some of the kinds of things 500 00:23:50,130 --> 00:23:51,040 that people can do. 501 00:23:51,040 --> 00:23:53,700 And so there we go.