1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,100
continue to offer high quality
educational resources for free.

5
00:00:10,100 --> 00:00:12,680
To make a donation or to
view additional materials

6
00:00:12,680 --> 00:00:16,590
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,590 --> 00:00:18,270
at ocw.mit.edu.

8
00:00:20,969 --> 00:00:22,010
JEREMY KEPNER: All right.

9
00:00:22,010 --> 00:00:25,730
We're back.

10
00:00:25,730 --> 00:00:29,025
So I think we just
went over a sort

11
00:00:29,025 --> 00:00:31,150
of a tour of some of the
more complicated analytics

12
00:00:31,150 --> 00:00:32,439
that people can do.

13
00:00:32,439 --> 00:00:34,760
And now, we're going to
show some examples of not

14
00:00:34,760 --> 00:00:37,200
those specific analytics,
but using the Reuters data

15
00:00:37,200 --> 00:00:39,530
set that we had before
some of the analytics

16
00:00:39,530 --> 00:00:40,810
that we can do here.

17
00:00:40,810 --> 00:00:44,700
So again, this is
in the D4M API.

18
00:00:44,700 --> 00:00:46,580
We go into the
Examples directory.

19
00:00:46,580 --> 00:00:48,040
It's Apps.

20
00:00:48,040 --> 00:00:50,320
And now we're going
to deal with tracks.

21
00:00:50,320 --> 00:00:54,250
And this is a data set here.

22
00:00:54,250 --> 00:00:56,440
The data actually,
this Entity.mat

23
00:00:56,440 --> 00:01:00,005
was actually created
in the previous.

24
00:01:02,720 --> 00:01:04,110
And this entity
analysis actually

25
00:01:04,110 --> 00:01:05,560
creates the Entity.mat.

26
00:01:05,560 --> 00:01:08,210
It's the same file.

27
00:01:08,210 --> 00:01:10,770
So we just have over here.

28
00:01:10,770 --> 00:01:13,708
And so let's get started.

29
00:01:16,500 --> 00:01:21,680
And just to remind people,
we do load entity.mat.

30
00:01:26,529 --> 00:01:28,960
You see we have this E here.

31
00:01:28,960 --> 00:01:30,780
That's the data.

32
00:01:30,780 --> 00:01:36,340
It represents, essentially,
almost 10,000 documents

33
00:01:36,340 --> 00:01:38,500
and 3,600 entities in that.

34
00:01:38,500 --> 00:01:45,116
If we do spy E. transpose.

35
00:01:45,116 --> 00:01:47,282
AUDIENCE: Can you move the
bottom of your screen up?

36
00:01:47,282 --> 00:01:50,385
Because we're seeing the
top half of your thing.

37
00:02:02,040 --> 00:02:03,520
JEREMY KEPNER: That better?

38
00:02:03,520 --> 00:02:04,500
AUDIENCE: Much better.

39
00:02:04,500 --> 00:02:05,250
JEREMY KEPNER: OK.

40
00:02:09,300 --> 00:02:11,460
And so there you go.

41
00:02:11,460 --> 00:02:13,260
That is the entire data set.

42
00:02:13,260 --> 00:02:17,640
Again, spy plot, very
useful tool for doing that.

43
00:02:17,640 --> 00:02:20,260
So you can see we have
locations and people

44
00:02:20,260 --> 00:02:23,310
and times in this data set.

45
00:02:23,310 --> 00:02:25,640
We also have
organizations in here.

46
00:02:25,640 --> 00:02:27,870
You can zoom in on
it if you want to.

47
00:02:31,870 --> 00:02:34,820
You can see, the are
the locations, and then

48
00:02:34,820 --> 00:02:37,540
the organizations,
and the people.

49
00:02:37,540 --> 00:02:40,360
Zoom in a little bit
more, you can actually

50
00:02:40,360 --> 00:02:41,840
look at the actual values here.

51
00:02:41,840 --> 00:02:44,140
And you see there's this
common popular location here.

52
00:02:44,140 --> 00:02:45,595
What is that?

53
00:02:45,595 --> 00:02:47,250
Ah, it's the United States.

54
00:02:47,250 --> 00:02:49,530
Yes, the United States
does appear a lot

55
00:02:49,530 --> 00:02:51,930
in Reuters documents
as one would expect.

56
00:02:51,930 --> 00:02:53,480
What do we think
this one is here?

57
00:02:53,480 --> 00:02:54,616
Any guesses?

58
00:02:54,616 --> 00:02:55,430
AUDIENCE: New York.

59
00:02:55,430 --> 00:02:56,651
JEREMY KEPNER: What?

60
00:02:56,651 --> 00:02:58,020
AUDIENCE: Is that New York?

61
00:02:58,020 --> 00:02:58,330
JEREMY KEPNER: Yeah.

62
00:02:58,330 --> 00:02:58,890
Oh, New York.

63
00:02:58,890 --> 00:02:59,889
See, it's already there.

64
00:02:59,889 --> 00:03:00,420
New York.

65
00:03:00,420 --> 00:03:02,400
Yes, we can read.

66
00:03:02,400 --> 00:03:03,940
All right.

67
00:03:03,940 --> 00:03:07,110
Organizations, anything
really popular here?

68
00:03:07,110 --> 00:03:09,420
Maybe this guy, is he popular?

69
00:03:09,420 --> 00:03:10,947
International Red Cross, right?

70
00:03:10,947 --> 00:03:13,280
And people here, we don't
have any really popular people

71
00:03:13,280 --> 00:03:15,504
in this list.

72
00:03:15,504 --> 00:03:18,840
Ahmad Shah Massoud, I have no
idea who that person was back

73
00:03:18,840 --> 00:03:21,310
in 1996.

74
00:03:21,310 --> 00:03:24,640
Anyway, so that's the data.

75
00:03:24,640 --> 00:03:27,670
And you see, by the way, when
every single time you click.

76
00:03:27,670 --> 00:03:30,980
Because when we make
those spy plots,

77
00:03:30,980 --> 00:03:33,910
I have to do some compression
on the strings just

78
00:03:33,910 --> 00:03:36,350
to make it work.

79
00:03:36,350 --> 00:03:39,010
But we actually always
print out onto the screen

80
00:03:39,010 --> 00:03:40,790
the full exact string.

81
00:03:40,790 --> 00:03:43,480
So if I want that string,
you can just then copy it

82
00:03:43,480 --> 00:03:45,657
if you want to
say copy and paste

83
00:03:45,657 --> 00:03:46,740
it or something like that.

84
00:03:46,740 --> 00:03:49,031
Or the full person or something
like that, you can then

85
00:03:49,031 --> 00:03:51,370
copy that and paste that.

86
00:03:51,370 --> 00:03:59,772
You can do something like
that like, E, this guy, yeah.

87
00:03:59,772 --> 00:04:00,730
And there, it's showed.

88
00:04:00,730 --> 00:04:05,000
That's all the documents that
contain this person's name.

89
00:04:05,000 --> 00:04:08,350
And then this shows the
character position, I believe,

90
00:04:08,350 --> 00:04:10,130
that they appeared
in the document.

91
00:04:10,130 --> 00:04:15,550
So again, you have all
kinds of fun with that.

92
00:04:15,550 --> 00:04:17,560
You can do row, that.

93
00:04:17,560 --> 00:04:18,839
That gets us all those rows.

94
00:04:18,839 --> 00:04:21,085
And then we can pass
those rows back in.

95
00:04:21,085 --> 00:04:27,980
And let's say we want
to do starts with,

96
00:04:27,980 --> 00:04:31,384
let's see here, how
about organization?

97
00:04:31,384 --> 00:04:32,800
Everyone correct
my spelling here.

98
00:04:32,800 --> 00:04:39,050
I'm going to type this
wrong, organization slash.

99
00:04:39,050 --> 00:04:42,532
All right, there we go.

100
00:04:42,532 --> 00:04:44,120
AUDIENCE: Starts with.

101
00:04:44,120 --> 00:04:44,190
JEREMY KEPNER: Yeah.

102
00:04:44,190 --> 00:04:45,190
I always get that wrong.

103
00:04:45,190 --> 00:04:46,917
Stars with-- starts with.

104
00:04:46,917 --> 00:04:47,417
See?

105
00:04:47,417 --> 00:04:48,251
There you go.

106
00:04:48,251 --> 00:04:48,810
All right.

107
00:04:48,810 --> 00:04:49,351
There you go.

108
00:04:49,351 --> 00:04:51,540
So this shows you
all the organizations

109
00:04:51,540 --> 00:04:54,080
that are cited in
the documents that

110
00:04:54,080 --> 00:04:56,366
contained this person's name.

111
00:04:56,366 --> 00:04:56,865
You know?

112
00:04:59,694 --> 00:05:01,360
This is kind of the
spirit of it, right?

113
00:05:01,360 --> 00:05:03,090
I mean, it's like
just, oh, I want that?

114
00:05:03,090 --> 00:05:03,760
I can get that.

115
00:05:03,760 --> 00:05:06,397
Oh, I could then say, oh, all
right, well get me-- you know?

116
00:05:06,397 --> 00:05:08,480
And you could just keep
going and going and going.

117
00:05:08,480 --> 00:05:13,982
If I did r c s of
that, right, it

118
00:05:13,982 --> 00:05:15,440
would never return
those as triple.

119
00:05:15,440 --> 00:05:17,180
So r, those are
all the documents.

120
00:05:17,180 --> 00:05:19,680
C, those are all the columns.

121
00:05:19,680 --> 00:05:21,840
V, there we go.

122
00:05:21,840 --> 00:05:23,260
You get the idea.

123
00:05:23,260 --> 00:05:27,830
So again, very, very powerful
type of syntax there.

124
00:05:27,830 --> 00:05:28,996
All right.

125
00:05:28,996 --> 00:05:30,870
So I'll just give you
a sense of the data set

126
00:05:30,870 --> 00:05:31,786
that we're looking at.

127
00:05:31,786 --> 00:05:34,040
So let's look at the
first example here.

128
00:05:34,040 --> 00:05:37,432
So we're going to do track,
analytic, build, test.

129
00:05:37,432 --> 00:05:39,640
So we're going to build some
tracks out of this data.

130
00:05:39,640 --> 00:05:41,500
So I have these documents.

131
00:05:41,500 --> 00:05:46,110
And they have locations in
them, and people, and times.

132
00:05:46,110 --> 00:05:50,830
So I could say, hey, if
there's a person and a location

133
00:05:50,830 --> 00:05:53,370
and a time in a
document, maybe I could

134
00:05:53,370 --> 00:05:55,760
call that's sort of a track.

135
00:05:55,760 --> 00:05:56,480
So let's do that.

136
00:05:56,480 --> 00:05:57,840
So what do we do here?

137
00:05:57,840 --> 00:06:01,560
So the first thing we do
is we loaded the data.

138
00:06:01,560 --> 00:06:03,560
The string values are
those character positions.

139
00:06:03,560 --> 00:06:04,935
We don't really
care about those.

140
00:06:04,935 --> 00:06:06,900
So we're going to
just get rid of them

141
00:06:06,900 --> 00:06:09,400
and just convert it to a
numeric like we've always

142
00:06:09,400 --> 00:06:10,550
been doing here.

143
00:06:10,550 --> 00:06:14,600
And then I'm going
to say my thing

144
00:06:14,600 --> 00:06:16,990
that I want to track is
going to be anything starting

145
00:06:16,990 --> 00:06:17,730
with person.

146
00:06:17,730 --> 00:06:18,960
So I set that.

147
00:06:18,960 --> 00:06:22,360
And my time thing is going to
be anything starting my time.

148
00:06:22,360 --> 00:06:24,710
And my location thing
is going to be anything

149
00:06:24,710 --> 00:06:25,920
here starting with location.

150
00:06:25,920 --> 00:06:29,114
So I've done the starts
with to get these ranges.

151
00:06:29,114 --> 00:06:30,780
And now, the first
thing I'm going to do

152
00:06:30,780 --> 00:06:37,690
is I want to limit my data to
only rows that have at least

153
00:06:37,690 --> 00:06:39,690
one of all three of those.

154
00:06:39,690 --> 00:06:43,630
So I'm not dealing with I
have a person and a location

155
00:06:43,630 --> 00:06:44,680
and no time or whatever.

156
00:06:44,680 --> 00:06:46,470
So I'm just going
to clean that up.

157
00:06:46,470 --> 00:06:49,780
So basically, I
get all the people.

158
00:06:49,780 --> 00:06:51,480
And I sum that.

159
00:06:51,480 --> 00:06:56,550
I basically sum
across the columns.

160
00:06:56,550 --> 00:06:59,010
So I basically
compress the columns.

161
00:06:59,010 --> 00:07:01,340
And then I sum the rows.

162
00:07:01,340 --> 00:07:03,550
All right.

163
00:07:03,550 --> 00:07:06,620
There we go, sum those.

164
00:07:06,620 --> 00:07:09,397
I get, then, all three.

165
00:07:09,397 --> 00:07:10,730
And then I filter them back out.

166
00:07:10,730 --> 00:07:13,230
And that just reduces this
to the ones that contain

167
00:07:13,230 --> 00:07:16,911
just the ones that have these.

168
00:07:20,480 --> 00:07:21,330
Let's see here.

169
00:07:21,330 --> 00:07:25,330
So now, I want to
collapse these.

170
00:07:25,330 --> 00:07:29,070
I want to create, essentially,
just edges and times.

171
00:07:29,070 --> 00:07:32,700
So I can do that with the call
to type syntax and the val

172
00:07:32,700 --> 00:07:33,780
to call syntax.

173
00:07:33,780 --> 00:07:37,710
And going and bopping back
and forth between that,

174
00:07:37,710 --> 00:07:40,840
I get a set of edges, the edge
list, which is essentially

175
00:07:40,840 --> 00:07:43,629
the document and the time.

176
00:07:43,629 --> 00:07:45,670
And now, I'm going to
combine these back together

177
00:07:45,670 --> 00:07:49,770
into a new associative array,
which essentially still

178
00:07:49,770 --> 00:07:54,780
has the same text label, which
is essentially the document.

179
00:07:54,780 --> 00:07:56,710
But it has columns of time.

180
00:07:56,710 --> 00:07:58,200
And the value is space.

181
00:07:58,200 --> 00:07:58,700
All right.

182
00:07:58,700 --> 00:08:01,180
And then I'm going to do
another one, which is,

183
00:08:01,180 --> 00:08:06,480
again, has a row, which is
the document or the edge.

184
00:08:06,480 --> 00:08:10,710
And it has a column of
space and a value of time.

185
00:08:10,710 --> 00:08:14,740
And now, I can construct
a track from this

186
00:08:14,740 --> 00:08:20,300
through this wonderful
sparse matrix multiply.

187
00:08:20,300 --> 00:08:25,360
So essentially, I transpose Etx.

188
00:08:25,360 --> 00:08:28,190
And then I'm going to
just get the people

189
00:08:28,190 --> 00:08:31,720
and convert those
numeric values.

190
00:08:31,720 --> 00:08:33,600
And then we do this
cat value mul, which

191
00:08:33,600 --> 00:08:36,260
will actually convert that.

192
00:08:36,260 --> 00:08:37,220
These are time tracks.

193
00:08:37,220 --> 00:08:38,679
And these are space tracks.

194
00:08:38,679 --> 00:08:40,220
And again, it's a
little difficult

195
00:08:40,220 --> 00:08:44,710
to explain to you
exactly why this works,

196
00:08:44,710 --> 00:08:47,410
why these matrix multiplies
give us the answer

197
00:08:47,410 --> 00:08:48,399
that we're going for.

198
00:08:48,399 --> 00:08:50,940
Because we're going to have to
sit and think about the actual

199
00:08:50,940 --> 00:08:51,780
matrices.

200
00:08:51,780 --> 00:08:54,120
This is a great example
to go do, and then

201
00:08:54,120 --> 00:08:56,890
explore these various
associative arrays to actually

202
00:08:56,890 --> 00:09:02,620
see why these matrix
multiplies actually give us

203
00:09:02,620 --> 00:09:04,150
the answer that we want.

204
00:09:04,150 --> 00:09:08,320
So in fact, we can
take a look at those.

205
00:09:08,320 --> 00:09:09,866
I just want to look
at Figure 1 here.

206
00:09:12,830 --> 00:09:15,150
So this shows us,
basically, and I

207
00:09:15,150 --> 00:09:20,190
plotted the transpose of
this, the people on the right.

208
00:09:20,190 --> 00:09:22,000
And then these are times.

209
00:09:22,000 --> 00:09:24,800
And so, basically,
for each row here, I

210
00:09:24,800 --> 00:09:26,870
have a listing of times.

211
00:09:26,870 --> 00:09:31,680
And if I click on one of them,
it will give me a location.

212
00:09:31,680 --> 00:09:32,180
OK.

213
00:09:36,860 --> 00:09:38,600
In fact, I think that
number's the number

214
00:09:38,600 --> 00:09:40,380
of times that appears.

215
00:09:40,380 --> 00:09:43,990
So basically, we
have here the person,

216
00:09:43,990 --> 00:09:48,990
Daniel Smith was in a document
with a time stamp of 1996,

217
00:09:48,990 --> 00:09:49,750
November 12.

218
00:09:49,750 --> 00:09:51,960
Oh, that's almost-- yesterday.

219
00:09:51,960 --> 00:10:01,700
And then the location New
York appeared once in that.

220
00:10:01,700 --> 00:10:02,520
Here's another one.

221
00:10:02,520 --> 00:10:05,020
And so this is a track of sorts.

222
00:10:05,020 --> 00:10:08,340
We basically have a
person and a set of times

223
00:10:08,340 --> 00:10:11,410
and a set of tracks.

224
00:10:11,410 --> 00:10:15,150
That's one kind of track.

225
00:10:15,150 --> 00:10:16,650
Another kind of
track here is now

226
00:10:16,650 --> 00:10:20,637
we have person and locations.

227
00:10:20,637 --> 00:10:22,720
So that was the other
matrix multiply that we did.

228
00:10:22,720 --> 00:10:26,180
And so now, we have person
Carole King, location Buffalo,

229
00:10:26,180 --> 00:10:29,130
and on this time.

230
00:10:29,130 --> 00:10:32,502
So those are two different ways
of representing the tracks.

231
00:10:32,502 --> 00:10:33,710
Obviously, these are triples.

232
00:10:33,710 --> 00:10:35,680
But then you can use
either of these matrices

233
00:10:35,680 --> 00:10:39,260
to do additional queries
and other types of things.

234
00:10:39,260 --> 00:10:40,150
All right.

235
00:10:40,150 --> 00:10:42,905
So that just shows you how
using matrix multiplies

236
00:10:42,905 --> 00:10:44,280
and other types
of things you can

237
00:10:44,280 --> 00:10:48,370
construct more sophisticated
graphs or data structures,

238
00:10:48,370 --> 00:10:52,630
in this case tracks, which is a
very interesting type of thing.

239
00:10:52,630 --> 00:10:54,032
Let's move on to the next one.

240
00:10:54,032 --> 00:10:54,990
That's going to be TA2.

241
00:11:01,500 --> 00:11:03,830
So this is a slightly more
sophisticated tract builder.

242
00:11:03,830 --> 00:11:06,320
Again, so when I
read the data in,

243
00:11:06,320 --> 00:11:09,090
i create my three sort
of categories here,

244
00:11:09,090 --> 00:11:11,820
the object, and the
time, and the location,

245
00:11:11,820 --> 00:11:13,020
or the coordinate.

246
00:11:13,020 --> 00:11:15,960
And then I have a
function here called

247
00:11:15,960 --> 00:11:18,600
find tracks, which actually just
goes and creates those tracks

248
00:11:18,600 --> 00:11:24,070
that I essentially did
in the last section.

249
00:11:24,070 --> 00:11:25,770
To be honest with
you, the reason

250
00:11:25,770 --> 00:11:28,210
I did that is because some
of those matrix multiplies

251
00:11:28,210 --> 00:11:30,790
used to be really,
really, really slow.

252
00:11:30,790 --> 00:11:33,540
And so I did a sort
of special function

253
00:11:33,540 --> 00:11:36,480
that took advantage of
certain properties of the data

254
00:11:36,480 --> 00:11:40,000
to make it find these
tracks much faster.

255
00:11:40,000 --> 00:11:42,660
Eventually, I broke down and
just optimized the matrix

256
00:11:42,660 --> 00:11:44,610
multiply.

257
00:11:44,610 --> 00:11:47,840
In the past when I ran that
last query, before it would have

258
00:11:47,840 --> 00:11:50,810
taken like a minute to
actually run the analytic,

259
00:11:50,810 --> 00:11:52,140
which got annoying.

260
00:11:52,140 --> 00:11:53,550
So I optimized it.

261
00:11:53,550 --> 00:11:54,960
But we still have this code.

262
00:11:54,960 --> 00:11:59,600
This code shows all kinds of
little tricks and techniques

263
00:11:59,600 --> 00:12:02,290
for doing things that
are slightly better

264
00:12:02,290 --> 00:12:04,539
and using triples instead
of associate arrays

265
00:12:04,539 --> 00:12:05,830
if you want to do optimization.

266
00:12:05,830 --> 00:12:07,990
So we leave it here.

267
00:12:07,990 --> 00:12:10,410
But the matrix
multiply performance

268
00:12:10,410 --> 00:12:14,960
is now pretty good that these
tricks are less necessary.

269
00:12:14,960 --> 00:12:17,310
So what we want to do here,
we have this track now.

270
00:12:17,310 --> 00:12:19,180
And I want to do a track query.

271
00:12:19,180 --> 00:12:23,230
So I have a person
here, Michael Chang,

272
00:12:23,230 --> 00:12:26,930
another person Javier Sanchez.

273
00:12:26,930 --> 00:12:29,310
Now, Michael Chang was a
tennis player at this time.

274
00:12:29,310 --> 00:12:33,160
Was Javier Sanchez also a
tennis player at this time?

275
00:12:33,160 --> 00:12:33,720
I don't know.

276
00:12:33,720 --> 00:12:35,511
I think there was a
Javier Sanchez that was

277
00:12:35,511 --> 00:12:36,950
a tennis player at that time.

278
00:12:36,950 --> 00:12:38,580
So we just want to look.

279
00:12:38,580 --> 00:12:42,100
We're going to just do,
essentially, here A and just

280
00:12:42,100 --> 00:12:44,800
say give me of this track.

281
00:12:44,800 --> 00:12:52,510
And say give me the listings
for these two people, P1, P2.

282
00:12:52,510 --> 00:12:55,260
And then we use our Display Full
command to sort of make them

283
00:12:55,260 --> 00:12:57,450
in a nice neat tabular format.

284
00:12:57,450 --> 00:12:59,440
And you see here,
basically, here

285
00:12:59,440 --> 00:13:03,870
is Javier Sanchez' listing.

286
00:13:03,870 --> 00:13:04,440
OK.

287
00:13:04,440 --> 00:13:05,610
And here is Michael Chang's.

288
00:13:05,610 --> 00:13:07,490
And you see there's
no overlap here.

289
00:13:07,490 --> 00:13:13,280
We don't ever have them in
the same time or same place.

290
00:13:13,280 --> 00:13:15,280
We can also do things
like track windows.

291
00:13:15,280 --> 00:13:18,920
So we can say I want to
set a time range here

292
00:13:18,920 --> 00:13:22,280
and a location, Australia.

293
00:13:22,280 --> 00:13:24,250
So if we have our
track thing here

294
00:13:24,250 --> 00:13:28,940
and I say, all right,
give me the time range, T,

295
00:13:28,940 --> 00:13:33,830
and then equal to all
locations in Australia,

296
00:13:33,830 --> 00:13:37,140
this shows me all
tracks that essentially

297
00:13:37,140 --> 00:13:42,410
went through this location
in this time window.

298
00:13:42,410 --> 00:13:45,010
And these are the different
folks that they list,

299
00:13:45,010 --> 00:13:48,680
Sanchez, Melissa Russo,
whoever that was,

300
00:13:48,680 --> 00:13:50,660
Michael Chang, and
Michelle Martin.

301
00:13:50,660 --> 00:13:54,230
So those are just an example of
a more sophisticated analytic.

302
00:13:54,230 --> 00:13:56,180
And here, we're
using the fact that

303
00:13:56,180 --> 00:14:00,830
for our associative
arrays, we actually

304
00:14:00,830 --> 00:14:03,540
have defined equals equals.

305
00:14:03,540 --> 00:14:07,940
So this only works,
though, if x is a constant.

306
00:14:07,940 --> 00:14:10,250
So it will check
the value to see

307
00:14:10,250 --> 00:14:13,322
if that value, if it's a string,
if it's equal to that string,

308
00:14:13,322 --> 00:14:14,780
or if it's a numeric
value, if it's

309
00:14:14,780 --> 00:14:17,070
equal to that numeric value.

310
00:14:17,070 --> 00:14:20,960
But it only works
with a constant.

311
00:14:20,960 --> 00:14:23,530
One could argue that maybe
I should make this work

312
00:14:23,530 --> 00:14:26,292
for a list of strings.

313
00:14:26,292 --> 00:14:28,000
But then the MATLAB
syntax doesn't really

314
00:14:28,000 --> 00:14:29,370
work there either.

315
00:14:29,370 --> 00:14:34,704
If I have a matrix equals equal
to a list of-- I don't know.

316
00:14:34,704 --> 00:14:36,120
I don't know if
that really works.

317
00:14:36,120 --> 00:14:39,400
So we try and preserve the
MATLAB syntax where we can.

318
00:14:39,400 --> 00:14:41,700
And again, then we're
just getting the columns.

319
00:14:41,700 --> 00:14:43,790
Again, this thing returns
an associative array.

320
00:14:43,790 --> 00:14:46,450
This equal equals returns
the associative array

321
00:14:46,450 --> 00:14:49,010
of all things that
are equal to that.

322
00:14:49,010 --> 00:14:51,151
And then we can
look at the columns.

323
00:14:51,151 --> 00:14:53,680
All right.

324
00:14:53,680 --> 00:14:59,030
Moving on here, next one's TA3.

325
00:14:59,030 --> 00:15:01,222
So those are fairly
simple track builders.

326
00:15:01,222 --> 00:15:02,680
Let's begin to do
something that's,

327
00:15:02,680 --> 00:15:08,990
I think, kind of-- doing
that track analytic,

328
00:15:08,990 --> 00:15:11,780
one could imagine doing that
with existing techniques that

329
00:15:11,780 --> 00:15:14,510
are out there, existing
tools and stuff like that.

330
00:15:14,510 --> 00:15:15,526
It would be long.

331
00:15:15,526 --> 00:15:17,150
You would write a
lot of code to do it.

332
00:15:17,150 --> 00:15:19,507
But you could do it.

333
00:15:19,507 --> 00:15:21,590
Now, let's do some things
where you go, like, wow,

334
00:15:21,590 --> 00:15:24,120
this is really something
that would just

335
00:15:24,120 --> 00:15:27,058
be prohibitively complicated to
do it using other techniques.

336
00:15:35,210 --> 00:15:37,070
So once again, we load our data.

337
00:15:37,070 --> 00:15:38,990
We convert it to numeric.

338
00:15:38,990 --> 00:15:43,570
We get our object and our
time and our space keys.

339
00:15:43,570 --> 00:15:45,400
We find our tracks.

340
00:15:45,400 --> 00:15:47,720
And then we've built something
called FindTrackGraph.

341
00:15:50,310 --> 00:15:51,370
All right.

342
00:15:51,370 --> 00:15:53,690
And this is actually
not that complicated.

343
00:15:53,690 --> 00:15:55,930
But it is more than,
like, one or two lines.

344
00:15:55,930 --> 00:15:59,210
But what it does, it says,
OK, I have this track.

345
00:15:59,210 --> 00:16:03,910
This track is a sequence of
locations in a particular time

346
00:16:03,910 --> 00:16:04,770
order.

347
00:16:04,770 --> 00:16:07,280
Well, now, I want to
build a graph that's

348
00:16:07,280 --> 00:16:09,580
location by location.

349
00:16:09,580 --> 00:16:13,200
So if a track started
in one location,

350
00:16:13,200 --> 00:16:16,210
and then its next
destination was another,

351
00:16:16,210 --> 00:16:18,870
that will, of course,
create a new graph.

352
00:16:18,870 --> 00:16:19,680
OK.

353
00:16:19,680 --> 00:16:22,800
So I now have a new graph,
which is essentially

354
00:16:22,800 --> 00:16:26,320
220 by 220 locations.

355
00:16:26,320 --> 00:16:29,210
And we can actually
take a look at that.

356
00:16:29,210 --> 00:16:31,700
And that's this graph here.

357
00:16:31,700 --> 00:16:36,260
So this basically
says, you know,

358
00:16:36,260 --> 00:16:39,320
there was a track that
started in Belgium,

359
00:16:39,320 --> 00:16:41,200
and then its next
stop was Albania.

360
00:16:43,864 --> 00:16:44,780
Or here's another one.

361
00:16:44,780 --> 00:16:47,420
It started in Australia
and ended in Colombia.

362
00:16:47,420 --> 00:16:52,920
And obviously, we have
a dense diagonal here,

363
00:16:52,920 --> 00:16:56,260
because by definition--
well, actually

364
00:16:56,260 --> 00:16:59,516
a lot of times, that's
just the way it works.

365
00:16:59,516 --> 00:17:01,140
And so again, here's
Damascus, Florida,

366
00:17:01,140 --> 00:17:02,098
all this type of thing.

367
00:17:02,098 --> 00:17:04,784
So now, we've created a
new graph of these tracks.

368
00:17:07,980 --> 00:17:13,690
Now, we can do something
like a track pattern.

369
00:17:13,690 --> 00:17:19,530
So let's say I just want to
look at the tracks associated

370
00:17:19,530 --> 00:17:24,140
with people associated with
the organization International

371
00:17:24,140 --> 00:17:25,890
Monetary Fund.

372
00:17:25,890 --> 00:17:30,950
So I'm going have
starts with person.

373
00:17:30,950 --> 00:17:34,570
And I'm going to limit my data.

374
00:17:34,570 --> 00:17:38,430
So I'm going to
basically limit it

375
00:17:38,430 --> 00:17:40,270
to data that begins
with the organization.

376
00:17:40,270 --> 00:17:44,120
So now, I'm building
a new graph, G0.

377
00:17:44,120 --> 00:17:45,120
OK.

378
00:17:45,120 --> 00:17:48,170
FindTrackGraph of
just that data set.

379
00:17:48,170 --> 00:17:50,940
So I've basically taken
my A, which is this graph,

380
00:17:50,940 --> 00:17:53,670
and I said, oh, just
go back and find me

381
00:17:53,670 --> 00:17:57,160
the people associated with the
International Monetary Fund.

382
00:17:57,160 --> 00:18:00,570
And now, I can do
things like, all right--

383
00:18:00,570 --> 00:18:03,590
because this track
graph the value

384
00:18:03,590 --> 00:18:06,500
is the number of
times that occurred.

385
00:18:06,500 --> 00:18:10,570
So for instance, if
I say show me now

386
00:18:10,570 --> 00:18:17,250
all edges that occurred
more than twice

387
00:18:17,250 --> 00:18:23,010
and where are the tracks
that were due to people

388
00:18:23,010 --> 00:18:24,870
associate the
International Monetary

389
00:18:24,870 --> 00:18:30,105
Fund were greater than 20% of
all the tracks that occurred.

390
00:18:30,105 --> 00:18:31,730
So basically, I'm
looking for something

391
00:18:31,730 --> 00:18:35,560
that happened more than
twice, in that IMF folks did

392
00:18:35,560 --> 00:18:36,770
more than twice.

393
00:18:36,770 --> 00:18:39,830
And of all the
data, the IMF people

394
00:18:39,830 --> 00:18:41,724
did it a lot of the time.

395
00:18:41,724 --> 00:18:42,890
So we're in a lot of people.

396
00:18:42,890 --> 00:18:45,690
It wasn't just a really,
really popular track.

397
00:18:45,690 --> 00:18:49,070
And so we see here, now we
get Karachi and Afghanistan.

398
00:18:49,070 --> 00:18:51,760
So we had, essentially,
at least two of those.

399
00:18:51,760 --> 00:18:54,490
And all of them were people
associated with the IMF.

400
00:18:54,490 --> 00:18:57,030
Afghanistan and Britain.

401
00:18:57,030 --> 00:18:58,460
Here's Britain and England.

402
00:18:58,460 --> 00:19:03,040
Well, obviously that's a little
bit-- Britain and England

403
00:19:03,040 --> 00:19:05,050
are the same thing, right?

404
00:19:05,050 --> 00:19:07,430
Here's Islamabad,
Islamabad, Moscow, Moscow.

405
00:19:07,430 --> 00:19:10,600
So here just shows you the
kinds of things that you

406
00:19:10,600 --> 00:19:12,670
can do with typing in again.

407
00:19:12,670 --> 00:19:15,450
This is a very sophisticated
type of analytic.

408
00:19:15,450 --> 00:19:17,890
If you were to try and to do
those things using existing

409
00:19:17,890 --> 00:19:21,894
types of things-- if you knew
this was the analytic to do

410
00:19:21,894 --> 00:19:23,560
and someone handed
this to you and said,

411
00:19:23,560 --> 00:19:25,280
go implement it using
another technique,

412
00:19:25,280 --> 00:19:26,860
you definitely could do it.

413
00:19:26,860 --> 00:19:30,450
But discovering the analytic
using existing techniques

414
00:19:30,450 --> 00:19:32,650
would be very, very
time consuming.

415
00:19:32,650 --> 00:19:35,370
And this kind of tool
is very, very easy

416
00:19:35,370 --> 00:19:38,000
for you to explore and get
the analytic just right.

417
00:19:38,000 --> 00:19:40,140
And I would say that
in a certain sense what

418
00:19:40,140 --> 00:19:43,270
D4M is doing here is doing the
same rule that we've always

419
00:19:43,270 --> 00:19:46,460
used MATLAB for in
signal processing.

420
00:19:46,460 --> 00:19:48,787
People use it to play
with their algorithms,

421
00:19:48,787 --> 00:19:51,370
to figure out their algorithms,
to get their algorithms right.

422
00:19:51,370 --> 00:19:53,870
They know this will give
us the right answer.

423
00:19:53,870 --> 00:19:56,330
And then when they
deploy it and actually

424
00:19:56,330 --> 00:19:59,430
make it part of a real
system, sometimes they'll

425
00:19:59,430 --> 00:20:00,870
just take the
MATLAB code and make

426
00:20:00,870 --> 00:20:02,080
it a part of the real system.

427
00:20:02,080 --> 00:20:05,470
But more often than not, the
target system or the target

428
00:20:05,470 --> 00:20:09,710
application will require you to
port it to some other language,

429
00:20:09,710 --> 00:20:16,160
maybe C++, maybe Java,
for deployment reasons.

430
00:20:16,160 --> 00:20:18,340
We still see that
happening today

431
00:20:18,340 --> 00:20:20,750
that, you know, algorithm
development is one thing,

432
00:20:20,750 --> 00:20:23,070
deployment is another thing.

433
00:20:23,070 --> 00:20:26,020
Even if people use
the same language

434
00:20:26,020 --> 00:20:27,976
for doing algorithm
and deployment,

435
00:20:27,976 --> 00:20:29,600
usually deployment
people end up having

436
00:20:29,600 --> 00:20:31,900
to completely rewrite what
the algorithm analysts wrote

437
00:20:31,900 --> 00:20:32,400
anyway.

438
00:20:32,400 --> 00:20:35,057
Because the algorithm
analysts had certain things

439
00:20:35,057 --> 00:20:36,140
they were concerned about.

440
00:20:36,140 --> 00:20:38,050
And the deployment
person will have

441
00:20:38,050 --> 00:20:40,580
completely different issues
that they have to worry about.

442
00:20:40,580 --> 00:20:43,310
But I think D4M
allows you to still do

443
00:20:43,310 --> 00:20:47,130
that same kind of model
on these new types of data

444
00:20:47,130 --> 00:20:49,520
in a very useful
and productive way

445
00:20:49,520 --> 00:20:53,360
to get the productivity
that we want out of that.

446
00:20:53,360 --> 00:20:54,620
All right.

447
00:20:54,620 --> 00:20:55,450
Let's see here.

448
00:20:55,450 --> 00:21:00,860
So finally, one last thing,
a more complicated analytic,

449
00:21:00,860 --> 00:21:04,630
which I call sort of a
multiple hypothesis tracker.

450
00:21:04,630 --> 00:21:09,340
Essentially, what we're doing
here is we're loading the data.

451
00:21:09,340 --> 00:21:13,310
We're going to just
focus on one person here.

452
00:21:13,310 --> 00:21:17,050
And then the locations
are specified by time.

453
00:21:17,050 --> 00:21:19,470
I mean, the time is
specified by the time column.

454
00:21:19,470 --> 00:21:21,362
And location is
specified by location.

455
00:21:21,362 --> 00:21:23,320
And then I'm going to
have this function called

456
00:21:23,320 --> 00:21:27,780
find multiply hypothesis
trackers for Michael Chang

457
00:21:27,780 --> 00:21:29,840
with respect to the data set E.

458
00:21:29,840 --> 00:21:33,670
And what this does is
this says, all right,

459
00:21:33,670 --> 00:21:37,930
in the previous thing, I was
basically just making one pick.

460
00:21:37,930 --> 00:21:42,140
If I had a document and
Michael Chang was in it

461
00:21:42,140 --> 00:21:45,760
and there was multiple
locations and times,

462
00:21:45,760 --> 00:21:48,930
I would just sort of pick one
of those locations and times.

463
00:21:48,930 --> 00:21:52,530
Here, I'm now going to show
you, for Michael Chang,

464
00:21:52,530 --> 00:21:55,610
for each document, all
the locations and times.

465
00:21:55,610 --> 00:22:01,260
So here's, basically,
Africa and time.

466
00:22:01,260 --> 00:22:02,410
And let's see here.

467
00:22:02,410 --> 00:22:04,580
Maybe we can make
this a little smaller.

468
00:22:04,580 --> 00:22:09,940
That would probably help
a bit, a little smaller.

469
00:22:09,940 --> 00:22:10,700
There we go.

470
00:22:10,700 --> 00:22:11,910
You can barely see that.

471
00:22:11,910 --> 00:22:16,220
But you see here, this
basically shows all the times,

472
00:22:16,220 --> 00:22:17,670
all the locations.

473
00:22:17,670 --> 00:22:21,970
And this shows you,
essentially, in theory

474
00:22:21,970 --> 00:22:26,604
the true track could be going
through any one of these.

475
00:22:26,604 --> 00:22:28,770
There isn't a single track
really for Michael Chang.

476
00:22:28,770 --> 00:22:30,820
There's multiple
potential tracks.

477
00:22:30,820 --> 00:22:32,700
And then this complex
value I happened

478
00:22:32,700 --> 00:22:35,350
to store here just because
I'm using complex values,

479
00:22:35,350 --> 00:22:40,060
the first one is the
character distance.

480
00:22:40,060 --> 00:22:45,370
The location is--
so Michael Chang

481
00:22:45,370 --> 00:22:49,580
appears in a particular word
position in the document.

482
00:22:49,580 --> 00:22:54,220
And it tells me that Austria
appears 278 characters

483
00:22:54,220 --> 00:23:00,960
before him and that the time
stamp appears 11 characters

484
00:23:00,960 --> 00:23:02,680
before his name.

485
00:23:02,680 --> 00:23:08,570
And so over here, you
see the time stamp

486
00:23:08,570 --> 00:23:13,040
appears-- this is a
different document,

487
00:23:13,040 --> 00:23:16,820
but with the same time--
in different locations.

488
00:23:16,820 --> 00:23:20,130
So you can then use this data
to actually go back and say,

489
00:23:20,130 --> 00:23:23,350
you know what, I only
to pick the words that

490
00:23:23,350 --> 00:23:25,460
are closest to Michael Chang.

491
00:23:25,460 --> 00:23:28,590
And I want that to actually
be the real track for Michael

492
00:23:28,590 --> 00:23:29,090
Chang.

493
00:23:29,090 --> 00:23:32,500
So that just shows the
more complicated things

494
00:23:32,500 --> 00:23:34,070
that we can do with that.

495
00:23:34,070 --> 00:23:39,510
So with that, that leads us
to the end of the examples.

496
00:23:39,510 --> 00:23:42,260
And if there's any
questions, I'll

497
00:23:42,260 --> 00:23:44,780
be happy to take them
now if there's any.

498
00:23:47,464 --> 00:23:48,130
All right, good.

499
00:23:48,130 --> 00:23:50,130
I'm just showing you some
of the kinds of things

500
00:23:50,130 --> 00:23:51,040
that people can do.

501
00:23:51,040 --> 00:23:53,700
And so there we go.