1
00:00:00,090 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,820
Commons license.

3
00:00:03,820 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,150
continue to offer high quality
educational resources for free.

5
00:00:10,150 --> 00:00:12,700
To make a donation, or to
view additional materials

6
00:00:12,700 --> 00:00:16,600
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,600 --> 00:00:17,261
at ocw.mit.edu.

8
00:00:21,510 --> 00:00:22,700
JEREMY KEPNER: Welcome back.

9
00:00:22,700 --> 00:00:27,980
We're now going get in the demo
portion of lecture zero three.

10
00:00:27,980 --> 00:00:35,220
And so these demos are
in the program directory,

11
00:00:35,220 --> 00:00:36,830
and the examples.

12
00:00:36,830 --> 00:00:39,290
We've now moved on
from kind of the intro

13
00:00:39,290 --> 00:00:41,860
in to the
applications, and we're

14
00:00:41,860 --> 00:00:44,520
going to do this when we're
dealing with entities extracted

15
00:00:44,520 --> 00:00:45,910
from this data set.

16
00:00:45,910 --> 00:00:48,750
So we go in to the entity
analysis directory.

17
00:00:48,750 --> 00:00:53,520
Start my shell here, so.

18
00:01:13,000 --> 00:01:14,250
Start our MATLAB session here.

19
00:01:16,830 --> 00:01:19,329
I always run MATLAB
from the shell.

20
00:01:19,329 --> 00:01:21,370
If you prefer to run it
from the ID, that's fine.

21
00:01:24,050 --> 00:01:28,720
And again, our D4M code runs in
a variety of environments, not

22
00:01:28,720 --> 00:01:29,220
just MATLAB.

23
00:01:35,425 --> 00:01:35,925
Right.

24
00:01:39,980 --> 00:01:43,790
That's started up here, so
let's-- before we do that,

25
00:01:43,790 --> 00:01:45,814
let's look at our data set here.

26
00:01:45,814 --> 00:01:47,230
So this is a nice
little data set.

27
00:01:47,230 --> 00:01:51,290
It's about 2.4
megabytes of data.

28
00:01:51,290 --> 00:01:55,110
And some of the
Reuters documents

29
00:01:55,110 --> 00:01:59,725
processed through
our entity extractor

30
00:01:59,725 --> 00:02:01,000
here at Lincoln Laboratory.

31
00:02:01,000 --> 00:02:05,750
And so if I look at this, say
with Microsoft Excel, try that.

32
00:02:24,850 --> 00:02:27,180
All right, so you can see
that pretty clearly, right?

33
00:02:27,180 --> 00:02:32,050
So we basically just have a
row, key here to read this in.

34
00:02:32,050 --> 00:02:36,680
We have a document column,
which is the actual--

35
00:02:36,680 --> 00:02:37,746
so we widen that.

36
00:02:37,746 --> 00:02:41,250
That's the actual document
that it appears in.

37
00:02:41,250 --> 00:02:44,770
We have the entity, happen
to be alphabetical here,

38
00:02:44,770 --> 00:02:46,200
the way it comes out.

39
00:02:46,200 --> 00:02:48,800
We have the word
position in the document,

40
00:02:48,800 --> 00:02:51,740
so this tells you that
this entity appeared

41
00:02:51,740 --> 00:02:54,820
in these positions
in this document.

42
00:02:54,820 --> 00:02:58,424
And then various types--
location, person, organization,

43
00:02:58,424 --> 00:02:59,340
those types of things.

44
00:02:59,340 --> 00:03:03,050
So that's what we extracted
from these documents.

45
00:03:03,050 --> 00:03:04,250
There's no verbs here.

46
00:03:04,250 --> 00:03:08,307
We just tell you a person,
a place, an organization,

47
00:03:08,307 --> 00:03:09,390
appeared in this document.

48
00:03:09,390 --> 00:03:10,980
It wouldn't say thing
about what they're doing,

49
00:03:10,980 --> 00:03:11,896
or anything like that.

50
00:03:11,896 --> 00:03:12,857
So.

51
00:03:12,857 --> 00:03:14,940
But this just shows an
example of the kind of data

52
00:03:14,940 --> 00:03:16,820
that we have here.

53
00:03:16,820 --> 00:03:17,956
So.

54
00:03:17,956 --> 00:03:20,320
There's that.

55
00:03:20,320 --> 00:03:21,100
To save.

56
00:03:24,470 --> 00:03:26,430
So, all right.

57
00:03:29,449 --> 00:03:30,990
So the first thing
we're going to do,

58
00:03:30,990 --> 00:03:34,037
is the first script,
is entity analysis one.

59
00:03:34,037 --> 00:03:35,370
We're going to read the data in.

60
00:03:47,370 --> 00:03:48,920
All right, there we go.

61
00:03:48,920 --> 00:03:52,590
So-- and this first
line just reads this in.

62
00:03:52,590 --> 00:03:55,080
We have-- it's a
fair amount of data.

63
00:03:55,080 --> 00:03:57,550
We're turning it all in
to an associative array.

64
00:03:57,550 --> 00:04:02,650
We have actually some faster
parsers that will read it in.

65
00:04:02,650 --> 00:04:05,760
You know, but just
the read CSV is

66
00:04:05,760 --> 00:04:09,810
kind of the most general
purpose one that's fairly nice.

67
00:04:09,810 --> 00:04:11,460
So it's going to go,
and we'll go back.

68
00:04:11,460 --> 00:04:13,418
It's running about a
whole bunch of stuff here.

69
00:04:20,406 --> 00:04:22,980
It tries to pull
together its graphics.

70
00:04:27,080 --> 00:04:28,142
All right.

71
00:04:28,142 --> 00:04:29,850
So let's take a look
at what we did here.

72
00:04:29,850 --> 00:04:35,040
So we read in the data into,
essentially an edge list,

73
00:04:35,040 --> 00:04:37,220
associative array.

74
00:04:37,220 --> 00:04:41,470
Just display the first few
rows here so you can see.

75
00:04:41,470 --> 00:04:43,270
So it's alphabetized.

76
00:04:43,270 --> 00:04:46,660
The lexigraph we sorted
the first columns, always

77
00:04:46,660 --> 00:04:47,880
something to be aware of.

78
00:04:47,880 --> 00:04:51,950
That 1 is then followed by
10, is then followed by 100

79
00:04:51,950 --> 00:04:54,500
and then 1,000 in this sorting.

80
00:04:54,500 --> 00:04:56,624
Here's the documents,
here's the entity.

81
00:04:56,624 --> 00:04:57,540
There's the positions.

82
00:04:57,540 --> 00:04:59,831
So it just looks like the
normal table that we read in,

83
00:04:59,831 --> 00:05:02,010
but it is an associative array.

84
00:05:02,010 --> 00:05:04,420
Now I'm going to want
to sort of pull this

85
00:05:04,420 --> 00:05:05,960
into our exploded schema.

86
00:05:05,960 --> 00:05:07,940
And this table
didn't necessarily

87
00:05:07,940 --> 00:05:10,180
come to us in the easiest
format to do that in.

88
00:05:10,180 --> 00:05:12,710
So there's a few tricks
that we have to do.

89
00:05:12,710 --> 00:05:17,670
So we're going to go and get
each document, basically.

90
00:05:17,670 --> 00:05:22,520
So we have a row, a column and
the doc, so that's the value.

91
00:05:22,520 --> 00:05:26,710
So this is-- whenever you do a
query in an associative array,

92
00:05:26,710 --> 00:05:29,965
one of the powerful things
about MATLAB is you can change,

93
00:05:29,965 --> 00:05:32,640
you can have variable
numbers of output arguments.

94
00:05:32,640 --> 00:05:34,910
And you can change the behavior.

95
00:05:34,910 --> 00:05:39,080
So if I just wanted this to
return-- if I set this to,

96
00:05:39,080 --> 00:05:41,990
a equals e this, I
would have just gotten

97
00:05:41,990 --> 00:05:43,630
an associative array vector.

98
00:05:43,630 --> 00:05:45,830
But if I give it three
output arguments, so say,

99
00:05:45,830 --> 00:05:50,210
I'll give you the triples that
went into constructing that.

100
00:05:50,210 --> 00:05:52,700
Because sometimes that's
what you want to do.

101
00:05:52,700 --> 00:05:56,240
Also it's faster,
because to do this

102
00:05:56,240 --> 00:05:58,880
every time I do a sub
associative array,

103
00:05:58,880 --> 00:06:02,130
I effectively have to
pull out the triples,

104
00:06:02,130 --> 00:06:04,271
and then reconstitute
the associative array.

105
00:06:04,271 --> 00:06:05,520
And that could take some time.

106
00:06:05,520 --> 00:06:06,561
It just saves you a step.

107
00:06:06,561 --> 00:06:11,180
It goes directly to the triple
switches, which is nice.

108
00:06:11,180 --> 00:06:13,380
So I basically say, give
me the document column,

109
00:06:13,380 --> 00:06:15,780
give me the entity column,
give me the position column.

110
00:06:15,780 --> 00:06:17,080
Now I have all these.

111
00:06:17,080 --> 00:06:18,700
And since I know
they're dense, I

112
00:06:18,700 --> 00:06:21,390
know that these
are all lined up.

113
00:06:21,390 --> 00:06:23,101
The first doc string,
and entity string,

114
00:06:23,101 --> 00:06:24,600
and position string,
and type string

115
00:06:24,600 --> 00:06:29,180
are all-- so that's me
exploiting a little bit.

116
00:06:29,180 --> 00:06:31,140
I know this is a dense
table and I'm not

117
00:06:31,140 --> 00:06:33,920
going have any missing there.

118
00:06:33,920 --> 00:06:38,210
Now I'm going to interleave
the type and entity strings.

119
00:06:38,210 --> 00:06:41,550
So I have this function
called Cat String.

120
00:06:41,550 --> 00:06:45,660
So basically Cat String
takes one string list

121
00:06:45,660 --> 00:06:47,790
and another string
list, and then it

122
00:06:47,790 --> 00:06:51,450
leaves them with a separator,
a single character separator.

123
00:06:51,450 --> 00:06:55,780
And now I can construct a
new sparse matrix, which

124
00:06:55,780 --> 00:06:58,300
has docks for the row key.

125
00:06:58,300 --> 00:07:01,630
This new thing called type
entity, to be the columns,

126
00:07:01,630 --> 00:07:04,530
and I'll just throw the
position in as the value.

127
00:07:04,530 --> 00:07:06,890
And so that is our
exploded schema

128
00:07:06,890 --> 00:07:09,110
that we just talked about.

129
00:07:09,110 --> 00:07:10,490
Done for us in a couple lines.

130
00:07:10,490 --> 00:07:13,100
There's other functions
they do that as well.

131
00:07:13,100 --> 00:07:15,420
And since I don't
want to repeat this,

132
00:07:15,420 --> 00:07:18,120
because it took a little
while to read it in.

133
00:07:18,120 --> 00:07:23,010
I'm just going to save
it as a MATLAB of binary,

134
00:07:23,010 --> 00:07:26,177
which will then read in
very quickly from now on.

135
00:07:26,177 --> 00:07:26,760
Which is nice.

136
00:07:26,760 --> 00:07:28,236
So I can save it.

137
00:07:28,236 --> 00:07:29,910
There you go.

138
00:07:29,910 --> 00:07:32,400
Just to show you, if I
show you the first entry

139
00:07:32,400 --> 00:07:39,080
here, in the row.

140
00:07:39,080 --> 00:07:43,560
And then, this
just shows you this

141
00:07:43,560 --> 00:07:47,840
is the different
types of columns

142
00:07:47,840 --> 00:07:50,230
that were in that data set.

143
00:07:50,230 --> 00:07:53,545
So this the row key here.

144
00:07:53,545 --> 00:07:54,545
And the various columns.

145
00:07:54,545 --> 00:07:55,878
These are the different columns.

146
00:07:55,878 --> 00:07:58,520
And then it shows you how
many-- the locations of that

147
00:07:58,520 --> 00:08:00,500
word there.

148
00:08:00,500 --> 00:08:02,370
If I display it, I
see that I ended up

149
00:08:02,370 --> 00:08:03,730
creating an associative array.

150
00:08:03,730 --> 00:08:10,130
And it consisted of almost
10,000 unique documents.

151
00:08:10,130 --> 00:08:21,050
And 3,657 unique entities
in that data set.

152
00:08:21,050 --> 00:08:23,050
And then we're going
to take a look at that.

153
00:08:23,050 --> 00:08:26,002
We can do spy and transpose.

154
00:08:26,002 --> 00:08:29,970
If we look at here,
zoom in on that.

155
00:08:29,970 --> 00:08:35,890
You can see different
types of structures here.

156
00:08:38,450 --> 00:08:41,350
Click on one of them.

157
00:08:41,350 --> 00:08:45,180
And that tells us that
the location Singapore

158
00:08:45,180 --> 00:08:50,150
appeared in this document
in these two word positions.

159
00:08:50,150 --> 00:08:51,730
So.

160
00:08:51,730 --> 00:08:53,710
That's typically
about how many lines

161
00:08:53,710 --> 00:08:56,900
it takes to take essentially
a basic CSV file,

162
00:08:56,900 --> 00:09:00,190
and cast it into the
associative array to start doing

163
00:09:00,190 --> 00:09:01,280
start doing real work.

164
00:09:01,280 --> 00:09:05,110
That's typically what we see.

165
00:09:05,110 --> 00:09:07,550
If you find it's getting
a lot longer than that,

166
00:09:07,550 --> 00:09:09,960
then that probably
means there's a trick

167
00:09:09,960 --> 00:09:14,680
that you're not quite getting.

168
00:09:14,680 --> 00:09:17,720
And there again, we're
always happy to help people

169
00:09:17,720 --> 00:09:19,150
do this very important step.

170
00:09:19,150 --> 00:09:23,330
This first-- taking your data
and getting into this format

171
00:09:23,330 --> 00:09:27,716
so that you can do
subsequent analysis on it.

172
00:09:27,716 --> 00:09:29,230
All right.

173
00:09:29,230 --> 00:09:30,630
So there's that.

174
00:09:30,630 --> 00:09:33,800
Let's go on to the
next sample here.

175
00:09:37,590 --> 00:09:40,880
So we're going to do
some statistics on this.

176
00:09:46,240 --> 00:09:48,040
See, that took a lot
less time loading it

177
00:09:48,040 --> 00:09:52,930
in from a binary
much, much faster.

178
00:09:52,930 --> 00:09:54,020
Much, much faster.

179
00:09:54,020 --> 00:09:54,750
Very fast.

180
00:09:54,750 --> 00:09:57,200
I mean, faster than if
you put in a database

181
00:09:57,200 --> 00:09:58,450
and read it out of a database.

182
00:09:58,450 --> 00:10:00,412
Much faster to read a file.

183
00:10:00,412 --> 00:10:02,120
So we always encourage
people to do that.

184
00:10:05,390 --> 00:10:07,220
This just gets
displaced the size.

185
00:10:07,220 --> 00:10:10,460
So we have-- that's what we
had in the original data.

186
00:10:10,460 --> 00:10:13,970
nnz shows the number of entries.

187
00:10:13,970 --> 00:10:17,920
So we do nnz, that's
the number of entries.

188
00:10:17,920 --> 00:10:20,450
All right.

189
00:10:20,450 --> 00:10:23,060
I now want to undo
what I did before.

190
00:10:23,060 --> 00:10:26,730
I want to convert it
back into a dense table.

191
00:10:26,730 --> 00:10:28,706
I want to take all those
values, and convert--

192
00:10:28,706 --> 00:10:31,330
I want to take all those things
that [INAUDIBLE] because I just

193
00:10:31,330 --> 00:10:33,704
want to count how many, I want
to know how many locations

194
00:10:33,704 --> 00:10:35,207
organization people have.

195
00:10:35,207 --> 00:10:37,290
So we have this function
here called call-to-type.

196
00:10:37,290 --> 00:10:39,623
And another function that I
forget the precise name that

197
00:10:39,623 --> 00:10:41,820
does the reverse,
which basically give me

198
00:10:41,820 --> 00:10:43,420
an associative array.

199
00:10:43,420 --> 00:10:45,970
Give me a delimiter
for the columns,

200
00:10:45,970 --> 00:10:47,900
and I'm going to now
rip that back apart.

201
00:10:47,900 --> 00:10:49,861
Stick that back into
the value position,

202
00:10:49,861 --> 00:10:51,860
and then you're going to
now sort of essentially

203
00:10:51,860 --> 00:10:53,310
make it dense again.

204
00:10:53,310 --> 00:10:55,070
And so we do that.

205
00:10:55,070 --> 00:10:58,799
We then throw away
the actual values.

206
00:10:58,799 --> 00:11:00,340
And so we do this
double [INAUDIBLE],

207
00:11:00,340 --> 00:11:04,630
which converts it all to ones,
and then we can sum the rows.

208
00:11:04,630 --> 00:11:05,980
So we collapsed along rows.

209
00:11:05,980 --> 00:11:07,240
Now we have count.

210
00:11:07,240 --> 00:11:11,890
It tells me in this data
set I have 9,600 locations,

211
00:11:11,890 --> 00:11:14,000
organizations, people, time.

212
00:11:14,000 --> 00:11:16,290
Those are the
distinct ones there.

213
00:11:19,650 --> 00:11:22,140
All right.

214
00:11:22,140 --> 00:11:24,420
Let's see we can
count each entity.

215
00:11:24,420 --> 00:11:25,530
So we have our thing here.

216
00:11:25,530 --> 00:11:27,720
We just do a count.

217
00:11:27,720 --> 00:11:30,680
We can find the triples-- so
basically I've just done a sum.

218
00:11:30,680 --> 00:11:33,080
So this is the original,
exploded scheme.

219
00:11:33,080 --> 00:11:35,700
I basically summed
all the rows together.

220
00:11:35,700 --> 00:11:38,980
So now we're getting a count
for each unique entity.

221
00:11:38,980 --> 00:11:43,700
And I can find-- instead if one
another way to get the triple

222
00:11:43,700 --> 00:11:46,034
is just to pass, use
the find command.

223
00:11:46,034 --> 00:11:47,700
Works the same way
as it does in MATLAB.

224
00:11:47,700 --> 00:11:50,432
It will now give you
a set up of triples.

225
00:11:50,432 --> 00:11:52,140
A temp, which we don't
really care about,

226
00:11:52,140 --> 00:11:54,870
because we've collapsed that
dimensional to be all ones.

227
00:11:54,870 --> 00:11:57,060
The actual entity and the count.

228
00:11:57,060 --> 00:12:01,590
And I'm going to create a new
associative matrix out of that.

229
00:12:01,590 --> 00:12:05,100
So we now have the
counts and the entity.

230
00:12:05,100 --> 00:12:06,950
And I'm going to plot
a histogram of that.

231
00:12:06,950 --> 00:12:08,720
So I'm going to go
with my count things.

232
00:12:08,720 --> 00:12:11,500
I'm just going to dip the
locations out of that.

233
00:12:11,500 --> 00:12:13,780
So using that
Starts With command,

234
00:12:13,780 --> 00:12:16,359
I say get me all locations.

235
00:12:16,359 --> 00:12:18,400
I'm now going to just real
map, and I say give me

236
00:12:18,400 --> 00:12:20,920
the adjacency matrix, which
is return to regular sparse

237
00:12:20,920 --> 00:12:21,420
matrix.

238
00:12:21,420 --> 00:12:24,770
I can do sum, full, log, log.

239
00:12:24,770 --> 00:12:28,000
Now this is the classic
degree distribution

240
00:12:28,000 --> 00:12:30,870
of all the locations
in this data.

241
00:12:30,870 --> 00:12:35,970
And it shows us that certain
locations are very common.

242
00:12:35,970 --> 00:12:37,960
A few of them are very common.

243
00:12:37,960 --> 00:12:40,815
And a few locate, as they
appear in most of the documents,

244
00:12:40,815 --> 00:12:44,310
and a lot of locations only
appear in one document.

245
00:12:44,310 --> 00:12:47,300
So this is the classical--
we call power law-- degree

246
00:12:47,300 --> 00:12:48,470
distribution.

247
00:12:48,470 --> 00:12:50,939
Again, always a
good thing to check.

248
00:12:50,939 --> 00:12:52,980
Sometimes you can't do
this on the full data set,

249
00:12:52,980 --> 00:12:54,370
it's so large.

250
00:12:54,370 --> 00:12:56,670
But just even a sample,
just to sort of-- you

251
00:12:56,670 --> 00:12:58,330
want to know this
from the beginning.

252
00:12:58,330 --> 00:13:00,663
Am I dealing with something
that looks like a power law?

253
00:13:00,663 --> 00:13:04,000
So like everything else--
or oh, what do you know,

254
00:13:04,000 --> 00:13:06,700
I see a big spike here.

255
00:13:06,700 --> 00:13:08,450
Well what's that about?

256
00:13:08,450 --> 00:13:10,140
Or it looks like a bell curve.

257
00:13:10,140 --> 00:13:11,190
Or something like that.

258
00:13:11,190 --> 00:13:16,090
And so this is just really
computing means, good.

259
00:13:16,090 --> 00:13:18,760
Computing basic
histograms, very good.

260
00:13:18,760 --> 00:13:21,720
Without this basic
information, it's really hard

261
00:13:21,720 --> 00:13:24,730
to advance in detection theory.

262
00:13:24,730 --> 00:13:27,010
And so-- and this
is probably where

263
00:13:27,010 --> 00:13:29,760
we take a slightly different
philosophical approach

264
00:13:29,760 --> 00:13:31,890
than a lot of the data
mining community, which

265
00:13:31,890 --> 00:13:35,420
tends to focus on just finding
stuff that's interesting.

266
00:13:35,420 --> 00:13:38,340
Where we tend to focus on
understanding the data,

267
00:13:38,340 --> 00:13:40,840
modeling the data,
coming up with models

268
00:13:40,840 --> 00:13:43,160
for what we're trying to
find, and then doing that.

269
00:13:43,160 --> 00:13:44,980
Rather than just--
show me something

270
00:13:44,980 --> 00:13:45,980
interesting in the data.

271
00:13:45,980 --> 00:13:49,110
Show me a low probability
covariance or something

272
00:13:49,110 --> 00:13:49,840
like that.

273
00:13:49,840 --> 00:13:53,520
Which tends to be more than
the basis of the data mining

274
00:13:53,520 --> 00:13:55,405
community.

275
00:13:55,405 --> 00:13:58,250
Although, no doubt, I probably
offended them horribly

276
00:13:58,250 --> 00:14:00,320
by simplifying it that way.

277
00:14:00,320 --> 00:14:01,820
In which, case I
apologize to people

278
00:14:01,820 --> 00:14:04,700
in the [INAUDIBLE] community.

279
00:14:04,700 --> 00:14:08,172
But here, we-- more
traditional detection theory,

280
00:14:08,172 --> 00:14:10,630
and the first thing you want
to do is get the distribution.

281
00:14:10,630 --> 00:14:13,520
So you can-- typically one
thing you'll do immediately

282
00:14:13,520 --> 00:14:16,120
is like you know what,
you put essentially

283
00:14:16,120 --> 00:14:18,820
what we call low pass,
high pass filter on it.

284
00:14:18,820 --> 00:14:21,680
Things that only appear once,
or so unique that they give us

285
00:14:21,680 --> 00:14:24,930
no information, and things
that appear everywhere

286
00:14:24,930 --> 00:14:27,030
again give us no
information about the data.

287
00:14:27,030 --> 00:14:30,940
So you might just you have
a high pass and a low pass

288
00:14:30,940 --> 00:14:32,410
filter that just
sort of eliminates

289
00:14:32,410 --> 00:14:34,630
these high end and low
end types of things.

290
00:14:34,630 --> 00:14:35,640
Very standard.

291
00:14:35,640 --> 00:14:39,080
Signal processing technique,
that's just as relevant here

292
00:14:39,080 --> 00:14:40,850
as elsewhere.

293
00:14:40,850 --> 00:14:41,350
All right.

294
00:14:44,400 --> 00:14:48,000
So statistics, histograms,
all good things,

295
00:14:48,000 --> 00:14:50,960
easy to do in D4M on our data.

296
00:14:50,960 --> 00:14:52,700
Don't be afraid
to take your data,

297
00:14:52,700 --> 00:14:54,720
bump it back into
regular map of arrays,

298
00:14:54,720 --> 00:14:58,520
and just do the regular things
that you would do it with

299
00:14:58,520 --> 00:14:59,580
MATLAB data.

300
00:14:59,580 --> 00:15:02,090
With MATLAB matrices, we
highly encourage that.

301
00:15:02,090 --> 00:15:05,680
Now we're going to do the facet
algorithm that I talked about

302
00:15:05,680 --> 00:15:07,910
in the lecture.

303
00:15:07,910 --> 00:15:08,590
Let's do that.

304
00:15:13,370 --> 00:15:19,550
So once again, we've loaded
our data in very quickly.

305
00:15:19,550 --> 00:15:21,090
We've decided to
convert directly

306
00:15:21,090 --> 00:15:23,910
to numeric, which is-- we're
getting rid of all those word

307
00:15:23,910 --> 00:15:27,700
positions, because we don't
really care about them.

308
00:15:27,700 --> 00:15:31,560
I'm going to pick-- so one thing
I like about this data set,

309
00:15:31,560 --> 00:15:33,930
and for those of us
who are a little older,

310
00:15:33,930 --> 00:15:36,630
we remember the
news of the 1990's.

311
00:15:36,630 --> 00:15:37,130
You know?

312
00:15:37,130 --> 00:15:38,620
This is all a trip to
remember [INAUDIBLE].

313
00:15:38,620 --> 00:15:39,890
For those of you that
are a little younger, who

314
00:15:39,890 --> 00:15:42,431
are in elementary school, this
will not mean anything to you.

315
00:15:42,431 --> 00:15:43,430
But it's like-- oh yes.

316
00:15:43,430 --> 00:15:45,596
A lot of stuff about the
Clinton's in this data set.

317
00:15:45,596 --> 00:15:47,860
So that's always fun.

318
00:15:47,860 --> 00:15:50,860
So the first facet
we're going to do

319
00:15:50,860 --> 00:15:54,300
is, we want to look at
the location New York.

320
00:15:54,300 --> 00:15:57,090
And the person Michael Chang.

321
00:15:57,090 --> 00:16:00,310
Does anybody remember
who Michael Chang was?

322
00:16:00,310 --> 00:16:00,970
Tennis player.

323
00:16:00,970 --> 00:16:04,370
Yes, he was kind of like
this hip tennis player.

324
00:16:04,370 --> 00:16:05,990
Used to battle
under Andre Agassi.

325
00:16:05,990 --> 00:16:08,470
So, I don't know how long
his career really lasted,

326
00:16:08,470 --> 00:16:11,520
but he was a very-- this is
kind of right it his peak,

327
00:16:11,520 --> 00:16:12,529
I believe.

328
00:16:12,529 --> 00:16:15,070
So now there is-- to show you
that equation that I showed you

329
00:16:15,070 --> 00:16:16,620
in the slides wasn't a lie.

330
00:16:16,620 --> 00:16:20,390
So we take our edge,
or incidence matrix,

331
00:16:20,390 --> 00:16:25,070
we pick the column
New York, then

332
00:16:25,070 --> 00:16:28,100
we then pick the
column, Michael Chang.

333
00:16:28,100 --> 00:16:29,672
We need to do this
no call thing,

334
00:16:29,672 --> 00:16:31,130
because we now have
a column vector

335
00:16:31,130 --> 00:16:33,900
with a column, Michael Chang.

336
00:16:33,900 --> 00:16:34,924
And a column New York.

337
00:16:34,924 --> 00:16:36,590
And if we enter them
together, well they

338
00:16:36,590 --> 00:16:37,798
should have no intersections.

339
00:16:37,798 --> 00:16:40,100
So if we do know the no
call, it just sort of pops

340
00:16:40,100 --> 00:16:42,360
those two column names away.

341
00:16:42,360 --> 00:16:44,510
So now they're
effectively both one,

342
00:16:44,510 --> 00:16:47,000
and we can add them
together to find

343
00:16:47,000 --> 00:16:50,280
all the documents that contain
both New York and Michael

344
00:16:50,280 --> 00:16:52,260
Chang.

345
00:16:52,260 --> 00:16:54,870
So we do that, so we've
added them together.

346
00:16:54,870 --> 00:16:57,050
This is now a new column vector.

347
00:16:57,050 --> 00:16:58,110
OK.

348
00:16:58,110 --> 00:17:01,380
And we transpose it to
make it a row vector.

349
00:17:01,380 --> 00:17:03,599
And then when we
matrix multiply it back

350
00:17:03,599 --> 00:17:04,849
against the original data set.

351
00:17:04,849 --> 00:17:08,670
We just computed the histogram
of the other entities that

352
00:17:08,670 --> 00:17:12,790
are in the documents they
contain both Michael Chang

353
00:17:12,790 --> 00:17:14,400
and New York.

354
00:17:14,400 --> 00:17:16,990
And as we can see, Michael
Chang appears three times.

355
00:17:16,990 --> 00:17:18,910
That tells us there's
three documents.

356
00:17:18,910 --> 00:17:21,069
And New York
appears three times.

357
00:17:21,069 --> 00:17:26,609
But we also see Czech Republic,
and Austria, and Spain,

358
00:17:26,609 --> 00:17:29,630
and the United States.

359
00:17:29,630 --> 00:17:31,435
You want to guess
why Czech Republic?

360
00:17:34,110 --> 00:17:35,720
Was Ivan Lendl
still playing then?

361
00:17:35,720 --> 00:17:37,700
I'm just wandering if he had
a lot of matches against him?

362
00:17:37,700 --> 00:17:39,735
Or would he already be into
[INAUDIBLE] or something

363
00:17:39,735 --> 00:17:40,234
like that?

364
00:17:40,234 --> 00:17:40,890
I don't know.

365
00:17:40,890 --> 00:17:43,740
Probably had some Czech arch
tribal or something like that.

366
00:17:43,740 --> 00:17:45,360
Or maybe there was
just a tournament

367
00:17:45,360 --> 00:17:47,170
and these are the type of thing.

368
00:17:47,170 --> 00:17:48,895
With the Reuters
data sets, you do

369
00:17:48,895 --> 00:17:51,350
have the problem
is that the person

370
00:17:51,350 --> 00:17:54,070
who types the article
always puts the location

371
00:17:54,070 --> 00:17:55,520
where they filed it.

372
00:17:55,520 --> 00:17:58,107
So New York could
always be in these

373
00:17:58,107 --> 00:18:00,440
just because the person happens
to be based in New York,

374
00:18:00,440 --> 00:18:02,780
and typed it from New York
or something like that.

375
00:18:02,780 --> 00:18:05,370
So that's always fun.

376
00:18:05,370 --> 00:18:08,100
We can normalize this.

377
00:18:08,100 --> 00:18:13,540
So what I'm going to
do is take the facets

378
00:18:13,540 --> 00:18:17,910
that we computed here, and
normalize it by their sums.

379
00:18:17,910 --> 00:18:19,010
OK.

380
00:18:19,010 --> 00:18:22,290
So this just-- the facet
is showing how many times

381
00:18:22,290 --> 00:18:24,490
something showed up,
but it doesn't tell me

382
00:18:24,490 --> 00:18:26,730
if it's like-- well is
that something that is

383
00:18:26,730 --> 00:18:28,830
really common or really rare.

384
00:18:28,830 --> 00:18:31,620
So now we want to look at
the places and things that

385
00:18:31,620 --> 00:18:35,730
were-- we want to see, of
the other entities that

386
00:18:35,730 --> 00:18:41,240
appear in these documents,
how many of them

387
00:18:41,240 --> 00:18:42,610
are really, really popular?

388
00:18:42,610 --> 00:18:45,550
And how many of them really just
only appear in these documents

389
00:18:45,550 --> 00:18:47,830
with Michael Chang and New York.

390
00:18:47,830 --> 00:18:52,290
So as you can see here-- so we
have Belarus, Czech Republic,

391
00:18:52,290 --> 00:18:53,810
Virginia for some reason.

392
00:18:53,810 --> 00:18:55,640
And this just shows
you-- look, these

393
00:18:55,640 --> 00:18:58,890
are in a lot of--
these have a very low,

394
00:18:58,890 --> 00:19:01,490
they're being divided by a
fairly large number, which

395
00:19:01,490 --> 00:19:03,760
means they appear
in a lot of places.

396
00:19:03,760 --> 00:19:07,070
So they happen to appear in
the same set of documents.

397
00:19:07,070 --> 00:19:09,230
But they happen to appear
in a lot of documents.

398
00:19:09,230 --> 00:19:16,594
As opposed to Michael
Joyce, or Virginia Wade.

399
00:19:16,594 --> 00:19:18,010
That's a tennis
tournament, right?

400
00:19:18,010 --> 00:19:20,133
The Virginia Wade tennis
tournament or something

401
00:19:20,133 --> 00:19:20,633
like that.

402
00:19:20,633 --> 00:19:22,730
These appear more
common, or more likely,

403
00:19:22,730 --> 00:19:24,640
to be something that
is like, oh that's

404
00:19:24,640 --> 00:19:30,456
a real sort of relation that
exists between these entities.

405
00:19:30,456 --> 00:19:32,080
So that just shows
you how you do that.

406
00:19:32,080 --> 00:19:33,560
Again, very powerful.

407
00:19:33,560 --> 00:19:34,990
Very powerful technique.

408
00:19:34,990 --> 00:19:40,039
Basic covariance type of mat, so
facets search is very powerful.

409
00:19:40,039 --> 00:19:42,080
All right, so I think
we're now-- that was three,

410
00:19:42,080 --> 00:19:43,380
so we are moving on to four.

411
00:19:43,380 --> 00:19:45,380
So now we're going to do
some stuff with graphs.

412
00:19:51,940 --> 00:19:53,960
All kinds of graphs here,
all kinds of graphs.

413
00:19:57,440 --> 00:20:02,020
So once again, we loaded
in our data very quickly.

414
00:20:02,020 --> 00:20:04,380
We're just going to
make a copy of it.

415
00:20:04,380 --> 00:20:08,110
And then I'm going to basically
now convert the original thing

416
00:20:08,110 --> 00:20:09,720
to numbers.

417
00:20:09,720 --> 00:20:12,210
So get rid of those positions.

418
00:20:12,210 --> 00:20:14,420
And now we're going to
do-- well so before, we

419
00:20:14,420 --> 00:20:17,050
did essentially the facets of
the correlation between two

420
00:20:17,050 --> 00:20:18,040
things.

421
00:20:18,040 --> 00:20:20,970
But the real power of D4M is
it for about the same amount

422
00:20:20,970 --> 00:20:24,360
of work, we can correlate
everything simultaneously.

423
00:20:24,360 --> 00:20:28,250
So we're going to do square in,
which basically is the same as

424
00:20:28,250 --> 00:20:30,050
e transpose times e.

425
00:20:30,050 --> 00:20:33,400
It's a little bit faster when
we're doing self correlations,

426
00:20:33,400 --> 00:20:34,380
but not too much.

427
00:20:34,380 --> 00:20:38,736
You could have typed
e transpose times e.

428
00:20:38,736 --> 00:20:40,860
And then we're going to
show the structure of that.

429
00:20:40,860 --> 00:20:45,010
So remember we had
3,657 unique entities.

430
00:20:45,010 --> 00:20:47,560
So now we've constructed
a square matrix,

431
00:20:47,560 --> 00:20:51,090
it's 3,600 by 3,600
unique entities.

432
00:20:51,090 --> 00:20:52,840
And we're going to plot that.

433
00:20:52,840 --> 00:20:53,880
So let's see here.

434
00:20:58,740 --> 00:21:03,130
So that is-- hopefully this
won't crash our computer.

435
00:21:03,130 --> 00:21:06,040
So this is the
location, by location--

436
00:21:06,040 --> 00:21:09,320
this is the full entity-entity
correlation matrix here.

437
00:21:09,320 --> 00:21:10,830
You can see all the structure.

438
00:21:10,830 --> 00:21:11,940
These are the locations.

439
00:21:11,940 --> 00:21:16,030
So this block here is location
by location correlation.

440
00:21:16,030 --> 00:21:20,790
This block here is
person-person correlation.

441
00:21:20,790 --> 00:21:23,460
It's obviously as a dense
diagonal, it's symmetric.

442
00:21:23,460 --> 00:21:27,480
And then down here are the
time-time correlations.

443
00:21:27,480 --> 00:21:30,050
And so these in fact, the first
thing you'll probably notice

444
00:21:30,050 --> 00:21:33,040
is these dense blocks of time
have to deal with the fact

445
00:21:33,040 --> 00:21:35,145
that this is a finite
set of Reuters documents.

446
00:21:35,145 --> 00:21:38,790
There's only 35
unique days, or times,

447
00:21:38,790 --> 00:21:41,360
associated with the
documents themselves.

448
00:21:41,360 --> 00:21:44,210
So those are showing you
the times are associated

449
00:21:44,210 --> 00:21:45,549
with the-- reference.

450
00:21:45,549 --> 00:21:47,840
So that would be a very--
this type of structure shows,

451
00:21:47,840 --> 00:21:52,010
oh well if I want to just get
the times that were associated

452
00:21:52,010 --> 00:21:53,570
with reporting, I can get that.

453
00:21:53,570 --> 00:21:56,190
Or if I want to do, the
referring to some date.

454
00:21:56,190 --> 00:21:57,870
And you can see
there's times that

455
00:21:57,870 --> 00:22:00,530
are in the past, and future,
and various types of things

456
00:22:00,530 --> 00:22:01,595
like that.

457
00:22:01,595 --> 00:22:02,970
And that shows
you the structure.

458
00:22:02,970 --> 00:22:05,300
So this is a very
nice structure.

459
00:22:05,300 --> 00:22:06,260
We can zoom in here.

460
00:22:11,980 --> 00:22:14,190
And then you see-- and more
struct-- sub-structures

461
00:22:14,190 --> 00:22:15,830
emerge here.

462
00:22:15,830 --> 00:22:17,034
What's that one?

463
00:22:17,034 --> 00:22:17,825
Let me take a look.

464
00:22:17,825 --> 00:22:20,910
Here is this.

465
00:22:20,910 --> 00:22:23,770
United States, very
popular-- dense.

466
00:22:23,770 --> 00:22:26,455
You see that very dense thing,
like the United States here.

467
00:22:26,455 --> 00:22:26,955
Who's this?

468
00:22:26,955 --> 00:22:30,450
This is probably America
or something like that.

469
00:22:30,450 --> 00:22:32,760
United Nations Security
Council, very popular.

470
00:22:32,760 --> 00:22:34,212
Oh-- the organizations here.

471
00:22:34,212 --> 00:22:36,670
Yeah, we don't have so many of
them, it's the most popular.

472
00:22:36,670 --> 00:22:40,080
So you can just get this feel
of the data, very powerful.

473
00:22:40,080 --> 00:22:42,920
Because often,
for the most part,

474
00:22:42,920 --> 00:22:45,320
when you work with your
customer, and you do this,

475
00:22:45,320 --> 00:22:49,340
you can be just about 100%
sure you're the first person

476
00:22:49,340 --> 00:22:51,770
to actually get a view.

477
00:22:51,770 --> 00:22:55,310
A high level view of the
overall structure of the data,

478
00:22:55,310 --> 00:22:57,000
in a very powerful perspective.

479
00:22:57,000 --> 00:22:59,240
Because they simply just
don't have a way to do this.

480
00:22:59,240 --> 00:23:02,160
And using the sparse matrix is
a way to just review the data.

481
00:23:02,160 --> 00:23:04,230
It's very, very powerful.

482
00:23:04,230 --> 00:23:05,330
All right.

483
00:23:05,330 --> 00:23:07,400
So let's get rid of that.

484
00:23:07,400 --> 00:23:10,330
So we did the
entity-entity graph.

485
00:23:10,330 --> 00:23:12,080
Again, what we do
is just whenever

486
00:23:12,080 --> 00:23:15,580
we take an incidence
matrix, and we sort of

487
00:23:15,580 --> 00:23:17,400
square it, or correlate it.

488
00:23:17,400 --> 00:23:19,789
Then the result is an
adjacency matrix, or graph

489
00:23:19,789 --> 00:23:20,830
between those two things.

490
00:23:20,830 --> 00:23:24,347
So then adjacency
matrices are a projection

491
00:23:24,347 --> 00:23:25,430
from the incidence matrix.

492
00:23:25,430 --> 00:23:26,910
We've lost some
information, we've

493
00:23:26,910 --> 00:23:29,125
always thrown away some
information in doing that.

494
00:23:29,125 --> 00:23:31,000
But often, that's exactly
what we want to do.

495
00:23:31,000 --> 00:23:34,510
You want to project it
down into a subspace

496
00:23:34,510 --> 00:23:38,130
that we find more interesting.

497
00:23:38,130 --> 00:23:40,320
Something else that we
can do, is let's just look

498
00:23:40,320 --> 00:23:43,330
at the people beginning with J.

499
00:23:43,330 --> 00:23:44,440
All right.

500
00:23:44,440 --> 00:23:46,815
So I'm going to
just get-- create

501
00:23:46,815 --> 00:23:48,944
a little starts
with entity range

502
00:23:48,944 --> 00:23:50,360
here with our
starts with command.

503
00:23:50,360 --> 00:23:55,990
I can pass that into it here,
to grid just the records that

504
00:23:55,990 --> 00:23:57,690
have the entity p.

505
00:23:57,690 --> 00:24:01,370
And I'm going to do
the correlation of them

506
00:24:01,370 --> 00:24:05,710
using this pedigree preserving
matrix multiply that we have.

507
00:24:05,710 --> 00:24:06,670
This very special one.

508
00:24:06,670 --> 00:24:09,430
So we'll go take a look at that.

509
00:24:09,430 --> 00:24:14,330
And that will explain what I
mean by pedigree preserving.

510
00:24:14,330 --> 00:24:16,900
So we go here.

511
00:24:16,900 --> 00:24:22,290
So again, this shows you all
the people that appear together,

512
00:24:22,290 --> 00:24:24,530
beginning with the
name J. And it's

513
00:24:24,530 --> 00:24:26,500
of course again,
symmetric matrix.

514
00:24:26,500 --> 00:24:28,410
Distance diagonal.

515
00:24:28,410 --> 00:24:30,060
You click on, it tells you.

516
00:24:30,060 --> 00:24:32,810
So Jennifer Hurley
and James Harvey

517
00:24:32,810 --> 00:24:35,160
appeared in this
document together.

518
00:24:35,160 --> 00:24:38,260
So when we do that
special matrix multiply,

519
00:24:38,260 --> 00:24:41,530
it doesn't preserve the values
in the normal matrix multiply

520
00:24:41,530 --> 00:24:42,190
sense.

521
00:24:42,190 --> 00:24:47,530
Instead, it preserves the label
of the common dimension that's

522
00:24:47,530 --> 00:24:48,180
eliminated.

523
00:24:48,180 --> 00:24:50,650
So when we do matrix multiply,
you're limiting dimension.

524
00:24:50,650 --> 00:24:53,440
We now take that
common intersection

525
00:24:53,440 --> 00:24:54,810
and throw that in the value.

526
00:24:54,810 --> 00:24:58,580
So you can now say, Jennifer
Hurley, James Harvey

527
00:24:58,580 --> 00:25:01,560
were connected by this document.

528
00:25:01,560 --> 00:25:05,190
And so very powerful
tool for doing that.

529
00:25:05,190 --> 00:25:07,320
You notice, I
restricted it to J.

530
00:25:07,320 --> 00:25:09,560
Why did I do-- why didn't
just I do the whole thing?

531
00:25:09,560 --> 00:25:12,370
Well, you see this
dense diagonal here.

532
00:25:17,520 --> 00:25:21,480
If you had a very popular
person, look at all the

533
00:25:21,480 --> 00:25:24,880
hits we're going to get when
we correlated with itself.

534
00:25:24,880 --> 00:25:26,720
OK.

535
00:25:26,720 --> 00:25:29,220
This becomes a
performance bottleneck

536
00:25:29,220 --> 00:25:33,330
when we start creating
enormously long values.

537
00:25:33,330 --> 00:25:34,960
And so we always
have to be careful

538
00:25:34,960 --> 00:25:40,380
when we do these pedigree
preserving correlations.

539
00:25:40,380 --> 00:25:43,310
If it's calculating
one set with another,

540
00:25:43,310 --> 00:25:45,060
we're not have a dense diagonal.

541
00:25:45,060 --> 00:25:46,162
It's going to be fine.

542
00:25:46,162 --> 00:25:48,120
But if we're correlating
something with itself,

543
00:25:48,120 --> 00:25:49,530
we always have to
be on the lookout

544
00:25:49,530 --> 00:25:51,560
that we might be creating
these dense diagonals.

545
00:25:51,560 --> 00:25:53,500
In which case, we
could have an issue

546
00:25:53,500 --> 00:25:56,360
that we might create a
very, very large value.

547
00:25:56,360 --> 00:26:00,660
And the fact of the
matter is, the way

548
00:26:00,660 --> 00:26:08,960
we sort the strains is we
convert the list to a dense car

549
00:26:08,960 --> 00:26:09,937
matrix.

550
00:26:09,937 --> 00:26:11,520
And then call the
MATLAB sort routine.

551
00:26:11,520 --> 00:26:13,690
It's the only way
we can sort it.

552
00:26:13,690 --> 00:26:19,690
And so the width of that matrix
is the length of the longest

553
00:26:19,690 --> 00:26:21,200
string in the list.

554
00:26:21,200 --> 00:26:24,044
So they're all about the
same length, it's fine.

555
00:26:24,044 --> 00:26:25,460
But if you have
mostly things that

556
00:26:25,460 --> 00:26:27,490
are 20 characters, and
then you have something

557
00:26:27,490 --> 00:26:29,620
that's 2,000 characters long.

558
00:26:29,620 --> 00:26:34,760
And you create 100,000
by 2,000 matrix,

559
00:26:34,760 --> 00:26:37,020
you've just consumed
an awful lot of memory.

560
00:26:37,020 --> 00:26:39,172
And sorting that and all
that just-- so whenever

561
00:26:39,172 --> 00:26:41,130
people say it's slow or
they run out of memory,

562
00:26:41,130 --> 00:26:45,090
it's almost always
because they're creating,

563
00:26:45,090 --> 00:26:49,150
they have a string
list and we're trying

564
00:26:49,150 --> 00:26:50,570
to manipulate it by sorting.

565
00:26:50,570 --> 00:26:55,590
And one of the values in that
string is just extremely long.

566
00:26:55,590 --> 00:26:57,202
So that's just a little caveat.

567
00:26:57,202 --> 00:26:58,660
A very powerful
technique, you have

568
00:26:58,660 --> 00:27:00,360
to watch out for this diagonal.

569
00:27:00,360 --> 00:27:01,880
And there's ways
to work around it,

570
00:27:01,880 --> 00:27:03,660
that are usually data specific.

571
00:27:03,660 --> 00:27:05,690
But again a very
powerful-- this pedigree

572
00:27:05,690 --> 00:27:09,210
preserving correlation,
something that usually people

573
00:27:09,210 --> 00:27:09,710
want.

574
00:27:09,710 --> 00:27:12,150
And especially since
we almost usually store

575
00:27:12,150 --> 00:27:15,810
just like a one in the
value, we're not really

576
00:27:15,810 --> 00:27:17,454
throwing away any information.

577
00:27:17,454 --> 00:27:19,370
Sometimes we want it,
because we want to count

578
00:27:19,370 --> 00:27:21,080
how many times those happen.

579
00:27:21,080 --> 00:27:23,041
If we didn't do the
pedigree preserving,

580
00:27:23,041 --> 00:27:24,790
the catkey [INAUDIBLE],
then we would have

581
00:27:24,790 --> 00:27:26,550
gotten a value of three here.

582
00:27:26,550 --> 00:27:27,824
And so we could have counted.

583
00:27:27,824 --> 00:27:28,740
It would have told us.

584
00:27:28,740 --> 00:27:30,030
Sometimes we do them both.

585
00:27:30,030 --> 00:27:32,640
One that gives us the count,
one that gives us the pedigree.

586
00:27:32,640 --> 00:27:33,340
And just store them both.

587
00:27:33,340 --> 00:27:35,298
There's different things
that we want-- that we

588
00:27:35,298 --> 00:27:36,680
do with one versus the other.

589
00:27:36,680 --> 00:27:39,160
So again, having fun with
the different types of matrix

590
00:27:39,160 --> 00:27:43,980
multiplies that we can do in
the space of associative arrays.

591
00:27:43,980 --> 00:27:44,480
All right.

592
00:27:44,480 --> 00:27:47,910
We're running out of time here,
but I think we're almost done.

593
00:27:47,910 --> 00:27:49,560
So let me just see here.

594
00:27:49,560 --> 00:27:51,590
Then we did the
document-document correlation,

595
00:27:51,590 --> 00:27:56,480
so this is the square out,
which is e times e transpose.

596
00:27:56,480 --> 00:27:59,220
And we'll just take
a quick look at that.

597
00:27:59,220 --> 00:28:02,400
So again, here we
are for your four.

598
00:28:02,400 --> 00:28:06,057
This just shows us which
documents are closely related

599
00:28:06,057 --> 00:28:06,640
to each other.

600
00:28:06,640 --> 00:28:09,160
Which documents have a
lot of common entities.

601
00:28:09,160 --> 00:28:14,970
So this just says here that
these two documents share-- had

602
00:28:14,970 --> 00:28:18,090
one common entity with them.

603
00:28:18,090 --> 00:28:19,930
Other ones.

604
00:28:19,930 --> 00:28:22,670
Just clicking around here, not
having a lot of luck finding

605
00:28:22,670 --> 00:28:24,510
one that's got a lot in it.

606
00:28:24,510 --> 00:28:27,453
So maybe around here, nope.

607
00:28:27,453 --> 00:28:28,380
No.

608
00:28:28,380 --> 00:28:31,201
Well certainly if I click
on the diagonal, nope.

609
00:28:31,201 --> 00:28:31,700
Wow.

610
00:28:36,900 --> 00:28:39,580
Basically pretty sparse,
sparse correlations

611
00:28:39,580 --> 00:28:41,320
between these documents.

612
00:28:41,320 --> 00:28:42,790
So all right.

613
00:28:42,790 --> 00:28:47,251
And I think we'll go
now to the last example.

614
00:28:47,251 --> 00:28:48,250
Is there a last example?

615
00:28:48,250 --> 00:28:48,750
Yes.

616
00:28:48,750 --> 00:28:50,880
One-- one last one.

617
00:28:50,880 --> 00:28:52,621
Yay.

618
00:28:52,621 --> 00:28:53,120
[INAUDIBLE]

619
00:28:56,986 --> 00:28:58,190
All right.

620
00:28:58,190 --> 00:29:00,570
So now we'll do-- the
stuff I showed you

621
00:29:00,570 --> 00:29:04,470
so far is pretty easy stuff.

622
00:29:04,470 --> 00:29:06,840
Pretty sophisticated by
a lot of the standards

623
00:29:06,840 --> 00:29:09,140
are out there, in
terms being able to do

624
00:29:09,140 --> 00:29:12,370
sums, and histograms,
and basic correlations.

625
00:29:12,370 --> 00:29:15,690
That's often significantly
beyond what is just

626
00:29:15,690 --> 00:29:17,890
sort of the run of the
mill, which you'll just

627
00:29:17,890 --> 00:29:22,270
run in to if you just go into
a place doing basic analysis.

628
00:29:22,270 --> 00:29:23,200
So again we load it.

629
00:29:23,200 --> 00:29:25,240
We convert everything
to numeric.

630
00:29:25,240 --> 00:29:26,450
We've now squared it.

631
00:29:26,450 --> 00:29:28,710
So we created the graph.

632
00:29:28,710 --> 00:29:31,440
Now I want to get rid of
that annoying diagonal,

633
00:29:31,440 --> 00:29:33,270
so I got the diagonal out.

634
00:29:33,270 --> 00:29:34,690
So just so you
know, all I did is

635
00:29:34,690 --> 00:29:37,810
I take-- I know it's a square,
so I can do certain things

636
00:29:37,810 --> 00:29:39,000
without worrying.

637
00:29:39,000 --> 00:29:41,300
So I can basically take
the adjacency matrix,

638
00:29:41,300 --> 00:29:44,150
that just going to be a
regular MATLAB sparse matrix.

639
00:29:44,150 --> 00:29:46,700
I can now do-- take
the diagonal back.

640
00:29:46,700 --> 00:29:49,720
And since it's a square,
I can just take those two,

641
00:29:49,720 --> 00:29:53,150
subtract them from each
other, and just insert it

642
00:29:53,150 --> 00:29:55,880
back in without any harm done.

643
00:29:55,880 --> 00:30:00,870
Knowing it's like look, I took
a square adjacency matrix out,

644
00:30:00,870 --> 00:30:02,080
did some manipulation.

645
00:30:02,080 --> 00:30:03,580
Stuff it back in.

646
00:30:03,580 --> 00:30:05,150
The dimensions are the same.

647
00:30:05,150 --> 00:30:07,490
You know, it should be OK.

648
00:30:07,490 --> 00:30:12,290
I might have an issue that I may
have introduced an empty row,

649
00:30:12,290 --> 00:30:14,990
or an empty column, you know.

650
00:30:14,990 --> 00:30:18,270
Which could cause me
a little problems.

651
00:30:18,270 --> 00:30:20,670
I'm now going to get
the triples of that,

652
00:30:20,670 --> 00:30:22,880
because I want to
normalize the values here.

653
00:30:22,880 --> 00:30:25,560
So I'm going to get the--
the diagonals showed me

654
00:30:25,560 --> 00:30:27,730
how many counts are
in each document.

655
00:30:27,730 --> 00:30:31,794
So I want to normalize
any entities in a document

656
00:30:31,794 --> 00:30:33,960
because sometimes you'll
get a document that's like,

657
00:30:33,960 --> 00:30:36,040
and here's a list of
every single country that

658
00:30:36,040 --> 00:30:38,350
participated in this UN meeting.

659
00:30:38,350 --> 00:30:40,710
And it'll have a hundred
and something countries.

660
00:30:40,710 --> 00:30:42,649
So a country appearing
in that document

661
00:30:42,649 --> 00:30:44,690
is not such a big deal,
because all the countries

662
00:30:44,690 --> 00:30:46,070
appear in that document.

663
00:30:46,070 --> 00:30:48,500
So this allows us to do that.

664
00:30:48,500 --> 00:30:52,120
If we want to do a correlation,
we take the actual value

665
00:30:52,120 --> 00:30:53,540
and then divide
it by the minimum.

666
00:30:53,540 --> 00:30:56,040
The least documented would
be the maximum number

667
00:30:56,040 --> 00:30:58,150
of hits you could get
in the correlation.

668
00:30:58,150 --> 00:31:00,990
You can do multi-facet queries.

669
00:31:00,990 --> 00:31:03,940
So we want do location
against all people.

670
00:31:03,940 --> 00:31:06,590
And so here's a more
complicated query which says,

671
00:31:06,590 --> 00:31:12,840
find me all people that
appeared more than four times,

672
00:31:12,840 --> 00:31:16,874
and with a probability greater
than 0.3, and with respect

673
00:31:16,874 --> 00:31:17,540
to the New York.

674
00:31:17,540 --> 00:31:20,640
So here's New York, and
all these different people

675
00:31:20,640 --> 00:31:22,150
who are basically--
these are people

676
00:31:22,150 --> 00:31:26,020
that appeared with New York,
and almost exclusively appeared

677
00:31:26,020 --> 00:31:28,680
with New York.

678
00:31:28,680 --> 00:31:29,180
So.

679
00:31:29,180 --> 00:31:32,040
So John Kennedy, so
this is his son that

680
00:31:32,040 --> 00:31:34,410
was still alive at that time.

681
00:31:34,410 --> 00:31:38,060
Was very popular in the news.

682
00:31:38,060 --> 00:31:41,710
And so let's focus on him.

683
00:31:41,710 --> 00:31:45,810
So let's get John Kennedy,
let's get his neighborhood.

684
00:31:45,810 --> 00:31:50,270
So we can go into his-- get
all the people other people who

685
00:31:50,270 --> 00:31:52,730
appeared in documents
with John Kennedy,

686
00:31:52,730 --> 00:31:57,690
and we can plot that
neighborhood, and you go here.

687
00:31:57,690 --> 00:31:58,760
This is his neighborhood.

688
00:31:58,760 --> 00:31:59,635
There's John Kennedy.

689
00:31:59,635 --> 00:32:01,240
His rows and columns.

690
00:32:01,240 --> 00:32:04,200
And then these are all the other
people that appeared with him.

691
00:32:04,200 --> 00:32:05,749
You see this dense
row and column.

692
00:32:05,749 --> 00:32:07,790
And then these are what
are called the triangles.

693
00:32:07,790 --> 00:32:14,320
So these are other people who
appeared in documents together,

694
00:32:14,320 --> 00:32:15,520
you know.

695
00:32:15,520 --> 00:32:18,010
So George Bush and
Jim Wright appeared

696
00:32:18,010 --> 00:32:21,030
in a document together, but not
necessarily with John Kennedy.

697
00:32:21,030 --> 00:32:22,710
And we can actually
find those people

698
00:32:22,710 --> 00:32:24,910
by just doing this
basic arithmetic here.

699
00:32:24,910 --> 00:32:26,410
And this shows us
all the triangles,

700
00:32:26,410 --> 00:32:29,295
all the other people also
appeared in documents together,

701
00:32:29,295 --> 00:32:30,880
who appeared with John Kennedy.

702
00:32:30,880 --> 00:32:33,100
And so that's what you asked.

703
00:32:33,100 --> 00:32:36,360
So with that, we will-- we're
running a little bit late here,

704
00:32:36,360 --> 00:32:39,470
so we will wrap it up.

705
00:32:39,470 --> 00:32:41,920
And again, encourage you
to go into your code,

706
00:32:41,920 --> 00:32:43,820
and run these
examples and try out.

707
00:32:43,820 --> 00:32:45,490
And this should give you
the first sense really kind

708
00:32:45,490 --> 00:32:47,440
of working with some real
data, and playing with stuff.

709
00:32:47,440 --> 00:32:47,940
So.

710
00:32:47,940 --> 00:32:49,960
Thank you very much.