1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation or to
view additional materials

6
00:00:12,680 --> 00:00:16,590
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,590 --> 00:00:17,470
at ocw.mit.edu.

8
00:00:21,810 --> 00:00:25,040
JEREMY KEPNER: Thank
you all for coming.

9
00:00:25,040 --> 00:00:26,910
I'm glad to see
that our attendance

10
00:00:26,910 --> 00:00:32,090
after the first lecture has
not dropped dramatically--

11
00:00:32,090 --> 00:00:32,820
a little bit.

12
00:00:32,820 --> 00:00:34,540
That's to be expected.

13
00:00:34,540 --> 00:00:38,270
But so far I haven't managed
to scare you all off.

14
00:00:38,270 --> 00:00:42,340
So this is the second lecture
of the signal processing

15
00:00:42,340 --> 00:00:44,502
on database of course.

16
00:00:50,230 --> 00:00:53,780
The format of the lecture will
be the same as the last one.

17
00:00:53,780 --> 00:00:58,030
We'll have some slide
material that will cover

18
00:00:58,030 --> 00:00:59,870
the first half of the course.

19
00:00:59,870 --> 00:01:03,250
And then we'll
take a short break.

20
00:01:03,250 --> 00:01:09,280
And then there'll be some
example demo programs.

21
00:01:09,280 --> 00:01:12,530
And so with that,
we'll get into it.

22
00:01:12,530 --> 00:01:15,420
And all this material is
available in your LLGrid

23
00:01:15,420 --> 00:01:16,550
accounts.

24
00:01:16,550 --> 00:01:20,350
In the Tools directory you
get a directory like this.

25
00:01:20,350 --> 00:01:23,800
And the documents and the
slides are right here.

26
00:01:23,800 --> 00:01:26,370
So we'll go to lecture 01.

27
00:01:40,714 --> 00:01:42,130
For those of you
who weren't there

28
00:01:42,130 --> 00:01:46,460
or to recap for those who will
be viewing this on the web

29
00:01:46,460 --> 00:01:49,090
later, the title of the
course is Signal Processing

30
00:01:49,090 --> 00:01:52,840
on Databases-- two terms that
we often don't see together.

31
00:01:52,840 --> 00:01:55,290
The signal processing
element is really

32
00:01:55,290 --> 00:01:59,450
alluding to detection theory,
and the deeper linear algebra

33
00:01:59,450 --> 00:02:02,260
basis of detection theory.

34
00:02:02,260 --> 00:02:07,170
The databases is really alluding
to unstructured data, data

35
00:02:07,170 --> 00:02:10,699
that we often represent with
strings and words and sort

36
00:02:10,699 --> 00:02:12,540
of sparse relationships.

37
00:02:12,540 --> 00:02:15,380
And so this course is
really bringing these two

38
00:02:15,380 --> 00:02:18,270
normally quite separate ideas
and bringing them together

39
00:02:18,270 --> 00:02:22,880
in a way because there's
a lot of need to do so.

40
00:02:22,880 --> 00:02:28,120
So again, this is
the first course

41
00:02:28,120 --> 00:02:29,690
ever taught on this topic.

42
00:02:29,690 --> 00:02:33,480
There really isn't a prior
model for how to teach this.

43
00:02:33,480 --> 00:02:36,060
So the approach
that we're taking

44
00:02:36,060 --> 00:02:38,330
is we're going to
have a lot of examples

45
00:02:38,330 --> 00:02:42,100
that show how people
use this technology

46
00:02:42,100 --> 00:02:44,120
and hopefully see that.

47
00:02:44,120 --> 00:02:48,300
And then we'll also have
some theory as well as

48
00:02:48,300 --> 00:02:49,840
a lot of examples.

49
00:02:49,840 --> 00:02:52,410
And so hopefully
through some combination

50
00:02:52,410 --> 00:02:57,210
of those different types of
teaching vehicles, each of you

51
00:02:57,210 --> 00:02:59,855
will be able to find
something that is helpful.

52
00:03:02,800 --> 00:03:06,270
So the outline of
today's lecture

53
00:03:06,270 --> 00:03:08,370
is I'm going to go
through an example

54
00:03:08,370 --> 00:03:11,940
where we analyzed
some citation data.

55
00:03:11,940 --> 00:03:17,800
So citation data is one of
the most common examples used

56
00:03:17,800 --> 00:03:23,050
in the field because it's so
readily available and so easy

57
00:03:23,050 --> 00:03:24,900
to get.

58
00:03:24,900 --> 00:03:27,720
And so there's no issues
associated with it.

59
00:03:27,720 --> 00:03:35,620
So you have lots of academics
analyzing themselves,

60
00:03:35,620 --> 00:03:39,880
which is, of course,
very symmetric, right?

61
00:03:39,880 --> 00:03:43,990
And so there's a lot of
work on citation data.

62
00:03:43,990 --> 00:03:47,380
We are going to be looking
at some citation data here,

63
00:03:47,380 --> 00:03:53,180
but doing it using the
technology, the D4M technology,

64
00:03:53,180 --> 00:03:56,660
that you all have
access to, doing it

65
00:03:56,660 --> 00:04:00,740
in the way that would work
well there, which we think

66
00:04:00,740 --> 00:04:04,510
is more efficient both in
terms of the performance

67
00:04:04,510 --> 00:04:06,960
that you'll get on
your analytics and just

68
00:04:06,960 --> 00:04:09,630
in terms of the total
time it will take you.

69
00:04:09,630 --> 00:04:12,890
Of course, it does require
a certain level of comfort

70
00:04:12,890 --> 00:04:13,770
with linear algebra.

71
00:04:13,770 --> 00:04:18,870
That's sort of the mathematical
foundation of the class.

72
00:04:18,870 --> 00:04:21,079
So we're going to
get right into it.

73
00:04:21,079 --> 00:04:23,880
And I talked about this
in the previous lecture,

74
00:04:23,880 --> 00:04:26,590
these special schemas
that we set up

75
00:04:26,590 --> 00:04:31,030
for processing this kind of data
that allow us to, essentially,

76
00:04:31,030 --> 00:04:33,820
take any kind of data
and sort of pull it

77
00:04:33,820 --> 00:04:38,920
into one basic
architecture that is

78
00:04:38,920 --> 00:04:42,530
a really a good starting
point for analysis.

79
00:04:42,530 --> 00:04:47,040
So here's a very
classic type of data

80
00:04:47,040 --> 00:04:49,920
that you might get in
citation data here.

81
00:04:49,920 --> 00:04:53,350
So in this particular
data set, we

82
00:04:53,350 --> 00:04:56,860
have a universal identifier
for each citation.

83
00:04:56,860 --> 00:04:59,200
So every single
citation is just given

84
00:04:59,200 --> 00:05:02,830
some index in this data set.

85
00:05:02,830 --> 00:05:11,955
And we'll have an author, a doc
ID, and then a reference doc

86
00:05:11,955 --> 00:05:13,120
ID.

87
00:05:13,120 --> 00:05:18,400
So this basically
is the document ID

88
00:05:18,400 --> 00:05:25,540
of the document that's
being referenced

89
00:05:25,540 --> 00:05:28,130
and the one that the
reference is coming from.

90
00:05:28,130 --> 00:05:29,990
That may seem a little opposite.

91
00:05:29,990 --> 00:05:31,570
But that's sort
of what it means.

92
00:05:31,570 --> 00:05:33,950
And you can have
absences of both.

93
00:05:33,950 --> 00:05:40,570
You can have
references where it's

94
00:05:40,570 --> 00:05:43,320
referring to another document
for which they have not

95
00:05:43,320 --> 00:05:45,850
constructed some
kind of document ID.

96
00:05:45,850 --> 00:05:47,620
You can also have
references that

97
00:05:47,620 --> 00:05:50,200
are within a document that
just for some reason or other

98
00:05:50,200 --> 00:05:52,420
doesn't have a
document ID as well.

99
00:05:52,420 --> 00:05:56,260
So I think the important
thing to understand here

100
00:05:56,260 --> 00:06:03,280
is that this kind of
incompleteness on a data set,

101
00:06:03,280 --> 00:06:06,530
which might be as clean a
data set as you would expect,

102
00:06:06,530 --> 00:06:08,340
which is the scientific
citation data

103
00:06:08,340 --> 00:06:11,240
set, this kind of
incompleteness is very normal.

104
00:06:13,990 --> 00:06:16,450
You never get data sets that
are even remotely perfect.

105
00:06:16,450 --> 00:06:21,500
To get good data sets that
feel even like about 80% solid,

106
00:06:21,500 --> 00:06:22,920
you're doing very well.

107
00:06:22,920 --> 00:06:24,520
And it's very,
very common to have

108
00:06:24,520 --> 00:06:26,710
data sets that are
incomplete, that have

109
00:06:26,710 --> 00:06:32,470
mistakes, null values,
mislabeled fields, et cetera,

110
00:06:32,470 --> 00:06:32,970
et cetera.

111
00:06:32,970 --> 00:06:35,710
It is just a natural
part of the process.

112
00:06:35,710 --> 00:06:38,410
I think even though more
and more of these sets

113
00:06:38,410 --> 00:06:40,050
are created by
machines and you're

114
00:06:40,050 --> 00:06:42,420
hoping that would eliminate
some of those issues,

115
00:06:42,420 --> 00:06:44,180
well, the machines
have the same problems.

116
00:06:44,180 --> 00:06:47,090
And they can make mistakes
on a much larger scale,

117
00:06:47,090 --> 00:06:49,740
more quickly than humans can.

118
00:06:49,740 --> 00:06:53,680
So incompleteness
definitely exists.

119
00:06:53,680 --> 00:06:58,400
We pull this data
into-- this data

120
00:06:58,400 --> 00:07:04,340
is sort of a standard SQL
or tabular format here,

121
00:07:04,340 --> 00:07:08,420
in which you have records
and columns and then values.

122
00:07:08,420 --> 00:07:11,050
And what we tend
generally do is pull it

123
00:07:11,050 --> 00:07:13,730
into what we call
this exploded schema

124
00:07:13,730 --> 00:07:16,530
where we still use the records.

125
00:07:16,530 --> 00:07:18,170
Those are the row keys.

126
00:07:18,170 --> 00:07:20,600
We create column keys,
which are essentially

127
00:07:20,600 --> 00:07:25,840
the column label and the
column value appended together.

128
00:07:25,840 --> 00:07:30,050
So every single unique value
will have its own column.

129
00:07:30,050 --> 00:07:32,310
And then we sort of
build it out that way.

130
00:07:32,310 --> 00:07:36,455
And this creates this very,
very large, sparse structure.

131
00:07:39,800 --> 00:07:43,980
Once we store the transpose of
that in our database as well,

132
00:07:43,980 --> 00:07:49,230
we get, essentially,
constant-time access

133
00:07:49,230 --> 00:07:51,010
to any string in the database.

134
00:07:51,010 --> 00:07:54,910
So we can now look up
any author very quickly,

135
00:07:54,910 --> 00:07:58,580
any document very quickly,
any record very quickly.

136
00:07:58,580 --> 00:08:00,970
And so it's a very
natural schema

137
00:08:00,970 --> 00:08:05,500
for handling this kind of data.

138
00:08:05,500 --> 00:08:08,700
Typically, we just
hold the value 1

139
00:08:08,700 --> 00:08:11,730
in the actual value
of the database.

140
00:08:11,730 --> 00:08:16,790
We can put other
numbers in there.

141
00:08:16,790 --> 00:08:19,560
But it generally
should be a number

142
00:08:19,560 --> 00:08:22,150
that you would never
want to search on

143
00:08:22,150 --> 00:08:26,390
because in many of the
data sets that we have

144
00:08:26,390 --> 00:08:28,440
or the databases
we're working with,

145
00:08:28,440 --> 00:08:32,370
they can look up a row key or
with this schema, a column key

146
00:08:32,370 --> 00:08:33,490
very quickly.

147
00:08:33,490 --> 00:08:37,490
But to look up a value
requires a complete traversal

148
00:08:37,490 --> 00:08:40,860
of the table, which is, by
definition, very expensive.

149
00:08:40,860 --> 00:08:42,702
So there's definitely
information

150
00:08:42,702 --> 00:08:45,160
you can put in there but just
general information you would

151
00:08:45,160 --> 00:08:46,368
never want to really look up.

152
00:08:49,020 --> 00:08:54,270
In addition for this
particular data set,

153
00:08:54,270 --> 00:08:58,843
we have some fields
that are quite long.

154
00:09:02,860 --> 00:09:10,180
And so we might have the
fully formatted reference,

155
00:09:10,180 --> 00:09:12,930
which in a journal
article can be quite long.

156
00:09:12,930 --> 00:09:16,477
You know if it's got the full
name and the journal volume

157
00:09:16,477 --> 00:09:18,560
and maybe all the authors
and all that stuff, that

158
00:09:18,560 --> 00:09:19,340
can be quite long.

159
00:09:19,340 --> 00:09:21,540
It's very unstructured.

160
00:09:21,540 --> 00:09:22,730
We might have the title.

161
00:09:22,730 --> 00:09:24,000
We might have the abstract.

162
00:09:24,000 --> 00:09:31,610
These are longer blobs of
data that maybe our other data

163
00:09:31,610 --> 00:09:35,730
set essentially is extracted all
the key information from that.

164
00:09:35,730 --> 00:09:38,230
But often, we still
want to have access

165
00:09:38,230 --> 00:09:45,130
to this information as a
complete uninterrupted piece

166
00:09:45,130 --> 00:09:45,814
of data.

167
00:09:45,814 --> 00:09:47,980
So if you actually find a
record, you're like, well,

168
00:09:47,980 --> 00:09:54,260
I'd like to see the abstract
as it was actually formatted

169
00:09:54,260 --> 00:09:55,310
as it was in the journal.

170
00:09:55,310 --> 00:09:58,500
Because a lot of times when
you pull this stuff out and do

171
00:09:58,500 --> 00:10:02,310
keywords, you might lose
punctuation, capitalization,

172
00:10:02,310 --> 00:10:03,810
other types of
things that just make

173
00:10:03,810 --> 00:10:06,290
it generally easier to
read, but don't really

174
00:10:06,290 --> 00:10:08,360
help you with indexing.

175
00:10:08,360 --> 00:10:11,090
So we actually recommend
that that kind of information

176
00:10:11,090 --> 00:10:13,310
be stored in a completely
separate table--

177
00:10:13,310 --> 00:10:19,120
a traditional table, which would
just be some kind of row key

178
00:10:19,120 --> 00:10:23,430
and then the actual reference
with the values there.

179
00:10:23,430 --> 00:10:25,580
You're never going to
search the contents of this.

180
00:10:25,580 --> 00:10:30,010
But this allows you to at
least have access to the data.

181
00:10:30,010 --> 00:10:35,280
So this is sort of a small
addition to our standard schema

182
00:10:35,280 --> 00:10:38,730
where we have the exploded
schema with all the information

183
00:10:38,730 --> 00:10:40,540
that we would ever
want to search on

184
00:10:40,540 --> 00:10:43,840
with this transpose, which
gives us a very fast search.

185
00:10:43,840 --> 00:10:45,580
You may have an
additional table which

186
00:10:45,580 --> 00:10:50,760
just stores the data in
a raw format separately.

187
00:10:50,760 --> 00:10:52,340
You can look it up quickly.

188
00:10:52,340 --> 00:10:56,900
But it won't get in the way
of any of your searches.

189
00:10:56,900 --> 00:10:59,950
The problem can be if
you're doing searches

190
00:10:59,950 --> 00:11:03,070
and you have very large
fields mixed in with very

191
00:11:03,070 --> 00:11:07,000
small fields, it can become
a performance bottleneck

192
00:11:07,000 --> 00:11:09,380
to be handling these
at the exact same time.

193
00:11:16,800 --> 00:11:19,720
Another table that
we used in this data

194
00:11:19,720 --> 00:11:23,720
set, which was to
essentially index

195
00:11:23,720 --> 00:11:28,920
all the words in the title and
the abstract and their Ngrams,

196
00:11:28,920 --> 00:11:33,770
and so the one gram would
just be the individual words

197
00:11:33,770 --> 00:11:34,290
in a title.

198
00:11:34,290 --> 00:11:36,490
If a word appeared
in a title, you

199
00:11:36,490 --> 00:11:41,190
would create a new column
for that word with respect

200
00:11:41,190 --> 00:11:42,750
to that reference.

201
00:11:42,750 --> 00:11:46,070
An Ngram would be word pairs.

202
00:11:46,070 --> 00:11:48,800
A trigram would be all
words that occur together

203
00:11:48,800 --> 00:11:51,140
in threes and on and on and on.

204
00:11:51,140 --> 00:11:56,790
And so often for text analytics,
you want to store all of these.

205
00:11:56,790 --> 00:12:00,150
And essentially, you might
have here an input data

206
00:12:00,150 --> 00:12:03,360
set with various
identifiers-- a title that

207
00:12:03,360 --> 00:12:05,460
consists of a set of
words, an abstract that

208
00:12:05,460 --> 00:12:07,530
consists of set of words.

209
00:12:07,530 --> 00:12:11,030
And then typically
what we would do

210
00:12:11,030 --> 00:12:15,040
is we would say for the
one gram of the title a,

211
00:12:15,040 --> 00:12:16,900
we might format it like this.

212
00:12:16,900 --> 00:12:20,170
And then here we
might hold where

213
00:12:20,170 --> 00:12:21,822
that appeared in the document.

214
00:12:21,822 --> 00:12:23,280
So this allows you
to, essentially,

215
00:12:23,280 --> 00:12:26,780
come back and reconstruct
the original documents.

216
00:12:26,780 --> 00:12:31,310
So that's a perfectly
good example

217
00:12:31,310 --> 00:12:35,850
of something that we don't
tend to need to search on,

218
00:12:35,850 --> 00:12:39,350
which is the exact position
of a word in a document.

219
00:12:39,350 --> 00:12:42,640
We don't tend to really
care that the word was

220
00:12:42,640 --> 00:12:46,220
in the eighth position or the
10th position or whatever.

221
00:12:46,220 --> 00:12:49,500
So we can just create a
list of all of these values.

222
00:12:49,500 --> 00:12:53,430
And that way, if you want
to then later reconstruct

223
00:12:53,430 --> 00:12:55,820
the original text,
you can do that.

224
00:12:55,820 --> 00:12:58,560
Between a column
and its positions,

225
00:12:58,560 --> 00:13:00,700
you have enough
information to do that.

226
00:13:00,700 --> 00:13:04,900
So that's an example of another
type of schema that we admit

227
00:13:04,900 --> 00:13:07,200
and how we handle
this type of data

228
00:13:07,200 --> 00:13:08,950
for doing these
types of searches.

229
00:13:08,950 --> 00:13:13,520
And now it makes it very
easy to say, show me,

230
00:13:13,520 --> 00:13:18,630
if I wanted to look up, please
give me all documents that

231
00:13:18,630 --> 00:13:25,080
had the word d, this
word d, in the abstract,

232
00:13:25,080 --> 00:13:33,710
I would simply go over
here, look up the row key,

233
00:13:33,710 --> 00:13:39,170
get these two columns, and then
I could take those two column

234
00:13:39,170 --> 00:13:41,630
labels and then
look them up here.

235
00:13:41,630 --> 00:13:43,260
And I would get all the Ngrams.

236
00:13:43,260 --> 00:13:45,804
Or I could then
take those row keys

237
00:13:45,804 --> 00:13:47,720
and look them up in any
of the previous tables

238
00:13:47,720 --> 00:13:48,830
and get that information.

239
00:13:48,830 --> 00:13:51,410
So that's kind of
what you can do there.

240
00:13:51,410 --> 00:13:52,910
I'm going over this
kind of quickly.

241
00:13:52,910 --> 00:13:55,470
The examples go through this
in very specific detail.

242
00:13:55,470 --> 00:13:57,200
But this is kind
of just showing you

243
00:13:57,200 --> 00:14:00,450
how we tend to think about
these data sets in a way that's

244
00:14:00,450 --> 00:14:01,840
very flexible.

245
00:14:01,840 --> 00:14:03,960
But showing you
some of the nuances.

246
00:14:03,960 --> 00:14:06,030
It's not just the one table.

247
00:14:06,030 --> 00:14:08,210
It ends up being
two or three tables.

248
00:14:08,210 --> 00:14:11,433
But still, that really covers
what you're looking for.

249
00:14:16,420 --> 00:14:23,250
Again, when you're
processing data like this,

250
00:14:23,250 --> 00:14:26,230
you're always thinking
about a pipeline.

251
00:14:26,230 --> 00:14:28,485
And D4M, the technologies
you talk about

252
00:14:28,485 --> 00:14:30,400
is only one piece
of that pipeline.

253
00:14:30,400 --> 00:14:32,070
We're not saying
that the technologies

254
00:14:32,070 --> 00:14:36,870
we're talking about cover every
single step of that pipeline.

255
00:14:36,870 --> 00:14:39,790
So in this particular
data set that we had,

256
00:14:39,790 --> 00:14:48,100
which I think was 150
gigabyte data set,

257
00:14:48,100 --> 00:14:52,320
step one was getting a
hard disk drive sent to us

258
00:14:52,320 --> 00:14:53,570
and copying it.

259
00:14:53,570 --> 00:14:56,210
It came as a giant XML zip file.

260
00:14:56,210 --> 00:14:57,840
Obviously we had
to uncompress it.

261
00:14:57,840 --> 00:14:58,760
So we had to zip file.

262
00:14:58,760 --> 00:15:00,190
We uncompressed it.

263
00:15:00,190 --> 00:15:02,500
And then from there
we had to write

264
00:15:02,500 --> 00:15:06,180
a parser of the
XML that then spat

265
00:15:06,180 --> 00:15:08,470
that out into just
a series of triples

266
00:15:08,470 --> 00:15:15,020
that then could be inserted into
our database the way we want.

267
00:15:15,020 --> 00:15:17,700
Well, D4M forum is very
good for developing

268
00:15:17,700 --> 00:15:21,990
how you're going to want to
parse those triples initially.

269
00:15:21,990 --> 00:15:26,460
If you're going to do a
high volume parsing of data,

270
00:15:26,460 --> 00:15:29,530
I would really recommend
using a low-level language

271
00:15:29,530 --> 00:15:33,740
like C. Generally C is an
outstanding language for doing

272
00:15:33,740 --> 00:15:36,590
this very repetitive,
monotonous parsing

273
00:15:36,590 --> 00:15:39,070
and has excellent support
for things like XML

274
00:15:39,070 --> 00:15:40,520
and other types of things.

275
00:15:40,520 --> 00:15:44,440
And it will run as
fast as you can.

276
00:15:44,440 --> 00:15:46,670
Yeah, you end up writing
more code to do that.

277
00:15:46,670 --> 00:15:50,780
But it tends to be kind of
a do once type of thing.

278
00:15:50,780 --> 00:15:53,660
You actually do it more than
once, usually, to get it right.

279
00:15:53,660 --> 00:15:58,940
But it's nice when, I think,
our parser that we wrote in C++

280
00:15:58,940 --> 00:16:03,480
could take this several
hundred gigabytes of XML data

281
00:16:03,480 --> 00:16:07,800
and convert it in
triples in under an hour.

282
00:16:07,800 --> 00:16:09,190
And so that's a nice thing.

283
00:16:09,190 --> 00:16:11,215
And other environments,
if you try

284
00:16:11,215 --> 00:16:14,240
and to do this in, say, Python
or other types of things

285
00:16:14,240 --> 00:16:16,790
or even in D4M at
high volume, it's

286
00:16:16,790 --> 00:16:17,960
going to take a lot longer.

287
00:16:17,960 --> 00:16:20,320
So the right tool
for the job-- it

288
00:16:20,320 --> 00:16:22,110
will take you more
lines of code to do it.

289
00:16:22,110 --> 00:16:24,568
But we have found that it's
usually a worthwhile investment

290
00:16:24,568 --> 00:16:26,770
because it can take a long time.

291
00:16:26,770 --> 00:16:29,480
Once we have the triples,
typically what we then do

292
00:16:29,480 --> 00:16:32,860
is read them into D4M and
construct dissociative arrays

293
00:16:32,860 --> 00:16:33,880
out of them.

294
00:16:33,880 --> 00:16:36,380
Talked a little bit about
associative arrays last time.

295
00:16:36,380 --> 00:16:37,990
I'll get to that again.

296
00:16:37,990 --> 00:16:40,073
But these are essentially
a MATLAB data structure.

297
00:16:43,000 --> 00:16:45,480
And then we generally
save them to files.

298
00:16:48,130 --> 00:16:50,260
We always recommend
right before you

299
00:16:50,260 --> 00:16:54,480
do an insert into
a database, even

300
00:16:54,480 --> 00:16:56,240
when you write
performance parsers

301
00:16:56,240 --> 00:16:58,330
or if you're not
writing, the parsing

302
00:16:58,330 --> 00:17:00,580
can be a significant activity.

303
00:17:00,580 --> 00:17:03,930
And if you write the data out
in some sort of efficient file

304
00:17:03,930 --> 00:17:07,130
format right before you insert
it, if you ever need to go back

305
00:17:07,130 --> 00:17:11,099
and re insert the data, it's now
a very, very fast thing to do.

306
00:17:11,099 --> 00:17:14,430
You'll often have databases
especially during development,

307
00:17:14,430 --> 00:17:15,950
they can get corrupted.

308
00:17:15,950 --> 00:17:19,319
You try and recover them
but they don't recover.

309
00:17:19,319 --> 00:17:22,630
And again, just knowing that
you have the parsed files laying

310
00:17:22,630 --> 00:17:25,220
around someplace and you
can just process them

311
00:17:25,220 --> 00:17:29,250
whenever you want is a
very useful technique

312
00:17:29,250 --> 00:17:34,430
and usually is not such a large
expense from a data perspective

313
00:17:34,430 --> 00:17:36,430
and can allow you
essentially reinsert

314
00:17:36,430 --> 00:17:39,860
the data at the full rate the
database can handle, as opposed

315
00:17:39,860 --> 00:17:42,210
to having you redo your
parsing, which sometimes

316
00:17:42,210 --> 00:17:43,650
can take 10 times as long.

317
00:17:43,650 --> 00:17:45,250
We've definitely
seen many folks who

318
00:17:45,250 --> 00:17:49,740
do full parsing, text parsing
in Java or in Python that often

319
00:17:49,740 --> 00:17:51,940
is 10 times slower
than the actual insert

320
00:17:51,940 --> 00:17:53,280
part of the database.

321
00:17:53,280 --> 00:17:57,190
So it's nice to be able
to checkpoint that data.

322
00:17:57,190 --> 00:18:02,910
And then from there we
read the files in and then

323
00:18:02,910 --> 00:18:08,110
insert them into our three
two database pairs-- the keys

324
00:18:08,110 --> 00:18:10,680
and the transpose, the
Ngrams and the transpose,

325
00:18:10,680 --> 00:18:13,755
and then the actual
raw text itself.

326
00:18:16,504 --> 00:18:21,720
I should also say, keeping
these associative arrays around

327
00:18:21,720 --> 00:18:28,160
as files is also very useful.

328
00:18:28,160 --> 00:18:34,350
Because the database, as
we talked about before,

329
00:18:34,350 --> 00:18:36,820
is very good if you have
a large volume of data

330
00:18:36,820 --> 00:18:40,590
and you want to look up a
small piece of it quickly.

331
00:18:40,590 --> 00:18:43,820
If you want to re-traverse
the entire data set,

332
00:18:43,820 --> 00:18:47,050
do some kind of analysis
on the entire data set,

333
00:18:47,050 --> 00:18:49,130
that's going to take
about the same time

334
00:18:49,130 --> 00:18:51,590
as it took to insert the data.

335
00:18:51,590 --> 00:18:54,195
So you're going to be having
to pull all the data out

336
00:18:54,195 --> 00:18:56,320
if you're going to be
wanting to analyze all of it.

337
00:18:56,320 --> 00:18:58,252
There's no sort of magic there.

338
00:18:58,252 --> 00:19:00,710
So the database is very useful
for looking up small pieces.

339
00:19:00,710 --> 00:19:03,540
But if you want to traverse
the whole thing, in which case

340
00:19:03,540 --> 00:19:06,140
having those files around
is a very convenient thing.

341
00:19:06,140 --> 00:19:08,680
If you're like, I'm going
to traverse all the data,

342
00:19:08,680 --> 00:19:13,990
I might as well just begin with
the files and traverse them.

343
00:19:13,990 --> 00:19:15,820
And that's very easy.

344
00:19:15,820 --> 00:19:18,920
The file system will read it in
much faster than the database.

345
00:19:18,920 --> 00:19:20,940
The bandwidth of
the file systems

346
00:19:20,940 --> 00:19:22,770
are much higher than databases.

347
00:19:22,770 --> 00:19:25,260
And it's also very easy to
handle in a parallel way.

348
00:19:25,260 --> 00:19:29,370
You just send different
processes, parse--

349
00:19:29,370 --> 00:19:30,660
in reading in different files.

350
00:19:30,660 --> 00:19:32,951
So if you're going to be
doing something that traverses

351
00:19:32,951 --> 00:19:38,050
an entire set of data, we
really recommend that you

352
00:19:38,050 --> 00:19:39,260
might do that in files.

353
00:19:39,260 --> 00:19:40,790
And in fact, we
even see folks its

354
00:19:40,790 --> 00:19:44,910
like, well, I might have a
query that requires 3% or 4%

355
00:19:44,910 --> 00:19:46,020
of the data.

356
00:19:46,020 --> 00:19:47,740
Well then we would
recommend, oftentimes,

357
00:19:47,740 --> 00:19:50,239
if you know you're going to be
working on that same data set

358
00:19:50,239 --> 00:19:53,380
over and over again, querying
it out of the database,

359
00:19:53,380 --> 00:19:54,897
saving to a set
of files, and then

360
00:19:54,897 --> 00:19:56,980
just working with those
files over and over again.

361
00:19:56,980 --> 00:20:00,512
Now you never have to deal with
the latencies or other issues

362
00:20:00,512 --> 00:20:01,720
associated with the database.

363
00:20:01,720 --> 00:20:03,960
You've completely isolated
yourself from that.

364
00:20:03,960 --> 00:20:05,190
So databases are great.

365
00:20:05,190 --> 00:20:06,930
You need to use them
for the right thing.

366
00:20:06,930 --> 00:20:08,530
But the file system
is also great.

367
00:20:08,530 --> 00:20:10,710
And often in the kind
of work that we do,

368
00:20:10,710 --> 00:20:13,850
that algorithm
development people do,

369
00:20:13,850 --> 00:20:15,760
testing and developing
on the files

370
00:20:15,760 --> 00:20:19,740
is often better than
using the database

371
00:20:19,740 --> 00:20:24,970
as your fine-grain rudimentary
access type of thing.

372
00:20:24,970 --> 00:20:29,340
So we covered that.

373
00:20:29,340 --> 00:20:33,040
So this particular data
set had 42 million records.

374
00:20:33,040 --> 00:20:37,650
And just to kind of
show you how long it

375
00:20:37,650 --> 00:20:40,980
took using the pipeline, we
uncompressed the XML file

376
00:20:40,980 --> 00:20:41,630
in one hour.

377
00:20:41,630 --> 00:20:45,830
So that's just running gzip.

378
00:20:45,830 --> 00:20:49,740
Our parser, that converted
the XML binary structure

379
00:20:49,740 --> 00:20:55,040
into triples, this was
written in C#, was two hours.

380
00:20:55,040 --> 00:20:58,470
So that's very, very nice.

381
00:20:58,470 --> 00:21:01,022
As we debug the parser, we
could rewrite it and not

382
00:21:01,022 --> 00:21:02,730
have to worry about,
oh my goodness, this

383
00:21:02,730 --> 00:21:04,940
is going to be days and days.

384
00:21:04,940 --> 00:21:07,950
The constructing of the
D4M associative arrays

385
00:21:07,950 --> 00:21:10,250
from the triples
took about a day.

386
00:21:10,250 --> 00:21:12,360
That just kind of
just shows you there.

387
00:21:12,360 --> 00:21:16,850
And then inserting the
triples into Accumulo,

388
00:21:16,850 --> 00:21:19,750
this was a single node database.

389
00:21:24,660 --> 00:21:31,541
Not a powerful computer, it was
a dual core circa 2006 Server.

390
00:21:31,541 --> 00:21:33,040
But we have sustained
an insert rate

391
00:21:33,040 --> 00:21:36,270
of about 10,000 to 100,000
entries per second, which

392
00:21:36,270 --> 00:21:39,060
is extremely good for this
particular database, which

393
00:21:39,060 --> 00:21:40,470
was Accumulo.

394
00:21:40,470 --> 00:21:42,830
So inserting the keys
took about two days.

395
00:21:42,830 --> 00:21:44,640
Inserting the text
took about one day.

396
00:21:44,640 --> 00:21:47,670
Inserting the Ngrams
took about 10 days

397
00:21:47,670 --> 00:21:49,330
in this particular data set.

398
00:21:49,330 --> 00:21:51,430
We ended up not using
the Ngrams very much

399
00:21:51,430 --> 00:21:54,167
in this particular data
set, mostly the keys.

400
00:21:54,167 --> 00:21:55,000
And so there you go.

401
00:21:55,000 --> 00:21:57,570
That gives you an idea.

402
00:21:57,570 --> 00:22:00,390
If the database itself
is running in parallel,

403
00:22:00,390 --> 00:22:03,350
you can increase those
insert rates significantly.

404
00:22:03,350 --> 00:22:06,520
But these are the basic
single node insert performance

405
00:22:06,520 --> 00:22:07,431
that we tend to see.

406
00:22:07,431 --> 00:22:08,430
Yes, question over here.

407
00:22:08,430 --> 00:22:09,471
AUDIENCE: Silly question.

408
00:22:09,471 --> 00:22:10,960
Is Accumulo a type
of a database?

409
00:22:10,960 --> 00:22:11,876
JEREMY KEPNER: Ah yes.

410
00:22:11,876 --> 00:22:14,270
So I talked about that a
little bit in the first class.

411
00:22:14,270 --> 00:22:17,760
So Accumulo is a type of
database called a triple store

412
00:22:17,760 --> 00:22:20,730
or sometimes called
a NoSQL database.

413
00:22:20,730 --> 00:22:23,540
It's a new class
of database that's

414
00:22:23,540 --> 00:22:26,780
based on the
architecture published

415
00:22:26,780 --> 00:22:29,320
in the Google
Bigtable paper, which

416
00:22:29,320 --> 00:22:31,100
is about five or six years old.

417
00:22:31,100 --> 00:22:34,020
There's a number of
databases that have been

418
00:22:34,020 --> 00:22:36,110
built with this technology.

419
00:22:36,110 --> 00:22:40,530
It sits on top of a file system
infrastructure called Hadoop.

420
00:22:40,530 --> 00:22:43,040
And it's a very, very
high performance database

421
00:22:43,040 --> 00:22:45,220
for what it does.

422
00:22:45,220 --> 00:22:48,090
And that's the database
that we use here.

423
00:22:48,090 --> 00:22:51,740
And it's open source, a
part of the Apache project.

424
00:22:51,740 --> 00:22:54,710
You can download it.

425
00:22:54,710 --> 00:22:56,730
And at the end of
the class, we'll

426
00:22:56,730 --> 00:22:59,700
actually be doing examples
working with databases.

427
00:22:59,700 --> 00:23:02,750
We are developing an
infrastructure, part of LLGrid,

428
00:23:02,750 --> 00:23:05,270
to host these databases
so that you don't really

429
00:23:05,270 --> 00:23:08,480
have to kind of mess around
with the details of them.

430
00:23:08,480 --> 00:23:10,640
And our hope is to have
that infrastructure

431
00:23:10,640 --> 00:23:14,440
ready by the last two classes
so that you guys can actually

432
00:23:14,440 --> 00:23:15,590
try it out, so.

433
00:23:19,212 --> 00:23:21,320
So the next thing
we want to do here

434
00:23:21,320 --> 00:23:23,990
is after we've built
these databases,

435
00:23:23,990 --> 00:23:27,260
I want to show you how
we construct graphs

436
00:23:27,260 --> 00:23:30,130
from this exploded schema.

437
00:23:30,130 --> 00:23:34,370
Formally, this exploded
schema is what you

438
00:23:34,370 --> 00:23:37,740
would call an incidence matrix.

439
00:23:37,740 --> 00:23:42,530
So in graph theory,
graphs generally

440
00:23:42,530 --> 00:23:45,920
can be represented as two
different types of matrix--

441
00:23:45,920 --> 00:23:50,680
an adjacency matrix, which
is a matrix where each row

442
00:23:50,680 --> 00:23:52,560
and column is a vertex.

443
00:23:52,560 --> 00:23:56,050
And then if there is an
edge between two vertices,

444
00:23:56,050 --> 00:23:58,760
there'll be a value
associated with that.

445
00:23:58,760 --> 00:24:00,880
Generally very sparse
in the types of data

446
00:24:00,880 --> 00:24:03,540
sets that we work with.

447
00:24:03,540 --> 00:24:07,190
That is great for covering
certain types of graphs.

448
00:24:07,190 --> 00:24:11,350
It handles directed
graphs very well.

449
00:24:11,350 --> 00:24:17,600
It does not handle graphs
with multiple edges very well.

450
00:24:17,600 --> 00:24:22,600
It does not handle
graphs with hyper-edges.

451
00:24:22,600 --> 00:24:25,140
These are edges that
connect multiple vertices

452
00:24:25,140 --> 00:24:26,890
at the same time.

453
00:24:26,890 --> 00:24:30,230
And the data that
we have here is very

454
00:24:30,230 --> 00:24:32,590
much in that latter category.

455
00:24:32,590 --> 00:24:37,960
A citation, essentially,
is a hyper-edge.

456
00:24:37,960 --> 00:24:43,670
Or basically one of these
things it connects a document

457
00:24:43,670 --> 00:24:46,510
with sets of people,
with sets of titles,

458
00:24:46,510 --> 00:24:49,030
with sets of all
different types of things.

459
00:24:49,030 --> 00:24:52,200
So it's a very, very
rich edge that's

460
00:24:52,200 --> 00:24:54,150
connecting a lot of vertices.

461
00:24:54,150 --> 00:24:57,010
And so a traditional
adjacency matrix

462
00:24:57,010 --> 00:24:58,330
doesn't really capture it.

463
00:24:58,330 --> 00:25:02,210
So that's why we store the data
as an incidence matrix, which

464
00:25:02,210 --> 00:25:06,290
is essentially one
row for each edge.

465
00:25:06,290 --> 00:25:11,440
And then the columns
are vertices.

466
00:25:11,440 --> 00:25:13,830
And then you basically
have information

467
00:25:13,830 --> 00:25:17,530
that shows which vertices
are connected by that edge.

468
00:25:17,530 --> 00:25:21,790
So that's how we store the
data in a lossless fashion.

469
00:25:21,790 --> 00:25:24,120
And we have a mechanism for
storing it and indexing it

470
00:25:24,120 --> 00:25:26,390
fairly efficiently.

471
00:25:26,390 --> 00:25:29,560
That said, the types of
analyses we want to do,

472
00:25:29,560 --> 00:25:34,530
tend to very quickly get
us to adjacency matrices.

473
00:25:34,530 --> 00:25:38,220
For those of you who are
familiar with radar-- and this

474
00:25:38,220 --> 00:25:41,840
is probably the only place where
I could say that-- I almost

475
00:25:41,840 --> 00:25:43,610
think of one as kind
of the voltage domain

476
00:25:43,610 --> 00:25:45,260
and the other the power domain.

477
00:25:45,260 --> 00:25:48,480
You do lose information when
you go to the power domain.

478
00:25:48,480 --> 00:25:52,590
But often, to make
progress on your analytics,

479
00:25:52,590 --> 00:25:54,100
you have to do that.

480
00:25:54,100 --> 00:25:57,920
And so the adjacency matrix
is a formal projection

481
00:25:57,920 --> 00:25:59,970
of the incidence matrix.

482
00:25:59,970 --> 00:26:05,390
You essentially, by multiplying
the incidence matrix by itself

483
00:26:05,390 --> 00:26:09,400
or squaring it, can get
the adjacency matrix.

484
00:26:09,400 --> 00:26:10,350
So let's look at that.

485
00:26:10,350 --> 00:26:14,270
So here's an adjacency
matrix of this data

486
00:26:14,270 --> 00:26:19,300
that shows the source document
and the cited document.

487
00:26:19,300 --> 00:26:25,020
And so this just shows
what we see in this data--

488
00:26:25,020 --> 00:26:26,090
--cornor up here.

489
00:26:26,090 --> 00:26:29,290
If you see my mouse
going up here, scream.

490
00:26:29,290 --> 00:26:32,080
So we'll stay in the safe area.

491
00:26:32,080 --> 00:26:34,360
So This is the source document.

492
00:26:34,360 --> 00:26:36,080
And this is the cited document.

493
00:26:36,080 --> 00:26:40,720
This is a significant
subset of this data set.

494
00:26:40,720 --> 00:26:42,300
I think it's only 1,000 records.

495
00:26:42,300 --> 00:26:43,649
It's sort of a random sample.

496
00:26:43,649 --> 00:26:45,107
If I showed you
the real, full data

497
00:26:45,107 --> 00:26:47,235
set, obviously that
it just be solid blue.

498
00:26:53,841 --> 00:26:59,110
The source document
ID is decreasing.

499
00:26:59,110 --> 00:27:03,390
And basically, newer
documents are at the bottom,

500
00:27:03,390 --> 00:27:05,790
older documents at the top.

501
00:27:05,790 --> 00:27:07,910
I assume everyone
can know why we have

502
00:27:07,910 --> 00:27:13,874
the sharp-edged boundary here.

503
00:27:13,874 --> 00:27:15,540
AUDIENCE: Can you
cite future documents?

504
00:27:15,540 --> 00:27:17,790
JEREMY KEPNER: You can't
cite future documents, right?

505
00:27:17,790 --> 00:27:23,180
This is the Einstein event
cone of this data set, right?

506
00:27:23,180 --> 00:27:25,950
You can't cite documents
into the future.

507
00:27:25,950 --> 00:27:30,360
So that's why we have
that boundary there.

508
00:27:30,360 --> 00:27:32,505
And you could
almost see it here.

509
00:27:35,190 --> 00:27:36,850
You see-- and I don't
think that's just

510
00:27:36,850 --> 00:27:41,010
an optical illusion-- you
see how it's denser here

511
00:27:41,010 --> 00:27:45,590
and sparser here, which
just shows as documents

512
00:27:45,590 --> 00:27:51,190
get published over time,
they get cited less and less

513
00:27:51,190 --> 00:27:52,236
going into the future.

514
00:27:55,100 --> 00:28:00,030
I think one statistic is that if
you eliminated self-citations,

515
00:28:00,030 --> 00:28:04,230
most published journal
articles are never cited.

516
00:28:04,230 --> 00:28:07,270
So thank God for self-citation.

517
00:28:13,970 --> 00:28:16,670
This shows the degree
distribution of this data.

518
00:28:16,670 --> 00:28:19,990
We'll get into this in
great much greater detail.

519
00:28:19,990 --> 00:28:26,530
But this just shows here on this
small subset of data, basically

520
00:28:26,530 --> 00:28:31,560
for each document we count
how many times it was cited.

521
00:28:31,560 --> 00:28:35,820
So these are documents
that are cited more.

522
00:28:35,820 --> 00:28:37,940
And then we count
how many documents

523
00:28:37,940 --> 00:28:41,510
have that number of citations.

524
00:28:41,510 --> 00:28:44,770
10 to the 0 is basically
documents that are cited once.

525
00:28:44,770 --> 00:28:47,010
And there shows the
number of documents

526
00:28:47,010 --> 00:28:53,160
in this set here, which
was 20,000, 30,000, 40,000

527
00:28:53,160 --> 00:28:54,890
that have one citation.

528
00:28:54,890 --> 00:28:59,020
And then going on down here to
the one that is cited the most.

529
00:28:59,020 --> 00:29:01,830
This is what is called a
power law distribution.

530
00:29:01,830 --> 00:29:05,170
Basically it just says most
things are not cited very often

531
00:29:05,170 --> 00:29:07,320
and some things are
cited a long time.

532
00:29:07,320 --> 00:29:09,880
You see those are
on a approximately

533
00:29:09,880 --> 00:29:14,230
linear negative slope
in a log-log plot.

534
00:29:14,230 --> 00:29:20,210
This is something that
we see all the time,

535
00:29:20,210 --> 00:29:27,090
I should say, in what we
call sub-sampled data where

536
00:29:27,090 --> 00:29:29,020
essentially the
space of citations

537
00:29:29,020 --> 00:29:30,150
has not been fully sampled.

538
00:29:30,150 --> 00:29:35,230
If people stopped publishing
and we created new documents

539
00:29:35,230 --> 00:29:40,030
and all citations only were of
older things, then over time

540
00:29:40,030 --> 00:29:43,050
you would probably
see this thing

541
00:29:43,050 --> 00:29:48,370
begin to take on a bit more
of a bell-shaped curve, OK?

542
00:29:48,370 --> 00:29:52,990
But as long as you're
an expanding world,

543
00:29:52,990 --> 00:29:55,760
then you will get this
type of power law.

544
00:29:55,760 --> 00:29:57,240
And a great many
of the data sets

545
00:29:57,240 --> 00:30:00,470
that we tend to work with
fall in this category.

546
00:30:00,470 --> 00:30:05,560
And, in fact, if
you work with data

547
00:30:05,560 --> 00:30:11,970
that is a byproduct of
artificial human-induced

548
00:30:11,970 --> 00:30:14,690
phenomena, this is
what you should expect.

549
00:30:14,690 --> 00:30:18,456
And if you don't see
it, then you kind of

550
00:30:18,456 --> 00:30:20,080
want to question
what's going on there.

551
00:30:20,080 --> 00:30:22,710
There's usually
something going on there.

552
00:30:25,900 --> 00:30:27,960
So those are some
overall statistics.

553
00:30:27,960 --> 00:30:30,570
Those are two things that
we-- the adjacency matrix

554
00:30:30,570 --> 00:30:34,830
and the redistribution of two
things that we often look at.

555
00:30:34,830 --> 00:30:37,190
I'm going to move things
forward here and talk

556
00:30:37,190 --> 00:30:40,710
about the different
types of adjacency matrix

557
00:30:40,710 --> 00:30:43,430
we might construct from
the incidence matrix.

558
00:30:43,430 --> 00:30:47,625
So in general, the
adjacency matrix

559
00:30:47,625 --> 00:30:53,572
is often denoted
by the letter A.

560
00:30:53,572 --> 00:30:58,880
And incident matrices, you
think we might use the letter I.

561
00:30:58,880 --> 00:31:01,670
But I has obviously been
taken in matrix theory.

562
00:31:01,670 --> 00:31:07,490
So we tend to use the letter
E for the edge matrix.

563
00:31:07,490 --> 00:31:14,070
So I used E as the associative
array representation

564
00:31:14,070 --> 00:31:15,890
of my edge matrix here.

565
00:31:15,890 --> 00:31:20,290
And if I want to find the
author-author correlation

566
00:31:20,290 --> 00:31:26,470
matrix, I just basically can do
E starts with author transpose

567
00:31:26,470 --> 00:31:28,870
times E starts with author.

568
00:31:28,870 --> 00:31:33,280
So basically the
starts with author

569
00:31:33,280 --> 00:31:35,990
is the part of the
incidence matrix

570
00:31:35,990 --> 00:31:37,770
that has just the authors.

571
00:31:37,770 --> 00:31:40,530
I now have an associative
array of just authors.

572
00:31:40,530 --> 00:31:42,620
And then I square
it with itself.

573
00:31:42,620 --> 00:31:46,610
And you get, obviously, a
square symmetric matrix that

574
00:31:46,610 --> 00:31:53,230
is dense along that diagonal
because every single document,

575
00:31:53,230 --> 00:31:55,490
the author appears with itself.

576
00:31:55,490 --> 00:31:58,470
And then this just
shows you which authors

577
00:31:58,470 --> 00:32:01,120
appear with which other author.

578
00:32:01,120 --> 00:32:06,380
So if two authors appear on
the same article together,

579
00:32:06,380 --> 00:32:07,650
they would get a dot.

580
00:32:07,650 --> 00:32:09,350
And so there's a
very classic type

581
00:32:09,350 --> 00:32:12,090
of graph, the co-author graph.

582
00:32:12,090 --> 00:32:16,140
It's well studied, has
a variety of phenomena

583
00:32:16,140 --> 00:32:17,420
associated with it.

584
00:32:17,420 --> 00:32:20,130
And here, it's
constructed very simply

585
00:32:20,130 --> 00:32:24,550
from this
matrix-matrix multiply.

586
00:32:27,480 --> 00:32:32,340
I'm going to call
this the inner square.

587
00:32:32,340 --> 00:32:35,830
So we actually have a short
cut for this in D4M called

588
00:32:35,830 --> 00:32:37,020
[INAUDIBLE].

589
00:32:37,020 --> 00:32:40,600
So [INAUDIBLE] means the
matrix transposed times

590
00:32:40,600 --> 00:32:46,740
itself, which sort of has
an inner product feel to it.

591
00:32:46,740 --> 00:32:50,590
No one has their name
associated with this product.

592
00:32:50,590 --> 00:32:53,430
Seems like someone missed
an opportunity there.

593
00:32:57,400 --> 00:32:59,710
All of the special matrices
are named after Germans.

594
00:32:59,710 --> 00:33:03,150
And so some German missed
an opportunity there back

595
00:33:03,150 --> 00:33:06,500
in the days.

596
00:33:06,500 --> 00:33:08,890
Maybe we can call it the
Inner Kepner product.

597
00:33:08,890 --> 00:33:13,980
That would sufficiently
expire it-- obscure it

598
00:33:13,980 --> 00:33:18,490
like the Hadamard product,
which I always have to look up.

599
00:33:18,490 --> 00:33:21,190
And here is the outer product.

600
00:33:21,190 --> 00:33:26,380
So this shows you the
distribution of the documents

601
00:33:26,380 --> 00:33:29,920
that share common authors.

602
00:33:29,920 --> 00:33:33,130
So it's the same.

603
00:33:33,130 --> 00:33:36,600
What I'm trying to show you
is these adjacency matrices

604
00:33:36,600 --> 00:33:39,610
are formally projections
of the incidence

605
00:33:39,610 --> 00:33:41,950
matrix onto a subspace.

606
00:33:41,950 --> 00:33:45,900
And whether you do
the inner squaring

607
00:33:45,900 --> 00:33:49,700
or the outer squaring
product, those

608
00:33:49,700 --> 00:33:51,740
both have valuable information.

609
00:33:51,740 --> 00:33:54,370
One of them showed
us which authors

610
00:33:54,370 --> 00:33:57,450
have published together.

611
00:33:57,450 --> 00:34:01,160
The other one shows us which
documents have common authors.

612
00:34:01,160 --> 00:34:04,790
Both very legitimate
pieces to analyze.

613
00:34:04,790 --> 00:34:07,170
And in each case, we've
lost some information

614
00:34:07,170 --> 00:34:08,370
by constructing it.

615
00:34:08,370 --> 00:34:10,500
But probably out
of necessity, we

616
00:34:10,500 --> 00:34:13,520
have to do this to make
progress on the analysis

617
00:34:13,520 --> 00:34:15,389
that we're doing.

618
00:34:15,389 --> 00:34:18,360
And continuing, we can
look at the institutions.

619
00:34:18,360 --> 00:34:25,340
So here's the inner
institution squaring product.

620
00:34:25,340 --> 00:34:28,630
And so this shows you
which institutions

621
00:34:28,630 --> 00:34:31,449
are on the same paper.

622
00:34:31,449 --> 00:34:34,659
And then likewise, this
shows us which documents

623
00:34:34,659 --> 00:34:37,889
share the same institution.

624
00:34:37,889 --> 00:34:44,650
Same thing with keywords,
which pairs of keywords

625
00:34:44,650 --> 00:34:49,060
occur in the same document,
which document share

626
00:34:49,060 --> 00:34:52,670
the same keywords, et
cetera, et cetera, et cetera.

627
00:34:52,670 --> 00:34:55,500
Typically, when we do
this matrix multiply,

628
00:34:55,500 --> 00:34:58,770
if we have the
value be a 1, then

629
00:34:58,770 --> 00:35:02,060
actually that the value of
the result will be the count.

630
00:35:02,060 --> 00:35:05,390
So this would show
you how many keywords

631
00:35:05,390 --> 00:35:08,400
are shared by the document.

632
00:35:08,400 --> 00:35:17,470
Or in the previous one, how
many times a pair of keywords

633
00:35:17,470 --> 00:35:19,300
appear in the same
document together.

634
00:35:19,300 --> 00:35:24,150
Again, valuable information
for constructing analytics.

635
00:35:24,150 --> 00:35:33,524
And so these inner and
outer Kepner products

636
00:35:33,524 --> 00:35:34,940
or whatever you
want to call them,

637
00:35:34,940 --> 00:35:39,620
are very useful in that regard.

638
00:35:39,620 --> 00:35:47,500
So now we're going to take
a little bit of a turn here.

639
00:35:47,500 --> 00:35:50,390
This is really going to lead
you up into the examples.

640
00:35:50,390 --> 00:35:54,780
So we're going to get back in
to this point I made and kind

641
00:35:54,780 --> 00:36:00,210
of revisit it again, which
is the issues associated

642
00:36:00,210 --> 00:36:04,890
with adjacency matrix,
hypergraphs, multi-edge graphs,

643
00:36:04,890 --> 00:36:06,490
and those types of things.

644
00:36:06,490 --> 00:36:08,570
I think I've already
talked about this.

645
00:36:08,570 --> 00:36:14,890
Just to remind people, here's
a graph, a directed graph.

646
00:36:14,890 --> 00:36:17,500
We've numbered all the vertices.

647
00:36:17,500 --> 00:36:22,420
This is the adjacency
matrix of that graph.

648
00:36:22,420 --> 00:36:24,240
So basically two
vertices have an edge.

649
00:36:24,240 --> 00:36:25,890
They get a dot.

650
00:36:25,890 --> 00:36:29,209
The fundamental
operation of graph theory

651
00:36:29,209 --> 00:36:30,750
is what we call
breadth-first search.

652
00:36:30,750 --> 00:36:33,710
So you start at a vertex and
then go to its neighbors,

653
00:36:33,710 --> 00:36:36,230
following the edges.

654
00:36:36,230 --> 00:36:39,480
The fundamental operation
of linear algebra

655
00:36:39,480 --> 00:36:42,280
is vector matrix multiply.

656
00:36:42,280 --> 00:36:47,780
So if I construct a vector with
a value in a particular vertex

657
00:36:47,780 --> 00:36:51,850
and I do the matrix
multiply, I find the edges

658
00:36:51,850 --> 00:36:55,200
and I result in the neighbors.

659
00:36:55,200 --> 00:36:57,780
And so you have
this formal duality

660
00:36:57,780 --> 00:37:01,440
between the fundamental
operation of graph theory

661
00:37:01,440 --> 00:37:06,730
here, which is breadth-first
search and a linear algebra.

662
00:37:06,730 --> 00:37:10,305
And so this is philosophically
how we will think about that.

663
00:37:10,305 --> 00:37:12,680
And I think we've already
shown a little bit of that when

664
00:37:12,680 --> 00:37:19,800
we talk about these projections
from incidence matrices

665
00:37:19,800 --> 00:37:21,760
to adjacency matrices.

666
00:37:21,760 --> 00:37:25,980
And again, to get
into that a little bit

667
00:37:25,980 --> 00:37:32,110
further and bring that point
home, the traditional graph

668
00:37:32,110 --> 00:37:36,270
theory that we are taught
in graph theory courses

669
00:37:36,270 --> 00:37:46,490
tends to focus on what we call
undirected, unweighted graphs.

670
00:37:46,490 --> 00:37:50,570
So these are graphs
with vertices.

671
00:37:50,570 --> 00:37:53,290
And you just say whether there's
a connection between vertices.

672
00:37:53,290 --> 00:37:54,790
And you may not
have any information

673
00:37:54,790 --> 00:37:58,140
about if there's multiple
connections between vertices.

674
00:37:58,140 --> 00:38:00,550
And you certainly
can't do hyper-edges.

675
00:38:00,550 --> 00:38:03,360
And this gives you this
very black and white picture

676
00:38:03,360 --> 00:38:08,110
of the world here, which is not
really what the world really

677
00:38:08,110 --> 00:38:08,660
looks like.

678
00:38:08,660 --> 00:38:13,130
And in fact, there's a painting
courtesy of a friend of mine

679
00:38:13,130 --> 00:38:17,100
who is an artist who does
this type of painting.

680
00:38:17,100 --> 00:38:18,940
I tend to like it.

681
00:38:18,940 --> 00:38:25,550
And this is the same painting
of what I showed before,

682
00:38:25,550 --> 00:38:28,210
but what it really looks like.

683
00:38:28,210 --> 00:38:31,280
And what you see is that this
is really more representative

684
00:38:31,280 --> 00:38:32,030
of the true world.

685
00:38:32,030 --> 00:38:37,710
Just drawing some straight edge
lines with colors on a canvas

686
00:38:37,710 --> 00:38:40,700
we already have gone
well beyond what

687
00:38:40,700 --> 00:38:46,580
we could easily represent with
undirected, unweighted graphs.

688
00:38:46,580 --> 00:38:49,240
First of all, our edges
in this case have color.

689
00:38:49,240 --> 00:38:51,160
We have five distinct
colors, which

690
00:38:51,160 --> 00:38:53,950
is information that
would not be captured

691
00:38:53,950 --> 00:38:56,995
in the traditional
unweighted, undirected graph

692
00:38:56,995 --> 00:38:57,620
representation.

693
00:39:01,760 --> 00:39:04,420
We have 20 distinct
vertices here.

694
00:39:04,420 --> 00:39:06,620
So I've labeled every
intersection or ending

695
00:39:06,620 --> 00:39:08,600
of a line as a vertex.

696
00:39:08,600 --> 00:39:10,720
So we have five colors
and 20 distinct vertices.

697
00:39:14,990 --> 00:39:17,110
We have 12 things
that we really should

698
00:39:17,110 --> 00:39:20,140
consider to be multi-edges, OK?

699
00:39:20,140 --> 00:39:21,950
These are essentially
multiple edges

700
00:39:21,950 --> 00:39:23,300
that connect the same vertices.

701
00:39:23,300 --> 00:39:26,615
And I've labeled them
by their color here.

702
00:39:32,500 --> 00:39:34,810
We have 19 hyper-edges.

703
00:39:34,810 --> 00:39:37,910
These are edges
that fundamentally

704
00:39:37,910 --> 00:39:40,790
connect more than one vertex.

705
00:39:40,790 --> 00:39:43,200
So if we look at
this example here,

706
00:39:43,200 --> 00:39:49,100
P3 connects this vertex,
this vertex, this vertex,

707
00:39:49,100 --> 00:39:51,690
and this vertex with one edge.

708
00:39:51,690 --> 00:39:54,000
We could, of course,
topologically say, well, this

709
00:39:54,000 --> 00:39:56,645
is the same as just having
an edge from here to here

710
00:39:56,645 --> 00:39:58,080
or here to here to here.

711
00:39:58,080 --> 00:39:59,850
But that's not what's
really going on.

712
00:39:59,850 --> 00:40:04,850
We are throwing away information
if we decompose that.

713
00:40:04,850 --> 00:40:07,800
Finally, we often have a
concept of edge ordering.

714
00:40:20,277 --> 00:40:21,860
The order that the
line is drawn here,

715
00:40:21,860 --> 00:40:24,630
you can infer it just from
the layering of the lines

716
00:40:24,630 --> 00:40:27,860
that certain edges
occur before others.

717
00:40:27,860 --> 00:40:31,610
And ordering is
also very important.

718
00:40:31,610 --> 00:40:38,390
And anyone want to guess what
color was the first color

719
00:40:38,390 --> 00:40:40,820
of this painting?

720
00:40:40,820 --> 00:40:43,360
When the artist started,
they painted one color first.

721
00:40:43,360 --> 00:40:45,646
Anyone want to guess
what that color was?

722
00:40:45,646 --> 00:40:46,870
AUDIENCE: Red.

723
00:40:46,870 --> 00:40:47,796
JEREMY KEPNER: What?

724
00:40:47,796 --> 00:40:49,350
AUDIENCE: Brown?

725
00:40:49,350 --> 00:40:51,500
JEREMY KEPNER: It was an orange.

726
00:40:51,500 --> 00:40:52,250
It was orange.

727
00:40:52,250 --> 00:40:53,916
And I would have never
have guessed that

728
00:40:53,916 --> 00:40:59,099
except by talking to the artist,
which I found interesting.

729
00:40:59,099 --> 00:41:00,890
And when she was telling
me how she did it,

730
00:41:00,890 --> 00:41:07,210
it was so counterintuitive
to the perception

731
00:41:07,210 --> 00:41:10,210
of the work, which I think
is very, very interesting.

732
00:41:16,490 --> 00:41:21,380
So if you were to represent
this in a standard graph,

733
00:41:21,380 --> 00:41:26,190
you would create
53 standard edges.

734
00:41:26,190 --> 00:41:30,260
You would say this
hyper-edge here,

735
00:41:30,260 --> 00:41:33,210
we would break up into
three separate edges.

736
00:41:33,210 --> 00:41:36,616
This hyper-edge you would break
up into two separate edges

737
00:41:36,616 --> 00:41:37,490
et cetera, et cetera.

738
00:41:37,490 --> 00:41:39,510
So you have 53 standard edges.

739
00:41:39,510 --> 00:41:42,510
So one of the basic
observations is

740
00:41:42,510 --> 00:41:46,260
that the standard edge
representation, the fragments,

741
00:41:46,260 --> 00:41:48,320
or hyper-edges, a lot
of information is lost.

742
00:41:48,320 --> 00:41:50,770
The digraph representation
compresses the multi-edges.

743
00:41:50,770 --> 00:41:52,620
And a lot of
information is lost.

744
00:41:52,620 --> 00:41:55,240
The matrix representation
drops the edge labels.

745
00:41:55,240 --> 00:41:57,289
And a lot of
information is lost.

746
00:41:57,289 --> 00:41:58,830
And the standard
graph representation

747
00:41:58,830 --> 00:41:59,990
drops the edge order.

748
00:41:59,990 --> 00:42:01,364
And again, more
information loss.

749
00:42:01,364 --> 00:42:04,680
So we really need something
better than the traditional way

750
00:42:04,680 --> 00:42:05,800
to do that.

751
00:42:05,800 --> 00:42:07,700
And so the solution
to this problem

752
00:42:07,700 --> 00:42:10,670
is to use the incidence
matrix formulation

753
00:42:10,670 --> 00:42:14,150
where we assign
every single edge

754
00:42:14,150 --> 00:42:18,570
a label, a color, an order
that it was laid down.

755
00:42:18,570 --> 00:42:22,550
And then for every
single vertex we

756
00:42:22,550 --> 00:42:25,760
can-- so this edge B1 is blue.

757
00:42:25,760 --> 00:42:28,060
It's in the second order group.

758
00:42:28,060 --> 00:42:31,070
And its connects
these three vertices.

759
00:42:31,070 --> 00:42:34,382
And you see various
structures here appearing

760
00:42:34,382 --> 00:42:35,840
from the different
types of things.

761
00:42:35,840 --> 00:42:38,780
So this is how we would
represent this data

762
00:42:38,780 --> 00:42:41,830
as an incidence
matrix in a way that

763
00:42:41,830 --> 00:42:44,380
preserves all the information.

764
00:42:44,380 --> 00:42:48,990
And so that's really kind of the
power of the incidence matrix

765
00:42:48,990 --> 00:42:50,890
approach to this.

766
00:42:50,890 --> 00:42:51,390
All right.

767
00:42:51,390 --> 00:42:54,540
So that actually brings us
to the end of the first-- oh,

768
00:42:54,540 --> 00:42:55,760
so we have a question here.

769
00:42:55,760 --> 00:42:57,260
AUDIENCE: So this
is great, and I've

770
00:42:57,260 --> 00:42:58,778
worked with [INAUDIBLE] before.

771
00:42:58,778 --> 00:43:01,028
But what kind of analysis
can you actually do on that?

772
00:43:03,540 --> 00:43:05,490
JEREMY KEPNER: So the
analysis you typically

773
00:43:05,490 --> 00:43:08,680
do is you take projections
into adjacency matrices.

774
00:43:08,680 --> 00:43:09,301
Yep.

775
00:43:09,301 --> 00:43:09,800
Yep.

776
00:43:09,800 --> 00:43:12,650
So this way, you're
preserving the ability

777
00:43:12,650 --> 00:43:15,400
to do all those
projections, you know?

778
00:43:15,400 --> 00:43:22,272
And it's not to say
sometimes to make progress

779
00:43:22,272 --> 00:43:23,730
if your data sets
are large, you're

780
00:43:23,730 --> 00:43:26,526
just forced to just make some
projections to get started.

781
00:43:30,460 --> 00:43:34,090
Theoretically, one could argue
that the right way to proceed

782
00:43:34,090 --> 00:43:38,470
is you have the
incidence matrix.

783
00:43:38,470 --> 00:43:41,230
You derive the
analytics you want.

784
00:43:41,230 --> 00:43:43,775
Maybe you use
adjacency matrices as

785
00:43:43,775 --> 00:43:46,040
an analytic intermediate step.

786
00:43:46,040 --> 00:43:48,600
But then once you've
figured out the analysis,

787
00:43:48,600 --> 00:43:51,620
you then go back and make sure
that the analytic runs just

788
00:43:51,620 --> 00:43:53,720
on the raw instance
matrix itself

789
00:43:53,720 --> 00:43:55,860
and save yourself the time
of actually constructing

790
00:43:55,860 --> 00:43:57,360
the adjacency matrix.

791
00:43:57,360 --> 00:43:59,099
Sometimes this
space is so large,

792
00:43:59,099 --> 00:44:01,640
though, it can be difficult for
us to get our mind around it.

793
00:44:01,640 --> 00:44:03,710
And sometimes it
can be useful just

794
00:44:03,710 --> 00:44:05,700
like you know what,
I really think

795
00:44:05,700 --> 00:44:08,460
that these projections are
going to be useful ones to get

796
00:44:08,460 --> 00:44:09,340
started.

797
00:44:09,340 --> 00:44:11,522
Rather than trying to just
get lost in this forest,

798
00:44:11,522 --> 00:44:13,230
sometimes it's better
to be like I'm just

799
00:44:13,230 --> 00:44:15,510
going to start by projecting.

800
00:44:15,510 --> 00:44:18,040
I'm going to create a few
of these adjacency matrices

801
00:44:18,040 --> 00:44:21,760
to get started, make
progress, do my analytics,

802
00:44:21,760 --> 00:44:25,070
and then figure out if I
need to fix it from there.

803
00:44:25,070 --> 00:44:27,710
Because working in
this space directly

804
00:44:27,710 --> 00:44:30,340
can be a little bit tricky.

805
00:44:30,340 --> 00:44:33,840
Every single analysis--
it's like working

806
00:44:33,840 --> 00:44:34,880
in the voltage domain.

807
00:44:34,880 --> 00:44:38,980
You're constantly having to
keep extra theoretical machinery

808
00:44:38,980 --> 00:44:39,790
around.

809
00:44:39,790 --> 00:44:42,830
And sometimes to make progress,
you're better off just

810
00:44:42,830 --> 00:44:45,080
going to the power
domain, making progress,

811
00:44:45,080 --> 00:44:48,400
and maybe discover
later if you can fix it

812
00:44:48,400 --> 00:44:51,560
up doing other types of things.

813
00:44:51,560 --> 00:44:53,060
Are there other
questions before we

814
00:44:53,060 --> 00:44:57,131
come to the end of this
part of the lecture?

815
00:44:57,131 --> 00:44:57,630
Great.

816
00:44:57,630 --> 00:44:58,290
OK, thank you.

817
00:44:58,290 --> 00:45:01,490
So we'll take a
short break here.

818
00:45:01,490 --> 00:45:03,500
And then we'll
proceed to the demo.

819
00:45:03,500 --> 00:45:04,940
There is a sign-up sheet here.

820
00:45:04,940 --> 00:45:07,700
If you haven't signed
up, please write it down.

821
00:45:07,700 --> 00:45:10,290
Apparently this is extremely
important for the accounting

822
00:45:10,290 --> 00:45:12,190
purposes of the laboratory.

823
00:45:12,190 --> 00:45:14,230
So I would encourage
you to do that.

824
00:45:14,230 --> 00:45:19,080
So we'll start up again
in about five minutes.