1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation, or to
view additional materials

6
00:00:12,680 --> 00:00:16,590
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,590 --> 00:00:17,260
at ocw.mit.edu.

8
00:00:20,924 --> 00:00:23,340
JEREMY KEPNER: All right, we're
going to get started here,

9
00:00:23,340 --> 00:00:24,900
everyone.

10
00:00:24,900 --> 00:00:28,600
And I know we have a lot going
on here at the lab today,

11
00:00:28,600 --> 00:00:33,287
and I'm very pleased that those
of you decided to choose us

12
00:00:33,287 --> 00:00:35,370
today over all the other
things that are going on.

13
00:00:35,370 --> 00:00:37,290
So that's great,
I'm very pleased.

14
00:00:39,850 --> 00:00:44,820
I wanted to say I think I've
provided feedback to everyone

15
00:00:44,820 --> 00:00:49,130
who submitted their
earlier first homework

16
00:00:49,130 --> 00:00:52,169
over the weekend, and I was
very impressed and pleased

17
00:00:52,169 --> 00:00:53,710
by those of you who
did the homework.

18
00:00:53,710 --> 00:00:56,310
I think you really embraced
the spirit of the homework

19
00:00:56,310 --> 00:00:57,850
very well.

20
00:00:57,850 --> 00:00:59,850
I know the homework
that just happened

21
00:00:59,850 --> 00:01:01,746
is one that kind of
really builds on that.

22
00:01:01,746 --> 00:01:03,620
For those of you who
didn't do the first one,

23
00:01:03,620 --> 00:01:06,530
then you're kind of
left out a colon.

24
00:01:06,530 --> 00:01:10,000
But I would like to give my
congratulations for those

25
00:01:10,000 --> 00:01:11,860
who took on that
second homework, which

26
00:01:11,860 --> 00:01:15,950
was significantly--
definitely required

27
00:01:15,950 --> 00:01:20,310
a fair amount of contemplation
to sort of attempt at homework.

28
00:01:20,310 --> 00:01:23,340
So I want to thank those
of you who did that.

29
00:01:23,340 --> 00:01:26,510
So we're going to
get started here.

30
00:01:26,510 --> 00:01:31,080
And so we're going to
bring up the slides here.

31
00:01:31,080 --> 00:01:35,600
So we're going to be
getting a little bit more.

32
00:01:35,600 --> 00:01:39,969
We're kind of entering the
second of the three sections.

33
00:01:39,969 --> 00:01:41,510
We had these first
three courses here

34
00:01:41,510 --> 00:01:43,760
that really was the first part.

35
00:01:43,760 --> 00:01:47,040
Did a lot with motivation
and a lot with theory,

36
00:01:47,040 --> 00:01:50,400
and now we're going to do stuff
that's a little bit more--

37
00:01:50,400 --> 00:01:52,337
you know, we'll be
continuing that trend.

38
00:01:52,337 --> 00:01:54,170
We'll still always have
plenty of motivation

39
00:01:54,170 --> 00:01:56,950
and a little bit of theory,
but getting more and more

40
00:01:56,950 --> 00:01:59,540
towards things that
feel like the things

41
00:01:59,540 --> 00:02:01,520
you might actually really do.

42
00:02:01,520 --> 00:02:03,830
And so we're kind of
entering into that next phase

43
00:02:03,830 --> 00:02:04,621
of the course here.

44
00:02:04,621 --> 00:02:05,770
So let's get started here.

45
00:02:05,770 --> 00:02:08,780
So this is the lecture 03.

46
00:02:14,170 --> 00:02:16,920
Pull up an example here
from some real data,

47
00:02:16,920 --> 00:02:21,210
and talking about that.

48
00:02:21,210 --> 00:02:24,229
Here is the outline
of the course.

49
00:02:24,229 --> 00:02:26,020
Actually just before
we're doing that, sort

50
00:02:26,020 --> 00:02:29,470
of somewhat mandatory for those
of you who are just joining us

51
00:02:29,470 --> 00:02:31,560
or are watching on the
web and have jumped

52
00:02:31,560 --> 00:02:34,440
to add to this lecture,
the title of the class

53
00:02:34,440 --> 00:02:35,890
is Signal Processing
on Databases,

54
00:02:35,890 --> 00:02:37,931
is two words that we don't
normally see together,

55
00:02:37,931 --> 00:02:39,350
or two phrases.

56
00:02:39,350 --> 00:02:40,960
Signal processing,
which is really

57
00:02:40,960 --> 00:02:44,940
alluding to Detection
Theory, Formal Detection

58
00:02:44,940 --> 00:02:48,520
theory, and its linear
algebraic foundations.

59
00:02:48,520 --> 00:02:50,850
And databases, which
is really alluding

60
00:02:50,850 --> 00:02:55,030
to dealing with strings, and
unstructured data, and graphs.

61
00:02:55,030 --> 00:02:57,940
And so we're, in
this course, talking

62
00:02:57,940 --> 00:03:01,140
about a technology, D4M, which
allows us to pull these two

63
00:03:01,140 --> 00:03:02,100
views together.

64
00:03:02,100 --> 00:03:07,110
And so again we're
now on the lecture 03,

65
00:03:07,110 --> 00:03:12,160
and we will continue on this
journey that we have started.

66
00:03:12,160 --> 00:03:15,170
Here's the outline.

67
00:03:15,170 --> 00:03:17,200
A little bit of
history, I'm going

68
00:03:17,200 --> 00:03:22,110
to talk about how the web became
the thing it is today, which

69
00:03:22,110 --> 00:03:27,760
I think talks a lot about why
we invented the D4M technology.

70
00:03:27,760 --> 00:03:32,760
So it's going to talk about
the specific gap that it fills,

71
00:03:32,760 --> 00:03:36,000
and talk a little bit about
the D4M technology, some

72
00:03:36,000 --> 00:03:37,340
of the results that we've had.

73
00:03:37,340 --> 00:03:42,910
And then, hopefully today,
the demo is more substantive,

74
00:03:42,910 --> 00:03:45,980
and so we'll spend
more time on that

75
00:03:45,980 --> 00:03:48,660
than we did last time,
where last time was

76
00:03:48,660 --> 00:03:52,000
more serious than the demo
was really fairly small.

77
00:03:52,000 --> 00:03:56,830
So for those of you
who don't remember

78
00:03:56,830 --> 00:04:00,740
back in the ancient primordial
days of the web, back

79
00:04:00,740 --> 00:04:08,600
in the early 90s, this
is what it looked like.

80
00:04:08,600 --> 00:04:12,810
So if I talk about the
hardware side of the web,

81
00:04:12,810 --> 00:04:14,440
it was quite simple.

82
00:04:14,440 --> 00:04:18,329
So there we go.

83
00:04:18,329 --> 00:04:23,170
The hardware side of the
web was all Sun boxes.

84
00:04:23,170 --> 00:04:26,450
We all had our Sun workstations,
and there were clients,

85
00:04:26,450 --> 00:04:29,670
and there were servers,
and there were databases.

86
00:04:29,670 --> 00:04:35,310
And the databases that we had
at the time were SQL databases.

87
00:04:35,310 --> 00:04:37,710
Oracle was sort of really
kind of just beginning

88
00:04:37,710 --> 00:04:39,370
to get going then.

89
00:04:39,370 --> 00:04:41,885
Sybase was sort of a dominant
player at that stage.

90
00:04:45,300 --> 00:04:49,880
This is sort of an example
of the very first modern web

91
00:04:49,880 --> 00:04:53,000
page, or website,
that was built.

92
00:04:53,000 --> 00:04:56,500
It was built by-- I'm sure
there are other people who

93
00:04:56,500 --> 00:04:58,380
are doing very similar
things at same time,

94
00:04:58,380 --> 00:04:59,900
so it's among the first.

95
00:04:59,900 --> 00:05:02,860
It's difficult to-- not
going to be like Al Gore

96
00:05:02,860 --> 00:05:05,630
and say I invented the internet.

97
00:05:05,630 --> 00:05:07,672
But a friend of mine,
Don Beaudry and I,

98
00:05:07,672 --> 00:05:09,880
were working for a company
called Vision Intelligence

99
00:05:09,880 --> 00:05:12,320
Corporation, which
is now part of GE.

100
00:05:12,320 --> 00:05:18,670
And we had a data set
and a SQL database,

101
00:05:18,670 --> 00:05:21,780
and we wanted to
make it available.

102
00:05:21,780 --> 00:05:24,340
And there are a lot
of thick client tools

103
00:05:24,340 --> 00:05:26,510
that came with our Sybase
package for doing that,

104
00:05:26,510 --> 00:05:29,270
but we found them a
little bit clunky.

105
00:05:29,270 --> 00:05:36,360
And so a friend of mine, Don,
he downloaded this-- he said,

106
00:05:36,360 --> 00:05:37,700
hey I found this new software.

107
00:05:37,700 --> 00:05:42,010
I found a beta release of this
software called NCSA Mosaic.

108
00:05:42,010 --> 00:05:44,210
And so for those who
don't know, NCSA Mosaic

109
00:05:44,210 --> 00:05:48,040
was the first
browser, and it was

110
00:05:48,040 --> 00:05:50,579
invented at the National
Center for Supercomputing

111
00:05:50,579 --> 00:05:51,120
Applications.

112
00:05:53,960 --> 00:05:56,210
And they released it.

113
00:05:56,210 --> 00:05:57,550
And it was awesome.

114
00:05:57,550 --> 00:06:00,220
It changed everything.

115
00:06:00,220 --> 00:06:04,430
Just so you know, prior to that,
and not to disrespect anybody

116
00:06:04,430 --> 00:06:12,080
who invented anything, but
HTTP and HTML were dying.

117
00:06:12,080 --> 00:06:16,970
They had actually come out
in 1989, and everyone like,

118
00:06:16,970 --> 00:06:18,180
this is silly.

119
00:06:18,180 --> 00:06:18,700
What's this?

120
00:06:18,700 --> 00:06:20,825
Why do I have to type this
HTTP and then this HTML,

121
00:06:20,825 --> 00:06:23,910
and it's just very
clunky and stuff like.

122
00:06:23,910 --> 00:06:28,060
That and it was getting its
clock pretty well cleaned

123
00:06:28,060 --> 00:06:32,680
by another technology called
Gopher, which was developed

124
00:06:32,680 --> 00:06:34,834
in the University
of Minnesota, which

125
00:06:34,834 --> 00:06:36,000
was a much easier interface.

126
00:06:36,000 --> 00:06:39,060
The server was a lot better.

127
00:06:39,060 --> 00:06:41,930
It was a lot easier to
do links then in HTML.

128
00:06:41,930 --> 00:06:49,750
And it was just a menu driven,
text based way to do that.

129
00:06:49,750 --> 00:06:57,210
Well Mosaic created a
GUI front end to HTML

130
00:06:57,210 --> 00:06:59,740
and allowed you
to have pictures,

131
00:06:59,740 --> 00:07:01,310
and that changed everything.

132
00:07:01,310 --> 00:07:04,490
That just absolutely-- and
it just sort of caught on

133
00:07:04,490 --> 00:07:05,550
like wildfire.

134
00:07:05,550 --> 00:07:07,660
But it was just a beta
at that point in time,

135
00:07:07,660 --> 00:07:09,618
and I don't think you
could really do pictures,

136
00:07:09,618 --> 00:07:11,280
but it rendered fonts nicely.

137
00:07:11,280 --> 00:07:12,300
And Gopher did not.

138
00:07:12,300 --> 00:07:14,190
Gopher was like just plain text.

139
00:07:14,190 --> 00:07:16,870
So you could have like
underline, in bold,

140
00:07:16,870 --> 00:07:18,170
and the links were very clear.

141
00:07:18,170 --> 00:07:19,170
You could click at them.

142
00:07:19,170 --> 00:07:20,850
You could even have
indents and bullets.

143
00:07:20,850 --> 00:07:23,190
You could imagine organizing
with something that would

144
00:07:23,190 --> 00:07:26,460
look like a web page today.

145
00:07:26,460 --> 00:07:29,460
And so we had that,
and they're like,

146
00:07:29,460 --> 00:07:32,320
oh well we'll talk to
our database using that.

147
00:07:32,320 --> 00:07:36,030
We'll use you HTT put
to post a request.

148
00:07:38,850 --> 00:07:41,010
The Mosaic server was
so bad, we actually

149
00:07:41,010 --> 00:07:45,200
just used the Gopher server
to talk to the Mosaic browser.

150
00:07:45,200 --> 00:07:47,130
And so we would do
puts, and then we

151
00:07:47,130 --> 00:07:49,800
used a language called perl,
which many of you know,

152
00:07:49,800 --> 00:07:53,330
because it actually had a
direct binding to Sybase.

153
00:07:53,330 --> 00:07:55,930
And so we would take
that, format it,

154
00:07:55,930 --> 00:07:59,532
and create SQL queries.

155
00:07:59,532 --> 00:08:01,990
And then it would send us back
strings, which we would then

156
00:08:01,990 --> 00:08:05,620
format as HTML, and input to
the browser, and it was great.

157
00:08:05,620 --> 00:08:07,710
It was like very
nice, and it was

158
00:08:07,710 --> 00:08:12,159
sort of the first modern web
page that we did in 1992.

159
00:08:12,159 --> 00:08:14,700
And essentially a good fraction
of the internet kind of looks

160
00:08:14,700 --> 00:08:17,460
like this today.

161
00:08:17,460 --> 00:08:22,080
Our conclusion at that time was
that this browser was a really

162
00:08:22,080 --> 00:08:26,360
lousy GUI, and that
HTTP was really

163
00:08:26,360 --> 00:08:28,781
trying to be a file system,
but it was a really bad file

164
00:08:28,781 --> 00:08:29,280
system.

165
00:08:29,280 --> 00:08:31,220
And Perl was really
bad for analysis.

166
00:08:31,220 --> 00:08:33,990
And SQL really wasn't--
it's good for credit cards

167
00:08:33,990 --> 00:08:35,990
transactions, but it
wasn't really good for data

168
00:08:35,990 --> 00:08:36,490
in and out.

169
00:08:36,490 --> 00:08:39,070
So our conclusion was this
is an awful lot of work

170
00:08:39,070 --> 00:08:42,029
just to use some documents,
and it won't catch on.

171
00:08:42,029 --> 00:08:43,570
Our conclusion was
it won't catch on.

172
00:08:43,570 --> 00:08:47,920
So with that
conclusion, maybe I'm

173
00:08:47,920 --> 00:08:50,630
underscoring my judgment
for the entire course

174
00:08:50,630 --> 00:08:54,131
that we decided
this was a bad idea.

175
00:08:54,131 --> 00:08:56,380
So after that we had sort
of what we call the Cambrian

176
00:08:56,380 --> 00:09:01,050
Explosion of the internet,
which is we had an explosion.

177
00:09:01,050 --> 00:09:05,350
On the hardware side, we
had big, giant data centers

178
00:09:05,350 --> 00:09:08,130
merging servers and
databases, which is great.

179
00:09:08,130 --> 00:09:11,400
We had all these new types of
hardware technologies here,

180
00:09:11,400 --> 00:09:13,820
laptops, and
tablets, and iPhones,

181
00:09:13,820 --> 00:09:15,420
and all that kind of stuff.

182
00:09:15,420 --> 00:09:18,540
So the hardware technology went
forward in leaps and bounds.

183
00:09:18,540 --> 00:09:23,760
And then the browsers really
took off, all of them.

184
00:09:23,760 --> 00:09:25,350
We have all these
different browsers.

185
00:09:25,350 --> 00:09:27,859
You can still see, I
have the mosaic logo

186
00:09:27,859 --> 00:09:30,400
back there still a little bit,
because they're still showing.

187
00:09:30,400 --> 00:09:36,830
For those of you who don't know,
the author of Mosaic and NCSA

188
00:09:36,830 --> 00:09:38,820
was a guy by the name
of Marc Andreessen.

189
00:09:38,820 --> 00:09:41,110
He left NCSA and
formed a company called

190
00:09:41,110 --> 00:09:43,950
Netscape, which actually,
at a moment in time

191
00:09:43,950 --> 00:09:46,515
was the dominant--
they were the internet.

192
00:09:49,830 --> 00:09:53,322
They got crushed, although
I think Marc Andreessen made

193
00:09:53,322 --> 00:09:54,280
a fair amount of money.

194
00:09:57,960 --> 00:10:00,770
But they did open source
all their software,

195
00:10:00,770 --> 00:10:04,330
and so they basically rewrote
Mosaic from the ground up,

196
00:10:04,330 --> 00:10:10,040
and that became Mozilla
and Firefox and all that.

197
00:10:10,040 --> 00:10:14,610
So that's still around today.

198
00:10:14,610 --> 00:10:16,820
Safari may have
borrowed from this.

199
00:10:16,820 --> 00:10:19,530
I don't really know.

200
00:10:19,530 --> 00:10:22,120
When Netscape was
dominating the internet,

201
00:10:22,120 --> 00:10:25,520
Microsoft needed to-- first
they thought the internet was

202
00:10:25,520 --> 00:10:26,419
going to be a fad.

203
00:10:26,419 --> 00:10:28,460
And then they're like,
all right this is serious.

204
00:10:28,460 --> 00:10:30,050
We need a browser.

205
00:10:30,050 --> 00:10:33,930
And so Mosaic had
actually been spun out

206
00:10:33,930 --> 00:10:36,160
of the University of
Illinois into a company

207
00:10:36,160 --> 00:10:39,260
called Spyglass, which may
actually still exist today.

208
00:10:39,260 --> 00:10:44,550
And then Microsoft
bought the browser,

209
00:10:44,550 --> 00:10:47,350
which was the original
NCSA Mosaic code,

210
00:10:47,350 --> 00:10:49,460
and that became in
Internet Explorer.

211
00:10:49,460 --> 00:10:51,930
So it's still around today.

212
00:10:51,930 --> 00:10:53,760
This is Chrome, and
as far as I know

213
00:10:53,760 --> 00:10:58,170
chrome is a complete
rewrite from the ground up.

214
00:10:58,170 --> 00:11:01,220
And probably some of the
other, like Opera, and other

215
00:11:01,220 --> 00:11:03,134
browsers that around
today are completely--

216
00:11:03,134 --> 00:11:04,050
but they still use it.

217
00:11:04,050 --> 00:11:04,966
It's a browser to use.

218
00:11:04,966 --> 00:11:08,460
HTML still uses the
basic put get syntax.

219
00:11:08,460 --> 00:11:10,730
SQL, we're still using it today.

220
00:11:10,730 --> 00:11:13,090
We have all whole
explosion of technologies,

221
00:11:13,090 --> 00:11:16,360
Oracle being very
dominant, but my SQL,

222
00:11:16,360 --> 00:11:18,000
SQL Server, or other
types of things,

223
00:11:18,000 --> 00:11:20,880
Sybase still
around, very common.

224
00:11:20,880 --> 00:11:26,310
And we still have
two major servers

225
00:11:26,310 --> 00:11:28,920
out there, Apache and Windows.

226
00:11:28,920 --> 00:11:31,015
So Apache was the Mosaic server.

227
00:11:31,015 --> 00:11:33,770
And just so you
know, Apache was--

228
00:11:33,770 --> 00:11:38,530
it's a patchy server,
meaning it was very buggy.

229
00:11:38,530 --> 00:11:42,650
So it was the a patchy server,
and that became Apache.

230
00:11:42,650 --> 00:11:44,490
So that was the
new Mosaic server,

231
00:11:44,490 --> 00:11:48,980
became Apache because
it was a patchy server.

232
00:11:48,980 --> 00:11:50,944
And I bet they had
borrowed code from Gopher,

233
00:11:50,944 --> 00:11:52,360
so that's why the
Gopher, I think,

234
00:11:52,360 --> 00:11:54,750
is still showing
there a little bit.

235
00:11:54,750 --> 00:11:56,474
We've had a proliferation
of languages

236
00:11:56,474 --> 00:11:57,390
that are very similar.

237
00:11:57,390 --> 00:11:59,910
So people, in terms of
gluing these two together,

238
00:11:59,910 --> 00:12:03,680
Java, et cetera, all
that type of stuff.

239
00:12:03,680 --> 00:12:05,890
But the important
thing to recognize

240
00:12:05,890 --> 00:12:09,230
is that this whole architecture
was not put together,

241
00:12:09,230 --> 00:12:14,000
because somehow we felt the web
browser was the optimal tool,

242
00:12:14,000 --> 00:12:16,730
or SQL was the optimal
tool, or these languages are

243
00:12:16,730 --> 00:12:18,340
the optimal tool.

244
00:12:18,340 --> 00:12:20,230
They were just things
that were laying down

245
00:12:20,230 --> 00:12:23,390
to could be repurposed
for this job.

246
00:12:23,390 --> 00:12:26,850
And so that's why they're not
necessarily optimal things.

247
00:12:26,850 --> 00:12:28,420
But nevertheless,
our conclusion has

248
00:12:28,420 --> 00:12:31,180
to be slightly revised
that it would not catch on,

249
00:12:31,180 --> 00:12:32,600
because obviously
it did catch on.

250
00:12:32,600 --> 00:12:35,970
The desire to share
information was so great

251
00:12:35,970 --> 00:12:40,190
that it overcame the
high barrier to entry

252
00:12:40,190 --> 00:12:44,630
just to set up on a web
page or something like that.

253
00:12:44,630 --> 00:12:46,300
Now things have
changed significantly.

254
00:12:46,300 --> 00:12:47,990
We're not the first
people to observe

255
00:12:47,990 --> 00:12:50,920
that this original
architecture is somewhat

256
00:12:50,920 --> 00:12:52,900
broken for this type of model.

257
00:12:52,900 --> 00:12:57,100
And so today the web looks
a lot more like this.

258
00:12:57,100 --> 00:13:01,250
You see all of our awesome
client hardware which is great.

259
00:13:01,250 --> 00:13:03,280
We have servers
and databases that

260
00:13:03,280 --> 00:13:06,060
have now moved in type
of the data center

261
00:13:06,060 --> 00:13:08,890
into shipping containers
and other types of things,

262
00:13:08,890 --> 00:13:11,220
which is great.

263
00:13:11,220 --> 00:13:13,760
Increasingly, our
interfaces will look

264
00:13:13,760 --> 00:13:15,660
a lot more like video games.

265
00:13:15,660 --> 00:13:19,140
I would tell you that a
typical app on the iPad

266
00:13:19,140 --> 00:13:22,500
feels a lot more like a video
game than a browser, which

267
00:13:22,500 --> 00:13:25,670
is great, so we're really kind
of moving beyond the browser

268
00:13:25,670 --> 00:13:28,480
is the dominant way for
getting our information.

269
00:13:28,480 --> 00:13:31,040
And companies like
Google and others

270
00:13:31,040 --> 00:13:34,860
recognize that, if you're just
presenting data to people,

271
00:13:34,860 --> 00:13:38,240
and you're not doing say
credit card transactions,

272
00:13:38,240 --> 00:13:41,900
then a lot of the features of
SQL in those types of databases

273
00:13:41,900 --> 00:13:44,690
are maybe more than you
need, and you can actually

274
00:13:44,690 --> 00:13:48,000
make these sort of
very scalable databases

275
00:13:48,000 --> 00:13:50,550
more simply that are
called triple stores.

276
00:13:50,550 --> 00:13:55,510
And so Google Bigtable being
the original sort of one.

277
00:13:55,510 --> 00:13:56,890
Riak is another.

278
00:13:56,890 --> 00:13:58,430
Cassandra is another.

279
00:13:58,430 --> 00:13:59,890
HBase is another.

280
00:13:59,890 --> 00:14:02,920
Accumulo, which is the one we
use in this class, is another.

281
00:14:02,920 --> 00:14:08,590
So this whole new technology
space has really driven that

282
00:14:08,590 --> 00:14:11,497
and has made a
lot more scalable.

283
00:14:11,497 --> 00:14:13,330
But unfortunately, in
the middle here, we're

284
00:14:13,330 --> 00:14:16,725
still very much left with
much of the same technology.

285
00:14:16,725 --> 00:14:21,200
And this is kind of the last
piece, this sort of glue.

286
00:14:21,200 --> 00:14:23,590
And so our conclusion
is now that it's

287
00:14:23,590 --> 00:14:25,340
a lot of work to
view a lot of data,

288
00:14:25,340 --> 00:14:27,530
but it's a pretty great
view, and we can really

289
00:14:27,530 --> 00:14:28,367
view a lot of data.

290
00:14:28,367 --> 00:14:29,950
But we still have
this middle problem.

291
00:14:29,950 --> 00:14:31,340
There's this gap.

292
00:14:31,340 --> 00:14:33,820
And so we're trying
to address that gap.

293
00:14:33,820 --> 00:14:36,190
And so we address
that gap with D4M.

294
00:14:36,190 --> 00:14:39,920
So D4M is formally
designed from the ground up

295
00:14:39,920 --> 00:14:41,760
to be something that
is really designed

296
00:14:41,760 --> 00:14:43,490
to work on triples,
the kind of data

297
00:14:43,490 --> 00:14:45,470
we store in these
modern databases,

298
00:14:45,470 --> 00:14:48,660
and do the kind of
analysis in less code

299
00:14:48,660 --> 00:14:49,753
than you would otherwise.

300
00:14:54,180 --> 00:14:57,200
So D4M, which stands for Dynamic
Distributed Dimension Data

301
00:14:57,200 --> 00:14:59,440
model, which is
sometimes in quotes,

302
00:14:59,440 --> 00:15:01,400
databases for Matlab,
because it happens

303
00:15:01,400 --> 00:15:03,680
to be written in Matlab.

304
00:15:03,680 --> 00:15:05,640
It connects the
triple stores, which

305
00:15:05,640 --> 00:15:08,660
can view their data as
very large, sparse matrices

306
00:15:08,660 --> 00:15:13,020
of triples with the mathematical
concepts of associative arrays

307
00:15:13,020 --> 00:15:14,580
that we talked about last class.

308
00:15:14,580 --> 00:15:18,600
That's essentially
what D4M does for you.

309
00:15:18,600 --> 00:15:22,140
So that's kind of one sort of
holistic view of the technology

310
00:15:22,140 --> 00:15:26,040
space that we're
dealing with here.

311
00:15:26,040 --> 00:15:29,140
Another way to view at it, which
I'm going to talk a little bit

312
00:15:29,140 --> 00:15:32,807
about how the technologies that
we have brought this to you.

313
00:15:32,807 --> 00:15:34,890
So I'm going to tell you
a little bit about LLGrid

314
00:15:34,890 --> 00:15:38,700
and what we do there, because
you're using your LLGrid

315
00:15:38,700 --> 00:15:41,770
accounts for this, and
so I want to tell you

316
00:15:41,770 --> 00:15:44,580
a little bit about that.

317
00:15:44,580 --> 00:15:50,690
So LLGrid is the
core supercomputer

318
00:15:50,690 --> 00:15:52,380
at Lincoln Laboratory.

319
00:15:52,380 --> 00:15:57,670
And we're here to help with
all your computing needs.

320
00:15:57,670 --> 00:16:00,250
Likewise, we're here to
help with your D4M needs.

321
00:16:00,250 --> 00:16:01,750
If you have a
project, and you think

322
00:16:01,750 --> 00:16:04,790
it would be useful to
use D4M in that project,

323
00:16:04,790 --> 00:16:05,870
please contact us.

324
00:16:05,870 --> 00:16:07,670
You don't have to
just take this class

325
00:16:07,670 --> 00:16:09,086
and think that
you're on your own.

326
00:16:09,086 --> 00:16:10,530
I mean, we are a service.

327
00:16:10,530 --> 00:16:14,690
We will help you do the
schemas, get the code going,

328
00:16:14,690 --> 00:16:17,780
get it sort of adapted
to your problem.

329
00:16:17,780 --> 00:16:19,970
That's what we do.

330
00:16:19,970 --> 00:16:22,300
LLGrid, we have
actually well over this,

331
00:16:22,300 --> 00:16:24,830
but just so I don't have
to keep changing the slide.

332
00:16:24,830 --> 00:16:27,900
We always say it's about 500
users, about 2000 processors.

333
00:16:27,900 --> 00:16:29,612
It's more than that.

334
00:16:29,612 --> 00:16:31,320
To our knowledge, it's
really the world's

335
00:16:31,320 --> 00:16:33,240
only desktop interactive
supercomputer.

336
00:16:33,240 --> 00:16:35,781
It's dramatically easier to use
than any other supercomputer.

337
00:16:35,781 --> 00:16:37,540
And as a result,
as an organization,

338
00:16:37,540 --> 00:16:40,710
we have the highest fraction
of staff using supercomputing

339
00:16:40,710 --> 00:16:42,970
on a regular basis
of any organization,

340
00:16:42,970 --> 00:16:44,381
just because it's so easy.

341
00:16:44,381 --> 00:16:46,630
And it's also kind of been
the foundation, the vision,

342
00:16:46,630 --> 00:16:50,170
for supercomputing in the
entire state of Massachusetts.

343
00:16:50,170 --> 00:16:52,330
All of our ideas
have been adopted

344
00:16:52,330 --> 00:16:57,140
by other universities in the
state and moving forward.

345
00:16:57,140 --> 00:17:02,330
Not really the technology that
we're talking about today,

346
00:17:02,330 --> 00:17:04,310
but something that is
really kind of the bread

347
00:17:04,310 --> 00:17:07,740
and butter of LLGrid, is our
Parallel Matlab technology,

348
00:17:07,740 --> 00:17:09,810
which is now going
to be celebrating

349
00:17:09,810 --> 00:17:12,155
its 10th anniversary this year.

350
00:17:14,750 --> 00:17:17,770
And it's really a
way to allow people

351
00:17:17,770 --> 00:17:21,530
to do parallel programming
relatively easily.

352
00:17:21,530 --> 00:17:25,990
D4M interacts seamlessly
with this technology,

353
00:17:25,990 --> 00:17:31,720
and so you will also be
doing parallel database codes

354
00:17:31,720 --> 00:17:33,180
with this technology.

355
00:17:33,180 --> 00:17:35,550
So we have this thing
called Parallel Matlab.

356
00:17:35,550 --> 00:17:38,410
If you're an LLGrid user,
you are entitled to,

357
00:17:38,410 --> 00:17:41,390
or you almost always get
the book on Parallel Matlab.

358
00:17:41,390 --> 00:17:43,510
And it allows you to
take things like matrices

359
00:17:43,510 --> 00:17:48,610
and chop them up very easily
to do complicated calculations.

360
00:17:48,610 --> 00:17:52,070
And almost every single parallel
program that you want to write

361
00:17:52,070 --> 00:17:55,770
can be written with just
the addition of a handful

362
00:17:55,770 --> 00:17:58,600
of additional functions.

363
00:17:58,600 --> 00:18:01,600
And we get to use what is
called a distributed arrays

364
00:18:01,600 --> 00:18:03,370
parallel program model,
which is generally

365
00:18:03,370 --> 00:18:07,350
recognized as by far the best
of all the parallel programming

366
00:18:07,350 --> 00:18:10,820
models, if you are comfortable
with a multi-dimensional

367
00:18:10,820 --> 00:18:13,200
arrays.

368
00:18:13,200 --> 00:18:15,860
Everyone at MIT is,
so this is easy.

369
00:18:15,860 --> 00:18:19,290
However this is
not generally true.

370
00:18:19,290 --> 00:18:22,070
Thinking in matrices is
not a universal skill,

371
00:18:22,070 --> 00:18:26,560
or universal training, and so
it is a bit of-- sometimes we do

372
00:18:26,560 --> 00:18:28,940
say this is software for the 1%.

373
00:18:28,940 --> 00:18:32,910
So you know, it is
not for everyone.

374
00:18:32,910 --> 00:18:35,350
But if you do have this
knowledge and background,

375
00:18:35,350 --> 00:18:37,700
it allows you to do
things very naturally.

376
00:18:37,700 --> 00:18:39,500
And as you know,
we've talked about all

377
00:18:39,500 --> 00:18:43,200
of our associated arrays as
looking at data in matrices,

378
00:18:43,200 --> 00:18:47,629
and so these two technologies
come together very nicely.

379
00:18:47,629 --> 00:18:49,920
And we are really the only
place in the world where you

380
00:18:49,920 --> 00:18:51,637
can routinely use this model.

381
00:18:51,637 --> 00:18:53,220
This is the dominant
model that people

382
00:18:53,220 --> 00:18:55,230
use when they do parallel
programming, which

383
00:18:55,230 --> 00:18:57,330
is generally-- and you
might say, well why do I

384
00:18:57,330 --> 00:18:58,080
say it's the best?

385
00:18:58,080 --> 00:18:59,370
Well we run contests.

386
00:18:59,370 --> 00:19:02,080
There's a contest every year
where we bake off the best

387
00:19:02,080 --> 00:19:04,920
parallel programming models,
and distributed arrays

388
00:19:04,920 --> 00:19:07,660
in high level environments
wins every year.

389
00:19:07,660 --> 00:19:10,920
And so it's kind
of the best model

390
00:19:10,920 --> 00:19:14,680
if you can understand them.

391
00:19:14,680 --> 00:19:18,620
It wouldn't be a class
on computing in 2012

392
00:19:18,620 --> 00:19:20,550
unless we talked
about the cloud,

393
00:19:20,550 --> 00:19:22,960
so we definitely need
to talk about the cloud.

394
00:19:22,960 --> 00:19:24,840
And to make it
simple for people,

395
00:19:24,840 --> 00:19:28,220
we're going to divide the
cloud into sort of two halves.

396
00:19:28,220 --> 00:19:31,160
You can subdivide the cloud
into many, many different

397
00:19:31,160 --> 00:19:33,970
components.

398
00:19:33,970 --> 00:19:36,900
There's what we call utility
cloud computing, which is what

399
00:19:36,900 --> 00:19:39,260
most people use the cloud for.

400
00:19:39,260 --> 00:19:43,570
So gmail, what are
enterprise services,

401
00:19:43,570 --> 00:19:46,580
calendars online, all
that kind of stuff,

402
00:19:46,580 --> 00:19:49,180
basic data sharing,
photo sharing,

403
00:19:49,180 --> 00:19:52,790
and stuff like that, that's
what we call utility cloud

404
00:19:52,790 --> 00:19:54,220
computing.

405
00:19:54,220 --> 00:19:56,652
You can do human
resources on the cloud.

406
00:19:56,652 --> 00:19:58,110
You can do accounting
on the cloud.

407
00:19:58,110 --> 00:20:00,230
All this type of stuff
that's on the cloud,

408
00:20:00,230 --> 00:20:04,460
and it's a very, very
successful business.

409
00:20:04,460 --> 00:20:07,600
What we mostly focus on is what
is called data intensive cloud

410
00:20:07,600 --> 00:20:08,360
computing.

411
00:20:08,360 --> 00:20:12,210
It's based on Hadoop and
other types of technologies.

412
00:20:12,210 --> 00:20:14,180
But even still, a lot
of that is technologies

413
00:20:14,180 --> 00:20:16,540
that are used by large
scale cloud companies,

414
00:20:16,540 --> 00:20:19,040
but they don't really
share with you.

415
00:20:19,040 --> 00:20:22,970
Google, although it's
often called Hadoop,

416
00:20:22,970 --> 00:20:25,120
Google doesn't
actually use Hadoop.

417
00:20:25,120 --> 00:20:29,620
Hadoop is based on a small part
of the Google infrastructure.

418
00:20:29,620 --> 00:20:32,580
Google's infrastructure is
vastly larger than that.

419
00:20:32,580 --> 00:20:35,630
And they don't just let
anybody log into and say,

420
00:20:35,630 --> 00:20:38,350
oh go ahead mine
all the Google data.

421
00:20:38,350 --> 00:20:39,980
They expose it to
services, allowing

422
00:20:39,980 --> 00:20:43,300
you to do some of that, but the
core technology that they do,

423
00:20:43,300 --> 00:20:44,120
they don't expose.

424
00:20:44,120 --> 00:20:46,890
In all large companies, they
don't expose that to you.

425
00:20:46,890 --> 00:20:50,147
They give you selected
services and such.

426
00:20:50,147 --> 00:20:51,980
But this is really what
we're focused on is,

427
00:20:51,980 --> 00:20:55,100
this data intensive computing.

428
00:20:55,100 --> 00:21:04,160
If I want to further subdivided
the cloud from an implementer's

429
00:21:04,160 --> 00:21:09,700
perspective, if you own a
lot of computing hardware,

430
00:21:09,700 --> 00:21:11,710
I mean hundreds of
kilowatts or megawatts

431
00:21:11,710 --> 00:21:14,720
of computing hardware,
you're most likely doing

432
00:21:14,720 --> 00:21:15,785
one of these four things.

433
00:21:20,890 --> 00:21:24,290
You could be doing traditional
supercomputing, which we all

434
00:21:24,290 --> 00:21:26,100
know and love, which
is sort of closest

435
00:21:26,100 --> 00:21:27,940
to what we do in LLGrid.

436
00:21:27,940 --> 00:21:30,371
You could do a traditional
database management system,

437
00:21:30,371 --> 00:21:32,870
so every single time you use a
credit card, make a purchase,

438
00:21:32,870 --> 00:21:35,830
you're probably connecting
to a traditional database

439
00:21:35,830 --> 00:21:37,330
managed system.

440
00:21:37,330 --> 00:21:39,090
You could be doing
enterprise computing,

441
00:21:39,090 --> 00:21:42,760
which mostly, this day and
age, is run out of VMware.

442
00:21:42,760 --> 00:21:45,720
So this is the LLGrid logo.

443
00:21:45,720 --> 00:21:48,130
And so you could be doing that.

444
00:21:48,130 --> 00:21:50,470
So that's all the things
I earlier mentioned.

445
00:21:50,470 --> 00:21:53,010
Or this new kid on
the block, big data,

446
00:21:53,010 --> 00:21:58,380
which is all the buzz today,
using java and MapReduce

447
00:21:58,380 --> 00:22:00,070
and other types of things.

448
00:22:00,070 --> 00:22:03,220
The important thing to
recognize is these four areas,

449
00:22:03,220 --> 00:22:08,830
each of these are multi tens of
billions of dollars industry.

450
00:22:08,830 --> 00:22:10,660
The IT world has
gotten large enough

451
00:22:10,660 --> 00:22:14,160
that we now see specialization
down to the hardware level.

452
00:22:14,160 --> 00:22:16,530
The hardware, the processors,
that these folks use

453
00:22:16,530 --> 00:22:17,732
is different.

454
00:22:17,732 --> 00:22:19,440
Likewise for these
folks and these folks.

455
00:22:19,440 --> 00:22:21,910
You see specialization
of chips, now,

456
00:22:21,910 --> 00:22:23,940
to these different
types of things.

457
00:22:23,940 --> 00:22:27,550
So each of these at the center
of a multi-billion dollar

458
00:22:27,550 --> 00:22:30,830
ecosystem, and they
each have pros and cons.

459
00:22:30,830 --> 00:22:33,250
And sometimes you can do
your mission wholly in one,

460
00:22:33,250 --> 00:22:34,600
but sometimes you have to cross.

461
00:22:34,600 --> 00:22:36,725
And when you do cross, it
can be quite a challenge,

462
00:22:36,725 --> 00:22:40,570
because these worlds are
not necessarily compatible.

463
00:22:40,570 --> 00:22:43,690
And I would say, just to help
people to understand Hadoop,

464
00:22:43,690 --> 00:22:50,090
one must recognize that
Java is the first class

465
00:22:50,090 --> 00:22:53,840
citizen in the Hadoop world.

466
00:22:53,840 --> 00:22:56,490
The entire infrastructure
is written in Java.

467
00:22:56,490 --> 00:22:58,950
It's designed so that
if you only know Java,

468
00:22:58,950 --> 00:23:01,860
you can actually administer
and manage a cluster, which

469
00:23:01,860 --> 00:23:03,480
is why it's become so popular.

470
00:23:03,480 --> 00:23:04,980
There are so many
Java programmers,

471
00:23:04,980 --> 00:23:08,280
and now they just know Java,
and they downloaded Hadoop,

472
00:23:08,280 --> 00:23:11,450
and they have rudimentary
access to some computer systems

473
00:23:11,450 --> 00:23:14,500
they can sort of cobble
together a cluster.

474
00:23:14,500 --> 00:23:16,410
I should say the
same thing occurred

475
00:23:16,410 --> 00:23:18,500
in the C in the Fortran
commune in the mid 90s.

476
00:23:18,500 --> 00:23:20,250
There was a
technology called NPI,

477
00:23:20,250 --> 00:23:23,890
in fact this is the NPI logo
today, which allowed you,

478
00:23:23,890 --> 00:23:28,810
if you knew C, Fortran had
rudimentary access on a network

479
00:23:28,810 --> 00:23:31,920
of workstations, you could
create a cluster with NPI.

480
00:23:31,920 --> 00:23:34,550
And that really sort
of began the whole sort

481
00:23:34,550 --> 00:23:36,590
of parallel cluster
computing revolution.

482
00:23:36,590 --> 00:23:38,180
This does the same
thing in Java,

483
00:23:38,180 --> 00:23:40,660
and so for Java
programmers it's been

484
00:23:40,660 --> 00:23:44,100
a huge success in that regard.

485
00:23:44,100 --> 00:23:46,310
Just so you know where
we sit-in with that,

486
00:23:46,310 --> 00:23:49,460
so we have a LLGrid
here, which has really

487
00:23:49,460 --> 00:23:53,890
made traditional
supercomputing feel interactive

488
00:23:53,890 --> 00:23:56,140
and on demand and elastic,
so that's one of the things

489
00:23:56,140 --> 00:23:57,877
that we do that's
LLGrid, very unique.

490
00:23:57,877 --> 00:23:58,710
You launch your job.

491
00:23:58,710 --> 00:23:59,543
It runs immediately.

492
00:23:59,543 --> 00:24:02,690
That is different than the
way every other supercomputing

493
00:24:02,690 --> 00:24:05,130
center in the country is run.

494
00:24:05,130 --> 00:24:08,760
Most supercomputing centers are
run by, you write your program,

495
00:24:08,760 --> 00:24:13,470
and you launch it, and you
wait until the queue says

496
00:24:13,470 --> 00:24:15,110
that it is run.

497
00:24:15,110 --> 00:24:18,090
And in fact, some
centers will even

498
00:24:18,090 --> 00:24:21,257
use, as a metric of success,
how long their wait is.

499
00:24:21,257 --> 00:24:22,840
It's kind of like a
restaurant, right?

500
00:24:22,840 --> 00:24:24,870
How many years do
you have to wait

501
00:24:24,870 --> 00:24:27,690
to get a good reservation,
or a good doctor right?

502
00:24:27,690 --> 00:24:32,910
And I remember someone saying,
oh our system is so successful

503
00:24:32,910 --> 00:24:35,530
that our wait is a week.

504
00:24:35,530 --> 00:24:39,982
You hit submit and you will have
to wait a week to run your job.

505
00:24:39,982 --> 00:24:41,690
That's not really
consistent with the way

506
00:24:41,690 --> 00:24:42,990
we do business around here.

507
00:24:42,990 --> 00:24:44,590
You're under very
tight deadlines.

508
00:24:44,590 --> 00:24:47,859
So we created a very different
type of system 10 years ago,

509
00:24:47,859 --> 00:24:49,900
which sort of feels more
like what you people are

510
00:24:49,900 --> 00:24:51,690
trying to do in the cloud.

511
00:24:51,690 --> 00:24:53,880
In the Hadoop community,
there's been a lot of work

512
00:24:53,880 --> 00:24:55,390
to try and make databases.

513
00:24:55,390 --> 00:24:59,370
What these technologies give
you is sort of efficient search,

514
00:24:59,370 --> 00:25:01,025
and so there are a
number of databases

515
00:25:01,025 --> 00:25:03,150
that have been developed
that sit on top of Hadoop,

516
00:25:03,150 --> 00:25:06,750
HBase being one, and
Accumulo being another.

517
00:25:06,750 --> 00:25:10,630
Which to my knowledge is
the highest performance

518
00:25:10,630 --> 00:25:12,370
open source triple
store in the world,

519
00:25:12,370 --> 00:25:14,990
and probably will remain that
way for a very long time.

520
00:25:14,990 --> 00:25:18,460
And so we will be using
the Accumulo database.

521
00:25:18,460 --> 00:25:21,060
And then we've created
bindings to that.

522
00:25:21,060 --> 00:25:23,420
So we have our D4M
technology, which allows

523
00:25:23,420 --> 00:25:25,880
you to bind you to databases.

524
00:25:25,880 --> 00:25:28,640
And we also have something
called LLGrid MapReduce.

525
00:25:28,640 --> 00:25:31,400
MapReduce is sort of the
core programming model

526
00:25:31,400 --> 00:25:33,560
within the Hadoop community.

527
00:25:33,560 --> 00:25:36,410
It's such a trivial
model, it almost

528
00:25:36,410 --> 00:25:37,680
doesn't even deserve a name.

529
00:25:37,680 --> 00:25:41,630
It's basically like, write a
program given a list of files.

530
00:25:41,630 --> 00:25:44,550
Each program runs on
each file independently.

531
00:25:44,550 --> 00:25:46,540
I mean it's a very,
very simple model.

532
00:25:46,540 --> 00:25:49,710
It's been around since the
beginning of computing,

533
00:25:49,710 --> 00:25:55,040
but the name has become
popularized here,

534
00:25:55,040 --> 00:25:58,910
and we now have a way to do
MapReduce right in LLGrid.

535
00:25:58,910 --> 00:26:01,610
It's very easy, and
people enjoy it.

536
00:26:01,610 --> 00:26:04,170
You can, of course,
do the same thing

537
00:26:04,170 --> 00:26:07,860
with other technology in
LLGrid, but for those people,

538
00:26:07,860 --> 00:26:10,620
particularly if you're
writing job or-- not Matlab.

539
00:26:10,620 --> 00:26:12,940
We don't recommend people
use this for Matlab,

540
00:26:12,940 --> 00:26:15,920
because we have better
technologies for doing that.

541
00:26:15,920 --> 00:26:20,720
But if you're using Python,
or Java, or other programs,

542
00:26:20,720 --> 00:26:23,990
then MapReduce is there for you.

543
00:26:23,990 --> 00:26:26,580
As we move towards combining--
so our big vision here

544
00:26:26,580 --> 00:26:28,750
is to combine big
compute and big data,

545
00:26:28,750 --> 00:26:31,290
and so we have a
whole stack here

546
00:26:31,290 --> 00:26:33,690
as we deal with
new applications,

547
00:26:33,690 --> 00:26:38,000
texts, cyber, bio, other
types of things, the new APIs

548
00:26:38,000 --> 00:26:40,740
that people write, the new
types of distributed databases.

549
00:26:40,740 --> 00:26:44,990
So it's really
affecting everything.

550
00:26:44,990 --> 00:26:48,792
Just an example of the
Hadoop architecture.

551
00:26:48,792 --> 00:26:49,947
You've had this course.

552
00:26:49,947 --> 00:26:51,530
You should at least
have a few minutes

553
00:26:51,530 --> 00:26:53,446
so that if someone asked,
well what is Hadoop,

554
00:26:53,446 --> 00:26:54,700
you know what it does.

555
00:26:54,700 --> 00:26:56,820
This is the basic architecture.

556
00:26:56,820 --> 00:27:00,080
So you have a Hadoop
MapReduce job.

557
00:27:00,080 --> 00:27:02,790
You submit it, and it
goes to a JobTracker.

558
00:27:02,790 --> 00:27:07,170
The JobTracker then breaks that
up into subtasks, each of which

559
00:27:07,170 --> 00:27:08,720
has its own tracker.

560
00:27:08,720 --> 00:27:14,812
Those tasks then get sent to the
DataNodes in the architecture,

561
00:27:14,812 --> 00:27:17,100
and it gets those
DataNodes, and NameNode

562
00:27:17,100 --> 00:27:19,440
keeps track of the actual
names of the things.

563
00:27:19,440 --> 00:27:21,710
Just a very simple
a Hadoop cluster.

564
00:27:21,710 --> 00:27:24,740
This is the different
types of the nodes

565
00:27:24,740 --> 00:27:27,230
that are in, and
processes in architecture.

566
00:27:29,830 --> 00:27:32,360
Hadoop's strengths and
weaknesses, well it

567
00:27:32,360 --> 00:27:35,405
does allow you to distribute
processing a large data.

568
00:27:39,580 --> 00:27:45,450
Best use case is, let's say
you had enormous number of log

569
00:27:45,450 --> 00:27:50,850
files, and you decided you only
wanted to search them once,

570
00:27:50,850 --> 00:27:54,500
for one string.

571
00:27:54,500 --> 00:27:57,075
That's the perfect
use case for Hadoop.

572
00:27:57,075 --> 00:27:59,450
If you decide, though, that
you would like to actually do

573
00:27:59,450 --> 00:28:04,750
more than one search, then
it doesn't necessarily

574
00:28:04,750 --> 00:28:06,290
always make sense.

575
00:28:06,290 --> 00:28:09,030
So it's basically
designed to run grep

576
00:28:09,030 --> 00:28:11,690
on an enormous number of files.

577
00:28:11,690 --> 00:28:13,610
And I kid you not, there
are companies today

578
00:28:13,610 --> 00:28:16,680
that actually do this,
like at production scale.

579
00:28:16,680 --> 00:28:18,740
All their analytics are
done by just grepping

580
00:28:18,740 --> 00:28:21,620
enormous numbers of log files.

581
00:28:21,620 --> 00:28:25,940
Needless to say that the entire
database community cringes,

582
00:28:25,940 --> 00:28:29,990
and rolls over in their graves
if they are not alive, at this.

583
00:28:29,990 --> 00:28:33,740
Because we have
solved this problem

584
00:28:33,740 --> 00:28:36,750
by indexing our data, and that
allows you to do fast search.

585
00:28:36,750 --> 00:28:38,650
And people, that's why
people have invented

586
00:28:38,650 --> 00:28:40,470
databases like
HBase and Accumulo,

587
00:28:40,470 --> 00:28:43,730
which would sit on top of Hadoop
because they recognize that,

588
00:28:43,730 --> 00:28:46,150
really, if you're going
to search more than once,

589
00:28:46,150 --> 00:28:48,660
you should index your data.

590
00:28:48,660 --> 00:28:51,480
Again, it is very scalable.

591
00:28:51,480 --> 00:28:54,980
It is fundamentally
designed to have extremely

592
00:28:54,980 --> 00:28:57,160
unreliable hardware in it.

593
00:28:57,160 --> 00:28:58,790
So it is quite resilient.

594
00:28:58,790 --> 00:29:00,810
It does this
resiliency at a cost.

595
00:29:00,810 --> 00:29:04,210
It relies on a significant
amount of replication.

596
00:29:04,210 --> 00:29:07,370
Typically the standard
replication in a Hadoop cluster

597
00:29:07,370 --> 00:29:10,500
is a factor of 3.

598
00:29:10,500 --> 00:29:12,630
If you're in the high
performance storage business,

599
00:29:12,630 --> 00:29:15,110
this also makes you
cringe, because you're

600
00:29:15,110 --> 00:29:17,160
paying 3x for your storage.

601
00:29:17,160 --> 00:29:21,870
We have techniques like RAID,
which allow us to do that.

602
00:29:21,870 --> 00:29:25,070
But again, the expertise
required to set up a cluster

603
00:29:25,070 --> 00:29:28,230
and do 3x replication
is relatively low,

604
00:29:28,230 --> 00:29:30,020
and so that makes
it very appealing

605
00:29:30,020 --> 00:29:35,020
to many, many, many folks, and
so it's very useful for that.

606
00:29:35,020 --> 00:29:38,740
Some of the difficulties,
the scheduler in Hadoop

607
00:29:38,740 --> 00:29:42,170
is very immature.

608
00:29:42,170 --> 00:29:44,320
Schedulers are
very well defined.

609
00:29:44,320 --> 00:29:46,370
There's about two or
three standard schedulers

610
00:29:46,370 --> 00:29:48,100
in the supercomputing community.

611
00:29:48,100 --> 00:29:49,490
They're each about 20 years old.

612
00:29:49,490 --> 00:29:52,430
They all have the
same 200 features,

613
00:29:52,430 --> 00:29:54,670
and you need about 200
features to really do

614
00:29:54,670 --> 00:29:56,320
a proper scheduler.

615
00:29:56,320 --> 00:29:59,220
Hadoop is about four
or five years old,

616
00:29:59,220 --> 00:30:00,890
and it's got about
that many features.

617
00:30:00,890 --> 00:30:05,260
And so you do often have
to deal with collisions.

618
00:30:05,260 --> 00:30:08,110
It's easy for one user
to hog and monopolise

619
00:30:08,110 --> 00:30:09,910
the entire cluster.

620
00:30:09,910 --> 00:30:11,950
You're often dealing
with various overheads

621
00:30:11,950 --> 00:30:14,032
from the scheduler itself.

622
00:30:14,032 --> 00:30:15,990
And this is well known
to the Hadoop community.

623
00:30:15,990 --> 00:30:18,660
They're working on it
actively, and we'll

624
00:30:18,660 --> 00:30:23,570
see all that-- it's not an
easy multi-user environment.

625
00:30:23,570 --> 00:30:26,380
And as I've said
before, it fundamentally

626
00:30:26,380 --> 00:30:30,800
relies on the fact that
the JVM is on every node.

627
00:30:30,800 --> 00:30:34,020
Because when you're sending
a program to every node,

628
00:30:34,020 --> 00:30:37,150
the interpreter for that
must exist on every node.

629
00:30:37,150 --> 00:30:39,440
And by definition,
the only language

630
00:30:39,440 --> 00:30:41,890
that you're guaranteed to
have on every node in a Hadoop

631
00:30:41,890 --> 00:30:43,720
cluster is Java.

632
00:30:43,720 --> 00:30:46,890
Any other language,
any other tool,

633
00:30:46,890 --> 00:30:49,570
has to be installed
and become a part

634
00:30:49,570 --> 00:30:53,600
of the image for
the entire system

635
00:30:53,600 --> 00:30:56,030
so that it can be
run on every node.

636
00:30:56,030 --> 00:30:58,570
So that is something that
one has to be aware of.

637
00:30:58,570 --> 00:31:00,153
And we've certainly
seen it's like, oh

638
00:31:00,153 --> 00:31:02,270
I wrote these great programs
in another language.

639
00:31:02,270 --> 00:31:04,020
And it's like, well
can't I just run them?

640
00:31:04,020 --> 00:31:07,800
It's like no, that
doesn't exist.

641
00:31:07,800 --> 00:31:11,220
Even distributing a fat
binary can be difficult,

642
00:31:11,220 --> 00:31:13,980
because it has to be distributed
through the Hadoop Distributed

643
00:31:13,980 --> 00:31:14,965
File system.

644
00:31:14,965 --> 00:31:16,590
All data is distributed
there, and that

645
00:31:16,590 --> 00:31:20,140
is not-- you have to essentially
write a wrapper program that

646
00:31:20,140 --> 00:31:22,270
then recognizes where
to get your binary,

647
00:31:22,270 --> 00:31:27,190
and then exacts that thing
and it can be tricky.

648
00:31:27,190 --> 00:31:29,640
But the basic LLGrid
MapReduce architecture

649
00:31:29,640 --> 00:31:31,990
simplifies this significantly,
although this maybe

650
00:31:31,990 --> 00:31:33,468
doesn't look like it.

651
00:31:36,620 --> 00:31:40,550
Essentially you call
LLGrid MapReduce.

652
00:31:40,550 --> 00:31:43,680
It launches a bunch of what
are called mapper tasks.

653
00:31:43,680 --> 00:31:45,920
It runs them on your
different input files.

654
00:31:45,920 --> 00:31:48,270
When they have done, they've
created output files.

655
00:31:48,270 --> 00:31:50,200
And then runs another program.

656
00:31:50,200 --> 00:31:52,610
If you specified, it
then combines the output.

657
00:31:52,610 --> 00:31:54,650
So that's the basic
model of MapReduce,

658
00:31:54,650 --> 00:31:58,210
which is you have what's
called a map program

659
00:31:58,210 --> 00:32:01,360
and a reduced program,
and the map program is

660
00:32:01,360 --> 00:32:02,400
given a list of files.

661
00:32:02,400 --> 00:32:04,220
It runs those
programs on at a time.

662
00:32:04,220 --> 00:32:06,230
Each one of them
generates an outpoint,

663
00:32:06,230 --> 00:32:08,870
and a reduce program
pulls them all together.

664
00:32:08,870 --> 00:32:12,960
All right, so that's a
little tutorial on Hadoop,

665
00:32:12,960 --> 00:32:16,540
because I felt an obligation
to explain that to you.

666
00:32:16,540 --> 00:32:19,890
We couldn't be a big data course
without having doing that.

667
00:32:19,890 --> 00:32:22,650
But it's very simple,
and very popular,

668
00:32:22,650 --> 00:32:27,070
and I expect it to maintain it's
popular for a very long time.

669
00:32:27,070 --> 00:32:29,140
I think it will
evolve and get better,

670
00:32:29,140 --> 00:32:32,830
but for the vast majority
of people on planet Earth,

671
00:32:32,830 --> 00:32:34,824
this is really the
most accessible form

672
00:32:34,824 --> 00:32:37,240
of parallel computing technology
that they have available.

673
00:32:37,240 --> 00:32:40,070
So we should all be aware of
Hadoop, and its existence,

674
00:32:40,070 --> 00:32:41,320
and what it can do.

675
00:32:41,320 --> 00:32:43,830
Because many of our
customers use Hadoop,

676
00:32:43,830 --> 00:32:47,611
and we need to work with them.

677
00:32:47,611 --> 00:32:50,260
All right, so getting
back now to D4M.

678
00:32:50,260 --> 00:32:52,310
I think I've mentioned
a lot of this before.

679
00:32:52,310 --> 00:32:55,560
The core concept of D4M is
multi-dimensional associative

680
00:32:55,560 --> 00:32:56,260
array.

681
00:32:56,260 --> 00:32:59,080
Again, D4M is designed
to sort of overcomes.

682
00:32:59,080 --> 00:33:03,300
For those of us who have
more mathematical expertise,

683
00:33:03,300 --> 00:33:05,650
we can do much more
sophisticated things

684
00:33:05,650 --> 00:33:08,530
than you might be
able to do in Hadoop.

685
00:33:08,530 --> 00:33:13,970
Again, it allows you to look at
your data in four ways at once.

686
00:33:13,970 --> 00:33:18,200
You can view it as 2D matrices,
and reference rows and columns

687
00:33:18,200 --> 00:33:21,190
with strings, and have
values that are strings.

688
00:33:21,190 --> 00:33:23,030
It's one to one
with a triple store,

689
00:33:23,030 --> 00:33:25,730
so you can easily
connect to databases.

690
00:33:25,730 --> 00:33:30,140
Again, it looks like matrices
so you can do linear algebra.

691
00:33:30,140 --> 00:33:33,640
And also between the duality
between adjacency matrices

692
00:33:33,640 --> 00:33:37,990
and graphs, you can also think
about your data as graphs.

693
00:33:37,990 --> 00:33:41,540
This is composable,
as I've said before.

694
00:33:41,540 --> 00:33:44,640
Almost all of our operations
that I perform an associated

695
00:33:44,640 --> 00:33:46,860
rate of return another
associative array,

696
00:33:46,860 --> 00:33:50,870
and so we can do things like add
them, subtract them, and them,

697
00:33:50,870 --> 00:33:54,230
or them, multiply them, and we
can do very complicated queries

698
00:33:54,230 --> 00:33:55,212
very easily.

699
00:33:55,212 --> 00:33:56,920
And these can work on
associative arrays.

700
00:33:56,920 --> 00:34:01,270
And if we're bound to tables,
they can also do that.

701
00:34:01,270 --> 00:34:03,100
Speaking of tables,
I've already talked

702
00:34:03,100 --> 00:34:05,720
about our schema in this
class that we always use.

703
00:34:05,720 --> 00:34:08,500
So if your standard data
might look like this,

704
00:34:08,500 --> 00:34:12,159
say this is a cyber record
of a source IP, domain,

705
00:34:12,159 --> 00:34:16,130
and destination IP, this is sort
of the standard tabular view.

706
00:34:16,130 --> 00:34:21,340
We explode it by taking the
source, pending the value to it

707
00:34:21,340 --> 00:34:25,310
which creates this very
large, sparse table here

708
00:34:25,310 --> 00:34:28,767
which will naturally go
into our triple store.

709
00:34:28,767 --> 00:34:30,350
Of course, by itself
it doesn't really

710
00:34:30,350 --> 00:34:33,370
gain as anything, because
most tables either

711
00:34:33,370 --> 00:34:35,474
are row based or column based.

712
00:34:35,474 --> 00:34:38,320
The databases that we are
working with a row based,

713
00:34:38,320 --> 00:34:41,250
which allow you to do
fast look up of row key.

714
00:34:41,250 --> 00:34:44,360
However, once we've exploded
the schema, if we also

715
00:34:44,360 --> 00:34:49,010
store the transpose, we now can
index everything here quickly

716
00:34:49,010 --> 00:34:51,210
and effectively to the user.

717
00:34:51,210 --> 00:34:53,290
It looks like we have
indexed every single string

718
00:34:53,290 --> 00:34:55,659
in the database with
one schema, which

719
00:34:55,659 --> 00:34:59,050
is very nice, very
powerful stepping stone

720
00:34:59,050 --> 00:35:02,472
for getting results.

721
00:35:02,472 --> 00:35:04,180
All right, I'm going
to talk a little bit

722
00:35:04,180 --> 00:35:06,120
about what we have done here.

723
00:35:06,120 --> 00:35:08,180
So I'm just going to go
over some basic analytics

724
00:35:08,180 --> 00:35:09,960
and show you a
little bit more code.

725
00:35:09,960 --> 00:35:11,830
So for example,
here is our table.

726
00:35:14,730 --> 00:35:17,390
So this could be viewed
as the sparse matrix.

727
00:35:17,390 --> 00:35:19,770
We have various source IPs here.

728
00:35:19,770 --> 00:35:23,470
And I want to just compute
some very elementary statistics

729
00:35:23,470 --> 00:35:25,840
on this data.

730
00:35:25,840 --> 00:35:30,910
I should say, computing
things like sums and averages

731
00:35:30,910 --> 00:35:36,340
may sound like not very
sophisticated statistic.

732
00:35:36,340 --> 00:35:39,460
It's still
extraordinarily powerful,

733
00:35:39,460 --> 00:35:42,830
and we are amazed
at how valuable it

734
00:35:42,830 --> 00:35:46,320
is, because it usually
shows you right

735
00:35:46,320 --> 00:35:50,730
away the bad data in your data.

736
00:35:50,730 --> 00:35:53,680
And you will always
have bad data.

737
00:35:53,680 --> 00:35:56,640
And this is probably
the first bit of value

738
00:35:56,640 --> 00:35:59,320
add we provide to
almost any customer

739
00:35:59,320 --> 00:36:03,160
that we work with on D4M,
is it's the first time

740
00:36:03,160 --> 00:36:05,980
anybody's actually
sort of looked

741
00:36:05,980 --> 00:36:09,270
at their data in totality,
and been able to do

742
00:36:09,270 --> 00:36:10,670
these types of things.

743
00:36:10,670 --> 00:36:12,567
And so usually the first
thing we do is like,

744
00:36:12,567 --> 00:36:14,400
all right we've loaded
their data in schema,

745
00:36:14,400 --> 00:36:15,649
and we've done some basic sum.

746
00:36:15,649 --> 00:36:19,260
And then we'll say, do you know
that you have the following

747
00:36:19,260 --> 00:36:22,590
stuck on switches?

748
00:36:22,590 --> 00:36:25,070
You know, 8% of
your data all has

749
00:36:25,070 --> 00:36:29,040
this value in this column,
which absolutely can't be right,

750
00:36:29,040 --> 00:36:29,800
you know?

751
00:36:29,800 --> 00:36:32,836
And 8% could be a low enough
number that you might not just

752
00:36:32,836 --> 00:36:34,460
encounter it through
routine traversal,

753
00:36:34,460 --> 00:36:39,380
but it will stand out in just
the most rudimentary histogram,

754
00:36:39,380 --> 00:36:40,730
extremely.

755
00:36:40,730 --> 00:36:43,714
And so that's an incredibly
valuable piece of [INAUDIBLE].

756
00:36:43,714 --> 00:36:45,380
Oh my goodness, they'll
usually be like,

757
00:36:45,380 --> 00:36:47,430
there's something
broken in our system.

758
00:36:47,430 --> 00:36:50,920
Or you'll be like,
you have this column,

759
00:36:50,920 --> 00:36:52,954
but it's almost never
filled in, and it sure

760
00:36:52,954 --> 00:36:54,370
looks like it
should be filled in,

761
00:36:54,370 --> 00:36:57,830
so these switches are stuck off.

762
00:36:57,830 --> 00:37:00,360
I encourage, when you
do the first thing,

763
00:37:00,360 --> 00:37:03,610
to just kind of do the
basic statistics on the data

764
00:37:03,610 --> 00:37:07,550
and see where things are working
and where things are broken.

765
00:37:07,550 --> 00:37:12,301
And usually you can
then either fix them.

766
00:37:12,301 --> 00:37:14,175
And most of the time
when we tell a customer,

767
00:37:14,175 --> 00:37:16,080
it's like, we're
just letting you

768
00:37:16,080 --> 00:37:18,010
know these are
issues in your data.

769
00:37:18,010 --> 00:37:19,110
We can work around them.

770
00:37:19,110 --> 00:37:21,660
We can sort of ignore
them and still proceed,

771
00:37:21,660 --> 00:37:23,209
or you can fix them yourselves.

772
00:37:23,209 --> 00:37:24,750
And almost always
they're like, no we

773
00:37:24,750 --> 00:37:25,930
want to fix that ourselves.

774
00:37:25,930 --> 00:37:27,130
That's something
that's fundamentally

775
00:37:27,130 --> 00:37:28,005
broken in our system.

776
00:37:28,005 --> 00:37:29,964
We want to make sure that
that data is correct.

777
00:37:29,964 --> 00:37:31,463
Because usually
that data is flowing

778
00:37:31,463 --> 00:37:33,410
into all other places
doing a being acted on,

779
00:37:33,410 --> 00:37:35,750
and so that's one thing
we're going to do.

780
00:37:35,750 --> 00:37:38,590
So we're going to do some
basic statistics here.

781
00:37:38,590 --> 00:37:43,380
We're going to compute how many
times each column type appears.

782
00:37:43,380 --> 00:37:48,130
We're going to compute how
many times each type appears.

783
00:37:48,130 --> 00:37:51,380
We'll compute some covariance
matrices, just some very, very

784
00:37:51,380 --> 00:37:52,962
simple types of things here.

785
00:38:01,870 --> 00:38:05,070
All right, let's move on here.

786
00:38:05,070 --> 00:38:08,420
So this is the basic
implementation.

787
00:38:08,420 --> 00:38:11,825
So we're going to
give it a set of rows

788
00:38:11,825 --> 00:38:12,950
that we're looking at here.

789
00:38:12,950 --> 00:38:15,210
This could be very large.

790
00:38:15,210 --> 00:38:19,220
We have a table binding,
T, so we say, please

791
00:38:19,220 --> 00:38:23,930
go and get me all those rows.

792
00:38:23,930 --> 00:38:25,520
So we get the whole
swath of rows,

793
00:38:25,520 --> 00:38:28,730
and that will return that
as an associative array.

794
00:38:28,730 --> 00:38:32,380
And then, normally I
shorten this is double logi,

795
00:38:32,380 --> 00:38:34,912
since the table always
contain strings,

796
00:38:34,912 --> 00:38:36,620
it return strings
values, the first thing

797
00:38:36,620 --> 00:38:37,590
we knew it was going
to get rid of those,

798
00:38:37,590 --> 00:38:38,882
because we're going to do math.

799
00:38:38,882 --> 00:38:41,256
So we turn them into logicals,
and then we turn them back

800
00:38:41,256 --> 00:38:42,520
into double so we can do math.

801
00:38:46,060 --> 00:38:48,160
You can actually pass
regular expressions.

802
00:38:48,160 --> 00:38:49,900
You can also do that
starts with command.

803
00:38:49,900 --> 00:38:52,110
We want to get the source IP.

804
00:38:52,110 --> 00:38:54,269
We're just interested
in source IP domain.

805
00:38:54,269 --> 00:38:56,560
And the first thing we do is
find some popular columns.

806
00:38:56,560 --> 00:38:59,450
So we just type sum, and that
shows up popular columns.

807
00:38:59,450 --> 00:39:01,300
If we want to find
popular pairs,

808
00:39:01,300 --> 00:39:04,370
it's covariance matrix
calculation there,

809
00:39:04,370 --> 00:39:10,270
or square in and find domains
with many destination IPs.

810
00:39:10,270 --> 00:39:13,954
An example in this data set
is this was the most popular

811
00:39:13,954 --> 00:39:15,120
in the data set that we had.

812
00:39:15,120 --> 00:39:20,670
These were the most popular
things that appeared here.

813
00:39:20,670 --> 00:39:23,280
And you can see, this is
all fairly reasonable stuff,

814
00:39:23,280 --> 00:39:29,500
LLBean, lot of New England
Patriots fans in this data set,

815
00:39:29,500 --> 00:39:31,870
a lot of ads, Staples.

816
00:39:31,870 --> 00:39:34,120
These are the types of things
it shows you right away.

817
00:39:37,780 --> 00:39:41,020
Here's the covariance
matrix of this data.

818
00:39:41,020 --> 00:39:43,860
So as you can see,
it's symmetric.

819
00:39:43,860 --> 00:39:48,320
Obviously there is
a diagonal, and it

820
00:39:48,320 --> 00:39:50,000
has the bipartite structure.

821
00:39:50,000 --> 00:39:54,900
That is, there is no destination
to destination IP links.

822
00:39:54,900 --> 00:39:58,390
And we have source IP the
destination, domain, and all

823
00:39:58,390 --> 00:40:00,600
that type of stuff exactly.

824
00:40:00,600 --> 00:40:03,745
The covariance matrix is
often extremely helpful thing

825
00:40:03,745 --> 00:40:05,620
to look at, because it
will be the first time

826
00:40:05,620 --> 00:40:08,220
it shows you the full
interrelated structure

827
00:40:08,220 --> 00:40:09,150
of your data.

828
00:40:09,150 --> 00:40:13,210
And you can quickly identify
dense rows, dense columns,

829
00:40:13,210 --> 00:40:16,360
blocks, chunks, some things
where you want to look at,

830
00:40:16,360 --> 00:40:18,110
chunks that aren't
going to be interesting

831
00:40:18,110 --> 00:40:20,690
because they're very, very
dense or very, very sparse.

832
00:40:20,690 --> 00:40:24,160
Again a very, very just
sort of survey type of tool.

833
00:40:27,310 --> 00:40:29,880
A little shout out
here to our colleagues

834
00:40:29,880 --> 00:40:32,130
in the other group, who
developed structured knowledge

835
00:40:32,130 --> 00:40:32,630
space.

836
00:40:32,630 --> 00:40:35,560
Structured knowledge space is
a very powerful analytic data

837
00:40:35,560 --> 00:40:38,240
set.

838
00:40:38,240 --> 00:40:41,410
I kind of tell people it's
like the Google search

839
00:40:41,410 --> 00:40:43,850
where when you
type a search key,

840
00:40:43,850 --> 00:40:47,120
it starts guessing what
your next answer is.

841
00:40:47,120 --> 00:40:50,307
But Google does that based on
sort of a long sort of analysis

842
00:40:50,307 --> 00:40:52,640
of all the different types
of searches people have done,

843
00:40:52,640 --> 00:40:54,250
and it can make a good guess.

844
00:40:54,250 --> 00:40:58,650
SKS does that much more deeply,
in that it maintains a database

845
00:40:58,650 --> 00:41:00,400
on a collection of documents.

846
00:41:00,400 --> 00:41:05,620
And in this case, if
you type Afghanistan,

847
00:41:05,620 --> 00:41:08,397
will then actually go
and count the occurrences

848
00:41:08,397 --> 00:41:10,480
of different types of
entities in those documents,

849
00:41:10,480 --> 00:41:13,400
and then show you what
the possible next key word

850
00:41:13,400 --> 00:41:16,800
choices can be, so in
the actual data itself,

851
00:41:16,800 --> 00:41:19,980
not based on a set
of cached queries

852
00:41:19,980 --> 00:41:21,924
and other types of things.

853
00:41:21,924 --> 00:41:23,840
I don't really know what
Google actually does.

854
00:41:23,840 --> 00:41:24,710
I'm just guessing.

855
00:41:24,710 --> 00:41:27,450
I'm sure they do use some
data for guiding their search,

856
00:41:27,450 --> 00:41:30,460
but this is a fairly
sophisticated analytic.

857
00:41:30,460 --> 00:41:32,954
And one of the big
sort of wins for we

858
00:41:32,954 --> 00:41:34,620
knew we were on the
right track with D4M

859
00:41:34,620 --> 00:41:36,530
is that this has
been implemented

860
00:41:36,530 --> 00:41:38,836
in hundreds or thousands
of lines of job at SQL.

861
00:41:38,836 --> 00:41:40,710
And I'm going to show
you how we implement it

862
00:41:40,710 --> 00:41:44,450
in one line in D4M.

863
00:41:44,450 --> 00:41:48,750
So let's go over that algorithm.

864
00:41:48,750 --> 00:41:52,240
So here's an example of what
my data might look like.

865
00:41:52,240 --> 00:41:54,460
I have a bunch of
documents here.

866
00:41:54,460 --> 00:41:57,280
I have a bunch of entities,
where an entity appears

867
00:41:57,280 --> 00:41:58,600
in a document.

868
00:41:58,600 --> 00:42:00,580
I have a dot.

869
00:42:00,580 --> 00:42:03,800
I'm going to have my
associative array over here,

870
00:42:03,800 --> 00:42:07,140
which is matching from
the strings to the reals.

871
00:42:07,140 --> 00:42:10,430
And my two facets are just going
to be two columns names here,

872
00:42:10,430 --> 00:42:14,010
so I'm going to pick
a facet y1 and y2.

873
00:42:14,010 --> 00:42:16,150
So I'm going to pick
these two columns.

874
00:42:16,150 --> 00:42:20,050
All right, so basically
that's what this does.

875
00:42:20,050 --> 00:42:23,980
This says get me
y1 and get me y2.

876
00:42:23,980 --> 00:42:26,010
I'm using this bar
here to say I'm

877
00:42:26,010 --> 00:42:28,800
going to knock away the
facet name afterwards

878
00:42:28,800 --> 00:42:30,350
so I can and them together.

879
00:42:30,350 --> 00:42:34,330
So that gets me all the
documents that basically

880
00:42:34,330 --> 00:42:36,970
contain both UN and
Carl, and then I

881
00:42:36,970 --> 00:42:38,850
can go and compute
the counts that you

882
00:42:38,850 --> 00:42:41,770
saw in the previous
page there just

883
00:42:41,770 --> 00:42:43,260
by doing this matrix multiply.

884
00:42:43,260 --> 00:42:45,340
So and them, I
transpose them, and I

885
00:42:45,340 --> 00:42:48,690
matrix multiply to get them
together, and then there we go.

886
00:42:52,030 --> 00:42:54,960
All right, so now I'm
going to get into the demo,

887
00:42:54,960 --> 00:42:57,970
and I think we have
plenty of time for that.

888
00:42:57,970 --> 00:43:00,650
I'll sort of set it up, and
they'll take short break.

889
00:43:00,650 --> 00:43:04,379
People can have
cookies, and then we

890
00:43:04,379 --> 00:43:06,670
will get into the demo, which
really begins to show you

891
00:43:06,670 --> 00:43:11,054
a way to use D4M on real data.

892
00:43:11,054 --> 00:43:12,970
So the data that we're
going to work with here

893
00:43:12,970 --> 00:43:14,320
is the Reuters Corpus.

894
00:43:14,320 --> 00:43:20,340
So this was a corpus of data
released in 2000 to help spur

895
00:43:20,340 --> 00:43:21,480
research in this community.

896
00:43:21,480 --> 00:43:23,670
We're very grateful
for Reuters to do this.

897
00:43:23,670 --> 00:43:27,030
They gave the data
to NIST who actually

898
00:43:27,030 --> 00:43:29,860
stewards the data sets.

899
00:43:29,860 --> 00:43:32,370
It's a set of
Reuters news reports

900
00:43:32,370 --> 00:43:37,290
over this particular
time frame, the mid 90s,

901
00:43:37,290 --> 00:43:40,630
and it's like 800,000 total.

902
00:43:40,630 --> 00:43:42,515
And what we have done,
we're not actually

903
00:43:42,515 --> 00:43:44,140
working with the
straight Reuters data,

904
00:43:44,140 --> 00:43:46,560
we've run it through
various parsers that

905
00:43:46,560 --> 00:43:48,770
have extracted just
the entities, so

906
00:43:48,770 --> 00:43:51,540
the people, the places,
and the organization.

907
00:43:51,540 --> 00:43:53,240
So it's sort of a summary.

908
00:43:53,240 --> 00:43:57,980
It's a very terse summarization
of the data, if that.

909
00:43:57,980 --> 00:44:01,510
It's really sort of
a derived product.

910
00:44:01,510 --> 00:44:03,770
The data is power
law, so if we look

911
00:44:03,770 --> 00:44:07,740
at the documents per
entity, you see here

912
00:44:07,740 --> 00:44:13,070
that certain people, places,
and organizations, appear.

913
00:44:13,070 --> 00:44:16,490
A few of them appear lots
of places, and most of them

914
00:44:16,490 --> 00:44:20,390
appear in just a few documents,
and you see the same thing

915
00:44:20,390 --> 00:44:22,590
when you look at the
entities per documents.

916
00:44:22,590 --> 00:44:24,130
So there's
essentially, could view

917
00:44:24,130 --> 00:44:26,690
this is computing the
in degrees and the out

918
00:44:26,690 --> 00:44:28,820
degrees of this
bipartite graphs.

919
00:44:28,820 --> 00:44:31,910
And it has the classic
power law shape

920
00:44:31,910 --> 00:44:34,450
that we expect to
see in our data.

921
00:44:34,450 --> 00:44:37,360
So with that, we're
going to go in the demo.

922
00:44:37,360 --> 00:44:41,430
And just to summarize,
so just recall,

923
00:44:41,430 --> 00:44:43,430
the evolution of
the web is really

924
00:44:43,430 --> 00:44:45,940
created a new class
of technologies.

925
00:44:45,940 --> 00:44:50,070
We really see the web moving
towards game style interfaces,

926
00:44:50,070 --> 00:44:52,270
triple store databases,
and technologies

927
00:44:52,270 --> 00:44:56,320
like D4M for doing analysis,
very new technology.

928
00:44:56,320 --> 00:44:59,460
And just so that
we, for the record,

929
00:44:59,460 --> 00:45:01,377
there is no
assignment this week.

930
00:45:01,377 --> 00:45:03,710
So for those of you who are
still doing the assignments,

931
00:45:03,710 --> 00:45:05,400
yay.

932
00:45:05,400 --> 00:45:08,307
The example, actually won't
do both of these today,

933
00:45:08,307 --> 00:45:09,390
we'll just do one of them.

934
00:45:09,390 --> 00:45:11,670
Oh and I keep
making this mistake.

935
00:45:11,670 --> 00:45:12,410
I'll fix it.

936
00:45:12,410 --> 00:45:13,430
I'll fix it later.

937
00:45:13,430 --> 00:45:17,020
But it's in the
examples directory.

938
00:45:17,020 --> 00:45:19,850
We've now moved on to the
second subfolder, apps,

939
00:45:19,850 --> 00:45:21,410
entity analysis.

940
00:45:21,410 --> 00:45:24,120
And you guys are encouraged
to run those examples, which

941
00:45:24,120 --> 00:45:25,680
we will now do shortly.

942
00:45:25,680 --> 00:45:27,870
So we will take a short
five minute break,

943
00:45:27,870 --> 00:45:31,870
and then continue
on with the demo.