1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation or to
view additional materials

6
00:00:12,680 --> 00:00:16,415
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,415 --> 00:00:17,040
at OCW.MIT.edu.

8
00:00:21,446 --> 00:00:22,820
JEREMY KEPNER:
All right, well, I

9
00:00:22,820 --> 00:00:26,590
want to thank you all
for coming to what

10
00:00:26,590 --> 00:00:29,660
I have advertised as
the penultimate course

11
00:00:29,660 --> 00:00:31,060
in this lecture series.

12
00:00:31,060 --> 00:00:33,270
Everything else we've
done up to this point

13
00:00:33,270 --> 00:00:36,250
has sort of been
building up to actually

14
00:00:36,250 --> 00:00:39,480
finally really using databases.

15
00:00:39,480 --> 00:00:41,580
And hopefully, you
haven't been too

16
00:00:41,580 --> 00:00:47,997
disappointed at how long I've
led you along here to get

17
00:00:47,997 --> 00:00:48,580
to this point.

18
00:00:48,580 --> 00:00:52,400
But the point being that if you
have certain abstract concepts

19
00:00:52,400 --> 00:00:56,330
in your mind, that once we
get to the database part,

20
00:00:56,330 --> 00:01:01,000
it just feels very
straightforward.

21
00:01:01,000 --> 00:01:04,080
If we do the database piece and
you don't have those concepts,

22
00:01:04,080 --> 00:01:06,220
then you can easily
get distracted

23
00:01:06,220 --> 00:01:09,500
by extraneous information.

24
00:01:09,500 --> 00:01:12,310
So today, there's
no view graphs.

25
00:01:12,310 --> 00:01:16,190
So I'm sure you're all
thrilled about that.

26
00:01:16,190 --> 00:01:19,420
It's going to be
all demos showing

27
00:01:19,420 --> 00:01:22,550
interacting with the actual
technologies that we have here.

28
00:01:22,550 --> 00:01:26,130
And everything I'm showing
is stuff that you can use.

29
00:01:26,130 --> 00:01:29,080
I mean, so everyone
has been-- so I'm just

30
00:01:29,080 --> 00:01:31,560
going to kind of get into this.

31
00:01:31,560 --> 00:01:34,420
And let's just
start with step one.

32
00:01:34,420 --> 00:01:36,370
We're going to be
using the Accumulo

33
00:01:36,370 --> 00:01:38,290
databases that we've set up.

34
00:01:38,290 --> 00:01:42,710
We have a clearinghouse of these
databases on our LLGrid system.

35
00:01:42,710 --> 00:01:45,557
And you can get to that list
by going to this web page

36
00:01:45,557 --> 00:01:46,890
here, dbstatusllgrid.ll.mit.edu.

37
00:01:51,520 --> 00:01:53,650
And when you go there,
it will prompt you

38
00:01:53,650 --> 00:01:55,480
for your one password.

39
00:01:55,480 --> 00:01:58,190
And then it will show
you the databases

40
00:01:58,190 --> 00:02:00,490
that you have access to.

41
00:02:00,490 --> 00:02:03,030
Now, I have access
to all the databases.

42
00:02:03,030 --> 00:02:08,870
But you guys should only
see these class databases

43
00:02:08,870 --> 00:02:11,450
if you log into that.

44
00:02:11,450 --> 00:02:18,210
And so as you see here, we have
five databases that are set up.

45
00:02:18,210 --> 00:02:21,730
These are five independent
instances of Accumulo.

46
00:02:21,730 --> 00:02:23,280
And I started a couple already.

47
00:02:23,280 --> 00:02:26,340
And we can even
take a look at them.

48
00:02:26,340 --> 00:02:29,800
So this is what running
Accumulo instance looks like.

49
00:02:29,800 --> 00:02:32,240
This is its main page here.

50
00:02:32,240 --> 00:02:34,650
And it shows you how
much disk is used,

51
00:02:34,650 --> 00:02:36,921
and the number of tables,
and all that type of stuff.

52
00:02:36,921 --> 00:02:38,420
And it gives you a
nice history that

53
00:02:38,420 --> 00:02:43,370
shows ingest rate over the last
few [INAUDIBLE], and scan rate.

54
00:02:43,370 --> 00:02:47,240
This is all in entries per
second, ingest in megabytes,

55
00:02:47,240 --> 00:02:50,580
all different kinds of really
useful information here.

56
00:02:50,580 --> 00:02:52,757
And you'll see that
this has got the URL

57
00:02:52,757 --> 00:02:53,840
of classdb01.cloud.llgrid.

58
00:02:59,320 --> 00:03:01,180
When I started it, it
was an actual machine

59
00:03:01,180 --> 00:03:02,965
that was allocated to that.

60
00:03:02,965 --> 00:03:08,250
In fact, just for fun here,
I could turn one of these on.

61
00:03:08,250 --> 00:03:09,770
You guys are free to start them.

62
00:03:09,770 --> 00:03:12,040
I wouldn't encourage you
to hit the Stop button,

63
00:03:12,040 --> 00:03:13,680
because if someone
else is using it,

64
00:03:13,680 --> 00:03:18,050
and you hit stop,
then that may not

65
00:03:18,050 --> 00:03:19,620
be something you want to do.

66
00:03:19,620 --> 00:03:21,430
But it's the same set
up-- for instance,

67
00:03:21,430 --> 00:03:24,802
if you have a project,
everyone in the class

68
00:03:24,802 --> 00:03:26,260
can see this because
we've made you

69
00:03:26,260 --> 00:03:28,840
all a part of the class group.

70
00:03:28,840 --> 00:03:31,180
But you can see here
there's other classes.

71
00:03:31,180 --> 00:03:33,600
We have a bioinformatics group.

72
00:03:33,600 --> 00:03:35,770
And they have a
couple of databases.

73
00:03:35,770 --> 00:03:37,010
Those are there.

74
00:03:37,010 --> 00:03:38,600
They're not running right now.

75
00:03:38,600 --> 00:03:41,220
There's a very large
graph database group.

76
00:03:41,220 --> 00:03:42,150
It's running now.

77
00:03:42,150 --> 00:03:46,050
And just to show
this is it's running.

78
00:03:46,050 --> 00:03:48,980
You see this has about
200 gigabytes of data.

79
00:03:48,980 --> 00:03:51,706
And if we look in
the tables here,

80
00:03:51,706 --> 00:03:53,080
we see here we
have a few tables.

81
00:03:53,080 --> 00:03:56,810
Here are some tables with
a few billion entries

82
00:03:56,810 --> 00:03:58,260
that have been put in there.

83
00:03:58,260 --> 00:04:02,460
And this is really what
Accumulo does very, very well.

84
00:04:02,460 --> 00:04:06,280
But I'm going to start one just
for fun here, if that works.

85
00:04:06,280 --> 00:04:08,140
And so it will be starting that.

86
00:04:08,140 --> 00:04:10,640
And all that happens.

87
00:04:10,640 --> 00:04:14,940
And you can see it's starting,
and all those type of stuff.

88
00:04:14,940 --> 00:04:18,820
So we're going to get going here
now with the specific examples.

89
00:04:22,730 --> 00:04:23,850
And I have these.

90
00:04:26,730 --> 00:04:28,710
Just so you know,
today's examples--

91
00:04:28,710 --> 00:04:30,770
so it's in the
Examples directory,

92
00:04:30,770 --> 00:04:35,450
in the Scaling directory,
and two parallel database.

93
00:04:35,450 --> 00:04:36,950
So this is the
directory we're going

94
00:04:36,950 --> 00:04:38,850
to be going through today.

95
00:04:38,850 --> 00:04:42,130
And we have a lot of
examples to get through,

96
00:04:42,130 --> 00:04:44,600
because we're going to be
covering a lot of ground

97
00:04:44,600 --> 00:04:49,800
here about how you can take
advantage of D4M and Accumulo

98
00:04:49,800 --> 00:04:52,660
together here.

99
00:04:52,660 --> 00:05:05,382
So the first thing I'm
going to do is go here.

100
00:05:05,382 --> 00:05:06,340
I'm going to run these.

101
00:05:06,340 --> 00:05:09,560
I have, essentially, two
versions of the code-- one

102
00:05:09,560 --> 00:05:14,487
that's going to do fairly
smaller databases on my laptop.

103
00:05:14,487 --> 00:05:16,070
And then I have
another version that's

104
00:05:16,070 --> 00:05:18,250
sitting in my LLGrid
account that I

105
00:05:18,250 --> 00:05:21,990
can do some bigger things with.

106
00:05:21,990 --> 00:05:26,320
So to get started, we're going
to do this first example PDB01

107
00:05:26,320 --> 00:05:26,840
data test.

108
00:05:48,110 --> 00:05:52,970
So in order to do database
work, and to test data,

109
00:05:52,970 --> 00:05:55,070
we need to generate some data.

110
00:05:55,070 --> 00:05:58,590
And so I'm using a
built-in data generator

111
00:05:58,590 --> 00:06:02,291
that we have called
the Kronecker graph.

112
00:06:02,291 --> 00:06:06,060
It's basically borrowed from a
benchmark called the Graph 500

113
00:06:06,060 --> 00:06:06,560
benchmark.

114
00:06:06,560 --> 00:06:08,890
There's actually a
list called Graph 500.

115
00:06:08,890 --> 00:06:12,070
And I helped write
that benchmark.

116
00:06:12,070 --> 00:06:14,550
And, in fact, I actually-- the
Matlab code on that website

117
00:06:14,550 --> 00:06:18,510
is stuff that I originally
wrote, and other people

118
00:06:18,510 --> 00:06:20,530
have since modified.

119
00:06:20,530 --> 00:06:22,445
And so this is a
graph generator.

120
00:06:22,445 --> 00:06:30,370
It generates a very large power
law graph using a Kronecker

121
00:06:30,370 --> 00:06:31,980
product approach.

122
00:06:31,980 --> 00:06:35,040
And it has a few parameters
here-- a scale parameter,

123
00:06:35,040 --> 00:06:36,840
which is basically the
number of vertices.

124
00:06:36,840 --> 00:06:40,280
So 2 this scale parameter
is approximately the number

125
00:06:40,280 --> 00:06:40,780
of vertices.

126
00:06:40,780 --> 00:06:44,790
So 2 to the 12th gets
you about 4,000 vertices.

127
00:06:44,790 --> 00:06:47,970
It then creates a certain
number of edges per vertex--

128
00:06:47,970 --> 00:06:50,310
so 16 edges per vertex.

129
00:06:50,310 --> 00:06:54,320
And so this computes n max
as 2 to the scale here.

130
00:06:54,320 --> 00:06:57,560
And then the number of edges is
edges per vertex times n max.

131
00:06:57,560 --> 00:06:59,710
This is the maximum
number of edges.

132
00:06:59,710 --> 00:07:02,320
And then it generates this.

133
00:07:02,320 --> 00:07:05,360
And it comes back
with two vectors,

134
00:07:05,360 --> 00:07:09,670
which is just the list
of the-- the first vector

135
00:07:09,670 --> 00:07:11,700
is a list of starting vertices.

136
00:07:11,700 --> 00:07:15,440
And the second vector is
a list of ending vertices.

137
00:07:15,440 --> 00:07:17,670
And we're not really
using any D4M here.

138
00:07:17,670 --> 00:07:22,550
We're just creating a sparse
adjacency matrix of that data,

139
00:07:22,550 --> 00:07:26,950
showing it, and then plotting
the degree distribution.

140
00:07:26,950 --> 00:07:31,380
So if we look at
that figure-- so this

141
00:07:31,380 --> 00:07:36,460
shows the adjacency
matrix of this graph,

142
00:07:36,460 --> 00:07:39,080
start vertex to end vertex.

143
00:07:39,080 --> 00:07:42,629
These Kronecker graphs have this
sort of recursive structure.

144
00:07:42,629 --> 00:07:44,170
And if you kept
zooming in, you would

145
00:07:44,170 --> 00:07:48,850
see that the graph looked like
itself in a recursive way here.

146
00:07:48,850 --> 00:07:52,360
That's what gives us this
power law distribution.

147
00:07:52,360 --> 00:07:55,230
And this is a
relatively small graph.

148
00:07:55,230 --> 00:07:57,150
This particular
data generator is

149
00:07:57,150 --> 00:08:01,500
chosen because you can make
enormous graphs in parallel

150
00:08:01,500 --> 00:08:06,500
very easily, which is something
that if we had to pass around

151
00:08:06,500 --> 00:08:09,170
large data sets every single
time we wanted to test

152
00:08:09,170 --> 00:08:11,240
our software, it
would be prohibitive

153
00:08:11,240 --> 00:08:14,960
because we'd be passing around
gigabytes and terabytes.

154
00:08:14,960 --> 00:08:20,200
And I think the largest
this has ever been run on

155
00:08:20,200 --> 00:08:23,030
is on a scale of 2 to the 37.

156
00:08:23,030 --> 00:08:29,160
So that's trillions of
vertices, or certainly

157
00:08:29,160 --> 00:08:33,650
billions of vertices, almost
trillions of vertices.

158
00:08:33,650 --> 00:08:36,630
And then we do the degree
distribution of this.

159
00:08:36,630 --> 00:08:38,929
And you see here, it creates
a power law distribution.

160
00:08:38,929 --> 00:08:41,669
We have a few vertices
with only one connection.

161
00:08:41,669 --> 00:08:44,720
And we always have a super
node with a lot of connections.

162
00:08:44,720 --> 00:08:49,910
And you can see, actually,
here the Kronecker structure

163
00:08:49,910 --> 00:08:56,030
in this data, which creates
this characteristic sawtooth

164
00:08:56,030 --> 00:08:56,741
pattern.

165
00:08:56,741 --> 00:08:58,740
And there's ways to get
rid of that if you want.

166
00:08:58,740 --> 00:09:00,770
But for our purposes,
having that structure

167
00:09:00,770 --> 00:09:02,270
there, there's no
problem with that.

168
00:09:02,270 --> 00:09:05,630
So this is kind of exactly what
the degree distribution looks

169
00:09:05,630 --> 00:09:07,950
like.

170
00:09:07,950 --> 00:09:10,870
So that's just a small
version to show you

171
00:09:10,870 --> 00:09:12,600
what the data looks like.

172
00:09:12,600 --> 00:09:15,420
Now we're going to
create a bigger version.

173
00:09:19,190 --> 00:09:29,980
So this program, which I'll
now show you-- so this program

174
00:09:29,980 --> 00:09:35,540
creates, essentially,
the same Kronecker graph.

175
00:09:35,540 --> 00:09:38,592
But it's going to
do it eight times.

176
00:09:38,592 --> 00:09:40,550
And one of the nice things
about this generator

177
00:09:40,550 --> 00:09:43,730
is if you just
keep calling this,

178
00:09:43,730 --> 00:09:47,570
it gives you more independent
samples from the same graph.

179
00:09:47,570 --> 00:09:50,320
So we're just creating
a graph that's

180
00:09:50,320 --> 00:09:52,570
got eight times as many
edges as the previous one

181
00:09:52,570 --> 00:09:55,010
just by calling it over
and over again, just

182
00:09:55,010 --> 00:09:58,030
from the random
number generator.

183
00:09:58,030 --> 00:09:58,750
So I have eight.

184
00:09:58,750 --> 00:10:00,130
I'm going to do
this eight times.

185
00:10:00,130 --> 00:10:03,560
And I'm going to save each one
of those to a separate file.

186
00:10:03,560 --> 00:10:05,380
So I create a file name.

187
00:10:05,380 --> 00:10:08,140
I'm actually setting
the random number seed

188
00:10:08,140 --> 00:10:12,280
to be set by the file
name so that I can do this

189
00:10:12,280 --> 00:10:18,000
if I want to-- the seventh file
will always have, essentially,

190
00:10:18,000 --> 00:10:21,930
the same random sequence
regardless of when I run it.

191
00:10:21,930 --> 00:10:23,730
And so I create my vertices.

192
00:10:23,730 --> 00:10:25,730
And I'm going to convert
these to strings,

193
00:10:25,730 --> 00:10:27,890
and then write
these out to files.

194
00:10:27,890 --> 00:10:30,000
And so that's all
that does here.

195
00:10:30,000 --> 00:10:32,460
And one of the things I
do throughout this process

196
00:10:32,460 --> 00:10:34,030
that you will see
is I keep track

197
00:10:34,030 --> 00:10:38,030
of how many edges per second
I'm generating things.

198
00:10:38,030 --> 00:10:42,020
So here, I'm generating
about 150,000.

199
00:10:42,020 --> 00:10:46,000
It varies in terms of the
edges per second here,

200
00:10:46,000 --> 00:10:50,930
but between 30,0000 and 150,000
or 180,000 edges per second,

201
00:10:50,930 --> 00:10:53,190
because when you're creating
a whole data processing

202
00:10:53,190 --> 00:10:55,620
pipeline, that's essentially
the kind of metrics

203
00:10:55,620 --> 00:10:59,730
you're looking-- some
steps might process

204
00:10:59,730 --> 00:11:01,430
your edges extremely quickly.

205
00:11:01,430 --> 00:11:03,819
And other steps might process
your edges more slowly.

206
00:11:03,819 --> 00:11:05,360
And that's, obviously,
the ones where

207
00:11:05,360 --> 00:11:10,420
you want to put more
energy and effort into it.

208
00:11:10,420 --> 00:11:12,830
So we can actually
now go and look.

209
00:11:12,830 --> 00:11:16,680
It stuck it in this
data directory here.

210
00:11:16,680 --> 00:11:18,340
And we just created that.

211
00:11:18,340 --> 00:11:20,870
And so, basically, we write
it out on three files.

212
00:11:20,870 --> 00:11:24,935
Essentially, each one of these
holds one part of a triple-- so

213
00:11:24,935 --> 00:11:27,020
a row, a column, and a value.

214
00:11:27,020 --> 00:11:30,890
So if we look at the
tow, you can just

215
00:11:30,890 --> 00:11:35,520
see it's a sequence of
strings separated by commas,

216
00:11:35,520 --> 00:11:38,860
same with the column, just a
separate sequence of strings

217
00:11:38,860 --> 00:11:40,760
separated by commas.

218
00:11:40,760 --> 00:11:42,930
And then in this
case, the values

219
00:11:42,930 --> 00:11:46,300
we just made all ones,
nothing fancy there.

220
00:11:46,300 --> 00:11:48,300
So now we have eight files.

221
00:11:48,300 --> 00:11:48,872
That's great.

222
00:11:48,872 --> 00:11:50,205
We generated those very quickly.

223
00:11:50,205 --> 00:11:53,447
And now we want to do a
little processing on them.

224
00:11:56,540 --> 00:12:01,970
So if we go pDB03, the first
thing we're going to do

225
00:12:01,970 --> 00:12:05,804
is read those files back in and
construct associative arrays,

226
00:12:05,804 --> 00:12:07,470
because the associative
ray construction

227
00:12:07,470 --> 00:12:09,101
time takes a little time.

228
00:12:09,101 --> 00:12:11,350
And we're going to want to
use it over and over again.

229
00:12:11,350 --> 00:12:14,190
So we might as well
take those triples

230
00:12:14,190 --> 00:12:16,390
and construct them into
associative arrays,

231
00:12:16,390 --> 00:12:18,107
and save them out as
malak binary files.

232
00:12:18,107 --> 00:12:19,690
And now that will
be something that we

233
00:12:19,690 --> 00:12:22,030
can work with very quickly.

234
00:12:22,030 --> 00:12:24,714
So we're going to do that.

235
00:12:24,714 --> 00:12:25,380
So there you go.

236
00:12:25,380 --> 00:12:26,475
It read them in.

237
00:12:26,475 --> 00:12:28,600
And it shows you at the
rate at which it reads them

238
00:12:28,600 --> 00:12:30,450
in, and then essentially
writes them out,

239
00:12:30,450 --> 00:12:32,140
and then gives us
another example

240
00:12:32,140 --> 00:12:33,580
of the edges per second.

241
00:12:33,580 --> 00:12:37,410
And now you see we have Matlab
files for each one of those.

242
00:12:37,410 --> 00:12:39,540
And, not surprisingly,
the Matlab file

243
00:12:39,540 --> 00:12:46,910
is smaller than the three
input triples that it gave.

244
00:12:46,910 --> 00:12:49,520
So this is a 24
kilobyte Matlab file.

245
00:12:49,520 --> 00:12:53,834
And it was probably about
80 kilobytes of input data.

246
00:12:53,834 --> 00:12:56,000
And that's just because
we've compressed all the row

247
00:12:56,000 --> 00:12:58,390
keys into single vectors.

248
00:12:58,390 --> 00:13:02,000
And we have the sparse adjacency
matrix, which stores things.

249
00:13:02,000 --> 00:13:04,840
And so that makes it a
little bit better there.

250
00:13:08,890 --> 00:13:13,050
If we actually look at that
program here-- so we can see

251
00:13:13,050 --> 00:13:16,190
we basically are reading it in.

252
00:13:16,190 --> 00:13:18,230
And then what we're
doing is we're basically

253
00:13:18,230 --> 00:13:19,790
creating an associative array.

254
00:13:19,790 --> 00:13:23,150
We read in each set of triples.

255
00:13:23,150 --> 00:13:26,280
And then the constructor
takes the list

256
00:13:26,280 --> 00:13:28,000
of row strings, column strings.

257
00:13:28,000 --> 00:13:30,500
We just all want this-- since
we knew they were all one,

258
00:13:30,500 --> 00:13:32,090
we were just
letting that be one.

259
00:13:32,090 --> 00:13:35,270
And then there's this optional
fourth argument that tells us,

260
00:13:35,270 --> 00:13:40,520
what do we want to do
if we put in two triples

261
00:13:40,520 --> 00:13:44,030
with the same row and
column, what to do?

262
00:13:44,030 --> 00:13:47,600
The default is it
will just do the min.

263
00:13:47,600 --> 00:13:49,935
So if I have a collision,
it will just do a min.

264
00:13:49,935 --> 00:13:52,670
If I give it this
optional fourth argument

265
00:13:52,670 --> 00:13:55,290
at sum-- in fact, you
can put in essentially

266
00:13:55,290 --> 00:13:57,310
any binary operation there.

267
00:13:57,310 --> 00:14:00,010
But at sum will just
add them together.

268
00:14:00,010 --> 00:14:04,220
So now we'll have-- in
the associative array,

269
00:14:04,220 --> 00:14:08,120
a particular row and column will
have how many that occurred.

270
00:14:08,120 --> 00:14:11,042
And so we're summing
up as we go here.

271
00:14:11,042 --> 00:14:13,000
And then after we create
the associative array,

272
00:14:13,000 --> 00:14:14,830
we save them out to a file.

273
00:14:14,830 --> 00:14:16,622
And so we have that step done.

274
00:14:21,370 --> 00:14:23,712
Now, the whole reason I
showed you this process

275
00:14:23,712 --> 00:14:26,170
is because now I'm actually in
a position I can start doing

276
00:14:26,170 --> 00:14:28,110
computation just on the files.

277
00:14:28,110 --> 00:14:30,569
As I said before, I don't
have to use the database.

278
00:14:30,569 --> 00:14:32,610
If I'm going to do any
kind of calculation that's

279
00:14:32,610 --> 00:14:35,150
going to involve
traversing all the data,

280
00:14:35,150 --> 00:14:37,900
it's going to be faster just
to read in those Matlab files

281
00:14:37,900 --> 00:14:39,710
and do my processing on that.

282
00:14:39,710 --> 00:14:41,460
It's also very easy
to make parallel.

283
00:14:41,460 --> 00:14:43,390
I have a lot of files.

284
00:14:43,390 --> 00:14:46,750
I just have different-- if
I launch a parallel job,

285
00:14:46,750 --> 00:14:49,770
I can just have
different processes

286
00:14:49,770 --> 00:14:51,160
reading separate files.

287
00:14:51,160 --> 00:14:52,730
It will scale very well.

288
00:14:52,730 --> 00:14:56,994
The data array read
rates will be very fast.

289
00:14:56,994 --> 00:15:01,100
Reading these files in parallel
will take much less time

290
00:15:01,100 --> 00:15:05,100
than trying to pull out all the
data from the database again.

291
00:15:05,100 --> 00:15:07,270
So we're going to do a
little analytics here.

292
00:15:12,661 --> 00:15:18,160
pDB04 so I'm going to
basically compute--

293
00:15:18,160 --> 00:15:21,110
I'm going to take those eight
files, read them all in,

294
00:15:21,110 --> 00:15:23,158
and accumulate the
results as we go.

295
00:15:25,800 --> 00:15:26,890
And there we go.

296
00:15:26,890 --> 00:15:29,490
We get to the in
degree distribution

297
00:15:29,490 --> 00:15:36,860
and the out degree
distribution of this result.

298
00:15:36,860 --> 00:15:39,900
If you look to
that program here,

299
00:15:39,900 --> 00:15:44,110
you can see all we did is we
looped over all the files,

300
00:15:44,110 --> 00:15:48,860
just loaded them from in Matlab,
and then basically summed

301
00:15:48,860 --> 00:15:52,600
the rows and added that
to a temp variable,

302
00:15:52,600 --> 00:15:54,870
and summed the columns,
and then plotted them out.

303
00:15:54,870 --> 00:15:59,140
So we just sort of
accumulated them as we went.

304
00:15:59,140 --> 00:16:01,800
This actually-- this
method of just summing

305
00:16:01,800 --> 00:16:03,627
on top of an associative
array is something

306
00:16:03,627 --> 00:16:04,710
that you can certainly do.

307
00:16:04,710 --> 00:16:06,850
It's a very convenient
way to do it.

308
00:16:06,850 --> 00:16:09,030
I should say, though--
and you can kind of

309
00:16:09,030 --> 00:16:10,990
see it here a little bit.

310
00:16:10,990 --> 00:16:13,430
You notice that the
time is beginning

311
00:16:13,430 --> 00:16:16,110
to-- it's not so clear here,
this took so little time.

312
00:16:16,110 --> 00:16:18,550
But on a larger example,
what you would see

313
00:16:18,550 --> 00:16:21,720
is that every single time
we did that-- because we're

314
00:16:21,720 --> 00:16:25,500
building and then adding,
we're basically redoing

315
00:16:25,500 --> 00:16:26,907
the construction process.

316
00:16:26,907 --> 00:16:29,240
And so, eventually, this will
become longer, and longer,

317
00:16:29,240 --> 00:16:30,360
and longer.

318
00:16:30,360 --> 00:16:32,990
And so it's OK for
small stuff to do that,

319
00:16:32,990 --> 00:16:34,865
or if you're only going
to do it a few times.

320
00:16:34,865 --> 00:16:37,640
But if you're going to be
accumulating an enormous amount

321
00:16:37,640 --> 00:16:43,125
of data, then what
we can actually do

322
00:16:43,125 --> 00:16:50,740
is we have another version
of this program pDB04 cat

323
00:16:50,740 --> 00:16:55,630
DegreeTest And you can tell
that was a little bit faster.

324
00:16:55,630 --> 00:16:58,900
You see here it's all
in milliseconds of time.

325
00:17:01,420 --> 00:17:03,830
And this is a little
bit longer program.

326
00:17:10,910 --> 00:17:16,480
What we're doing here
is, basically, we're

327
00:17:16,480 --> 00:17:18,640
reading in-- doing
the exact same thing.

328
00:17:18,640 --> 00:17:21,720
We're loading our Matlab file.

329
00:17:21,720 --> 00:17:24,520
We're doing the sum.

330
00:17:24,520 --> 00:17:30,460
And then since I know something
about the structure of that.

331
00:17:30,460 --> 00:17:33,480
That is, basically
I'm summing the rows.

332
00:17:36,500 --> 00:17:42,240
I can just append that to a
longer list, and then at the

333
00:17:42,240 --> 00:17:47,700
and do one large sum.

334
00:17:47,700 --> 00:17:50,670
And that's, obviously,
much faster.

335
00:17:50,670 --> 00:17:53,255
And so these kind of tricks
you just need to be aware of.

336
00:17:53,255 --> 00:17:55,650
If you're trying to do
very-- people typically

337
00:17:55,650 --> 00:17:57,570
want to do a large
amount of data.

338
00:17:57,570 --> 00:17:59,050
You just do the simple sum.

339
00:17:59,050 --> 00:18:00,000
That will be OK.

340
00:18:00,000 --> 00:18:03,090
But if you're doing a
lot over a large list,

341
00:18:03,090 --> 00:18:06,260
that essentially becomes
almost an n squared operation

342
00:18:06,260 --> 00:18:08,810
with the loop variable.

343
00:18:08,810 --> 00:18:11,720
And this is one that
will be even faster.

344
00:18:11,720 --> 00:18:15,120
You can make it even
faster, because we are

345
00:18:15,120 --> 00:18:17,110
doing this concatenation here.

346
00:18:17,110 --> 00:18:19,970
When you do a concatenation in
Matlab, you're doing a malak.

347
00:18:19,970 --> 00:18:21,740
If you want to make
it even faster,

348
00:18:21,740 --> 00:18:24,440
you can pre-allocate
a large buffer,

349
00:18:24,440 --> 00:18:28,160
and then append and
into that buffer.

350
00:18:28,160 --> 00:18:32,770
And then when you hit
the edge, do a sum then,

351
00:18:32,770 --> 00:18:33,790
and do it that way.

352
00:18:33,790 --> 00:18:37,370
And that's the
fastest you can do.

353
00:18:37,370 --> 00:18:39,810
So these are tricks--
very, very large sums,

354
00:18:39,810 --> 00:18:43,180
you can do them very
quickly, and all with files.

355
00:18:43,180 --> 00:18:45,440
You don't need a database.

356
00:18:45,440 --> 00:18:47,270
And this is the
way to go if you're

357
00:18:47,270 --> 00:18:49,940
going to be doing an
analytic where you really

358
00:18:49,940 --> 00:18:54,250
want to traverse most of
the data in the database,

359
00:18:54,250 --> 00:18:55,090
in your data set.

360
00:18:55,090 --> 00:18:58,110
If you just need to get pieces
of the data in the database,

361
00:18:58,110 --> 00:19:00,190
then the database
will be a better tool.

362
00:19:04,520 --> 00:19:05,600
We did that.

363
00:19:05,600 --> 00:19:10,110
So those show how we
worked with files.

364
00:19:10,110 --> 00:19:12,990
And that's always a
good place to start.

365
00:19:12,990 --> 00:19:14,740
Even if you are working
with the database,

366
00:19:14,740 --> 00:19:17,264
if you find that you're doing
one query over and over again

367
00:19:17,264 --> 00:19:19,430
that you're going to be
then working with that data,

368
00:19:19,430 --> 00:19:21,260
a lot times better to
just do that query,

369
00:19:21,260 --> 00:19:23,710
and then save those results
to a file, and then just work

370
00:19:23,710 --> 00:19:25,500
with that file while
you're doing it.

371
00:19:25,500 --> 00:19:27,458
And, again, this is
something that people often

372
00:19:27,458 --> 00:19:30,040
do in our business.

373
00:19:30,040 --> 00:19:37,260
Now we're going to get to the
actual database part of it.

374
00:19:37,260 --> 00:19:38,760
So the first thing
we're going to do

375
00:19:38,760 --> 00:19:40,420
is we have to set
up our database.

376
00:19:40,420 --> 00:19:44,850
So we're going to create
some tables in Accumulo.

377
00:19:44,850 --> 00:19:46,940
And so we want to
create those first

378
00:19:46,940 --> 00:19:48,610
so they're created properly.

379
00:19:48,610 --> 00:19:51,110
And so I'm going to show you
the program that does that.

380
00:19:51,110 --> 00:19:52,090
The first thing that
you're going to do,

381
00:19:52,090 --> 00:19:54,150
it's going to call
a DB setup command.

382
00:19:54,150 --> 00:19:57,700
And so let me-- this program,
I'm going to now show you that.

383
00:19:57,700 --> 00:19:59,600
And so when you
run these examples,

384
00:19:59,600 --> 00:20:02,302
you will have to modify
this DB setup program.

385
00:20:05,950 --> 00:20:08,590
So the first thing
you'll notice is

386
00:20:08,590 --> 00:20:15,210
that we're all using the
same-- each group that's

387
00:20:15,210 --> 00:20:19,210
with those databases is
all using one user account.

388
00:20:19,210 --> 00:20:22,720
And you can say, well, that's
not the best way to do it.

389
00:20:22,720 --> 00:20:25,530
Well, it's very consistent
with the group structure,

390
00:20:25,530 --> 00:20:27,600
in that it's basically
you're all users.

391
00:20:27,600 --> 00:20:31,010
The database is there to
share data amongst your group.

392
00:20:31,010 --> 00:20:32,870
And so it is not an
uncommon practice

393
00:20:32,870 --> 00:20:36,910
to have a single user account
in which you put that data.

394
00:20:36,910 --> 00:20:39,330
So we have a bit of
a namespace problem.

395
00:20:39,330 --> 00:20:42,550
If you all just ran
this example together,

396
00:20:42,550 --> 00:20:45,571
you'd all create the exact
same table, and all fill it up.

397
00:20:45,571 --> 00:20:47,070
So the first thing
we're going to do

398
00:20:47,070 --> 00:20:49,390
is just pre-pend the tables.

399
00:20:49,390 --> 00:20:51,920
And I would suggest that
you put your name-- instead

400
00:20:51,920 --> 00:20:54,964
of having my name there,
you put your name there.

401
00:20:54,964 --> 00:20:56,380
And then we have
a special command

402
00:20:56,380 --> 00:20:59,930
you're called DB Setup LLGrid,
which basically creates

403
00:20:59,930 --> 00:21:05,056
a binding to a database just
using the name of the database.

404
00:21:05,056 --> 00:21:06,180
So it's a special function.

405
00:21:06,180 --> 00:21:07,430
That's not a generic function.

406
00:21:07,430 --> 00:21:09,100
It only works with
our LLGrid system.

407
00:21:09,100 --> 00:21:13,760
And it only works if you have
mounted the LLGrid file system.

408
00:21:13,760 --> 00:21:21,488
So hide all this stuff
here, get rid of that.

409
00:21:27,160 --> 00:21:29,980
So as you see here, I have
mounted the yellow grid file

410
00:21:29,980 --> 00:21:30,480
system.

411
00:21:35,700 --> 00:21:40,920
And you need to do that because
the DB setup command, when

412
00:21:40,920 --> 00:21:42,650
it binds to the
database it actually

413
00:21:42,650 --> 00:21:47,400
goes get the keys from
the LLGrid file system.

414
00:21:47,400 --> 00:21:51,470
And those keys are sitting in
the group directory for that.

415
00:21:51,470 --> 00:21:55,130
So, basically, from a password
management perspective,

416
00:21:55,130 --> 00:21:57,610
all we need to do is
add you to the group.

417
00:21:57,610 --> 00:21:59,850
And then you have
access to the database.

418
00:21:59,850 --> 00:22:01,790
Or if we remove
you from the group,

419
00:22:01,790 --> 00:22:03,620
you no longer have
access to the database.

420
00:22:03,620 --> 00:22:05,390
So that's how we
do-- otherwise, we'd

421
00:22:05,390 --> 00:22:12,170
have to distribute keys to
every single user all the time.

422
00:22:12,170 --> 00:22:14,190
And so this is why we do that.

423
00:22:14,190 --> 00:22:15,940
So this greatly simplifies it.

424
00:22:15,940 --> 00:22:19,270
But you will not be able
to make a connection to one

425
00:22:19,270 --> 00:22:21,270
of these databases unless
you are either logged

426
00:22:21,270 --> 00:22:22,590
into your LLGrid account.

427
00:22:22,590 --> 00:22:26,980
Or if you are on your computer,
you've mounted the file system,

428
00:22:26,980 --> 00:22:30,580
and D4M knows where
to look for the keys

429
00:22:30,580 --> 00:22:32,730
when you pass in
that setup command.

430
00:22:40,530 --> 00:22:46,300
So if we look at that again--
and this is just a shorthand

431
00:22:46,300 --> 00:22:50,000
for the full DB server command.

432
00:22:50,000 --> 00:22:53,430
So if you were connecting to
some database directly other

433
00:22:53,430 --> 00:22:55,130
than one of these--
or you could even

434
00:22:55,130 --> 00:22:57,010
do it with these-- you
would have to pass in,

435
00:22:57,010 --> 00:22:59,130
essentially, a five
argument thing,

436
00:22:59,130 --> 00:23:05,560
which is the hostname of
the computer and the port,

437
00:23:05,560 --> 00:23:10,230
the name of the instance,
the name of the-- I

438
00:23:10,230 --> 00:23:13,810
guess there's a couple instance
names here-- and then the name

439
00:23:13,810 --> 00:23:16,190
of the user, and then
an actual password.

440
00:23:16,190 --> 00:23:19,090
And so that's the generic
way to connect in general.

441
00:23:19,090 --> 00:23:22,030
But for of those of you
connected to LLGrid,

442
00:23:22,030 --> 00:23:25,960
we can just use this
shorthand, which is very nice.

443
00:23:25,960 --> 00:23:29,410
Then we're going to
build a couple of tables.

444
00:23:29,410 --> 00:23:31,420
So we have these--
first thing we're

445
00:23:31,420 --> 00:23:34,340
going to do is we're going to
want to create a table that's

446
00:23:34,340 --> 00:23:38,050
going to hold that adjacency
matrix that I just created

447
00:23:38,050 --> 00:23:39,160
with the files.

448
00:23:39,160 --> 00:23:42,760
And so we're going to do
that with a database pair.

449
00:23:42,760 --> 00:23:44,450
So if we have our
database object here

450
00:23:44,450 --> 00:23:46,850
and we give it two
string names, it

451
00:23:46,850 --> 00:23:51,310
will know to create two
tables in the database,

452
00:23:51,310 --> 00:23:56,340
and return a binding to that
table that's a transposed pair.

453
00:23:56,340 --> 00:23:58,800
So whenever we do an
insert into that table,

454
00:23:58,800 --> 00:24:02,360
it will insert the row and
the column in one table,

455
00:24:02,360 --> 00:24:06,160
and then flip those and
insert the column and the row

456
00:24:06,160 --> 00:24:07,250
in the other table.

457
00:24:07,250 --> 00:24:09,000
And then whenever
you do a lookup,

458
00:24:09,000 --> 00:24:12,560
it will know if it's doing a
row lookup to look on one table,

459
00:24:12,560 --> 00:24:14,560
and if it's doing a
column lookup to do it

460
00:24:14,560 --> 00:24:15,730
on the other table.

461
00:24:15,730 --> 00:24:18,280
And this allows you to do
fast lookups of both rows

462
00:24:18,280 --> 00:24:21,250
and columns and makes
it all nice for you.

463
00:24:21,250 --> 00:24:23,000
We're also going to
want to take advantage

464
00:24:23,000 --> 00:24:26,695
of Accumulo's built-in
ability to sum as we insert.

465
00:24:26,695 --> 00:24:28,570
And so we're going to
create something that's

466
00:24:28,570 --> 00:24:35,060
going to hold the degree as
we go-- very useful to have

467
00:24:35,060 --> 00:24:37,620
these statistics,
because a lot of times

468
00:24:37,620 --> 00:24:39,364
you want to look up something.

469
00:24:39,364 --> 00:24:40,780
But the first thing
you want to do

470
00:24:40,780 --> 00:24:43,270
is to see, well, how many
of them are in there?

471
00:24:43,270 --> 00:24:44,970
And so if you
create, essentially,

472
00:24:44,970 --> 00:24:48,850
a column vector with that
information, it's very helpful.

473
00:24:48,850 --> 00:24:50,700
Later, we're going
to do something

474
00:24:50,700 --> 00:24:54,430
where we actually store the
raw edges that were created.

475
00:24:54,430 --> 00:24:57,400
So when we create
the adjacency matrix,

476
00:24:57,400 --> 00:24:59,806
we actually lose a little
bit of information.

477
00:24:59,806 --> 00:25:01,430
And so when we create
this edge matrix,

478
00:25:01,430 --> 00:25:04,400
we'll be able to preserve
that matrix, that information.

479
00:25:04,400 --> 00:25:06,880
And, likewise, we'll
be doing the tallies

480
00:25:06,880 --> 00:25:08,950
of the edges in that as well.

481
00:25:08,950 --> 00:25:10,695
So that's what this does.

482
00:25:10,695 --> 00:25:11,820
And that's the setup.

483
00:25:11,820 --> 00:25:15,690
You'll need to modify that
and this program here.

484
00:25:15,690 --> 00:25:20,530
Actually, basically
after the setup is done,

485
00:25:20,530 --> 00:25:24,490
it adds these accumulator things
by designating certain columns

486
00:25:24,490 --> 00:25:26,730
to be what they call combiners.

487
00:25:26,730 --> 00:25:29,870
So in this adjacency
degree table, I've said,

488
00:25:29,870 --> 00:25:32,060
I want to create two new
columns-- an out degree,

489
00:25:32,060 --> 00:25:34,250
in degree column,
and the operation--

490
00:25:34,250 --> 00:25:36,190
that I want to be
applied when there

491
00:25:36,190 --> 00:25:39,700
are collisions on
those values is sum.

492
00:25:39,700 --> 00:25:41,620
It will then sum the values.

493
00:25:41,620 --> 00:25:45,670
And, likewise, with the
edge, since I only have one

494
00:25:45,670 --> 00:25:49,630
I'm going to have to
degree and sum there.

495
00:25:49,630 --> 00:25:51,750
So that's how we do that.

496
00:25:51,750 --> 00:25:57,360
So if we now go to our database
page here-- so you can see,

497
00:25:57,360 --> 00:25:59,220
class DB3, it started.

498
00:25:59,220 --> 00:26:01,620
We can actually view the info.

499
00:26:01,620 --> 00:26:05,942
And this is what a nice,
fresh, never before used

500
00:26:05,942 --> 00:26:07,150
Accumulo instance looks like.

501
00:26:07,150 --> 00:26:09,100
It has very little data.

502
00:26:09,100 --> 00:26:14,210
It has one table, a meta
data table and a trace table.

503
00:26:14,210 --> 00:26:17,270
And there's no
errors or anything

504
00:26:17,270 --> 00:26:19,460
like that-- so a
very clean instance.

505
00:26:24,280 --> 00:26:25,530
So that's what one looks like.

506
00:26:25,530 --> 00:26:27,905
We're going to use one that
we've already started though,

507
00:26:27,905 --> 00:26:30,920
which is DB1.

508
00:26:30,920 --> 00:26:35,680
And if we look at
the tables here,

509
00:26:35,680 --> 00:26:37,180
there's already some tables.

510
00:26:37,180 --> 00:26:39,130
Michelle ran a
practice run on this.

511
00:26:39,130 --> 00:26:41,320
[? Chansup ?] ran
a practice run.

512
00:26:41,320 --> 00:26:47,170
I'm going to now create those
tables by live setup test.

513
00:26:57,160 --> 00:26:59,790
There, it created
all those tables.

514
00:26:59,790 --> 00:27:01,490
You can now see, did
it actually work?

515
00:27:05,300 --> 00:27:09,060
So if I refresh that--
and you can see,

516
00:27:09,060 --> 00:27:16,240
it's now created these six
tables which are empty.

517
00:27:16,240 --> 00:27:18,360
We do have abilities
to set the write

518
00:27:18,360 --> 00:27:20,200
and read permissions
of these tables.

519
00:27:20,200 --> 00:27:21,980
So right now, everyone
has the ability

520
00:27:21,980 --> 00:27:26,140
to read, and write, and
delete everyone else's tables

521
00:27:26,140 --> 00:27:27,534
in a class database.

522
00:27:27,534 --> 00:27:29,950
In a project, that's not such
a difficult thing to manage.

523
00:27:29,950 --> 00:27:30,970
You all know that.

524
00:27:30,970 --> 00:27:33,500
But you could imagine in
a situation [INAUDIBLE],

525
00:27:33,500 --> 00:27:35,360
we had a big ingest.

526
00:27:35,360 --> 00:27:37,060
This is a corpus of data.

527
00:27:37,060 --> 00:27:39,680
We don't want anybody-- we can
actually make the permissions.

528
00:27:39,680 --> 00:27:42,272
We can make it read only so
that no one can delete it.

529
00:27:42,272 --> 00:27:44,230
Or we can make it so it's
still read and write,

530
00:27:44,230 --> 00:27:48,060
but it can't be deleted, whole
cloth, those permissions exist.

531
00:27:48,060 --> 00:27:50,220
A feature we will add
to this database manager

532
00:27:50,220 --> 00:27:52,340
will also be a
checkpoint feature.

533
00:27:52,340 --> 00:27:55,690
So, for instance, if
you did a big ingest,

534
00:27:55,690 --> 00:27:57,770
have a bunch of data that
you're very happy with,

535
00:27:57,770 --> 00:28:00,126
you can checkpoint it.

536
00:28:00,126 --> 00:28:01,500
You'll have to
stop the database.

537
00:28:01,500 --> 00:28:05,170
Then you can create a checkpoint
of that stop database, name

538
00:28:05,170 --> 00:28:06,051
checkpoint.

539
00:28:06,051 --> 00:28:08,300
And then you can restart
from that, if for some reason

540
00:28:08,300 --> 00:28:10,140
your database gets corrupted.

541
00:28:10,140 --> 00:28:13,960
As I like to say, Accumulo,
like all other databases,

542
00:28:13,960 --> 00:28:15,350
is stable for production.

543
00:28:15,350 --> 00:28:17,510
But it can be unstable
for development.

544
00:28:17,510 --> 00:28:22,790
New database users, the
database will train you

545
00:28:22,790 --> 00:28:25,810
in terms of the things that
you should not do to it.

546
00:28:25,810 --> 00:28:29,740
And so over time, you will not
do things that destroy data,

547
00:28:29,740 --> 00:28:31,812
or cause your database
to be very unhappy.

548
00:28:31,812 --> 00:28:33,895
And then you will have a
nice production database,

549
00:28:33,895 --> 00:28:36,190
because you will only do
things that make it happy.

550
00:28:36,190 --> 00:28:38,330
But in that phase
where you're learning,

551
00:28:38,330 --> 00:28:40,460
or you're experimenting
with things, as

552
00:28:40,460 --> 00:28:42,580
with all databases,
any database,

553
00:28:42,580 --> 00:28:44,880
it's easy to do
commands that will

554
00:28:44,880 --> 00:28:47,670
put the database in a
fairly unhappy state, even

555
00:28:47,670 --> 00:28:49,975
to the point of corrupting
or losing the data.

556
00:28:53,780 --> 00:28:56,510
But for the most part, once it's
up and running in production--

557
00:28:56,510 --> 00:28:59,620
and we have a database that's
been running for almost two

558
00:28:59,620 --> 00:29:04,730
years continuously using a very
old version of this software.

559
00:29:04,730 --> 00:29:06,980
So it just continues
to hum away.

560
00:29:06,980 --> 00:29:08,650
It's got billions
of entries in it.

561
00:29:11,620 --> 00:29:13,820
It's running on a single Mac.

562
00:29:13,820 --> 00:29:19,610
And so it definitely, we've seen
situations where it's-- it has

563
00:29:19,610 --> 00:29:23,610
the same stability as just
about any other database.

564
00:29:23,610 --> 00:29:27,010
So now we've created
these tables.

565
00:29:27,010 --> 00:29:31,380
And let's insert
some data into them.

566
00:29:31,380 --> 00:29:36,130
So I do pDB06.

567
00:29:36,130 --> 00:29:39,196
So I'm now going to insert
the adjacency matrix.

568
00:29:44,230 --> 00:29:47,770
And now it's basically
ereading in each file

569
00:29:47,770 --> 00:29:51,710
and ingesting that data.

570
00:29:51,710 --> 00:29:52,835
And it's not a lot of data.

571
00:29:52,835 --> 00:29:54,840
It doesn't take very long.

572
00:29:54,840 --> 00:30:03,050
And now you can see up here
that data is getting ingested.

573
00:30:03,050 --> 00:30:03,880
And you see there.

574
00:30:03,880 --> 00:30:09,340
So we just inserted about 62,000
in each of those two tables,

575
00:30:09,340 --> 00:30:14,390
and 25,000, which for
Accumulo is a trivial amount.

576
00:30:14,390 --> 00:30:19,480
We just inserted 150,000
entries into a database,

577
00:30:19,480 --> 00:30:22,840
which is pretty
impressive, to be

578
00:30:22,840 --> 00:30:25,630
able to do that in, essentially,
the blink of an eye.

579
00:30:25,630 --> 00:30:29,240
And that's really the
real power of Accumulo,

580
00:30:29,240 --> 00:30:33,320
is this-- on a lot of other
databases, 150,000 entries,

581
00:30:33,320 --> 00:30:35,039
you're talking
about a few minutes.

582
00:30:35,039 --> 00:30:36,580
And here, it's just
you wouldn't even

583
00:30:36,580 --> 00:30:37,896
think about doing that twice.

584
00:30:42,780 --> 00:30:44,435
So we can take a
look at that program.

585
00:30:48,080 --> 00:30:49,870
So here we go.

586
00:30:49,870 --> 00:30:55,080
So we had eight files here,
and basically loaded the data.

587
00:30:55,080 --> 00:30:58,430
And, basically, one
thing we have to remember

588
00:30:58,430 --> 00:31:04,260
is that our adjacency
matrix has numeric values

589
00:31:04,260 --> 00:31:05,850
in an associative array.

590
00:31:05,850 --> 00:31:08,500
And Accumulo can
only hold strings.

591
00:31:08,500 --> 00:31:11,310
So we have to call
NumtoString function, which

592
00:31:11,310 --> 00:31:16,630
will convert those numbers
into strings to be stored.

593
00:31:16,630 --> 00:31:22,800
So the first thing we do is we
load our adjacency matrix A.

594
00:31:22,800 --> 00:31:25,220
We convert the numeric
values to strings.

595
00:31:25,220 --> 00:31:26,580
And we just do a put.

596
00:31:26,580 --> 00:31:29,120
So we can just insert the
associative array directly

597
00:31:29,120 --> 00:31:31,580
into the adjacency matrix.

598
00:31:31,580 --> 00:31:33,270
It pulls apart the triples.

599
00:31:33,270 --> 00:31:36,720
And it knows how to
take care of the fact

600
00:31:36,720 --> 00:31:39,180
that it recognizes this is
a transposed table pair.

601
00:31:39,180 --> 00:31:41,860
And it does that ingest for you.

602
00:31:41,860 --> 00:31:44,850
And, likewise, same thing here--
now on these other things,

603
00:31:44,850 --> 00:31:46,000
we pulled it out.

604
00:31:46,000 --> 00:31:50,150
We summed it, convert it to
strings to do the out degree,

605
00:31:50,150 --> 00:31:52,570
and the same thing
to do in degree.

606
00:31:52,570 --> 00:31:54,950
And so this is actually where
the adjacency matrix comes

607
00:31:54,950 --> 00:31:58,410
in very handy because when
we're doing accumulating,

608
00:31:58,410 --> 00:32:01,440
if we didn't first sum
it and then do it--

609
00:32:01,440 --> 00:32:05,530
if we put those raw
triples into insert,

610
00:32:05,530 --> 00:32:09,170
we're essentially redoing
the complete insert.

611
00:32:09,170 --> 00:32:11,570
And so this saves
usually an order

612
00:32:11,570 --> 00:32:13,550
of magnitude in
number of inserts

613
00:32:13,550 --> 00:32:17,510
by basically doing the sum
in our D4M program first,

614
00:32:17,510 --> 00:32:20,250
and then just inserting those
sum values, just a nice way

615
00:32:20,250 --> 00:32:24,280
to save the database a little
bit of trouble in doing that.

616
00:32:24,280 --> 00:32:26,680
And so we certainly recommend
this type of approach

617
00:32:26,680 --> 00:32:27,830
for doing that.

618
00:32:35,860 --> 00:32:38,920
So the next one DB07.

619
00:32:38,920 --> 00:32:40,380
So now we're going to a query.

620
00:32:40,380 --> 00:32:43,220
We're going to get some
data out of that table.

621
00:32:50,570 --> 00:32:53,090
And so what we did
here is we said,

622
00:32:53,090 --> 00:32:57,190
I want to pick 100
random vertices.

623
00:32:57,190 --> 00:33:02,260
So in this case, I randomly
generate 100 random vertices

624
00:33:02,260 --> 00:33:05,900
over the range 1 to 1,000.

625
00:33:05,900 --> 00:33:11,269
OK And I then convert those
to a string, because these

626
00:33:11,269 --> 00:33:12,060
are numeric values.

627
00:33:12,060 --> 00:33:13,730
And I will look up strings.

628
00:33:13,730 --> 00:33:15,740
And then the first
thing I'm going to do

629
00:33:15,740 --> 00:33:18,420
is I'm going to look up
the degrees, essentially,

630
00:33:18,420 --> 00:33:19,950
of those vertices.

631
00:33:19,950 --> 00:33:24,160
So I have my sum table here
called T adjacency degree.

632
00:33:24,160 --> 00:33:25,850
I'm going to pass
those strings in.

633
00:33:25,850 --> 00:33:28,050
And then I'm going to
get just the degrees.

634
00:33:28,050 --> 00:33:30,910
So that's just a big,
long, skinny vector.

635
00:33:30,910 --> 00:33:33,860
Looking things up from
that takes much less time

636
00:33:33,860 --> 00:33:37,457
than looking up whole
rows or columns.

637
00:33:37,457 --> 00:33:39,040
And it stores all
those values for me.

638
00:33:39,040 --> 00:33:40,687
So that's a great
place to start.

639
00:33:40,687 --> 00:33:42,270
And then I want to
say, you know what?

640
00:33:42,270 --> 00:33:46,290
I only care about degrees that
have been a particular range.

641
00:33:46,290 --> 00:33:47,970
This is a very
common thing to do.

642
00:33:47,970 --> 00:33:51,210
There will be vertices that
are so common you're like,

643
00:33:51,210 --> 00:33:53,000
I don't care about those.

644
00:33:53,000 --> 00:33:54,440
I have their tally.

645
00:33:54,440 --> 00:33:56,880
And these are sometimes
called super nodes.

646
00:33:56,880 --> 00:34:00,220
And doing anything with
those is a waste of my time,

647
00:34:00,220 --> 00:34:03,680
and forces me to end up
traversing enormous swathes

648
00:34:03,680 --> 00:34:04,690
of the database.

649
00:34:04,690 --> 00:34:09,199
So I'll set an upper threshold
here and a lower threshold.

650
00:34:09,199 --> 00:34:11,159
So, basically, I take a degree.

651
00:34:11,159 --> 00:34:12,960
I'm just going to look
at the out degree.

652
00:34:12,960 --> 00:34:15,460
And I say, I want things
that are greater than the min

653
00:34:15,460 --> 00:34:16,920
and less than the max.

654
00:34:16,920 --> 00:34:18,880
And I want to get
just those rows.

655
00:34:18,880 --> 00:34:21,620
So that's this query, a
fairly complicated thing,

656
00:34:21,620 --> 00:34:23,989
analytic here done
in just one line.

657
00:34:23,989 --> 00:34:28,230
And now a new set of vertices,
which are just the rows that

658
00:34:28,230 --> 00:34:30,250
satisfy this requirement.

659
00:34:30,250 --> 00:34:32,033
And then I'm going to
look those up again.

660
00:34:32,033 --> 00:34:34,449
So I'm just going to look up
the ones that satisfy those--

661
00:34:34,449 --> 00:34:37,520
now I'm going to get the
whole row of those things.

662
00:34:37,520 --> 00:34:39,502
So before, I was just
looking up their counts.

663
00:34:39,502 --> 00:34:40,960
Now I'm going to
get the whole row.

664
00:34:40,960 --> 00:34:42,850
And I know that
there's none that

665
00:34:42,850 --> 00:34:46,800
have more than this certain
value, or have too few.

666
00:34:46,800 --> 00:34:51,540
And then I'm going to plot it.

667
00:34:51,540 --> 00:34:54,219
And so if we look at the
figure, there we see.

668
00:34:54,219 --> 00:35:01,440
So we ended up finding one,
two, three, four rows that

669
00:35:01,440 --> 00:35:05,160
had between-- these should all
write-- one, two, three, four,

670
00:35:05,160 --> 00:35:07,160
five, six, seven, so
that's between five and 10.

671
00:35:07,160 --> 00:35:09,100
These should all be
between five and 10.

672
00:35:09,100 --> 00:35:12,590
And then this shows what
their actual column was.

673
00:35:12,590 --> 00:35:15,270
And so that's a
very quick example

674
00:35:15,270 --> 00:35:17,540
of that kind of analytic.

675
00:35:17,540 --> 00:35:19,390
And, again, real bread
and butter, and this

676
00:35:19,390 --> 00:35:24,150
is basically standard from a
signal processing perspective.

677
00:35:24,150 --> 00:35:26,400
The max is our
clutter threshold.

678
00:35:26,400 --> 00:35:27,410
We don't want that.

679
00:35:27,410 --> 00:35:28,710
It's too bright.

680
00:35:28,710 --> 00:35:30,724
And then we'll have
a noise threshold.

681
00:35:30,724 --> 00:35:32,140
We're like, we
don't care anything

682
00:35:32,140 --> 00:35:33,580
that's below a certain value.

683
00:35:33,580 --> 00:35:35,570
And this sort of narrows
in on the signal.

684
00:35:35,570 --> 00:35:37,910
Same kind of processing
and signal processing,

685
00:35:37,910 --> 00:35:39,510
we're doing it here on graphs.

686
00:35:39,510 --> 00:35:42,076
And Accumulo supports that
very, very nicely for us.

687
00:35:49,030 --> 00:35:51,960
So let's move on to
the next example.

688
00:35:51,960 --> 00:36:01,987
And, actually, if we look here,
if we go back to the overview--

689
00:36:01,987 --> 00:36:03,820
so you see here, that
was that ingest I did.

690
00:36:03,820 --> 00:36:06,240
This is the ingest--
so very quick,

691
00:36:06,240 --> 00:36:08,290
it's getting about
5,000 inserts a second.

692
00:36:08,290 --> 00:36:10,280
That was over a very
short period of time.

693
00:36:10,280 --> 00:36:12,962
It doesn't even time to get
reach its full rise time there.

694
00:36:12,962 --> 00:36:14,420
And then you can
see the scan rate.

695
00:36:14,420 --> 00:36:15,340
And it was very small.

696
00:36:15,340 --> 00:36:17,260
It was a very tiny
amount type of thing.

697
00:36:17,260 --> 00:36:20,010
As we do larger data sets,
you'll really see that.

698
00:36:20,010 --> 00:36:23,820
And this is a great tool here.

699
00:36:23,820 --> 00:36:25,960
It really shows you
what's really going on.

700
00:36:29,640 --> 00:36:34,220
Before you use the database,
always check this page.

701
00:36:34,220 --> 00:36:36,805
If you can't get to the
page of the database,

702
00:36:36,805 --> 00:36:38,930
you're not going to be able
to get to the database.

703
00:36:38,930 --> 00:36:41,480
There's no point in
doing anything with D4M

704
00:36:41,480 --> 00:36:44,390
if this page is not responding.

705
00:36:44,390 --> 00:36:46,300
Likewise, if you look
at this, and you see

706
00:36:46,300 --> 00:36:49,222
this thing is just hamming away.

707
00:36:49,222 --> 00:36:51,180
You're seeing hundreds
of thousands or millions

708
00:36:51,180 --> 00:36:53,138
of inserts a second, it
means that, guess what?

709
00:36:53,138 --> 00:36:55,170
Someone is probably
using your database.

710
00:36:55,170 --> 00:36:57,730
And you might want to either
pick a different database,

711
00:36:57,730 --> 00:36:59,640
or find out who that
is and say, hey,

712
00:36:59,640 --> 00:37:01,890
when are you going to be
done, or something like that?

713
00:37:01,890 --> 00:37:04,710
Likewise, on the scan side.

714
00:37:04,710 --> 00:37:09,150
Likewise, when you do inserts--
you'll get some examples here

715
00:37:09,150 --> 00:37:11,270
of the kinds of rates
you should be seeing.

716
00:37:11,270 --> 00:37:12,870
You want to make sure
you're seeing those rates.

717
00:37:12,870 --> 00:37:14,244
If you're not
seeing those rates,

718
00:37:14,244 --> 00:37:17,030
if you're basically
just using the resource

719
00:37:17,030 --> 00:37:19,120
but only inserting
at a low rate,

720
00:37:19,120 --> 00:37:20,590
then you're actually
doing yourself

721
00:37:20,590 --> 00:37:21,923
and everybody else a disservice.

722
00:37:21,923 --> 00:37:23,280
You're using the database.

723
00:37:23,280 --> 00:37:24,927
But you're using
it inefficiently.

724
00:37:24,927 --> 00:37:27,010
And it's much better to
have your inserts go fast,

725
00:37:27,010 --> 00:37:28,160
because then you're
out of the way.

726
00:37:28,160 --> 00:37:29,450
Your work gets done quicker.

727
00:37:29,450 --> 00:37:31,840
And then you're out of everybody
else's way quicker too.

728
00:37:31,840 --> 00:37:36,730
So I highly recommend
people look at this

729
00:37:36,730 --> 00:37:38,930
to see what's going on.

730
00:37:38,930 --> 00:37:41,130
So I think the
last one was seven.

731
00:37:41,130 --> 00:37:44,050
So we'll move on to eight here.

732
00:37:44,050 --> 00:37:46,130
So now we're going to do
another type of query.

733
00:37:46,130 --> 00:37:49,170
We're going to do a little
bit more sophisticated query

734
00:37:49,170 --> 00:37:52,320
That query used
the degree tables

735
00:37:52,320 --> 00:37:54,010
to sort of prune our query.

736
00:37:54,010 --> 00:37:55,712
So you think about it.

737
00:37:55,712 --> 00:37:57,170
There was probably
an edge in there

738
00:37:57,170 --> 00:37:59,680
that had like 100 entries.

739
00:37:59,680 --> 00:38:02,270
And we just were
able to avoid that,

740
00:38:02,270 --> 00:38:04,080
never had to touch that edge.

741
00:38:04,080 --> 00:38:06,260
But if we had touched
it, that could

742
00:38:06,260 --> 00:38:07,830
have been a much bigger query.

743
00:38:07,830 --> 00:38:09,390
You might be like, well,
100 doesn't sound bad.

744
00:38:09,390 --> 00:38:11,265
Well, it's easy to get
databases where you'll

745
00:38:11,265 --> 00:38:13,760
have some rows with
a million entries,

746
00:38:13,760 --> 00:38:16,910
or columns with a million
entries, or a billion entries.

747
00:38:16,910 --> 00:38:18,480
And you really
don't want to have

748
00:38:18,480 --> 00:38:22,970
to query those rows or
columns, because they will just

749
00:38:22,970 --> 00:38:25,700
send back so much data.

750
00:38:25,700 --> 00:38:27,100
But you can still
have situations

751
00:38:27,100 --> 00:38:28,420
where you're doing
queries that are going

752
00:38:28,420 --> 00:38:30,961
to take-- they're going to send
back a lot of data, more data

753
00:38:30,961 --> 00:38:35,920
that you can really handle
in your memory segment.

754
00:38:35,920 --> 00:38:37,940
So we have a thing
called an iterator.

755
00:38:37,940 --> 00:38:40,710
So we've created an iterator
log to set up a query

756
00:38:40,710 --> 00:38:42,379
and have it work through it.

757
00:38:42,379 --> 00:38:44,420
Now, this table is so
small that you won't really

758
00:38:44,420 --> 00:38:47,644
get to see the
iteration happening.

759
00:38:47,644 --> 00:38:48,810
But I'll show you the setup.

760
00:38:48,810 --> 00:38:50,310
So we'll do that.

761
00:38:50,310 --> 00:38:51,820
So that was very quick.

762
00:38:51,820 --> 00:38:53,706
So, essentially, we're
doing a similar thing.

763
00:38:53,706 --> 00:38:55,330
We're creating a
random set of vertices

764
00:38:55,330 --> 00:39:01,740
here, about 100, creating
over the range 1 to 1,000.

765
00:39:01,740 --> 00:39:04,750
And we're creating an
iterator here called

766
00:39:04,750 --> 00:39:05,920
Tadjacency iterator.

767
00:39:09,600 --> 00:39:11,450
It's this function
here, iterator.

768
00:39:11,450 --> 00:39:13,030
We pass it in the table.

769
00:39:13,030 --> 00:39:14,650
Then we have a flag,
which is element,

770
00:39:14,650 --> 00:39:18,730
which is, how many entries do
we want this iterator-- it's

771
00:39:18,730 --> 00:39:19,520
the chunk size.

772
00:39:19,520 --> 00:39:22,000
How many entries do
we want this iterator

773
00:39:22,000 --> 00:39:25,000
to return every single
time we call it?

774
00:39:25,000 --> 00:39:27,760
And here, we've set max element
to be a pretty small number,

775
00:39:27,760 --> 00:39:29,220
to be 1,000.

776
00:39:29,220 --> 00:39:32,630
So what we're saying is every
single time we do this query,

777
00:39:32,630 --> 00:39:36,870
we want you to return
1,000 at a time.

778
00:39:36,870 --> 00:39:41,940
Now for those of you who
are Matlab aficionados,

779
00:39:41,940 --> 00:39:44,860
you should be in awe,
because Matlab is supposed

780
00:39:44,860 --> 00:39:46,566
to be a stateless language.

781
00:39:46,566 --> 00:39:48,690
And there's nothing more
stateful than an iterator.

782
00:39:48,690 --> 00:39:52,060
And so we have tricked Matlab
with a combination of Matlab

783
00:39:52,060 --> 00:39:55,580
on the surface and some
hidden Java under the covers

784
00:39:55,580 --> 00:39:59,850
to make it have the feel
of Matlab, yet hold state.

785
00:39:59,850 --> 00:40:02,000
So now what we're
going to do is we're

786
00:40:02,000 --> 00:40:05,390
going to-- this just
creates the iterator.

787
00:40:05,390 --> 00:40:09,460
We then initialize the query
by actually passing in a query.

788
00:40:09,460 --> 00:40:14,960
So we now say, OK, the query we
want is over this set of rows

789
00:40:14,960 --> 00:40:15,690
here.

790
00:40:15,690 --> 00:40:18,890
And so we're going to run
the query the first time.

791
00:40:18,890 --> 00:40:21,600
And since that thing
returns string values,

792
00:40:21,600 --> 00:40:23,970
and we want numbers, we
have to do a string to num.

793
00:40:23,970 --> 00:40:26,640
And that's our first
associative array that's

794
00:40:26,640 --> 00:40:28,770
the result of the first query.

795
00:40:28,770 --> 00:40:31,930
We then initialize
our tally here.

796
00:40:31,930 --> 00:40:36,391
And then we just do a
while on this result.

797
00:40:36,391 --> 00:40:38,640
So we're going to say, oh,
if there's something there,

798
00:40:38,640 --> 00:40:42,320
then we want sum that,
and add it to our tally,

799
00:40:42,320 --> 00:40:43,770
to our in degree tally.

800
00:40:43,770 --> 00:40:45,350
And then we call it again.

801
00:40:45,350 --> 00:40:47,440
So we do the next
round of the iteration

802
00:40:47,440 --> 00:40:51,970
by just calling the
object with no argument.

803
00:40:51,970 --> 00:40:55,380
And it will just run it
until the query is empty.

804
00:40:55,380 --> 00:40:58,270
If we put in an argument,
it would re-initialize

805
00:40:58,270 --> 00:41:00,140
the iterator to that new query.

806
00:41:00,140 --> 00:41:02,370
And so you don't have
to create a new iterator

807
00:41:02,370 --> 00:41:04,770
every single time you want
to put in a different query.

808
00:41:04,770 --> 00:41:07,320
You can reuse the object.

809
00:41:07,320 --> 00:41:09,280
Not that it really
matters, but this

810
00:41:09,280 --> 00:41:10,720
is how you get it to do again.

811
00:41:10,720 --> 00:41:14,160
So it's a pretty elegant
syntax for basically doing

812
00:41:14,160 --> 00:41:15,010
an iterator.

813
00:41:15,010 --> 00:41:19,100
And it allows you to deal
with things very nicely.

814
00:41:19,100 --> 00:41:23,180
And as we see here, then we
did the calculation here,

815
00:41:23,180 --> 00:41:24,440
which is all right.

816
00:41:24,440 --> 00:41:30,170
I want it to then return
the value with the largest

817
00:41:30,170 --> 00:41:31,500
maximum degree.

818
00:41:31,500 --> 00:41:33,610
So, basically, I
compute the max.

819
00:41:33,610 --> 00:41:36,870
I get the adjacency
matrix of the degree.

820
00:41:36,870 --> 00:41:38,580
And I compute its maximum.

821
00:41:38,580 --> 00:41:40,630
And then I set that
equal to n degree.

822
00:41:40,630 --> 00:41:45,025
And it tells me that the first
vertex had a degree of 14

823
00:41:45,025 --> 00:41:46,400
in this query,
which makes sense.

824
00:41:46,400 --> 00:41:50,190
In the Kronecker
matrix, 1, 1 is always

825
00:41:50,190 --> 00:41:55,942
the largest and densest value.

826
00:41:55,942 --> 00:41:57,442
So that's just how
we use iterators.

827
00:42:00,220 --> 00:42:01,680
Let us continue on here.

828
00:42:06,010 --> 00:42:08,720
Now I'm going to use iterators
in a more complicated way

829
00:42:08,720 --> 00:42:12,180
to do a join.

830
00:42:12,180 --> 00:42:16,690
So a join is where I want
to basically-- maybe there's

831
00:42:16,690 --> 00:42:18,150
a row in the table.

832
00:42:18,150 --> 00:42:22,210
And I only want
the rows that have

833
00:42:22,210 --> 00:42:24,810
both of a certain
type of value in them.

834
00:42:24,810 --> 00:42:28,180
So, for instance, if I had
a table of network records,

835
00:42:28,180 --> 00:42:30,640
I might like, look, please
only return records that

836
00:42:30,640 --> 00:42:36,500
have this source IP, and are
talking to this domain name.

837
00:42:36,500 --> 00:42:40,750
So we have to figure out how
we do joins in this technology.

838
00:42:40,750 --> 00:42:42,040
So I'm going to do that.

839
00:42:42,040 --> 00:42:43,408
So let me run that.

840
00:42:48,080 --> 00:42:50,810
So join-- so a little
bit more complicated,

841
00:42:50,810 --> 00:42:53,460
obviously, we're building up
fairly complicated analytics

842
00:42:53,460 --> 00:42:54,500
here.

843
00:42:54,500 --> 00:43:00,590
So I create my
iterator limit here.

844
00:43:00,590 --> 00:43:05,330
I'm going to pick two columns
to join, column 1 and column 2.

845
00:43:05,330 --> 00:43:08,260
And this just does a simple
join over those columns.

846
00:43:08,260 --> 00:43:11,130
So basically what
I'm doing is I'm

847
00:43:11,130 --> 00:43:15,590
saying, please return
all the columns that

848
00:43:15,590 --> 00:43:20,860
contain-- this is an or
basically either column 1 or 2.

849
00:43:20,860 --> 00:43:24,430
I'm going to then convert
those values, which

850
00:43:24,430 --> 00:43:29,000
would have been string values
of numerics, to just zeros or 1.

851
00:43:29,000 --> 00:43:30,920
Then I'm going to sum that.

852
00:43:30,920 --> 00:43:33,550
So, basically, I
got the two columns.

853
00:43:33,550 --> 00:43:35,085
I converted all the values to 1.

854
00:43:35,085 --> 00:43:38,840
Now I'm going to
sum them together.

855
00:43:38,840 --> 00:43:41,650
And then I'm going
to ask the question,

856
00:43:41,650 --> 00:43:44,490
where were those equal to 2?

857
00:43:44,490 --> 00:43:46,210
Those show me the records.

858
00:43:46,210 --> 00:43:47,540
So that's what I'm doing here.

859
00:43:47,540 --> 00:43:49,080
I'm saying, those equal to 2.

860
00:43:49,080 --> 00:43:53,270
So that shows me all the records
where that value is equal to 2.

861
00:43:53,270 --> 00:43:57,260
And I can then get the
row of those things,

862
00:43:57,260 --> 00:44:02,550
and then pass that back
in to the original matrix.

863
00:44:02,550 --> 00:44:08,390
And now I will get back
those rows with those things.

864
00:44:08,390 --> 00:44:09,540
And so we can see that.

865
00:44:09,540 --> 00:44:13,150
I think that's the first
figure that we did here.

866
00:44:13,150 --> 00:44:15,870
We go to figure one.

867
00:44:15,870 --> 00:44:21,500
And those show,
basically-- what did we do?

868
00:44:21,500 --> 00:44:22,000
Right.

869
00:44:35,080 --> 00:44:40,690
This shows us the complete
rows of all the records

870
00:44:40,690 --> 00:44:43,100
that had the value 1.

871
00:44:43,100 --> 00:44:45,660
So this is here,
1, and I think it

872
00:44:45,660 --> 00:44:49,960
was 100, which is somewhere
right probably in there.

873
00:44:49,960 --> 00:44:55,111
So basically every single one
of these records had 1 and 100

874
00:44:55,111 --> 00:44:55,610
in it.

875
00:44:55,610 --> 00:44:57,790
And then it shows us all
the rest of the values that

876
00:44:57,790 --> 00:44:59,860
are also in that record.

877
00:44:59,860 --> 00:45:02,980
If we just wanted those
columns and those values,

878
00:45:02,980 --> 00:45:05,170
when we just summed them
equal to 2, we were done.

879
00:45:05,170 --> 00:45:07,440
But this allowed us to then
go back into the record,

880
00:45:07,440 --> 00:45:11,690
and now into the full set,
and just get those records.

881
00:45:11,690 --> 00:45:14,260
So this is a way
to do a join if you

882
00:45:14,260 --> 00:45:19,260
can hold the complete results
of both of those things,

883
00:45:19,260 --> 00:45:23,460
like give me the whole column
1, and the whole column 2.

884
00:45:23,460 --> 00:45:26,700
That's one way to do a join,
perfectly reasonable way

885
00:45:26,700 --> 00:45:28,930
to do a join.

886
00:45:28,930 --> 00:45:32,590
I'm going to now do this
again, but with a column range.

887
00:45:32,590 --> 00:45:37,600
So I'm going to say, I want to
do a join of all columns that

888
00:45:37,600 --> 00:45:42,070
begin with 111, and all
columns that begin with 222.

889
00:45:42,070 --> 00:45:43,500
So that would return more.

890
00:45:43,500 --> 00:45:46,710
I'm going to create two
iterators now to do that.

891
00:45:46,710 --> 00:45:49,295
So I have initialized my
iterator, iterator one

892
00:45:49,295 --> 00:45:51,740
and iterator two.

893
00:45:51,740 --> 00:45:55,120
And so now I'm going to start
the first query iterator here

894
00:45:55,120 --> 00:45:57,830
by giving it its column
range that initializes it.

895
00:45:57,830 --> 00:45:59,560
And I get an A1.

896
00:45:59,560 --> 00:46:03,460
And then I check to
see if A1 is not empty.

897
00:46:03,460 --> 00:46:06,640
If it isn't empty, I'm going to
then sum it, and then call it

898
00:46:06,640 --> 00:46:08,810
again until it proceeds.

899
00:46:08,810 --> 00:46:10,620
And since it's
such a small thing,

900
00:46:10,620 --> 00:46:13,729
it only went through that once.

901
00:46:13,729 --> 00:46:15,770
And then now I'm going to
do the same thing again

902
00:46:15,770 --> 00:46:19,250
with the other iterator and
sum them to get together again.

903
00:46:26,140 --> 00:46:33,064
And now I'm going to
join those two columns.

904
00:46:33,064 --> 00:46:34,480
And that's a way
of doing the join

905
00:46:34,480 --> 00:46:36,900
with essentially two
nested iterators,

906
00:46:36,900 --> 00:46:38,480
and doing the join that way.

907
00:46:38,480 --> 00:46:41,400
So that's something you
can do if you couldn't hold

908
00:46:41,400 --> 00:46:45,580
the whole column in memory
and you wanted to build it up

909
00:46:45,580 --> 00:46:46,080
as you went.

910
00:46:46,080 --> 00:46:48,480
That's a way to do
it with iterators.

911
00:46:48,480 --> 00:46:50,090
Then let's see here.

912
00:46:53,750 --> 00:46:56,350
There's an example of
the results from that.

913
00:47:00,860 --> 00:47:03,240
And just so you know,
when you do an SQL query

914
00:47:03,240 --> 00:47:07,470
in an SQL database, this is what
it's doing under the covers.

915
00:47:07,470 --> 00:47:10,250
It's trying to do whatever
information it can.

916
00:47:10,250 --> 00:47:13,560
It will [INAUDIBLE] hold a
lot of internal statistical

917
00:47:13,560 --> 00:47:14,210
information.

918
00:47:14,210 --> 00:47:18,300
It's trying to figure out, can
I hold the results in memory?

919
00:47:18,300 --> 00:47:18,884
Or can I not?

920
00:47:18,884 --> 00:47:20,300
Do I have to go
through in chunks?

921
00:47:20,300 --> 00:47:21,590
Or do I not?

922
00:47:21,590 --> 00:47:24,910
Do I have information about
oh-- you know, this query,

923
00:47:24,910 --> 00:47:29,230
do I have a sum table sitting
around that will tell me,

924
00:47:29,230 --> 00:47:32,330
oh, which column
should I query first,

925
00:47:32,330 --> 00:47:34,090
because it will
return fewer results?

926
00:47:34,090 --> 00:47:36,790
And then I can go from
there type of thing.

927
00:47:36,790 --> 00:47:39,220
So this is all going
under the covers.

928
00:47:39,220 --> 00:47:43,750
Here, you get the power to do
that directly on the database.

929
00:47:43,750 --> 00:47:46,570
And it's pretty easy to do.

930
00:47:46,570 --> 00:47:49,270
But you do have to
understand these concepts.

931
00:47:49,270 --> 00:47:52,760
So now we'll move on.

932
00:47:52,760 --> 00:47:54,830
So that was all with
the adjacency matrix.

933
00:47:54,830 --> 00:47:57,390
And as I said before, when we
formed the adjacency matrix,

934
00:47:57,390 --> 00:47:59,810
we threw away a
little information,

935
00:47:59,810 --> 00:48:04,410
because if we had multiple-- if
we had a collision of any kind,

936
00:48:04,410 --> 00:48:09,910
we lost the distinctness of
that record, or that edge.

937
00:48:09,910 --> 00:48:13,760
And a lot of times, like, no
we want to keep these edges,

938
00:48:13,760 --> 00:48:17,190
because, yeah, we'll have other
information about those edges.

939
00:48:17,190 --> 00:48:19,340
And we want to keep things.

940
00:48:19,340 --> 00:48:22,530
So we want to store that.

941
00:48:22,530 --> 00:48:27,130
So we're going to now reinsert
the data in the table,

942
00:48:27,130 --> 00:48:30,380
and use, essentially, instead
of an adjacency matrix,

943
00:48:30,380 --> 00:48:31,510
an incidence matrix.

944
00:48:31,510 --> 00:48:35,770
An incidence matrix, every
single row is an edge.

945
00:48:35,770 --> 00:48:39,240
And every single
column is a vertice.

946
00:48:39,240 --> 00:48:42,330
And so an edge can then
connect multiple vertices.

947
00:48:42,330 --> 00:48:44,360
It also allows us
to store essentially

948
00:48:44,360 --> 00:48:46,030
what we call hyper edges.

949
00:48:46,030 --> 00:48:48,590
So if you have an edge that
connects multiple vertices

950
00:48:48,590 --> 00:48:50,700
at the same time,
we can do that.

951
00:48:50,700 --> 00:48:53,700
So let's do that.

952
00:48:53,700 --> 00:48:56,960
This is inserting about
twice as much data.

953
00:48:56,960 --> 00:49:02,410
So it naturally takes a
little bit longer there.

954
00:49:02,410 --> 00:49:04,470
And you see the
edge insertion rates

955
00:49:04,470 --> 00:49:08,960
that we're getting there,
30,000 edges per second.

956
00:49:08,960 --> 00:49:16,060
So let's go and see what
it did to our data here.

957
00:49:16,060 --> 00:49:20,310
So if we look at our
tables, we can see now

958
00:49:20,310 --> 00:49:24,850
that there's our edge data set.

959
00:49:24,850 --> 00:49:30,970
And you see we've inserted
about 270,000 distinct entries

960
00:49:30,970 --> 00:49:31,830
in this data.

961
00:49:31,830 --> 00:49:33,320
So there's the edge table.

962
00:49:33,320 --> 00:49:34,820
And there's this transpose.

963
00:49:34,820 --> 00:49:37,660
And there's the degree count.

964
00:49:37,660 --> 00:49:41,050
And as you saw
before, we had 53,000.

965
00:49:41,050 --> 00:49:45,310
So that just shows you the
additional information.

966
00:49:45,310 --> 00:49:48,010
Let's look at that program here.

967
00:49:50,650 --> 00:49:55,880
So, again, we're looping
over all our files here.

968
00:49:55,880 --> 00:49:56,810
We're reading them in.

969
00:49:56,810 --> 00:49:59,840
I should say, this case
we're reading in the raw text

970
00:49:59,840 --> 00:50:00,530
files again.

971
00:50:00,530 --> 00:50:03,020
We're not reading in
the associative array

972
00:50:03,020 --> 00:50:05,430
because we just want
to insert those edges.

973
00:50:05,430 --> 00:50:06,960
And then the only
thing we've done

974
00:50:06,960 --> 00:50:09,800
is that, basically,
to create our edge we

975
00:50:09,800 --> 00:50:14,160
had to create-- this data
didn't come with a record label.

976
00:50:14,160 --> 00:50:15,340
So we don't have any.

977
00:50:15,340 --> 00:50:18,840
So we're constructing a unique
record label for each edge

978
00:50:18,840 --> 00:50:21,310
here just so that we
have it for the row key.

979
00:50:21,310 --> 00:50:26,800
And then we're pre-pending the
word out into the row string

980
00:50:26,800 --> 00:50:28,660
and in into the column string.

981
00:50:28,660 --> 00:50:32,970
So we know when we
create our record,

982
00:50:32,970 --> 00:50:35,300
out shows the
direction it came from.

983
00:50:35,300 --> 00:50:37,920
And in shows the
direction it left.

984
00:50:37,920 --> 00:50:40,320
And so that's a way
of creating the edge.

985
00:50:40,320 --> 00:50:46,340
And then, likewise, we do
the count degree and such

986
00:50:46,340 --> 00:50:47,950
to preserve that
information so we

987
00:50:47,950 --> 00:50:51,496
can sum the new total
number of counts there.

988
00:50:57,410 --> 00:50:59,270
So let's do some
queries on that.

989
00:51:05,600 --> 00:51:09,120
So I'm going to ask for
100 random vertices here.

990
00:51:09,120 --> 00:51:11,390
So I get my random vertices.

991
00:51:11,390 --> 00:51:15,880
And then I'm going to do
my query of the strings.

992
00:51:15,880 --> 00:51:20,390
But I have to pre-pend this
out, slash, the value in it

993
00:51:20,390 --> 00:51:26,820
so I know I'm looking for
vertices from which the edge is

994
00:51:26,820 --> 00:51:28,060
departing.

995
00:51:28,060 --> 00:51:30,460
And I'm going to
get those edges.

996
00:51:30,460 --> 00:51:31,375
So I get those edges.

997
00:51:31,375 --> 00:51:33,040
So I created the query.

998
00:51:33,040 --> 00:51:34,290
I get the edges.

999
00:51:34,290 --> 00:51:36,070
I'm going to do my
thresholding again.

1000
00:51:36,070 --> 00:51:40,010
I want a certain min and max.

1001
00:51:40,010 --> 00:51:46,660
And then I'm going
to do the threshold.

1002
00:51:46,660 --> 00:51:48,950
So this just gave me
the degree counts.

1003
00:51:48,950 --> 00:51:52,150
And I thresholded
between this range.

1004
00:51:52,150 --> 00:52:00,290
And then now I then do the same
thing back with the-- I say,

1005
00:52:00,290 --> 00:52:02,040
give me everything
greater than degree min

1006
00:52:02,040 --> 00:52:03,500
and less than degree max.

1007
00:52:03,500 --> 00:52:04,950
I get a new set of rows.

1008
00:52:04,950 --> 00:52:07,800
So that will just
return the edges

1009
00:52:07,800 --> 00:52:14,130
that are a part of vertices
with these degree range.

1010
00:52:14,130 --> 00:52:21,230
And then I'm going to get all
those edges, all the records

1011
00:52:21,230 --> 00:52:26,480
that contain those vertices,
through this nested query here.

1012
00:52:26,480 --> 00:52:28,455
The result is this.

1013
00:52:28,455 --> 00:52:32,890
So, basically, this
shows me all the edges

1014
00:52:32,890 --> 00:52:40,310
there are a part of this
random set of vertices

1015
00:52:40,310 --> 00:52:45,430
that have a degree range
between five and 10.

1016
00:52:45,430 --> 00:52:47,780
This is a fairly
sophisticated analytic.

1017
00:52:50,984 --> 00:52:52,650
We're doing about
seven or eight queries

1018
00:52:52,650 --> 00:52:55,280
here, doing a lot
of mathematics.

1019
00:52:55,280 --> 00:52:58,040
And you see how dense it is.

1020
00:52:58,040 --> 00:53:02,190
And, hopefully, from what
we've learned in prior

1021
00:53:02,190 --> 00:53:04,170
has some intuition for you.

1022
00:53:07,080 --> 00:53:08,620
And we'll continue on.

1023
00:53:14,610 --> 00:53:17,240
So now I'm going to do a
query with the iterator--

1024
00:53:17,240 --> 00:53:18,752
again, same type of drill.

1025
00:53:18,752 --> 00:53:20,960
I set the maximum number of
elements to the iterator.

1026
00:53:20,960 --> 00:53:22,750
I get my random set of things.

1027
00:53:22,750 --> 00:53:26,570
I create an iterator, again
setting the number of elements.

1028
00:53:26,570 --> 00:53:31,450
I initialize the iterator
to be over these vertices.

1029
00:53:31,450 --> 00:53:34,600
I then check to see if
it returned anything.

1030
00:53:34,600 --> 00:53:39,710
If it did, I'm going
to then actually pass

1031
00:53:39,710 --> 00:53:47,670
the rows of that back into it
to get the edges containing

1032
00:53:47,670 --> 00:53:50,130
those vertices,
and then do a sum

1033
00:53:50,130 --> 00:53:53,030
to compute the in degree,
and so on and so forth.

1034
00:53:53,030 --> 00:53:56,750
And then I get here
the maximum in degree

1035
00:53:56,750 --> 00:53:59,520
of that set of vertices was 25.

1036
00:53:59,520 --> 00:54:03,320
So that's just an
example of that.

1037
00:54:03,320 --> 00:54:05,140
And that was 12.

1038
00:54:05,140 --> 00:54:07,090
And I think 13 is our last one.

1039
00:54:16,300 --> 00:54:20,840
All right, and again, a
more complicated example

1040
00:54:20,840 --> 00:54:26,760
showing basically a join
over this space creating,

1041
00:54:26,760 --> 00:54:33,420
essentially, a couple
of sets of edges,

1042
00:54:33,420 --> 00:54:36,230
a couple of column
ranges, iterators,

1043
00:54:36,230 --> 00:54:37,190
and so on and so forth.

1044
00:54:37,190 --> 00:54:38,490
And I won't belabor this point.

1045
00:54:38,490 --> 00:54:41,800
But this just shows how you
can combine between using

1046
00:54:41,800 --> 00:54:46,155
the degree table and iterators.

1047
00:54:46,155 --> 00:54:49,470
You have all the
tools at your disposal

1048
00:54:49,470 --> 00:54:54,990
that any type of query planning
system would have inside it,

1049
00:54:54,990 --> 00:54:58,080
that it would use to
make sure that you're not

1050
00:54:58,080 --> 00:55:02,871
over-taxing the results
that you're returning too.

1051
00:55:02,871 --> 00:55:05,370
And that's a lot of times if
you do a query on any database,

1052
00:55:05,370 --> 00:55:08,100
you get the big spinning
watch or whatever.

1053
00:55:08,100 --> 00:55:13,590
It's because the query you
asked was simply too long.

1054
00:55:13,590 --> 00:55:15,240
It also gives a
lot of nice places

1055
00:55:15,240 --> 00:55:21,420
to-- if you're making a query
system, to interrupt it.

1056
00:55:21,420 --> 00:55:24,290
So if you do the query
against the counts,

1057
00:55:24,290 --> 00:55:27,440
you can quickly tell the user,
look, you just did a query.

1058
00:55:27,440 --> 00:55:31,390
And this is going to
return 10 million results.

1059
00:55:31,390 --> 00:55:33,312
Do you want to proceed?

1060
00:55:33,312 --> 00:55:34,770
And so you, of
course, [INAUDIBLE].

1061
00:55:34,770 --> 00:55:38,600
Or likewise, you can set a
maximum number of iterations.

1062
00:55:38,600 --> 00:55:40,840
Like it says, OK, I
want to get them back

1063
00:55:40,840 --> 00:55:44,070
in units of 100,000 entries.

1064
00:55:44,070 --> 00:55:46,330
But I only want to
go up to a million.

1065
00:55:46,330 --> 00:55:49,220
And then I'm going
to pause and get

1066
00:55:49,220 --> 00:55:52,380
some kind of additional guidance
from the user to continue.

1067
00:55:52,380 --> 00:55:54,100
So those are the
same tools and tricks

1068
00:55:54,100 --> 00:55:57,250
that are in any query planner
very elegantly exposed

1069
00:55:57,250 --> 00:56:03,120
to you here for managing
these types of queries.

1070
00:56:03,120 --> 00:56:05,750
With that, I want
to do some stuff

1071
00:56:05,750 --> 00:56:08,081
where we do a little
bit of bigger data sets.

1072
00:56:08,081 --> 00:56:09,580
So I've walked
through the examples.

1073
00:56:09,580 --> 00:56:14,830
I want to show you what this is
like running on a bigger data.

1074
00:56:14,830 --> 00:56:17,560
So let's close all that.

1075
00:56:20,530 --> 00:56:21,510
Close that.

1076
00:56:34,070 --> 00:56:35,400
I want to do this one.

1077
00:56:35,400 --> 00:56:41,620
So now I'm logged
into-- I just SSHed

1078
00:56:41,620 --> 00:56:48,180
into classdb02.cloud.
llgrid.ll.mit.edu, which

1079
00:56:48,180 --> 00:56:50,710
happens-- it tells you which
node it's actually mapped to,

1080
00:56:50,710 --> 00:56:55,030
which is node F-15-11
in our cluster.

1081
00:56:55,030 --> 00:56:57,000
And this is a fairly
powerful compute node.

1082
00:56:57,000 --> 00:56:59,564
These are our next generation
compute nodes for LLGrid.

1083
00:56:59,564 --> 00:57:01,230
So those of you who've
been using LLGrid

1084
00:57:01,230 --> 00:57:02,910
for all these past
years, may have

1085
00:57:02,910 --> 00:57:05,290
noticed that they're getting
a little long in the tooth.

1086
00:57:05,290 --> 00:57:08,180
These are the first
set of the new nodes.

1087
00:57:08,180 --> 00:57:12,900
And we'll have about 500 of
them total in various systems.

1088
00:57:12,900 --> 00:57:19,960
And so here, I am doing
something a little bit larger.

1089
00:57:19,960 --> 00:57:21,245
So let me see here.

1090
00:57:26,060 --> 00:57:28,780
Examples-- so I'm on
my LLGrid account here.

1091
00:57:28,780 --> 00:57:33,810
And I'm going to
go to 3 and then 2.

1092
00:57:33,810 --> 00:57:37,500
Then I do open dots.

1093
00:57:37,500 --> 00:57:40,030
So that's the directory.

1094
00:57:40,030 --> 00:57:45,520
And so the first thing I
did is in my DB setup here,

1095
00:57:45,520 --> 00:57:53,150
you'll notice that I
have done class DB 0.

1096
00:57:53,150 --> 00:57:56,050
And also, we don't need to do 1.

1097
00:57:56,050 --> 00:57:58,200
But I will do 2 here.

1098
00:57:58,200 --> 00:57:59,510
I've made this bigger.

1099
00:57:59,510 --> 00:58:03,320
So I have made this now
2 to the 18th vertices,

1100
00:58:03,320 --> 00:58:06,740
instead of what I had before.

1101
00:58:06,740 --> 00:58:11,940
So let's go [INAUDIBLE] anymore.

1102
00:58:11,940 --> 00:58:18,936
So if I do PDB02, it's going
to now generate these things.

1103
00:58:18,936 --> 00:58:20,560
And so it's generating
a lot more data.

1104
00:58:20,560 --> 00:58:24,684
And you see it's doing at about
200,000 vertices per second.

1105
00:58:24,684 --> 00:58:26,850
Just shows you the difference
between the capability

1106
00:58:26,850 --> 00:58:32,900
of my laptop and one
of these servers here.

1107
00:58:32,900 --> 00:58:36,850
And this will also-- I'm
logged onto this system.

1108
00:58:36,850 --> 00:58:38,390
It has 32 cores.

1109
00:58:38,390 --> 00:58:40,340
I can do things in parallel.

1110
00:58:40,340 --> 00:58:46,310
And so, for instance,
if I did eval p run,

1111
00:58:46,310 --> 00:58:52,690
for those of you who have had
the parallel Matlab training,

1112
00:58:52,690 --> 00:58:54,880
I can say before.

1113
00:58:54,880 --> 00:58:57,140
And since I'm logged
into this node,

1114
00:58:57,140 --> 00:58:59,120
and I just do curly
brackets, it just

1115
00:58:59,120 --> 00:59:00,710
says launch locally
on that node.

1116
00:59:00,710 --> 00:59:01,890
Don't launch onto the grid.

1117
00:59:05,600 --> 00:59:10,660
And now it's launching that in
parallel on this node data one,

1118
00:59:10,660 --> 00:59:11,210
did that.

1119
00:59:11,210 --> 00:59:12,610
Data two, did that.

1120
00:59:12,610 --> 00:59:13,360
Now it's done.

1121
00:59:13,360 --> 00:59:17,170
And the others finished there
work too, probably right

1122
00:59:17,170 --> 00:59:18,010
about now.

1123
00:59:18,010 --> 00:59:19,600
So that's just an
example of being

1124
00:59:19,600 --> 00:59:21,320
able to do things in parallel.

1125
00:59:21,320 --> 00:59:23,580
We've created here-- I
mean, you look at it.

1126
00:59:23,580 --> 00:59:28,000
We did eight times 300,000.

1127
00:59:28,000 --> 00:59:31,880
We did 2.5 million edges
in that, essentially,

1128
00:59:31,880 --> 00:59:34,330
five seconds type of thing.

1129
00:59:34,330 --> 00:59:37,420
So multiply this by 4.

1130
00:59:37,420 --> 00:59:39,860
You're doing like a
million edges a second just

1131
00:59:39,860 --> 00:59:44,400
in that one type of
calculation there.

1132
00:59:44,400 --> 00:59:47,970
Move on TBPB03.

1133
00:59:47,970 --> 00:59:54,200
And-- oh, I should say, I did
modify that program slightly.

1134
00:59:54,200 --> 00:59:54,958
Let's see here.

1135
00:59:59,040 --> 01:00:07,400
So if I look at-- the line
labeled in big capital letters

1136
01:00:07,400 --> 01:00:10,490
parallel, I uncommented it.

1137
01:00:10,490 --> 01:00:13,450
That's what allows each
one of the processors when

1138
01:00:13,450 --> 01:00:16,290
they launched to then
work on different files.

1139
01:00:16,290 --> 01:00:19,650
Otherwise, they all would
have worked on the same files.

1140
01:00:19,650 --> 01:00:22,610
So by uncommenting
that parallel,

1141
01:00:22,610 --> 01:00:24,730
this now becomes
a parallel program

1142
01:00:24,730 --> 01:00:26,420
that I can run with
evalp run command.

1143
01:00:26,420 --> 01:00:28,670
Of course, you have to have
parallel Matlab installed,

1144
01:00:28,670 --> 01:00:30,753
which of course you all
do since you're on LLGrid.

1145
01:00:30,753 --> 01:00:33,650
But for anyone seeing
this on the internet,

1146
01:00:33,650 --> 01:00:36,470
they would need to have that
software installed too, which

1147
01:00:36,470 --> 01:00:39,554
is also available on the
web and installable there.

1148
01:00:39,554 --> 01:00:40,970
So that's all we
needed to do, was

1149
01:00:40,970 --> 01:00:44,320
uncomment that one line to
make that program parallel,

1150
01:00:44,320 --> 01:00:46,806
and did the right
thing for us as well.

1151
01:00:46,806 --> 01:00:48,680
And we're going to go
on to the next example.

1152
01:00:48,680 --> 01:00:51,090
And we did the same thing there.

1153
01:00:51,090 --> 01:00:52,570
We just uncommented parallel.

1154
01:00:52,570 --> 01:00:55,410
And it's now going to create
these associate arrays

1155
01:00:55,410 --> 01:00:56,490
in parallel.

1156
01:00:56,490 --> 01:01:07,410
So if I do PDB30-- so it's
now actually constructing

1157
01:01:07,410 --> 01:01:08,840
these associate arrays.

1158
01:01:08,840 --> 01:01:10,800
You see it takes
about four seconds

1159
01:01:10,800 --> 01:01:11,890
to do each one of those.

1160
01:01:11,890 --> 01:01:17,740
It's doing about 120,000,
130,000 entries per second.

1161
01:01:17,740 --> 01:01:22,041
So this thing will take about 25
seconds to do the whole thing.

1162
01:01:35,730 --> 01:01:50,380
And, again, if we ran that
in parallel, it automatically

1163
01:01:50,380 --> 01:01:52,090
tries to kill the
last job you ran

1164
01:01:52,090 --> 01:01:54,173
if you're in the same
directory so that you're not

1165
01:01:54,173 --> 01:01:57,330
[INAUDIBLE] on top of yourself.

1166
01:01:57,330 --> 01:02:01,180
And now you see it's
doing that again.

1167
01:02:01,180 --> 01:02:03,900
And now it's done.

1168
01:02:03,900 --> 01:02:06,780
And the other one
is finished as well.

1169
01:02:06,780 --> 01:02:10,310
You can actually check
that, if you really want to.

1170
01:02:10,310 --> 01:02:18,062
Just hit Control Z, if I
do more [INAUDIBLE] out.

1171
01:02:18,062 --> 01:02:20,520
You can see those for the output
files of each one of them.

1172
01:02:20,520 --> 01:02:21,520
I'm not lying.

1173
01:02:21,520 --> 01:02:23,394
They didn't take a
ridiculous amount of time.

1174
01:02:23,394 --> 01:02:24,920
They all completed.

1175
01:02:24,920 --> 01:02:26,470
You always encourage
people to check

1176
01:02:26,470 --> 01:02:28,845
those dot out files, and then
that [INAUDIBLE] directory.

1177
01:02:28,845 --> 01:02:30,660
That's where it sends
all the standard out

1178
01:02:30,660 --> 01:02:32,060
from all the other nodes.

1179
01:02:32,060 --> 01:02:33,720
It's probably the
number one feedback

1180
01:02:33,720 --> 01:02:36,570
we get from a user who
says, my job didn't run.

1181
01:02:36,570 --> 01:02:38,820
We're like, what does it
say in your .out files?

1182
01:02:38,820 --> 01:02:41,270
And usually, like, oh
there's an error on node 3

1183
01:02:41,270 --> 01:02:43,700
because this calculation is
wrong on that particular node,

1184
01:02:43,700 --> 01:02:48,800
or something like that-- so
just reminding folks of that.

1185
01:02:48,800 --> 01:02:52,360
Moving on-- so what
did we just do?

1186
01:02:52,360 --> 01:02:54,130
We did three.

1187
01:02:54,130 --> 01:02:59,380
So we did PDB4, re-test.

1188
01:03:02,860 --> 01:03:05,200
And this is doing a little
bit bigger calculation.

1189
01:03:05,200 --> 01:03:07,840
And so you can see
here-- I told you

1190
01:03:07,840 --> 01:03:11,040
it does begin to
get bigger here.

1191
01:03:11,040 --> 01:03:13,050
So it started out--
the first iteration

1192
01:03:13,050 --> 01:03:14,760
was about 0.6 seconds.

1193
01:03:14,760 --> 01:03:18,390
And then it goes
on to 0.8 seconds.

1194
01:03:18,390 --> 01:03:22,340
If we did that cat approach,
it would do it faster.

1195
01:03:25,210 --> 01:03:26,870
I'll show you a
little neat trick.

1196
01:03:26,870 --> 01:03:30,600
This is also a parallel
program when I run it.

1197
01:03:30,600 --> 01:03:37,600
And, basically, I loop
over each file here.

1198
01:03:37,600 --> 01:03:39,540
I'm just doing this
little ag thing just

1199
01:03:39,540 --> 01:03:42,460
to sync the processors
just for fun so I don't

1200
01:03:42,460 --> 01:03:44,840
have to wait for them to start.

1201
01:03:44,840 --> 01:03:46,000
And then it's going to go.

1202
01:03:46,000 --> 01:03:47,510
And they're going
to do their sums.

1203
01:03:47,510 --> 01:03:49,350
And then when they're
all done, they're

1204
01:03:49,350 --> 01:03:51,903
going to call this-- so each
one will have a local sum.

1205
01:03:51,903 --> 01:03:53,660
And it needs to be
pulled together.

1206
01:03:53,660 --> 01:03:56,860
So we have this function called
GAG, which basically takes

1207
01:03:56,860 --> 01:04:00,280
associative rays and we'll just
sum them all together, and very

1208
01:04:00,280 --> 01:04:02,849
nice tool for doing that.

1209
01:04:02,849 --> 01:04:04,890
And, of course, we had to
uncomment that in order

1210
01:04:04,890 --> 01:04:06,000
for that to work.

1211
01:04:06,000 --> 01:04:08,490
And so let's go give that a try.

1212
01:04:08,490 --> 01:04:25,380
And so if we do eval pRUN,
so it's launching them.

1213
01:04:25,380 --> 01:04:26,240
And it's reading.

1214
01:04:26,240 --> 01:04:28,281
And then now it's going
to have to wait a second.

1215
01:04:28,281 --> 01:04:31,480
OK, so it took about two seconds
to pull them all together

1216
01:04:31,480 --> 01:04:33,370
and do the sum across
those processors.

1217
01:04:33,370 --> 01:04:36,420
So that's a parallel
computation,

1218
01:04:36,420 --> 01:04:40,160
a classic example of what
people do with LLGrid

1219
01:04:40,160 --> 01:04:42,320
and can do with D4M is
they have a bunch files.

1220
01:04:42,320 --> 01:04:44,880
Each processor processes
them independently.

1221
01:04:44,880 --> 01:04:50,190
And at the end, they pull
something all together using

1222
01:04:50,190 --> 01:04:53,960
this GAG command.

1223
01:04:53,960 --> 01:05:04,485
All, right moving on-- so
now I'm on database 2 here.

1224
01:05:04,485 --> 01:05:05,610
So I'm going to go to that.

1225
01:05:05,610 --> 01:05:07,862
And let's look at our tables.

1226
01:05:07,862 --> 01:05:10,320
Very little activity-- and you
see we have no tables there.

1227
01:05:10,320 --> 01:05:11,400
So I have to create them.

1228
01:05:11,400 --> 01:05:14,590
So I'm going to pd
[INAUDIBLE] setup 05.

1229
01:05:14,590 --> 01:05:17,616
That's going to create those
tables on that database.

1230
01:05:23,600 --> 01:05:28,040
We can now look, see.

1231
01:05:28,040 --> 01:05:30,090
And it created all my tables.

1232
01:05:30,090 --> 01:05:31,765
So now I'm ready to go.

1233
01:05:31,765 --> 01:05:35,915
And now we're going to
be insert again, PDB06.

1234
01:05:35,915 --> 01:05:38,220
I'm going to insert
the adjacency matrix.

1235
01:05:49,620 --> 01:05:52,290
And this obviously takes
a little bit longer.

1236
01:05:52,290 --> 01:05:55,110
Each one of these, there's
a parameter associated

1237
01:05:55,110 --> 01:05:57,860
with the table, which
is-- you would think,

1238
01:05:57,860 --> 01:05:59,870
normally, it should
just send all its data

1239
01:05:59,870 --> 01:06:01,432
to the database at once.

1240
01:06:01,432 --> 01:06:02,890
But it turns out
one thing we found

1241
01:06:02,890 --> 01:06:05,440
is that actually
the database prefers

1242
01:06:05,440 --> 01:06:08,990
to receive the data in a
smaller increment, typically

1243
01:06:08,990 --> 01:06:11,200
around half a megabyte chunk.

1244
01:06:11,200 --> 01:06:13,300
So every single time
it's calling this,

1245
01:06:13,300 --> 01:06:15,734
it's sending half a megabyte
waiting to get the all clear

1246
01:06:15,734 --> 01:06:17,275
again, and then
setting the next one.

1247
01:06:17,275 --> 01:06:19,800
And we've actually found that
makes a fairly significant--

1248
01:06:19,800 --> 01:06:23,600
so you can see here, we're
getting about 60,000 inserts

1249
01:06:23,600 --> 01:06:24,660
per second.

1250
01:06:24,660 --> 01:06:28,050
This is from one processor.

1251
01:06:28,050 --> 01:06:30,205
And this takes a little while.

1252
01:06:30,205 --> 01:06:33,580
Let's see if we can go and
look at it while it's going on.

1253
01:06:33,580 --> 01:06:36,280
If we go here, we should
begin to see some.

1254
01:06:36,280 --> 01:06:37,654
So there you go.

1255
01:06:37,654 --> 01:06:39,820
That's what a real insert
is beginning to look like.

1256
01:06:39,820 --> 01:06:43,910
It just changes its axis
for you dynamically here.

1257
01:06:43,910 --> 01:06:49,050
But we're getting about
60,000 inserts a second there.

1258
01:06:49,050 --> 01:06:51,560
Let me just go along here.

1259
01:06:51,560 --> 01:06:53,793
It'll start leveling
off a little bit.

1260
01:07:01,192 --> 01:07:03,775
And then it will show you what's
going on in the tables there.

1261
01:07:14,180 --> 01:07:16,750
And, I mean, not too
many of you are probably

1262
01:07:16,750 --> 01:07:17,990
database aficionados.

1263
01:07:17,990 --> 01:07:24,340
But 60,000 inserts a seconds
on a single node database

1264
01:07:24,340 --> 01:07:26,549
is pretty darn amazing.

1265
01:07:26,549 --> 01:07:29,090
I mean, you would mostly have
had to use a parallel database.

1266
01:07:29,090 --> 01:07:30,680
And that's actually one of
the great powers of Accumulo

1267
01:07:30,680 --> 01:07:32,300
is there's a lot
of-- even though it's

1268
01:07:32,300 --> 01:07:34,927
a parallel database, there's
a lot of problems you can now

1269
01:07:34,927 --> 01:07:37,510
do with a single node database
that you would have had to have

1270
01:07:37,510 --> 01:07:39,310
a parallel system to do before.

1271
01:07:39,310 --> 01:07:41,400
And that's really--
because, parallel computing

1272
01:07:41,400 --> 01:07:43,180
is a real pain.

1273
01:07:43,180 --> 01:07:43,790
I should know.

1274
01:07:47,079 --> 01:07:49,120
If you can make your
parallel problem fast enough

1275
01:07:49,120 --> 01:07:50,830
to now work on a
single one, that's

1276
01:07:50,830 --> 01:07:53,864
really a tremendously
convenient capability.

1277
01:08:02,460 --> 01:08:09,230
So we inserted there, what,
eight million entries-- so

1278
01:08:09,230 --> 01:08:11,210
pretty impressive.

1279
01:08:11,210 --> 01:08:13,340
But that did take a while.

1280
01:08:13,340 --> 01:08:16,330
And so maybe I want to
do that in parallel.

1281
01:08:16,330 --> 01:08:24,452
So if I just do eval pRUN, let's
try four and see what happens.

1282
01:08:27,620 --> 01:08:29,040
Now I would expect,
actually, this

1283
01:08:29,040 --> 01:08:30,660
to begin to top this thing out.

1284
01:08:30,660 --> 01:08:33,564
And so the individual
inserts rates on each node

1285
01:08:33,564 --> 01:08:34,813
probably go down a little bit.

1286
01:08:38,930 --> 01:08:41,200
And you'll see it will get
a little bit noisy here,

1287
01:08:41,200 --> 01:08:44,520
because now we have four
separate processes all doing

1288
01:08:44,520 --> 01:08:45,214
inserts.

1289
01:08:45,214 --> 01:08:46,130
You see there was one.

1290
01:08:46,130 --> 01:08:47,713
It took a little
bit, almost a second.

1291
01:08:47,713 --> 01:08:51,180
And you get this noise
beginning to happen here.

1292
01:08:51,180 --> 01:08:53,720
But we're getting
50,000 edges per second

1293
01:08:53,720 --> 01:08:56,460
on one node, which means
we should be getting

1294
01:08:56,460 --> 01:08:58,819
close to four times that.

1295
01:08:58,819 --> 01:08:59,810
So let's go check.

1296
01:08:59,810 --> 01:09:01,540
What's it seeing here?

1297
01:09:01,540 --> 01:09:04,270
So if we update that--
and there you see,

1298
01:09:04,270 --> 01:09:08,630
we're sort of now climbing
the hill well over 100,000.

1299
01:09:08,630 --> 01:09:10,470
That was our first insert there.

1300
01:09:10,470 --> 01:09:14,224
And now we're entering
the second one here.

1301
01:09:14,224 --> 01:09:16,414
Whoops, don't want
to check my email.

1302
01:09:25,770 --> 01:09:26,630
Let's see here.

1303
01:09:26,630 --> 01:09:28,040
So how are we doing there?

1304
01:09:28,040 --> 01:09:28,720
Oh, it's done.

1305
01:09:31,750 --> 01:09:34,979
We may not have even get
the full rise time of that.

1306
01:09:34,979 --> 01:09:38,240
Yeah, so it basically finished
before we could even really hit

1307
01:09:38,240 --> 01:09:41,914
the-- it has a little filter
here, has a certain resolution.

1308
01:09:41,914 --> 01:09:44,080
You really need to do an
insert for about 10 minutes

1309
01:09:44,080 --> 01:09:46,060
before you can get
really a sense of that.

1310
01:09:46,060 --> 01:09:52,120
But there you see, we probably
got over 200,000 inserts

1311
01:09:52,120 --> 01:09:57,220
a second using four
processes on one node.

1312
01:09:57,220 --> 01:10:01,110
And we could probably keep
on going up this ramp.

1313
01:10:01,110 --> 01:10:02,760
For this data set,
I'd expect we'd

1314
01:10:02,760 --> 01:10:04,980
be able to get maybe
500,000 inserts a second

1315
01:10:04,980 --> 01:10:08,540
if I kept adding processors
and stuff like that.

1316
01:10:08,540 --> 01:10:13,710
And if you look at our
data, what do we got here?

1317
01:10:13,710 --> 01:10:18,220
We got like 15 million
entries now in our database.

1318
01:10:18,220 --> 01:10:21,010
Again, one of the nice things
is for a typical databases,

1319
01:10:21,010 --> 01:10:24,770
a lot of times if you have to
re-ingest the whole database,

1320
01:10:24,770 --> 01:10:25,480
that's fine.

1321
01:10:25,480 --> 01:10:28,040
In our cyber program,
we have a month of data.

1322
01:10:28,040 --> 01:10:30,410
And we can re-ingest
it in a couple hours.

1323
01:10:30,410 --> 01:10:32,640
And that's a very powerful
tool to be able to like,

1324
01:10:32,640 --> 01:10:33,390
oh, you know what?

1325
01:10:33,390 --> 01:10:34,860
I didn't like the ingestion.

1326
01:10:34,860 --> 01:10:35,500
That's fine.

1327
01:10:35,500 --> 01:10:37,440
I'll just rewrite the
ingestion and redo it.

1328
01:10:37,440 --> 01:10:39,810
And this gives you
a very powerful tool

1329
01:10:39,810 --> 01:10:41,900
for exploring your data here.

1330
01:10:41,900 --> 01:10:44,920
So that's kind of what
I want to do with that.

1331
01:10:44,920 --> 01:10:49,090
One of our students here
very generously gave me

1332
01:10:49,090 --> 01:10:50,440
some Twitter data.

1333
01:10:50,440 --> 01:10:53,870
And so I wanted to show you a
little bit with that Twitter

1334
01:10:53,870 --> 01:10:56,180
data, because it's
probably maybe a hair more

1335
01:10:56,180 --> 01:11:00,610
meaningful than this abstract
Kronecker graph data.

1336
01:11:00,610 --> 01:11:04,230
And by definition, Twitter data
is about the most public data

1337
01:11:04,230 --> 01:11:05,130
that one can imagine.

1338
01:11:05,130 --> 01:11:07,450
I think no one who
posts to Twitter

1339
01:11:07,450 --> 01:11:10,310
is expecting any sense
of privacy there.

1340
01:11:10,310 --> 01:11:14,410
So I think we can use that OK.

1341
01:11:14,410 --> 01:11:18,027
So let's see here.

1342
01:11:18,027 --> 01:11:20,236
Let me exit out of that.

1343
01:11:28,270 --> 01:11:37,027
[INAUDIBLE] desktop,
[INAUDIBLE], Twitter.

1344
01:11:40,800 --> 01:11:44,800
And so, basically, just a few
examples here-- the first thing

1345
01:11:44,800 --> 01:11:47,640
we did is construct
the associative array.

1346
01:11:47,640 --> 01:11:49,507
So let's start up here.

1347
01:11:52,130 --> 01:11:54,660
And I think it was
like a million Twitter.

1348
01:11:54,660 --> 01:11:56,300
Is that right?

1349
01:11:56,300 --> 01:12:02,289
A million entries, and how many
tweets do you think that was?

1350
01:12:02,289 --> 01:12:03,830
We should be able
to find out, right?

1351
01:12:03,830 --> 01:12:05,080
We should be able to find out.

1352
01:12:10,150 --> 01:12:12,302
So let's do the
first thing here.

1353
01:12:14,950 --> 01:12:17,744
So it's reading these
in, and chunked,

1354
01:12:17,744 --> 01:12:19,535
and writing them out
to associative arrays.

1355
01:12:23,234 --> 01:12:24,400
That's going to be annoying.

1356
01:12:24,400 --> 01:12:25,100
Isn't it?

1357
01:12:25,100 --> 01:12:26,290
Let's go to a faster system.

1358
01:12:40,610 --> 01:12:42,923
So this is running on
the database system.

1359
01:12:49,890 --> 01:12:51,610
And I broke it up
into chunks of 10.

1360
01:12:51,610 --> 01:12:54,564
I couldn't quite on my
laptop fit the whole thing

1361
01:12:54,564 --> 01:12:55,605
in one associative array.

1362
01:12:55,605 --> 01:12:58,699
So I broke it up
into chunks of 10.

1363
01:13:07,169 --> 01:13:08,710
Yeah, see we're
still cruising there.

1364
01:13:12,320 --> 01:13:14,140
So that just reads it all in.

1365
01:13:14,140 --> 01:13:16,190
In fact, we can take
a look at that file.

1366
01:13:24,530 --> 01:13:26,600
So I just took that
exact same example

1367
01:13:26,600 --> 01:13:28,050
and just put his
data in-- so just

1368
01:13:28,050 --> 01:13:30,400
to take a look at
what that involved,

1369
01:13:30,400 --> 01:13:33,370
pretty much all the same.

1370
01:13:33,370 --> 01:13:34,970
It was one big file.

1371
01:13:34,970 --> 01:13:37,890
But I couldn't process it.

1372
01:13:37,890 --> 01:13:40,530
I mean, I could read it in.

1373
01:13:40,530 --> 01:13:42,810
He did a great job of
creating it into triples.

1374
01:13:42,810 --> 01:13:44,890
And I could easily
hold those triples.

1375
01:13:44,890 --> 01:13:48,350
But I couldn't quite construct
the associative array.

1376
01:13:48,350 --> 01:13:51,470
And so I basically
just go through

1377
01:13:51,470 --> 01:13:56,780
and find all the separators,
and then basically take them out

1378
01:13:56,780 --> 01:13:59,160
of the vector in memory.

1379
01:13:59,160 --> 01:14:03,960
And so that's what I'm doing
here, is I'm looping over here.

1380
01:14:03,960 --> 01:14:06,480
So, basically, I
read in all the data.

1381
01:14:06,480 --> 01:14:08,574
I find all the separators.

1382
01:14:08,574 --> 01:14:09,490
And then I go through.

1383
01:14:09,490 --> 01:14:11,540
And it's a little bit
of a messy calculation

1384
01:14:11,540 --> 01:14:14,324
to basically do them in
these particular blocks.

1385
01:14:14,324 --> 01:14:16,240
And then I can construct
the associative array

1386
01:14:16,240 --> 01:14:18,363
and save those out, no problem.

1387
01:14:22,409 --> 01:14:23,200
And let's see here.

1388
01:14:23,200 --> 01:14:25,870
So what else [INAUDIBLE].

1389
01:14:25,870 --> 01:14:31,247
We can do it in little degree
calculation from that data.

1390
01:14:31,247 --> 01:14:32,830
So it's now reading
each one of those,

1391
01:14:32,830 --> 01:14:34,480
and computing the degrees.

1392
01:14:40,840 --> 01:14:44,444
[INAUDIBLE] do the same
thing on this system.

1393
01:14:44,444 --> 01:14:45,887
It's pretty fast.

1394
01:15:05,530 --> 01:15:11,720
Proceed then to-- let's
create some tables.

1395
01:15:11,720 --> 01:15:14,845
So I created a special
class of tables for that.

1396
01:15:14,845 --> 01:15:18,100
[INAUDIBLE] modify
that [INAUDIBLE] that.

1397
01:15:18,100 --> 01:15:20,896
If you go over here, I
think it was on this one.

1398
01:15:23,710 --> 01:15:32,300
Nope, I did it on the other
database-- database 1,

1399
01:15:32,300 --> 01:15:35,950
got tables there.

1400
01:15:35,950 --> 01:15:39,810
So all I was doing there
was plotting the degree

1401
01:15:39,810 --> 01:15:41,260
distribution.

1402
01:15:41,260 --> 01:15:53,400
So this shows us--
so, for each tweet--

1403
01:15:53,400 --> 01:16:01,790
let's go to figure
one-- are you done?

1404
01:16:01,790 --> 01:16:03,130
It's very Twitter-like.

1405
01:16:03,130 --> 01:16:06,640
No, no one is ever
done on Twitter.

1406
01:16:06,640 --> 01:16:09,776
So-- wow, what did I do?

1407
01:16:13,640 --> 01:16:16,440
I went to town, didn't I?

1408
01:16:16,440 --> 01:16:17,610
Done now?

1409
01:16:17,610 --> 01:16:20,130
So we go to figure 1.

1410
01:16:20,130 --> 01:16:23,480
You see for each tweet, this
shows us how much information

1411
01:16:23,480 --> 01:16:24,340
was in each tweet.

1412
01:16:24,340 --> 01:16:26,830
And you see that, on average--
this is because he basically

1413
01:16:26,830 --> 01:16:29,340
parsed out each word uniquely.

1414
01:16:29,340 --> 01:16:33,450
So this shows there is about
1,000 pieces of information

1415
01:16:33,450 --> 01:16:39,030
associated with each tweet,
which seems a little high.

1416
01:16:39,030 --> 01:16:40,765
So we should probably
double check that.

1417
01:16:43,990 --> 01:16:45,780
And then what did I do?

1418
01:16:50,304 --> 01:16:51,470
So we loaded all of them up.

1419
01:16:51,470 --> 01:16:53,090
We summed the degrees.

1420
01:16:53,090 --> 01:16:58,090
And then-- oh, I said, show me
all the locations with counts

1421
01:16:58,090 --> 01:17:04,790
greater than 100, and then
all the words with at signs

1422
01:17:04,790 --> 01:17:07,120
greater than 100, and all
the hashtags greater than 50.

1423
01:17:07,120 --> 01:17:09,710
So that's what these
other guys are.

1424
01:17:09,710 --> 01:17:11,880
So this was the--
essentially, for each tweet,

1425
01:17:11,880 --> 01:17:14,310
how many do you have?

1426
01:17:14,310 --> 01:17:17,970
If we go to figure 2, this
shows the degree distribution

1427
01:17:17,970 --> 01:17:21,144
of all the words and
other things in there.

1428
01:17:21,144 --> 01:17:23,310
So there's some guy here
who is really, really high.

1429
01:17:23,310 --> 01:17:26,090
In fact, we can find him out.

1430
01:17:26,090 --> 01:17:29,590
But as, of course, most
things occur only once-- like,

1431
01:17:29,590 --> 01:17:31,499
there's a lot of unique keys.

1432
01:17:31,499 --> 01:17:34,040
There's the message ID, which
of course is probably something

1433
01:17:34,040 --> 01:17:35,990
that only appears once.

1434
01:17:35,990 --> 01:17:40,000
If we go to figure 3-- so
this just shows the locations.

1435
01:17:40,000 --> 01:17:42,910
And this was tweets from the
day before the hurricane,

1436
01:17:42,910 --> 01:17:46,370
or the Wednesday
before Hurricane Sandy.

1437
01:17:46,370 --> 01:17:51,690
And so this shows us-- but
limited to the New York

1438
01:17:51,690 --> 01:17:53,074
area or something like that?

1439
01:17:53,074 --> 01:17:54,990
AUDIENCE: Yeah, 40 miles
around New York City.

1440
01:17:54,990 --> 01:17:57,180
JEREMY KEPNER: 40 miles
around New York City.

1441
01:17:57,180 --> 01:18:00,760
Basically, this shows
all the locations here.

1442
01:18:00,760 --> 01:18:02,619
So this is a classic
example of the kind

1443
01:18:02,619 --> 01:18:04,660
of things you want to do,
because the first thing

1444
01:18:04,660 --> 01:18:08,660
that we see is that we have some
problems with our data, which

1445
01:18:08,660 --> 01:18:14,420
is New York City and
New York City space got

1446
01:18:14,420 --> 01:18:15,910
to go in and correct all those.

1447
01:18:15,910 --> 01:18:17,610
So that's a classic example.

1448
01:18:17,610 --> 01:18:21,590
This is really what
D4M-- it's the number one

1449
01:18:21,590 --> 01:18:23,732
thing that people
do with D4M, is

1450
01:18:23,732 --> 01:18:25,190
it's the first time
that you really

1451
01:18:25,190 --> 01:18:28,620
can do sums and tallies
over the entire data.

1452
01:18:28,620 --> 01:18:30,400
And these things
just don't pop out.

1453
01:18:30,400 --> 01:18:32,980
They stick out
like a sore thumb.

1454
01:18:32,980 --> 01:18:34,321
Like, oh got to correct that.

1455
01:18:34,321 --> 01:18:36,070
You can either correct
it in the database,

1456
01:18:36,070 --> 01:18:38,110
or correct it
afterwards in the query.

1457
01:18:38,110 --> 01:18:40,580
But that just immediately
improves the quality

1458
01:18:40,580 --> 01:18:42,270
of everything else you have.

1459
01:18:42,270 --> 01:18:46,170
And then there's this clutter
one, like location none.

1460
01:18:46,170 --> 01:18:48,730
Well, clearly, you'd want
to just get rid of that,

1461
01:18:48,730 --> 01:18:51,040
or do something with
that, and then just

1462
01:18:51,040 --> 01:18:53,530
plain old normal
spelled New York.

1463
01:18:53,530 --> 01:18:55,400
So most people can
spell correctly.

1464
01:18:55,400 --> 01:18:57,290
And so we're very good.

1465
01:18:57,290 --> 01:18:59,520
But location, iPhone,
what's that about?

1466
01:18:59,520 --> 01:19:05,420
Jersey City-- well, we
don't care about that.

1467
01:19:05,420 --> 01:19:09,960
South Jersey-- well, what's--
South Jersey people don't know

1468
01:19:09,960 --> 01:19:12,840
that they're 40 miles
from New York, I guess?

1469
01:19:12,840 --> 01:19:14,840
AUDIENCE: It's
whatever they have.

1470
01:19:14,840 --> 01:19:16,180
JEREMY KEPNER: In their profile.

1471
01:19:16,180 --> 01:19:19,580
So a lot of people in South
Jersey who say they're

1472
01:19:19,580 --> 01:19:21,850
from New York.

1473
01:19:21,850 --> 01:19:29,070
So what's her name
from Jersey Shore?

1474
01:19:29,070 --> 01:19:31,890
Snooki, right Snooki says
she's actually from New York

1475
01:19:31,890 --> 01:19:35,900
City, not South Jersey.

1476
01:19:35,900 --> 01:19:37,830
So there's a great
example of that.

1477
01:19:37,830 --> 01:19:40,890
And then let's see
here, figure 4.

1478
01:19:40,890 --> 01:19:44,410
So this just shows
all the at signs.

1479
01:19:44,410 --> 01:19:50,290
So you see, basically,
damnitstrue,

1480
01:19:50,290 --> 01:19:55,310
is like the most-- is this like
a retweeted thing or something?

1481
01:19:55,310 --> 01:19:56,150
I don't know.

1482
01:19:56,150 --> 01:19:58,870
What does the at
sign mean again?

1483
01:19:58,870 --> 01:20:01,119
AUDIENCE: It's to another user.

1484
01:20:01,119 --> 01:20:02,160
JEREMY KEPNER: To a user.

1485
01:20:02,160 --> 01:20:05,070
So most people tweet to
damnitstrue in New York.

1486
01:20:05,070 --> 01:20:06,320
There you go.

1487
01:20:06,320 --> 01:20:12,900
Funny fact, relatable quote,
and then Donald Trump,

1488
01:20:12,900 --> 01:20:15,620
the real Donald Trump, and
then just word at sign.

1489
01:20:15,620 --> 01:20:19,820
So those are another
example-- another here

1490
01:20:19,820 --> 01:20:27,320
is a hilarious idiot,
badgalv, an Marilyn Monroe ID.

1491
01:20:27,320 --> 01:20:32,569
So there you go, a lot of
fun stuff there on Twitter.

1492
01:20:32,569 --> 01:20:35,110
But this is sort of-- he's going
to establish his background,

1493
01:20:35,110 --> 01:20:37,193
and then go back and look
at during the hurricane.

1494
01:20:37,193 --> 01:20:42,790
So this is all basically the
normal situation, very clearly.

1495
01:20:42,790 --> 01:20:47,160
And then the hashtags-- so
we can look at the hashtags.

1496
01:20:47,160 --> 01:20:49,030
So what do we got here?

1497
01:20:49,030 --> 01:20:50,205
Favorite movie quotes.

1498
01:20:50,205 --> 01:20:51,997
AUDIENCE: Favorite
movie quotes misspelled.

1499
01:20:51,997 --> 01:20:53,663
JEREMY KEPNER: And
favorite movie quotes

1500
01:20:53,663 --> 01:20:55,310
misspelled right up there.

1501
01:20:55,310 --> 01:20:58,840
The Knicks, and then
what I love the most,

1502
01:20:58,840 --> 01:21:02,880
and all this type--
team follow back.

1503
01:21:02,880 --> 01:21:06,070
I don't know, team auto--
no, what's this one?

1504
01:21:06,070 --> 01:21:06,670
What's TFB?

1505
01:21:09,260 --> 01:21:12,920
Maybe we don't want to know.

1506
01:21:12,920 --> 01:21:16,460
You can always look it
up in Urban Dictionary.

1507
01:21:16,460 --> 01:21:17,500
It's a bad one?

1508
01:21:17,500 --> 01:21:20,100
All right, OK, good, we'll
will leave it at that,

1509
01:21:20,100 --> 01:21:21,270
won't add that to the video.

1510
01:21:24,870 --> 01:21:28,220
So continuing on
here, let's see.

1511
01:21:32,580 --> 01:21:33,560
Well, you get the idea.

1512
01:21:33,560 --> 01:21:36,660
And so all these examples,
they work in parallel to,

1513
01:21:36,660 --> 01:21:39,910
you get a lot of speed up, lots
of interesting stuff like that.

1514
01:21:39,910 --> 01:21:42,310
But that's very classic
the kind of thing you do.

1515
01:21:42,310 --> 01:21:43,670
You get data.

1516
01:21:43,670 --> 01:21:44,600
You parse it.

1517
01:21:44,600 --> 01:21:46,210
You maybe stick
it in Matlab files

1518
01:21:46,210 --> 01:21:48,547
to do your initial
sweep through it.

1519
01:21:48,547 --> 01:21:50,130
But then if it gets
really, really big

1520
01:21:50,130 --> 01:21:52,129
and you want to do more
detailed things that you

1521
01:21:52,129 --> 01:21:54,200
insert in the database,
can do queries there.

1522
01:21:54,200 --> 01:21:57,827
Leverage using your counts, so
that you don't accidentally--

1523
01:21:57,827 --> 01:22:00,035
you can imagine if we put
all the tweets in the world

1524
01:22:00,035 --> 01:22:01,535
and you had location,
New York City.

1525
01:22:01,535 --> 01:22:03,160
And you looked at--
you had to, give me

1526
01:22:03,160 --> 01:22:04,274
all this set of locations.

1527
01:22:04,274 --> 01:22:05,690
And one of them
was New York City.

1528
01:22:05,690 --> 01:22:08,360
You'd be like, oh
my God, I've just

1529
01:22:08,360 --> 01:22:11,120
done a query that's going to
give me 5% of all the data

1530
01:22:11,120 --> 01:22:11,800
back.

1531
01:22:11,800 --> 01:22:14,044
That's going to just
flush your system.

1532
01:22:14,044 --> 01:22:15,960
But if you can just do
the count, and be like,

1533
01:22:15,960 --> 01:22:18,050
oh, New York City has
got a million entries.

1534
01:22:18,050 --> 01:22:19,480
Don't touch that one.

1535
01:22:19,480 --> 01:22:24,570
Or put an iterator on
that one so that I only

1536
01:22:24,570 --> 01:22:27,050
handle it in manageable chunks.

1537
01:22:27,050 --> 01:22:28,440
So I want to thank you.

1538
01:22:28,440 --> 01:22:30,810
So hopefully this was worth it.

1539
01:22:30,810 --> 01:22:32,530
We have one more
class, which deals

1540
01:22:32,530 --> 01:22:34,490
with a little bit
of wrapping up some

1541
01:22:34,490 --> 01:22:38,590
of the theory on this stuff,
and some stuff on performance

1542
01:22:38,590 --> 01:22:39,510
metrics.

1543
01:22:39,510 --> 01:22:42,040
And then in two weeks, for
those of you who signed up,

1544
01:22:42,040 --> 01:22:44,760
we have the Accumulo folks
coming in showing you

1545
01:22:44,760 --> 01:22:49,130
how to run your own
database all day

1546
01:22:49,130 --> 01:22:54,250
on just-- we're setting
up a database for you guys

1547
01:22:54,250 --> 01:22:55,450
on LLGrid.

1548
01:22:55,450 --> 01:22:58,130
But you're definitely going
to run in with your customers

1549
01:22:58,130 --> 01:22:59,430
Accumulo instances.

1550
01:22:59,430 --> 01:23:01,725
It's good to know some
basics about that,

1551
01:23:01,725 --> 01:23:03,100
because a lot of
times you're not

1552
01:23:03,100 --> 01:23:06,320
going to have all the nice
stuff that we've provided.

1553
01:23:06,320 --> 01:23:08,770
And it's good to know how
to set up your own Accumulo

1554
01:23:08,770 --> 01:23:10,460
and interact with
that in the field.

1555
01:23:10,460 --> 01:23:12,950
So with that, that brings
the lecture to the end.

1556
01:23:12,950 --> 01:23:16,940
And happy to stay for any
questions, if anybody has them.