1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation or to
view additional materials

6
00:00:12,680 --> 00:00:16,496
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,496 --> 00:00:17,120
at ocw.mit.edu.

8
00:00:21,900 --> 00:00:25,240
PROFESSOR: And so I want to
thank you all for coming.

9
00:00:25,240 --> 00:00:28,050
As the holiday season
comes along here

10
00:00:28,050 --> 00:00:34,227
I'm guessing that our
people are getting

11
00:00:34,227 --> 00:00:35,435
distracted with other things.

12
00:00:35,435 --> 00:00:38,990
So a little bit more of
an intimate setting today.

13
00:00:38,990 --> 00:00:41,570
So I urge people to
take advantage of that,

14
00:00:41,570 --> 00:00:42,195
ask questions.

15
00:00:42,195 --> 00:00:44,260
It's often a lot easier
to do when there aren't

16
00:00:44,260 --> 00:00:45,690
as many people in the room.

17
00:00:45,690 --> 00:00:48,860
So I just really want to
encourage you to do that.

18
00:00:48,860 --> 00:00:51,660
So let's get right into this.

19
00:00:56,290 --> 00:00:57,225
All right.

20
00:01:09,534 --> 00:01:11,450
Fortunately the mic
actually doesn't pick that

21
00:01:11,450 --> 00:01:13,080
up at all for the most part.

22
00:01:13,080 --> 00:01:15,940
But if you all can hear
the drilling of a-- yeah.

23
00:01:22,176 --> 00:01:26,080
All right, so this is Lecture 4.

24
00:01:26,080 --> 00:01:30,070
I kinda went out of order to get
our special Halloween lecture

25
00:01:30,070 --> 00:01:31,710
in last time.

26
00:01:31,710 --> 00:01:34,060
But we're going to
talking a little bit more

27
00:01:34,060 --> 00:01:37,900
about the analysis of what
we call structured data.

28
00:01:37,900 --> 00:01:41,680
And we're doing more
structured type analyzes

29
00:01:41,680 --> 00:01:44,940
on unstructured data.

30
00:01:44,940 --> 00:01:49,150
And of course the signal
processing on databases,

31
00:01:49,150 --> 00:01:51,670
this is where we bring the
ideas of detection theory

32
00:01:51,670 --> 00:01:53,420
and apply them to
the kinds of data

33
00:01:53,420 --> 00:01:55,910
we see in databases, strings,
and other types of things

34
00:01:55,910 --> 00:01:57,170
there.

35
00:01:57,170 --> 00:02:03,650
And so this lecture is going
to talk a little bit more

36
00:02:03,650 --> 00:02:05,730
about-- we're going to
get into a little bit

37
00:02:05,730 --> 00:02:07,670
more sophisticated analytics.

38
00:02:07,670 --> 00:02:11,850
I think up to this point we've
done a fairly simple thing,

39
00:02:11,850 --> 00:02:17,950
basic queries using
the technology, pretty

40
00:02:17,950 --> 00:02:19,930
simple correlations,
pretty simple stuff.

41
00:02:19,930 --> 00:02:22,060
Now we're going to
begin to kind of go

42
00:02:22,060 --> 00:02:25,410
a little bit more sophisticated
in some of the things we do.

43
00:02:25,410 --> 00:02:28,930
And again, I think as things
get more complicated the deform

44
00:02:28,930 --> 00:02:33,587
technology becomes, its benefits
become even more apparent.

45
00:02:37,330 --> 00:02:41,850
So we're just going to
get right into this,

46
00:02:41,850 --> 00:02:45,000
gonna show you our
very generic scheme,

47
00:02:45,000 --> 00:02:47,450
gonna talk a little bit more
about some of the issues

48
00:02:47,450 --> 00:02:50,567
that we encounter when we deal
with particular databases.

49
00:02:50,567 --> 00:02:52,900
But the database we're going
to be using for this course

50
00:02:52,900 --> 00:02:54,210
is called accumulo.

51
00:02:54,210 --> 00:02:56,460
And so I'll be talking
about some issues that

52
00:02:56,460 --> 00:02:58,832
are more accumulo specific.

53
00:03:02,650 --> 00:03:05,845
So accumulo is a triple store.

54
00:03:13,440 --> 00:03:18,780
And so the way we read
data into our triple store

55
00:03:18,780 --> 00:03:23,840
is using what we call this
exploded transpose pair schema.

56
00:03:23,840 --> 00:03:28,270
So we have-- [INAUDIBLE].

57
00:03:28,270 --> 00:03:33,170
We have an input
data set that might

58
00:03:33,170 --> 00:03:37,150
look like a table like
this with maybe our row key

59
00:03:37,150 --> 00:03:39,600
is going to be
somehow based on time,

60
00:03:39,600 --> 00:03:42,610
and then we have various columns
here, which may or may not

61
00:03:42,610 --> 00:03:44,780
contain various data.

62
00:03:44,780 --> 00:03:48,200
And then the first thing
we do is we basically

63
00:03:48,200 --> 00:03:51,950
explode, because our triple
stores can hold an arbitrary

64
00:03:51,950 --> 00:03:52,810
number of columns.

65
00:03:52,810 --> 00:03:57,920
It can add columns
dynamically without any costs.

66
00:03:57,920 --> 00:04:00,380
We explode this
schema by appending

67
00:04:00,380 --> 00:04:05,750
our Column 1 and its value
together in this way.

68
00:04:05,750 --> 00:04:08,330
We then just took a
value, could be anything,

69
00:04:08,330 --> 00:04:11,120
like one, anything that we
wouldn't want a search on.

70
00:04:13,840 --> 00:04:15,400
And then we have a row key here.

71
00:04:15,400 --> 00:04:17,820
I'll get into back a
little bit why we've

72
00:04:17,820 --> 00:04:19,630
done the row key in this way.

73
00:04:19,630 --> 00:04:22,650
And so by itself this doesn't
give us any real advantage.

74
00:04:22,650 --> 00:04:25,540
But in accumulo or any
other triple store end

75
00:04:25,540 --> 00:04:29,790
or in just in D form
sociative arrays,

76
00:04:29,790 --> 00:04:32,140
we can store the transpose
of the table, which

77
00:04:32,140 --> 00:04:36,530
means that all these
columns are now rows.

78
00:04:36,530 --> 00:04:39,210
And this particular database
is a row oriented database,

79
00:04:39,210 --> 00:04:43,710
which means it can look up any
row key with constant access.

80
00:04:43,710 --> 00:04:45,870
So it can look up row
keys very quickly.

81
00:04:45,870 --> 00:04:48,250
And D4M a lot we
hide this from you

82
00:04:48,250 --> 00:04:50,520
so that whenever you
do inserts it does,

83
00:04:50,520 --> 00:04:52,680
if you want it to
happen, it will

84
00:04:52,680 --> 00:04:54,430
store the transpose for you.

85
00:04:54,430 --> 00:04:57,160
And so you look like you
have a giant table where

86
00:04:57,160 --> 00:05:00,690
you can look up any row key or
any column very, very quickly

87
00:05:00,690 --> 00:05:03,190
and makes it very easy to do
very complicated analytics.

88
00:05:06,270 --> 00:05:09,740
One little twist here
that you may have noticed

89
00:05:09,740 --> 00:05:16,570
is that I have flipped
this time field here,

90
00:05:16,570 --> 00:05:20,860
taking it from
essentially little endian

91
00:05:20,860 --> 00:05:23,110
and made it big endian.

92
00:05:23,110 --> 00:05:26,800
And I've always been an
advocate of having row keys that

93
00:05:26,800 --> 00:05:28,050
have some meaning in them.

94
00:05:28,050 --> 00:05:31,990
Sometimes in databases they just
create your arbitrary random

95
00:05:31,990 --> 00:05:32,955
hashes.

96
00:05:32,955 --> 00:05:35,160
I think we have enough
random data in the world

97
00:05:35,160 --> 00:05:37,520
that we don't necessarily
need to create more,

98
00:05:37,520 --> 00:05:39,960
and so if we can have
meaning in our row keys,

99
00:05:39,960 --> 00:05:42,520
I think it's very
useful, because it just

100
00:05:42,520 --> 00:05:44,050
makes it that much
easier to debug

101
00:05:44,050 --> 00:05:45,470
data that actually has meaning.

102
00:05:48,200 --> 00:05:51,140
And so I've often advocated
for creating having

103
00:05:51,140 --> 00:05:53,070
a row key be a timelike key.

104
00:05:53,070 --> 00:05:56,000
I think it's good to
have timelike keys.

105
00:05:56,000 --> 00:05:59,530
And by itself
there's nothing wrong

106
00:05:59,530 --> 00:06:05,520
with having a little
endian row key,

107
00:06:05,520 --> 00:06:08,780
except when you go parallel.

108
00:06:08,780 --> 00:06:11,460
If this database is actually
running on a parallel system,

109
00:06:11,460 --> 00:06:13,764
you can run into some issues.

110
00:06:13,764 --> 00:06:15,430
People have run into
these issues, which

111
00:06:15,430 --> 00:06:18,610
is why I now advocate
essentially doing

112
00:06:18,610 --> 00:06:22,980
something else with the
row key more like this.

113
00:06:22,980 --> 00:06:28,420
In particular, accumulo,
like many databases,

114
00:06:28,420 --> 00:06:30,540
when it goes parallel
it takes the tables

115
00:06:30,540 --> 00:06:33,460
and it splits them up.

116
00:06:33,460 --> 00:06:36,000
And it splits them
up by row keys.

117
00:06:36,000 --> 00:06:38,670
So it creates continuous
blocks of row keys

118
00:06:38,670 --> 00:06:41,430
on different processors.

119
00:06:41,430 --> 00:06:48,200
If you have a little
endian time row key,

120
00:06:48,200 --> 00:06:54,600
it means every single insert
will go to the same processor.

121
00:06:54,600 --> 00:06:56,450
And that will
create a bottleneck.

122
00:06:56,450 --> 00:06:58,430
And then what will
happen over time

123
00:06:58,430 --> 00:07:03,310
it will then migrate that
data to the other processor.

124
00:07:03,310 --> 00:07:06,340
So it can be a real bottleneck
if you have essentially

125
00:07:06,340 --> 00:07:10,540
a little endian row key.

126
00:07:10,540 --> 00:07:13,340
If you have a big endian
row key, if it just uses,

127
00:07:13,340 --> 00:07:15,250
it will then break
up these things.

128
00:07:15,250 --> 00:07:18,220
And then when you insert your
data, it will sort of naturally

129
00:07:18,220 --> 00:07:21,690
just cause that to spread
out over all the systems.

130
00:07:21,690 --> 00:07:24,680
So that is if your data
is coming in in some kind

131
00:07:24,680 --> 00:07:28,480
of time order, which happens.

132
00:07:28,480 --> 00:07:30,950
We definitely see people
coming in that yesterday,

133
00:07:30,950 --> 00:07:32,560
you know, today's
data comes in today

134
00:07:32,560 --> 00:07:34,244
and tomorrow's data
comes in tomorrow.

135
00:07:34,244 --> 00:07:36,410
You don't want to have it
all that data just hitting

136
00:07:36,410 --> 00:07:41,380
one processor or one compute
node in your parallel database.

137
00:07:41,380 --> 00:07:46,270
Other databases have more
sophisticated distribution

138
00:07:46,270 --> 00:07:49,380
things they can use as
sort of a round robin

139
00:07:49,380 --> 00:07:53,750
or a modular type of--
they'll create a hash that

140
00:07:53,750 --> 00:07:56,460
does a modular so that
it doesn't have that,

141
00:07:56,460 --> 00:07:58,490
eliminates that type of hotspot.

142
00:07:58,490 --> 00:08:03,190
But for now we just recommend
that people use this.

143
00:08:03,190 --> 00:08:05,570
This does make it
difficult to use the row

144
00:08:05,570 --> 00:08:07,780
key as your actual time value.

145
00:08:07,780 --> 00:08:11,330
And so what you would
want to do is also stick,

146
00:08:11,330 --> 00:08:15,790
have a column called time that
had the actual time in it.

147
00:08:15,790 --> 00:08:17,940
And then you could
actually directly look up

148
00:08:17,940 --> 00:08:21,220
a time in that way.

149
00:08:21,220 --> 00:08:22,880
So that's just a
little nuance there,

150
00:08:22,880 --> 00:08:25,030
good sort of design feature.

151
00:08:25,030 --> 00:08:26,920
Probably something
that you wouldn't

152
00:08:26,920 --> 00:08:30,720
run into for quite a while, but
we've definitely had people,

153
00:08:30,720 --> 00:08:33,409
hey I went off and
implemented exactly the way

154
00:08:33,409 --> 00:08:36,062
you suggested, but now
I'm going paralyzed

155
00:08:36,062 --> 00:08:37,020
seeing this bottleneck.

156
00:08:37,020 --> 00:08:37,690
And I'm like oh.

157
00:08:37,690 --> 00:08:40,789
So I'm going to try and correct
that right now and tell people

158
00:08:40,789 --> 00:08:42,289
that this is the
right way to do it.

159
00:08:45,240 --> 00:08:46,919
So starting out
simple, we're just

160
00:08:46,919 --> 00:08:49,460
going to talk about kind of one
of the simplest analytics you

161
00:08:49,460 --> 00:08:53,760
can do, which is getting
basic statistical information

162
00:08:53,760 --> 00:08:54,760
from your data set.

163
00:08:54,760 --> 00:09:00,200
So if this is our data set,
now we have a bunch of row keys

164
00:09:00,200 --> 00:09:03,050
here, timelike with
some additional stuff

165
00:09:03,050 --> 00:09:05,950
to make them
unique, unique rows.

166
00:09:05,950 --> 00:09:07,780
And then we have
these various columns

167
00:09:07,780 --> 00:09:10,440
here, some of which-- the
gray ones are, I'm saying,

168
00:09:10,440 --> 00:09:13,460
are filled in, and the
white ones are empty.

169
00:09:13,460 --> 00:09:17,460
And so what I want to do is
just grab a chunk of rows.

170
00:09:17,460 --> 00:09:20,290
OK, and then I'm going
to compute basically

171
00:09:20,290 --> 00:09:23,590
how many times each column
appears, essentially some.

172
00:09:23,590 --> 00:09:25,200
I'll sum by type.

173
00:09:25,200 --> 00:09:27,199
So I'll show you how
we can look at if you

174
00:09:27,199 --> 00:09:28,990
want to compute just
how many entries there

175
00:09:28,990 --> 00:09:34,060
were in column 1 or column
2, computing the covariance,

176
00:09:34,060 --> 00:09:37,197
computing type and para
covariances are all

177
00:09:37,197 --> 00:09:38,530
different things that we can do.

178
00:09:41,350 --> 00:09:44,320
So I'm going to do all those
statistics on the next slide.

179
00:09:44,320 --> 00:09:46,040
All those analytics,
which if you

180
00:09:46,040 --> 00:09:49,207
were to do them in another
environment would actually be,

181
00:09:49,207 --> 00:09:50,790
you would write a
fair amount of code.

182
00:09:50,790 --> 00:09:52,680
But here we can do them
and each one of them

183
00:09:52,680 --> 00:09:56,300
is essentially a one liner.

184
00:09:56,300 --> 00:09:59,970
So here's my set of row keys.

185
00:09:59,970 --> 00:10:01,100
OK?

186
00:10:01,100 --> 00:10:04,150
I've just created
a list of row keys

187
00:10:04,150 --> 00:10:10,040
that have essentially a
comma as the separator here.

188
00:10:10,040 --> 00:10:11,890
And there, service that.

189
00:10:27,850 --> 00:10:30,310
And this is our Table
T. So in the notation

190
00:10:30,310 --> 00:10:32,500
we often just
refer Table T. This

191
00:10:32,500 --> 00:10:34,990
is in a binding to
an Accumulator Table

192
00:10:34,990 --> 00:10:39,880
or it could be any table really,
any database that we support.

193
00:10:39,880 --> 00:10:44,920
And so this little
bit of code here

194
00:10:44,920 --> 00:10:50,150
says, return me all the rows
given by this list of rows here

195
00:10:50,150 --> 00:10:51,150
and all the columns.

196
00:10:51,150 --> 00:10:52,608
So we're using the
Matlab notation.

197
00:10:52,608 --> 00:10:54,870
Colon means return
all the columns.

198
00:10:54,870 --> 00:10:57,410
This will then
return these results

199
00:10:57,410 --> 00:11:00,730
in the form of an
associative array.

200
00:11:00,730 --> 00:11:06,130
Now since the values
of that are strings,

201
00:11:06,130 --> 00:11:08,776
in this case maybe
strings values of 1,

202
00:11:08,776 --> 00:11:10,150
we have this little
function here

203
00:11:10,150 --> 00:11:13,130
called double logi,
which will just basically

204
00:11:13,130 --> 00:11:16,160
say ignore whatever the
value is and just make it--

205
00:11:16,160 --> 00:11:19,310
if it's got an entry, give it
a 1 or otherwise ignore it.

206
00:11:19,310 --> 00:11:20,925
So this is a shorthand.

207
00:11:23,440 --> 00:11:26,830
It basically does a logical,
and then it does a double,

208
00:11:26,830 --> 00:11:28,440
so we can do math on it.

209
00:11:28,440 --> 00:11:30,560
So this is our query
that gets us our rows

210
00:11:30,560 --> 00:11:32,290
and returns them as
an associative array

211
00:11:32,290 --> 00:11:34,680
with numeric values.

212
00:11:34,680 --> 00:11:37,190
We then compute
the column counts.

213
00:11:37,190 --> 00:11:39,280
So that's just the
Matlab sum command,

214
00:11:39,280 --> 00:11:44,040
which says basically this
tells you the dimension that's

215
00:11:44,040 --> 00:11:44,940
being compressed.

216
00:11:44,940 --> 00:11:47,950
So it's compressing the first
dimension with the rows.

217
00:11:47,950 --> 00:11:51,260
So it's basically collapsing
it into a row vector.

218
00:11:51,260 --> 00:11:52,500
So it's just summing.

219
00:11:52,500 --> 00:11:55,470
So I just would tell us now we
could then for all those rows

220
00:11:55,470 --> 00:11:59,820
count how many occurrences
of each unique column,

221
00:11:59,820 --> 00:12:01,156
of each column type there was.

222
00:12:05,150 --> 00:12:11,400
And then we can then
get the covariance,

223
00:12:11,400 --> 00:12:15,820
the type type covariance by just
doing A transpose A or square

224
00:12:15,820 --> 00:12:16,320
in.

225
00:12:16,320 --> 00:12:19,040
These do the same thing.

226
00:12:19,040 --> 00:12:23,670
This is usually slightly
faster but not a lot.

227
00:12:23,670 --> 00:12:28,190
And so that just does the column
type, column type, covariance,

228
00:12:28,190 --> 00:12:29,780
very simple.

229
00:12:29,780 --> 00:12:34,730
And then finally we have this
function, which essentially

230
00:12:34,730 --> 00:12:37,200
undoes our exploded schema.

231
00:12:37,200 --> 00:12:38,800
So let's say we
wanted to return it

232
00:12:38,800 --> 00:12:40,960
back to the original
dense format

233
00:12:40,960 --> 00:12:45,010
where we have essentially four
columns, column 1, 2, 3, and 4.

234
00:12:45,010 --> 00:12:50,580
And we wanted to value put
back into the value position.

235
00:12:50,580 --> 00:12:54,440
So we have this function
call to type with this--

236
00:12:54,440 --> 00:12:56,690
and I don't know if
the name makes sense

237
00:12:56,690 --> 00:13:01,230
or not, but that's what we call
it-- and basically just says,

238
00:13:01,230 --> 00:13:02,520
oh, this is the limiter.

239
00:13:02,520 --> 00:13:05,460
So it takes each one of
those, splits it back out,

240
00:13:05,460 --> 00:13:06,570
and stuffs it back in.

241
00:13:06,570 --> 00:13:12,260
And so then you now have
that associative array.

242
00:13:12,260 --> 00:13:13,884
You can then do a
sum again on that

243
00:13:13,884 --> 00:13:15,550
to just get like I
want to know just how

244
00:13:15,550 --> 00:13:18,160
many column 1 instances,
column 2 instance,

245
00:13:18,160 --> 00:13:20,610
column 3s there are.

246
00:13:20,610 --> 00:13:24,050
And likewise, just doing
A transpose A or square in

247
00:13:24,050 --> 00:13:25,650
would then do the
covariance of that.

248
00:13:25,650 --> 00:13:27,960
And that would tell you
of those higher level

249
00:13:27,960 --> 00:13:29,350
types how many of there were.

250
00:13:29,350 --> 00:13:33,630
So this is a lot of
high level information.

251
00:13:33,630 --> 00:13:37,040
Really highly recommend that
people do this when they first

252
00:13:37,040 --> 00:13:38,620
get their data,
because it really

253
00:13:38,620 --> 00:13:41,700
uncovers a lot of bad data
right from the get go.

254
00:13:41,700 --> 00:13:45,290
You'll discover columns
that are just like,

255
00:13:45,290 --> 00:13:49,424
wow these two column types
never appear together.

256
00:13:49,424 --> 00:13:50,590
And that doesn't make sense.

257
00:13:50,590 --> 00:13:51,839
There's something wrong there.

258
00:13:51,839 --> 00:13:58,990
Or why is this exploded column,
why is it so common when

259
00:13:58,990 --> 00:13:59,990
that doesn't make sense?

260
00:13:59,990 --> 00:14:03,670
So again, it's very, very
useful for doing things.

261
00:14:03,670 --> 00:14:08,062
And we always recommend people
start with this analytic.

262
00:14:08,062 --> 00:14:09,520
So again, that's
very simple stuff.

263
00:14:09,520 --> 00:14:11,520
We've talked about it before.

264
00:14:11,520 --> 00:14:14,430
Let's talk about some
more sophisticated types

265
00:14:14,430 --> 00:14:15,983
of analytics that
we can do here.

266
00:14:20,570 --> 00:14:23,440
So I'm going to build
what I call a data graph.

267
00:14:23,440 --> 00:14:26,990
So this is just a
graph in the data.

268
00:14:26,990 --> 00:14:29,110
It may not actually be a real.

269
00:14:29,110 --> 00:14:30,790
You may have some
other graph in mind,

270
00:14:30,790 --> 00:14:34,200
but this is the data supports
as a particular kind of graph

271
00:14:34,200 --> 00:14:36,420
in it.

272
00:14:36,420 --> 00:14:40,100
So what we're
going to do here is

273
00:14:40,100 --> 00:14:44,410
we're going to set a couple
starting column here.

274
00:14:44,410 --> 00:14:48,940
OK, C0, that'll be our
set of starting columns.

275
00:14:48,940 --> 00:14:52,400
And we're going to set a column.

276
00:14:52,400 --> 00:14:55,170
I said I've allowed
column types.

277
00:14:55,170 --> 00:14:57,615
So we're going to be interested
in certain column types.

278
00:14:57,615 --> 00:14:59,740
And we're going to create
a set of clutter columns.

279
00:14:59,740 --> 00:15:04,120
These are columns that
we want to ignore.

280
00:15:04,120 --> 00:15:08,150
We want to-- they're either
very, very large or whatever.

281
00:15:08,150 --> 00:15:13,220
And so the basic
algorithm is that we're

282
00:15:13,220 --> 00:15:16,705
going to get all the columns.

283
00:15:19,950 --> 00:15:22,950
Our result is going
to be called C1 here.

284
00:15:22,950 --> 00:15:33,980
OK, and that's going to be all
rows in C0 that are of type CT,

285
00:15:33,980 --> 00:15:38,410
and excluding columns CL.

286
00:15:38,410 --> 00:15:42,070
So this is a rather complicated
joined type of thing

287
00:15:42,070 --> 00:15:43,890
that people often want to do.

288
00:15:43,890 --> 00:15:45,630
They want to do these
types of things.

289
00:15:45,630 --> 00:15:47,420
Look, I want to
get all these rows,

290
00:15:47,420 --> 00:15:50,890
but I only care about these
particular types of columns.

291
00:15:50,890 --> 00:15:55,160
And I don't really
care about, or I

292
00:15:55,160 --> 00:15:58,870
want to expressively eliminate
certain clutter columns

293
00:15:58,870 --> 00:16:02,460
that I know are just like
always pointing everywhere.

294
00:16:02,460 --> 00:16:05,010
So let's go look through
the D4M Code that

295
00:16:05,010 --> 00:16:07,163
does this sort of
complicated type of join.

296
00:16:11,710 --> 00:16:15,160
So I'm going to
specify my C0 here.

297
00:16:15,160 --> 00:16:16,870
And this could be a whole list.

298
00:16:16,870 --> 00:16:21,950
It's just the whole list
of starting columns.

299
00:16:21,950 --> 00:16:25,400
I'm going to specify
my column types.

300
00:16:25,400 --> 00:16:28,090
In this case I'm going
to have two column types.

301
00:16:28,090 --> 00:16:30,440
I'm gonna say, create a
string called starts with,

302
00:16:30,440 --> 00:16:32,360
which essentially
creates a column range,

303
00:16:32,360 --> 00:16:33,950
one column range
around column 1,

304
00:16:33,950 --> 00:16:36,270
and one column range
around column 3.

305
00:16:36,270 --> 00:16:40,570
And then I'm going to specify
a clutter column here,

306
00:16:40,570 --> 00:16:42,030
which is just this
A. And again, it

307
00:16:42,030 --> 00:16:44,530
could be a whole list as well.

308
00:16:44,530 --> 00:16:49,260
All right, so step one
here is I'm going to pass.

309
00:16:49,260 --> 00:16:51,510
I'm assuming that
this table is bound

310
00:16:51,510 --> 00:16:54,710
to one of these exploded
transposed pairs.

311
00:16:54,710 --> 00:16:57,850
So it will know when
I give it columns

312
00:16:57,850 --> 00:17:01,860
to look up to then point
to the correct table.

313
00:17:01,860 --> 00:17:04,170
So we have a C0 here.

314
00:17:04,170 --> 00:17:10,359
And it will say, all right,
please return all rows,

315
00:17:10,359 --> 00:17:16,230
all the data that contains
C0, basically all those rows.

316
00:17:16,230 --> 00:17:20,329
I'm then going to say now I
don't care about their values,

317
00:17:20,329 --> 00:17:22,040
I just want the rows.

318
00:17:22,040 --> 00:17:24,980
So this is an associative array,
and this command row just says,

319
00:17:24,980 --> 00:17:26,500
just give me those rows.

320
00:17:26,500 --> 00:17:29,920
So basically I had a column,
I returned all the rows,

321
00:17:29,920 --> 00:17:33,280
but now I just
want the row keys.

322
00:17:33,280 --> 00:17:35,700
I'm now going to
take those row keys

323
00:17:35,700 --> 00:17:39,210
and pass them back
into the row position.

324
00:17:39,210 --> 00:17:43,750
So now I will have
gotten the entire row.

325
00:17:43,750 --> 00:17:47,559
So basically I got a
column, I took those rows,

326
00:17:47,559 --> 00:17:49,850
and now I'm going and striping
and painting and getting

327
00:17:49,850 --> 00:17:50,950
the whole rows.

328
00:17:50,950 --> 00:17:52,302
So now I have every single.

329
00:17:52,302 --> 00:17:54,260
And since I don't care
about the actual values,

330
00:17:54,260 --> 00:17:58,980
I just want them to
be numeric, I just

331
00:17:58,980 --> 00:18:00,640
use the double logi command.

332
00:18:00,640 --> 00:18:03,270
So basically I've done a
rather complicated little piece

333
00:18:03,270 --> 00:18:07,900
of query here in terms of get
me all columns that contain,

334
00:18:07,900 --> 00:18:11,210
all rows that contain
a certain column.

335
00:18:11,210 --> 00:18:14,380
And so that's now an
associative array.

336
00:18:14,380 --> 00:18:17,350
I'm then going to reduce to
specific allowed columns.

337
00:18:17,350 --> 00:18:18,905
So I'm going to pass in.

338
00:18:18,905 --> 00:18:21,240
I'm going to say,
please give me just

339
00:18:21,240 --> 00:18:23,050
the ones of these
particular column types.

340
00:18:23,050 --> 00:18:27,220
So I got the whole row, but now
I just want to eliminate those.

341
00:18:29,780 --> 00:18:33,220
I could have probably
actually put this in here,

342
00:18:33,220 --> 00:18:37,230
but whether it's more efficient
or not to do it here or here,

343
00:18:37,230 --> 00:18:40,820
it's 6 of one, half
dozen of the other.

344
00:18:40,820 --> 00:18:42,910
I try and make my
table queries just

345
00:18:42,910 --> 00:18:45,423
either only columns
or only rows.

346
00:18:45,423 --> 00:18:46,756
It tends to make things simpler.

347
00:18:49,330 --> 00:18:50,500
So now we do that.

348
00:18:50,500 --> 00:18:52,880
We have just those types.

349
00:18:52,880 --> 00:18:57,040
And now I want to eliminate
the clutter columns.

350
00:18:57,040 --> 00:18:59,990
So I have A, which is
just of these type.

351
00:18:59,990 --> 00:19:06,290
And I want to then eliminate
any of the columns that are

352
00:19:06,290 --> 00:19:08,510
in this, basically, column 1.

353
00:19:08,510 --> 00:19:10,230
So I had column 1
as one of my types,

354
00:19:10,230 --> 00:19:11,660
but I don't care about that.

355
00:19:11,660 --> 00:19:13,500
And so I can just
do subtraction.

356
00:19:13,500 --> 00:19:17,390
I can just basically say, oh
go get the clutter columns

357
00:19:17,390 --> 00:19:21,130
and subtract them from
the data that I have.

358
00:19:21,130 --> 00:19:25,170
And now I'm just going
to get those columns,

359
00:19:25,170 --> 00:19:26,650
and I have now the
set of columns.

360
00:19:26,650 --> 00:19:27,360
I now have C1.

361
00:19:27,360 --> 00:19:29,980
So I've completed
the analytic I just

362
00:19:29,980 --> 00:19:33,160
described in the-- if you're
grabbing like four lines rather

363
00:19:33,160 --> 00:19:35,730
sophisticating analytic,
I could then proceed

364
00:19:35,730 --> 00:19:37,540
to look for additional clutter.

365
00:19:37,540 --> 00:19:43,230
For instance, I could
then basically query say,

366
00:19:43,230 --> 00:19:48,700
please give me those, stick
the C1 back in and then

367
00:19:48,700 --> 00:19:51,720
sum it up and look for things
that had a lot of values

368
00:19:51,720 --> 00:19:53,220
and continue this process.

369
00:19:53,220 --> 00:19:54,720
Just an example of
something I might

370
00:19:54,720 --> 00:19:57,100
want to do when I
have this data set.

371
00:19:57,100 --> 00:20:01,710
So these show you the kinds of
more sophisticated analytics

372
00:20:01,710 --> 00:20:02,530
that you can do.

373
00:20:07,110 --> 00:20:09,670
I want to talk a little
bit about data graphs

374
00:20:09,670 --> 00:20:12,150
in terms of what are the
things you can do here,

375
00:20:12,150 --> 00:20:13,020
what is supported.

376
00:20:13,020 --> 00:20:16,910
The topology of
your data, remember

377
00:20:16,910 --> 00:20:18,275
this edge list has direction.

378
00:20:18,275 --> 00:20:19,900
When you have graphs
they can very well

379
00:20:19,900 --> 00:20:21,212
have edges that have direction.

380
00:20:21,212 --> 00:20:22,670
And a lot of times
people will say,

381
00:20:22,670 --> 00:20:24,128
well I want to get
the whole graph.

382
00:20:24,128 --> 00:20:28,200
I'm like, well, if you're
doing essentially things that

383
00:20:28,200 --> 00:20:30,950
are based on breadth
first search,

384
00:20:30,950 --> 00:20:35,180
you may not be able
to get the full graph,

385
00:20:35,180 --> 00:20:38,410
because you're never going at
it from the correct direction.

386
00:20:38,410 --> 00:20:41,170
So you are definitely limited
by the natural topology

387
00:20:41,170 --> 00:20:43,230
of your data.

388
00:20:43,230 --> 00:20:49,110
So for example here, let's say
I start with C0 as column 1A.

389
00:20:49,110 --> 00:20:54,290
So I now have essentially
this vertex A let's call it.

390
00:20:57,140 --> 00:21:04,070
And now here I say, OK,
I now want to get all.

391
00:21:04,070 --> 00:21:06,580
Give me the whole row
of all I'm going to do.

392
00:21:06,580 --> 00:21:09,100
Give me all the rows
that contain that.

393
00:21:09,100 --> 00:21:12,500
All right, so then
these two guys pop up.

394
00:21:12,500 --> 00:21:16,400
So I get another A and I
get a B. So I got a B here.

395
00:21:16,400 --> 00:21:17,010
That's good.

396
00:21:21,500 --> 00:21:23,677
And then I'm going to go,
proceed, go down again.

397
00:21:23,677 --> 00:21:24,510
I'm like, all right.

398
00:21:24,510 --> 00:21:27,154
I'm going to then now say,
give me all the rows that

399
00:21:27,154 --> 00:21:28,070
contain those columns.

400
00:21:31,610 --> 00:21:39,230
And I go down again, and I
got the C. Did I get a C?

401
00:21:39,230 --> 00:21:42,450
No, I never got a C. I never
got a C in any one of these,

402
00:21:42,450 --> 00:21:45,520
even though it's all in the
data and probably all connected.

403
00:21:45,520 --> 00:21:53,000
I never actually got a C.

404
00:21:53,000 --> 00:21:55,510
There I got the C. Now when
I did it the second time

405
00:21:55,510 --> 00:21:58,800
I got the C, so there we go.

406
00:21:58,800 --> 00:22:01,080
So this is an
example of a series

407
00:22:01,080 --> 00:22:03,730
of breadth first
searches that result

408
00:22:03,730 --> 00:22:07,690
in getting the whole graph.

409
00:22:07,690 --> 00:22:10,860
But the graph had this
typology and wouldn't naturally

410
00:22:10,860 --> 00:22:12,650
admit that.

411
00:22:12,650 --> 00:22:15,690
So certainly in this particular
case the data and the queries

412
00:22:15,690 --> 00:22:18,190
were good for this,
we would call a star,

413
00:22:18,190 --> 00:22:19,840
because essentially
it's a vertex

414
00:22:19,840 --> 00:22:22,800
with everything going into it.

415
00:22:22,800 --> 00:22:25,440
Let's take a different graph.

416
00:22:25,440 --> 00:22:27,700
This is what we call a cycle.

417
00:22:27,700 --> 00:22:30,750
So we see we have a little
cycle here going like this.

418
00:22:30,750 --> 00:22:36,590
We start again with
A. We get our columns.

419
00:22:36,590 --> 00:22:40,880
We get C1s across here.

420
00:22:40,880 --> 00:22:42,620
And that's kind of
the end of the game.

421
00:22:45,130 --> 00:22:49,920
We get A's, we get
a B, but we're not

422
00:22:49,920 --> 00:22:53,140
going to get anything else
when we column back up.

423
00:22:53,140 --> 00:22:54,820
We're not going to
get anything else.

424
00:22:54,820 --> 00:22:56,480
And so we're
basically stuck here

425
00:22:56,480 --> 00:23:00,080
at B. We weren't able
to get to C or D.

426
00:23:00,080 --> 00:23:02,380
So these are the kind of a
little more subtle things

427
00:23:02,380 --> 00:23:04,810
that everyone has
to worry about.

428
00:23:04,810 --> 00:23:07,020
And once you see it it's
kind of, well, of course.

429
00:23:07,020 --> 00:23:10,410
I can't do-- I'm not going
to get the whole graph.

430
00:23:10,410 --> 00:23:12,430
But you'd be amazed how
many teams were like,

431
00:23:12,430 --> 00:23:14,630
I wanted the whole graph,
and I just can't do it.

432
00:23:14,630 --> 00:23:17,930
It's like, well, you
don't have the edges going

433
00:23:17,930 --> 00:23:20,181
in the right direction
for you to do that.

434
00:23:20,181 --> 00:23:21,680
You're going to
have to think things

435
00:23:21,680 --> 00:23:23,410
through a little bit more.

436
00:23:23,410 --> 00:23:27,894
So I just want to-- it's
kind of a little catch

437
00:23:27,894 --> 00:23:30,310
that I want to point out to
people, because it's something

438
00:23:30,310 --> 00:23:31,351
that people can run into.

439
00:23:35,422 --> 00:23:37,630
We're going to do a little
different type of analytic

440
00:23:37,630 --> 00:23:38,130
here.

441
00:23:38,130 --> 00:23:40,270
I've changed some
of my columns here.

442
00:23:40,270 --> 00:23:41,232
I have some.

443
00:23:41,232 --> 00:23:42,440
Let's call these coordinates.

444
00:23:42,440 --> 00:23:43,856
I'm going to have
now with my data

445
00:23:43,856 --> 00:23:47,680
set an x and a y-coordinate that
I'm storing in different rows

446
00:23:47,680 --> 00:23:48,270
and columns.

447
00:23:48,270 --> 00:23:50,800
I want to do some kind
of space windowing.

448
00:23:50,800 --> 00:23:55,752
I want to find all data within
a particular x and y-coordinate.

449
00:23:58,140 --> 00:23:59,640
So what I'm going
to do is I'm going

450
00:23:59,640 --> 00:24:03,340
to select a set of data
here R instead of rows.

451
00:24:03,340 --> 00:24:07,140
And I'm going to
give a space polygon.

452
00:24:07,140 --> 00:24:11,310
And I'm going to
query, get the data.

453
00:24:11,310 --> 00:24:13,930
And then I'm going to
extract the space coordinates

454
00:24:13,930 --> 00:24:16,320
from the values
there, and I'm gonna

455
00:24:16,320 --> 00:24:23,240
return all columns that are
within my space window here.

456
00:24:23,240 --> 00:24:25,375
And again, this is good
for finding columns

457
00:24:25,375 --> 00:24:26,583
in between your space window.

458
00:24:29,230 --> 00:24:31,930
If you're concerned
that you're going

459
00:24:31,930 --> 00:24:36,150
to be getting an awful
lot of if you have,

460
00:24:36,150 --> 00:24:38,620
let's say you have a coordinate
that goes through New York.

461
00:24:38,620 --> 00:24:40,334
And you're concerned
that's just going

462
00:24:40,334 --> 00:24:42,500
to-- you don't want to get,
you don't want New York,

463
00:24:42,500 --> 00:24:44,450
but you happen to be
on the same latitude

464
00:24:44,450 --> 00:24:46,340
and longitude of New York.

465
00:24:46,340 --> 00:24:50,670
You can do something called
Mertonization, which basically

466
00:24:50,670 --> 00:24:53,200
is essentially imagine
taking your strings

467
00:24:53,200 --> 00:24:56,970
of your coordinates
and interleaving them.

468
00:24:56,970 --> 00:25:02,840
And now you've essentially
created an Ascii based grid

469
00:25:02,840 --> 00:25:05,240
of the entire planet.

470
00:25:05,240 --> 00:25:08,980
And so that's a way of if you
want to quickly filter down,

471
00:25:08,980 --> 00:25:11,000
you can get a box
and then go back

472
00:25:11,000 --> 00:25:13,200
and do the detailed
coordinates to vent yourself

473
00:25:13,200 --> 00:25:14,780
from having to do.

474
00:25:14,780 --> 00:25:16,720
So that's a standard trick.

475
00:25:16,720 --> 00:25:18,960
And there's a variety
of Mertonization schemes

476
00:25:18,960 --> 00:25:22,570
that people use for interleaving
coordinates in this way.

477
00:25:22,570 --> 00:25:25,574
I think Google Earth has a
standard box now as well.

478
00:25:25,574 --> 00:25:27,740
I find this the simplest,
because you literally just

479
00:25:27,740 --> 00:25:31,444
take the two strings
and interleave them.

480
00:25:31,444 --> 00:25:33,360
And if you have a space,
you can actually then

481
00:25:33,360 --> 00:25:35,950
do with variable precision,
because if you just like,

482
00:25:35,950 --> 00:25:37,360
space I don't know.

483
00:25:37,360 --> 00:25:39,290
And it all kind of
works out pretty nicely.

484
00:25:39,290 --> 00:25:41,510
And you can read the
coordinate right there.

485
00:25:41,510 --> 00:25:42,960
Like, the first
one is the first.

486
00:25:42,960 --> 00:25:45,418
And you can even include the
plus and minus signs for a lot

487
00:25:45,418 --> 00:25:46,380
if you wanted to.

488
00:25:46,380 --> 00:25:51,831
So maybe not the most efficient
scheme, but one that's human

489
00:25:51,831 --> 00:25:52,330
readable.

490
00:25:52,330 --> 00:25:57,670
And I'm a big advocate of human
readable types of data schemes.

491
00:25:57,670 --> 00:26:00,180
All right, so let's
actually do that analytic.

492
00:26:00,180 --> 00:26:05,340
So again, I created
my-- selected my row,

493
00:26:05,340 --> 00:26:10,480
got my x and y-coordinates
in those rows,

494
00:26:10,480 --> 00:26:15,540
and then figured out which
columns they were that

495
00:26:15,540 --> 00:26:17,990
satisfied that.

496
00:26:17,990 --> 00:26:19,360
So let's do that now in code.

497
00:26:22,480 --> 00:26:24,750
Let's see here.

498
00:26:24,750 --> 00:26:37,960
So we have the-- all
right, so we have,

499
00:26:37,960 --> 00:26:39,860
in this case I gave
It a row range.

500
00:26:39,860 --> 00:26:42,890
So you can do-- this is what
a range query looks like.

501
00:26:42,890 --> 00:26:48,470
If you give any, either an
associative array or a table,

502
00:26:48,470 --> 00:26:55,040
something that is a triple,
essentially that's a string,

503
00:26:55,040 --> 00:26:57,030
colon, and another
string, it will

504
00:26:57,030 --> 00:26:59,030
treat that as a range query.

505
00:26:59,030 --> 00:27:00,770
And we actually
support doing if you

506
00:27:00,770 --> 00:27:03,310
have multiple sets
of triples, it

507
00:27:03,310 --> 00:27:07,220
should handle that,
which is good.

508
00:27:07,220 --> 00:27:13,380
I'm going to specify my bounding
box here, essentially a box.

509
00:27:13,380 --> 00:27:16,990
And I happen to do
with complex numbers

510
00:27:16,990 --> 00:27:20,310
just for fun, just because
complex numbers are a nice way

511
00:27:20,310 --> 00:27:25,700
to store coordinates on
a two dimensional plane.

512
00:27:25,700 --> 00:27:28,010
And Matlab supports
them very nicely.

513
00:27:28,010 --> 00:27:30,550
Complex numbers are our
friends, so there you go.

514
00:27:30,550 --> 00:27:35,030
But I could have just as easily
had a set of x and y vectors.

515
00:27:35,030 --> 00:27:36,920
So I'm going to
get all the rows.

516
00:27:36,920 --> 00:27:38,200
So I query that.

517
00:27:38,200 --> 00:27:40,810
Very good, we have that.

518
00:27:40,810 --> 00:27:43,360
And then, so that
just gives me that set

519
00:27:43,360 --> 00:27:45,760
of data, all those rows.

520
00:27:45,760 --> 00:27:52,470
I then use my starts with my x
and y to get just the columns

521
00:27:52,470 --> 00:27:54,820
of those X's and Y's.

522
00:27:54,820 --> 00:27:58,690
And I'm now kind of going to
convert those exploded back

523
00:27:58,690 --> 00:28:03,220
into a regular table with
this call to type function.

524
00:28:03,220 --> 00:28:06,080
So that basically
takes those values.

525
00:28:06,080 --> 00:28:08,640
So it takes those coordinate
values, so like we saw.

526
00:28:08,640 --> 00:28:12,930
It takes these coordinate
values here like the 0 1,

527
00:28:12,930 --> 00:28:15,205
and it puts it back
into the value position.

528
00:28:20,270 --> 00:28:21,810
So now I have,
though, it will still

529
00:28:21,810 --> 00:28:23,979
be a string in the
value position, which

530
00:28:23,979 --> 00:28:25,520
our associate arrays
can handle fine.

531
00:28:25,520 --> 00:28:28,030
But I now want to really
treat it like a number.

532
00:28:28,030 --> 00:28:31,200
So we just have overloaded
the standard Matlab string

533
00:28:31,200 --> 00:28:34,760
to numb function, which
will convert those strings

534
00:28:34,760 --> 00:28:36,090
and will store them back.

535
00:28:36,090 --> 00:28:39,440
You now have an associated
array with numbers in them.

536
00:28:39,440 --> 00:28:41,850
So we call this Axy.

537
00:28:41,850 --> 00:28:45,610
And now we can do something.

538
00:28:45,610 --> 00:28:48,252
We basically can extract
the x values here.

539
00:28:48,252 --> 00:28:49,710
So we have Axy,
and say, all right,

540
00:28:49,710 --> 00:28:54,050
give me the x column, and then
Axy and give me the y column.

541
00:28:54,050 --> 00:28:59,300
And Matlab has a built in
function called in polygon,

542
00:28:59,300 --> 00:29:00,920
which you give it a polygon.

543
00:29:00,920 --> 00:29:03,250
So I give it the real
and the imaginary parts

544
00:29:03,250 --> 00:29:06,570
of my polygon here S and
the x and y-coordinates.

545
00:29:06,570 --> 00:29:10,300
And it will return in
essentially the value

546
00:29:10,300 --> 00:29:16,240
of is that in there,
which is great,

547
00:29:16,240 --> 00:29:19,380
because there are many
dissertations written

548
00:29:19,380 --> 00:29:21,070
on the in polygon problem.

549
00:29:21,070 --> 00:29:23,710
And it's nice that we have a
nice built in Matlab function

550
00:29:23,710 --> 00:29:25,680
to do that.

551
00:29:25,680 --> 00:29:28,400
And then now I have that,
and I can just pass that back

552
00:29:28,400 --> 00:29:31,550
into the original
A. So I do find.

553
00:29:31,550 --> 00:29:33,910
This actually returns a
logical of zeros and ones.

554
00:29:33,910 --> 00:29:36,970
If I do find, then that will
return a set of indices.

555
00:29:36,970 --> 00:29:39,165
And I just pass those
indices back into A.

556
00:29:39,165 --> 00:29:41,690
And then I can get
those columns and there

557
00:29:41,690 --> 00:29:45,990
we go, all very standard
Matlab like syntax.

558
00:29:45,990 --> 00:29:48,550
Again, this is a fairly
complicated analytic.

559
00:29:48,550 --> 00:29:51,010
If you were doing this
using other technologies,

560
00:29:51,010 --> 00:29:52,950
I mean, I'm sure
all of us would be

561
00:29:52,950 --> 00:29:54,200
writing a fair amount of code.

562
00:29:54,200 --> 00:29:55,230
And this is just
the kind of thing

563
00:29:55,230 --> 00:29:56,646
that we can do
very easily in D4M.

564
00:30:02,290 --> 00:30:04,910
Another analytic, which is
probably a bit of a stretch,

565
00:30:04,910 --> 00:30:08,710
but I was just having fun
here, is doing convolution

566
00:30:08,710 --> 00:30:12,820
on strings, which
is a little odd.

567
00:30:12,820 --> 00:30:15,530
But I gave it I gave it a whirl.

568
00:30:15,530 --> 00:30:20,410
So what we want to do is we want
to convolve some of our data

569
00:30:20,410 --> 00:30:21,040
with a filter.

570
00:30:21,040 --> 00:30:24,050
I mean, convolving the filters
as a standard type of thing.

571
00:30:24,050 --> 00:30:28,300
So it's the standard way
we do detection here.

572
00:30:28,300 --> 00:30:34,650
And so the way we do
that is once again,

573
00:30:34,650 --> 00:30:37,940
I give a list of rows
that I want here.

574
00:30:37,940 --> 00:30:40,270
I'm going to create a filter,
which is just essentially

575
00:30:40,270 --> 00:30:46,950
a box of 4 wide box.

576
00:30:46,950 --> 00:30:47,870
So I get my rows.

577
00:30:50,580 --> 00:30:59,500
And then I convert
them to numeric.

578
00:30:59,500 --> 00:31:05,162
And I'm going to do my
convolution on the x columns.

579
00:31:08,360 --> 00:31:09,730
So let's see here.

580
00:31:09,730 --> 00:31:11,410
So I'm going to get these.

581
00:31:11,410 --> 00:31:13,398
I'm basically getting
all the x-coordinates.

582
00:31:16,270 --> 00:31:21,780
I'm going to sum all of
those, so I basically now

583
00:31:21,780 --> 00:31:24,450
have all those.

584
00:31:24,450 --> 00:31:28,840
And now I'm going to pop
those back into their values.

585
00:31:28,840 --> 00:31:30,710
And now I can do a convolution.

586
00:31:30,710 --> 00:31:36,890
And this convolution
works if one of the axes

587
00:31:36,890 --> 00:31:40,160
is sort of like an integer
sequence type of thing.

588
00:31:40,160 --> 00:31:44,310
So you can do-- it tries
to extend that naturally.

589
00:31:44,310 --> 00:31:47,890
So something to play around with
if you want to do convolutions.

590
00:31:47,890 --> 00:31:51,590
We sort of support it.

591
00:31:51,590 --> 00:31:53,870
And I'm sure if anyone
of you do play around it,

592
00:31:53,870 --> 00:31:55,700
we would be glad to
hear your experiences,

593
00:31:55,700 --> 00:31:58,150
think about how we
should extend it.

594
00:31:58,150 --> 00:32:00,690
So these are all sort of
basic standard first order

595
00:32:00,690 --> 00:32:03,570
statistical analytics that
one can do on data sets.

596
00:32:03,570 --> 00:32:05,460
And we can support
them very, very well.

597
00:32:05,460 --> 00:32:07,770
Let's do some more complicated
what I would call second order

598
00:32:07,770 --> 00:32:08,270
analytics.

599
00:32:12,870 --> 00:32:15,260
So I'm going to do
something called,

600
00:32:15,260 --> 00:32:18,480
it's a complicated
join essentially,

601
00:32:18,480 --> 00:32:20,810
what I call a type pair.

602
00:32:20,810 --> 00:32:23,810
So what I want to
do here is I want

603
00:32:23,810 --> 00:32:29,840
to find all rows
that contain values.

604
00:32:29,840 --> 00:32:34,370
I want to find rows that
have both value of type 1

605
00:32:34,370 --> 00:32:36,050
and of type 2.

606
00:32:36,050 --> 00:32:40,290
So I'm going to specify this
to be, basically x to be type 1

607
00:32:40,290 --> 00:32:41,950
and y to be type 2.

608
00:32:41,950 --> 00:32:45,400
And I want to find
all data that has

609
00:32:45,400 --> 00:32:47,480
entries in both those
very, very standard type

610
00:32:47,480 --> 00:32:48,700
of join type of thing.

611
00:32:51,590 --> 00:32:54,350
And this is done a little
bit more complicated

612
00:32:54,350 --> 00:32:56,332
than we need it to be
just to show you kind

613
00:32:56,332 --> 00:32:57,540
of some of the richness here.

614
00:32:57,540 --> 00:32:59,750
You can kind of take
a fork in any way.

615
00:32:59,750 --> 00:33:02,330
We could probably do this
whole thing in about two lines,

616
00:33:02,330 --> 00:33:06,550
but I'm kind of showing you
some additional features of D4M

617
00:33:06,550 --> 00:33:09,140
in the spirit of this analytic.

618
00:33:09,140 --> 00:33:11,910
So again, I'm just going
to use a range query here.

619
00:33:11,910 --> 00:33:13,680
So I have this range.

620
00:33:13,680 --> 00:33:16,940
I'm going to have my type 1 be
starts with x, and my type 2

621
00:33:16,940 --> 00:33:18,760
be starts with y.

622
00:33:18,760 --> 00:33:20,050
So I do my query.

623
00:33:20,050 --> 00:33:23,610
I convert all the string
1's to numeric 1's.

624
00:33:23,610 --> 00:33:25,660
And then what I'm
going to do is I'm

625
00:33:25,660 --> 00:33:30,600
going to basically, all right,
get me all the columns of type

626
00:33:30,600 --> 00:33:39,490
1, sum them all together,
find everything that equals--

627
00:33:39,490 --> 00:33:42,600
and I only care about the
1's that are exactly equal 1.

628
00:33:42,600 --> 00:33:45,910
So like if I had two
x's, I'm like no.

629
00:33:45,910 --> 00:33:46,760
I don't want those.

630
00:33:46,760 --> 00:33:52,220
I want exactly
one x in this row.

631
00:33:58,830 --> 00:34:03,240
And then I'm going to take those
rows that have exactly one x.

632
00:34:03,240 --> 00:34:05,350
I'm going to pass
them back into A.

633
00:34:05,350 --> 00:34:10,469
So I now just get the rows
that have exactly one x.

634
00:34:10,469 --> 00:34:13,840
I'm going to filter it
again with ct1 and ct2,

635
00:34:13,840 --> 00:34:17,400
although, I don't
need to do that.

636
00:34:17,400 --> 00:34:20,030
Then I'm going to sum it again.

637
00:34:20,030 --> 00:34:22,330
And now I'm going
to say, all right,

638
00:34:22,330 --> 00:34:24,380
give me the only ones
that are exactly 2.

639
00:34:24,380 --> 00:34:26,650
So I know my x is exactly 1.

640
00:34:26,650 --> 00:34:30,320
So in order for it to be
exactly 2, that means my y also

641
00:34:30,320 --> 00:34:32,580
had to have only
exactly one entry in it.

642
00:34:32,580 --> 00:34:35,719
So now I have the
data that just has

643
00:34:35,719 --> 00:34:38,162
exactly one of each of those.

644
00:34:43,050 --> 00:34:45,670
Now I want to create sort of
like a cross-correlation pair

645
00:34:45,670 --> 00:34:48,520
mapping of this.

646
00:34:48,520 --> 00:34:51,656
So I'm actually to look
for x's across columns

647
00:34:51,656 --> 00:34:53,280
that appear say, I
want to look for x's

648
00:34:53,280 --> 00:34:57,980
that appear with more
than one y or a y that

649
00:34:57,980 --> 00:35:00,220
appears with more than one x.

650
00:35:00,220 --> 00:35:02,510
So there's a variety
of ways to do that.

651
00:35:02,510 --> 00:35:06,200
Here what I'm doing is-- so I
have gotten the rows of A that

652
00:35:06,200 --> 00:35:08,800
have exactly one x and y.

653
00:35:08,800 --> 00:35:13,810
I now pass that back again to
get my C, to get the x's again.

654
00:35:13,810 --> 00:35:17,400
And one of the things that
we've done that's kind of nice

655
00:35:17,400 --> 00:35:21,190
is we've overloaded the
syntax of this query

656
00:35:21,190 --> 00:35:24,030
on our associated arrays
and on our table queries

657
00:35:24,030 --> 00:35:26,350
such that if it only
has one argument,

658
00:35:26,350 --> 00:35:28,490
it will return an
associative array.

659
00:35:28,490 --> 00:35:30,650
But if you give it
three output arguments,

660
00:35:30,650 --> 00:35:35,067
it will return the triple, so in
this case, the row, the column,

661
00:35:35,067 --> 00:35:35,650
and the value.

662
00:35:35,650 --> 00:35:38,060
Now I don't care about
the row and the value,

663
00:35:38,060 --> 00:35:39,660
I just care about the column.

664
00:35:39,660 --> 00:35:40,640
But that's a nice way.

665
00:35:40,640 --> 00:35:42,262
We're often gonna
in certain cases

666
00:35:42,262 --> 00:35:43,720
want to bump back
and forth between

667
00:35:43,720 --> 00:35:49,336
the triples representation
and the associated array

668
00:35:49,336 --> 00:35:49,960
implementation.

669
00:35:49,960 --> 00:35:52,060
Now you always can
use the Find command

670
00:35:52,060 --> 00:35:56,480
around any associated array
just as you can in normal Matlab

671
00:35:56,480 --> 00:35:59,100
matrices not return the triple.

672
00:35:59,100 --> 00:36:01,740
The advantage of
doing it here is

673
00:36:01,740 --> 00:36:07,220
that it's faster,
because what we actually

674
00:36:07,220 --> 00:36:10,020
do when we do the
query internally

675
00:36:10,020 --> 00:36:12,940
is we actually get the
triples and then convert

676
00:36:12,940 --> 00:36:14,180
to an associative array.

677
00:36:14,180 --> 00:36:15,500
And if you just say
I want the triples,

678
00:36:15,500 --> 00:36:17,625
we can just short cut that
and give you the triples

679
00:36:17,625 --> 00:36:18,250
right away.

680
00:36:18,250 --> 00:36:20,250
So sometimes if you're
dealing with a very large

681
00:36:20,250 --> 00:36:24,960
associative arrays or
some operation that's just

682
00:36:24,960 --> 00:36:28,080
I want to get some more
performance back then working.

683
00:36:28,080 --> 00:36:31,350
Especially if you like, well,
I only care about one thing,

684
00:36:31,350 --> 00:36:33,360
I don't care about
all of the values,

685
00:36:33,360 --> 00:36:35,600
I don't really need to be
a full associative array,

686
00:36:35,600 --> 00:36:39,000
then that's a great way to
sort of short circuit that.

687
00:36:39,000 --> 00:36:39,890
So we do that here.

688
00:36:39,890 --> 00:36:42,620
And now we can construct
a new associate array,

689
00:36:42,620 --> 00:36:47,000
which is just taking the
x's and the y's and creating

690
00:36:47,000 --> 00:36:50,140
a new associative
array with those.

691
00:36:50,140 --> 00:36:54,210
And that just shows me
the correlations between

692
00:36:54,210 --> 00:36:55,590
the x's and the y's.

693
00:36:55,590 --> 00:37:01,110
And I can then find ct,
basically x's that have more

694
00:37:01,110 --> 00:37:04,070
than one y-- so I've
just summed them there--

695
00:37:04,070 --> 00:37:06,430
or y's with more than one x's.

696
00:37:06,430 --> 00:37:09,600
Again, these are very
similar to analytics

697
00:37:09,600 --> 00:37:10,780
that people want to do.

698
00:37:10,780 --> 00:37:12,310
And again, very simple to do.

699
00:37:12,310 --> 00:37:15,740
And again, showing you some of
the different types of syntax

700
00:37:15,740 --> 00:37:18,120
that are available
to you in D4M.

701
00:37:18,120 --> 00:37:21,340
Again, if you're used to using
Matlab, these types of tricks

702
00:37:21,340 --> 00:37:22,770
are very natural.

703
00:37:22,770 --> 00:37:26,210
We're just showing you that
they also exist within D4M.

704
00:37:30,380 --> 00:37:32,220
So here's another one.

705
00:37:32,220 --> 00:37:37,310
So I wanted to find column
of pair set C1 and C2,

706
00:37:37,310 --> 00:37:42,090
and get all columns C1 and C2,
and find the rows that have

707
00:37:42,090 --> 00:37:43,950
just one entry in C1 and C2.

708
00:37:46,570 --> 00:37:48,620
And it basically checks
to see if data pairs are

709
00:37:48,620 --> 00:37:50,110
present in the same row.

710
00:37:50,110 --> 00:37:52,497
Again, something that
people often want to do.

711
00:37:52,497 --> 00:37:54,080
You've got a complicated
type of join.

712
00:37:54,080 --> 00:37:58,440
So here we have a set of columns
one C1, a set of column C2.

713
00:38:01,830 --> 00:38:05,150
I want to create this, I want
to sort of interleave these

714
00:38:05,150 --> 00:38:08,270
together into a pair.

715
00:38:08,270 --> 00:38:10,240
So I want to create
some concept of a pair.

716
00:38:10,240 --> 00:38:12,170
And so we have
this function here

717
00:38:12,170 --> 00:38:14,530
called cat string,
which basically

718
00:38:14,530 --> 00:38:18,380
will take two strings
and another delimiter,

719
00:38:18,380 --> 00:38:21,000
and will basically if they are
of the same number of strings

720
00:38:21,000 --> 00:38:22,950
will just leave them together.

721
00:38:22,950 --> 00:38:24,700
If one of these is
just a single string,

722
00:38:24,700 --> 00:38:27,900
it will just an essentially
preappend or append that.

723
00:38:27,900 --> 00:38:30,120
So for instance, if you are
wondering how we actually

724
00:38:30,120 --> 00:38:34,250
create these exploded
values like call 1 and B,

725
00:38:34,250 --> 00:38:36,290
that's just basically
using this function here.

726
00:38:36,290 --> 00:38:39,110
We get the values,
we get the columns,

727
00:38:39,110 --> 00:38:41,560
we put essentially the
pipe thing in the middle,

728
00:38:41,560 --> 00:38:43,530
and it just merges
them together.

729
00:38:43,530 --> 00:38:46,010
So we now sort of interleave
these two together.

730
00:38:46,010 --> 00:38:51,100
So we'll now have something
like Col1 b pipe Col3

731
00:38:51,100 --> 00:38:55,500
b pipe comma as the separator.

732
00:38:55,500 --> 00:38:57,540
So now I can create a
set of pair mappings

733
00:38:57,540 --> 00:39:00,020
from C1 to its pairs.

734
00:39:00,020 --> 00:39:04,090
OK, that's A1 to its pairs
and A2 to it's pairs.

735
00:39:04,090 --> 00:39:07,830
I can get the columns
of those A1 and A2.

736
00:39:07,830 --> 00:39:10,040
And then I can find all
the pairs by essentially

737
00:39:10,040 --> 00:39:14,010
through this combination of
matrix multiplies and additions

738
00:39:14,010 --> 00:39:15,010
and etc.

739
00:39:15,010 --> 00:39:18,741
So a very sort of complicated
analytic done very nicely.

740
00:39:18,741 --> 00:39:20,740
And then there's a whole
bunch of different ones

741
00:39:20,740 --> 00:39:21,650
you can do here.

742
00:39:21,650 --> 00:39:23,687
These are almost
semantic extension.

743
00:39:23,687 --> 00:39:25,770
The column types may have
several different types,

744
00:39:25,770 --> 00:39:26,940
and you want to do.

745
00:39:26,940 --> 00:39:29,930
So for instance, if I have
a pair of columns here,

746
00:39:29,930 --> 00:39:33,260
column 1 and column
3, I could say, well,

747
00:39:33,260 --> 00:39:34,800
that also implies this.

748
00:39:34,800 --> 00:39:36,740
Column three equals column 1.

749
00:39:36,740 --> 00:39:39,530
That's one kind of sort of
pair reversal type of thing.

750
00:39:39,530 --> 00:39:40,550
You'll have extensions.

751
00:39:40,550 --> 00:39:42,133
You might say, look
if I have a column

752
00:39:42,133 --> 00:39:44,210
1A that also implies
that really there

753
00:39:44,210 --> 00:39:47,220
should also be a column
2A, and other types

754
00:39:47,220 --> 00:39:48,130
of things like that.

755
00:39:48,130 --> 00:39:49,505
So these are just
types of things

756
00:39:49,505 --> 00:39:50,770
that people do with pairs.

757
00:39:50,770 --> 00:39:53,990
They're often very useful.

758
00:39:53,990 --> 00:39:56,610
And I think that
basically brings us

759
00:39:56,610 --> 00:40:01,060
to the end of the
lecture portion of class.

760
00:40:01,060 --> 00:40:02,670
So again, just the
exploited schema

761
00:40:02,670 --> 00:40:05,820
really allows you to do this
very rapidly with your data.

762
00:40:05,820 --> 00:40:10,530
And you can implement a very
efficient graph analytics

763
00:40:10,530 --> 00:40:13,670
as a sequence of essentially
row and column queries,

764
00:40:13,670 --> 00:40:21,250
because we use this very special
exploded transpose pair schema.

765
00:40:21,250 --> 00:40:24,860
And increasingly as you
become more and more skilled

766
00:40:24,860 --> 00:40:27,240
with this, you will discover
that many, many, many

767
00:40:27,240 --> 00:40:31,400
of your analytics really reduce
to matrix matrix multiplies.

768
00:40:31,400 --> 00:40:33,910
That matrix matrix
multiply really

769
00:40:33,910 --> 00:40:38,030
captures the whole
sort of all correlation

770
00:40:38,030 --> 00:40:43,480
that you want to do without
having to kind of figure things

771
00:40:43,480 --> 00:40:45,020
out.

772
00:40:45,020 --> 00:40:47,030
All right, so I'm now
going to go and show some

773
00:40:47,030 --> 00:40:49,260
not these specific
analytics, but some analytics

774
00:40:49,260 --> 00:40:52,290
that are more sophisticated
based on the Reuters data set.

775
00:40:52,290 --> 00:40:54,720
If you remember a
few weeks ago, we

776
00:40:54,720 --> 00:40:58,490
worked with the
Reuters data set.

777
00:40:58,490 --> 00:41:00,980
And so let's see here.

778
00:41:00,980 --> 00:41:05,120
So we already did the
entity analysis application

779
00:41:05,120 --> 00:41:05,790
a few weeks ago.

780
00:41:05,790 --> 00:41:07,940
I'm going to now
do basically what

781
00:41:07,940 --> 00:41:09,430
happens when you
construct tracks,

782
00:41:09,430 --> 00:41:12,470
which is a more sophisticated
structured analytic.

783
00:41:12,470 --> 00:41:17,690
And the assignment I'll
send out was basically

784
00:41:17,690 --> 00:41:19,660
doing more cross correlations.

785
00:41:19,660 --> 00:41:22,626
For those of you who have
kept it going here this far

786
00:41:22,626 --> 00:41:24,000
and continue to
do the homeworks,

787
00:41:24,000 --> 00:41:25,830
I'll send this
homework out to you,

788
00:41:25,830 --> 00:41:27,920
which is basically
just cross correlating

789
00:41:27,920 --> 00:41:29,200
the data sets that you have.

790
00:41:29,200 --> 00:41:33,720
Again, and not an assignment
that really predispose

791
00:41:33,720 --> 00:41:36,930
requires you having done
the previous assignments.

792
00:41:36,930 --> 00:41:40,530
Just any data set, pull it
into an associate array,

793
00:41:40,530 --> 00:41:43,960
and then do matrix
multiplies to figure out

794
00:41:43,960 --> 00:41:46,719
the cross correlations
and what they mean.

795
00:41:46,719 --> 00:41:48,260
All right, so with
that, why don't we

796
00:41:48,260 --> 00:41:50,530
take a short five minute break.

797
00:41:50,530 --> 00:41:53,450
And then I'll come back
and show you the demo.