1
00:00:00,040 --> 00:00:02,410
The following content is
provided under a Creative

2
00:00:02,410 --> 00:00:03,790
Commons license.

3
00:00:03,790 --> 00:00:06,030
Your support will help
MIT OpenCourseWare

4
00:00:06,030 --> 00:00:10,110
continue to offer high quality
educational resources for free.

5
00:00:10,110 --> 00:00:12,680
To make a donation, or to
view additional materials

6
00:00:12,680 --> 00:00:16,496
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,496 --> 00:00:17,120
at ocw.mit.edu.

8
00:00:21,362 --> 00:00:27,030
JEREMY KEPNER: All right,
so we're doing lecture 06

9
00:00:27,030 --> 00:00:28,120
in the course today.

10
00:00:28,120 --> 00:00:31,240
That's in the-- just
remind people it's

11
00:00:31,240 --> 00:00:35,350
in the docs directory here.

12
00:00:35,350 --> 00:00:39,660
And I'm going to be
doing a lot, talking

13
00:00:39,660 --> 00:00:43,060
about a particular application,
which I think is actually

14
00:00:43,060 --> 00:00:45,690
representative of a
variety of applications

15
00:00:45,690 --> 00:00:51,460
that we do a lot of things,
similar statistical properties

16
00:00:51,460 --> 00:00:56,826
of a sequence cross correlation.

17
00:00:56,826 --> 00:00:58,130
OK.

18
00:00:58,130 --> 00:01:02,834
So diving right in.

19
00:01:02,834 --> 00:01:04,250
Just going to give
an introduction

20
00:01:04,250 --> 00:01:07,650
to this particular problem
of genetic sequence analysis

21
00:01:07,650 --> 00:01:14,870
from a computational perspective
and how the D4M technology can

22
00:01:14,870 --> 00:01:19,160
really make it pretty easy
to do the kinds of things

23
00:01:19,160 --> 00:01:20,890
that people like to do.

24
00:01:20,890 --> 00:01:23,260
And then just talk
about the pipeline

25
00:01:23,260 --> 00:01:26,930
that we implemented to implement
this system because, as I've

26
00:01:26,930 --> 00:01:32,360
said before, when you're dealing
with this type of large data,

27
00:01:32,360 --> 00:01:34,240
the D4M technology is one piece.

28
00:01:34,240 --> 00:01:37,270
It's often the piece you use
to prototype your algorithms.

29
00:01:37,270 --> 00:01:40,730
But to be a part of
a system, you usually

30
00:01:40,730 --> 00:01:42,824
have to stitch together a
variety of technologies,

31
00:01:42,824 --> 00:01:44,990
databases obviously being
an important part of that.

32
00:01:47,860 --> 00:01:50,400
So this is a great chart.

33
00:01:50,400 --> 00:01:58,090
This is the relative cost
per DNA sequence over time

34
00:01:58,090 --> 00:02:03,117
here, over the last 20 years.

35
00:02:03,117 --> 00:02:04,700
So we're getting a
little cut off here

36
00:02:04,700 --> 00:02:06,760
at the bottom of the screen.

37
00:02:06,760 --> 00:02:09,930
So I think I'm-- hmm.

38
00:02:09,930 --> 00:02:14,780
So you know, this just
shows Moore's law,

39
00:02:14,780 --> 00:02:17,410
so we all know that that
technology has been increasing

40
00:02:17,410 --> 00:02:18,840
at an incredible rate.

41
00:02:18,840 --> 00:02:24,010
And as we've seen, the
cost of DNA sequencing

42
00:02:24,010 --> 00:02:25,630
is going down dramatically.

43
00:02:25,630 --> 00:02:29,790
So the first DNA
sequence, people nominally

44
00:02:29,790 --> 00:02:32,440
say to sequence the
first human genome

45
00:02:32,440 --> 00:02:34,710
was around a billion dollars.

46
00:02:34,710 --> 00:02:45,540
And they're expecting it to be
$100 within the next few years.

47
00:02:45,540 --> 00:02:49,620
So having your
DNA sequenced will

48
00:02:49,620 --> 00:02:55,010
become probably a fairly
routine medical activity

49
00:02:55,010 --> 00:02:59,080
in the next decade.

50
00:02:59,080 --> 00:03:04,100
And so the data
generated, you know,

51
00:03:04,100 --> 00:03:08,460
typically a human genome
will have billions

52
00:03:08,460 --> 00:03:11,690
of DNA sequences in it.

53
00:03:11,690 --> 00:03:12,996
And that's a lot of data.

54
00:03:15,770 --> 00:03:20,510
What's actually perhaps
even more interesting

55
00:03:20,510 --> 00:03:24,001
than sequencing your
DNA is sequencing

56
00:03:24,001 --> 00:03:25,500
the DNA of all the
other things that

57
00:03:25,500 --> 00:03:29,420
are in you, which is sometimes
called the metagenome.

58
00:03:29,420 --> 00:03:35,700
So take a swab and
not just get the DNA--

59
00:03:35,700 --> 00:03:38,280
your DNA, but also of
all the other things that

60
00:03:38,280 --> 00:03:43,260
are a part of you, which can be
ten times larger than your DNA.

61
00:03:43,260 --> 00:03:46,420
So depending on
that, so they now

62
00:03:46,420 --> 00:03:48,690
have developed these
high volume sequencers.

63
00:03:48,690 --> 00:03:53,818
Here's an example of one
that I believe can do 600.

64
00:03:53,818 --> 00:03:54,990
AUDIENCE: [INAUDIBLE]

65
00:03:54,990 --> 00:03:55,770
JEREMY KEPNER: OK.

66
00:03:55,770 --> 00:03:56,590
No problem.

67
00:03:56,590 --> 00:04:03,960
That 600 billion
base pairs a day,

68
00:04:03,960 --> 00:04:07,850
so like 600 gigabytes
of data a day.

69
00:04:07,850 --> 00:04:11,610
And this is all data that
you want to cross correlate.

70
00:04:11,610 --> 00:04:15,100
I mean, it's your--
it's a-- so that's

71
00:04:15,100 --> 00:04:20,470
what this-- this is a table
top, a table top apparatus here

72
00:04:20,470 --> 00:04:23,150
that sells for a few
hundred thousand dollars.

73
00:04:23,150 --> 00:04:26,770
And they are even getting
into portable sequencers

74
00:04:26,770 --> 00:04:30,100
that you can plug in
with a USB connection

75
00:04:30,100 --> 00:04:32,340
into a laptop or
something like that.

76
00:04:32,340 --> 00:04:36,370
It-- to do more in the
field types of things.

77
00:04:36,370 --> 00:04:39,010
So why would you
want to do this?

78
00:04:39,010 --> 00:04:44,250
I think abstractly to understand
all that would be good.

79
00:04:44,250 --> 00:04:47,900
Computation plays a huge role
because this data is collected.

80
00:04:47,900 --> 00:04:52,630
And it's just sort of abstract
snippets of DNA, you know?

81
00:04:52,630 --> 00:04:56,030
Just even assembling them
into a-- just your DNA

82
00:04:56,030 --> 00:05:00,050
into a whole process can take
a fair amount of computation.

83
00:05:00,050 --> 00:05:02,060
And right now, that is
actually something that

84
00:05:02,060 --> 00:05:04,280
takes a fair amount of time.

85
00:05:04,280 --> 00:05:09,060
And so to give you an example,
here's a great use case.

86
00:05:09,060 --> 00:05:12,110
This shows, if you recall,
in the summer of 2011

87
00:05:12,110 --> 00:05:16,610
there was a virulent E.
coli outbreak in Germany.

88
00:05:16,610 --> 00:05:18,162
And not to single
out the Germans.

89
00:05:18,162 --> 00:05:19,620
We've certainly
had the same things

90
00:05:19,620 --> 00:05:20,840
occur in the United States.

91
00:05:20,840 --> 00:05:24,150
And these occur all
across the world.

92
00:05:24,150 --> 00:05:29,880
And so, you know, this shows
kind of the time course in May.

93
00:05:29,880 --> 00:05:32,540
You know, the first
cases starting appear.

94
00:05:32,540 --> 00:05:35,990
And then those lead
to the first deaths.

95
00:05:35,990 --> 00:05:37,990
And then it spikes.

96
00:05:37,990 --> 00:05:40,340
And then that's
just kind of when

97
00:05:40,340 --> 00:05:43,105
you hit this peak is when they
really identify the outbreak.

98
00:05:43,105 --> 00:05:44,480
And then they
finally figured out

99
00:05:44,480 --> 00:05:48,960
what the-- what it is
that's causing people

100
00:05:48,960 --> 00:05:50,340
and begin to remediate it.

101
00:05:50,340 --> 00:05:53,920
But, you know, until you kind
of really have this portion,

102
00:05:53,920 --> 00:05:58,430
people are still getting
exposed to the thing, usually

103
00:05:58,430 --> 00:06:01,790
before they actually
nail it down.

104
00:06:01,790 --> 00:06:03,900
There's lots of
rumors flying around.

105
00:06:03,900 --> 00:06:08,920
All other parts of the
food chain are disrupted.

106
00:06:08,920 --> 00:06:13,430
You know, any single
time a particular product

107
00:06:13,430 --> 00:06:15,750
is implicated, that's
hundreds of millions

108
00:06:15,750 --> 00:06:18,910
of dollars of lost
business as people just

109
00:06:18,910 --> 00:06:21,810
basically-- you know,
they say it's spinach.

110
00:06:21,810 --> 00:06:24,530
Then everyone stops buying
spinach for a while.

111
00:06:24,530 --> 00:06:26,200
And, oh, it wasn't spinach.

112
00:06:26,200 --> 00:06:26,910
Sorry.

113
00:06:26,910 --> 00:06:28,650
It was something else.

114
00:06:28,650 --> 00:06:33,180
And so that's-- so
there's a dual-- you know,

115
00:06:33,180 --> 00:06:35,960
so they started by implicating
this, the cucumbers,

116
00:06:35,960 --> 00:06:37,320
but that wasn't quite right.

117
00:06:37,320 --> 00:06:38,810
They've then
sequenced the stuff.

118
00:06:38,810 --> 00:06:41,952
And then they correctly
identified it was the sprouts.

119
00:06:41,952 --> 00:06:44,410
At least I believe that was
the time course of events here.

120
00:06:44,410 --> 00:06:49,900
So-- and this is sort of the
integrated number of deaths.

121
00:06:49,900 --> 00:06:53,060
And so, you know, the
story here is obviously

122
00:06:53,060 --> 00:06:56,190
the thing we want to do most
is when a person gets sick

123
00:06:56,190 --> 00:06:59,130
here or here, wouldn't it
be great to sequence them

124
00:06:59,130 --> 00:07:02,050
immediately, get
that information,

125
00:07:02,050 --> 00:07:04,810
know exactly what's
causing the problem,

126
00:07:04,810 --> 00:07:09,670
and then be able to start
testing the food supply channel

127
00:07:09,670 --> 00:07:12,570
so that you can make a real
impact on the mortality?

128
00:07:12,570 --> 00:07:15,060
And then likewise, not
have the economic--

129
00:07:15,060 --> 00:07:18,060
you know, obviously
the loss of life

130
00:07:18,060 --> 00:07:23,720
is the preeminent issue here.

131
00:07:23,720 --> 00:07:25,780
But there's also
the economic impact,

132
00:07:25,780 --> 00:07:28,410
which certainly the people
who are in those businesses

133
00:07:28,410 --> 00:07:30,110
would want to address.

134
00:07:30,110 --> 00:07:31,920
So as you can see,
there was a really sort

135
00:07:31,920 --> 00:07:34,280
of a rather long delay
here between sort

136
00:07:34,280 --> 00:07:39,200
of when the outbreak started
and the DNA sequence released.

137
00:07:39,200 --> 00:07:43,230
And this was actually a big
step forward in the sense

138
00:07:43,230 --> 00:07:45,915
that DNA sequencing
really did play-- ended up

139
00:07:45,915 --> 00:07:49,370
playing a role in this
process as opposed

140
00:07:49,370 --> 00:07:52,410
to previously where
it may not have.

141
00:07:52,410 --> 00:07:56,450
And in the-- and
people see that now

142
00:07:56,450 --> 00:08:00,570
and they would love
to move this earlier.

143
00:08:00,570 --> 00:08:03,420
So, you know, and
obviously in our business

144
00:08:03,420 --> 00:08:05,800
and across, it's not just
this type of example.

145
00:08:05,800 --> 00:08:07,550
But there's other
types of examples

146
00:08:07,550 --> 00:08:12,870
where rapid sequencing
and identification would

147
00:08:12,870 --> 00:08:13,700
be very important.

148
00:08:13,700 --> 00:08:15,158
And there are
certainly investments

149
00:08:15,158 --> 00:08:18,170
being made to try and
make that more possible.

150
00:08:18,170 --> 00:08:21,610
So an example of what the
processing timeline looks

151
00:08:21,610 --> 00:08:23,410
now, I mean, you're
basically starting

152
00:08:23,410 --> 00:08:24,860
with the human infection.

153
00:08:24,860 --> 00:08:26,690
And it could be a
natural disease.

154
00:08:26,690 --> 00:08:29,430
Obviously in the DOD, they're
very concerned about bioweapons

155
00:08:29,430 --> 00:08:31,070
as well.

156
00:08:31,070 --> 00:08:34,740
So there's the collection of
the sample, the preparation,

157
00:08:34,740 --> 00:08:37,470
the analysis, and sort
of the overall time

158
00:08:37,470 --> 00:08:39,120
to actionable data.

159
00:08:39,120 --> 00:08:42,470
And really, it's not to say
processing plays everything

160
00:08:42,470 --> 00:08:44,450
on this, but if you, as
part of a whole system,

161
00:08:44,450 --> 00:08:48,540
you could imagine if you
could do on-site collection,

162
00:08:48,540 --> 00:08:53,210
automatic preparation, and
then very quick analysis,

163
00:08:53,210 --> 00:08:55,250
you could imagine
shorting this cycle down

164
00:08:55,250 --> 00:08:57,960
to one day, which would be
something that would really

165
00:08:57,960 --> 00:09:01,330
make a huge impact.

166
00:09:01,330 --> 00:09:06,090
Some of the other useful
sequences, useful-- I'm sorry,

167
00:09:06,090 --> 00:09:11,690
roles for DNA sequence matching
is quickly comparing two data--

168
00:09:11,690 --> 00:09:12,646
two sets of DNA.

169
00:09:16,715 --> 00:09:22,190
Identification,
that is, who is it?

170
00:09:22,190 --> 00:09:25,360
Analysis of mixtures, you
know, what type of things

171
00:09:25,360 --> 00:09:29,350
could you determine if someone
was related to somebody else?

172
00:09:29,350 --> 00:09:31,085
Ancestry analysis,
which can be used

173
00:09:31,085 --> 00:09:32,960
in disease outbreaks,
criminal investigation,

174
00:09:32,960 --> 00:09:34,460
and personal medicine.

175
00:09:34,460 --> 00:09:39,930
You know, the set of things
is pretty large here.

176
00:09:39,930 --> 00:09:43,250
So I'm not going to explain to
you kind of fundamentally what

177
00:09:43,250 --> 00:09:45,810
is the algorithm that we
use for doing this matching,

178
00:09:45,810 --> 00:09:48,394
but we're going to explain it
in terms of the mathematics

179
00:09:48,394 --> 00:09:49,810
that we've described
before, which

180
00:09:49,810 --> 00:09:51,770
is these associative
arrays, which actually

181
00:09:51,770 --> 00:09:55,310
make it very, very easy to
describe what's going on here.

182
00:09:55,310 --> 00:09:58,350
If I was to describe to you the
traditional approaches for how

183
00:09:58,350 --> 00:10:00,150
we describe DNA
sequence matching,

184
00:10:00,150 --> 00:10:01,690
it would actually
be-- that would

185
00:10:01,690 --> 00:10:05,690
be a whole lecture in itself.

186
00:10:05,690 --> 00:10:07,520
So let me get into
that algorithm.

187
00:10:07,520 --> 00:10:09,700
And so basically this is it.

188
00:10:09,700 --> 00:10:13,740
On one slide is how we do
DNA sequencing matching.

189
00:10:13,740 --> 00:10:18,210
So we have a reference
sequence here.

190
00:10:18,210 --> 00:10:20,680
This is something that
we know, a database

191
00:10:20,680 --> 00:10:22,190
of data that we know.

192
00:10:22,190 --> 00:10:27,990
And it consists of a sequence
ID and then a whole bunch

193
00:10:27,990 --> 00:10:30,930
of what are called base pairs.

194
00:10:30,930 --> 00:10:36,430
And this can usually be
several hundred long.

195
00:10:36,430 --> 00:10:38,840
And you'll have
thousands of these,

196
00:10:38,840 --> 00:10:44,730
each that are a few hundred,
maybe 1,000 base pairs long.

197
00:10:44,730 --> 00:10:51,820
And so the standard approach to
this is to take these sequences

198
00:10:51,820 --> 00:10:53,970
and break them up into
smaller units, which

199
00:10:53,970 --> 00:10:57,430
are called words or mers.

200
00:10:57,430 --> 00:11:00,419
And a standard number
is called a 10mer.

201
00:11:00,419 --> 00:11:01,960
So they're basically--
what you would

202
00:11:01,960 --> 00:11:05,070
say is you take the first
ten letters, and you say,

203
00:11:05,070 --> 00:11:05,920
"All right.

204
00:11:05,920 --> 00:11:08,100
That's one 10mer."

205
00:11:08,100 --> 00:11:10,460
Then you move it
over one, and you

206
00:11:10,460 --> 00:11:14,080
say that's another 10mer,
and so on and so forth.

207
00:11:14,080 --> 00:11:19,780
So you had, if this was
a sequence of 400 long,

208
00:11:19,780 --> 00:11:22,940
you would have 400 10mers, OK?

209
00:11:22,940 --> 00:11:25,560
And then you're obviously
multiplying the total data

210
00:11:25,560 --> 00:11:29,120
volume by a factor of ten
because of this thing.

211
00:11:29,120 --> 00:11:31,310
And so for those of us who
know signal processing,

212
00:11:31,310 --> 00:11:34,160
this is just standard filtering.

213
00:11:34,160 --> 00:11:35,280
Nothing new here.

214
00:11:35,280 --> 00:11:38,020
Very, very standard type
of filtering approach.

215
00:11:38,020 --> 00:11:45,250
So then what we do is
for each sequence ID, OK,

216
00:11:45,250 --> 00:11:50,850
this forms the row key
of an associative array.

217
00:11:50,850 --> 00:11:55,980
And each 10mer of
that sequence forms

218
00:11:55,980 --> 00:11:58,450
a column key of that 10mer.

219
00:11:58,450 --> 00:12:01,680
So you can see here
each of these rows

220
00:12:01,680 --> 00:12:08,680
shows you all the unique 10mers
that appeared in that sequence.

221
00:12:08,680 --> 00:12:11,900
And then a column is
a particular 10mer.

222
00:12:11,900 --> 00:12:16,940
So as we can see, certain
10mers appear very commonly.

223
00:12:16,940 --> 00:12:20,400
And some appear in
a not so common way.

224
00:12:20,400 --> 00:12:22,510
So that gives us an
associative array,

225
00:12:22,510 --> 00:12:26,690
which is a sparse matrix where
we have sequence ID by 10mer.

226
00:12:26,690 --> 00:12:28,850
And then we do the same
thing with the collection.

227
00:12:28,850 --> 00:12:31,890
So we've collected, in this
case, some unknown bacteria

228
00:12:31,890 --> 00:12:32,870
sample.

229
00:12:32,870 --> 00:12:35,550
And we have a similar
set of sequences here.

230
00:12:35,550 --> 00:12:36,890
And we do the exact same thing.

231
00:12:36,890 --> 00:12:39,820
We have a sequence ID
and then the 10mer.

232
00:12:39,820 --> 00:12:42,270
And then what we want to
do is cross correlate,

233
00:12:42,270 --> 00:12:44,710
find all the matches
between them.

234
00:12:44,710 --> 00:12:48,560
And so that's just done
with matrix multiply.

235
00:12:48,560 --> 00:12:52,280
So A1 transpose A2
will then result

236
00:12:52,280 --> 00:12:54,740
in a new matrix, which
is reference sequence

237
00:12:54,740 --> 00:12:57,580
ID by unknown sequence ID.

238
00:12:57,580 --> 00:12:59,850
And then it will then--
the value in here

239
00:12:59,850 --> 00:13:03,090
will be how many
matches they had.

240
00:13:03,090 --> 00:13:07,280
And so generally you look,
if say, 400 was the match,

241
00:13:07,280 --> 00:13:11,880
then you would be looking for
things well above 30 or 40

242
00:13:11,880 --> 00:13:13,920
matches between them.

243
00:13:13,920 --> 00:13:20,010
Or our true match would be maybe
50%, 60%, 70% of them match.

244
00:13:20,010 --> 00:13:23,750
And so then you can just
apply a threshold to this

245
00:13:23,750 --> 00:13:26,110
to deter-- to find the
true, true matches.

246
00:13:26,110 --> 00:13:28,040
So very simple.

247
00:13:28,040 --> 00:13:31,430
And, you know, there are large
software packages out there

248
00:13:31,430 --> 00:13:32,170
that do this.

249
00:13:32,170 --> 00:13:33,670
They essentially
do this algorithm

250
00:13:33,670 --> 00:13:36,450
with various twists and
variations and stuff like that.

251
00:13:36,450 --> 00:13:38,750
But here we can
explain the whole thing

252
00:13:38,750 --> 00:13:41,120
knowing the mathematics
of associative arrays

253
00:13:41,120 --> 00:13:43,890
on one slide.

254
00:13:43,890 --> 00:13:50,670
So this calculation is
what we call a direct match

255
00:13:50,670 --> 00:13:51,340
calculation.

256
00:13:51,340 --> 00:13:54,970
We're literally comparing
every single sequence's 10mers

257
00:13:54,970 --> 00:13:55,995
with all the other ones.

258
00:13:58,870 --> 00:14:03,660
And this is essentially what
sequence comparison does.

259
00:14:03,660 --> 00:14:07,930
And it takes a lot of
computation to do this.

260
00:14:07,930 --> 00:14:12,710
If you might have millions
of sequence IDs and millions

261
00:14:12,710 --> 00:14:15,120
of sequence IDs,
this very quickly

262
00:14:15,120 --> 00:14:20,890
becomes a large amount
of computational effort.

263
00:14:20,890 --> 00:14:24,750
So we are, of course, interested
in other techniques that

264
00:14:24,750 --> 00:14:27,140
could possibly accelerate this.

265
00:14:27,140 --> 00:14:31,110
So one of the things we're
able to do using the Accumulo

266
00:14:31,110 --> 00:14:34,260
database is ingest
this entire set

267
00:14:34,260 --> 00:14:38,850
of data as an associative
array into the database.

268
00:14:38,850 --> 00:14:43,790
And using Accumulo's tally
features, then tally.

269
00:14:43,790 --> 00:14:45,730
Have it essentially,
as we do the ingestion,

270
00:14:45,730 --> 00:14:50,180
automatically tally the
counts for each 10mer.

271
00:14:50,180 --> 00:14:51,720
So we can then
essentially construct

272
00:14:51,720 --> 00:14:54,680
a histogram of all the 10mers.

273
00:14:54,680 --> 00:14:59,070
Some 10mers will appear in a
very large number of sequences,

274
00:14:59,070 --> 00:15:03,080
and some 10mers will
appear in not very many.

275
00:15:03,080 --> 00:15:07,980
So this is that
histogram, or essentially

276
00:15:07,980 --> 00:15:09,040
a degree distribution.

277
00:15:09,040 --> 00:15:11,470
We talked about degree
distributions in other.

278
00:15:11,470 --> 00:15:15,060
This tells us that
there's one 10mer that

279
00:15:15,060 --> 00:15:21,310
appears in, you know, 80%
or 90% of all the sequences.

280
00:15:21,310 --> 00:15:30,320
And then there's 20 or 30 10mers
that appear in just a few.

281
00:15:30,320 --> 00:15:33,210
And then there's a distribution,
as in this case, almost a law--

282
00:15:33,210 --> 00:15:35,520
of sort of almost a
log normal curve that

283
00:15:35,520 --> 00:15:42,070
shows you that really,
most of the 10mers

284
00:15:42,070 --> 00:15:49,850
seem to have-- appear in
a few hundred sequences.

285
00:15:49,850 --> 00:15:53,100
And so now one
thing I've done here

286
00:15:53,100 --> 00:15:59,160
is create certain thresholds
to say, all right.

287
00:15:59,160 --> 00:16:05,680
If I only wanted-- if I looked
at that large, sparse matrix

288
00:16:05,680 --> 00:16:11,810
of the data, and I wanted to
threshold, what-- how many--

289
00:16:11,810 --> 00:16:13,560
how much of the data
would I eliminate?

290
00:16:13,560 --> 00:16:17,460
So if I wanted to
eliminate 50% of the data,

291
00:16:17,460 --> 00:16:19,920
I could set a
threshold, let's only

292
00:16:19,920 --> 00:16:24,207
look at things that are less
than a degree of 10,000.

293
00:16:24,207 --> 00:16:26,790
You might say, well, why would
I want to eliminate these very,

294
00:16:26,790 --> 00:16:28,870
very popular things?

295
00:16:28,870 --> 00:16:31,190
Well because they
appear everywhere,

296
00:16:31,190 --> 00:16:33,020
a true match is--
they're not going

297
00:16:33,020 --> 00:16:35,390
to give me any information
that really tells me

298
00:16:35,390 --> 00:16:37,400
about true matches.

299
00:16:37,400 --> 00:16:38,830
Those are going to be clutter.

300
00:16:38,830 --> 00:16:40,220
Everything has them.

301
00:16:40,220 --> 00:16:44,770
That two sequences share that
particular 10mer doesn't really

302
00:16:44,770 --> 00:16:49,000
give me a lot of power in
selecting which one it really

303
00:16:49,000 --> 00:16:51,310
belongs to.

304
00:16:51,310 --> 00:16:53,350
So like-- so I can do that.

305
00:16:53,350 --> 00:16:57,300
If I wanted to go down to only
5% of the data, I could say,

306
00:16:57,300 --> 00:17:00,850
you know, I only want to
look at 10mers that are 100,

307
00:17:00,850 --> 00:17:04,690
you know, or that have--
that appear in 100

308
00:17:04,690 --> 00:17:06,300
of these sequences or less.

309
00:17:06,300 --> 00:17:08,690
And if I wanted to go
even further, you know,

310
00:17:08,690 --> 00:17:11,609
I could go down to
is 20, 30, 40, 50,

311
00:17:11,609 --> 00:17:14,540
and I would only have one
half percent of the data.

312
00:17:14,540 --> 00:17:18,500
I would have eliminated
from consideration 99.5%

313
00:17:18,500 --> 00:17:20,310
of the data.

314
00:17:20,310 --> 00:17:25,910
So and then if I
can do that, then

315
00:17:25,910 --> 00:17:31,160
that's very powerful because I
can quickly take my sample data

316
00:17:31,160 --> 00:17:31,660
set.

317
00:17:31,660 --> 00:17:33,650
I know all the 10mers it has.

318
00:17:33,650 --> 00:17:35,780
And I can quickly look it
up against this histogram

319
00:17:35,780 --> 00:17:36,770
and say, "No.

320
00:17:36,770 --> 00:17:38,030
I don't want to do any.

321
00:17:38,030 --> 00:17:40,820
I only care about this
very small section of data,

322
00:17:40,820 --> 00:17:43,580
and I only need to do
correlations from that."

323
00:17:43,580 --> 00:17:46,130
So let's see how that works out.

324
00:17:46,130 --> 00:17:49,950
And I should say, this
technique is very generic.

325
00:17:49,950 --> 00:17:51,702
You could do it
for text matching.

326
00:17:51,702 --> 00:17:53,660
If you have documents,
you have the same issue.

327
00:17:53,660 --> 00:17:57,740
Very popular words are not going
to tell you really anything

328
00:17:57,740 --> 00:17:59,794
meaningful about whether
two documents are

329
00:17:59,794 --> 00:18:00,710
related to each other.

330
00:18:00,710 --> 00:18:02,100
It's going to be clutter.

331
00:18:02,100 --> 00:18:05,220
And other types of records
of that thing, you know?

332
00:18:05,220 --> 00:18:07,920
So in the graph
theory perspective,

333
00:18:07,920 --> 00:18:09,800
we call these super nodes.

334
00:18:09,800 --> 00:18:11,390
These 10mers are super nodes.

335
00:18:11,390 --> 00:18:13,840
They have connections
to many, many things,

336
00:18:13,840 --> 00:18:17,490
and therefore, if you try
and connect through them,

337
00:18:17,490 --> 00:18:21,650
it's just going to not give
you very useful information.

338
00:18:21,650 --> 00:18:27,140
You know, it's like
people visiting Google.

339
00:18:27,140 --> 00:18:30,060
Looking for all the records
where people connect--

340
00:18:30,060 --> 00:18:31,670
visited Google is
not really going

341
00:18:31,670 --> 00:18:36,010
to tell you much unless
you have more information.

342
00:18:36,010 --> 00:18:41,000
And so it's not a very
big distinguishing factor.

343
00:18:41,000 --> 00:18:45,120
So here's an example
of the results

344
00:18:45,120 --> 00:18:50,470
of doing-- of selecting
this low threshold,

345
00:18:50,470 --> 00:18:55,650
eliminating 99.5% of the data,
and then comparing our matches

346
00:18:55,650 --> 00:18:57,590
that we got with
what happens when

347
00:18:57,590 --> 00:19:00,060
we've used 100% of the data.

348
00:19:00,060 --> 00:19:05,700
And so what we have here is
the true 10mer match and then

349
00:19:05,700 --> 00:19:08,900
the measured
sub-sampled match here.

350
00:19:08,900 --> 00:19:15,210
And what you see here is that
we get a very, very high success

351
00:19:15,210 --> 00:19:24,870
rate in terms of we basically
detect all strong matches using

352
00:19:24,870 --> 00:19:26,920
only half percent of the data.

353
00:19:26,920 --> 00:19:31,560
And, you know, the
number of false positives

354
00:19:31,560 --> 00:19:33,600
is extremely low.

355
00:19:33,600 --> 00:19:35,280
In fact, a better
way to look at that

356
00:19:35,280 --> 00:19:38,420
is if we look at the cumular--
cumulative probability

357
00:19:38,420 --> 00:19:40,470
of detection.

358
00:19:40,470 --> 00:19:43,890
This shows this that if
the actual match, if there

359
00:19:43,890 --> 00:19:48,410
was actually 100 matches
between two sequences,

360
00:19:48,410 --> 00:19:55,830
we detect all of those
using only 1/20 of the data.

361
00:19:55,830 --> 00:19:59,480
And likewise, in our
probability false alarm rate,

362
00:19:59,480 --> 00:20:03,360
we see that if you see
more than a match of ten

363
00:20:03,360 --> 00:20:07,610
in the sub-sample data, that
is going to be a true match

364
00:20:07,610 --> 00:20:10,070
essentially 100% of the time.

365
00:20:10,070 --> 00:20:16,030
And so this technique allows
us to dramatically speed up

366
00:20:16,030 --> 00:20:19,222
the rate at which we can
do these comparisons.

367
00:20:19,222 --> 00:20:20,930
And you do have the
price to pay that you

368
00:20:20,930 --> 00:20:23,920
have to pre-load
your reference data

369
00:20:23,920 --> 00:20:26,190
into this special database.

370
00:20:26,190 --> 00:20:28,950
But since you tend
to be more concerned

371
00:20:28,950 --> 00:20:30,990
about comparing
with respect to it,

372
00:20:30,990 --> 00:20:34,650
that is a worthwhile investment.

373
00:20:34,650 --> 00:20:38,940
So that's just an example of how
you can use these techniques,

374
00:20:38,940 --> 00:20:41,610
use the mathematics
of associative arrays,

375
00:20:41,610 --> 00:20:46,490
and these databases
together in a coherent way.

376
00:20:49,090 --> 00:20:54,600
So we can do more
than just find.

377
00:20:54,600 --> 00:20:57,280
So the matrix
multiply I showed you

378
00:20:57,280 --> 00:21:02,540
before, A1 times A2
transposed showed us

379
00:21:02,540 --> 00:21:05,360
the counts, the
number of matches.

380
00:21:05,360 --> 00:21:07,940
But sometimes we don't want
to know more than that.

381
00:21:07,940 --> 00:21:10,610
We want to know not just
the number of matches,

382
00:21:10,610 --> 00:21:14,990
but please show me the
exact set of 10mers

383
00:21:14,990 --> 00:21:17,540
that caused the match, OK?

384
00:21:17,540 --> 00:21:18,980
And so this is
where, and I think

385
00:21:18,980 --> 00:21:21,110
we talked about this
in a previous lecture,

386
00:21:21,110 --> 00:21:24,500
we have these special matrix
multiplies that will actually

387
00:21:24,500 --> 00:21:30,710
take the intersecting key in
the matrix and multiply it now,

388
00:21:30,710 --> 00:21:33,567
assign that to the value
field or [INAUDIBLE].

389
00:21:33,567 --> 00:21:35,400
And so that's why we
have the special matrix

390
00:21:35,400 --> 00:21:37,860
multiply called CatKeyMul.

391
00:21:37,860 --> 00:21:39,860
And so, for instance
here, if we look

392
00:21:39,860 --> 00:21:43,380
at the result of that,
which is AK, and we say,

393
00:21:43,380 --> 00:21:48,090
"Show me all the value matches
that are greater than six

394
00:21:48,090 --> 00:21:50,560
in their rows and their
columns together,"

395
00:21:50,560 --> 00:21:53,590
now we can see that
this sequence ID

396
00:21:53,590 --> 00:21:55,260
matched with this sequence ID.

397
00:21:55,260 --> 00:21:58,790
And these were the actual
10mers that generated

398
00:21:58,790 --> 00:22:00,620
that they had in common.

399
00:22:00,620 --> 00:22:03,740
Now clearly six is
not a true match

400
00:22:03,740 --> 00:22:05,790
in this little sample data set.

401
00:22:05,790 --> 00:22:07,450
We don't have any true matches.

402
00:22:07,450 --> 00:22:10,760
But this just shows
you what that is like.

403
00:22:10,760 --> 00:22:14,860
And so this is what we
call a pedigree preserving

404
00:22:14,860 --> 00:22:15,680
correlation.

405
00:22:15,680 --> 00:22:18,119
That is, it shows you
the-- it doesn't just

406
00:22:18,119 --> 00:22:18,910
give you the count.

407
00:22:18,910 --> 00:22:21,840
It shows you where that
evidence came from.

408
00:22:21,840 --> 00:22:23,300
And you can track it back.

409
00:22:23,300 --> 00:22:25,670
And this is something we
do want to do all the time.

410
00:22:25,670 --> 00:22:27,800
If you imagined two
documents that you

411
00:22:27,800 --> 00:22:30,740
wanted to cross correlate,
you might say, all right.

412
00:22:30,740 --> 00:22:32,690
I have these two
documents, and now I've

413
00:22:32,690 --> 00:22:34,970
cross correlated
their word matches.

414
00:22:34,970 --> 00:22:39,220
Well, now I want to know the
actual words that matched,

415
00:22:39,220 --> 00:22:40,280
not just the counts.

416
00:22:40,280 --> 00:22:44,150
And you would use the exact
same multiply to do that.

417
00:22:44,150 --> 00:22:48,560
Likewise, you could do
the word/word correlation

418
00:22:48,560 --> 00:22:49,490
of a document.

419
00:22:49,490 --> 00:22:52,170
So that would be
A transpose times

420
00:22:52,170 --> 00:22:55,190
A instead of A, A transpose.

421
00:22:55,190 --> 00:22:58,110
And then it would
show you two words.

422
00:22:58,110 --> 00:22:59,750
It would show you
the list of documents

423
00:22:59,750 --> 00:23:02,020
that they actually
had in common.

424
00:23:02,020 --> 00:23:06,030
So again, this is a
powerful-- a powerful tool.

425
00:23:06,030 --> 00:23:08,580
Again, I should remind
people when using this,

426
00:23:08,580 --> 00:23:15,390
you do have to be careful
when you do, say, CatKeyMul A1

427
00:23:15,390 --> 00:23:20,180
times A1 transpose if you do
a square because you will then

428
00:23:20,180 --> 00:23:23,690
end up with this
very dense diagonal,

429
00:23:23,690 --> 00:23:27,090
and these lists will
get extremely long.

430
00:23:27,090 --> 00:23:30,810
And you can often run out
of memory when that happens.

431
00:23:30,810 --> 00:23:34,230
So you do have to be careful
when you do correlations,

432
00:23:34,230 --> 00:23:37,180
these CatKeyMul
correlations on things

433
00:23:37,180 --> 00:23:40,917
that are going to have very
large, overlapping matches.

434
00:23:40,917 --> 00:23:42,500
The regular matrix
multiply, you don't

435
00:23:42,500 --> 00:23:44,560
have to worry about
that, creating dense.

436
00:23:44,560 --> 00:23:46,370
You know, that's not a problem.

437
00:23:46,370 --> 00:23:48,640
But that's just a little
caveat to be aware of.

438
00:23:48,640 --> 00:23:50,390
And there's really
nothing we can do about

439
00:23:50,390 --> 00:23:53,015
that if you do square
them, you know?

440
00:23:53,015 --> 00:23:55,410
And we've thought about
creating a new function, which

441
00:23:55,410 --> 00:23:57,620
is basically squaring
with and basically

442
00:23:57,620 --> 00:23:59,250
not doing the diagonal.

443
00:23:59,250 --> 00:24:05,940
And we may end up making that
if we can figure that one out.

444
00:24:05,940 --> 00:24:11,440
So once you have those
actual specific matches here,

445
00:24:11,440 --> 00:24:15,480
so for example, we have
our two reference samples.

446
00:24:15,480 --> 00:24:19,030
And we looked at the
ones that were larger.

447
00:24:19,030 --> 00:24:20,420
So here's our two sequences.

448
00:24:20,420 --> 00:24:26,440
This, if we look back at the
original data, which actually

449
00:24:26,440 --> 00:24:28,704
stored the locations of that.

450
00:24:28,704 --> 00:24:30,120
So now we're saying,
oh, these two

451
00:24:30,120 --> 00:24:32,680
have six matches between them.

452
00:24:32,680 --> 00:24:36,690
Let me look them up through
this one line statement here.

453
00:24:36,690 --> 00:24:40,570
Now I can see the
actual 10mers that match

454
00:24:40,570 --> 00:24:44,130
and their actual locations
in the real data.

455
00:24:44,130 --> 00:24:48,670
And from that, I can then deduce
that, oh, actually this is not

456
00:24:48,670 --> 00:24:51,560
six separate 10mer
matches, but it's really

457
00:24:51,560 --> 00:24:57,374
two sort of-- what is this,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

458
00:24:57,374 --> 00:24:59,700
One 10mer match,
and then I guess--

459
00:24:59,700 --> 00:25:03,580
I think this is like
a 17 match, right?

460
00:25:03,580 --> 00:25:04,860
And likewise over here.

461
00:25:04,860 --> 00:25:06,620
You get a similar type of thing.

462
00:25:06,620 --> 00:25:08,470
So that's basically
stitching it back

463
00:25:08,470 --> 00:25:10,220
together because that's
really what you're

464
00:25:10,220 --> 00:25:13,130
trying to do is find these
longer sequences where

465
00:25:13,130 --> 00:25:14,490
it's identical.

466
00:25:14,490 --> 00:25:16,760
And then people
can then use those

467
00:25:16,760 --> 00:25:20,430
to determine functionality
and other types of things

468
00:25:20,430 --> 00:25:22,150
that have to do with that.

469
00:25:22,150 --> 00:25:25,030
So this is just part of
the process of doing that.

470
00:25:25,030 --> 00:25:28,090
And it just shows by having
this pedigree-preserving matrix

471
00:25:28,090 --> 00:25:31,350
multiply and the ability
to store string values

472
00:25:31,350 --> 00:25:33,570
in the value, that we
can actually preserve

473
00:25:33,570 --> 00:25:38,290
that information and do that.

474
00:25:38,290 --> 00:25:40,930
So that just shows you, I
think, a really great example.

475
00:25:40,930 --> 00:25:42,679
It's kind of one of
the most recent things

476
00:25:42,679 --> 00:25:46,320
we did this summer in terms
of the power of D4M coupled

477
00:25:46,320 --> 00:25:52,508
with databases to really do
algorithms that are completely

478
00:25:52,508 --> 00:25:55,990
kind of new and interesting.

479
00:25:55,990 --> 00:25:57,490
And as algorithm
developers, I think

480
00:25:57,490 --> 00:26:00,095
that's something that we
are all very excited about.

481
00:26:00,095 --> 00:26:01,720
So now I'm going to
talk about how this

482
00:26:01,720 --> 00:26:03,700
fits into an overall pipeline.

483
00:26:03,700 --> 00:26:06,030
So once again,
just for reminders,

484
00:26:06,030 --> 00:26:10,610
you know, D4M, we've talked
mostly about working over here

485
00:26:10,610 --> 00:26:13,610
in this space, in
the Matlab space,

486
00:26:13,610 --> 00:26:16,400
for doing your analytics
with associative arrays.

487
00:26:16,400 --> 00:26:20,660
But they also have ways-- we
have very nice bindings too.

488
00:26:20,660 --> 00:26:22,260
In particular, the
Accumulo database,

489
00:26:22,260 --> 00:26:25,560
but we can bind to just
about any database.

490
00:26:25,560 --> 00:26:26,720
And here's an example.

491
00:26:26,720 --> 00:26:29,810
If I have a table that
shows all this data,

492
00:26:29,810 --> 00:26:32,340
and I just wanted to
get, please give me

493
00:26:32,340 --> 00:26:35,400
the column of this
10mer sequence,

494
00:26:35,400 --> 00:26:37,730
I would just type that
query and it would return

495
00:26:37,730 --> 00:26:39,560
that column for me very nicely.

496
00:26:39,560 --> 00:26:42,890
So it's a very powerful
binding to databases.

497
00:26:42,890 --> 00:26:45,290
It's funny because a lot of
people, when I talk to them,

498
00:26:45,290 --> 00:26:49,510
are like-- they think
D4M is either just about

499
00:26:49,510 --> 00:26:54,760
databases or just about this
associative array mathematics.

500
00:26:54,760 --> 00:26:57,180
And because they usually kind
of-- people take sort of--

501
00:26:57,180 --> 00:27:01,540
they usually see it sort of more
used in one context or another.

502
00:27:01,540 --> 00:27:02,940
And it's really about both.

503
00:27:02,940 --> 00:27:06,660
It's really about connecting
the two worlds together.

504
00:27:06,660 --> 00:27:08,940
That particular application
that we showed you,

505
00:27:08,940 --> 00:27:13,830
this genetic sequence
comparison application

506
00:27:13,830 --> 00:27:16,290
is like you-- is a
part of a pipeline,

507
00:27:16,290 --> 00:27:20,000
like you would see
in any real system.

508
00:27:20,000 --> 00:27:23,380
And so I'm going to show
you that pipeline here.

509
00:27:23,380 --> 00:27:30,140
And so we have
here the raw data.

510
00:27:30,140 --> 00:27:33,250
So Fasta is just the
name of the file format

511
00:27:33,250 --> 00:27:35,900
that all of this DNA
sequence comes in,

512
00:27:35,900 --> 00:27:38,030
which basically looks
pretty much like a CSV

513
00:27:38,030 --> 00:27:40,080
file of what I just saw you.

514
00:27:40,080 --> 00:27:44,200
I guess that deserves a name as
a file format, but, you know.

515
00:27:44,200 --> 00:27:47,210
So it's basically that.

516
00:27:47,210 --> 00:27:56,670
And then one thing we do is we
parse that data from the Fasta

517
00:27:56,670 --> 00:28:02,240
data into a triples
format, which

518
00:28:02,240 --> 00:28:09,040
allows us to then really work
it with this associative array.

519
00:28:09,040 --> 00:28:11,240
So it basically
creates the 10mers

520
00:28:11,240 --> 00:28:14,500
and puts them into a
series of triple files,

521
00:28:14,500 --> 00:28:16,830
or a row triple file,
which holds the sequence

522
00:28:16,830 --> 00:28:21,460
ID, the actual 10mer
itself, a list of those,

523
00:28:21,460 --> 00:28:25,750
and then the position of
where that 10mer appeared

524
00:28:25,750 --> 00:28:28,660
in the sequence.

525
00:28:28,660 --> 00:28:31,280
And then typically
what we do then

526
00:28:31,280 --> 00:28:36,480
is we read-- we write a program.

527
00:28:36,480 --> 00:28:38,060
And this can be a
parallel program

528
00:28:38,060 --> 00:28:41,170
if we want it that will
then, usually in Matlab

529
00:28:41,170 --> 00:28:44,410
using D4, that will
read these sequences in

530
00:28:44,410 --> 00:28:48,130
and will often just
directly insert them

531
00:28:48,130 --> 00:28:50,040
without any
additional formatting.

532
00:28:50,040 --> 00:28:53,410
Just-- we have a way of just
inserting triples directly

533
00:28:53,410 --> 00:28:56,000
into the Accumulo database,
and which is the fastest way

534
00:28:56,000 --> 00:28:59,030
that we can do inserts.

535
00:28:59,030 --> 00:29:05,130
And then it will also convert
these to associative arrays

536
00:29:05,130 --> 00:29:09,360
in Matlab and will save
them out as mat files

537
00:29:09,360 --> 00:29:11,460
because this will take
a little time to convert

538
00:29:11,460 --> 00:29:13,625
all of these things into files.

539
00:29:13,625 --> 00:29:15,250
You're like, well
why would we do that?

540
00:29:15,250 --> 00:29:18,520
Well, there's two very good
reasons for doing that.

541
00:29:18,520 --> 00:29:21,280
One, the Accumulo
database, or any database,

542
00:29:21,280 --> 00:29:23,200
is very good if
we want to look up

543
00:29:23,200 --> 00:29:27,880
a small part of a
large data set, OK?

544
00:29:27,880 --> 00:29:29,910
So if we have
billions of records,

545
00:29:29,910 --> 00:29:32,310
and we want millions
of those records,

546
00:29:32,310 --> 00:29:34,180
that's a great
use of a database.

547
00:29:34,180 --> 00:29:36,180
However, there are certain
times we're like, no.

548
00:29:36,180 --> 00:29:39,710
We want to traverse
the entire data set.

549
00:29:39,710 --> 00:29:42,140
Well, databases are
actually bad at doing that.

550
00:29:42,140 --> 00:29:45,180
If you say, "I want to scan
over the entire database,"

551
00:29:45,180 --> 00:29:48,060
it's like doing a billion
queries, you know?

552
00:29:48,060 --> 00:29:50,300
And so there's overheads
associated with that.

553
00:29:50,300 --> 00:29:54,970
In that instance, it's far more
efficient to just save the data

554
00:29:54,970 --> 00:29:56,940
into these associative
array files, which

555
00:29:56,940 --> 00:29:58,805
will read in very quickly.

556
00:29:58,805 --> 00:30:00,430
And then you can
just-- if you're like,

557
00:30:00,430 --> 00:30:03,902
"I want to do an
analysis of the data

558
00:30:03,902 --> 00:30:06,360
in that way where I'm going to
want-- I really want to work

559
00:30:06,360 --> 00:30:08,820
with 10% or 20% of the data."

560
00:30:08,820 --> 00:30:12,800
Then having these-- this
data in this already parsed

561
00:30:12,800 --> 00:30:15,490
into this binary format,
is a very efficient way

562
00:30:15,490 --> 00:30:18,730
to run an application or an
analytic that will run over

563
00:30:18,730 --> 00:30:21,030
all of those files.

564
00:30:21,030 --> 00:30:24,080
It also makes
parallelism very easily.

565
00:30:24,080 --> 00:30:27,180
You just get-- let
different processors

566
00:30:27,180 --> 00:30:28,890
process different files.

567
00:30:28,890 --> 00:30:30,730
You know that's very easy to do.

568
00:30:30,730 --> 00:30:33,510
We have lots of support
for that type of model.

569
00:30:33,510 --> 00:30:36,790
And so that's a good
reason to do that.

570
00:30:36,790 --> 00:30:39,520
And at worst, you're
doubling the size

571
00:30:39,520 --> 00:30:41,506
of the data and your database.

572
00:30:41,506 --> 00:30:42,380
Don't worry about it.

573
00:30:42,380 --> 00:30:45,180
We double data and
databases all the time.

574
00:30:45,180 --> 00:30:47,090
Databases are
notoriously, if you

575
00:30:47,090 --> 00:30:50,880
put a certain amount of data
in, they make it much larger.

576
00:30:50,880 --> 00:30:53,000
Accumulo actually does
a very good job of that.

577
00:30:53,000 --> 00:30:55,060
It won't really be
that much larger.

578
00:30:55,060 --> 00:30:58,850
But there's no reason to not,
as you're doing the ingest,

579
00:30:58,850 --> 00:31:00,580
to save those files.

580
00:31:00,580 --> 00:31:04,920
And then you can do
various comparisons.

581
00:31:04,920 --> 00:31:09,700
So for example, we
then can-- if this--

582
00:31:09,700 --> 00:31:13,930
if we save the sample
data as a Matlab file, we

583
00:31:13,930 --> 00:31:15,260
could read that in.

584
00:31:15,260 --> 00:31:21,100
And then do our comparison
with the reference data

585
00:31:21,100 --> 00:31:22,790
that's sitting
inside the database

586
00:31:22,790 --> 00:31:24,280
to get our top matches.

587
00:31:24,280 --> 00:31:26,860
And that's exactly how this
application actually works.

588
00:31:26,860 --> 00:31:32,190
So this pipeline from raw
data to an intermediate format

589
00:31:32,190 --> 00:31:37,830
to a sort of efficient
binary format and insertion

590
00:31:37,830 --> 00:31:41,160
to then doing the
analytics and comparisons.

591
00:31:41,160 --> 00:31:43,080
And eventually,
you'll usually often

592
00:31:43,080 --> 00:31:45,760
have another layer, which
is you might create a web

593
00:31:45,760 --> 00:31:47,796
interface to something.

594
00:31:47,796 --> 00:31:48,420
We're like, OK.

595
00:31:48,420 --> 00:31:50,760
This is now a full system.

596
00:31:50,760 --> 00:31:53,670
A person can type in your--
or put in their own data.

597
00:31:53,670 --> 00:31:56,180
And then it will
then call a function.

598
00:31:56,180 --> 00:32:00,080
So again, this is very standard
how this stuff ends up fitting

599
00:32:00,080 --> 00:32:03,440
into an overall pipeline.

600
00:32:03,440 --> 00:32:05,880
And it's certainly
a situation where

601
00:32:05,880 --> 00:32:08,107
if you're going to deploy
a system, you might decide,

602
00:32:08,107 --> 00:32:08,690
you know what?

603
00:32:08,690 --> 00:32:12,160
I don't want Matlab to be a
part of my deployed system.

604
00:32:12,160 --> 00:32:15,420
I want to do that, say, in
Java or something else that's

605
00:32:15,420 --> 00:32:18,980
sort of more universal, which
we certainly see people do.

606
00:32:18,980 --> 00:32:22,730
It still makes sense to
do it in this approach

607
00:32:22,730 --> 00:32:25,900
because the algorithm
development in testing

608
00:32:25,900 --> 00:32:30,850
is just much, much easier to
do in an environment like D4M.

609
00:32:30,850 --> 00:32:33,000
And then once you have
the algorithm correct,

610
00:32:33,000 --> 00:32:34,810
it's now much
easier to give that

611
00:32:34,810 --> 00:32:37,570
to someone else
who is going to do

612
00:32:37,570 --> 00:32:40,570
the implementation in
another environment and deal

613
00:32:40,570 --> 00:32:43,960
with all the issues that are
associated with maybe doing

614
00:32:43,960 --> 00:32:45,830
a deployment type of thing.

615
00:32:45,830 --> 00:32:51,090
So one certainly could use the
Matlab or the new octave code

616
00:32:51,090 --> 00:32:52,390
in a production environment.

617
00:32:52,390 --> 00:32:54,280
We certainly have seen that.

618
00:32:54,280 --> 00:32:56,650
But often the case,
one has limitations

619
00:32:56,650 --> 00:32:58,410
about what can deploy.

620
00:32:58,410 --> 00:33:01,880
And so it is still better to
do the algorithm development

621
00:33:01,880 --> 00:33:05,500
in this type of environment
than to try and do

622
00:33:05,500 --> 00:33:08,460
the algorithm in a deployment
language like Java.

623
00:33:13,820 --> 00:33:19,660
One of the things that was very
important for this database,

624
00:33:19,660 --> 00:33:22,940
and it's true of most
parallel databases.

625
00:33:22,940 --> 00:33:28,200
So if we want to get the
highest performance insert, that

626
00:33:28,200 --> 00:33:31,645
is, we want to read
the data and insert it

627
00:33:31,645 --> 00:33:33,610
as quickly as possible
in the database,

628
00:33:33,610 --> 00:33:36,780
typically we'll need to have
some kind of parallel program

629
00:33:36,780 --> 00:33:39,420
running, in this case,
maybe each reading

630
00:33:39,420 --> 00:33:42,020
different sets of input
files and all inserting them

631
00:33:42,020 --> 00:33:44,280
into the parallel database.

632
00:33:44,280 --> 00:33:48,370
And so in Accumulo,
they divide your table

633
00:33:48,370 --> 00:33:53,200
amongst the different computers,
which they call tablet servers.

634
00:33:53,200 --> 00:33:56,400
And it's very important
to avoid the situation

635
00:33:56,400 --> 00:33:59,310
where everyone is
inserting and all the data

636
00:33:59,310 --> 00:34:01,990
is inserting into the
same tablet server.

637
00:34:01,990 --> 00:34:05,450
You're not going to get
really very good performance.

638
00:34:05,450 --> 00:34:08,350
Now, the way Accumulo
splits up its data

639
00:34:08,350 --> 00:34:11,030
is similar to many
other databases.

640
00:34:11,030 --> 00:34:13,386
It's called sharting,
sometimes they call it.

641
00:34:13,386 --> 00:34:14,969
It just means they
split up the table,

642
00:34:14,969 --> 00:34:16,679
but the term, in the
database community,

643
00:34:16,679 --> 00:34:18,800
uses-- they call it sharting.

644
00:34:18,800 --> 00:34:21,380
What they'll do is they'll
basically take the table

645
00:34:21,380 --> 00:34:24,850
and they'll assign certain row
keys, in this c-- in Accumulo's

646
00:34:24,850 --> 00:34:27,830
case, certain contiguous
sets of row keys

647
00:34:27,830 --> 00:34:31,929
to particular tablet servers.

648
00:34:31,929 --> 00:34:38,170
So, you know, if you had a
data set that was universally

649
00:34:38,170 --> 00:34:43,699
split over the alphabet, then--
and you were going to split it,

650
00:34:43,699 --> 00:34:47,050
the first split would
be between m and n.

651
00:34:47,050 --> 00:34:49,560
And so this is called splitting.

652
00:34:49,560 --> 00:34:52,989
And it's very important
to try and get good splits

653
00:34:52,989 --> 00:34:56,929
and choose your splits so
that you get good performance.

654
00:34:56,929 --> 00:35:00,990
Now D4M has a native interface
that allows you to just say,

655
00:35:00,990 --> 00:35:04,110
here are the-- I want these to
be the splits of this table.

656
00:35:04,110 --> 00:35:06,690
You can actually compute
those and assign them

657
00:35:06,690 --> 00:35:09,000
if you have a parallel instance.

658
00:35:09,000 --> 00:35:12,080
In the class, you will only be
working on the databases that

659
00:35:12,080 --> 00:35:13,410
will be set up for you.

660
00:35:13,410 --> 00:35:17,930
We have set up for you are
all single node instances.

661
00:35:17,930 --> 00:35:19,700
They do not have
multiple tablet servers,

662
00:35:19,700 --> 00:35:22,360
so you don't really have to
do-- deal with splitting.

663
00:35:22,360 --> 00:35:24,500
It's only an issue
in parallel, but it's

664
00:35:24,500 --> 00:35:29,782
certainly something to be
aware of and is often the key.

665
00:35:29,782 --> 00:35:32,820
People will often have a
very large Accumulo instance.

666
00:35:32,820 --> 00:35:35,940
And they may only be getting
not very-- the performance

667
00:35:35,940 --> 00:35:39,934
they would get on-- in just
a two node instance because--

668
00:35:39,934 --> 00:35:41,850
and usually it's because
their splitting needs

669
00:35:41,850 --> 00:35:44,730
to be done in this proper way.

670
00:35:44,730 --> 00:35:47,050
And this is true of all
databases, not just Accumulo.

671
00:35:47,050 --> 00:35:49,080
But other databases have
the exact same issue.

672
00:35:52,740 --> 00:35:57,110
Accumulo is called Accumulo
because it has something called

673
00:35:57,110 --> 00:35:59,900
accumulators, which
is what actually makes

674
00:35:59,900 --> 00:36:04,960
this whole bioinformatics
application really go,

675
00:36:04,960 --> 00:36:12,840
which is that I can
denote a column in a table

676
00:36:12,840 --> 00:36:14,770
to be an accumulator
column, which

677
00:36:14,770 --> 00:36:18,160
means whenever a
new entry is hit,

678
00:36:18,160 --> 00:36:19,950
some action will be performed.

679
00:36:19,950 --> 00:36:24,170
In this case, the
default is addition.

680
00:36:24,170 --> 00:36:27,930
And so a standard thing
we'll do in our schema,

681
00:36:27,930 --> 00:36:29,940
as we've already
talked about these

682
00:36:29,940 --> 00:36:34,500
exploded transpose pair
schemas that allows

683
00:36:34,500 --> 00:36:36,820
do fast lookups in
rows and columns

684
00:36:36,820 --> 00:36:40,840
is we'll, OK, create two
additional tables, one of which

685
00:36:40,840 --> 00:36:44,010
holds the sums of the
rows, and one of which

686
00:36:44,010 --> 00:36:47,470
holds the sums of the columns
which then allows us to do

687
00:36:47,470 --> 00:36:51,790
these very fast lookups of
the statistics, which is very

688
00:36:51,790 --> 00:36:56,090
useful to know how do we avoid
accidentally looking up columns

689
00:36:56,090 --> 00:36:59,410
that will have-- that are
present in the whole database.

690
00:36:59,410 --> 00:37:01,110
An issue that
happens all the time

691
00:37:01,110 --> 00:37:03,400
is that you'll have a
column, and it's essentially

692
00:37:03,400 --> 00:37:05,860
almost a dense column in the
database, and you really,

693
00:37:05,860 --> 00:37:08,040
really, really don't
want to look that up

694
00:37:08,040 --> 00:37:10,850
because it's basically giving
you the whole database.

695
00:37:10,850 --> 00:37:16,400
Well, with this accumulator, you
can look up the column first,

696
00:37:16,400 --> 00:37:18,821
and it will give you a
count and be like, oh, yeah.

697
00:37:18,821 --> 00:37:20,070
And then you can just say, no.

698
00:37:20,070 --> 00:37:22,810
I want to look up
everything but that column.

699
00:37:22,810 --> 00:37:27,010
So very powerful, very
powerful tool for doing that.

700
00:37:27,010 --> 00:37:30,810
That's also another
reason why we construct

701
00:37:30,810 --> 00:37:34,780
the associative array
when we load it for insert

702
00:37:34,780 --> 00:37:36,990
because when we construct
the associative array,

703
00:37:36,990 --> 00:37:41,270
we can actually do a quick
sum right then and there

704
00:37:41,270 --> 00:37:42,970
of whatever piece we've read in.

705
00:37:42,970 --> 00:37:46,060
And so then when we send
that into the database,

706
00:37:46,060 --> 00:37:49,180
we've dramatically reduced
the number of-- the amount

707
00:37:49,180 --> 00:37:50,560
of tallying it has to do.

708
00:37:50,560 --> 00:37:52,630
And this can be
a huge time saver

709
00:37:52,630 --> 00:37:57,940
because if you don't do that,
you're essentially reinserting

710
00:37:57,940 --> 00:38:01,400
the data two more times
because, you know,

711
00:38:01,400 --> 00:38:06,000
when you've inserted it into
the table and it's transpose,

712
00:38:06,000 --> 00:38:07,660
and to get the
accumulation effect,

713
00:38:07,660 --> 00:38:09,920
you would have to
directly insert it.

714
00:38:09,920 --> 00:38:13,330
But if we can do a
pre-sum, that reduces

715
00:38:13,330 --> 00:38:18,120
the amount of work on that
accumulator table dramatically.

716
00:38:18,120 --> 00:38:20,300
And so again, that's another
reason why we do that.

717
00:38:23,780 --> 00:38:26,780
So let's just talk about some
of the results that we see.

718
00:38:26,780 --> 00:38:30,000
So here's an example
on a-- let's see,

719
00:38:30,000 --> 00:38:32,270
this is an eight node database.

720
00:38:32,270 --> 00:38:34,740
And this shows us
the ingest rate

721
00:38:34,740 --> 00:38:37,940
that we get using different
numbers of ingesters.

722
00:38:37,940 --> 00:38:40,420
So this is different
programs all

723
00:38:40,420 --> 00:38:43,430
trying to insert into this
database simultaneously.

724
00:38:43,430 --> 00:38:49,140
And this just shows the
difference between having 0

725
00:38:49,140 --> 00:38:55,470
splits and, in this case, 35
splits was the optimal number

726
00:38:55,470 --> 00:38:56,400
that we had.

727
00:38:56,400 --> 00:39:00,370
And you see, it's a
rather large difference.

728
00:39:00,370 --> 00:39:02,900
And you don't get--
you get some benefit

729
00:39:02,900 --> 00:39:06,090
with multiple inserters,
but that sort of ends

730
00:39:06,090 --> 00:39:08,000
when you don't do this.

731
00:39:08,000 --> 00:39:12,440
So this is just an example
of why you want to do that.

732
00:39:12,440 --> 00:39:16,930
Another example is with the
actual human DNA database.

733
00:39:16,930 --> 00:39:21,610
This shows us our insert
rate on doing-- using

734
00:39:21,610 --> 00:39:29,740
different numbers of inserters.

735
00:39:29,740 --> 00:39:34,750
And yeah, so this is
eight tablet servers.

736
00:39:34,750 --> 00:39:37,480
And this just shows the
different number of ingesters

737
00:39:37,480 --> 00:39:38,000
here.

738
00:39:38,000 --> 00:39:39,890
And since we're doing
proper splitting,

739
00:39:39,890 --> 00:39:42,470
we're getting very nice scaling.

740
00:39:42,470 --> 00:39:45,330
And this is actually the
actual output from Accumulo.

741
00:39:45,330 --> 00:39:48,030
It actually has a
nice little meter here

742
00:39:48,030 --> 00:39:51,320
that shows you what it says
you're actually getting,

743
00:39:51,320 --> 00:39:54,420
which is very, very nice
to be able to verify

744
00:39:54,420 --> 00:39:57,300
that you and the
database both agree

745
00:39:57,300 --> 00:40:00,500
that your insert rate is--
so this is insert right here.

746
00:40:00,500 --> 00:40:03,510
So you see, we're getting
about 4,000-- 400,000 entries

747
00:40:03,510 --> 00:40:06,780
per second, which is an
extraordinarily high number

748
00:40:06,780 --> 00:40:09,800
in the database community.

749
00:40:09,800 --> 00:40:11,920
Just to give you
a comparison, it

750
00:40:11,920 --> 00:40:13,770
would not be
uncommon if you were

751
00:40:13,770 --> 00:40:17,240
to set up a MySQL instance
on a single computer

752
00:40:17,240 --> 00:40:21,840
and have one inserter going into
it to get maybe 1,000 inserts

753
00:40:21,840 --> 00:40:23,980
per second on a good day.

754
00:40:23,980 --> 00:40:26,100
And so, you know,
this is essentially

755
00:40:26,100 --> 00:40:29,090
the core reason why
people use Accumulo

756
00:40:29,090 --> 00:40:35,220
is because of this very high
insert and query performance.

757
00:40:35,220 --> 00:40:37,820
And this just shows our
extrapolated run time.

758
00:40:37,820 --> 00:40:43,440
If we wanted to, for instance,
ingest the entire human Fasta

759
00:40:43,440 --> 00:40:47,450
database here of
4.5 billion entries,

760
00:40:47,450 --> 00:40:51,280
we could do that in
eight hours, which

761
00:40:51,280 --> 00:40:53,950
would be a very reasonable
amount of time to do that.

762
00:40:53,950 --> 00:40:56,280
So in summary, what
we are able to achieve

763
00:40:56,280 --> 00:41:01,290
with this application is this
shows one of these diagrams.

764
00:41:01,290 --> 00:41:03,270
I think I showed one
in the first lecture.

765
00:41:03,270 --> 00:41:06,630
This is a way of sort of
measuring our productivity

766
00:41:06,630 --> 00:41:08,020
and our performance.

767
00:41:08,020 --> 00:41:11,240
So this shows the volume
of code that we wrote.

768
00:41:11,240 --> 00:41:12,830
And this shows the run time.

769
00:41:12,830 --> 00:41:16,060
So obviously this
is better down here.

770
00:41:16,060 --> 00:41:17,900
And this shows BLAST.

771
00:41:17,900 --> 00:41:22,390
So BLAST is the industry
standard application

772
00:41:22,390 --> 00:41:24,380
for doing sequence match.

773
00:41:24,380 --> 00:41:27,650
And I don't want, in any
way, to say that we are

774
00:41:27,650 --> 00:41:29,160
doing everything BLAST does.

775
00:41:29,160 --> 00:41:32,990
BLAST is almost a
million lines of code.

776
00:41:32,990 --> 00:41:34,840
It does lots and lots
of different things.

777
00:41:34,840 --> 00:41:36,780
It handles all different
types of file formats

778
00:41:36,780 --> 00:41:39,780
and little nuances and
other types of things.

779
00:41:39,780 --> 00:41:44,370
But at its core, it
does what we did.

780
00:41:44,370 --> 00:41:51,210
And we were able to basically
do that in, you know,

781
00:41:51,210 --> 00:41:55,140
150 lines of D4M code.

782
00:41:55,140 --> 00:41:58,730
And using the database,
we were able to do

783
00:41:58,730 --> 00:42:02,480
that 100 times faster.

784
00:42:02,480 --> 00:42:05,990
And so this is really the
power of this technology

785
00:42:05,990 --> 00:42:09,690
is to allow you to
develop these algorithms

786
00:42:09,690 --> 00:42:12,650
and implement them
and, in this case,

787
00:42:12,650 --> 00:42:16,890
actually leverage the database
to essentially replace lookups

788
00:42:16,890 --> 00:42:19,220
with computations in a
very intelligent way,

789
00:42:19,220 --> 00:42:21,800
in a way that's knowledgeable
about the statistics

790
00:42:21,800 --> 00:42:23,000
of your data.

791
00:42:23,000 --> 00:42:24,290
And that's really the power.

792
00:42:24,290 --> 00:42:26,820
And these are the types of
results that people can get.

793
00:42:26,820 --> 00:42:31,600
And this whole result was done
with one summer student, a very

794
00:42:31,600 --> 00:42:33,970
smart summer
student, but we were

795
00:42:33,970 --> 00:42:36,740
able to put this whole system
together in about a couple

796
00:42:36,740 --> 00:42:37,440
of months.

797
00:42:37,440 --> 00:42:40,840
And this is from a person who
knew nothing about Accumulo

798
00:42:40,840 --> 00:42:42,400
or D4M or anything like that.

799
00:42:42,400 --> 00:42:44,890
They were a good--
they were smart.

800
00:42:44,890 --> 00:42:49,024
They were good, solid Java
background and very energetic.

801
00:42:49,024 --> 00:42:50,440
But, you know,
these are the kinds

802
00:42:50,440 --> 00:42:53,710
of things we've
been able to see.

803
00:42:53,710 --> 00:42:55,680
So just to summarize,
we, again, we

804
00:42:55,680 --> 00:43:02,080
see-- I think just to--
this types of techniques

805
00:43:02,080 --> 00:43:04,950
are useful for document analysis
and network analysis and DNA

806
00:43:04,950 --> 00:43:06,570
sequencing.

807
00:43:06,570 --> 00:43:08,070
You know, this is
really summarizing

808
00:43:08,070 --> 00:43:10,750
all the applications
that we've talked about.

809
00:43:10,750 --> 00:43:14,350
We think there's a pretty big
gap between doing the data

810
00:43:14,350 --> 00:43:17,560
analysis tools that our really
algorithm developers need.

811
00:43:17,560 --> 00:43:20,210
And we think D4M
is really allowing

812
00:43:20,210 --> 00:43:24,600
us to use a tool like Matlab
for its traditional role

813
00:43:24,600 --> 00:43:27,730
in this new domain, which
is algorithm development.

814
00:43:27,730 --> 00:43:29,610
Figure things out,
getting things right

815
00:43:29,610 --> 00:43:34,490
before you can then hand it on
to someone else to implement

816
00:43:34,490 --> 00:43:38,159
and actually get enough
production environment.

817
00:43:38,159 --> 00:43:40,200
So with that, that brings
the end to the lecture.

818
00:43:40,200 --> 00:43:41,700
And then there's
some examples where

819
00:43:41,700 --> 00:43:44,290
we show you the-- this stuff.

820
00:43:44,290 --> 00:43:48,100
So-- and there's
no homework other

821
00:43:48,100 --> 00:43:52,680
than I sent you all
that link to see

822
00:43:52,680 --> 00:43:55,330
if your access to
the database exists.

823
00:43:55,330 --> 00:43:57,060
Did everyone-- raise your hand.

824
00:43:57,060 --> 00:43:59,001
Did you check the link out?

825
00:43:59,001 --> 00:43:59,500
Yes?

826
00:43:59,500 --> 00:44:00,260
You logged in?

827
00:44:00,260 --> 00:44:01,470
It worked?

828
00:44:01,470 --> 00:44:03,430
You saw a bunch of
databases there?

829
00:44:03,430 --> 00:44:04,190
OK.

830
00:44:04,190 --> 00:44:06,260
If it doesn't work,
because the next class is

831
00:44:06,260 --> 00:44:08,435
sort of the
penultimate class, I'm

832
00:44:08,435 --> 00:44:10,060
going to go through
all these examples.

833
00:44:10,060 --> 00:44:12,480
And the assignment will
be to run those examples

834
00:44:12,480 --> 00:44:13,630
on those databases.

835
00:44:13,630 --> 00:44:17,710
So you will be really using
D4M, touching a database,

836
00:44:17,710 --> 00:44:20,740
doing real analytics, doing
very complicated things.

837
00:44:20,740 --> 00:44:23,480
The whole class is just going
to be a demonstration of that.

838
00:44:23,480 --> 00:44:25,990
Everything we've been
doing has been leading up.

839
00:44:25,990 --> 00:44:28,460
So the concepts are in place
so that you can understand

840
00:44:28,460 --> 00:44:30,080
those examples fairly easily.

841
00:44:30,080 --> 00:44:32,490
All right, so we will
take a short break here.

842
00:44:32,490 --> 00:44:34,530
And then we will come
back, and I will show

843
00:44:34,530 --> 00:44:37,720
you the demo, this week's demo.