1
00:00:00,100 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,810
Commons license.

3
00:00:03,810 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,160
continue to offer high quality
educational resources for free.

5
00:00:10,160 --> 00:00:12,690
To make a donation or to
view additional materials

6
00:00:12,690 --> 00:00:16,610
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,610 --> 00:00:17,260
at ocw.mit.edu.

8
00:00:21,480 --> 00:00:23,890
PROFESSOR: All
right, so we're back.

9
00:00:23,890 --> 00:00:29,600
We're going to do the
examples we just described

10
00:00:29,600 --> 00:00:31,850
in the previous notes.

11
00:00:31,850 --> 00:00:34,232
And again, once again,
these are in the--

12
00:00:34,232 --> 00:00:35,960
if you have your
d4m directory here,

13
00:00:35,960 --> 00:00:43,656
they're in Examples,
Applications, BioBlast.

14
00:00:43,656 --> 00:00:45,780
This shows you the two
programs we're going to run.

15
00:00:45,780 --> 00:00:47,740
We just have two examples.

16
00:00:47,740 --> 00:00:49,500
And then we're going
to be comparing

17
00:00:49,500 --> 00:00:52,050
two pieces of actual data.

18
00:00:52,050 --> 00:00:57,860
So if you look at
this, this shows you

19
00:00:57,860 --> 00:01:00,330
what that data looks like.

20
00:01:00,330 --> 00:01:07,126
So sequence ID, and then
very long sequences here.

21
00:01:07,126 --> 00:01:09,250
It'd be sure nice if they
were all the same length,

22
00:01:09,250 --> 00:01:11,920
but they certainly
are not the same time.

23
00:01:11,920 --> 00:01:15,060
They vary dramatically
here you can see.

24
00:01:15,060 --> 00:01:16,310
Look at that guy.

25
00:01:16,310 --> 00:01:17,510
What's he doing?

26
00:01:17,510 --> 00:01:18,070
Super long.

27
00:01:18,070 --> 00:01:22,190
So this is a reference
bacteria dataset.

28
00:01:22,190 --> 00:01:24,290
This is known data.

29
00:01:24,290 --> 00:01:28,270
And this is a sequence
here of some data

30
00:01:28,270 --> 00:01:31,310
taken from somebody's palm.

31
00:01:31,310 --> 00:01:38,240
This is open source
data, so this is just

32
00:01:38,240 --> 00:01:41,480
an example of that.

33
00:01:41,480 --> 00:01:43,610
And so you can
imagine one would want

34
00:01:43,610 --> 00:01:48,230
to do a comparison of
the data, of what the DNA

35
00:01:48,230 --> 00:01:51,800
sequences of bacteria
on the person's hand

36
00:01:51,800 --> 00:01:54,000
are with some reference dataset.

37
00:01:54,000 --> 00:01:57,110
So obviously, we're not
going to be using a dataset.

38
00:01:57,110 --> 00:02:01,229
This is a very
small dataset here.

39
00:02:01,229 --> 00:02:03,020
We wouldn't expect to
have any matches here

40
00:02:03,020 --> 00:02:05,210
and we're not going to
have any, but this just

41
00:02:05,210 --> 00:02:07,560
shows you how that works.

42
00:02:07,560 --> 00:02:14,619
So let me go over here,
and so the first one here

43
00:02:14,619 --> 00:02:15,160
was this BB1.

44
00:02:45,190 --> 00:02:51,260
All right, so what we've
done is we set our word size.

45
00:02:51,260 --> 00:02:54,630
We're going to
use 10 base pairs,

46
00:02:54,630 --> 00:02:57,200
so that's where we're
basically dealing with 10-mers.

47
00:02:57,200 --> 00:03:00,260
And wrote a little
program here that what

48
00:03:00,260 --> 00:03:06,250
it does is given a CSV file,
FASTA file, and a word size,

49
00:03:06,250 --> 00:03:09,530
it then creates the associative
array as we described.

50
00:03:09,530 --> 00:03:12,620
So basically each
sequence will be a row.

51
00:03:12,620 --> 00:03:16,840
So we'll get into that.

52
00:03:16,840 --> 00:03:19,920
And likewise, each 10-mer
will then be a column.

53
00:03:19,920 --> 00:03:22,530
And so we do this
for the two datasets.

54
00:03:22,530 --> 00:03:27,370
Here is our matrix multiply,
so A1 times A2 transpose

55
00:03:27,370 --> 00:03:29,260
gives us A1, A2.

56
00:03:29,260 --> 00:03:33,450
And here's is our pedigreed
matrix multiply, the CatKeyMul

57
00:03:33,450 --> 00:03:36,950
A1, A2 transpose, and
gives us A1, A2 key.

58
00:03:36,950 --> 00:03:38,510
And so we can now look at that.

59
00:03:38,510 --> 00:03:46,060
So A1 has 198-- that's the
bacteria-- 198 sequences

60
00:03:46,060 --> 00:03:50,790
with 8,001 unique 10-mers.

61
00:03:50,790 --> 00:03:57,910
A2 had 1,000 sequences
with 2,365 unique 10-mers.

62
00:03:57,910 --> 00:04:00,750
And then when you
matrix multiply them,

63
00:04:00,750 --> 00:04:08,010
you see that 85 of the
195 in the reference

64
00:04:08,010 --> 00:04:14,040
had a match with 415 of
the 999 in the sample.

65
00:04:14,040 --> 00:04:16,610
So we're losing about
half in each case

66
00:04:16,610 --> 00:04:20,910
because the associative rate
does not store empty values.

67
00:04:20,910 --> 00:04:25,640
There's going to be no key
which is a completely empty row

68
00:04:25,640 --> 00:04:28,050
or a completely column.

69
00:04:28,050 --> 00:04:30,990
So we can now look at that data.

70
00:04:30,990 --> 00:04:33,670
So if we look at Figure 1 here.

71
00:04:36,620 --> 00:04:41,845
So this is A1, so this
shows you-- again, here,

72
00:04:41,845 --> 00:04:42,720
why don't we zoom in?

73
00:04:50,875 --> 00:04:52,875
So that gives you better
sense of the structure.

74
00:04:52,875 --> 00:04:55,050
You can actually click on that.

75
00:04:55,050 --> 00:04:58,790
So here's a 10-mer that appears
in an awful lot of things.

76
00:04:58,790 --> 00:05:04,080
So basically this 10-mer appears
in an awful lot of things.

77
00:05:04,080 --> 00:05:11,070
And these are examples
of a row corresponding

78
00:05:11,070 --> 00:05:15,989
to a particular sequence ID.

79
00:05:15,989 --> 00:05:16,530
So that's A1.

80
00:05:20,140 --> 00:05:21,040
That's the reference.

81
00:05:21,040 --> 00:05:25,359
A2, the sample, pretty much
the same type of thing.

82
00:05:25,359 --> 00:05:27,150
Although a little bit
different structure--

83
00:05:27,150 --> 00:05:31,056
more these dense things, maybe.

84
00:05:31,056 --> 00:05:33,230
Why don't we zoom in on
one of these blobs here?

85
00:05:35,900 --> 00:05:39,392
See those type of blobs there.

86
00:05:39,392 --> 00:05:41,460
We've got this guy.

87
00:05:41,460 --> 00:05:42,798
Who's he?

88
00:05:42,798 --> 00:05:50,370
This guy begins with GCCAC and
GCCTT and other types of things

89
00:05:50,370 --> 00:05:50,870
there.

90
00:05:54,550 --> 00:06:01,260
And this is the cross, so
if we go now to Figure 3,

91
00:06:01,260 --> 00:06:04,400
this is the cross
correlation of all those,

92
00:06:04,400 --> 00:06:06,360
and it holds the count.

93
00:06:06,360 --> 00:06:11,480
So you can see here 1, 1--
most of these are just 1.

94
00:06:11,480 --> 00:06:13,970
1-- and unless I did some
kind of thresholding,

95
00:06:13,970 --> 00:06:19,510
I probably couldn't even find
the ones that were larger.

96
00:06:19,510 --> 00:06:22,950
And then figure 4 here is
our pedigreed multiplication,

97
00:06:22,950 --> 00:06:26,230
so now it shows you
what the actual.

98
00:06:26,230 --> 00:06:30,620
So this tells us that
this reference sequence

99
00:06:30,620 --> 00:06:35,540
ID and this sample sequence
ID had a match at this 10-mer.

100
00:06:44,930 --> 00:06:45,760
So we back here.

101
00:06:45,760 --> 00:06:48,090
And as we see, as
I clicked on that,

102
00:06:48,090 --> 00:06:50,950
it printed out
the full sequence.

103
00:06:50,950 --> 00:06:53,960
So if you wanted to ever
just do a cut and paste

104
00:06:53,960 --> 00:06:56,280
and then put this in
something else-- so, like,

105
00:06:56,280 --> 00:07:06,590
if I wanted to look at this
row, I could then do A1,

106
00:07:06,590 --> 00:07:09,292
and that gives me that row.

107
00:07:09,292 --> 00:07:10,870
And I can do other things.

108
00:07:10,870 --> 00:07:19,720
I could do sum that-- I
could sum the columns up.

109
00:07:19,720 --> 00:07:25,317
That tells me there's 138 unique
sequences in that sequence ID.

110
00:07:25,317 --> 00:07:29,930
So those types of things there.

111
00:07:29,930 --> 00:07:30,660
All right.

112
00:07:35,280 --> 00:07:39,180
I think we pretty
much covered that.

113
00:07:41,669 --> 00:07:43,460
And so now we'll go on
to the next example.

114
00:07:43,460 --> 00:07:45,660
And I apologize, this
takes a little bit longer

115
00:07:45,660 --> 00:07:54,159
to run because now I'm doing two
things when I read in the data.

116
00:07:54,159 --> 00:07:55,700
I'm doing the same
thing I did before

117
00:07:55,700 --> 00:07:57,655
in terms of [INAUDIBLE]
I want 10-mers,

118
00:07:57,655 --> 00:07:59,600
then I'm going to
read the data again

119
00:07:59,600 --> 00:08:01,260
at a slightly
different function,

120
00:08:01,260 --> 00:08:03,890
and it's going to return
two associative arrays.

121
00:08:03,890 --> 00:08:06,190
This associative array
was the same as before,

122
00:08:06,190 --> 00:08:09,000
which showed us the actual
counts-- how many times.

123
00:08:13,660 --> 00:08:16,760
The next one gives us
a concatenated list

124
00:08:16,760 --> 00:08:20,630
of the positions, so if
it appears five times,

125
00:08:20,630 --> 00:08:23,700
it'll actually show us--
it'll hold that information--

126
00:08:23,700 --> 00:08:27,770
a list of the five locations
in the greater sequence, which

127
00:08:27,770 --> 00:08:31,300
then allows us to do those
things I talked about before.

128
00:08:31,300 --> 00:08:34,270
So this takes a
minute to do that.

129
00:08:37,150 --> 00:08:39,220
Probably with some
optimization, I have no doubt

130
00:08:39,220 --> 00:08:44,190
that this could be made faster.

131
00:08:46,902 --> 00:08:51,424
And also with the video
recording going on,

132
00:08:51,424 --> 00:08:57,665
we have a lot of pressure
on this personal computer.

133
00:09:06,930 --> 00:09:08,970
Just some additional
information here

134
00:09:08,970 --> 00:09:11,270
in terms of the whose
command, so we basically

135
00:09:11,270 --> 00:09:14,340
did whose A to show me all the
different associative arrays

136
00:09:14,340 --> 00:09:17,120
I have, as you can see.

137
00:09:17,120 --> 00:09:20,040
And this tallies up the
bytes that are associated.

138
00:09:20,040 --> 00:09:22,310
So our first associative
array, as you saw,

139
00:09:22,310 --> 00:09:24,760
was about 700 kilobytes.

140
00:09:24,760 --> 00:09:28,840
And that basically adds
up all the data stored

141
00:09:28,840 --> 00:09:32,520
and all the data
structures inside it.

142
00:09:32,520 --> 00:09:36,050
Usually does a pretty good
job of that, likewise.

143
00:09:36,050 --> 00:09:38,990
And you can see, there's
a significant reduction

144
00:09:38,990 --> 00:09:42,590
in the total size of that.

145
00:09:42,590 --> 00:09:44,370
And in this case,
we didn't really

146
00:09:44,370 --> 00:09:47,490
add that much by
adding the keys,

147
00:09:47,490 --> 00:09:51,890
so this is a very nice dataset
for us to do our CatKeyMul on.

148
00:09:51,890 --> 00:09:54,580
There was no dense
diagonal or anything

149
00:09:54,580 --> 00:09:56,370
like that to get
in our way there,

150
00:09:56,370 --> 00:09:58,450
so that was a very
useful thing to have.

151
00:10:01,430 --> 00:10:04,960
At least we're now onto the
second read, so almost there.

152
00:10:14,220 --> 00:10:17,710
And being able to click on the
thing it actually printed out

153
00:10:17,710 --> 00:10:23,650
is because when we actually
do the spy plot, in this case,

154
00:10:23,650 --> 00:10:27,850
it works out pretty well because
our keys are relatively short,

155
00:10:27,850 --> 00:10:29,160
but we do truncate that.

156
00:10:29,160 --> 00:10:33,870
So if you have a row key or a
column key that's very long,

157
00:10:33,870 --> 00:10:35,820
when we do the plots
we truncate that,

158
00:10:35,820 --> 00:10:37,403
and then you would
never actually know

159
00:10:37,403 --> 00:10:40,480
what the true full string was.

160
00:10:40,480 --> 00:10:42,710
I think we're just
about finished there.

161
00:10:47,860 --> 00:10:53,430
All right, so again,
we read in the data,

162
00:10:53,430 --> 00:10:56,230
and now I wanted
to make a degree

163
00:10:56,230 --> 00:10:58,430
distribution of the data.

164
00:10:58,430 --> 00:11:00,674
So just like we did our
degree distributions earlier,

165
00:11:00,674 --> 00:11:02,090
there's a great
way to get a sense

166
00:11:02,090 --> 00:11:05,360
of the overall statistics.

167
00:11:05,360 --> 00:11:08,300
So I took one,
A1N, and the way I

168
00:11:08,300 --> 00:11:11,235
do that is I have the
associative array, A1.

169
00:11:11,235 --> 00:11:15,450
If I do the ADJ or
adjacency matrix,

170
00:11:15,450 --> 00:11:18,620
this basically just pops
out the internal sparse

171
00:11:18,620 --> 00:11:20,750
matrix that's stored.

172
00:11:20,750 --> 00:11:22,680
And now I can do
regular math on that.

173
00:11:22,680 --> 00:11:25,096
And we happen to have this
little built-in function called

174
00:11:25,096 --> 00:11:28,920
out degree, which
essentially does, on the out

175
00:11:28,920 --> 00:11:32,060
direction, the appropriate
sums and returns

176
00:11:32,060 --> 00:11:37,200
a sparse matrix which we can
then plot in this log log form.

177
00:11:41,880 --> 00:11:47,520
So is the reference data.

178
00:11:47,520 --> 00:11:53,270
You see as this
distribution here,

179
00:11:53,270 --> 00:11:56,040
which is very power
law-like, although, as we

180
00:11:56,040 --> 00:12:01,650
saw in the earlier classes,
unless we properly fit and bin

181
00:12:01,650 --> 00:12:05,770
this, we wouldn't really
know if this was a power law

182
00:12:05,770 --> 00:12:06,780
or something like it.

183
00:12:06,780 --> 00:12:11,140
But probably our first order
linear fit of a power law

184
00:12:11,140 --> 00:12:12,860
would be at least a
good place to start,

185
00:12:12,860 --> 00:12:14,540
as it usually is with this data.

186
00:12:18,610 --> 00:12:21,280
There's the other one.

187
00:12:21,280 --> 00:12:23,490
Would appear to be even
more power law-like,

188
00:12:23,490 --> 00:12:27,010
but again, until we bin it, we
don't really know what happens.

189
00:12:27,010 --> 00:12:29,000
When we start binning
this stuff up here,

190
00:12:29,000 --> 00:12:33,350
maybe it will bow up in
some way, maybe it won't.

191
00:12:33,350 --> 00:12:34,100
Probably it would.

192
00:12:34,100 --> 00:12:35,740
Probably it would
look more like this.

193
00:12:39,640 --> 00:12:41,770
A question for you,
or a good question

194
00:12:41,770 --> 00:12:43,870
is I showed you
that other dataset,

195
00:12:43,870 --> 00:12:47,070
and it didn't look like this.

196
00:12:47,070 --> 00:12:50,750
You saw a very almost log
normal Gaussian distribution.

197
00:12:54,100 --> 00:12:55,070
It's the same data.

198
00:12:55,070 --> 00:12:59,070
It's still 10-mer type of data,
and so the only difference

199
00:12:59,070 --> 00:12:59,810
is the size.

200
00:13:02,806 --> 00:13:11,690
And so when you have
10-mers, since you

201
00:13:11,690 --> 00:13:16,660
have four choices of
letter, you essentially

202
00:13:16,660 --> 00:13:21,240
have 2 million total
possible 10-mers.

203
00:13:21,240 --> 00:13:28,256
So this data set-- in fact,
we can count here, right?

204
00:13:28,256 --> 00:13:28,830
That's A1.

205
00:13:28,830 --> 00:13:39,110
If we do N and Z-- we
only have 274,000 entries,

206
00:13:39,110 --> 00:13:45,930
so we're not fully sampling the
2 million possibilities here.

207
00:13:45,930 --> 00:13:47,600
In the other dataset,
that database

208
00:13:47,600 --> 00:13:49,840
was actually 500
million entries.

209
00:13:49,840 --> 00:13:52,460
And so we're really
fully sampling,

210
00:13:52,460 --> 00:13:57,170
and so basically
what's happening

211
00:13:57,170 --> 00:14:00,570
is these guys are rare,
they don't show up,

212
00:14:00,570 --> 00:14:05,370
and you're hitting them
once, and eventually

213
00:14:05,370 --> 00:14:09,880
you begin to create this sort
of more Gaussian distribution

214
00:14:09,880 --> 00:14:14,180
look, or bell curve look when
you start fully sampling.

215
00:14:14,180 --> 00:14:16,670
I don't know if that's
generally true or not,

216
00:14:16,670 --> 00:14:22,330
but one could hypothesize that a
lot of the data-- and this just

217
00:14:22,330 --> 00:14:26,770
a guess-- a lot of the data we
see is power law because we're

218
00:14:26,770 --> 00:14:30,640
not fully sampling the
total space that emerges,

219
00:14:30,640 --> 00:14:35,490
or that total space is growing
at a rate as we sample it.

220
00:14:35,490 --> 00:14:37,970
So domain names--
we haven't fully

221
00:14:37,970 --> 00:14:40,942
sampled the entire space of
domain names, and the name,

222
00:14:40,942 --> 00:14:42,650
and the set of domain
names, and the unit

223
00:14:42,650 --> 00:14:45,830
continues to grow at a
fairly significant rate.

224
00:14:45,830 --> 00:14:50,400
Now, maybe one day there will
stop being the new domain

225
00:14:50,400 --> 00:14:52,490
names, and then when we
start doing our sampling,

226
00:14:52,490 --> 00:14:55,920
maybe our distributions
and domain names

227
00:14:55,920 --> 00:14:58,370
won't look so power
law anymore, but more

228
00:14:58,370 --> 00:15:03,410
like a log normal distribution
with a kind of bell shape.

229
00:15:03,410 --> 00:15:06,360
Just a hypothesis, but
certainly many data

230
00:15:06,360 --> 00:15:09,377
we have don't do that.

231
00:15:09,377 --> 00:15:10,960
Now, of course, the
easiest way for me

232
00:15:10,960 --> 00:15:13,610
to change this is
instead of doing 10-mers

233
00:15:13,610 --> 00:15:16,290
would be to do 20-mers.

234
00:15:16,290 --> 00:15:20,240
Now I've dramatically increased
the space of possibilities,

235
00:15:20,240 --> 00:15:23,150
and so the odds of
having just statistical

236
00:15:23,150 --> 00:15:26,820
matches start piling
up are very, very low.

237
00:15:26,820 --> 00:15:30,730
So that is certainly
something that people do, too.

238
00:15:30,730 --> 00:15:33,210
And in the case for the
calculation we were doing,

239
00:15:33,210 --> 00:15:34,620
we just got lucky.

240
00:15:34,620 --> 00:15:38,630
We actually did not think 10-mer
would be the right sampling

241
00:15:38,630 --> 00:15:40,187
for that data.

242
00:15:40,187 --> 00:15:41,770
We thought it would
be a higher number

243
00:15:41,770 --> 00:15:44,530
like a 15-mer or
a 20-mer, and it

244
00:15:44,530 --> 00:15:47,590
turns out the 10-mer was just
actually spot on in terms

245
00:15:47,590 --> 00:15:51,700
of giving us this nice
balance between sampling,

246
00:15:51,700 --> 00:15:55,100
but statistical power
that we could eliminate

247
00:15:55,100 --> 00:15:56,100
more common occurrences.

248
00:15:58,650 --> 00:16:03,180
All right, so we plotted
the out degree again,

249
00:16:03,180 --> 00:16:05,290
a very useful technique.

250
00:16:05,290 --> 00:16:08,850
One should always understand
the overall statistics

251
00:16:08,850 --> 00:16:11,350
of one's data.

252
00:16:11,350 --> 00:16:13,790
We're doing the same
cross correlation

253
00:16:13,790 --> 00:16:18,680
here that we did before, except
since in that previous dataset

254
00:16:18,680 --> 00:16:20,430
the value didn't store
to the count-- that

255
00:16:20,430 --> 00:16:24,700
was a very fast read, and
all the values were 1-- here

256
00:16:24,700 --> 00:16:26,280
we actually have a count.

257
00:16:26,280 --> 00:16:34,430
And so in this
particular correlation,

258
00:16:34,430 --> 00:16:36,820
we don't care if well, this
10-mer appeared 10 times

259
00:16:36,820 --> 00:16:39,729
in this one sequence
and appeared 3 times

260
00:16:39,729 --> 00:16:41,770
in this other sequence,
and when we multiply them

261
00:16:41,770 --> 00:16:45,040
together we get a 30.

262
00:16:45,040 --> 00:16:47,200
We're really more interested
in, like, how many

263
00:16:47,200 --> 00:16:50,760
unique 1's did they each have?

264
00:16:50,760 --> 00:16:52,690
How many distinct
1's did we each have?

265
00:16:52,690 --> 00:16:55,180
So we use the
double logi function

266
00:16:55,180 --> 00:16:57,490
to basically convert
all our numbers back

267
00:16:57,490 --> 00:17:00,370
to just 0's and 1's, and then
do that correlation again.

268
00:17:05,300 --> 00:17:07,180
And since the
values don't matter

269
00:17:07,180 --> 00:17:10,329
when we do the CatKeyMul,
we don't need to do that,

270
00:17:10,329 --> 00:17:12,650
and then we do the
same correlation.

271
00:17:12,650 --> 00:17:13,720
And so here we are.

272
00:17:13,720 --> 00:17:19,740
We're finding the sequences
that have a greater than 8

273
00:17:19,740 --> 00:17:22,280
match here.

274
00:17:22,280 --> 00:17:27,670
Actually doing some fairly
inside tricks here to do that.

275
00:17:27,670 --> 00:17:29,320
You might ask, what is this put?

276
00:17:29,320 --> 00:17:31,840
So it's fairly clear
what I'm doing here.

277
00:17:31,840 --> 00:17:36,270
I'm taking this count
and I'm saying, please

278
00:17:36,270 --> 00:17:39,239
return the sub-associative
array of all things with greater

279
00:17:39,239 --> 00:17:41,155
than a value of 8, or
more than eight matches.

280
00:17:41,155 --> 00:17:43,120
All right, that's good.

281
00:17:43,120 --> 00:17:49,950
Then I'm flooring that and
saying, make that all just 1's.

282
00:17:49,950 --> 00:17:52,980
And then I'm doing this trick
here, which is called PutVal,

283
00:17:52,980 --> 00:17:54,965
and I'm giving it
this random string.

284
00:17:57,754 --> 00:17:59,670
So right now this is
just an associative array

285
00:17:59,670 --> 00:18:00,430
without a value.

286
00:18:00,430 --> 00:18:03,690
When I put that thing in
there, since all my things have

287
00:18:03,690 --> 00:18:07,260
a value of 1 and I'll have
one string, it's now saying,

288
00:18:07,260 --> 00:18:09,950
you know what, I've just
made all those values tilde.

289
00:18:09,950 --> 00:18:13,250
I'm like, why on earth
would we do that?

290
00:18:13,250 --> 00:18:17,170
Well, it's because when
I AND them together,

291
00:18:17,170 --> 00:18:22,010
we have to decide what
mathematically-- when

292
00:18:22,010 --> 00:18:24,455
we AND two associative
arrays together,

293
00:18:24,455 --> 00:18:27,140
if you recall the lecture where
we talk about these collision

294
00:18:27,140 --> 00:18:29,770
functions, we have to pick a
collision function for the two

295
00:18:29,770 --> 00:18:31,560
strings.

296
00:18:31,560 --> 00:18:34,850
And this key stores
this pedigreed list,

297
00:18:34,850 --> 00:18:36,510
which I really want to keep.

298
00:18:36,510 --> 00:18:39,190
And this just shows me which
ones are the high match,

299
00:18:39,190 --> 00:18:41,160
and I don't really
care about that.

300
00:18:41,160 --> 00:18:44,290
Well, the default collision
function, when in doubt,

301
00:18:44,290 --> 00:18:47,930
throughout d4m is min.

302
00:18:47,930 --> 00:18:53,910
So tilde is lexicographically
after every other printable

303
00:18:53,910 --> 00:18:55,960
character in the alphabet.

304
00:18:55,960 --> 00:18:58,930
So when I do a
min here, I'm just

305
00:18:58,930 --> 00:19:03,520
going to always return
these values here.

306
00:19:03,520 --> 00:19:05,690
So there was only one
entry greater than 8,

307
00:19:05,690 --> 00:19:07,130
and I ANDed them together.

308
00:19:07,130 --> 00:19:11,110
It then did a min of tilde
with this whole string here,

309
00:19:11,110 --> 00:19:14,050
and it said, aha, C
is less than tilde.

310
00:19:14,050 --> 00:19:16,620
I'm just going to give you
the C, and so there we go on.

311
00:19:16,620 --> 00:19:20,020
So this is just a
cute trick, but it's

312
00:19:20,020 --> 00:19:24,350
a way of using the collision
function in the group theory

313
00:19:24,350 --> 00:19:28,540
that we talked about in earlier
lectures to get a very nice

314
00:19:28,540 --> 00:19:31,040
just selecting the
one that I want here.

315
00:19:31,040 --> 00:19:33,080
And so that's what that does.

316
00:19:33,080 --> 00:19:36,090
And then there you can
see of this sequence

317
00:19:36,090 --> 00:19:41,930
ID and this sequence ID, they
had greater than eight matches,

318
00:19:41,930 --> 00:19:46,035
and this was the exact
10-mers that actually matched.

319
00:19:50,870 --> 00:19:53,070
And then we can go back
before and look up those two

320
00:19:53,070 --> 00:19:57,840
sequences, compare them side
by side in the original data,

321
00:19:57,840 --> 00:20:01,940
and then we see these were
the actual positions here.

322
00:20:01,940 --> 00:20:03,994
And you would probably
look at these-- well,

323
00:20:03,994 --> 00:20:04,910
what do we think here?

324
00:20:04,910 --> 00:20:12,700
Well, 137 and 138,
that's 139, 140,

325
00:20:12,700 --> 00:20:17,710
141, and then jumping
all the way to 149.

326
00:20:17,710 --> 00:20:22,150
So looks like those guys are
all part of one long group.

327
00:20:22,150 --> 00:20:31,250
But then 60 and 172 and 130
are probably isolated 10-mers.

328
00:20:31,250 --> 00:20:32,790
And then over here
what do we have?

329
00:20:32,790 --> 00:20:42,730
We have 111-- no, but we do
have 119, 120, 121, 122, 123,

330
00:20:42,730 --> 00:20:49,485
so that's probably also
a bigger group match.

331
00:20:49,485 --> 00:20:55,470
In fact, those all line up with
this, except for the 149 year.

332
00:20:55,470 --> 00:20:56,940
Well, actually, no.

333
00:20:56,940 --> 00:20:58,600
Some of them match,
some of the time.

334
00:20:58,600 --> 00:21:01,480
So they're each part
of a larger piece,

335
00:21:01,480 --> 00:21:07,490
but they aren't contiguous
with each other.

336
00:21:07,490 --> 00:21:10,387
And that's the kind of thing
that people who do this

337
00:21:10,387 --> 00:21:11,220
then really look at.

338
00:21:11,220 --> 00:21:13,595
And, of course, a real match
would be more like something

339
00:21:13,595 --> 00:21:15,330
with many more sequences.

340
00:21:15,330 --> 00:21:18,250
This is all just
random randomness

341
00:21:18,250 --> 00:21:20,220
and things like that.

342
00:21:20,220 --> 00:21:22,437
All right, so that's the demo.

343
00:21:22,437 --> 00:21:23,270
I want to thank you.

344
00:21:23,270 --> 00:21:28,870
And again, next week tell your
friends who weren't here today

345
00:21:28,870 --> 00:21:30,860
that that is the one lecture.

346
00:21:30,860 --> 00:21:34,050
If you had to come to one
lecture in this entire class--

347
00:21:34,050 --> 00:21:36,324
I didn't tell you that at
the beginning, you notice,

348
00:21:36,324 --> 00:21:38,490
because then you would have
skipped the other seven.

349
00:21:38,490 --> 00:21:41,412
But no, what we've
done before will really

350
00:21:41,412 --> 00:21:42,370
make that a lot easier.

351
00:21:42,370 --> 00:21:45,500
That's the real one, and
you will then be able to go,

352
00:21:45,500 --> 00:21:47,950
when you leave that lecture,
run all those examples

353
00:21:47,950 --> 00:21:50,450
on those test databases
that we have there

354
00:21:50,450 --> 00:21:55,680
from your LO grid accounts,
and have all that kind of fun.

355
00:21:55,680 --> 00:21:59,840
And we really want to do that
because we've released d4m

356
00:21:59,840 --> 00:22:02,040
on the web, we need
people to test it.

357
00:22:02,040 --> 00:22:05,330
It's much better if you
guys find the issues,

358
00:22:05,330 --> 00:22:07,130
and then we'll just
make it better.

359
00:22:07,130 --> 00:22:09,255
And likewise, it's a step
towards you guys actually

360
00:22:09,255 --> 00:22:14,410
using the technology for
your actual application.

361
00:22:14,410 --> 00:22:17,230
So thank you very much.