1
00:00:12,910 --> 00:00:17,290
JASON KU: Welcome to the
fourth lecture of 6.006.

2
00:00:17,290 --> 00:00:20,350
Today we are going to be
talking about hashing.

3
00:00:20,350 --> 00:00:24,850
Last lecture, on Tuesday,
Professor Solomon

4
00:00:24,850 --> 00:00:29,080
was talking about
set data structures,

5
00:00:29,080 --> 00:00:33,250
storing things so that
you can query items

6
00:00:33,250 --> 00:00:37,510
by their key right, by what
they intrinsically are--

7
00:00:37,510 --> 00:00:39,610
versus what Professor
Demaine was talking

8
00:00:39,610 --> 00:00:42,280
about last week, which was
sequence data structures, where

9
00:00:42,280 --> 00:00:46,060
we impose an external
order on these items

10
00:00:46,060 --> 00:00:49,600
and we want you
to maintain those.

11
00:00:49,600 --> 00:00:52,780
I'm not supporting operations
where I'm looking stuff up

12
00:00:52,780 --> 00:00:54,290
based on what they are.

13
00:00:54,290 --> 00:00:56,910
That's what the set
interface is for.

14
00:00:56,910 --> 00:00:59,410
So we're going to be talking a
little bit more about the set

15
00:00:59,410 --> 00:01:01,810
interface today.

16
00:01:01,810 --> 00:01:05,680
On Tuesday, you saw two
ways of implementing the set

17
00:01:05,680 --> 00:01:07,210
interface--

18
00:01:07,210 --> 00:01:09,970
one using just a
unsorted array-- just,

19
00:01:09,970 --> 00:01:12,430
I threw these things
in an array and I

20
00:01:12,430 --> 00:01:14,650
could do a linear
scan of my items

21
00:01:14,650 --> 00:01:17,140
to support basically
any of these operations.

22
00:01:17,140 --> 00:01:19,090
It's a little exercise
you can go through.

23
00:01:19,090 --> 00:01:21,640
I think they show it to you
in the recitation notes,

24
00:01:21,640 --> 00:01:26,620
but if you'd like to implement
it for yourself, that's fine.

25
00:01:26,620 --> 00:01:30,100
And then we saw a slightly
better data structure, at least

26
00:01:30,100 --> 00:01:31,780
for the find operations.

27
00:01:31,780 --> 00:01:34,660
Can I look something
up, whether this key

28
00:01:34,660 --> 00:01:38,750
is in my set interface?

29
00:01:38,750 --> 00:01:39,730
We can do that faster.

30
00:01:39,730 --> 00:01:43,450
We can do that in log n
time with a build overhead

31
00:01:43,450 --> 00:01:49,180
that's about n log n, because we
showed you three ways to sort.

32
00:01:49,180 --> 00:01:51,010
Two of them were n squared.

33
00:01:51,010 --> 00:01:55,480
One of them was n log n, which
is as good as we showed you

34
00:01:55,480 --> 00:01:57,500
how to do yesterday.

35
00:01:57,500 --> 00:02:00,500
So the question then becomes,
can I build that data structure

36
00:02:00,500 --> 00:02:01,080
faster?

37
00:02:01,080 --> 00:02:04,580
That'll be a subject of next
week's Thursday lecture.

38
00:02:04,580 --> 00:02:08,210
But this week we're going to
concentrate on this static

39
00:02:08,210 --> 00:02:09,530
find.

40
00:02:09,530 --> 00:02:11,900
we got log n, which is an
exponential improvement

41
00:02:11,900 --> 00:02:17,990
over linear right, but
the question now becomes,

42
00:02:17,990 --> 00:02:21,870
can I do faster than log n time?

43
00:02:21,870 --> 00:02:24,370
And what we're going to do at
the first part of this lecture

44
00:02:24,370 --> 00:02:26,320
is show you that, no, you--

45
00:02:26,320 --> 00:02:27,340
AUDIENCE: [INAUDIBLE]

46
00:02:27,340 --> 00:02:28,850
JASON KU: What's up?

47
00:02:28,850 --> 00:02:29,350
No?

48
00:02:29,350 --> 00:02:35,230
OK-- that you can't do
faster than log n time,

49
00:02:35,230 --> 00:02:38,920
in the caveat that we are in a
slightly more restricted model

50
00:02:38,920 --> 00:02:43,060
of computation that we
were-- than what we introduce

51
00:02:43,060 --> 00:02:46,460
to you a couple of weeks ago.

52
00:02:46,460 --> 00:02:50,210
And then so if we're not in
that more constrained model

53
00:02:50,210 --> 00:02:52,070
of computation, we can
actually do faster.

54
00:02:55,490 --> 00:02:57,320
Log n's already pretty good.

55
00:02:57,320 --> 00:03:03,460
Log n is not going to be larger
than like 30 for any problem

56
00:03:03,460 --> 00:03:08,350
that you're going to be
talking about in the real world

57
00:03:08,350 --> 00:03:13,720
on real computers, but a
factor of 30 is still bad.

58
00:03:13,720 --> 00:03:17,530
I would prefer to do faster with
those constant factors, when

59
00:03:17,530 --> 00:03:18,235
I can.

60
00:03:18,235 --> 00:03:19,360
It's not a constant factor.

61
00:03:19,360 --> 00:03:22,090
It's a logarithmic factor,
but you get what I'm saying.

62
00:03:22,090 --> 00:03:24,690
OK, so what we're
going to do is first

63
00:03:24,690 --> 00:03:27,810
prove that you can't
do faster for--

64
00:03:27,810 --> 00:03:32,330
does everyone understand--
remember what find key meant?

65
00:03:32,330 --> 00:03:36,120
I have a key, I have a bunch of
items that have keys associated

66
00:03:36,120 --> 00:03:39,390
with them, and I want to see
if one of the items that I'm

67
00:03:39,390 --> 00:03:42,480
storing contains a key
that is the same as the one

68
00:03:42,480 --> 00:03:43,890
that I searched for.

69
00:03:43,890 --> 00:03:46,680
The item might
contain other things,

70
00:03:46,680 --> 00:03:49,080
but in particular,
it has a search key

71
00:03:49,080 --> 00:03:52,620
that I'm maintaining the
set on so that it supports

72
00:03:52,620 --> 00:03:56,010
find operations, search
operations based on that key

73
00:03:56,010 --> 00:03:56,700
quickly.

74
00:03:56,700 --> 00:03:58,790
Does that make sense?

75
00:03:58,790 --> 00:04:00,970
So there's the find one
that we want to improve,

76
00:04:00,970 --> 00:04:03,310
and we also want to
improve this insert delete.

77
00:04:03,310 --> 00:04:08,410
We want to be-- make this data
structural dynamic, because we

78
00:04:08,410 --> 00:04:11,840
might do those
operations quite a bit.

79
00:04:11,840 --> 00:04:15,410
And so this lecture's about
optimizing those three things.

80
00:04:15,410 --> 00:04:17,740
OK, so first, I'm
going to show you

81
00:04:17,740 --> 00:04:22,150
that we can't do faster
than log n for find, which

82
00:04:22,150 --> 00:04:23,650
is a little weird.

83
00:04:23,650 --> 00:04:26,290
OK, the model of
computation I'm going

84
00:04:26,290 --> 00:04:28,600
to be proving this
lower bound on--

85
00:04:31,168 --> 00:04:33,460
how I'm going to approach
this is I'm going to say that

86
00:04:33,460 --> 00:04:37,390
any way that I store these--

87
00:04:37,390 --> 00:04:42,380
the items that I'm storing
in this data structure--

88
00:04:42,380 --> 00:04:45,350
for anyway I saw these
things, any algorithm

89
00:04:45,350 --> 00:04:48,500
of this certain type
is going to require

90
00:04:48,500 --> 00:04:50,090
at least logarithmic time.

91
00:04:50,090 --> 00:04:52,580
That's what we're
going to try to prove.

92
00:04:52,580 --> 00:04:55,340
And the model of
computation that's

93
00:04:55,340 --> 00:04:58,370
weaker than what we've been
talking about previously

94
00:04:58,370 --> 00:05:00,440
is what I'm going to call
the comparison model.

95
00:05:04,220 --> 00:05:07,730
And a comparison model
means-- is that the items,

96
00:05:07,730 --> 00:05:10,010
the objects I'm storing--

97
00:05:10,010 --> 00:05:12,050
I can kind of think of
them as black boxes.

98
00:05:12,050 --> 00:05:15,380
I don't get to touch these
things, except the only way

99
00:05:15,380 --> 00:05:20,060
that I can distinguish
between them is to say,

100
00:05:20,060 --> 00:05:27,820
given a key and an item, or two
items, I can do a comparison

101
00:05:27,820 --> 00:05:28,960
on those keys.

102
00:05:28,960 --> 00:05:31,660
Are these keys the same?

103
00:05:31,660 --> 00:05:34,060
Is this key bigger
than this one?

104
00:05:34,060 --> 00:05:35,710
Is it smaller than this one?

105
00:05:35,710 --> 00:05:40,450
Those are the only operations
I get to do with them.

106
00:05:40,450 --> 00:05:42,430
Say, if the keys are
numbers, I don't get

107
00:05:42,430 --> 00:05:44,740
to look at what number that is.

108
00:05:44,740 --> 00:05:46,810
I just get to take two
keys and compare them.

109
00:05:46,810 --> 00:05:49,660
And actually, all of
the search algorithms

110
00:05:49,660 --> 00:05:53,620
that we saw on Tuesday we're
comparison sort algorithms.

111
00:05:53,620 --> 00:05:56,830
What you did was stepped
through the program.

112
00:05:56,830 --> 00:05:59,290
At some point, you
came to a branch

113
00:05:59,290 --> 00:06:01,990
and you looked at
two keys, and you

114
00:06:01,990 --> 00:06:06,340
branched based on whether one
key was bigger than another.

115
00:06:06,340 --> 00:06:07,540
That was a comparison.

116
00:06:07,540 --> 00:06:09,280
And then you move
some stuff around,

117
00:06:09,280 --> 00:06:11,440
but that was the
general paradigm.

118
00:06:11,440 --> 00:06:17,540
Those three sorting operations
lived in this comparison model.

119
00:06:17,540 --> 00:06:20,560
You've got a
comparison operations,

120
00:06:20,560 --> 00:06:25,180
like are they equal,
less than, greater than,

121
00:06:25,180 --> 00:06:28,900
maybe greater than or
equal, less than or equal?

122
00:06:28,900 --> 00:06:30,580
Generally, you have
all these operations

123
00:06:30,580 --> 00:06:32,080
that you could do--
maybe not equal.

124
00:06:35,020 --> 00:06:38,200
But the key thing here
is that there are only

125
00:06:38,200 --> 00:06:40,930
two possible outputs to
each of these comparitors.

126
00:06:44,010 --> 00:06:46,740
There's only one thing
that I can branch on.

127
00:06:46,740 --> 00:06:49,830
It's going to branch
into two different lines.

128
00:06:49,830 --> 00:06:52,920
It's either true and I do
some other computation,

129
00:06:52,920 --> 00:06:56,640
or it's false and I'll do a
different set of computation.

130
00:06:56,640 --> 00:06:58,310
That makes sense?

131
00:06:58,310 --> 00:06:59,810
So what I'm going
to do is I'm going

132
00:06:59,810 --> 00:07:02,770
to give you a comparison--

133
00:07:02,770 --> 00:07:05,090
an algorithm in the
comparison model

134
00:07:05,090 --> 00:07:08,210
as what I like to
call a decision tree.

135
00:07:08,210 --> 00:07:10,620
So if I specify an
algorithm to you,

136
00:07:10,620 --> 00:07:13,160
the first thing it's going to
do-- if I don't compare items

137
00:07:13,160 --> 00:07:15,890
at all, I'm kind of
screwed, because I'll never

138
00:07:15,890 --> 00:07:17,990
be able to tell if my
keys in there or not.

139
00:07:17,990 --> 00:07:21,120
So I have to do
some comparisons.

140
00:07:21,120 --> 00:07:23,690
So I'll do some computation.

141
00:07:23,690 --> 00:07:25,430
Maybe I find out the
length of the array

142
00:07:25,430 --> 00:07:28,040
and I do some constant time
stuff, but at some point,

143
00:07:28,040 --> 00:07:31,880
I'll do a comparison,
and I'll branch.

144
00:07:31,880 --> 00:07:35,600
I'll come to this node,
and if the comparison--

145
00:07:35,600 --> 00:07:37,550
maybe a less than--

146
00:07:37,550 --> 00:07:41,240
if it's true, I'm going to go
this way in my computation,

147
00:07:41,240 --> 00:07:45,110
and if it's false, I'm going to
go this way in my computation.

148
00:07:45,110 --> 00:07:51,860
And I'm going to keep doing
that with various comparisons--

149
00:07:51,860 --> 00:08:02,710
sure-- until I get down
here to some leaf in which I

150
00:08:02,710 --> 00:08:04,160
I'm not branching.

151
00:08:04,160 --> 00:08:07,860
The internal nodes here are
representing comparisons,

152
00:08:07,860 --> 00:08:09,470
but the leaves
are representing--

153
00:08:09,470 --> 00:08:11,360
I stopped my computation.

154
00:08:11,360 --> 00:08:13,680
I'm outputting something.

155
00:08:13,680 --> 00:08:16,510
Does that make sense,
what I'm trying to do?

156
00:08:16,510 --> 00:08:20,700
I'm changing my
algorithm to be put

157
00:08:20,700 --> 00:08:24,210
in this kind of graphical
way, where I'm branching what

158
00:08:24,210 --> 00:08:28,650
my program could possibly
do based on the comparisons

159
00:08:28,650 --> 00:08:30,000
that I do.

160
00:08:30,000 --> 00:08:33,030
I'm not actually counting
the rest of the work

161
00:08:33,030 --> 00:08:35,010
that the program does.

162
00:08:35,010 --> 00:08:37,860
I'm really only looking
at the comparisons,

163
00:08:37,860 --> 00:08:41,880
because I know that I need to
compare some things eventually

164
00:08:41,880 --> 00:08:44,970
to figure out what my items are.

165
00:08:44,970 --> 00:08:47,340
And if that's the only way
I can distinguish items,

166
00:08:47,340 --> 00:08:49,890
then I have to do those
comparisons to find out.

167
00:08:49,890 --> 00:08:51,540
Does that make sense?

168
00:08:51,540 --> 00:08:56,220
All right, so what I
have is a binary tree

169
00:08:56,220 --> 00:08:58,530
that's representing
the comparisons done

170
00:08:58,530 --> 00:08:59,430
by the algorithm.

171
00:08:59,430 --> 00:09:01,510
OK.

172
00:09:01,510 --> 00:09:04,570
So it starts at one comparison
and then it branches.

173
00:09:04,570 --> 00:09:07,110
How many leaves must
I have in my tree?

174
00:09:10,510 --> 00:09:15,106
What does that question mean,
in terms of the program?

175
00:09:15,106 --> 00:09:16,477
AUDIENCE: [INAUDIBLE]

176
00:09:16,477 --> 00:09:17,310
JASON KU: What's up?

177
00:09:17,310 --> 00:09:18,720
AUDIENCE: The number
of comparisons--

178
00:09:18,720 --> 00:09:20,160
JASON KU: The number
of comparisons-- no,

179
00:09:20,160 --> 00:09:21,618
that's the number
of internal nodes

180
00:09:21,618 --> 00:09:23,310
that I have in the algorithm.

181
00:09:23,310 --> 00:09:25,620
And actually, the
number of comparisons

182
00:09:25,620 --> 00:09:27,510
that I do in an execution
of the algorithm

183
00:09:27,510 --> 00:09:32,942
is just along a path from
here to the-- to a leaf.

184
00:09:32,942 --> 00:09:34,650
So what do the leaves
actually represent?

185
00:09:34,650 --> 00:09:36,290
Those represent outputs.

186
00:09:36,290 --> 00:09:39,470
I'm going to output
something here.

187
00:09:39,470 --> 00:09:40,366
Yep?

188
00:09:40,366 --> 00:09:41,300
AUDIENCE: [INAUDIBLE]

189
00:09:41,300 --> 00:09:42,050
JASON KU: The number of--

190
00:09:42,050 --> 00:09:42,550
OK.

191
00:09:45,440 --> 00:09:47,570
So what is the output
to my search algorithm?

192
00:09:47,570 --> 00:09:52,660
Maybe it's the-- an index of
an item that contains this key.

193
00:09:52,660 --> 00:09:58,000
Or maybe I return the
item is the output--

194
00:09:58,000 --> 00:09:59,830
the item of the
thing I'm storing.

195
00:09:59,830 --> 00:10:04,720
And I'm storing n things, so
I need at least n outputs,

196
00:10:04,720 --> 00:10:07,870
because I need to be able
to return any of the items

197
00:10:07,870 --> 00:10:11,080
that I'm storing based on a
different search parameter,

198
00:10:11,080 --> 00:10:12,520
if it's going to be correct.

199
00:10:12,520 --> 00:10:13,900
I actually need one more output.

200
00:10:13,900 --> 00:10:15,150
Why do I need one more output?

201
00:10:17,700 --> 00:10:20,620
If it's not in there--

202
00:10:20,620 --> 00:10:26,710
so any correct comparison
searching algorithm--

203
00:10:26,710 --> 00:10:30,010
I'm doing some comparisons
to find this thing--

204
00:10:30,010 --> 00:10:34,375
needs to have at
least n plus 1 leaves.

205
00:10:38,280 --> 00:10:41,280
Otherwise, it can't be correct,
because I could look up

206
00:10:41,280 --> 00:10:44,880
the one that I'm not
returning in that set

207
00:10:44,880 --> 00:10:47,950
and it would never be
able to return that value.

208
00:10:47,950 --> 00:10:50,020
Does that make sense?

209
00:10:50,020 --> 00:10:50,755
Yeah?

210
00:10:50,755 --> 00:10:51,630
AUDIENCE: [INAUDIBLE]

211
00:10:51,630 --> 00:10:53,730
JASON KU: What's n?

212
00:10:53,730 --> 00:10:55,320
For a data structure,
n is the number

213
00:10:55,320 --> 00:10:58,720
of things stored in that
data structure at that time--

214
00:10:58,720 --> 00:11:00,660
so the number of items
in the data structure.

215
00:11:00,660 --> 00:11:03,060
That's what it means
in all of these tables.

216
00:11:03,060 --> 00:11:05,630
Any other questions?

217
00:11:05,630 --> 00:11:09,610
OK, so now we get
to the fun part.

218
00:11:09,610 --> 00:11:13,540
How many comparisons does
this algorithm have to do?

219
00:11:16,660 --> 00:11:17,806
Yeah, up there--

220
00:11:17,806 --> 00:11:19,750
AUDIENCE: [INAUDIBLE]

221
00:11:19,750 --> 00:11:22,310
JASON KU: What's up?

222
00:11:22,310 --> 00:11:25,460
All right, your colleague is
jumping ahead for a second,

223
00:11:25,460 --> 00:11:30,140
but really, I have to do as many
comparisons in the worst case

224
00:11:30,140 --> 00:11:35,340
as the longest root-to-leaf
path in this tree--

225
00:11:35,340 --> 00:11:37,470
because as I'm executing
this algorithm,

226
00:11:37,470 --> 00:11:42,860
I'll go down this thing,
always branching down,

227
00:11:42,860 --> 00:11:44,840
and at some point,
I'll get to a leaf.

228
00:11:44,840 --> 00:11:47,180
And in the worst
case, if I happen

229
00:11:47,180 --> 00:11:51,200
to need to return this
particular output,

230
00:11:51,200 --> 00:11:55,550
then I'll have to walk down the
longest thing, just the longest

231
00:11:55,550 --> 00:11:57,180
path.

232
00:11:57,180 --> 00:12:01,580
So then the longest path is the
same as the height of the tree,

233
00:12:01,580 --> 00:12:04,040
so the question
then becomes, what

234
00:12:04,040 --> 00:12:10,790
is the minimum height of any
binary tree that has at least n

235
00:12:10,790 --> 00:12:13,784
plus 1 leaves?

236
00:12:13,784 --> 00:12:18,830
Does everyone understand why
we're asking that question?

237
00:12:18,830 --> 00:12:19,643
Yeah?

238
00:12:19,643 --> 00:12:22,433
AUDIENCE: Could you over again
why it needs n plus 1 leaves?

239
00:12:22,433 --> 00:12:24,100
JASON KU: Why it needs
n plus 1 leaves--

240
00:12:24,100 --> 00:12:27,640
if it's a correct algorithm,
it needs to return--

241
00:12:27,640 --> 00:12:30,220
it needs to be able to
return any of the n items

242
00:12:30,220 --> 00:12:33,640
that I'm storing or say that
the key that I'm looking for

243
00:12:33,640 --> 00:12:35,740
is not there--

244
00:12:35,740 --> 00:12:37,720
great question.

245
00:12:37,720 --> 00:12:40,300
OK, so what is
the minimum height

246
00:12:40,300 --> 00:12:44,590
of any binary tree
that has n plus 1--

247
00:12:44,590 --> 00:12:48,247
at least n plus 1 leaves?

248
00:12:48,247 --> 00:12:50,080
You can actually state
a recurrence for that

249
00:12:50,080 --> 00:12:50,710
and solve that.

250
00:12:50,710 --> 00:12:52,502
You're going to do that
in your recitation.

251
00:12:52,502 --> 00:12:53,980
But it's log n.

252
00:12:53,980 --> 00:12:57,950
The best you can do is if this
is a balanced binary tree.

253
00:12:57,950 --> 00:13:10,600
So the min height is going
to be at least log n height.

254
00:13:14,760 --> 00:13:17,500
Or the min height
is logarithmic,

255
00:13:17,500 --> 00:13:19,080
so it's actually
theta right here.

256
00:13:19,080 --> 00:13:21,810
But if I just said
height here, I

257
00:13:21,810 --> 00:13:24,630
would be lower
bounding the height.

258
00:13:24,630 --> 00:13:28,800
I could have a linear height,
if I just changed comparisons

259
00:13:28,800 --> 00:13:34,050
down one by one, if I was doing
a linear search, for example.

260
00:13:34,050 --> 00:13:36,990
All right, so this is saying
that, if I'm just restricting

261
00:13:36,990 --> 00:13:40,680
to comparisons, I have to
spend at least logarithmic time

262
00:13:40,680 --> 00:13:43,680
to be able to find whether
this key is in my set.

263
00:13:46,613 --> 00:13:48,030
But I don't want
logarithmic time.

264
00:13:48,030 --> 00:13:49,930
I want faster.

265
00:13:49,930 --> 00:13:51,085
So how can I do that?

266
00:13:51,085 --> 00:13:51,960
AUDIENCE: [INAUDIBLE]

267
00:13:51,960 --> 00:13:54,660
JASON KU: I have one operation
in my model of computation

268
00:13:54,660 --> 00:13:56,970
I presented a
couple of weeks ago

269
00:13:56,970 --> 00:14:00,480
that allows me to do faster,
which allows me to do something

270
00:14:00,480 --> 00:14:03,240
stronger than comparisons.

271
00:14:03,240 --> 00:14:06,720
Comparisons have a
constant branching factor.

272
00:14:06,720 --> 00:14:08,850
In particular, I can--

273
00:14:08,850 --> 00:14:11,730
if I do this operation-- this
constant time operation--

274
00:14:11,730 --> 00:14:17,350
I can branch to two
different locations.

275
00:14:17,350 --> 00:14:21,630
It's like an if kind of
situation-- if, or else.

276
00:14:21,630 --> 00:14:24,540
And in fact, if I had
constant branching factor

277
00:14:24,540 --> 00:14:28,360
for any constant here--

278
00:14:28,360 --> 00:14:31,210
if I had three or four, if
it was bounded by a constant,

279
00:14:31,210 --> 00:14:32,800
the height of this
tree would still

280
00:14:32,800 --> 00:14:36,070
be bounded by a log
base the constant

281
00:14:36,070 --> 00:14:39,490
of that number of leaves.

282
00:14:39,490 --> 00:14:42,220
So I need, in some sense,
to be able to branch

283
00:14:42,220 --> 00:14:45,860
a non-constant amount.

284
00:14:45,860 --> 00:14:49,530
So how can I branch a
non-constant amount?

285
00:14:49,530 --> 00:14:51,870
This is a little tricky.

286
00:14:51,870 --> 00:14:57,390
We had this really neat
operation in the random access

287
00:14:57,390 --> 00:15:01,440
machine that we
could randomly go

288
00:15:01,440 --> 00:15:03,540
to any place in memory
in constant time

289
00:15:03,540 --> 00:15:04,350
based on a number.

290
00:15:08,250 --> 00:15:10,020
That was a super
powerful thing, because

291
00:15:10,020 --> 00:15:12,450
within a single
constant time operation,

292
00:15:12,450 --> 00:15:15,450
I could go to any
space in memory.

293
00:15:15,450 --> 00:15:19,050
That's potentially much larger
than linear branching factor,

294
00:15:19,050 --> 00:15:20,490
depending on the
size of my model

295
00:15:20,490 --> 00:15:22,620
and the size of my machine.

296
00:15:22,620 --> 00:15:24,270
So that's a very
powerful operation.

297
00:15:24,270 --> 00:15:27,328
Can we use that to find quicker?

298
00:15:27,328 --> 00:15:28,245
Anyone have any ideas?

299
00:15:31,420 --> 00:15:32,210
Sure.

300
00:15:32,210 --> 00:15:33,103
AUDIENCE: [INAUDIBLE]

301
00:15:33,103 --> 00:15:35,270
JASON KU: We're going to
get to hashing in a second,

302
00:15:35,270 --> 00:15:40,640
but this is a simpler
concept than hashing--

303
00:15:40,640 --> 00:15:44,280
something you probably
are familiar with already.

304
00:15:44,280 --> 00:15:46,350
We've kind of been
using it implicitly

305
00:15:46,350 --> 00:15:50,240
in some of our sequence
data structure things.

306
00:15:50,240 --> 00:15:57,860
What we're going to do is, if
I have an item that has key 10,

307
00:15:57,860 --> 00:16:04,110
I'm going to keep an array and
store that item 10 spaces away

308
00:16:04,110 --> 00:16:07,860
from the front of the
array, right at index 9,

309
00:16:07,860 --> 00:16:09,840
or the 10th index.

310
00:16:09,840 --> 00:16:11,400
Does that make sense?

311
00:16:11,400 --> 00:16:14,770
If I store that item at
that location in memory,

312
00:16:14,770 --> 00:16:19,602
I can use this random
access to that location

313
00:16:19,602 --> 00:16:21,060
and see if there's
something there.

314
00:16:21,060 --> 00:16:23,018
If there's something
there, I return that item.

315
00:16:23,018 --> 00:16:24,930
Does that make sense?

316
00:16:24,930 --> 00:16:26,930
This is what I call a
direct access array.

317
00:16:29,700 --> 00:16:32,160
It's really no different
than the arrays

318
00:16:32,160 --> 00:16:38,170
that we've been talking
about earlier in the class.

319
00:16:38,170 --> 00:16:43,690
We got an array, and
if I have an item here

320
00:16:43,690 --> 00:16:50,080
with key equals 10, I'll stick
it here in the 10th place.

321
00:16:50,080 --> 00:16:56,210
Now, I can only now store
one item with the key 10

322
00:16:56,210 --> 00:16:58,940
in my thing, and that's
one of the stipulations we

323
00:16:58,940 --> 00:17:00,500
had on our set data structures.

324
00:17:00,500 --> 00:17:03,073
If we tried to insert
something with the same key

325
00:17:03,073 --> 00:17:04,490
as something already
stored there,

326
00:17:04,490 --> 00:17:06,380
we're going to replace the item.

327
00:17:06,380 --> 00:17:09,530
That's what the semantics
of our set interface was.

328
00:17:09,530 --> 00:17:10,220
But that's OK.

329
00:17:10,220 --> 00:17:14,859
That's satisfying the
conditions of our set interface.

330
00:17:14,859 --> 00:17:17,670
So if we store it
there, that's fantastic.

331
00:17:17,670 --> 00:17:19,589
How long does it
take to find, if we

332
00:17:19,589 --> 00:17:23,240
have an item with the key 10?

333
00:17:23,240 --> 00:17:25,700
It takes constant
time, worst case--

334
00:17:25,700 --> 00:17:27,150
great.

335
00:17:27,150 --> 00:17:29,385
How about inserting
or deleting something?

336
00:17:29,385 --> 00:17:30,673
AUDIENCE: [INAUDIBLE]

337
00:17:30,673 --> 00:17:31,590
JASON KU: What's that?

338
00:17:31,590 --> 00:17:32,590
AUDIENCE: [INAUDIBLE]

339
00:17:32,590 --> 00:17:34,300
JASON KU: Again, constant time--

340
00:17:34,300 --> 00:17:36,273
we've solved all our problems.

341
00:17:36,273 --> 00:17:36,940
This is amazing.

342
00:17:39,540 --> 00:17:40,620
OK.

343
00:17:40,620 --> 00:17:42,090
What's not amazing about this?

344
00:17:42,090 --> 00:17:43,800
Why don't we just do
this all the time?

345
00:17:47,550 --> 00:17:50,190
Yeah?

346
00:17:50,190 --> 00:17:53,450
AUDIENCE: You don't know
how high the numbers go.

347
00:17:53,450 --> 00:17:56,820
JASON KU: I don't know
how high the numbers go.

348
00:17:56,820 --> 00:17:59,430
So let's say I'm
storing, I don't know,

349
00:17:59,430 --> 00:18:03,720
a number associated with
that the 300 or 400 of you

350
00:18:03,720 --> 00:18:05,220
that are in this classroom.

351
00:18:08,290 --> 00:18:10,330
But I'm storing your MIT IDs.

352
00:18:10,330 --> 00:18:12,080
How big are those numbers?

353
00:18:12,080 --> 00:18:15,190
Those are like
nine-digit numbers--

354
00:18:15,190 --> 00:18:17,370
pretty long numbers.

355
00:18:17,370 --> 00:18:21,380
So what I would need to do--
and if I was storing your keys

356
00:18:21,380 --> 00:18:25,460
as MIT IDs, I
would need an array

357
00:18:25,460 --> 00:18:28,380
that has indices
that span the tire

358
00:18:28,380 --> 00:18:33,030
space of nine-digit numbers.

359
00:18:33,030 --> 00:18:37,110
That's like 10 to the--

360
00:18:37,110 --> 00:18:37,860
10 to the 9.

361
00:18:37,860 --> 00:18:38,640
Thank you.

362
00:18:38,640 --> 00:18:43,500
10 to the 9 is the size of
a direct access road off

363
00:18:43,500 --> 00:18:50,010
to build to be able
to use this technique

364
00:18:50,010 --> 00:18:54,480
to create a direct access array
to search on your MIT IDs,

365
00:18:54,480 --> 00:18:57,570
when there's only really
300 of you in here.

366
00:18:57,570 --> 00:19:00,870
So 300 or 400 is
an n that's much

367
00:19:00,870 --> 00:19:03,030
smaller than the
size of the numbers

368
00:19:03,030 --> 00:19:04,330
that I'm trying to store.

369
00:19:04,330 --> 00:19:06,000
What I'm going to
use as a variable

370
00:19:06,000 --> 00:19:09,030
to talk about the size of
the numbers I'm storing--

371
00:19:09,030 --> 00:19:12,210
I'm going to say u is the
maximum size of any number

372
00:19:12,210 --> 00:19:13,770
that I'm storing.

373
00:19:13,770 --> 00:19:17,910
It's the size of the universe of
space of keys that I'm storing.

374
00:19:17,910 --> 00:19:19,320
Does that make sense?

375
00:19:19,320 --> 00:19:24,330
OK, so to instantiate a direct
access array of that size,

376
00:19:24,330 --> 00:19:26,920
I have to allocate
that amount of space.

377
00:19:26,920 --> 00:19:31,140
And so if that is
much bigger than n,

378
00:19:31,140 --> 00:19:34,020
then I'm kind of
screwed, because I'm

379
00:19:34,020 --> 00:19:36,030
using much more space.

380
00:19:36,030 --> 00:19:40,530
And these order operations are
bad also, because essentially,

381
00:19:40,530 --> 00:19:46,020
if I am storing these
things non-continuously,

382
00:19:46,020 --> 00:19:48,450
I kind of just have
to scan down the thing

383
00:19:48,450 --> 00:19:52,930
to find the next
element, for example.

384
00:19:52,930 --> 00:19:53,972
OK, what's your question?

385
00:19:53,972 --> 00:19:55,388
AUDIENCE: Is a
direct access array

386
00:19:55,388 --> 00:19:56,890
a sequence data structure?

387
00:19:56,890 --> 00:19:59,532
JASON KU: A direct access
array is a set data structure.

388
00:19:59,532 --> 00:20:01,240
That's why it's a set
interface up there.

389
00:20:05,670 --> 00:20:09,390
Your colleague is asking whether
you can use a direct accessory

390
00:20:09,390 --> 00:20:10,320
to implement a set--

391
00:20:10,320 --> 00:20:11,580
I mean a sequence.

392
00:20:11,580 --> 00:20:14,790
And actually, I think you'll
see in your recitation notes,

393
00:20:14,790 --> 00:20:19,140
you have code that can
take a set data structure

394
00:20:19,140 --> 00:20:20,850
and implement sequence
data structure,

395
00:20:20,850 --> 00:20:23,370
and take sequence data structure
and implement a set data

396
00:20:23,370 --> 00:20:24,723
structure.

397
00:20:24,723 --> 00:20:26,890
They just won't necessarily
have very good run time.

398
00:20:26,890 --> 00:20:29,440
So this direct access
array semantics

399
00:20:29,440 --> 00:20:34,055
is really just good for these
specific set operations.

400
00:20:34,055 --> 00:20:35,100
Does that makes sense?

401
00:20:35,100 --> 00:20:35,600
Yeah?

402
00:20:35,600 --> 00:20:36,980
AUDIENCE: What is u?

403
00:20:36,980 --> 00:20:39,410
JASON KU: u is this the
size of the largest key

404
00:20:39,410 --> 00:20:40,910
that I'm allowed to store.

405
00:20:40,910 --> 00:20:42,920
That makes sense?

406
00:20:42,920 --> 00:20:47,375
The direct access array is
supporting up to u size keys.

407
00:20:47,375 --> 00:20:48,920
Does that make sense?

408
00:20:48,920 --> 00:20:51,750
OK, we're going to
move on for a second.

409
00:20:51,750 --> 00:20:52,860
That's the problem, right?

410
00:20:52,860 --> 00:20:59,475
When u largest key--

411
00:21:01,980 --> 00:21:04,650
we're assuming integers here--

412
00:21:04,650 --> 00:21:10,200
integer keys-- so in
the comparison model,

413
00:21:10,200 --> 00:21:12,600
we could store any
arbitrary objects

414
00:21:12,600 --> 00:21:14,460
that supported a comparison.

415
00:21:14,460 --> 00:21:17,850
Here we really need
to have integer keys,

416
00:21:17,850 --> 00:21:21,810
or else we're not going to be
able to use those as addresses.

417
00:21:21,810 --> 00:21:25,740
So we're making an
assumption on the inputs

418
00:21:25,740 --> 00:21:27,600
that I can only
store integers now.

419
00:21:27,600 --> 00:21:29,820
I can't store
arbitrary objects--

420
00:21:29,820 --> 00:21:31,530
items with keys.

421
00:21:31,530 --> 00:21:34,800
And in particular, I also
need to-- this is a subtlety

422
00:21:34,800 --> 00:21:36,870
that's in the word RAM model--

423
00:21:36,870 --> 00:21:39,960
how can I be assured
that these keys can

424
00:21:39,960 --> 00:21:41,400
be looked up in constant time?

425
00:21:44,130 --> 00:21:46,190
I have this little CPU.

426
00:21:46,190 --> 00:21:49,310
It's got some number of
registers it can act upon.

427
00:21:49,310 --> 00:21:52,604
How big is those registers?

428
00:21:52,604 --> 00:21:53,580
AUDIENCE: [INAUDIBLE]

429
00:21:53,580 --> 00:21:54,205
JASON KU: What?

430
00:21:56,360 --> 00:21:59,390
Right now, they're 64 bits,
but in general, they're w.

431
00:21:59,390 --> 00:22:04,790
They're the size of your
word on your machine.

432
00:22:04,790 --> 00:22:09,290
2 to the w is the number
of dresses I can access.

433
00:22:09,290 --> 00:22:11,930
If I'm going to be able to
use this direct accessory,

434
00:22:11,930 --> 00:22:19,220
I need to make sure that the
u is less than 2 to the w,

435
00:22:19,220 --> 00:22:22,970
if I want these operations
to run in constant time.

436
00:22:22,970 --> 00:22:25,670
If I have kids that are
much larger than this,

437
00:22:25,670 --> 00:22:28,730
I'm going to need to
do something else,

438
00:22:28,730 --> 00:22:30,410
but this is kind
of the assumption.

439
00:22:30,410 --> 00:22:34,010
In this class, when we give
you an array of integers,

440
00:22:34,010 --> 00:22:35,570
or an array of
strings, or something

441
00:22:35,570 --> 00:22:38,780
like that on your
problem or on an exam,

442
00:22:38,780 --> 00:22:41,990
the assumption is,
unless we give you bounds

443
00:22:41,990 --> 00:22:45,530
on the size of those things--

444
00:22:45,530 --> 00:22:47,690
like the number of
characters in your string

445
00:22:47,690 --> 00:22:49,370
or the size of the
number in the--

446
00:22:49,370 --> 00:22:53,690
you can assume that those things
will fit in one word of memory.

447
00:22:58,690 --> 00:23:04,040
w is the word size of your
machine, the number of bits

448
00:23:04,040 --> 00:23:08,960
that your machine can do
operations on in constant time.

449
00:23:08,960 --> 00:23:10,560
Any other questions?

450
00:23:10,560 --> 00:23:12,390
OK, so we have this problem.

451
00:23:12,390 --> 00:23:15,710
We're using way too
much space, when we

452
00:23:15,710 --> 00:23:18,320
have a large universe of keys.

453
00:23:18,320 --> 00:23:24,140
So how do we get around
that Problem any ideas?

454
00:23:28,930 --> 00:23:29,928
Sure.

455
00:23:29,928 --> 00:23:31,345
AUDIENCE: Instead
of [INAUDIBLE]..

456
00:23:36,180 --> 00:23:39,210
JASON KU: OK, so what
your colleague is saying--

457
00:23:39,210 --> 00:23:43,110
instead of just storing
one value at each place,

458
00:23:43,110 --> 00:23:47,170
maybe store more than one value.

459
00:23:47,170 --> 00:23:50,590
If we're using
this idea, where I

460
00:23:50,590 --> 00:23:53,920
am storing my key at
the index of the key,

461
00:23:53,920 --> 00:23:55,750
that's getting
around the us having

462
00:23:55,750 --> 00:23:58,480
to have unique keys
in our data structure.

463
00:23:58,480 --> 00:24:02,590
It's not getting around
this space usage problem.

464
00:24:02,590 --> 00:24:04,740
Does that make sense?

465
00:24:04,740 --> 00:24:09,000
We will end up storing
multiple things at indices,

466
00:24:09,000 --> 00:24:13,290
but there's another trick that
I'm looking for right now.

467
00:24:13,290 --> 00:24:16,230
We have a lot of
space that we would

468
00:24:16,230 --> 00:24:19,710
need to allocate for
this data structure.

469
00:24:19,710 --> 00:24:22,870
What's an alternative?

470
00:24:22,870 --> 00:24:25,880
Instead of allocating a
lot of space, we allocate--

471
00:24:28,460 --> 00:24:30,785
less space.

472
00:24:30,785 --> 00:24:31,910
Let's allocate less space.

473
00:24:31,910 --> 00:24:32,410
All right.

474
00:24:36,270 --> 00:24:38,145
This is our space of keys, u.

475
00:24:40,920 --> 00:24:47,040
But instead, I want to store
those things in a direct access

476
00:24:47,040 --> 00:24:53,902
array of maybe size n, something
like the order of the things

477
00:24:53,902 --> 00:24:55,110
that I'm going to be storing.

478
00:24:55,110 --> 00:24:57,480
I'm going to relax
that and say we're

479
00:24:57,480 --> 00:25:00,780
going to make this
a length m that's

480
00:25:00,780 --> 00:25:04,950
around the size of the
things I'm storing.

481
00:25:07,757 --> 00:25:09,590
And what I'm going to
do is I'm going to try

482
00:25:09,590 --> 00:25:12,110
to map this space of keys--

483
00:25:12,110 --> 00:25:16,160
this large space of
keys, from 0 to u minus 1

484
00:25:16,160 --> 00:25:18,620
or something like that--

485
00:25:18,620 --> 00:25:21,870
down to arrange
that 0 to m minus 1.

486
00:25:24,570 --> 00:25:26,860
I'm going to want a function--

487
00:25:26,860 --> 00:25:29,100
this is what I'm
going to call h--

488
00:25:29,100 --> 00:25:37,150
which maps this range
down to a smaller range.

489
00:25:40,700 --> 00:25:41,630
Does that make sense?

490
00:25:41,630 --> 00:25:43,130
I'm going to have
some function that

491
00:25:43,130 --> 00:25:44,960
takes that large base of keys--

492
00:25:44,960 --> 00:25:46,130
sticks them down here.

493
00:25:48,760 --> 00:25:55,130
And instead of staring
at an index of the key,

494
00:25:55,130 --> 00:25:58,810
I'm going to put the key through
this function, the key space,

495
00:25:58,810 --> 00:26:02,510
into a compressed
space and store it

496
00:26:02,510 --> 00:26:05,190
at that index location.

497
00:26:05,190 --> 00:26:06,670
Does that make sense?

498
00:26:06,670 --> 00:26:07,527
Sure.

499
00:26:07,527 --> 00:26:10,330
AUDIENCE: [INAUDIBLE]

500
00:26:10,330 --> 00:26:12,020
JASON KU: Your colleague is--

501
00:26:12,020 --> 00:26:15,910
comes up with the question I
was going to ask right away,

502
00:26:15,910 --> 00:26:17,930
which was, what's
the problem here?

503
00:26:17,930 --> 00:26:21,250
The problem is it's the
potential that we might be--

504
00:26:21,250 --> 00:26:26,130
have to store more than
one thing at the same index

505
00:26:26,130 --> 00:26:27,930
location.

506
00:26:27,930 --> 00:26:31,560
If I have a function that
matches this big space down

507
00:26:31,560 --> 00:26:36,450
to this small
space, I got to have

508
00:26:36,450 --> 00:26:40,530
multiple of these things going
to the same places here, right?

509
00:26:40,530 --> 00:26:44,130
It can't be objective.

510
00:26:44,130 --> 00:26:45,870
But just based on
pigeonhole principle,

511
00:26:45,870 --> 00:26:47,490
I have more of these things.

512
00:26:47,490 --> 00:26:50,140
At least two of them have to
go to something over here.

513
00:26:50,140 --> 00:26:54,930
In fact, if I have, say, u
is bigger than n squared,

514
00:26:54,930 --> 00:26:58,180
for example, there--

515
00:26:58,180 --> 00:27:00,430
for any function I
give you that maps

516
00:27:00,430 --> 00:27:05,110
this large space down to the
small space, n of these things

517
00:27:05,110 --> 00:27:08,050
will map to the same place.

518
00:27:08,050 --> 00:27:11,770
So if I choose a
bad function here,

519
00:27:11,770 --> 00:27:16,130
then I'll have to store n things
at the same index location.

520
00:27:16,130 --> 00:27:19,330
And if I go there,
I have to check

521
00:27:19,330 --> 00:27:21,125
to see whether any of
those are the things

522
00:27:21,125 --> 00:27:22,000
that I'm looking for.

523
00:27:22,000 --> 00:27:23,680
I haven't gained anything.

524
00:27:23,680 --> 00:27:27,160
I really want a hash function
that will evenly distribute

525
00:27:27,160 --> 00:27:29,140
keys over this space.

526
00:27:32,430 --> 00:27:34,320
Does that make sense?

527
00:27:34,320 --> 00:27:35,800
But we have a problem here.

528
00:27:35,800 --> 00:27:37,980
If we need to store
multiple things

529
00:27:37,980 --> 00:27:41,610
at a given location in memory--

530
00:27:41,610 --> 00:27:42,270
can't do that.

531
00:27:42,270 --> 00:27:44,470
I have one thing
I can put there.

532
00:27:44,470 --> 00:27:46,650
So I have two options
on how to deal--

533
00:27:46,650 --> 00:27:49,140
what I call collisions.

534
00:27:49,140 --> 00:27:52,860
If I have two items
here, like a and b,

535
00:27:52,860 --> 00:27:58,750
these are different keys
in my universe of space.

536
00:27:58,750 --> 00:28:02,140
But it's possible that
they both map down

537
00:28:02,140 --> 00:28:07,240
to some hash that
has the same value.

538
00:28:10,950 --> 00:28:14,020
If I first hash a, and a is--

539
00:28:14,020 --> 00:28:17,470
I put a there, where do I put b?

540
00:28:22,170 --> 00:28:25,640
There are two options.

541
00:28:25,640 --> 00:28:28,920
AUDIENCE: Is the second
data structure [INAUDIBLE]

542
00:28:28,920 --> 00:28:31,770
so that it can
store [INAUDIBLE]??

543
00:28:31,770 --> 00:28:34,410
JASON KU: OK, so what
your colleague is saying--

544
00:28:34,410 --> 00:28:36,200
can I store this one
is a linked list,

545
00:28:36,200 --> 00:28:40,930
and then I can just insert a
guy right next to where it was?

546
00:28:40,930 --> 00:28:43,630
What's the problem there?

547
00:28:43,630 --> 00:28:48,550
Are linked lists good with
direct accessing by an index?

548
00:28:48,550 --> 00:28:51,040
No, they're terrible
with get_at and set_at

549
00:28:51,040 --> 00:28:53,140
They take linear time there.

550
00:28:53,140 --> 00:28:55,570
So really, the whole
point of direct this array

551
00:28:55,570 --> 00:28:57,370
is that there is an
array underneath,

552
00:28:57,370 --> 00:28:59,830
and I can do this
index arithmetic

553
00:28:59,830 --> 00:29:01,590
and go down to the next thing.

554
00:29:01,590 --> 00:29:03,340
So I really don't want
to replace a linked

555
00:29:03,340 --> 00:29:07,460
list as this data structure.

556
00:29:07,460 --> 00:29:07,960
Yeah?

557
00:29:10,510 --> 00:29:11,556
What's up?

558
00:29:11,556 --> 00:29:13,990
AUDIENCE: [INAUDIBLE]

559
00:29:13,990 --> 00:29:15,730
JASON KU: We can make
it really unlikely.

560
00:29:15,730 --> 00:29:17,380
Sure.

561
00:29:17,380 --> 00:29:19,840
I don't know what likely
means, because I'm

562
00:29:19,840 --> 00:29:22,438
giving you a hash function--
one hash function.

563
00:29:22,438 --> 00:29:23,980
And I don't know
what the inputs are.

564
00:29:23,980 --> 00:29:26,200
Yeah?

565
00:29:26,200 --> 00:29:26,890
Go ahead.

566
00:29:26,890 --> 00:29:31,630
AUDIENCE: [INAUDIBLE]

567
00:29:31,630 --> 00:29:32,740
JASON KU: OK, right.

568
00:29:32,740 --> 00:29:36,770
So there are actually
two solutions here.

569
00:29:36,770 --> 00:29:42,520
One is I-- maybe, if I
choose m to be larger than n,

570
00:29:42,520 --> 00:29:45,210
there's going to be
extra space in here.

571
00:29:45,210 --> 00:29:49,460
I'll just stick it somewhere
else in the existing array.

572
00:29:49,460 --> 00:29:52,740
How I find an open space
is a little complicated,

573
00:29:52,740 --> 00:29:57,710
but this is a technique
called open addressing, which

574
00:29:57,710 --> 00:30:00,350
is much more common
than the technique

575
00:30:00,350 --> 00:30:04,250
we're going to be talking
about today in implementations.

576
00:30:04,250 --> 00:30:07,640
Python uses an open addressing
scheme, which is essentially,

577
00:30:07,640 --> 00:30:12,680
find another place in the
array to put this collision.

578
00:30:12,680 --> 00:30:15,320
Open addressing is notoriously
difficult to analyze,

579
00:30:15,320 --> 00:30:17,153
so we're not going to
do that in this class.

580
00:30:17,153 --> 00:30:19,160
There's a much easier
technique that-- we

581
00:30:19,160 --> 00:30:23,290
have an implementation for you
in the recitation handouts.

582
00:30:23,290 --> 00:30:26,080
It's what your
colleague up here--

583
00:30:26,080 --> 00:30:27,730
I can't find him--

584
00:30:27,730 --> 00:30:29,620
over there was saying--

585
00:30:29,620 --> 00:30:31,360
was, instead of storing
it somewhere else

586
00:30:31,360 --> 00:30:35,657
in the existing direct
access array down here,

587
00:30:35,657 --> 00:30:37,240
which we usually
call the hash table--

588
00:30:41,110 --> 00:30:43,690
instead of storing it somewhere
else in that hash table,

589
00:30:43,690 --> 00:30:47,170
we'll instead, at that
key, store a pointer

590
00:30:47,170 --> 00:30:51,790
to another data structure,
some other data structure that

591
00:30:51,790 --> 00:30:54,370
can store a bunch of things--
just like any sequence data

592
00:30:54,370 --> 00:30:56,410
structure, like a dynamic
array, or linked list,

593
00:30:56,410 --> 00:30:57,610
or anything right.

594
00:30:57,610 --> 00:30:59,980
All I need to do is be able
to stick a bunch of things

595
00:30:59,980 --> 00:31:03,490
on there when there
are collisions,

596
00:31:03,490 --> 00:31:05,890
and then, when I go up
to look for that thing,

597
00:31:05,890 --> 00:31:09,400
I'll just look through all
of the things in that data

598
00:31:09,400 --> 00:31:11,780
structure and see
if my key exists.

599
00:31:11,780 --> 00:31:13,060
Does that make sense?

600
00:31:13,060 --> 00:31:16,210
Now, we want to make sure
that those additional data

601
00:31:16,210 --> 00:31:19,870
structures, which
I'll call chains--

602
00:31:19,870 --> 00:31:24,880
we want to make sure that
those chains are short.

603
00:31:24,880 --> 00:31:27,070
I don't want them to be long.

604
00:31:27,070 --> 00:31:29,680
So what I'm going to do is,
when I have this collision here,

605
00:31:29,680 --> 00:31:31,595
instead I'll have
a pointer to some--

606
00:31:31,595 --> 00:31:33,970
I don't know-- maybe make it
a dynamic array, or a linked

607
00:31:33,970 --> 00:31:35,720
list, or something like that.

608
00:31:35,720 --> 00:31:38,590
And I'll put a here
and I'll b here.

609
00:31:38,590 --> 00:31:46,470
And then later, when I look
up key K, or look up a or b--

610
00:31:46,470 --> 00:31:48,120
let's look up b--

611
00:31:48,120 --> 00:31:51,187
I'll go to this hash value here.

612
00:31:51,187 --> 00:31:52,770
I'll put it through
the hash function.

613
00:31:52,770 --> 00:31:54,000
I'll go to this index.

614
00:31:54,000 --> 00:31:56,820
I'll go to the data structure,
the chain associated

615
00:31:56,820 --> 00:31:59,620
to that index, and I'll
look at all of these items.

616
00:31:59,620 --> 00:32:01,170
I'm just going to
do a linear find.

617
00:32:01,170 --> 00:32:01,920
I'm going to look.

618
00:32:04,470 --> 00:32:06,000
I could put any
data structure here,

619
00:32:06,000 --> 00:32:08,760
but I'm going to look at
this one, see if it's b.

620
00:32:08,760 --> 00:32:09,960
It's not b.

621
00:32:09,960 --> 00:32:11,550
Look at this one-- it is b.

622
00:32:11,550 --> 00:32:12,600
I return yes.

623
00:32:12,600 --> 00:32:13,540
Does that make sense?

624
00:32:13,540 --> 00:32:15,120
So this is an idea
called chaining.

625
00:32:15,120 --> 00:32:16,800
I can put anything I want there.

626
00:32:16,800 --> 00:32:20,280
Commonly, we talk about
putting a linked list there,

627
00:32:20,280 --> 00:32:24,240
but you can put a
dynamic array there.

628
00:32:24,240 --> 00:32:27,773
You can put a sorted array
there to make it easier

629
00:32:27,773 --> 00:32:29,190
to check whether
the key is there.

630
00:32:29,190 --> 00:32:30,690
You can put anything
you want there.

631
00:32:30,690 --> 00:32:32,640
The point of this
lecture is going

632
00:32:32,640 --> 00:32:35,880
to try to show that there's
a choice of hash function

633
00:32:35,880 --> 00:32:42,240
I can make that make sure
that these chains are small so

634
00:32:42,240 --> 00:32:45,150
that it really doesn't
matter how I saw them there,

635
00:32:45,150 --> 00:32:46,853
because I can just--

636
00:32:46,853 --> 00:32:49,020
if there's a constant number
of things stored there,

637
00:32:49,020 --> 00:32:52,220
I can just look at all of
them and do whatever I want,

638
00:32:52,220 --> 00:32:53,510
and still get constant time.

639
00:32:53,510 --> 00:32:54,010
Yeah?

640
00:32:54,010 --> 00:33:01,920
AUDIENCE: So does that means
that, when you have [INAUDIBLE]

641
00:33:01,920 --> 00:33:05,680
let's just say, for some
reason, the number of things

642
00:33:05,680 --> 00:33:10,808
[INAUDIBLE] is that most of
them get multiple [INAUDIBLE]..

643
00:33:10,808 --> 00:33:13,295
Is it just a data structure
that only holds one thing?

644
00:33:13,295 --> 00:33:13,920
JASON KU: Yeah.

645
00:33:13,920 --> 00:33:16,530
So what your colleague
is saying is,

646
00:33:16,530 --> 00:33:19,950
at initialization,
what is stored here?

647
00:33:19,950 --> 00:33:22,590
Initially, it points to
an empty data structure.

648
00:33:22,590 --> 00:33:25,530
I'm just going to initialize
all of these things to have--

649
00:33:25,530 --> 00:33:27,287
now, you get some overhead here.

650
00:33:27,287 --> 00:33:29,370
We're paying something for
this-- some extra space

651
00:33:29,370 --> 00:33:31,770
and having pointer and
another data structure

652
00:33:31,770 --> 00:33:32,760
at all of these things.

653
00:33:32,760 --> 00:33:34,590
Or you could have
the semantics where,

654
00:33:34,590 --> 00:33:36,420
if I only have one
thing here, I'm

655
00:33:36,420 --> 00:33:38,880
going to store that
thing at this location,

656
00:33:38,880 --> 00:33:41,320
but if I have multiple, it
points to a data structure.

657
00:33:41,320 --> 00:33:44,100
These are kind of complicated
implementation details,

658
00:33:44,100 --> 00:33:46,740
but you get the basic idea.

659
00:33:46,740 --> 00:33:49,260
If I just have a 0
size data structure

660
00:33:49,260 --> 00:33:50,760
at all of these
things, I'm still

661
00:33:50,760 --> 00:33:54,510
going to have a constant
factor overhead.

662
00:33:54,510 --> 00:33:57,150
It's still going to be a
linear size data structure,

663
00:33:57,150 --> 00:33:59,950
as long as m is linear in n.

664
00:33:59,950 --> 00:34:01,700
Does that makes sense?

665
00:34:01,700 --> 00:34:02,360
OK.

666
00:34:02,360 --> 00:34:05,030
So how do we pick a
good hash function?

667
00:34:05,030 --> 00:34:08,719
I already told you
that any fixed hash

668
00:34:08,719 --> 00:34:12,560
function I give you is going
to experience collisions.

669
00:34:12,560 --> 00:34:20,190
And if u is large, then there's
the possibility that I--

670
00:34:20,190 --> 00:34:23,800
for some input, all of
the things in my set

671
00:34:23,800 --> 00:34:27,790
go directly to the same
hashed index value.

672
00:34:27,790 --> 00:34:29,040
So that ain't great.

673
00:34:29,040 --> 00:34:30,510
Let's ignore that for a second.

674
00:34:30,510 --> 00:34:33,900
What's the easiest
way to get down

675
00:34:33,900 --> 00:34:36,677
from this large space of
keys down to a small one?

676
00:34:36,677 --> 00:34:38,260
What's the easiest
thing you could do?

677
00:34:38,260 --> 00:34:38,489
Yeah?

678
00:34:38,489 --> 00:34:38,969
AUDIENCE: [INAUDIBLE]

679
00:34:38,969 --> 00:34:40,322
JASON KU: Modulus-- great.

680
00:34:40,322 --> 00:34:41,780
This is called the
division method.

681
00:34:51,239 --> 00:34:54,000
And what its function
is is essentially,

682
00:34:54,000 --> 00:34:56,340
it's going to take
a key and it's

683
00:34:56,340 --> 00:35:04,500
going to say equal
to be K mod m.

684
00:35:04,500 --> 00:35:06,570
I'm going to take
something of a large space,

685
00:35:06,570 --> 00:35:09,690
and I'm going to mod it so
that it just wraps around--

686
00:35:13,520 --> 00:35:15,110
perfectly valid thing to do.

687
00:35:15,110 --> 00:35:18,610
It satisfies what we're
doing in a hash table.

688
00:35:18,610 --> 00:35:24,965
And if my kids are completely
uniformly distributed--

689
00:35:24,965 --> 00:35:28,830
if, when I use my hash
function, all of the keys

690
00:35:28,830 --> 00:35:35,130
here are uniformly distributed
over this larger space, then

691
00:35:35,130 --> 00:35:38,250
actually, this isn't
such a bad thing.

692
00:35:38,250 --> 00:35:42,420
But that's imposing some kind
of distribution requirements

693
00:35:42,420 --> 00:35:43,860
on the type of
inputs I'm allowed

694
00:35:43,860 --> 00:35:45,402
to use with this
hash function for it

695
00:35:45,402 --> 00:35:48,030
to have good performance.

696
00:35:48,030 --> 00:35:53,340
But this plus a little bit
of extra mixing and bit

697
00:35:53,340 --> 00:35:58,230
manipulation is essentially
what Python does.

698
00:35:58,230 --> 00:36:00,720
Essentially, all it
does is jumbles up

699
00:36:00,720 --> 00:36:05,760
that key for some fixed
amount of jumbling,

700
00:36:05,760 --> 00:36:11,590
and then mods it m,
and sticks it there.

701
00:36:11,590 --> 00:36:15,570
It's hard coded in the Python
library, what this hash

702
00:36:15,570 --> 00:36:21,480
function is, and so there
exist some sequences of inserts

703
00:36:21,480 --> 00:36:24,540
into a hash table
in Python which

704
00:36:24,540 --> 00:36:26,850
will be really bad in
terms of performance,

705
00:36:26,850 --> 00:36:30,030
because these chain links are
the amount number of collisions

706
00:36:30,030 --> 00:36:35,030
that I'll get at a single
hash is going to be large.

707
00:36:35,030 --> 00:36:36,830
But they do that
for other reasons.

708
00:36:36,830 --> 00:36:38,810
They want a deterministic
hash function.

709
00:36:38,810 --> 00:36:41,750
They want something that
I do the program again--

710
00:36:41,750 --> 00:36:45,080
it's going to do the
same thing underneath.

711
00:36:45,080 --> 00:36:47,360
But sometimes Python
gets it wrong.

712
00:36:47,360 --> 00:36:50,150
But if your data
that you're storing

713
00:36:50,150 --> 00:36:53,210
is sufficiently uncorrelated
to the hash function

714
00:36:53,210 --> 00:36:54,380
that they've chosen--

715
00:36:54,380 --> 00:36:56,270
which, usually, it is--

716
00:36:56,270 --> 00:36:58,490
this is a pretty
good performance.

717
00:36:58,490 --> 00:37:03,090
But this is not a
practical class.

718
00:37:03,090 --> 00:37:05,890
Well, it is a practical
class, but one of the things

719
00:37:05,890 --> 00:37:07,540
that we are--

720
00:37:07,540 --> 00:37:09,280
that's the emphasis
of this class

721
00:37:09,280 --> 00:37:13,690
is making sure we can prove that
this is good in theory as well.

722
00:37:13,690 --> 00:37:17,590
I don't want to know that
sometimes this will be good.

723
00:37:17,590 --> 00:37:21,400
I really want to know
that, if I choose--

724
00:37:21,400 --> 00:37:26,830
if I make this data structure
and I put some inputs on it,

725
00:37:26,830 --> 00:37:30,790
I want a running time that
is independent on what

726
00:37:30,790 --> 00:37:34,300
inputs I decided to use,
independent of what keys

727
00:37:34,300 --> 00:37:35,993
I decided to store.

728
00:37:35,993 --> 00:37:36,910
Does that makes sense?

729
00:37:40,990 --> 00:37:44,250
But it's impossible for me to
pick a fixed hash function that

730
00:37:44,250 --> 00:37:45,750
will achieve this,
because I just

731
00:37:45,750 --> 00:37:48,420
told you that, if u is large--

732
00:37:48,420 --> 00:37:52,380
this is u-- if u is
large, then there

733
00:37:52,380 --> 00:37:55,185
exists inputs that map
everything to one place.

734
00:37:57,930 --> 00:37:58,998
I'm screwed, right?

735
00:37:58,998 --> 00:38:00,540
There's no way to
solve this problem.

736
00:38:03,120 --> 00:38:06,180
That's true if I want a
deterministic hash function--

737
00:38:06,180 --> 00:38:07,710
I want the thing
to be repeatable,

738
00:38:07,710 --> 00:38:09,870
to do the same thing
over and over again

739
00:38:09,870 --> 00:38:12,430
for any set of inputs.

740
00:38:12,430 --> 00:38:14,680
What can I do instead?

741
00:38:14,680 --> 00:38:18,480
Weaken my notion of what
constant time is to do better--

742
00:38:22,260 --> 00:38:24,570
OK, use a non-deterministic--

743
00:38:24,570 --> 00:38:26,610
what does
non-deterministic mean?

744
00:38:26,610 --> 00:38:31,560
It means don't choose a
hash function up front--

745
00:38:31,560 --> 00:38:34,230
choose one randomly later.

746
00:38:34,230 --> 00:38:35,880
So have the user--

747
00:38:35,880 --> 00:38:38,370
they pick whatever inputs
they're going to do,

748
00:38:38,370 --> 00:38:40,660
and then I'm going to pick
a hash function randomly.

749
00:38:40,660 --> 00:38:42,910
They don't know which hash
function I'm going to pick,

750
00:38:42,910 --> 00:38:45,930
so it's hard for them to
give me an input that's bad.

751
00:38:49,020 --> 00:38:52,530
I'm going to choose a
random hash function.

752
00:38:52,530 --> 00:38:55,620
Can I choose a hash
function from the space

753
00:38:55,620 --> 00:38:58,160
of all hash functions?

754
00:38:58,160 --> 00:39:00,490
What is the space of all
hash functions of this form?

755
00:39:03,380 --> 00:39:06,865
For every one of these values,
I give a value in here.

756
00:39:10,640 --> 00:39:12,860
For each one of these
independently random number

757
00:39:12,860 --> 00:39:15,920
between this range, how many
such hash functions are there?

758
00:39:19,140 --> 00:39:25,050
m to the this number--
that's a lot of things.

759
00:39:25,050 --> 00:39:26,780
So I can't do that.

760
00:39:26,780 --> 00:39:29,070
What I can do is fix a
family of hash functions

761
00:39:29,070 --> 00:39:32,390
where, if I choose
one from-- randomly,

762
00:39:32,390 --> 00:39:33,780
I get good performance.

763
00:39:33,780 --> 00:39:36,530
And so the hash function
I'm going to use,

764
00:39:36,530 --> 00:39:39,620
and we're going to spend
the rest of the time on,

765
00:39:39,620 --> 00:39:43,490
is what I call a
universal hash function.

766
00:39:43,490 --> 00:39:47,390
It satisfies what we call
a universal hash property--

767
00:39:47,390 --> 00:39:53,930
so universal hash function.

768
00:39:53,930 --> 00:39:56,960
And this is a little bit
of a weird nomenclature,

769
00:39:56,960 --> 00:40:01,310
because I'm defining this to you
as the universal hash function,

770
00:40:01,310 --> 00:40:05,960
but actually, universal
is a descriptor.

771
00:40:05,960 --> 00:40:09,720
There exist many
universal hash functions.

772
00:40:09,720 --> 00:40:12,000
This just happens to be
an example of one of them.

773
00:40:12,000 --> 00:40:12,500
OK?

774
00:40:23,420 --> 00:40:27,920
So here's the hash function--

775
00:40:27,920 --> 00:40:32,640
doesn't look actually
all that different.

776
00:40:32,640 --> 00:40:36,910
Goodness gracious-- how
many parentheses are there--

777
00:40:36,910 --> 00:40:41,470
mod p, mod m.

778
00:40:41,470 --> 00:40:41,970
OK.

779
00:40:41,970 --> 00:40:44,850
So it's kind of doing the same
thing as what's happening up

780
00:40:44,850 --> 00:40:52,640
here, but before modding by m,
I'm multiplying it by a number,

781
00:40:52,640 --> 00:40:55,780
I'm adding a number, I'm
taking it mod another number,

782
00:40:55,780 --> 00:40:57,640
and then I'm getting by m.

783
00:40:57,640 --> 00:40:58,790
This is a little weird.

784
00:40:58,790 --> 00:41:02,020
And not only that-- this is
still a fixed hash function.

785
00:41:02,020 --> 00:41:03,170
I don't want that.

786
00:41:03,170 --> 00:41:10,480
I want to generalize this to
be a family of hash functions,

787
00:41:10,480 --> 00:41:21,160
which are this habk for
some random choice of a,

788
00:41:21,160 --> 00:41:26,765
b in this larger range.

789
00:41:29,790 --> 00:41:34,590
All right, this is a
lot of notation here.

790
00:41:34,590 --> 00:41:40,350
Essentially what this is
saying is, I have a has family.

791
00:41:40,350 --> 00:41:43,530
It's parameterized by the
length of my hash function

792
00:41:43,530 --> 00:41:48,840
and some fixed large random
prime that's bigger than u.

793
00:41:48,840 --> 00:41:52,410
I'm going to pick some
large prime number,

794
00:41:52,410 --> 00:41:55,500
and that's going to be fixed
when I make the hash table.

795
00:41:58,940 --> 00:42:02,660
And then, when I
instantiate the hash table,

796
00:42:02,660 --> 00:42:06,320
I'm going to choose
randomly one of these things

797
00:42:06,320 --> 00:42:10,490
by choosing a random a and
a random b from this range.

798
00:42:10,490 --> 00:42:12,284
Does that makes sense?

799
00:42:12,284 --> 00:42:16,450
AUDIENCE: [INAUDIBLE]

800
00:42:16,450 --> 00:42:19,590
JASON KU: This is
a not equal to 0.

801
00:42:19,590 --> 00:42:22,568
If I had 0 here, I lose
the key information,

802
00:42:22,568 --> 00:42:23,360
and that's no good.

803
00:42:26,940 --> 00:42:27,830
Does this make sense?

804
00:42:27,830 --> 00:42:30,300
So what this is doing
is multiplying this key

805
00:42:30,300 --> 00:42:34,080
by some random number,
adding some random number,

806
00:42:34,080 --> 00:42:37,790
modding by this prime,
and then modding

807
00:42:37,790 --> 00:42:39,680
by the size of my thing.

808
00:42:39,680 --> 00:42:41,810
So it's doing a
bunch of jumbling,

809
00:42:41,810 --> 00:42:43,610
and there's some
randomness involved here.

810
00:42:43,610 --> 00:42:46,250
I'm choosing the hash
function by choosing an a,

811
00:42:46,250 --> 00:42:47,670
b randomly from this thing.

812
00:42:47,670 --> 00:42:53,170
So when I start
up my program, I'm

813
00:42:53,170 --> 00:42:56,320
going to instantiate this
thing with some random a and b,

814
00:42:56,320 --> 00:42:58,340
not deterministically.

815
00:42:58,340 --> 00:43:01,930
The user, when they're
using this thing,

816
00:43:01,930 --> 00:43:04,420
doesn't know which
a and b I picked,

817
00:43:04,420 --> 00:43:07,930
so it's really hard for them
to give me a bad example.

818
00:43:07,930 --> 00:43:11,020
And this universal
hash function--

819
00:43:11,020 --> 00:43:13,870
this universal hash family,
shall we say-- really,

820
00:43:13,870 --> 00:43:17,050
this is a family of functions,
and I'm choosing one randomly

821
00:43:17,050 --> 00:43:20,130
within that family--

822
00:43:20,130 --> 00:43:21,120
is universal.

823
00:43:21,120 --> 00:43:26,910
And universality says that--

824
00:43:26,910 --> 00:43:30,420
what is the property
of universality?

825
00:43:30,420 --> 00:43:34,830
It means that the probability,
by choosing a hash function

826
00:43:34,830 --> 00:43:43,530
from this hash family,
that a certain key collides

827
00:43:43,530 --> 00:43:52,500
with another key is less than
or equal to 1/m for all--

828
00:43:52,500 --> 00:43:57,870
any different two
keys in my universe.

829
00:44:02,145 --> 00:44:03,020
Does that make sense?

830
00:44:05,730 --> 00:44:10,170
Basically, this thing has the
property that, if I randomly--

831
00:44:10,170 --> 00:44:16,900
for any two keys that I
pick in my universe space,

832
00:44:16,900 --> 00:44:19,180
if I randomly choose
a hash function,

833
00:44:19,180 --> 00:44:22,030
the probability that
these things collide

834
00:44:22,030 --> 00:44:23,830
is less than 1/m.

835
00:44:23,830 --> 00:44:25,000
Why is that good?

836
00:44:25,000 --> 00:44:26,830
This is, in some
sense, a measure

837
00:44:26,830 --> 00:44:30,250
of how well distributed
these things are.

838
00:44:30,250 --> 00:44:35,560
I want these things to
collide with 1/m probability

839
00:44:35,560 --> 00:44:39,100
so that these things
don't collide very--

840
00:44:39,100 --> 00:44:41,470
it's not very likely for
these things to collide.

841
00:44:41,470 --> 00:44:43,220
Does that make sense?

842
00:44:43,220 --> 00:44:46,990
So we want proof that
this hash family satisfies

843
00:44:46,990 --> 00:44:48,370
this universality property.

844
00:44:48,370 --> 00:44:50,080
You'll do that in 046.

845
00:44:50,080 --> 00:44:54,460
But we can use this
result to show that,

846
00:44:54,460 --> 00:44:58,840
if we use a universal--
this universal hash family,

847
00:44:58,840 --> 00:45:01,510
that the length of our change--

848
00:45:01,510 --> 00:45:06,450
chains is expected to
be constant length.

849
00:45:06,450 --> 00:45:10,050
So we're going to use this
property to prove that.

850
00:45:10,050 --> 00:45:11,310
How do we prove that?

851
00:45:11,310 --> 00:45:15,180
We're going to do a
little probability.

852
00:45:15,180 --> 00:45:16,710
So how are we going
to prove that?

853
00:45:16,710 --> 00:45:20,040
I'm going to define a random
variable, an indicator

854
00:45:20,040 --> 00:45:20,760
random variable.

855
00:45:20,760 --> 00:45:23,790
Does anyone remember what an
indicator in a variable is?

856
00:45:23,790 --> 00:45:28,350
Yeah, it's a variable that,
with some amount of probability,

857
00:45:28,350 --> 00:45:33,230
is 1, and 1 minus
that probability is 0.

858
00:45:33,230 --> 00:45:35,120
So I'm going to
define this indicator

859
00:45:35,120 --> 00:45:44,310
random variable xij is a random
variable over my choice--

860
00:45:44,310 --> 00:45:50,610
over choice of a hash
function in my has family.

861
00:45:50,610 --> 00:45:52,120
And what does this mean?

862
00:45:52,120 --> 00:46:04,750
It means xij equals 1,
if hash Ki equals hKj--

863
00:46:04,750 --> 00:46:09,850
these things collide--
and 0 otherwise.

864
00:46:13,410 --> 00:46:18,380
So I'm choosing randomly
over this hash family.

865
00:46:18,380 --> 00:46:22,520
If, for two keys--

866
00:46:22,520 --> 00:46:24,200
key i and and j--

867
00:46:24,200 --> 00:46:27,650
if these things collide,
that's going to be 1.

868
00:46:27,650 --> 00:46:29,580
If they don't, then it's 0.

869
00:46:29,580 --> 00:46:30,530
OK?

870
00:46:30,530 --> 00:46:34,070
Then, how can we write
a formula for the length

871
00:46:34,070 --> 00:46:37,490
of a chain in this model?

872
00:46:37,490 --> 00:46:39,335
So the size of a chain--

873
00:46:43,440 --> 00:46:46,470
or let's put it here--

874
00:46:46,470 --> 00:46:55,350
the size of the chain at i--

875
00:46:55,350 --> 00:46:58,360
at i in my hash table--

876
00:46:58,360 --> 00:47:00,310
is going to equal--

877
00:47:00,310 --> 00:47:03,010
I'm going to call that
the random variable xi--

878
00:47:03,010 --> 00:47:07,352
that's going to equal the
sum over j equals 0 to--

879
00:47:10,000 --> 00:47:17,140
what is it-- over, I think,
u minus 1 of summation--

880
00:47:17,140 --> 00:47:20,960
or sorry-- of xij.

881
00:47:20,960 --> 00:47:33,410
So basically, if I
fix this location i,

882
00:47:33,410 --> 00:47:35,270
this is where this key goes.

883
00:47:38,490 --> 00:47:38,990
Sorry.

884
00:47:38,990 --> 00:47:44,480
This is the size of
chain at h of Ki.

885
00:47:44,480 --> 00:47:45,230
Sorry.

886
00:47:45,230 --> 00:47:49,310
So I look at wherever
Ki goes is hashed,

887
00:47:49,310 --> 00:47:52,010
and I see how many
things collide with it.

888
00:47:52,010 --> 00:47:55,070
I'm just summing over
all of these things,

889
00:47:55,070 --> 00:47:58,750
because this is 1 if there's a
collision and 0 if there's not.

890
00:47:58,750 --> 00:48:00,490
Does that make sense?

891
00:48:00,490 --> 00:48:04,470
So this is the size of the chain
at the index location mapped

892
00:48:04,470 --> 00:48:06,220
to by Ki.

893
00:48:09,940 --> 00:48:13,030
So here's where your
probability comes in.

894
00:48:13,030 --> 00:48:15,160
What's the expected
value of this chain

895
00:48:15,160 --> 00:48:18,610
length over my random choice?

896
00:48:18,610 --> 00:48:22,300
Expected value of
choosing a hash function

897
00:48:22,300 --> 00:48:25,810
from this universal hash
family of this chain length--

898
00:48:29,230 --> 00:48:31,090
I can put in my definition here.

899
00:48:31,090 --> 00:48:38,330
That's the expected value of
the summation over j of xij.

900
00:48:45,690 --> 00:48:49,670
What do I know about
expectations and summations?

901
00:48:53,220 --> 00:48:56,952
If these variables are
independent from each other--

902
00:48:56,952 --> 00:48:58,840
AUDIENCE: [INAUDIBLE]

903
00:48:58,840 --> 00:49:00,544
JASON KU: Say what?

904
00:49:00,544 --> 00:49:02,710
AUDIENCE: [INAUDIBLE]

905
00:49:02,710 --> 00:49:05,500
JASON KU: Linearity
of expectation--

906
00:49:05,500 --> 00:49:08,710
basically, the expectation sum
of these independent random

907
00:49:08,710 --> 00:49:10,420
variables is the
same as the summation

908
00:49:10,420 --> 00:49:12,520
of their expectations.

909
00:49:12,520 --> 00:49:14,710
So this is equal
to the summation

910
00:49:14,710 --> 00:49:18,655
over j of the expectations
of these individual ones.

911
00:49:26,870 --> 00:49:32,480
One of these j's
is the same as i.

912
00:49:32,480 --> 00:49:37,400
j loops over all of the
things from 0 to u minus 1.

913
00:49:37,400 --> 00:49:47,520
One of them is i, so when xhi is
hj, what is the expected value

914
00:49:47,520 --> 00:49:49,440
that they collide?

915
00:49:49,440 --> 00:49:52,800
1-- so I'm going
to refactor this

916
00:49:52,800 --> 00:49:59,460
as being this, where j
does not equal i, plus 1.

917
00:49:59,460 --> 00:50:00,780
Are people OK with that?

918
00:50:00,780 --> 00:50:04,200
Because if i equals--

919
00:50:04,200 --> 00:50:08,040
if j and i are equal,
they definitely collide.

920
00:50:08,040 --> 00:50:10,410
They're the same key.

921
00:50:10,410 --> 00:50:13,650
So I'm expected to have
one guy there, which

922
00:50:13,650 --> 00:50:16,680
was the original key, xi.

923
00:50:16,680 --> 00:50:22,920
But otherwise, we can use
this universal property

924
00:50:22,920 --> 00:50:27,420
that says, if they're not
equal and they collide--

925
00:50:27,420 --> 00:50:30,330
which is exactly this case--

926
00:50:30,330 --> 00:50:35,340
the probability that
that happens is 1/m.

927
00:50:35,340 --> 00:50:38,340
And since it's an
indicator random variable,

928
00:50:38,340 --> 00:50:41,370
the expectation is
there are outcomes

929
00:50:41,370 --> 00:50:45,300
times their probabilities--
so 1 times that probability

930
00:50:45,300 --> 00:50:51,060
plus 0 times 1 minus that
probability, which is just 1/m.

931
00:50:51,060 --> 00:50:58,960
So now we get the
summation of 1/m for j

932
00:50:58,960 --> 00:51:02,395
not equal to i plus 1.

933
00:51:08,130 --> 00:51:10,590
Oh, and this-- sorry.

934
00:51:10,590 --> 00:51:11,970
I did this wrong.

935
00:51:11,970 --> 00:51:12,810
This isn't u.

936
00:51:12,810 --> 00:51:13,980
This is n.

937
00:51:13,980 --> 00:51:17,760
We're storing n keys.

938
00:51:17,760 --> 00:51:20,490
OK, so now I'm looping over j--

939
00:51:20,490 --> 00:51:22,240
this over all of those things.

940
00:51:22,240 --> 00:51:23,430
How many things are there?

941
00:51:23,430 --> 00:51:26,210
n minus 1 things, right?

942
00:51:26,210 --> 00:51:32,720
So this should equal 1
plus n minus 1 over m.

943
00:51:32,720 --> 00:51:35,900
So that's what
universality gives us.

944
00:51:35,900 --> 00:51:41,980
So as long as we choose
m to be larger than n,

945
00:51:41,980 --> 00:51:44,980
or at least linear
in n, then we're

946
00:51:44,980 --> 00:51:49,720
expected to have our
chain lengths be constant,

947
00:51:49,720 --> 00:51:54,900
because this thing becomes a
constant if m is at least order

948
00:51:54,900 --> 00:51:55,400
n.

949
00:51:55,400 --> 00:51:57,750
Does that make sense?

950
00:51:57,750 --> 00:51:58,610
OK.

951
00:51:58,610 --> 00:52:00,360
The last thing I'm
going to leave you with

952
00:52:00,360 --> 00:52:02,400
is, how do we make
this thing dynamic?

953
00:52:02,400 --> 00:52:05,400
If we're growing
the number of things

954
00:52:05,400 --> 00:52:07,590
we're storing in
this thing, it's

955
00:52:07,590 --> 00:52:10,920
possible that, as we
grow n for a fixed m,

956
00:52:10,920 --> 00:52:13,140
this thing will stop being--

957
00:52:13,140 --> 00:52:15,990
m will stop being
linear in n, right?

958
00:52:15,990 --> 00:52:20,040
Well, then all we have to
do is, if we get too far,

959
00:52:20,040 --> 00:52:22,620
we rebuild the entire thing--

960
00:52:22,620 --> 00:52:24,540
the entire hash
table with the new m,

961
00:52:24,540 --> 00:52:27,330
just like we did
with a dynamic array.

962
00:52:27,330 --> 00:52:28,830
And you can prove--

963
00:52:28,830 --> 00:52:31,260
we're not going to
do that here, but you

964
00:52:31,260 --> 00:52:35,970
can prove that you won't do that
operation too often, if you're

965
00:52:35,970 --> 00:52:37,660
resizing in the right way.

966
00:52:37,660 --> 00:52:40,020
And so you just
rebuild completely

967
00:52:40,020 --> 00:52:42,210
after a certain
number of operations.

968
00:52:42,210 --> 00:52:44,010
OK, so that's hashing.

969
00:52:44,010 --> 00:52:45,510
Next week, we're
going to be talking

970
00:52:45,510 --> 00:52:48,890
about doing a faster sort.