1
00:00:09,000 --> 00:00:10,000
Hashing.
2
00:00:15,000 --> 00:00:19,000
Today we're going to do some
amazing stuff with hashing.
3
00:00:19,000 --> 00:00:21,000
And, really,
this is such neat stuff,
4
00:00:21,000 --> 00:00:24,000
it's amazing.
We're going to start by
5
00:00:24,000 --> 00:00:28,000
addressing a fundamental
weakness of hashing.
6
00:00:34,000 --> 00:00:37,000
And that is that for any choice
of hash function
7
00:00:49,000 --> 00:01:04,000
There exists a bad set of keys
that all hash to the same slot.
8
00:01:09,000 --> 00:01:11,000
OK.
So you pick a hash function.
9
00:01:11,000 --> 00:01:15,000
We looked at some that seem to
work well in practice,
10
00:01:15,000 --> 00:01:18,000
that are easy to put into your
code.
11
00:01:18,000 --> 00:01:23,000
But whichever one you pick,
there's always some bad set of
12
00:01:23,000 --> 00:01:25,000
keys.
So you can imagine,
13
00:01:25,000 --> 00:01:30,000
just to drive this point home a
little bit.
14
00:01:30,000 --> 00:01:35,000
Imagine that you're building a
compiler for a customer and you
15
00:01:35,000 --> 00:01:40,000
have a symbol table in your
compiler and one of the things
16
00:01:40,000 --> 00:01:46,000
that the customer is demanding
is that compilations go fast.
17
00:01:46,000 --> 00:01:50,000
They don't want to sit around
waiting for compilations.
18
00:01:50,000 --> 00:01:56,000
And you have a competitor who's
also building a compiler and
19
00:01:56,000 --> 00:02:01,000
they're going to test the
compiler, both of your compilers
20
00:02:01,000 --> 00:02:07,000
and sort of have a run-off.
And one of the things in the
21
00:02:07,000 --> 00:02:12,000
test that they're going to allow
you to do is not only will the
22
00:02:12,000 --> 00:02:16,000
customer run his own benchmarks,
but he'll let you make up
23
00:02:16,000 --> 00:02:20,000
benchmarks for the other
program, for your competitor.
24
00:02:20,000 --> 00:02:24,000
And your competitor gets to
make up benchmarks for you.
25
00:02:24,000 --> 00:02:28,000
So and not only that,
but you're actually sharing
26
00:02:28,000 --> 00:02:32,000
code.
So you get to look at what the
27
00:02:32,000 --> 00:02:37,000
competitor is actually doing and
what hash function they're
28
00:02:37,000 --> 00:02:40,000
actually using.
So it's pretty clear that in
29
00:02:40,000 --> 00:02:44,000
this circumstance,
you have an adversary who is
30
00:02:44,000 --> 00:02:49,000
going to look at whatever hash
function you have and figure out
31
00:02:49,000 --> 00:02:53,000
OK, what's a set of variable
names and so forth that are
32
00:02:53,000 --> 00:02:58,000
going to all hash to the same
slot so that essentially you're
33
00:02:58,000 --> 00:03:03,000
just chasing through a linked
list whenever it comes to
34
00:03:03,000 --> 00:03:07,000
looking something up.
Slowing down your program
35
00:03:07,000 --> 00:03:12,000
enormously compared to if in
fact they got distributed nicely
36
00:03:12,000 --> 00:03:15,000
across the hash table which is,
what after all,
37
00:03:15,000 --> 00:03:19,000
you have a hash table in there
to do in the first place.
38
00:03:19,000 --> 00:03:22,000
And so the question is,
how do you defeat this
39
00:03:22,000 --> 00:03:26,000
adversary?
And the answer is one word.
40
00:03:31,000 --> 00:03:33,000
One word.
How do you achieve?
41
00:03:33,000 --> 00:03:37,000
How do you defeat any adversary
in this class?
42
00:03:37,000 --> 00:03:38,000
Randomness.
OK.
43
00:03:38,000 --> 00:03:39,000
Randomness.
OK.
44
00:03:39,000 --> 00:03:42,000
You make it so that he can't
guess.
45
00:03:42,000 --> 00:03:47,000
And the idea is that you choose
a hash function at random.
46
00:03:47,000 --> 00:03:50,000
Independent.
So he can look at the code,
47
00:03:50,000 --> 00:03:55,000
but when it actually runs,
it's going to use a random hash
48
00:03:55,000 --> 00:04:00,000
function that he has no way of
predicting what the hash
49
00:04:00,000 --> 00:04:05,000
function is that will actually
be used.
50
00:04:05,000 --> 00:04:07,000
OK.
So that's the game and that way
51
00:04:07,000 --> 00:04:11,000
he can provide an input,
but he can't provide an input
52
00:04:11,000 --> 00:04:15,000
that's guaranteed to force you
to run slowly.
53
00:04:15,000 --> 00:04:19,000
You might get unlucky in your
choice of hash function,
54
00:04:19,000 --> 00:04:23,000
but it's not going to be
because of the adversary.
55
00:04:23,000 --> 00:04:28,000
So the idea is to choose a hash
function --
56
00:04:34,000 --> 00:04:38,000
-- at random,
independently from the keys
57
00:04:38,000 --> 00:04:42,000
that you're, that are going to
be fed to it.
58
00:04:42,000 --> 00:04:47,000
So even if your adversary can
see your code,
59
00:04:47,000 --> 00:04:53,000
he can't tell which hash
function is going to be actually
60
00:04:53,000 --> 00:04:58,000
used at run time.
Doesn't get to see the output
61
00:04:58,000 --> 00:05:04,000
of the random numbers.
And so it turns out you can
62
00:05:04,000 --> 00:05:11,000
make this scheme work and the
name of the scheme is universal
63
00:05:11,000 --> 00:05:17,000
hashing, OK, is one way of
making this scheme work.
64
00:05:22,000 --> 00:05:34,000
So let's do some math.
So let U be a universe of keys.
65
00:05:34,000 --> 00:05:41,000
And let H be a finite
collection --
66
00:05:48,000 --> 00:05:49,000
-- of hash functions --
67
00:05:56,000 --> 00:06:04,000
-- mapping U to what are going
to be the slots in our hash
68
00:06:04,000 --> 00:06:06,000
table.
OK.
69
00:06:06,000 --> 00:06:11,000
So we just have H as some
finite collection.
70
00:06:11,000 --> 00:06:15,000
We say that H is universal --
71
00:06:22,000 --> 00:06:30,000
-- if for all pairs of the
keys, distinct keys --
72
00:06:36,000 --> 00:06:41,000
-- so the keys are distinct,
the following is true.
73
00:07:03,000 --> 00:07:08,000
So if the set of keys,
if for any pair of keys I pick,
74
00:07:08,000 --> 00:07:15,000
the number of hash functions
that hash those two keys to the
75
00:07:15,000 --> 00:07:21,000
same place is a one over m
fraction of the total set of
76
00:07:21,000 --> 00:07:23,000
keys.
So let m just,
77
00:07:23,000 --> 00:07:28,000
so to view that,
another way of viewing that is
78
00:07:28,000 --> 00:07:33,000
if H is chosen randomly --
79
00:07:39,000 --> 00:07:51,000
-- from the set of keys H,
the probability of collision
80
00:07:51,000 --> 00:07:58,000
between x and y is what?
81
00:08:12,000 --> 00:08:17,000
What's the probability if the
fraction of hash functions,
82
00:08:17,000 --> 00:08:22,000
OK, if the number of hash
functions is H over m,
83
00:08:22,000 --> 00:08:27,000
what's the probability of a
collision between x and y?
84
00:08:27,000 --> 00:08:32,000
If I pick a hash function at
random.
85
00:08:32,000 --> 00:08:39,000
So I pick a hash function at
random, what's the odds they
86
00:08:39,000 --> 00:08:42,000
collide?
One over m.
87
00:08:42,000 --> 00:08:49,000
Now let's draw a picture for
that, help people see that
88
00:08:49,000 --> 00:08:56,000
that's in fact the case.
So imagine this is our set of
89
00:08:56,000 --> 00:09:00,000
all hash functions.
OK.
90
00:09:00,000 --> 00:09:08,000
And then if I pick a particular
x and y, let's say that this is
91
00:09:08,000 --> 00:09:16,000
the set of hash functions such
that H of x is equal to H of y.
92
00:09:16,000 --> 00:09:23,000
And so what we're saying is
that the cardinality of that set
93
00:09:23,000 --> 00:09:30,000
is one over m times the
cardinality of H.
94
00:09:30,000 --> 00:09:33,000
So if I throw a dart and pick
one hash function at random,
95
00:09:33,000 --> 00:09:37,000
the odds are one in m that the
hash function falls into this
96
00:09:37,000 --> 00:09:39,000
particular set.
And of course,
97
00:09:39,000 --> 00:09:43,000
this has to be true of every x
and y that I can pick.
98
00:09:43,000 --> 00:09:45,000
Of course, it will be a
different set,
99
00:09:45,000 --> 00:09:49,000
a different x and y will
somehow map the hash functions
100
00:09:49,000 --> 00:09:52,000
differently, but the odds that
for any x and y that I pick,
101
00:09:52,000 --> 00:09:55,000
the odds that if I have a
random hash function,
102
00:09:55,000 --> 00:10:00,000
it hashes it to the same place,
is one over m.
103
00:10:00,000 --> 00:10:03,000
Now this is a little bit hard
sometimes for people to get
104
00:10:03,000 --> 00:10:07,000
their head around because we're
used to thinking of perhaps
105
00:10:07,000 --> 00:10:09,000
picking keys at random or
something.
106
00:10:09,000 --> 00:10:11,000
OK, that's not what's going on
here.
107
00:10:11,000 --> 00:10:14,000
We're picking hash functions at
random.
108
00:10:14,000 --> 00:10:18,000
So our probability space is
defined over the hash functions,
109
00:10:18,000 --> 00:10:21,000
not over the keys.
And this has to be true now for
110
00:10:21,000 --> 00:10:24,000
any particular two keys that I
pick that are distinct.
111
00:10:24,000 --> 00:10:28,000
That the places that they hash,
this set of hash functions,
112
00:10:28,000 --> 00:10:34,000
I mean this is like a marvelous
property if you think about it.
113
00:10:34,000 --> 00:10:39,000
OK, that you can actually find
ones where no matter what two
114
00:10:39,000 --> 00:10:43,000
elements I pick,
the odds are exactly one in m
115
00:10:43,000 --> 00:10:48,000
that a random hash function from
this set is going to hash them
116
00:10:48,000 --> 00:10:51,000
to the same place.
So very neat.
117
00:10:51,000 --> 00:10:56,000
Very, very neat property and
we'll see the mathematics
118
00:10:56,000 --> 00:11:00,000
associated with this is very
cool.
119
00:11:00,000 --> 00:11:14,000
So our theorem is that if we
choose h randomly from the set
120
00:11:14,000 --> 00:11:25,000
of hash functions H,
and then we suppose we're
121
00:11:25,000 --> 00:11:37,000
hashing n keys into m slots in
Table T --
122
00:11:44,000 --> 00:11:46,000
-- then for given key x --
123
00:11:52,000 --> 00:11:56,000
-- the expected number of
collisions with x --
124
00:12:03,000 --> 00:12:12,000
-- is less than n over m.
And who remembers what we call
125
00:12:12,000 --> 00:12:16,000
n over m?
Alpha, which is the,
126
00:12:16,000 --> 00:12:22,000
what's the term that we use
there?
127
00:12:22,000 --> 00:12:30,000
Load factor.
The load factor of the table.
128
00:12:30,000 --> 00:12:36,000
OK, load factor alpha.
So the average number of keys
129
00:12:36,000 --> 00:12:42,000
per slot is the load factor of
the table.
130
00:12:42,000 --> 00:12:48,000
So we're saying,
so what is this theorem saying?
131
00:12:48,000 --> 00:12:55,000
It's saying that in fact,
if we have one of these
132
00:12:55,000 --> 00:13:02,000
universal sets of hash
functions, then things perform
133
00:13:02,000 --> 00:13:10,000
exactly the way we want them to.
Things get distributed evenly.
134
00:13:10,000 --> 00:13:15,000
The number of things that are
going to collide with any
135
00:13:15,000 --> 00:13:19,000
particular key that I pick is
going to be n over m.
136
00:13:19,000 --> 00:13:22,000
So that's a really good
property to have.
137
00:13:22,000 --> 00:13:27,000
Now I haven't shown you,
the construction of U is going,
138
00:13:27,000 --> 00:13:31,000
sorry, of the set of hash
functions H, that that
139
00:13:31,000 --> 00:13:36,000
construction will take us a
little bit of effort.
140
00:13:36,000 --> 00:13:39,000
But first I want to show you
why this is such a great
141
00:13:39,000 --> 00:13:42,000
property.
Basically it's this theorem.
142
00:13:42,000 --> 00:13:46,000
So let's prove this theorem.
So any questions about what the
143
00:13:46,000 --> 00:13:50,000
statement of the theorem is?
So we're going to go actually
144
00:13:50,000 --> 00:13:54,000
kind of fast today.
We've got a lot of good stuff
145
00:13:54,000 --> 00:13:57,000
today.
So I want to make sure people
146
00:13:57,000 --> 00:14:03,000
are onboard as we go through.
So if there are any questions,
147
00:14:03,000 --> 00:14:07,000
make sure, you know,
statement of theorem of
148
00:14:07,000 --> 00:14:13,000
whatever, best to get them out
early so that you're not
149
00:14:13,000 --> 00:14:19,000
confused later on when the going
gets a little more exciting.
150
00:14:19,000 --> 00:14:21,000
OK?
OK, good.
151
00:14:21,000 --> 00:14:26,000
So to prove this,
let's let C sub x be the random
152
00:14:26,000 --> 00:14:33,000
variable denoting the total
number of collisions --
153
00:14:38,000 --> 00:14:44,000
-- of keys in T with x.
So this is a total number and
154
00:14:44,000 --> 00:14:51,000
one of the techniques that you
use a lot in probabilistic
155
00:14:51,000 --> 00:14:57,000
analysis of randomized
algorithms is recognizing that C
156
00:14:57,000 --> 00:15:05,000
of x is in fact a sum of
indicator random variables.
157
00:15:05,000 --> 00:15:11,000
If you can decompose things
into indicator random variables,
158
00:15:11,000 --> 00:15:17,000
the analysis goes much more
easily than if you're left with
159
00:15:17,000 --> 00:15:22,000
aggregate variables.
So here we're going to let our
160
00:15:22,000 --> 00:15:27,000
indicator random variable be
little c of x.,
161
00:15:27,000 --> 00:15:32,000
which is going to be one if h
of x equals h of y and 0
162
00:15:32,000 --> 00:15:35,000
otherwise.
163
00:15:40,000 --> 00:15:49,000
And so we can note two things.
First, what is the expectation
164
00:15:49,000 --> 00:15:52,000
of C of x..
165
00:15:57,000 --> 00:16:00,000
OK, if I have a process which
is picking a hash function at
166
00:16:00,000 --> 00:16:04,000
random, what's the expectation
of C of x.?
167
00:16:04,000 --> 00:16:07,000
One over m.
Because that's basically this
168
00:16:07,000 --> 00:16:11,000
definition here.
Now in other words I pick a
169
00:16:11,000 --> 00:16:16,000
hash function at random,
what's the odds that the hash
170
00:16:16,000 --> 00:16:19,000
is the same?
It's one over m.
171
00:16:19,000 --> 00:16:24,000
And then the other thing is,
and the reason we pick this
172
00:16:24,000 --> 00:16:28,000
thing is that I can express
capital C sub x,
173
00:16:28,000 --> 00:16:33,000
the random variable denoting
the total number of collisions
174
00:16:33,000 --> 00:16:39,000
as being just the sum over all
the keys in the table except x
175
00:16:39,000 --> 00:16:46,000
of C of x..
So for each one that would
176
00:16:46,000 --> 00:16:53,000
cause me a collision,
with x, I add one and if it
177
00:16:53,000 --> 00:17:00,000
wouldn't cause me a collision,
I add 0.
178
00:17:00,000 --> 00:17:06,000
And that adds up all of the
collisions that I would have in
179
00:17:06,000 --> 00:17:09,000
the table with x.
180
00:17:17,000 --> 00:17:20,000
Is there any questions so far?
Because this is the set-up.
181
00:17:20,000 --> 00:17:24,000
The set-up in most of these
things, the set-up is where most
182
00:17:24,000 --> 00:17:27,000
students make mistakes and most
practicing researchers make
183
00:17:27,000 --> 00:17:30,000
mistakes as well,
let me tell you.
184
00:17:30,000 --> 00:17:32,000
And then once you get the
set-up right,
185
00:17:32,000 --> 00:17:36,000
then working out the math is
fine, but it's often that set-up
186
00:17:36,000 --> 00:17:40,000
of how do you actually translate
the situation into the math.
187
00:17:40,000 --> 00:17:43,000
That's the hard part.
Once you get that right,
188
00:17:43,000 --> 00:17:46,000
well, then, algebra,
we can all do algebra.
189
00:17:46,000 --> 00:17:49,000
Of course, we can also all make
mistakes doing algebra,
190
00:17:49,000 --> 00:17:53,000
but at least those mistakes are
much more easy to check than the
191
00:17:53,000 --> 00:17:57,000
one that does the translation.
So I want to make sure people
192
00:17:57,000 --> 00:18:00,000
are sort of understanding of how
that's set up.
193
00:18:00,000 --> 00:18:05,000
So now we just have to use our
math skills.
194
00:18:05,000 --> 00:18:12,000
So the expectation then of the
number of collisions is the
195
00:18:12,000 --> 00:18:18,000
expectation of C sub x and
that's just the expectation of
196
00:18:18,000 --> 00:18:26,000
just plugging the sum of y and T
minus the element x of c_xy.
197
00:18:26,000 --> 00:18:33,000
So that's just definition.
And that's equal to the sum of
198
00:18:33,000 --> 00:18:39,000
y and T minus x of expectation
of c_xy.
199
00:18:39,000 --> 00:18:44,000
So why is that?
Yeah, that's linearity.
200
00:18:52,000 --> 00:18:56,000
Linearity of expectation,
doesn't require independence.
201
00:18:56,000 --> 00:19:00,000
It's true of all random
variables.
202
00:19:00,000 --> 00:19:07,000
And that's equal to,
and now the math gets easier.
203
00:19:07,000 --> 00:19:10,000
So what is that?
One over m.
204
00:19:10,000 --> 00:19:16,000
That makes the summation easy
to evaluate.
205
00:19:16,000 --> 00:19:22,000
That's just n minus one over m.
206
00:19:30,000 --> 00:19:35,000
So fairly simple analysis and
shows you why we would love to
207
00:19:35,000 --> 00:19:41,000
have one of these sets of
universal hash functions because
208
00:19:41,000 --> 00:19:45,000
if you have them,
then they behave exactly the
209
00:19:45,000 --> 00:19:51,000
way you would want it to behave.
And you defeat your adversary
210
00:19:51,000 --> 00:19:55,000
by just picking up the hash
function at random.
211
00:19:55,000 --> 00:20:00,000
There's nothing he can do.
Or she.
212
00:20:00,000 --> 00:20:02,000
OK, any questions about that
proof?
213
00:20:02,000 --> 00:20:04,000
OK, now we get into the fun
math.
214
00:20:04,000 --> 00:20:07,000
Constructing one of these
babies.
215
00:20:07,000 --> 00:20:08,000
OK.
216
00:20:20,000 --> 00:20:23,000
This is not the only
construction.
217
00:20:23,000 --> 00:20:31,000
This is a construction of a
classic universal hash function.
218
00:20:31,000 --> 00:20:37,000
And there are other
constructions in the literature
219
00:20:37,000 --> 00:20:42,000
and I think there's one on the
practice quiz.
220
00:20:42,000 --> 00:20:47,000
So let's see.
So this one works when m is
221
00:20:47,000 --> 00:20:51,000
prime.
So it works when the set of
222
00:20:51,000 --> 00:20:57,000
slots is a prime number.
Number of slots is a prime
223
00:20:57,000 --> 00:21:05,000
number.
So the idea here is we're going
224
00:21:05,000 --> 00:21:16,000
to decompose any key k in our
universe into r plus 1 digits.
225
00:21:16,000 --> 00:21:25,000
So k, we're going to look at as
being a k 0, k one,
226
00:21:25,000 --> 00:21:33,000
k_r where 0 is less than or
equal to k sub I,
227
00:21:33,000 --> 00:21:41,000
is less than or equal to m
minus one.
228
00:21:41,000 --> 00:21:47,000
So the idea is in some sense
we're looking at what the
229
00:21:47,000 --> 00:21:52,000
representation would be of k
base m.
230
00:21:52,000 --> 00:21:58,000
So if it were base two,
it would be just one bit at a
231
00:21:58,000 --> 00:22:01,000
time.
These would just be the bits.
232
00:22:01,000 --> 00:22:05,000
I'm not going to do base two.
We're going to do base min
233
00:22:05,000 --> 00:22:09,000
general and so each of these
represents one of the digits.
234
00:22:09,000 --> 00:22:13,000
And the way I've done it is
I've done low order digit first.
235
00:22:13,000 --> 00:22:16,000
It actually doesn't matter.
We're not actually going to
236
00:22:16,000 --> 00:22:20,000
care really about what the order
is, but basically we're just
237
00:22:20,000 --> 00:22:24,000
looking at busting it into a
twofold represented by each of
238
00:22:24,000 --> 00:22:27,000
those digits.
So one algorithm for computing
239
00:22:27,000 --> 00:22:31,000
this out of k is take the
remainder mod m.
240
00:22:31,000 --> 00:22:34,000
That's the low order one.
OK, take what's left.
241
00:22:34,000 --> 00:22:37,000
Take the remainder of that mod
m.
242
00:22:37,000 --> 00:22:39,000
Take whatever's left,
etc.
243
00:22:39,000 --> 00:22:42,000
So you're familiar with the
conversion to a base
244
00:22:42,000 --> 00:22:46,000
representation.
That's exactly how we're
245
00:22:46,000 --> 00:22:49,000
getting this representation.
So we treat,
246
00:22:49,000 --> 00:22:53,000
this is just a question of
taking the data that we've got
247
00:22:53,000 --> 00:22:57,000
and treating it as an r plus one
base m number.
248
00:22:57,000 --> 00:23:02,000
And now we invoke our
randomized strategy.
249
00:23:02,000 --> 00:23:05,000
The randomized strategy is
going to be able to have a class
250
00:23:05,000 --> 00:23:09,000
of hash functions that's
dependent essentially on random
251
00:23:09,000 --> 00:23:11,000
numbers.
And the random numbers we're
252
00:23:11,000 --> 00:23:15,000
going to pick is we're going to
pick an a at random --
253
00:23:28,000 --> 00:23:33,000
-- which we're also going to
look at as a base mnumber.
254
00:23:33,000 --> 00:23:38,000
For each a_i is chosen randomly
--
255
00:23:49,000 --> 00:23:50,000
-- from --
256
00:23:55,000 --> 00:23:58,000
-- 0 to m minus one.
So one of our,
257
00:23:58,000 --> 00:24:03,000
it's a random if you will,
it's a random base mdigit.
258
00:24:03,000 --> 00:24:06,000
Random base m digit.
So each one of these is picked
259
00:24:06,000 --> 00:24:09,000
at random.
And for each one we,
260
00:24:09,000 --> 00:24:13,000
possible value of A,
we're going to get a different
261
00:24:13,000 --> 00:24:16,000
hash function.
So we're going to index our
262
00:24:16,000 --> 00:24:19,000
hash functions by this random
number.
263
00:24:19,000 --> 00:24:23,000
So this is where the randomness
is going to come in.
264
00:24:23,000 --> 00:24:28,000
Everybody with me?
And here's the hash function.
265
00:24:56,000 --> 00:25:06,000
So what we do is we dot product
this vector with this vector and
266
00:25:06,000 --> 00:25:11,000
take the result,
mod m.
267
00:25:11,000 --> 00:25:18,000
So each digit of k of our key
gets multiplied by a random
268
00:25:18,000 --> 00:25:25,000
other digit.
We add all those up and we take
269
00:25:25,000 --> 00:25:29,000
that mod m.
So that's a dot product
270
00:25:29,000 --> 00:25:34,000
operator.
And this is what we're going to
271
00:25:34,000 --> 00:25:37,000
show is universal,
that this set of h sub a,
272
00:25:37,000 --> 00:25:39,000
where I look over that whole
set.
273
00:25:39,000 --> 00:25:44,000
So one of the things we need to
know is how big is the set of
274
00:25:44,000 --> 00:25:46,000
hash functions here.
275
00:25:59,000 --> 00:26:01,000
So how big is this set of hash
functions?
276
00:26:01,000 --> 00:26:07,000
How many different hash
functions do I have in this set?
277
00:26:24,000 --> 00:26:31,000
It's basic 6.042 material.
It's basically how many vectors
278
00:26:31,000 --> 00:26:38,000
of length r plus one where each
element of the vector is a
279
00:26:38,000 --> 00:26:45,000
number of 0 to m minus one,
has m different values.
280
00:26:45,000 --> 00:26:50,000
So how many?
m minus one to the r.
281
00:26:50,000 --> 00:26:51,000
No.
Close.
282
00:26:51,000 --> 00:26:56,000
It's up there.
It's a big number.
283
00:26:56,000 --> 00:27:01,000
m to the r plus one.
Good.
284
00:27:01,000 --> 00:27:06,000
It's m, so the size of H is
equal to m to the r plus one.
285
00:27:06,000 --> 00:27:10,000
So we're going to want to
remember that.
286
00:27:10,000 --> 00:27:13,000
OK, so let's just understand
why that is.
287
00:27:13,000 --> 00:27:17,000
I have m choices for the first
value of A.
288
00:27:17,000 --> 00:27:19,000
m for the second,
etc.
289
00:27:19,000 --> 00:27:23,000
m for the r th.
And since there are plus one
290
00:27:23,000 --> 00:27:28,000
things here, for each choice
here, I have this many same
291
00:27:28,000 --> 00:27:34,000
number of choices here,
so it's a product.
292
00:27:34,000 --> 00:27:39,000
OK, so this is the product rule
in counting.
293
00:27:39,000 --> 00:27:45,000
So if you haven't reviewed your
6.042 notes for counting,
294
00:27:45,000 --> 00:27:52,000
this is going to be a good idea
to go back and review that
295
00:27:52,000 --> 00:27:57,000
because we're doing stuff of
that nature.
296
00:27:57,000 --> 00:28:01,000
This is just the product rule.
Good.
297
00:28:01,000 --> 00:28:10,000
So then the theorem we want to
prove is that H is universal.
298
00:28:10,000 --> 00:28:14,000
And this is going to involve a
little bit of number theory,
299
00:28:14,000 --> 00:28:19,000
so it gets kind of interesting.
And it's a non-trivial proof,
300
00:28:19,000 --> 00:28:23,000
so this is where if there's any
questions as I'm going along,
301
00:28:23,000 --> 00:28:28,000
please ask because the argument
is not as simple as other
302
00:28:28,000 --> 00:28:33,000
arguments we've seen so far.
OK, not the ones we've seen so
303
00:28:33,000 --> 00:28:38,000
far have been simple,
but this is definitely a more
304
00:28:38,000 --> 00:28:43,000
involved mathematical argument.
So here's a proof.
305
00:28:43,000 --> 00:28:46,000
So let's let,
so we have two keys.
306
00:28:46,000 --> 00:28:50,000
What are we trying to show if
it's universal,
307
00:28:50,000 --> 00:28:55,000
that if I pick any two keys,
the number of hash functions
308
00:28:55,000 --> 00:29:01,000
for which they hash to the same
thing is the size of set of hash
309
00:29:01,000 --> 00:29:08,000
functions divided by m.
OK, so I'm going to look at two
310
00:29:08,000 --> 00:29:11,000
keys.
So let's pick two keys
311
00:29:11,000 --> 00:29:16,000
arbitrarily.
So x, and we'll decompose it
312
00:29:16,000 --> 00:29:23,000
into our base r representation
and y, y_0, y_1 --
313
00:29:33,000 --> 00:29:39,000
So these are two distinct keys.
So if these are two distinct
314
00:29:39,000 --> 00:29:45,000
keys, so they're different,
then this base representation
315
00:29:45,000 --> 00:29:50,000
has the property that they've
got to differ somewhere.
316
00:29:50,000 --> 00:29:54,000
Right?
OK, they differ in at least one
317
00:29:54,000 --> 00:29:56,000
digit.
318
00:30:08,000 --> 00:30:12,000
OK, and this is where most
people get lost because I'm
319
00:30:12,000 --> 00:30:16,000
going to make a simplification.
They could differ in any one of
320
00:30:16,000 --> 00:30:20,000
these digits.
I'm going to say they differ in
321
00:30:20,000 --> 00:30:24,000
position 0 because it doesn't
matter which one I do,
322
00:30:24,000 --> 00:30:28,000
the math is the same,
but it'll make it so that if I
323
00:30:28,000 --> 00:30:31,000
pick some said they differ in
some position i,
324
00:30:31,000 --> 00:30:35,000
I would have to be taking
summations as you'll see over
325
00:30:35,000 --> 00:30:41,000
the elements that are not i,
and that's complicated.
326
00:30:41,000 --> 00:30:44,000
If I do it in position 0,
then I can just sum for the
327
00:30:44,000 --> 00:30:46,000
rest of them.
So the math is going to be
328
00:30:46,000 --> 00:30:50,000
identical if I were to do it for
any position because it's
329
00:30:50,000 --> 00:30:52,000
symmetric.
All the digits are symmetric.
330
00:30:52,000 --> 00:30:56,000
So let's say they differ in
position 0, but the same
331
00:30:56,000 --> 00:30:59,000
argument is going to be true if
they differed in some other
332
00:30:59,000 --> 00:31:02,000
position.
So let's say,
333
00:31:02,000 --> 00:31:05,000
so we're saying without loss of
generality.
334
00:31:05,000 --> 00:31:08,000
So that's without loss of
generality.
335
00:31:08,000 --> 00:31:12,000
Position 0.
Because all the positions are
336
00:31:12,000 --> 00:31:16,000
symmetric here.
And so, now we need to ask the
337
00:31:16,000 --> 00:31:19,000
question for how many --
338
00:31:24,000 --> 00:31:30,000
-- hash functions in our
universal, purportedly universal
339
00:31:30,000 --> 00:31:34,000
set do x and y collide?
340
00:31:39,000 --> 00:31:42,000
OK, we've got to count them up.
So how often do they collide?
341
00:31:42,000 --> 00:31:46,000
This is where we're going to
pull out some heavy duty number
342
00:31:46,000 --> 00:31:48,000
theory.
So we must have,
343
00:31:48,000 --> 00:31:50,000
if they collide --
344
00:31:56,000 --> 00:32:03,000
-- that h sub a of x is equal
to h sub a of y.
345
00:32:03,000 --> 00:32:09,000
That's what it means for them
to collide.
346
00:32:09,000 --> 00:32:20,000
So that implies that the sum of
i equal 0 to r of a sub i x sub
347
00:32:20,000 --> 00:32:30,000
i is equal to the sum of i
equals 0 to r of a sub i y sub i
348
00:32:30,000 --> 00:32:35,000
mod m.
Actually this is congruent mod
349
00:32:35,000 --> 00:32:38,000
m.
So congruence for those people
350
00:32:38,000 --> 00:32:43,000
who haven't seen much number
theory, is basically the way of
351
00:32:43,000 --> 00:32:48,000
essentially, rather than having
to say mod everywhere in here
352
00:32:48,000 --> 00:32:52,000
and mod everywhere in here,
we just at the end say OK,
353
00:32:52,000 --> 00:32:56,000
do a mod at the end.
Everything is being done mod,
354
00:32:56,000 --> 00:32:59,000
module m.
And then typically we use a
355
00:32:59,000 --> 00:33:06,000
congruence sign.
OK, there's a more mathematical
356
00:33:06,000 --> 00:33:13,000
definition but this will work
for us engineers.
357
00:33:13,000 --> 00:33:18,000
OK, so everybody with me so
far?
358
00:33:18,000 --> 00:33:23,000
This is just applying the
definition.
359
00:33:23,000 --> 00:33:32,000
So that implies that the sum of
i equals 0 to r of a i x i minus
360
00:33:32,000 --> 00:33:41,000
y i is congruent to zeros mod m.
OK, just threw it on the other
361
00:33:41,000 --> 00:33:45,000
side and applied the
distributive law.
362
00:33:45,000 --> 00:33:49,000
Now what I'm going to do is
pull out the 0-th position
363
00:33:49,000 --> 00:33:53,000
because that's the one that I
care about.
364
00:33:53,000 --> 00:33:58,000
And this is where it saves me
on the math, compared to if I
365
00:33:58,000 --> 00:34:03,000
didn't say that it was 0.
I'd have to pull out x_i.
366
00:34:03,000 --> 00:34:05,000
It wouldn't matter,
but it just would make the math
367
00:34:05,000 --> 00:34:06,000
a little bit cruftier
368
00:34:23,000 --> 00:34:30,000
OK, so now we've just pulled
out one term.
369
00:34:30,000 --> 00:34:41,000
That implies that a_0 x_0 minus
y_0 is congruent to minus --
370
00:34:54,000 --> 00:34:58,000
-- mod m.
Now remember that when I have a
371
00:34:58,000 --> 00:35:02,000
minus number mod m,
I just map it into whatever,
372
00:35:02,000 --> 00:35:07,000
into that range from 0 to m
minus one.
373
00:35:07,000 --> 00:35:12,000
So for example,
minus five mod seven is two.
374
00:35:12,000 --> 00:35:19,000
So if any of these things are
negative, we simply translate
375
00:35:19,000 --> 00:35:27,000
them into by adding multiples of
mbecause adding multiples of m
376
00:35:27,000 --> 00:35:32,000
doesn't affect the congruence.
377
00:35:39,000 --> 00:35:41,000
OK.
And now for the next step,
378
00:35:41,000 --> 00:35:44,000
we need to use a number theory
fact.
379
00:35:44,000 --> 00:35:48,000
So let's pull out our number
theory --
380
00:35:57,000 --> 00:36:05,000
-- textbook and take a little
digression
381
00:36:10,000 --> 00:36:14,000
So this comes from the theory
of finite fields.
382
00:36:14,000 --> 00:36:17,000
So for people who are
knowledgeable,
383
00:36:17,000 --> 00:36:21,000
that's where you're plugging
your knowledge in.
384
00:36:21,000 --> 00:36:26,000
If you're not knowledgeable,
this is a great area of math to
385
00:36:26,000 --> 00:36:30,000
learn about.
So here's the fact.
386
00:36:30,000 --> 00:36:34,000
So let m be prime.
Then for any z,
387
00:36:34,000 --> 00:36:41,000
little z element of z sub m,
and z sub m is the integers mod
388
00:36:41,000 --> 00:36:46,000
m.
So this is essentially numbers
389
00:36:46,000 --> 00:36:51,000
from 0 to m minus one with all
the operations,
390
00:36:51,000 --> 00:36:57,000
times, minus,
plus, etc., defined on that
391
00:36:57,000 --> 00:37:04,000
such that if you end up outside
of the range of 0 to m minus
392
00:37:04,000 --> 00:37:11,000
one, you re-normalize by
subtracting or adding multiples
393
00:37:11,000 --> 00:37:21,000
of m to get back within the
range from 0 to m minus one.
394
00:37:21,000 --> 00:37:30,000
So it's the standard thing of
just doing things module m.
395
00:37:30,000 --> 00:37:38,000
So for any z such that z is not
congruent to 0,
396
00:37:38,000 --> 00:37:47,000
there exists a unique z inverse
in z sub m, such that if I
397
00:37:47,000 --> 00:37:57,000
multiply z times the inverse,
it produces something congruent
398
00:37:57,000 --> 00:38:04,000
to one mod m.
So for any number it says,
399
00:38:04,000 --> 00:38:11,000
I can find another number that
when multiplied by it gives me
400
00:38:11,000 --> 00:38:15,000
one.
So let's just do an example for
401
00:38:15,000 --> 00:38:18,000
m equals seven.
So here we have,
402
00:38:18,000 --> 00:38:24,000
we'll make a little table.
So z is not equal to 0,
403
00:38:24,000 --> 00:38:29,000
so I just write down the other
numbers.
404
00:38:29,000 --> 00:38:35,000
And let's figure out what z
inverse is.
405
00:38:35,000 --> 00:38:41,000
So what's the inverse of one?
What number when multiplied by
406
00:38:41,000 --> 00:38:43,000
one gives me one?
One.
407
00:38:43,000 --> 00:38:45,000
Good.
How about two?
408
00:38:45,000 --> 00:38:51,000
What number when I multiply it
by two gives me one?
409
00:38:51,000 --> 00:38:55,000
Four.
Because two times four is eight
410
00:38:55,000 --> 00:39:01,000
and eight is congruent to one
mod seven.
411
00:39:01,000 --> 00:39:04,000
So I've re-normalized it.
What about three?
412
00:39:12,000 --> 00:39:13,000
Five.
Good.
413
00:39:13,000 --> 00:39:16,000
Five.
Three times five is 15.
414
00:39:16,000 --> 00:39:22,000
That's congruent to one mod
seven because 15 divided by
415
00:39:22,000 --> 00:39:28,000
seven is two remainder of one.
So that's the key thing.
416
00:39:28,000 --> 00:39:32,000
What about four?
Two.
417
00:39:32,000 --> 00:39:36,000
Five? Three. And six.
418
00:39:43,000 --> 00:39:43,000
Yeah.
Six.
Yeah, six it turns out.
OK, six times six is 36.
419
00:39:48,000 --> 00:39:52,000
OK, mod seven.
Basically subtract off the 35,
420
00:39:52,000 --> 00:39:56,000
gives m one.
So people have observed some
421
00:39:56,000 --> 00:40:02,000
interesting facts that if one
number's an inverse of another,
422
00:40:02,000 --> 00:40:08,000
then that other is an inverse
of the one.
423
00:40:08,000 --> 00:40:12,000
So that's actually one of these
things that you prove when you
424
00:40:12,000 --> 00:40:16,000
do group theory and field theory
and so forth.
425
00:40:16,000 --> 00:40:21,000
There are all sorts of other
great properties of this kind of
426
00:40:21,000 --> 00:40:23,000
math.
But the main thing is,
427
00:40:23,000 --> 00:40:27,000
and this turns out not to be
true if m is not a prime.
428
00:40:27,000 --> 00:40:31,000
So can somebody think of,
imagine we're doing something
429
00:40:31,000 --> 00:40:36,000
mod 10.
Can somebody think of a number
430
00:40:36,000 --> 00:40:39,000
that doesn't have an inverse mod
10?
431
00:40:39,000 --> 00:40:40,000
Yeah.
Two.
432
00:40:40,000 --> 00:40:45,000
Another one is five.
OK, it turns out the divisors
433
00:40:45,000 --> 00:40:49,000
in fact actually,
more generally,
434
00:40:49,000 --> 00:40:53,000
something that is not
relatively prime,
435
00:40:53,000 --> 00:40:58,000
meaning that it has no common
factors, the GCD is not one
436
00:40:58,000 --> 00:41:04,000
between that number and the
modulus.
437
00:41:04,000 --> 00:41:08,000
OK, those numbers do not have
an inverse mod m.
438
00:41:08,000 --> 00:41:13,000
OK, but if it's prime,
every number is relatively
439
00:41:13,000 --> 00:41:17,000
prime to the modulus.
And that's the property that
440
00:41:17,000 --> 00:41:22,000
we're taking advantage of.
So this is our fact and so,
441
00:41:22,000 --> 00:41:28,000
in this case what I'm after is
I want to divide by x_0 minus
442
00:41:28,000 --> 00:41:31,000
y_0.
That's what I want to do at
443
00:41:31,000 --> 00:41:34,000
this point.
But I can't do that if x_0,
444
00:41:34,000 --> 00:41:36,000
first of all,
if m isn't prime,
445
00:41:36,000 --> 00:41:40,000
I can't necessarily do that.
I might be able to,
446
00:41:40,000 --> 00:41:43,000
but I can't necessarily.
But if m is prime,
447
00:41:43,000 --> 00:41:46,000
I can definitely divide by x_0
minus y_0.
448
00:41:46,000 --> 00:41:49,000
I can find that inverse.
And the other thing I have to
449
00:41:49,000 --> 00:41:52,000
do is make sure x_0 minus y_0 is
not 0.
450
00:41:52,000 --> 00:41:57,000
OK, it would be 0 if these two
were equal, but our supposition
451
00:41:57,000 --> 00:42:01,000
was they weren't equal.
And once again,
452
00:42:01,000 --> 00:42:05,000
just bringing it back to the
without loss of generality,
453
00:42:05,000 --> 00:42:08,000
if it were some other position
that we were off,
454
00:42:08,000 --> 00:42:13,000
I would be doing exactly the
same thing with that position.
455
00:42:13,000 --> 00:42:16,000
So now we're going to be able
to divide.
456
00:42:16,000 --> 00:42:19,000
So we continue with our --
457
00:42:24,000 --> 00:42:33,000
-- continue with our proof.
So since x_0 is not equal to
458
00:42:33,000 --> 00:42:42,000
y_0, there exists an inverse for
x_0 minus y_0.
459
00:42:42,000 --> 00:42:48,000
And that implies,
just continue on from over
460
00:42:48,000 --> 00:42:56,000
there, that a_0 is congruent
therefore to minus the sum of i
461
00:42:56,000 --> 00:43:04,000
equal one to r of a_i,
x_i minus y_i times x_0 minus
462
00:43:04,000 --> 00:43:10,000
y_0 inverse.
So let's just go back to the
463
00:43:10,000 --> 00:43:15,000
beginning of our proof and see
what we've derived.
464
00:43:15,000 --> 00:43:19,000
If we're saying we have two
distinct keys,
465
00:43:19,000 --> 00:43:24,000
and we've picked all of these
a_i randomly,
466
00:43:24,000 --> 00:43:30,000
and we're saying that these two
distinct keys hash to the same
467
00:43:30,000 --> 00:43:34,000
place.
If they hash to the same place,
468
00:43:34,000 --> 00:43:41,000
it says that a_0 essentially
had to have a particular value
469
00:43:41,000 --> 00:43:47,000
as a function of the other a_i.
Because in other words,
470
00:43:47,000 --> 00:43:51,000
once I've picked each of these
a_i from one to r,
471
00:43:51,000 --> 00:43:54,000
if I did them in that order,
for example,
472
00:43:54,000 --> 00:43:58,000
then I don't have a choice for
how I pick a_0 to make it
473
00:43:58,000 --> 00:44:00,000
collide.
Exactly one value allows it to
474
00:44:00,000 --> 00:44:05,000
collide, namely the value of a_0
given by this.
475
00:44:05,000 --> 00:44:10,000
If I picked a different value
of a_0, they wouldn't collide.
476
00:44:10,000 --> 00:44:16,000
So let m write that down.
Thus, while you think about it
477
00:45:12,000 --> 00:45:18,000
So for any choice of these a_i,
there's exactly one of the
478
00:45:18,000 --> 00:45:24,000
impossible choices of a_0 that
cause a collision.
479
00:45:24,000 --> 00:45:29,000
And for all the other choices I
might make of a_0,
480
00:45:29,000 --> 00:45:36,000
there's n collision.
So essentially I don't have,
481
00:45:36,000 --> 00:45:42,000
if they're going to collide,
I've reduced essentially the
482
00:45:42,000 --> 00:45:49,000
number of degrees of freedom of
my randomness by a factor of m.
483
00:45:49,000 --> 00:45:55,000
So if I count up the number of
h_a's that cause x and y to
484
00:45:55,000 --> 00:46:01,000
collide, that's equal to,
well, there's m choices,
485
00:46:01,000 --> 00:46:06,000
just using the product rule
again.
486
00:46:06,000 --> 00:46:13,000
There's m choices for a_1 times
m choices for a_2,
487
00:46:13,000 --> 00:46:21,000
up to m choices for a_r and
then only one choice for a_0.
488
00:46:21,000 --> 00:46:28,000
So this is choices for a_1,
a_2, a_r and only one choice
489
00:46:28,000 --> 00:46:35,000
for a_0 if they're going to
collide.
490
00:46:35,000 --> 00:46:40,000
If they're not going to
collide, I've got more choices
491
00:46:40,000 --> 00:46:43,000
for a_0.
But if I want them to collide,
492
00:46:43,000 --> 00:46:48,000
there's only one value I can
pick, namely this value.
493
00:46:48,000 --> 00:46:53,000
That's the only value for which
I will pick.
494
00:46:53,000 --> 00:46:58,000
And that's equal to m to the r,
which is just the size of H
495
00:46:58,000 --> 00:47:03,000
divided by m.
And that completes the proof.
496
00:47:11,000 --> 00:47:14,000
So there are other universal
constructions,
497
00:47:14,000 --> 00:47:18,000
but this is a particularly
elegant one.
498
00:47:18,000 --> 00:47:22,000
So the point is that I have m
plus one, sorry,
499
00:47:22,000 --> 00:47:27,000
r plus one degrees of freedom
where each degree of freedom I
500
00:47:27,000 --> 00:47:33,000
have m choices.
But if I want them to collide,
501
00:47:33,000 --> 00:47:40,000
once I've picked any of the,
once I've picked r of those
502
00:47:40,000 --> 00:47:45,000
possible choices,
the last one is forced if I
503
00:47:45,000 --> 00:47:48,000
want it to collide.
So therefore,
504
00:47:48,000 --> 00:47:55,000
the set of functions for which
it collides is only one in m.
505
00:47:55,000 --> 00:48:01,000
A very slick construction.
Very slick.
506
00:48:01,000 --> 00:48:03,000
OK.
Everybody with me here?
507
00:48:03,000 --> 00:48:07,000
Didn't lose too many people?
Yeah, question.
508
00:48:07,000 --> 00:48:12,000
Well, part of it is,
actually this is a quite common
509
00:48:12,000 --> 00:48:15,000
type of thing to be doing
actually.
510
00:48:15,000 --> 00:48:19,000
If you take a class,
so we have follow on classes in
511
00:48:19,000 --> 00:48:24,000
cryptography and so forth,
and this kind of thing of
512
00:48:24,000 --> 00:48:29,000
taking dot products,
modulo m and also Galois fields
513
00:48:29,000 --> 00:48:34,000
which are particularly simple
finite fields and things like
514
00:48:34,000 --> 00:48:40,000
that, people play with these all
the time.
515
00:48:40,000 --> 00:48:43,000
So Galois fields are like using
exor's as your,
516
00:48:43,000 --> 00:48:46,000
same sort of thing as this
except base two.
517
00:48:46,000 --> 00:48:49,000
And so there's a lot of study
of this sort of thing.
518
00:48:49,000 --> 00:48:53,000
So people understand these kind
of properties.
519
00:48:53,000 --> 00:48:57,000
But yeah, it's like what's the
algorithm for having a brilliant
520
00:48:57,000 --> 00:49:01,000
insight into algorithms?
It's like OK.
521
00:49:01,000 --> 00:49:05,000
Wish I knew.
Then I'd just turn the crank.
522
00:49:05,000 --> 00:49:11,000
[LAUGHTER] But if it were that
easy, I wouldn't be standing up
523
00:49:11,000 --> 00:49:13,000
here today.
[LAUGHTER] Good.
524
00:49:13,000 --> 00:49:19,000
OK, so now I want to take on
another topic which is also I
525
00:49:19,000 --> 00:49:22,000
find, I think this is
astounding.
526
00:49:22,000 --> 00:49:27,000
It's just beautiful,
beautiful mathematics and a big
527
00:49:27,000 --> 00:49:34,000
impact on your ability to build
good hash functions.
528
00:49:34,000 --> 00:49:37,000
Now I want to talk about
another one topic,
529
00:49:37,000 --> 00:49:41,000
which is related,
which is the topic of perfect
530
00:49:41,000 --> 00:49:42,000
hashing.
531
00:49:54,000 --> 00:49:59,000
So everything we've done so far
does expected time performance.
532
00:49:59,000 --> 00:50:03,000
Hashing is good in the expected
sense.
533
00:50:03,000 --> 00:50:08,000
A perfect hashing addresses the
following questions.
534
00:50:08,000 --> 00:50:14,000
Suppose that I gave you a set
of keys, and I said just build
535
00:50:14,000 --> 00:50:20,000
me a static table so I can look
up whether the key is in the
536
00:50:20,000 --> 00:50:25,000
table with worst case time.
Good worst case time.
537
00:50:25,000 --> 00:50:31,000
So I have a fixed set of keys.
They might be something like
538
00:50:31,000 --> 00:50:37,000
for example, the hundred most
common or thousand most common
539
00:50:37,000 --> 00:50:42,000
words in English.
And when I get a word I want to
540
00:50:42,000 --> 00:50:47,000
check quickly in this table,
is the word that I've got one
541
00:50:47,000 --> 00:50:49,000
of the most common words in
English.
542
00:50:49,000 --> 00:50:54,000
I would like to do that not
with expected performance,
543
00:50:54,000 --> 00:50:57,000
but guaranteed worst case
performance.
544
00:50:57,000 --> 00:51:03,000
Is there a way of building it
so that I can find this quickly?
545
00:51:03,000 --> 00:51:06,000
So the problem is given n keys
--
546
00:51:12,000 --> 00:51:14,000
-- construct a static hash
table.
547
00:51:14,000 --> 00:51:17,000
In other words,
no insertion and deletion.
548
00:51:17,000 --> 00:51:20,000
We're just going to put the
elements in there.
549
00:51:20,000 --> 00:51:22,000
A size --
550
00:51:30,000 --> 00:51:37,000
-- m equal Order n.
So I don't want it to be a huge
551
00:51:37,000 --> 00:51:42,000
table.
I want it to be a table that is
552
00:51:42,000 --> 00:51:50,000
the size of my keys.
Table of size m equals Order n,
553
00:51:50,000 --> 00:51:59,000
such that search takes O(1)
time in the worst case.
554
00:52:06,000 --> 00:52:10,000
So there's no place in the
table where I'm going to have,
555
00:52:10,000 --> 00:52:14,000
I know in the average case,
that's not hard to do.
556
00:52:14,000 --> 00:52:18,000
But in the worst case,
I want to make sure that
557
00:52:18,000 --> 00:52:22,000
there's no particular spot where
the number of keys piles up to
558
00:52:22,000 --> 00:52:26,000
be a large number.
OK, in no spot should that
559
00:52:26,000 --> 00:52:29,000
happen.
Every single search I do should
560
00:52:29,000 --> 00:52:33,000
take Order one time.
There shouldn't be any
561
00:52:33,000 --> 00:52:37,000
statistical variation in terms
of how long it takes me to get
562
00:52:37,000 --> 00:52:39,000
something.
Does everybody understand what
563
00:52:39,000 --> 00:52:42,000
the puzzle is?
So this is a great,
564
00:52:42,000 --> 00:52:45,000
because this actually ends up
having a lot of uses.
565
00:52:45,000 --> 00:52:49,000
You know, you want to build a
table for something and you know
566
00:52:49,000 --> 00:52:52,000
what the values are that you're
going look up in it.
567
00:52:52,000 --> 00:52:56,000
But you don't want to spend a
lot of space on it and so forth.
568
00:52:56,000 --> 00:53:00,000
So the idea here is actually
going to be to use a two-level
569
00:53:00,000 --> 00:53:02,000
scheme.
570
00:53:09,000 --> 00:53:22,000
So the idea is we're going to
use a two-level scheme with
571
00:53:22,000 --> 00:53:31,000
universal hashing at both
levels.
572
00:53:31,000 --> 00:53:36,000
So the idea is we're going to
hash, we're going to have a hash
573
00:53:36,000 --> 00:53:41,000
table, we're going to hash into
slots, but rather than using
574
00:53:41,000 --> 00:53:46,000
chaining, we're going to have
another hash table there.
575
00:53:46,000 --> 00:53:51,000
We're going to do a second hash
into the second hash table.
576
00:53:51,000 --> 00:53:56,000
And the idea is that we're
going to do it in such a way
577
00:53:56,000 --> 00:54:01,000
that we have no collisions at
level two.
578
00:54:01,000 --> 00:54:03,000
So we may have collisions at
level one.
579
00:54:03,000 --> 00:54:08,000
We'll take anything that
collides at level one and put
580
00:54:08,000 --> 00:54:12,000
them into a hash table and then
our second level hash table,
581
00:54:12,000 --> 00:54:15,000
but that hash table,
no collisions.
582
00:54:15,000 --> 00:54:17,000
Boom.
We're just going to hash right
583
00:54:17,000 --> 00:54:20,000
in there.
And it'll just go boom to its
584
00:54:20,000 --> 00:54:23,000
thing.
So let's draw a picture of this
585
00:54:23,000 --> 00:54:28,000
to illustrate the scheme.
OK, so we have --
586
00:54:34,000 --> 00:54:37,000
-- 0 one, let's say six,
m minus one.
587
00:54:37,000 --> 00:54:42,000
So here's our hash table.
And what we're going to do is
588
00:54:42,000 --> 00:54:47,000
we're going to use universal
hashing at the first level,
589
00:54:47,000 --> 00:54:49,000
OK.
So we find a universal hash
590
00:54:49,000 --> 00:54:52,000
function.
We pick a hash function at
591
00:54:52,000 --> 00:54:56,000
random.
And what we'll do is we'll hash
592
00:54:56,000 --> 00:55:00,000
into that level.
And then what we'll do is we'll
593
00:55:00,000 --> 00:55:05,000
keep track of two things.
One is what the size of the
594
00:55:05,000 --> 00:55:09,000
hash table is at the next level.
So in this case,
595
00:55:09,000 --> 00:55:13,000
the size of the hash table will
only use the number of slots.
596
00:55:13,000 --> 00:55:17,000
There's going to be four.
And we're also going to keep a
597
00:55:17,000 --> 00:55:19,000
separate hash key for the second
level.
598
00:55:19,000 --> 00:55:23,000
So each slot will have its own
hash function for the second
599
00:55:23,000 --> 00:55:25,000
level.
So for example,
600
00:55:25,000 --> 00:55:30,000
this one might have a key of 31
that is a random number.
601
00:55:30,000 --> 00:55:32,000
The a's here.
a's up there.
602
00:55:32,000 --> 00:55:34,000
There we go,
a's up there.
603
00:55:34,000 --> 00:55:39,000
So that's going to be the basis
of my hash function,
604
00:55:39,000 --> 00:55:42,000
the key with which I'm going to
hash.
605
00:55:42,000 --> 00:55:46,000
This one say has 86.
And let's say that this,
606
00:55:46,000 --> 00:55:50,000
and then we have a pointer to
the hash table.
607
00:55:50,000 --> 00:55:55,000
This is say S_1.
And it's got four slots and we
608
00:55:55,000 --> 00:56:01,000
stored up 14 and 27.
And these two slots are empty.
609
00:56:01,000 --> 00:56:09,000
And this one for example,
had what?
610
00:56:09,000 --> 00:56:12,000
Two nines.
611
00:56:28,000 --> 00:56:34,000
So the idea here is that in
this case if we look over all
612
00:56:34,000 --> 00:56:40,000
our top level hash function,
which I'll just call H,
613
00:56:40,000 --> 00:56:47,000
has that H of 14 is equal to H
of 27 is equal to one.
614
00:56:47,000 --> 00:56:53,000
Because we're in slot one.
OK, so these two both hash to
615
00:56:53,000 --> 00:56:57,000
the same slot in the level one
hash table.
616
00:56:57,000 --> 00:57:02,000
This is level one.
And this is level two over
617
00:57:02,000 --> 00:57:06,000
here.
So level one hashing,
618
00:57:06,000 --> 00:57:11,000
14 and 27 collided.
They went into the same slot
619
00:57:11,000 --> 00:57:13,000
here.
But at level two,
620
00:57:13,000 --> 00:57:20,000
they got hashed to different
places and the hash function I
621
00:57:20,000 --> 00:57:26,000
use is going to be indexed by
whatever the random numbers are
622
00:57:26,000 --> 00:57:33,000
that I chose and found for those
and I'll show you how we find
623
00:57:33,000 --> 00:57:36,000
those.
We have then h of 31 of 14 is
624
00:57:36,000 --> 00:57:43,000
equal to one h of 31 of 27 is
equal to two.
625
00:57:43,000 --> 00:57:46,000
For level two.
So I go, hash in here,
626
00:57:46,000 --> 00:57:51,000
find the, use this as the basis
of my hash function to hash into
627
00:57:51,000 --> 00:57:55,000
whatever table I've got here.
And so, if there are no,
628
00:57:55,000 --> 00:58:00,000
if I can guarantee that there
are no collisions at level two,
629
00:58:00,000 --> 00:58:05,000
this is going to cost me Order
one time in the worst case to
630
00:58:05,000 --> 00:58:09,000
look something up.
How do I look it up?
631
00:58:09,000 --> 00:58:12,000
Take the value.
I apply h to it.
632
00:58:12,000 --> 00:58:16,000
That takes me to some slot.
Then I look to see what the key
633
00:58:16,000 --> 00:58:21,000
is for this hash function.
I apply that hash function and
634
00:58:21,000 --> 00:58:24,000
that takes me to another slot.
Then I go there.
635
00:58:24,000 --> 00:58:29,000
And that took me basically two
applications of hash functions
636
00:58:29,000 --> 00:58:33,000
plus some look-up,
plus who knows what minor
637
00:58:33,000 --> 00:58:41,000
amount of bookkeeping.
So the reason we're going to
638
00:58:41,000 --> 00:58:50,000
have no collisions at this level
is the following.
639
00:58:50,000 --> 00:59:01,000
If they're n sub i items that
hash to a level one slot i,
640
00:59:01,000 --> 00:59:11,000
then we're going to use m sub
i, which is equal to n sub i
641
00:59:11,000 --> 00:59:21,000
squared slots in the level two
hash table.
642
00:59:29,000 --> 00:59:33,000
OK, so I should have mentioned
here this is going to be m sub
643
00:59:33,000 --> 00:59:37,000
i, the size of the hash table
and this is going to be my a sub
644
00:59:37,000 --> 00:59:39,000
i essentially.
645
00:59:45,000 --> 00:59:50,000
So I'm going to use,
so basically I'm going to hash
646
00:59:50,000 --> 00:59:55,000
n sub i things into n sub i
squared locations here.
647
00:59:55,000 --> 01:00:00,000
So this is going to be
incredibly sparse.
648
01:00:00,000 --> 01:00:02,480
OK, it's going to be quadratic
in size.
649
01:00:02,480 --> 01:00:05,612
And so what I'm going to show
is that under those
650
01:00:05,612 --> 01:00:08,418
circumstances,
it's easy for me to find hash
651
01:00:08,418 --> 01:00:11,159
functions such that there are n
collisions.
652
01:00:11,159 --> 01:00:15,010
That's the name of the game.
Figure out how can I make these
653
01:00:15,010 --> 01:00:18,012
hash functions so that there are
no collisions.
654
01:00:18,012 --> 01:00:21,341
So that's why I draw this with
so few elements here.
655
01:00:21,341 --> 01:00:24,604
So here for example,
I have two elements and I have
656
01:00:24,604 --> 01:00:27,867
a hash table size four here.
I have three elements.
657
01:00:27,867 --> 01:00:32,520
I need a hash table size nine.
OK, if there are a hundred
658
01:00:32,520 --> 01:00:34,918
elements, I need a hash table
size 10,000.
659
01:00:34,918 --> 01:00:38,485
I'm not going to pick something
so there's likely that there's
660
01:00:38,485 --> 01:00:41,350
anything of that size.
And then the fact that this
661
01:00:41,350 --> 01:00:44,801
actually works and gives us all
the properties that we want,
662
01:00:44,801 --> 01:00:48,251
that's part of the analysis.
So does everybody see that this
663
01:00:48,251 --> 01:00:51,877
takes Order one worst case time
and what the basic structure of
664
01:00:51,877 --> 01:00:52,988
it is?
These things,
665
01:00:52,988 --> 01:00:55,210
by the way, are not in this
case prime.
666
01:00:55,210 --> 01:00:58,134
I could always pick primes that
were close to this.
667
01:00:58,134 --> 01:01:03,730
I didn't do that in this case.
Or I could use a universal hash
668
01:01:03,730 --> 01:01:09,103
function that in fact would work
for things other than primes.
669
01:01:09,103 --> 01:01:12,362
But I didn't do that for this
example.
670
01:01:12,362 --> 01:01:16,943
We all ready for analysis?
OK, let's do some analysis
671
01:01:16,943 --> 01:01:18,000
then.
672
01:01:29,000 --> 01:01:31,000
And this is really pretty
analysis.
673
01:01:31,000 --> 01:01:33,528
Partly as you'll see because
we've already done some of this
674
01:01:33,528 --> 01:01:34,000
analysis.
675
01:01:50,000 --> 01:01:53,238
So the trick is analyzing level
two.
676
01:01:53,238 --> 01:01:57,309
That's the main thing that I
want to analyze,
677
01:01:57,309 --> 01:02:02,583
to show that I can find hash
functions here that are going
678
01:02:02,583 --> 01:02:06,192
to, when I map them into,
very sparsely,
679
01:02:06,192 --> 01:02:09,523
into these arrays here,
that in fact,
680
01:02:09,523 --> 01:02:16,000
such hash functions exist and I
can compute them in advance.
681
01:02:16,000 --> 01:02:23,344
So that I have a good way of
storing those.
682
01:02:23,344 --> 01:02:30,338
So here's the theorem we're
going to use.
683
01:02:30,338 --> 01:02:40,830
My hash and keys into m equals
n squared slots using a random
684
01:02:40,830 --> 01:02:48,000
hash function in a universal set
H.
685
01:02:48,000 --> 01:03:00,393
Then the expected number of
collisions is less than one
686
01:03:00,393 --> 01:03:02,502
half.
OK.
687
01:03:02,502 --> 01:03:11,372
The expected number of
collisions I don't expect there
688
01:03:11,372 --> 01:03:20,577
to be even one collision.
I expect there to be less than
689
01:03:20,577 --> 01:03:29,447
half a collision on average.
And so, let's prove this,
690
01:03:29,447 --> 01:03:39,154
so that the probability that
two given keys collide under h
691
01:03:39,154 --> 01:03:45,216
is what?
What's the probability that two
692
01:03:45,216 --> 01:03:51,443
given keys collide under h when
h is chosen randomly from the
693
01:03:51,443 --> 01:03:54,037
universal set?
One over m.
694
01:03:54,037 --> 01:03:56,943
Right?
That's the definition,
695
01:03:56,943 --> 01:04:02,235
right, of, which is in this
case equal to one over n
696
01:04:02,235 --> 01:04:06,210
squared.
So now how many keys,
697
01:04:06,210 --> 01:04:11,052
how many pairs of keys do I
have in this table?
698
01:04:11,052 --> 01:04:16,526
How many keys could possibly
collide with each other?
699
01:04:16,526 --> 01:04:19,368
OK.
So that's basically just
700
01:04:19,368 --> 01:04:25,157
looking at how many different
pairs of keys do I have to
701
01:04:25,157 --> 01:04:30,315
evaluate this for.
So that's n choose two pairs of
702
01:04:30,315 --> 01:04:36,654
keys.
n choose two pairs of keys.
703
01:04:36,654 --> 01:04:42,689
So therefore,
the expected number of
704
01:04:42,689 --> 01:04:52,172
collisions is while for each of
these n, not n over two.
705
01:04:52,172 --> 01:05:00,793
n choose two pairs of keys.
The probability that it
706
01:05:00,793 --> 01:05:08,923
collides is one in n squared.
So that's equal to n times n
707
01:05:08,923 --> 01:05:12,221
minus one over two,
if you remember your formula,
708
01:05:12,221 --> 01:05:16,000
times one in n squared.
And that's less than a half.
709
01:05:24,000 --> 01:05:28,183
So for every pair of keys,
so those of you who remember
710
01:05:28,183 --> 01:05:33,063
from 6.042 the birthday paradox,
this is related to the birthday
711
01:05:33,063 --> 01:05:36,800
paradox a little bit.
But here I basically have a
712
01:05:36,800 --> 01:05:40,333
large set, and I'm looking at
all pairs, but my set is
713
01:05:40,333 --> 01:05:44,000
sufficiently big that the odds
that I get a collision is
714
01:05:44,000 --> 01:05:47,199
relatively small.
If I start increasing it beyond
715
01:05:47,199 --> 01:05:50,400
the square root of m,
OK, the number of elements,
716
01:05:50,400 --> 01:05:54,466
it starts getting bigger in the
square root of m then the odds
717
01:05:54,466 --> 01:05:57,733
of a collision go up
dramatically as you know from
718
01:05:57,733 --> 01:06:01,532
the birthday paradox.
But if I'm less than,
719
01:06:01,532 --> 01:06:05,401
if I'm really sparse in there,
I don't get collisions.
720
01:06:05,401 --> 01:06:09,197
Or at least I get a relatively
small number expected.
721
01:06:09,197 --> 01:06:13,430
Now I want to remind you of
something which actually in the
722
01:06:13,430 --> 01:06:17,080
past I have just assumed,
but I want to actually go
723
01:06:17,080 --> 01:06:20,291
through it briefly.
It's Markov's inequality.
724
01:06:20,291 --> 01:06:22,919
So who remembers Markov's
inequality?
725
01:06:22,919 --> 01:06:25,839
Don't everybody raise their
hand at once.
726
01:06:25,839 --> 01:06:30,000
So Markov's inequality says the
following.
727
01:06:30,000 --> 01:06:34,145
This is one of these great
probability facts.
728
01:06:34,145 --> 01:06:38,762
For random variable x which is
bounded below by 0,
729
01:06:38,762 --> 01:06:44,227
says the probability that x is
bigger than, greater than or
730
01:06:44,227 --> 01:06:49,316
equal to any given value T is
less than or equal to the
731
01:06:49,316 --> 01:06:53,838
expectation of x divided by T.
It's a great fact.
732
01:06:53,838 --> 01:06:57,796
Doesn't happen if x isn't bound
below by 0.
733
01:06:57,796 --> 01:07:03,230
But it's a great fact.
It allows me to relate the
734
01:07:03,230 --> 01:07:06,833
probability of an event to its
expectation.
735
01:07:06,833 --> 01:07:12,066
And the idea is in general that
if the expectation is going to
736
01:07:12,066 --> 01:07:17,213
be small, then I can't have a
high probability that the value
737
01:07:17,213 --> 01:07:21,845
of the random variable is large.
It doesn't make sense.
738
01:07:21,845 --> 01:07:26,649
How could you have a high
probability that it's a million
739
01:07:26,649 --> 01:07:31,968
when my expectation is one or in
this case we're going to apply
740
01:07:31,968 --> 01:07:36,000
it when the expectation is a
half?
741
01:07:36,000 --> 01:07:39,676
Couldn't happen.
And the proof follows just
742
01:07:39,676 --> 01:07:44,666
directly on the definition of
expectation, and so I'mdoing
743
01:07:44,666 --> 01:07:47,730
this for a discrete random
variable.
744
01:07:47,730 --> 01:07:52,282
So the expectation by
definition is just the sum from
745
01:07:52,282 --> 01:07:57,622
little x goes to 0 to infinity
of x times the probability that
746
01:07:57,622 --> 01:08:02,000
my random variable takes on the
value x.
747
01:08:02,000 --> 01:08:06,560
That's the definition.
And now it's just a question of
748
01:08:06,560 --> 01:08:11,120
doing like the coarsest
approximation you can imagine.
749
01:08:11,120 --> 01:08:14,734
First of all,
let me just simply throw away
750
01:08:14,734 --> 01:08:19,725
all small terms that can be
greater to or equal to x equals
751
01:08:19,725 --> 01:08:24,716
T to infinity of x times the
probability that x is equal to
752
01:08:24,716 --> 01:08:28,072
little x.
So just throw away all the low
753
01:08:28,072 --> 01:08:31,426
order terms.
Now what I'm going to do is
754
01:08:31,426 --> 01:08:36,848
replace every one of these terms
is lower bounded by the value x
755
01:08:36,848 --> 01:08:42,875
equals T.
So that's just the summation of
756
01:08:42,875 --> 01:08:49,750
x equals T to infinity of T
times the probability that x
757
01:08:49,750 --> 01:08:51,250
equals x.
OK.
758
01:08:51,250 --> 01:08:58,250
Over x going from T larger.
Because these are only bigger
759
01:08:58,250 --> 01:09:02,009
values.
And that's just equal then to
760
01:09:02,009 --> 01:09:06,306
T, because I can pull that out,
and the summation of x equals T
761
01:09:06,306 --> 01:09:10,256
to infinity of the probability
that x equals x is just the
762
01:09:10,256 --> 01:09:14,000
probability that x is greater
than or equal to T.
763
01:09:20,000 --> 01:09:26,000
And that's done because I just
divide by T.
764
01:09:31,000 --> 01:09:34,379
So that's Markov's inequality.
Really dumb.
765
01:09:34,379 --> 01:09:37,919
Really simple.
There are much stronger things
766
01:09:37,919 --> 01:09:42,264
like Chebyshev bounds and
Chernoff bounds and things of
767
01:09:42,264 --> 01:09:44,839
that nature.
But Markov's is like
768
01:09:44,839 --> 01:09:49,586
unbelievably simple and useful.
So we're going to just apply
769
01:09:49,586 --> 01:09:52,000
that as a corollary.
770
01:10:06,000 --> 01:10:13,059
So the probability now of no
collisions, when I hash n keys
771
01:10:13,059 --> 01:10:19,391
into n squared slots using a
universal hash function,
772
01:10:19,391 --> 01:10:26,817
I claim is the probability of
no collisions is greater than or
773
01:10:26,817 --> 01:10:32,173
equal to a half.
So I pick a hash function at
774
01:10:32,173 --> 01:10:36,409
random.
What are the odds that I got no
775
01:10:36,409 --> 01:10:40,917
collisions when I hashed those n
keys into n squared slots?
776
01:10:40,917 --> 01:10:43,326
Answer.
Probability is I have no
777
01:10:43,326 --> 01:10:47,834
collisions is at least a half.
Half the time I'm guaranteed
778
01:10:47,834 --> 01:10:51,409
that there won't be a collision.
And the proof,
779
01:10:51,409 --> 01:10:54,129
pretty simple.
The probability of no
780
01:10:54,129 --> 01:10:57,549
collisions is the same as the
probability as,
781
01:10:57,549 --> 01:11:01,746
sorry, is one minus the
probability that I have at most
782
01:11:01,746 --> 01:11:05,850
one collision.
So the odds that I have at
783
01:11:05,850 --> 01:11:09,337
least one collision,
the odds that I have at least
784
01:11:09,337 --> 01:11:12,254
one collision,
probability greater than or
785
01:11:12,254 --> 01:11:15,599
equal to one collision is less
than or equal to,
786
01:11:15,599 --> 01:11:18,872
now I just apply Markov's
inequality with this.
787
01:11:18,872 --> 01:11:23,000
So it's just the expected
number of collisions --
788
01:11:29,000 --> 01:11:33,090
-- divided by one.
And that is by Markov's
789
01:11:33,090 --> 01:11:36,272
inequality less than,
by definition,
790
01:11:36,272 --> 01:11:40,181
excuse me, of expected number
of collisions,
791
01:11:40,181 --> 01:11:44,363
which we've already shown,
is less than a half.
792
01:11:44,363 --> 01:11:49,636
So the probability of at least
one collision is less than a
793
01:11:49,636 --> 01:11:52,909
half.
The probability of 0 collisions
794
01:11:52,909 --> 01:11:56,363
is at least a half.
So we're done here.
795
01:11:56,363 --> 01:12:02,000
So to find a good level to hash
function is easy.
796
01:12:02,000 --> 01:12:06,562
I just test a few at random.
Most of them out there,
797
01:12:06,562 --> 01:12:10,856
OK, half of them,
at least half of them are going
798
01:12:10,856 --> 01:12:13,808
to work.
So this is in some sense,
799
01:12:13,808 --> 01:12:18,102
if you think about it,
a randomized construction,
800
01:12:18,102 --> 01:12:22,664
because I can't tell you which
one it's going to be.
801
01:12:22,664 --> 01:12:27,763
It's non-constructive in that
sense, but it's a randomized
802
01:12:27,763 --> 01:12:32,485
construction.
But they have to exist because
803
01:12:32,485 --> 01:12:36,297
most of them out there have this
good property.
804
01:12:36,297 --> 01:12:40,605
So I'mgoing to be able to find
for each one of these,
805
01:12:40,605 --> 01:12:44,168
I just test a few at random,
and I find one.
806
01:12:44,168 --> 01:12:47,068
Test a few at random,
find one, etc.
807
01:12:47,068 --> 01:12:50,548
Fill in my table there.
Because all that is
808
01:12:50,548 --> 01:12:53,945
pre-computation.
And I'mgoing to find them
809
01:12:53,945 --> 01:12:57,342
because the odds are good that
one exists.
810
01:12:57,342 --> 01:12:59,000
So --
811
01:13:13,000 --> 01:13:14,000
-- we just test a few at random.
812
01:13:24,000 --> 01:13:25,000
And we'll find one quickly --
813
01:13:32,000 --> 01:13:34,300
-- since at least half will
work.
814
01:13:34,300 --> 01:13:37,679
I just want to show that there
exists good ones.
815
01:13:37,679 --> 01:13:41,777
All I have to prove is that at
least one works for each of
816
01:13:41,777 --> 01:13:44,366
these cases.
In fact, I've shown that
817
01:13:44,366 --> 01:13:46,954
there's a huge number that will
work.
818
01:13:46,954 --> 01:13:50,189
Half of them will work.
But to show it exists,
819
01:13:50,189 --> 01:13:54,647
I would just have to show that
the probability was greater than
820
00:00:00,000 --> 01:13:55,941
So to finish up,
821
01:13:55,941 --> 01:14:00,254
we need to still analyze the
storage because I promised in my
822
01:14:00,254 --> 01:14:05,000
theorem that the table would be
of size order n.
823
01:14:05,000 --> 01:14:12,702
And yet now I've said there's
all of these quadratic-sized
824
01:14:12,702 --> 01:14:18,378
slots here.
So I'mgoing to show that that's
825
01:14:18,378 --> 01:14:20,000
order n.
826
01:14:31,000 --> 01:14:35,605
So for level one,
that's easy.
827
01:14:35,605 --> 01:14:45,450
We'll just choose the number of
slots to be equal to the number
828
01:14:45,450 --> 01:14:51,008
of keys.
And that way the storage at
829
01:14:51,008 --> 01:14:59,583
level one is just order n.
And now let's let n sub i be
830
01:14:59,583 --> 01:15:08,000
the random variable for the
number of keys --
831
01:15:13,000 --> 01:15:21,712
-- that hash to slot i in T.
OK, so n sub i is just what
832
01:15:21,712 --> 01:15:28,683
we've called it.
Number of elements that slot
833
01:15:28,683 --> 01:15:34,386
there.
And we're going to use m sub i
834
01:15:34,386 --> 01:15:45,000
equals n sub i squared slots in
each level two table S sub i.
835
01:15:45,000 --> 01:15:47,000
So the expected total storage --
836
01:15:54,000 --> 01:16:01,085
-- is just n for level one,
order n if you want,
837
01:16:01,085 --> 01:16:09,979
but basically n slots for level
one plus the expected value,
838
01:16:09,979 --> 01:16:19,326
whatever I expect the sum of i
equals 0 to m minus one of theta
839
01:16:19,326 --> 01:16:24,000
of n sub i squared to be.
840
01:16:30,000 --> 01:16:36,048
Because I basically have to add
up the square for every element
841
01:16:36,048 --> 01:16:40,731
that applies here,
the square of what's in there.
842
01:16:40,731 --> 01:16:46,682
Who recognizes this summation?
Where have we seen that before?
843
01:16:46,682 --> 01:16:51,951
Who attends recitation?
Where have we seen this before?
844
01:16:51,951 --> 01:16:54,000
What's the --
845
01:17:03,000 --> 01:17:06,000
We're summing the expected
value of a bunch of --
846
01:17:11,000 --> 01:17:14,959
Yeah, what was that algorithm?
We did the sorting algorithm,
847
01:17:14,959 --> 01:17:17,375
right?
What was the sorting algorithm
848
01:17:17,375 --> 01:17:21,000
for which this was an important
thing to evaluate?
849
01:17:26,000 --> 01:17:29,272
Don't everybody shout it out at
once.
850
01:17:29,272 --> 01:17:33,000
What was that sorting algorithm
called?
851
01:17:33,000 --> 01:17:35,397
Bucket sort.
Good.
852
01:17:35,397 --> 01:17:37,794
Bucket sort.
Yeah.
853
01:17:37,794 --> 01:17:46,397
We showed that the sum of the
squares of random variables when
854
01:17:46,397 --> 01:17:53,025
they're falling randomly into n
bins is order n.
855
01:17:53,025 --> 01:17:55,000
Right?
856
01:18:16,000 --> 01:18:20,105
And you can also out of this
get a, as we did before,
857
01:18:20,105 --> 01:18:24,131
get a probability bound.
What's the probability that
858
01:18:24,131 --> 01:18:28,315
it's more than a certain amount
times n using Markov's
859
01:18:28,315 --> 01:18:31,394
inequality.
But this is the key thing is
860
01:18:31,394 --> 01:18:36,109
we've seen this analysis.
OK, we used it there in time,
861
01:18:36,109 --> 01:18:39,963
so there's a little bit,
but that's one of the reasons
862
01:18:39,963 --> 01:18:43,963
we study sorting at the
beginning of the term is because
863
01:18:43,963 --> 01:18:47,890
the techniques of sorting,
they just propagate into all
864
01:18:47,890 --> 01:18:52,327
these other areas of analysis.
You see a lot of the same kinds
865
01:18:52,327 --> 01:18:55,309
of things.
And so now that you know bucket
866
01:18:55,309 --> 01:18:59,018
sort clearly so well,
now you know that this without
867
01:18:59,018 --> 01:19:04,610
having to do any extra work.
So you might want to go back
868
01:19:04,610 --> 01:19:09,925
and review your bucket sort
analysis, because it's applied
869
01:19:09,925 --> 01:19:11,604
now.
Same analysis.
870
01:19:11,604 --> 01:19:12,909
Two places.
OK.
871
01:19:12,909 --> 01:19:18,411
Good recitation this Friday,
which will be a quiz review and
872
01:19:18,411 --> 01:19:22,794
we have a quiz next,
there's no class on Monday,
873
01:19:22,794 --> 01:19:26,151
but we have a quiz on next
Wednesday.
874
01:19:26,151 --> 01:19:31,000
OK, so good luck everybody on
the quiz.
875
01:19:31,000 --> 01:19:34,000
Make sure you get plenty of
sleep.