WEBVTT

00:00:00.080 --> 00:00:02.500
The following content is
provided under a Creative

00:00:02.500 --> 00:00:04.019
Commons license.

00:00:04.019 --> 00:00:06.360
Your support will help
MIT OpenCourseWare

00:00:06.360 --> 00:00:10.730
continue to offer high quality
educational resources for free.

00:00:10.730 --> 00:00:13.340
To make a donation or
view additional materials

00:00:13.340 --> 00:00:17.236
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.236 --> 00:00:17.861
at ocw.mit.edu.

00:00:21.190 --> 00:00:23.410
PROFESSOR: All right,
let's get started.

00:00:23.410 --> 00:00:26.010
Today we have another
data structures topic

00:00:26.010 --> 00:00:30.360
which is, Data
Structure Augmentation.

00:00:30.360 --> 00:00:33.230
The idea here is we're going
to take some existing data

00:00:33.230 --> 00:00:36.715
structure and augment it
to do extra cool things.

00:00:46.390 --> 00:00:49.560
Take some other data
structure there we've covered.

00:00:49.560 --> 00:00:53.640
Typically, that'll be
a balanced search tree,

00:00:53.640 --> 00:00:55.870
like an AVL tree or a 2-3 tree.

00:00:59.480 --> 00:01:04.300
And then we'll modify it to
store extra information which

00:01:04.300 --> 00:01:09.270
will enable additional kinds
of searches, typically,

00:01:09.270 --> 00:01:11.240
and sometimes to
do updates better.

00:01:21.120 --> 00:01:24.040
And in 006, you've seen an
example of this where you took

00:01:24.040 --> 00:01:28.200
AVL trees and augmented AVL
trees so that every node knew

00:01:28.200 --> 00:01:31.290
the number of nodes in
that rooted subtree.

00:01:31.290 --> 00:01:33.630
Today we're going to see
that example but also

00:01:33.630 --> 00:01:36.650
a bunch of other examples,
different types of augmentation

00:01:36.650 --> 00:01:37.719
you could do.

00:01:37.719 --> 00:01:39.760
And we'll start out with
a very simple one, which

00:01:39.760 --> 00:01:49.400
I call easy tree augmentation,
which will include

00:01:49.400 --> 00:01:51.235
subtree size as a special case.

00:01:57.760 --> 00:02:02.220
So with easy tree
augmentation, the idea

00:02:02.220 --> 00:02:06.240
is you have a tree, like
an AVL tree, or 2-3 tree,

00:02:06.240 --> 00:02:07.480
or something like that.

00:02:07.480 --> 00:02:09.620
And you'd like to
store, for every node

00:02:09.620 --> 00:02:13.480
x, some function of the
subtree, rooted at x.

00:02:13.480 --> 00:02:15.220
Such as the number
of nodes in there,

00:02:15.220 --> 00:02:17.382
or the sum of the
weights of the nodes,

00:02:17.382 --> 00:02:19.090
or the sum of the
squares of the weights,

00:02:19.090 --> 00:02:26.610
or the min, or the max, or the
median maybe, I'm not sure.

00:02:26.610 --> 00:02:32.140
Some function f of x which
is a function of that.

00:02:32.140 --> 00:02:35.710
Maybe not f of x,
but we want to store

00:02:35.710 --> 00:02:37.430
some function of that subtree.

00:02:42.120 --> 00:02:49.325
Say the goal is to store
f of the subtree rooted

00:02:49.325 --> 00:03:07.584
at x at each node x in a
field which I'll call x.f.

00:03:07.584 --> 00:03:11.530
So, normally nodes have a left
child, right child, parent.

00:03:11.530 --> 00:03:13.770
But we're going to
store an extra field

00:03:13.770 --> 00:03:18.270
x.f for some function
that you define.

00:03:18.270 --> 00:03:22.590
This is not always
possible, but here's

00:03:22.590 --> 00:03:25.350
a case where it is possible.

00:03:25.350 --> 00:03:28.730
That's going to
be the easy case.

00:03:28.730 --> 00:03:34.710
Suppose x.f can be
computed locally

00:03:34.710 --> 00:03:37.905
using lower information,
lower nodes.

00:03:47.410 --> 00:03:48.880
And we'll say,
let's suppose it can

00:03:48.880 --> 00:03:52.530
be computed in constant time
from information in the node

00:03:52.530 --> 00:04:02.500
x from x's children
and from the f value

00:04:02.500 --> 00:04:04.500
that's stored in the children.

00:04:04.500 --> 00:04:05.750
I'll call that children.f.

00:04:05.750 --> 00:04:08.560
But really, I mean left
child.f, right child.f,

00:04:08.560 --> 00:04:10.290
or if you have a
2-3 tree you have

00:04:10.290 --> 00:04:12.100
three children, potentially.

00:04:12.100 --> 00:04:14.250
And the .f of each of them.

00:04:14.250 --> 00:04:14.750
OK.

00:04:14.750 --> 00:04:19.050
So suppose you can
compute x.f locally just

00:04:19.050 --> 00:04:23.020
using one level down
in constant time.

00:04:23.020 --> 00:04:26.420
Then, as you might expect,
you can update whenever

00:04:26.420 --> 00:04:30.510
a node ends up changing.

00:04:30.510 --> 00:04:33.560
So more formally.

00:04:33.560 --> 00:04:48.510
If some set of nodes
change-- call this at s.

00:05:18.970 --> 00:05:22.540
So I'm stating a very
general theorem here.

00:05:22.540 --> 00:05:28.920
If there is some set of nodes,
which we changed something

00:05:28.920 --> 00:05:30.310
about them.

00:05:30.310 --> 00:05:33.070
We change either
their f field, we

00:05:33.070 --> 00:05:35.220
change some of the data
that's in the node,

00:05:35.220 --> 00:05:39.420
or we do a rotation,
loosen those around.

00:05:39.420 --> 00:05:44.840
Then we count the total number
of ancestors of these nodes.

00:05:44.840 --> 00:05:47.100
So this subtree.

00:05:47.100 --> 00:05:49.080
Those are the nodes
that need to be updated

00:05:49.080 --> 00:05:52.870
because we're assuming we
can compute x.f just given

00:05:52.870 --> 00:05:54.080
the children data.

00:05:54.080 --> 00:05:56.880
So if this data is
changing, we have

00:05:56.880 --> 00:05:58.850
to update it's
parents value of f

00:05:58.850 --> 00:06:01.350
because it depends
on this child value.

00:06:01.350 --> 00:06:03.540
We have to update
all those parents,

00:06:03.540 --> 00:06:06.820
all the way up to the root.

00:06:06.820 --> 00:06:10.880
So however many nodes there are
there, that's the total cost.

00:06:10.880 --> 00:06:17.010
Now, luckily, in an AVL tree, or
2-3 tree, most balanced search

00:06:17.010 --> 00:06:20.610
structures, the updates
you do are very localized.

00:06:20.610 --> 00:06:23.520
When we do splits in
a 2-3 tree we only

00:06:23.520 --> 00:06:27.590
do it up a single
path to the root.

00:06:27.590 --> 00:06:31.130
So the number of ancestors
here is just going to be log n.

00:06:31.130 --> 00:06:32.360
Same thing with an AVL tree.

00:06:32.360 --> 00:06:34.340
If you look at the
rotations you do,

00:06:34.340 --> 00:06:39.640
they are up a single
leaf to root path.

00:06:39.640 --> 00:06:42.080
And so the number
of ancestors that

00:06:42.080 --> 00:06:44.960
need to be updated is
always order log n.

00:06:44.960 --> 00:06:49.820
Things change, and there's an
order log n ancestors of them.

00:06:49.820 --> 00:06:52.830
So this is a little more
general than we need,

00:06:52.830 --> 00:06:56.770
but it's just to point out if
we did log n rotation spread out

00:06:56.770 --> 00:06:59.550
somewhere in the tree,
that would actually be bad

00:06:59.550 --> 00:07:03.490
because the total number of
ancestors could be log squared.

00:07:03.490 --> 00:07:07.860
But because in the
structures we've seen,

00:07:07.860 --> 00:07:12.000
we just work on a single path
to the root, we get log n.

00:07:33.330 --> 00:07:38.790
So in a little more
detail here, whenever we

00:07:38.790 --> 00:07:41.060
do a rotation in an AVL tree.

00:07:49.610 --> 00:07:53.350
Let's say A, B, C, x, y.

00:07:56.160 --> 00:07:57.050
Remember rotations?

00:07:57.050 --> 00:08:00.680
Been a while since
we've done rotations.

00:08:00.680 --> 00:08:04.750
So we haven't changed any
of the nodes in A, B, C,

00:08:04.750 --> 00:08:08.160
but we have changed
the nodes x and y.

00:08:08.160 --> 00:08:13.010
So we're going to have to
trigger an update of y.

00:08:13.010 --> 00:08:16.540
First, we'd want to
update y.f and then

00:08:16.540 --> 00:08:21.420
we're going to trigger
the update to x.f.

00:08:21.420 --> 00:08:24.940
And as long as this one can
be computed from its children,

00:08:24.940 --> 00:08:29.810
then we compute y.f, then we
can compute x from its children.

00:08:29.810 --> 00:08:30.310
All right.

00:08:30.310 --> 00:08:31.810
So a constant number
of extra things

00:08:31.810 --> 00:08:33.850
we need to do whenever
we do rotation.

00:08:33.850 --> 00:08:38.610
And because the rotations lie
on a single path, total cost

00:08:38.610 --> 00:08:43.659
that-- once we stop doing the
rotations, in AVL insert say,

00:08:43.659 --> 00:08:46.100
then we still have to keep
updating up to the root.

00:08:46.100 --> 00:08:50.440
But there's only log n at
most log n nodes to do that.

00:08:50.440 --> 00:08:50.940
OK.

00:08:50.940 --> 00:08:53.590
Same thing with 2-3 trees.

00:08:53.590 --> 00:08:55.860
We have a node split.

00:08:55.860 --> 00:08:58.730
So we have, I guess,
three keys, four children.

00:08:58.730 --> 00:09:00.190
That's too many.

00:09:00.190 --> 00:09:06.860
So we split to two nodes
and an extra node up here.

00:09:09.810 --> 00:09:12.380
Then we just trigger an
update of this f value,

00:09:12.380 --> 00:09:15.650
an update of this f value,
and an update of that f value.

00:09:15.650 --> 00:09:21.510
And because that just follows a
single path everything's log n.

00:09:21.510 --> 00:09:23.820
So this is a general
theorem about augmentation.

00:09:23.820 --> 00:09:26.850
Any function that's well
behaved in this sense,

00:09:26.850 --> 00:09:32.260
we can maintain in AVL
trees and 2-3 trees.

00:09:32.260 --> 00:09:37.690
And I'll remind you and state,
a little more generally,

00:09:37.690 --> 00:09:41.674
what you did in 006, which are
called order statistic trees

00:09:41.674 --> 00:09:42.340
in the textbook.

00:09:50.540 --> 00:09:54.850
So here we're going to--
let me first tell you

00:09:54.850 --> 00:09:56.220
what we're trying to achieve.

00:09:56.220 --> 00:09:58.939
This is the abstract data
type, or the interface

00:09:58.939 --> 00:09:59.855
of the data structure.

00:10:03.360 --> 00:10:13.610
We want to do insert, delete,
and say, successor searches.

00:10:16.590 --> 00:10:19.230
It's the usual thing we want
out of a binary search tree.

00:10:19.230 --> 00:10:22.210
Predecessor too, sure.

00:10:22.210 --> 00:10:29.800
We want to do rank of a
given key which is, tell me

00:10:29.800 --> 00:10:38.210
what is the index of that key
in the overall sorted order

00:10:38.210 --> 00:10:40.770
of the items, of the keys?

00:10:40.770 --> 00:10:43.500
We've talked about rank a few
times already in this class.

00:10:46.170 --> 00:10:48.220
Depends whether you
start at 0 or 1,

00:10:48.220 --> 00:10:52.890
but let's say we start at one.

00:10:52.890 --> 00:10:56.017
So if you say rank of the key
that happens to be the minimum,

00:10:56.017 --> 00:10:56.850
you want to get one.

00:10:56.850 --> 00:10:59.224
If you say rank of the key
that happens to be the median,

00:10:59.224 --> 00:11:05.430
you want to get n over
2 plus 1, and so on.

00:11:05.430 --> 00:11:08.960
So it's a natural thing
you might want to find out.

00:11:08.960 --> 00:11:14.100
And the converse
operation is select,

00:11:14.100 --> 00:11:19.280
let's say of i, which is,
give me the key of rank i.

00:11:26.710 --> 00:11:31.620
We've talked about select
as an offline operation.

00:11:31.620 --> 00:11:34.720
Given an array,
find me the median.

00:11:34.720 --> 00:11:38.960
Or find me the n over
seventh rank item.

00:11:38.960 --> 00:11:42.810
And we can do that in linear
time given no data structure.

00:11:42.810 --> 00:11:44.720
Here, we want a
data structure so

00:11:44.720 --> 00:11:48.900
that we can find the
median, or the seventh item,

00:11:48.900 --> 00:11:54.630
or the n over seventh key,
whatever in log n time.

00:11:54.630 --> 00:11:58.995
We want to do all of these
in log n per operation.

00:12:05.241 --> 00:12:05.740
OK.

00:12:05.740 --> 00:12:10.120
So in particular, rank of
selective i should equal i.

00:12:10.120 --> 00:12:12.210
We're trying to find
the item of that rank.

00:12:14.970 --> 00:12:17.520
So far, so good.

00:12:17.520 --> 00:12:22.410
And just to plug these
two parts together.

00:12:22.410 --> 00:12:25.290
We have this data structure
augmentation tool,

00:12:25.290 --> 00:12:27.170
we have this goal
we want to achieve,

00:12:27.170 --> 00:12:29.710
we're going to achieve
this goal by applying

00:12:29.710 --> 00:12:33.125
this technique where f
is just the subtree size.

00:12:33.125 --> 00:12:36.350
It's the number of
nodes in that subtree

00:12:36.350 --> 00:12:39.880
because that will
let us compute rank.

00:12:39.880 --> 00:13:03.890
So we're going to use easy tree
augmentation with f of subtree

00:13:03.890 --> 00:13:06.220
equal to the number of
nodes in the subtree.

00:13:11.230 --> 00:13:13.310
So in order for
this to apply, we

00:13:13.310 --> 00:13:18.090
need to check that given a node
x we can compute x.f just using

00:13:18.090 --> 00:13:19.160
its children.

00:13:19.160 --> 00:13:19.890
This is easy.

00:13:23.330 --> 00:13:24.780
We just add everything up.

00:13:24.780 --> 00:13:29.092
So x.f would be equal to 1.

00:13:29.092 --> 00:13:30.770
That's for x.

00:13:30.770 --> 00:13:37.640
Plus the sum of c.f
for every child c.

00:13:41.980 --> 00:13:46.250
I'll write this as a
python interpolation

00:13:46.250 --> 00:13:48.690
so it looks a little
more like an algorithm.

00:13:48.690 --> 00:13:50.330
I'm trying to be generic here.

00:13:50.330 --> 00:13:52.000
If it's a binary
search tree you just

00:13:52.000 --> 00:13:56.840
do x.left.f, plus x.right.f.

00:13:56.840 --> 00:13:59.240
But this will work
also for 2-3 trees.

00:13:59.240 --> 00:14:02.320
Pick your favorite
data structure.

00:14:02.320 --> 00:14:05.660
As long as there's a constant
number of children then

00:14:05.660 --> 00:14:07.420
this will take constant time.

00:14:07.420 --> 00:14:09.440
So we satisfied this condition.

00:14:09.440 --> 00:14:11.710
So we can do easy
tree augmentation.

00:14:11.710 --> 00:14:13.600
And now we know we
have subtree sizes.

00:14:13.600 --> 00:14:14.520
So given any node.

00:14:14.520 --> 00:14:19.930
We know the number of
descendants below that node.

00:14:19.930 --> 00:14:20.790
So that's cool.

00:14:20.790 --> 00:14:24.240
It lets us compute
rank in select.

00:14:24.240 --> 00:14:29.130
I'll just give you those
algorithms, quickly.

00:14:29.130 --> 00:14:31.540
We can check that
they're log n time.

00:14:39.131 --> 00:14:39.630
Yeah.

00:14:39.630 --> 00:14:44.670
So the idea is pretty simple.

00:14:44.670 --> 00:14:47.630
You have some key-- let's
think about binary trees

00:14:47.630 --> 00:14:50.360
now, because it's a
little bit easier.

00:14:50.360 --> 00:14:55.250
We have some item x.

00:14:55.250 --> 00:14:58.810
It has a left subtree,
right subtree.

00:14:58.810 --> 00:15:02.160
And now let's look up from x.

00:15:02.160 --> 00:15:05.670
Just keep calling x.parent.

00:15:05.670 --> 00:15:08.950
So sometimes the parent
is to the right of us

00:15:08.950 --> 00:15:11.890
and sometimes the parent
is to the left of us.

00:15:11.890 --> 00:15:14.600
I'm going to draw this
in a, kind of, funny way.

00:15:18.530 --> 00:15:22.810
But this funny way has
a very special property,

00:15:22.810 --> 00:15:25.890
which is that the
x-coordinate in this diagram

00:15:25.890 --> 00:15:27.220
is the key value.

00:15:27.220 --> 00:15:29.450
Or is the sorted order
of the keys, right?

00:15:29.450 --> 00:15:34.040
Everything in the left subtree
of x has a value less than x.

00:15:34.040 --> 00:15:35.940
If we say all the
keys are different.

00:15:35.940 --> 00:15:38.760
Everything to the right of x
has a value greater than x.

00:15:38.760 --> 00:15:42.130
If x was the left
child of its parent,

00:15:42.130 --> 00:15:45.860
that means this thing
is also greater than x.

00:15:45.860 --> 00:15:48.740
And if we follow a
parent and this was

00:15:48.740 --> 00:15:50.990
the right child of
that parent, that

00:15:50.990 --> 00:15:52.910
means this thing is less than x.

00:15:52.910 --> 00:15:55.330
So that's why I drew it all
the way over to the left.

00:15:55.330 --> 00:15:58.010
This thing is also less
than x because it was a,

00:15:58.010 --> 00:15:59.720
I'll call it a left parent.

00:15:59.720 --> 00:16:01.270
Here we have a right
parent, so that

00:16:01.270 --> 00:16:04.060
means this is something
greater than x.

00:16:04.060 --> 00:16:05.980
And over here we have
a left parent, so this

00:16:05.980 --> 00:16:07.021
is something less than x.

00:16:07.021 --> 00:16:08.152
Let's say that's the root.

00:16:08.152 --> 00:16:09.860
In general, there's
going to be some left

00:16:09.860 --> 00:16:13.530
edges and some right
edges as we go up.

00:16:13.530 --> 00:16:18.170
These arrows will go either
left or right in a binary tree.

00:16:18.170 --> 00:16:23.060
So the rank of x is just
1 plus the number of nodes

00:16:23.060 --> 00:16:24.200
that are less than x.

00:16:24.200 --> 00:16:26.410
Number of keys that
are less than x.

00:16:26.410 --> 00:16:30.370
So there's these guys,
there's these guys,

00:16:30.370 --> 00:16:33.390
and there's whatever's
hanging off-- OK.

00:16:33.390 --> 00:16:36.477
Here I've almost violated
my x-coordinate rule.

00:16:36.477 --> 00:16:38.310
If I make these really
narrow, that's right.

00:16:40.940 --> 00:16:44.300
All of these things,
all of these nodes

00:16:44.300 --> 00:16:46.470
in the left subtrees of
these less than x nodes

00:16:46.470 --> 00:16:48.644
will also be less than x.

00:16:48.644 --> 00:16:50.310
If you think about
these other subtrees,

00:16:50.310 --> 00:16:51.726
they're going to
be bigger than x.

00:16:51.726 --> 00:16:54.760
So we don't really
care about them.

00:16:54.760 --> 00:17:02.420
So we just want to count
up all these nodes and all

00:17:02.420 --> 00:17:03.050
of these nodes.

00:17:06.660 --> 00:17:08.845
So the algorithm to do
that is pretty simple.

00:17:16.030 --> 00:17:20.000
We're just going
to start out with--

00:17:33.550 --> 00:17:37.512
I'm going to switch from
this f notation to size.

00:17:37.512 --> 00:17:38.720
That's a little more natural.

00:17:38.720 --> 00:17:42.210
In general, you might
have many functions.

00:17:42.210 --> 00:17:48.600
Size is the usual
notation for subtree size.

00:17:48.600 --> 00:17:52.100
So we start out by counting
up how many items are here.

00:17:52.100 --> 00:17:54.690
And if we want to
start at a rank of 1,

00:17:54.690 --> 00:17:56.460
if the min has rank
1, then I should also

00:17:56.460 --> 00:17:58.820
do plus 1 for x itself.

00:17:58.820 --> 00:18:02.490
If you wanted to start at zero
you just omit that plus 1.

00:18:02.490 --> 00:18:09.940
And then, all I do is walk up
from x to the root of the tree.

00:18:09.940 --> 00:18:22.340
And whenever we go left
from, say x to x prime.

00:18:22.340 --> 00:18:25.160
So that means we
have an x prime.

00:18:25.160 --> 00:18:27.150
It's right child is x.

00:18:27.150 --> 00:18:31.280
And so when we went from x to
its parent we went to the left.

00:18:31.280 --> 00:18:44.910
Then we say rank plus
equals x prime.left.size

00:18:44.910 --> 00:18:48.620
plus 1 for x prime itself.

00:18:48.620 --> 00:18:52.074
And maybe x
prime.left.size is zero.

00:18:52.074 --> 00:18:53.490
Maybe there's no
nodes over there.

00:18:53.490 --> 00:18:57.650
But at the very least we have
to count those nodes that

00:18:57.650 --> 00:18:59.290
are to the left of us.

00:18:59.290 --> 00:19:00.960
And if there's
anything down here

00:19:00.960 --> 00:19:02.550
we add up all those things.

00:19:02.550 --> 00:19:04.310
So that lets us compute rank.

00:19:04.310 --> 00:19:05.840
How long does it take?

00:19:05.840 --> 00:19:09.770
Well, we're just walking
up one path from a leaf

00:19:09.770 --> 00:19:12.590
to a root-- or not necessarily
a leaf, but from some node x

00:19:12.590 --> 00:19:13.540
to the root.

00:19:13.540 --> 00:19:16.050
And as long we're using
a balance structure

00:19:16.050 --> 00:19:18.340
like AVL trees.

00:19:18.340 --> 00:19:22.390
I guess I want binary here,
so let's say AVL trees.

00:19:22.390 --> 00:19:25.190
Then this will take log n time.

00:19:25.190 --> 00:19:28.270
So I'm spending
constant work per step,

00:19:28.270 --> 00:19:30.620
and there's log n steps.

00:19:30.620 --> 00:19:32.140
Clear?

00:19:32.140 --> 00:19:34.717
So that's good old rank.

00:19:34.717 --> 00:19:36.300
Easy to do once you
have subtree size.

00:19:36.300 --> 00:19:37.650
Let's do select for fun.

00:19:54.980 --> 00:19:57.360
This may seem like review,
but I drew out this picture

00:19:57.360 --> 00:20:00.110
explicitly because we're
going to do it a lot today.

00:20:00.110 --> 00:20:02.436
We'll have pictures like
this a bunch of times.

00:20:02.436 --> 00:20:03.810
Really helps to
think about where

00:20:03.810 --> 00:20:05.750
the nodes are, which
ones are less than x,

00:20:05.750 --> 00:20:08.970
which ones are greater than x.

00:20:08.970 --> 00:20:10.380
Let's do select first.

00:20:13.500 --> 00:20:19.375
This you may not
have seen in 006.

00:20:19.375 --> 00:20:20.750
So we're going to
do the reverse.

00:20:20.750 --> 00:20:25.020
We're going to start at the root
and we're going to walk down.

00:20:25.020 --> 00:20:26.570
Sounds easy enough.

00:20:26.570 --> 00:20:29.784
But now walking down is
kind of like doing a search

00:20:29.784 --> 00:20:31.950
but we don't have a key
we're searching for, we have

00:20:31.950 --> 00:20:34.380
a rank we're searching for.

00:20:34.380 --> 00:20:37.420
So what is that rank?

00:20:40.410 --> 00:20:41.550
Rank is i.

00:20:41.550 --> 00:20:42.300
OK.

00:20:42.300 --> 00:20:44.910
So on the other hand,
we have the node x.

00:20:44.910 --> 00:20:47.616
We'd like to know the rank
of x and compare that to i.

00:20:47.616 --> 00:20:49.990
That will tell us whether we
should go left, or go right,

00:20:49.990 --> 00:20:52.260
or whether we happen
to find the item.

00:20:52.260 --> 00:20:54.250
Now one possibility
is we call rank

00:20:54.250 --> 00:20:56.620
of x to find the rank of x.

00:20:56.620 --> 00:21:02.190
But that's dangerous because I'm
going to have a four loop here

00:21:02.190 --> 00:21:05.500
and it's going to
take log n iterations.

00:21:05.500 --> 00:21:07.460
If at every iteration
of computing rank

00:21:07.460 --> 00:21:11.310
of x, and rank costs
log n, then overall cost

00:21:11.310 --> 00:21:13.750
might be log squared n.

00:21:13.750 --> 00:21:17.510
So I can't afford to-- I want
to know what the rank of x

00:21:17.510 --> 00:21:22.470
is but I can't afford to
say rank, open paren, x.

00:21:22.470 --> 00:21:25.110
Because that recursive
call will be too expensive.

00:21:25.110 --> 00:21:27.040
So what is the rank
of x in this case?

00:21:27.040 --> 00:21:30.290
This is a little special.

00:21:30.290 --> 00:21:31.230
What's that?

00:21:31.230 --> 00:21:32.947
AUDIENCE: Number of
left children plus 1.

00:21:32.947 --> 00:21:34.530
PROFESSOR: Number
of left, or the size

00:21:34.530 --> 00:21:35.980
of the left subtree plus 1.

00:21:35.980 --> 00:21:36.480
Yep.

00:21:41.200 --> 00:21:44.900
Plus 1 if we're counting,
starting at one.

00:21:44.900 --> 00:21:45.400
Very good.

00:21:48.529 --> 00:21:51.600
I'm slowly getting better.

00:21:51.600 --> 00:21:52.990
Didn't hit anyone this time.

00:21:52.990 --> 00:21:53.490
OK.

00:21:53.490 --> 00:21:55.680
So at least for the
root, this is the rank,

00:21:55.680 --> 00:21:58.160
and that only takes us constant
time in the special case.

00:21:58.160 --> 00:21:59.320
So we'll have to
check that it's still

00:21:59.320 --> 00:22:00.580
holds after I do the loop.

00:22:00.580 --> 00:22:02.270
But it will.

00:22:02.270 --> 00:22:02.920
So, cool.

00:22:02.920 --> 00:22:05.040
Now there are three cases.

00:22:05.040 --> 00:22:06.569
If i equals rank.

00:22:06.569 --> 00:22:08.360
If the rank we're
searching for is the rank

00:22:08.360 --> 00:22:10.318
that we happen to have,
then we're done, right?

00:22:10.318 --> 00:22:13.000
We just return x.

00:22:13.000 --> 00:22:14.900
That's the easy case.

00:22:14.900 --> 00:22:18.510
More likely is that I will be
either less than or greater

00:22:18.510 --> 00:22:20.790
than the rank of x.

00:22:28.710 --> 00:22:29.600
OK.

00:22:29.600 --> 00:22:33.970
So if i is less than the
rank, this is fairly easy.

00:22:33.970 --> 00:22:36.490
We just say x equals x.left.

00:22:40.756 --> 00:22:41.630
Did I get that right?

00:22:41.630 --> 00:22:43.460
Yep.

00:22:43.460 --> 00:22:44.710
In this case, the rank.

00:22:44.710 --> 00:22:46.090
So here we have x.

00:22:46.090 --> 00:22:48.120
It's at rank, rank.

00:22:48.120 --> 00:22:52.476
And then we have the left
subtree and the right subtree.

00:22:52.476 --> 00:22:53.850
And so if the rank
were searching

00:22:53.850 --> 00:22:56.750
for is less than rank, that
means we know it's in here.

00:22:56.750 --> 00:22:57.900
So we should go left.

00:22:57.900 --> 00:23:01.230
And if we just said x
equals x.left you might ask,

00:23:01.230 --> 00:23:03.420
well what rank are we
searching for in here?

00:23:03.420 --> 00:23:06.560
Well, exactly the same rank.

00:23:06.560 --> 00:23:07.370
Fine.

00:23:07.370 --> 00:23:09.160
That's easy case.

00:23:09.160 --> 00:23:11.520
In the other situation, if
we're searching in here,

00:23:11.520 --> 00:23:15.090
we're searching for
rank greater than rank.

00:23:15.090 --> 00:23:19.110
Then I want to go
right but the new rank

00:23:19.110 --> 00:23:23.750
that I'm searching for
is local to this subtree.

00:23:23.750 --> 00:23:29.190
I'm searching for
i minus this stuff.

00:23:29.190 --> 00:23:31.380
This stuff is rank.

00:23:31.380 --> 00:23:37.290
So I'm going to let
i be i minus rank.

00:23:37.290 --> 00:23:39.200
Make sure I don't have
any off by 1 errors.

00:23:39.200 --> 00:23:42.541
That seems to be right.

00:23:42.541 --> 00:23:43.040
OK.

00:23:43.040 --> 00:23:44.110
And then I do a loop.

00:23:44.110 --> 00:23:45.140
So I'll write repeat.

00:23:50.570 --> 00:23:53.860
So then I'm going to
go up here and say, OK.

00:23:53.860 --> 00:23:56.150
Now relative to this thing.

00:23:56.150 --> 00:23:59.480
What is the rank of the
root of this subtree?

00:23:59.480 --> 00:24:04.020
Well, it's again going to be
that node .left.size plus 1.

00:24:04.020 --> 00:24:07.410
And now I have the new
rank I'm searching for, i.

00:24:07.410 --> 00:24:08.610
And I just keep going.

00:24:08.610 --> 00:24:12.040
You could write this recursively
if you like, but here's

00:24:12.040 --> 00:24:14.620
an iterative version.

00:24:14.620 --> 00:24:18.230
So it's actually very familiar
to the select algorithm

00:24:18.230 --> 00:24:22.370
that we had, like when we
did deterministic linear time

00:24:22.370 --> 00:24:25.550
median finding or
randomized median finding.

00:24:25.550 --> 00:24:29.140
They had a very similar
kind of recursion.

00:24:29.140 --> 00:24:31.150
But in that case, they
were spending linear time

00:24:31.150 --> 00:24:33.970
to do the partition
and that was expensive.

00:24:33.970 --> 00:24:36.540
Here, we're just spending
constant time at each node

00:24:36.540 --> 00:24:39.520
and so the overall
cost is log n.

00:24:39.520 --> 00:24:40.280
So that's nice.

00:24:40.280 --> 00:24:41.440
Any questions about that?

00:24:44.480 --> 00:24:45.570
OK.

00:24:45.570 --> 00:24:47.330
I have a note here.

00:24:47.330 --> 00:24:51.840
Subtree size is obvious once you
know that's what you should do.

00:24:51.840 --> 00:24:53.560
Another natural
thing to try to do

00:24:53.560 --> 00:24:55.910
would be to augment,
for each node,

00:24:55.910 --> 00:24:57.370
what is the rank of that node?

00:24:57.370 --> 00:24:59.820
Because then rank is
really easy to find.

00:24:59.820 --> 00:25:02.200
And then select would
basically be a regular search.

00:25:02.200 --> 00:25:03.827
I just look at the
rank of the root,

00:25:03.827 --> 00:25:05.410
I see whether the
rank I'm looking for

00:25:05.410 --> 00:25:08.380
is too big, or too small, and I
go left or right, accordingly.

00:25:08.380 --> 00:25:11.890
What would be bad about
augmenting with rank of a node?

00:25:14.430 --> 00:25:15.400
Updates.

00:25:15.400 --> 00:25:15.900
Why?

00:25:19.300 --> 00:25:22.395
What's a bad example
for an update?

00:25:22.395 --> 00:25:25.132
AUDIENCE: If you add
new in home element.

00:25:25.132 --> 00:25:25.840
PROFESSOR: Right.

00:25:25.840 --> 00:25:27.650
Say we insert a new
minimum element.

00:25:31.810 --> 00:25:33.810
Good catch, cameraman.

00:25:33.810 --> 00:25:36.510
That was for the
camera, obviously.

00:25:36.510 --> 00:25:37.610
So, right.

00:25:37.610 --> 00:25:41.560
If we insert, this is off to
the side, but say we insert,

00:25:41.560 --> 00:25:43.600
I'll call it minus infinity.

00:25:43.600 --> 00:25:46.170
A new key that is smaller
than all other keys,

00:25:46.170 --> 00:25:49.140
then the rank of
every node changes.

00:25:49.140 --> 00:25:53.280
So that's bad.

00:25:53.280 --> 00:25:55.695
It means that easy tree
augmentation, in particular,

00:25:55.695 --> 00:25:56.570
isn't going to apply.

00:25:56.570 --> 00:25:59.807
And furthermore, it would
take linear time to do this.

00:25:59.807 --> 00:26:02.390
And you could keep inserting,
if you insert keys in decreasing

00:26:02.390 --> 00:26:04.850
order from there, every
time you do an insert,

00:26:04.850 --> 00:26:06.730
all the ranks increase by one.

00:26:06.730 --> 00:26:09.360
Maintaining that's going to
cost linear time per update.

00:26:09.360 --> 00:26:14.230
So you have to be really
careful that the function you

00:26:14.230 --> 00:26:16.630
want to store actually
can be maintained.

00:26:16.630 --> 00:26:20.179
Be very careful about that,
say, on the quiz coming up,

00:26:20.179 --> 00:26:21.720
that when you're
augmenting something

00:26:21.720 --> 00:26:23.280
you can actually maintain it.

00:26:23.280 --> 00:26:27.010
For example, it's very hard to
maintain the depths of nodes

00:26:27.010 --> 00:26:31.410
because when you do a rotation
a whole lot of depths change.

00:26:31.410 --> 00:26:32.770
Depth is counting from the root.

00:26:32.770 --> 00:26:34.430
How deep am I?

00:26:34.430 --> 00:26:36.660
When I do a rotation
then this entire subtree

00:26:36.660 --> 00:26:37.560
went down by one.

00:26:37.560 --> 00:26:40.830
This entire subtree
went up by one.

00:26:40.830 --> 00:26:42.100
In this picture.

00:26:42.100 --> 00:26:44.820
But it's very easy to
maintain heights, for example.

00:26:44.820 --> 00:26:46.590
Height counting from
the bottom is OK,

00:26:46.590 --> 00:26:49.740
because I don't affect
the height of a, b, and c.

00:26:49.740 --> 00:26:52.090
I affect it for x and y
but that's just two nodes.

00:26:52.090 --> 00:26:53.890
That I can afford.

00:26:53.890 --> 00:26:56.984
So that's what you want to be
careful of in the easy tree

00:26:56.984 --> 00:26:57.525
augmentation.

00:27:00.470 --> 00:27:03.960
So most the time easy tree
augmentation does the job.

00:27:03.960 --> 00:27:07.340
But in the remaining
two examples,

00:27:07.340 --> 00:27:10.747
I want to show you cooler
examples of augmentation.

00:27:10.747 --> 00:27:12.330
These are things you
probably wouldn't

00:27:12.330 --> 00:27:17.780
be expected to come up with
on your own, but they're cool.

00:27:17.780 --> 00:27:20.055
And they let us do more
sophisticated operations.

00:27:27.970 --> 00:27:31.220
So the first one is
called level linking.

00:27:36.490 --> 00:27:41.290
And here we're going to do it
in the context of 2-3 trees,

00:27:41.290 --> 00:27:42.255
partly for variety.

00:27:45.690 --> 00:27:49.410
So the idea of level
linking is very simple.

00:27:49.410 --> 00:27:50.450
Let me draw a 2-3 tree.

00:28:03.340 --> 00:28:05.040
Not a very impressive 2-3 tree.

00:28:05.040 --> 00:28:08.030
I guess I don't feel
like drawing too much.

00:28:08.030 --> 00:28:10.710
Level linking is the
idea of, in addition to

00:28:10.710 --> 00:28:13.540
these child and
parent pointers, we're

00:28:13.540 --> 00:28:15.815
going to add links
on all the levels.

00:28:19.557 --> 00:28:21.140
Horizontal links,
you might call them.

00:28:54.180 --> 00:28:54.820
OK.

00:28:54.820 --> 00:28:57.370
So that's nice.

00:28:57.370 --> 00:28:59.350
Two questions-- can we do this?

00:28:59.350 --> 00:29:00.710
And what's it good for?

00:29:00.710 --> 00:29:03.050
So let's start with
can we do this.

00:29:03.050 --> 00:29:05.420
Remember in 2-3 trees all
we have to think about

00:29:05.420 --> 00:29:07.280
are splits and merges.

00:29:07.280 --> 00:29:12.570
So in a split, we have,
for a brief period,

00:29:12.570 --> 00:29:14.950
let's say three
keys, four children.

00:29:14.950 --> 00:29:17.610
That's too many.

00:29:17.610 --> 00:29:19.130
So we change that to--

00:29:26.524 --> 00:29:28.512
I'm going to change
this in a moment.

00:29:28.512 --> 00:29:31.494
For now, this is the split
you know and love, maybe.

00:29:31.494 --> 00:29:32.500
At least know.

00:29:32.500 --> 00:29:35.610
And if we think about where
the leveling pointers are,

00:29:35.610 --> 00:29:38.910
we have one before.

00:29:38.910 --> 00:29:42.330
And then we just need to
distribute those pointers

00:29:42.330 --> 00:29:44.540
to the two resulting nodes.

00:29:44.540 --> 00:29:47.980
And then we have to create a
new pointer between the nodes

00:29:47.980 --> 00:29:49.100
that we just created.

00:29:49.100 --> 00:29:50.430
This is, of course, easy to do.

00:29:50.430 --> 00:29:50.890
We're here.

00:29:50.890 --> 00:29:51.848
We're taking this node.

00:29:51.848 --> 00:29:53.870
We're splitting it in half.

00:29:53.870 --> 00:29:55.860
So we have the nodes
right in our hands so just

00:29:55.860 --> 00:29:57.710
add pointers between them.

00:29:57.710 --> 00:30:00.840
And key thing is, there's some
node over here on the left.

00:30:00.840 --> 00:30:02.360
It used to point
to this node, now

00:30:02.360 --> 00:30:04.760
we have to change it to
point to the left version.

00:30:04.760 --> 00:30:06.030
The left half of the node.

00:30:06.030 --> 00:30:07.696
And there's some node
over on the right.

00:30:07.696 --> 00:30:10.140
We have to change it's
left pointer to point

00:30:10.140 --> 00:30:13.897
to this right half of the node.

00:30:13.897 --> 00:30:14.480
But that's it.

00:30:14.480 --> 00:30:16.670
Constant time.

00:30:16.670 --> 00:30:20.010
So this doesn't fall under
the category of easy tree

00:30:20.010 --> 00:30:23.870
augmentation because this is
not isolated to the subtree.

00:30:23.870 --> 00:30:27.380
We're also dealing with it's
left and right subtrees.

00:30:27.380 --> 00:30:30.450
But still easy to
do in constant time.

00:30:36.599 --> 00:30:38.140
Merging nodes is
going to be similar.

00:30:48.130 --> 00:30:52.870
If we steal a node from our
parents or former sibling,

00:30:52.870 --> 00:30:55.380
nothing happens in
terms of level links.

00:30:55.380 --> 00:31:00.040
But if we have, say, an empty
node and a node that cannot

00:31:00.040 --> 00:31:01.290
afford any stealing.

00:31:01.290 --> 00:31:03.880
So we have single child
here, two children,

00:31:03.880 --> 00:31:05.030
and we merge it into--

00:31:09.840 --> 00:31:12.140
We're taking something
from our parent.

00:31:12.140 --> 00:31:13.310
Bringing it down.

00:31:13.310 --> 00:31:15.550
Then we have three
children afterwards.

00:31:15.550 --> 00:31:20.070
Again, we used to have
these level pointers.

00:31:20.070 --> 00:31:22.100
Now we just have
these level pointers.

00:31:22.100 --> 00:31:23.255
It's easy to maintain.

00:31:23.255 --> 00:31:24.880
It's just a constant
size neighborhood.

00:31:24.880 --> 00:31:26.910
Because we have
the level links, we

00:31:26.910 --> 00:31:28.940
can get to our left
and right neighbors

00:31:28.940 --> 00:31:31.200
and change where
the links point to.

00:31:31.200 --> 00:31:34.730
So easy to maintain
in constant time.

00:31:43.390 --> 00:31:45.690
I'll call it constant overhead.

00:31:45.690 --> 00:31:47.300
Every time we do
a split or merge

00:31:47.300 --> 00:31:50.339
we spend additional
constant time to do it.

00:31:50.339 --> 00:31:51.880
We're already spending
constant time.

00:31:51.880 --> 00:31:56.410
So just changes everything
by constant factor.

00:31:56.410 --> 00:31:58.440
So far, so good.

00:31:58.440 --> 00:32:01.600
Now, I'm going to have
to tweak this data

00:32:01.600 --> 00:32:02.560
structure a little bit.

00:32:02.560 --> 00:32:03.810
But let me first tell you why.

00:32:03.810 --> 00:32:06.030
What am I trying to achieve
with this data structure?

00:32:16.680 --> 00:32:20.510
What I'm trying to achieve is
something called the finger

00:32:20.510 --> 00:32:21.270
search property.

00:32:34.164 --> 00:32:35.580
So let's just think
about the case

00:32:35.580 --> 00:32:38.220
where I'm doing a
successful search.

00:32:38.220 --> 00:32:41.840
I'm searching for key x and I
find it in the data structure.

00:32:41.840 --> 00:32:43.130
I find it in the tree.

00:32:45.890 --> 00:32:49.890
Suppose I found one-- I
search for x, I found it.

00:32:49.890 --> 00:32:52.089
And then I search
for another key y.

00:32:52.089 --> 00:32:53.630
Actually I think
I'll do the reverse.

00:32:53.630 --> 00:32:56.460
First I found y, now
I'm searching for x.

00:32:56.460 --> 00:32:59.380
If x and y are
nearby in the tree,

00:32:59.380 --> 00:33:02.440
I want this to run
especially fast.

00:33:02.440 --> 00:33:05.090
For example, if x is
the successor of y

00:33:05.090 --> 00:33:07.790
I want this to
take constant time.

00:33:07.790 --> 00:33:10.350
That would be nice.

00:33:10.350 --> 00:33:12.880
In the worst case x and y
are very far away from me

00:33:12.880 --> 00:33:16.150
in the tree then I want
it to take log n time.

00:33:16.150 --> 00:33:19.570
So how could I interpolate
between constant time

00:33:19.570 --> 00:33:22.900
for finding the
successor and log n time

00:33:22.900 --> 00:33:28.080
for finding the
worst case search.

00:33:28.080 --> 00:33:35.040
So I'm going to call
this search of x from y.

00:33:35.040 --> 00:33:37.940
Meaning, this is a little
imprecise, but what

00:33:37.940 --> 00:33:41.720
I mean is when I call
search, I tell it

00:33:41.720 --> 00:33:43.330
where I've already found y.

00:33:43.330 --> 00:33:44.160
And here it is.

00:33:44.160 --> 00:33:46.690
Here's the node storing y.

00:33:46.690 --> 00:33:49.160
And now I'm given a key x.

00:33:49.160 --> 00:33:51.780
And I want to find
that key x given

00:33:51.780 --> 00:33:54.600
the node that stores key y.

00:33:54.600 --> 00:33:57.835
So how long should this take?

00:33:57.835 --> 00:33:59.210
Will be a good
way to interpolate

00:33:59.210 --> 00:34:01.420
between constant
time at one extreme.

00:34:01.420 --> 00:34:03.570
The good case, when
x and y are basically

00:34:03.570 --> 00:34:08.620
neighbors in sorted
order, versus

00:34:08.620 --> 00:34:12.535
log n time, in the worst case.

00:34:12.535 --> 00:34:14.120
AUDIENCE: Distance
along the graph.

00:34:14.120 --> 00:34:15.620
PROFESSOR: Distance
along the graph.

00:34:15.620 --> 00:34:18.600
That would be one
reasonable definition.

00:34:18.600 --> 00:34:22.050
So I have a tree which you
could think of as a graph.

00:34:22.050 --> 00:34:25.920
Measure the shortest
path length from x to y.

00:34:25.920 --> 00:34:29.340
Or we have a more
sophisticated graph over here.

00:34:29.340 --> 00:34:30.717
Maybe that length.

00:34:30.717 --> 00:34:32.800
The trouble with the
distance in the graph, that's

00:34:32.800 --> 00:34:35.400
a reasonable suggestion,
but it's very data structure

00:34:35.400 --> 00:34:36.090
specific.

00:34:36.090 --> 00:34:38.840
If I use an AVL tree
without level links,

00:34:38.840 --> 00:34:42.230
then the distance
could be one thing,

00:34:42.230 --> 00:34:45.886
whereas if I use a 2-3 tree,
even without level lengths,

00:34:45.886 --> 00:34:47.469
it's going to be a
different distance.

00:34:47.469 --> 00:34:49.397
If I use a 2-3 tree
with level lengths

00:34:49.397 --> 00:34:50.980
it's going to be yet
another distance.

00:34:50.980 --> 00:34:53.350
So that's a little unsatisfying.

00:34:53.350 --> 00:34:56.236
I want this to be an
answer to a question.

00:34:56.236 --> 00:34:58.610
I don't want to phrase the
question in terms of that data

00:34:58.610 --> 00:34:59.464
structure.

00:34:59.464 --> 00:35:01.380
AUDIENCE: Difference
between ranks of x and y?

00:35:01.380 --> 00:35:03.720
PROFESSOR: Difference between
ranks between x and y.

00:35:03.720 --> 00:35:04.695
That's close.

00:35:09.830 --> 00:35:12.580
So I'm going to look at the
rank of x and rank of y.

00:35:12.580 --> 00:35:15.010
Let's say, take the
absolute difference.

00:35:15.010 --> 00:35:18.390
That's kind of how far away
they are in sorted order.

00:35:18.390 --> 00:35:20.506
Do you want to add anything?

00:35:20.506 --> 00:35:21.350
AUDIENCE: Log?

00:35:21.350 --> 00:35:21.705
PROFESSOR: Log.

00:35:21.705 --> 00:35:22.205
Yeah.

00:35:24.660 --> 00:35:26.990
Because in the worst case
the difference in ranks

00:35:26.990 --> 00:35:28.400
could be linear.

00:35:28.400 --> 00:35:31.950
So I want to add a log out
here to get log n in that worst

00:35:31.950 --> 00:35:32.450
case.

00:35:35.120 --> 00:35:37.660
Add a big o for safety.

00:35:37.660 --> 00:35:39.569
That's how much time
we want to achieve.

00:35:39.569 --> 00:35:41.360
So this would be the
finger search property

00:35:41.360 --> 00:35:44.850
that you can solve this
problem in this much time.

00:35:44.850 --> 00:35:47.820
Again, difference in
ranks is at most n.

00:35:47.820 --> 00:35:49.670
So this is at most log n.

00:35:49.670 --> 00:35:54.090
But if y is the successor of
x this will only be constant

00:35:54.090 --> 00:35:56.360
and this will be constant.

00:35:56.360 --> 00:35:59.050
So this is great if you're
doing lots of searches

00:35:59.050 --> 00:36:01.909
and you tend to search for
things that are nearby,

00:36:01.909 --> 00:36:03.950
but sometimes you search
for things are far away.

00:36:03.950 --> 00:36:06.420
This gives you a nice bound.

00:36:12.100 --> 00:36:15.270
On the one hand, we
have, this is our goal.

00:36:15.270 --> 00:36:16.470
Log difference of ranks.

00:36:16.470 --> 00:36:18.200
On the other hand, we
have the suggestion

00:36:18.200 --> 00:36:20.070
that what we can
achieve is something

00:36:20.070 --> 00:36:21.900
like the distance in the graph.

00:36:25.080 --> 00:36:26.730
But we have a problem with this.

00:36:26.730 --> 00:36:28.650
I used to think that data
structure solved this problem,

00:36:28.650 --> 00:36:29.275
but it doesn't.

00:36:34.620 --> 00:36:37.918
Let me just draw-- actually
I have a tree right there.

00:36:37.918 --> 00:36:39.001
I'm going to use that one.

00:36:44.900 --> 00:36:51.530
Suppose x is here and y is here.

00:36:51.530 --> 00:36:52.030
OK.

00:36:52.030 --> 00:36:55.790
This is a bit of a small tree
but if you think about it

00:36:55.790 --> 00:37:00.670
long enough, this node is
the predecessor of this node.

00:37:00.670 --> 00:37:03.140
So their difference
in ranks should be 1.

00:37:06.290 --> 00:37:09.297
But the distance in
the graph here is two.

00:37:09.297 --> 00:37:10.130
Not very impressive.

00:37:10.130 --> 00:37:13.810
But in general, you have
a tree of height log n.

00:37:13.810 --> 00:37:18.495
If you look at the root, and
the predecessor of the root,

00:37:18.495 --> 00:37:20.120
they will have a rank
difference of one

00:37:20.120 --> 00:37:22.110
by definition of predecessor.

00:37:22.110 --> 00:37:25.561
But the graph distance
will be log n.

00:37:25.561 --> 00:37:28.060
So that's bad news, because if
we're only following pointers

00:37:28.060 --> 00:37:31.980
there's no way to get from
here to there in constant time.

00:37:31.980 --> 00:37:35.340
So we're not quite there.

00:37:35.340 --> 00:37:43.360
We're going to use another tweak
that data structure, which is

00:37:43.360 --> 00:37:44.910
store the data in the leaves.

00:37:52.529 --> 00:37:54.820
Tried to find a data structure
that didn't require this

00:37:54.820 --> 00:37:56.000
and still got finger search.

00:37:56.000 --> 00:37:58.000
But as far as I
know, there is none.

00:37:58.000 --> 00:37:58.980
No such data structure.

00:38:01.610 --> 00:38:04.427
If you look at, say,
Wikipedia about B-trees,

00:38:04.427 --> 00:38:06.510
you'll see there's a ton
of variations of B-trees.

00:38:06.510 --> 00:38:08.230
B+-trees, B*-trees.

00:38:08.230 --> 00:38:09.462
This is one of those.

00:38:09.462 --> 00:38:10.170
I think B+-trees.

00:38:12.990 --> 00:38:15.840
As you saw, B-trees or
2-3 trees, every node

00:38:15.840 --> 00:38:19.050
stored one or two keys.

00:38:19.050 --> 00:38:23.080
And each key only
existed in one spot.

00:38:23.080 --> 00:38:26.640
We're still only going to put
each key in one spot, kind of.

00:38:26.640 --> 00:38:30.030
But it's only going
to be the leaf spots.

00:38:30.030 --> 00:38:30.530
OK.

00:38:30.530 --> 00:38:32.411
Good news is most nodes
are leaves, right?

00:38:32.411 --> 00:38:34.660
Constant fraction of the
nodes are going to be leaves.

00:38:34.660 --> 00:38:38.720
So it doesn't change too
much from a space efficiency

00:38:38.720 --> 00:38:39.480
standpoint.

00:38:39.480 --> 00:38:41.970
If we just put data down
here and don't put--

00:38:41.970 --> 00:38:44.050
I'm not going to put any
keys up here for now.

00:38:47.690 --> 00:38:51.080
So this a little weird.

00:38:51.080 --> 00:38:54.036
Let me draw an example
of such a tree.

00:38:54.036 --> 00:39:06.180
So maybe we have 2, and 5,
and 7, and 8, 9, let's say.

00:39:09.410 --> 00:39:11.040
Let's put 1 here.

00:39:11.040 --> 00:39:14.590
So I'm going to have a node
here with three children, a node

00:39:14.590 --> 00:39:16.600
here with two
children, and here's

00:39:16.600 --> 00:39:17.860
a node with two children.

00:39:17.860 --> 00:39:22.120
So I think this mimics
this tree, roughly.

00:39:22.120 --> 00:39:23.620
I got it exactly right.

00:39:23.620 --> 00:39:26.120
So here I've taken
this tree structure.

00:39:26.120 --> 00:39:27.330
I've redrawn it.

00:39:27.330 --> 00:39:30.197
There's now no keys
in these nodes.

00:39:30.197 --> 00:39:32.030
But everything else is
going to be the same.

00:39:32.030 --> 00:39:34.400
Every node is going
to have 0 children

00:39:34.400 --> 00:39:37.999
if it's a leaf, or two, or
three children otherwise.

00:39:37.999 --> 00:39:39.540
Never have one child
because then you

00:39:39.540 --> 00:39:40.924
wouldn't get logarithmic depth.

00:39:40.924 --> 00:39:42.965
All the leaves are going
to be at the same depth.

00:39:46.380 --> 00:39:47.980
And that's it.

00:39:47.980 --> 00:39:48.480
OK.

00:39:48.480 --> 00:39:52.410
That is a 2-3 tree with the
data stored in the leaves.

00:39:52.410 --> 00:39:54.470
It's a useful trick to know.

00:39:54.470 --> 00:39:56.600
Now we're going to do a
level linked 2-3 tree.

00:39:56.600 --> 00:39:58.620
So in addition to
that picture, we're

00:39:58.620 --> 00:40:00.910
going to have links like this.

00:40:06.455 --> 00:40:06.955
OK.

00:40:06.955 --> 00:40:09.540
And I should check that I can
still do insert and delete

00:40:09.540 --> 00:40:10.570
into these structures.

00:40:10.570 --> 00:40:12.960
It's actually not too hard.

00:40:12.960 --> 00:40:14.070
But let's think about it.

00:40:30.760 --> 00:40:32.730
I think, actually,
it might be easier.

00:40:32.730 --> 00:40:33.600
Let's see.

00:40:36.640 --> 00:40:43.562
So if I want to
do an insert-- OK.

00:40:43.562 --> 00:40:45.520
I have to first search
for where I'm inserting.

00:40:45.520 --> 00:40:48.600
I haven't told you
how to do search yet.

00:40:48.600 --> 00:40:49.150
OK.

00:40:49.150 --> 00:40:51.230
So let's first
think about search.

00:40:55.780 --> 00:41:01.380
What we're going to do is
data structure augmentation.

00:41:01.380 --> 00:41:04.880
We have simple
tree augmentation.

00:41:04.880 --> 00:41:08.080
So I'm going to do
it and each node,

00:41:08.080 --> 00:41:09.730
what the functions
I'm going to store

00:41:09.730 --> 00:41:12.060
are the minimum
key in the subtree,

00:41:12.060 --> 00:41:13.710
and the maximum
key in the subtree.

00:41:23.700 --> 00:41:26.072
There are many ways to
do this, but I think

00:41:26.072 --> 00:41:27.280
this is kind of the simplest.

00:41:33.450 --> 00:41:36.420
So what that means
is at this node,

00:41:36.420 --> 00:41:43.240
I'm going to store 1 as
the min and 7 as the max.

00:41:43.240 --> 00:41:46.400
And at this node it's
going to be 1 at the min

00:41:46.400 --> 00:41:47.650
and 9 at the max.

00:41:47.650 --> 00:41:52.010
And here we have 8 as
the min and 9 as the max.

00:41:52.010 --> 00:41:54.320
Again min and max of
subtrees are easy to store.

00:41:54.320 --> 00:41:57.980
If I ever change a
node I can update it

00:41:57.980 --> 00:41:59.790
based on its children,
just by looking

00:41:59.790 --> 00:42:03.070
at the min of the
leftmost child and the max

00:42:03.070 --> 00:42:04.690
of the rightmost child.

00:42:04.690 --> 00:42:06.980
If I didn't know 1
and 9, I could just

00:42:06.980 --> 00:42:09.300
look at this min and
that max and that's

00:42:09.300 --> 00:42:11.870
going to be the min and the
max of the overall tree.

00:42:11.870 --> 00:42:14.126
So in constant time
I can update the min

00:42:14.126 --> 00:42:16.100
and the max of a
node given the min

00:42:16.100 --> 00:42:18.680
and the max of its children.

00:42:18.680 --> 00:42:19.997
Special case is at the leaves.

00:42:19.997 --> 00:42:22.330
Then you have to actually
look at keys and compare them.

00:42:22.330 --> 00:42:24.510
But leaves only have,
at most, two keys.

00:42:24.510 --> 00:42:29.120
So pretty easy to compare
them in constant time.

00:42:29.120 --> 00:42:29.620
OK.

00:42:29.620 --> 00:42:31.310
So that's how I do
the augmentation.

00:42:31.310 --> 00:42:33.030
Now how do I do a search?

00:42:33.030 --> 00:42:37.405
Well, if I'm at a node and
I'm searching for a key.

00:42:37.405 --> 00:42:38.780
Well, let's say
I'm at this node.

00:42:38.780 --> 00:42:42.290
I'm searching for a key like 8.

00:42:42.290 --> 00:42:44.497
What I'm going to do is
look at all of the children.

00:42:44.497 --> 00:42:45.580
In this case, there's two.

00:42:45.580 --> 00:42:47.280
In the worst case there's three.

00:42:47.280 --> 00:42:50.810
I look at the min and max
and I see where does 8 fall?

00:42:50.810 --> 00:42:52.750
Well it falls in this interval.

00:42:52.750 --> 00:42:56.400
If I was searching for 7
1/2 I know it's not there.

00:42:56.400 --> 00:42:58.120
It's going to be
in between here.

00:42:58.120 --> 00:43:03.420
If I'm doing a successor
then I'll go to the right.

00:43:03.420 --> 00:43:05.510
If I'm doing predecessor
I'll go to the left.

00:43:05.510 --> 00:43:08.860
And then take either the maximum
item or the minimum item.

00:43:08.860 --> 00:43:10.734
If I'm searching
for 8 I see, oh.

00:43:10.734 --> 00:43:12.400
8 falls in the interval
between 8 and 9,

00:43:12.400 --> 00:43:13.970
so I should clearly
take the right child

00:43:13.970 --> 00:43:15.040
among those two children.

00:43:15.040 --> 00:43:16.498
In general, there's
three children.

00:43:16.498 --> 00:43:17.260
Three intervals.

00:43:17.260 --> 00:43:17.930
Constant time.

00:43:17.930 --> 00:43:20.830
I can find where my key
falls in the interval.

00:43:20.830 --> 00:43:21.330
OK.

00:43:21.330 --> 00:43:25.200
So search is going to take
log n time again, provided

00:43:25.200 --> 00:43:26.570
I have these mins and maxs.

00:43:29.075 --> 00:43:31.130
If you stare at it
long enough, this

00:43:31.130 --> 00:43:34.960
is pretty much the same thing
as regular search in a 2-3 tree.

00:43:34.960 --> 00:43:39.850
But I've put the data
just one level down.

00:43:39.850 --> 00:43:40.350
OK.

00:43:43.180 --> 00:43:44.650
Good.

00:43:44.650 --> 00:43:46.430
That was regular search.

00:43:46.430 --> 00:43:49.149
I still need to do finger
search, but we'll get there.

00:43:49.149 --> 00:43:51.190
And now, if I want to do
an insert into this data

00:43:51.190 --> 00:43:54.330
structure, what happens.

00:43:54.330 --> 00:43:57.700
Well I search for the key
let's say I'm inserting 6.

00:43:57.700 --> 00:43:59.180
So maybe I go here.

00:43:59.180 --> 00:44:00.760
I say because 6.

00:44:00.760 --> 00:44:02.380
Is in this interval.

00:44:02.380 --> 00:44:04.550
6 is in neither of
these intervals.

00:44:04.550 --> 00:44:07.350
But it's closest to the interval
2, 5, or the interval 7.

00:44:07.350 --> 00:44:09.910
Let's say I go down to 2, 5.

00:44:09.910 --> 00:44:13.300
And well, to insert 6 I'll
just add a 6 on there.

00:44:13.300 --> 00:44:15.410
Of course, now that
node is too big.

00:44:15.410 --> 00:44:18.840
So there's still going to be a
split case at the leaves where

00:44:18.840 --> 00:44:24.460
I have let's say,
a,b,c, too many keys.

00:44:24.460 --> 00:44:29.140
I'm going to split
that into a,b and c.

00:44:29.140 --> 00:44:30.850
This is different from before.

00:44:30.850 --> 00:44:33.800
It used to be I would
promote b to the parent

00:44:33.800 --> 00:44:36.030
because the parent
needed the key there.

00:44:36.030 --> 00:44:37.810
Now parents don't have keys.

00:44:37.810 --> 00:44:42.000
So I'm just going to split
this thing, roughly, in half.

00:44:42.000 --> 00:44:44.510
It works.

00:44:44.510 --> 00:44:47.220
It's still the case that
whoever was the parent up

00:44:47.220 --> 00:44:50.460
here now has an
additional child.

00:44:50.460 --> 00:44:51.210
One more child.

00:44:51.210 --> 00:44:54.020
So maybe that node
now has four children

00:44:54.020 --> 00:44:56.390
but it's supposed
to be two or three.

00:44:56.390 --> 00:45:01.630
So if I have a node with four
children, what I'm going to do,

00:45:01.630 --> 00:45:05.140
I'm suppose to use
these fancy arrows.

00:45:05.140 --> 00:45:06.820
What do I do in this case?

00:45:06.820 --> 00:45:09.170
It's just going to split
that into two nodes with two

00:45:09.170 --> 00:45:10.860
children.

00:45:10.860 --> 00:45:13.410
And again this used
to have a parent.

00:45:13.410 --> 00:45:16.120
Now that parent has
an additional child,

00:45:16.120 --> 00:45:17.830
and that may cause
another split.

00:45:17.830 --> 00:45:19.030
It's just like before.

00:45:19.030 --> 00:45:23.160
Was just potentially split
all the way up to the root.

00:45:23.160 --> 00:45:26.610
If we split the root then
we get an additional level.

00:45:26.610 --> 00:45:32.040
But we could do all this and
we can still maintain our level

00:45:32.040 --> 00:45:32.970
links, if we want.

00:45:37.430 --> 00:45:38.770
But everything will take log n.

00:45:38.770 --> 00:45:41.940
I won't draw the
delete case, as delete

00:45:41.940 --> 00:45:43.660
is slightly more annoying.

00:45:43.660 --> 00:45:45.160
But I think, in
this case, you never

00:45:45.160 --> 00:45:47.290
have to worry about where
is the key coming from,

00:45:47.290 --> 00:45:49.740
your child or your parent?

00:45:49.740 --> 00:45:53.056
You're just merging nodes so
it's a little bit simpler.

00:45:53.056 --> 00:45:54.680
But you have to deal
with the leaf case

00:45:54.680 --> 00:45:56.880
separately from
the nonleaf case.

00:45:56.880 --> 00:45:57.380
OK.

00:45:57.380 --> 00:45:58.860
So all this was to
convince you that we

00:45:58.860 --> 00:46:00.068
can store data in the leaves.

00:46:00.068 --> 00:46:02.310
2-3 trees still work fine.

00:46:02.310 --> 00:46:06.650
Now I claim that the graph
distance in level link trees

00:46:06.650 --> 00:46:10.610
is within a constant factor
of the finger search bound.

00:46:10.610 --> 00:46:14.630
So I claim I can get the finger
search property in 2-3 trees,

00:46:14.630 --> 00:46:17.360
with data in the leaves,
with level links.

00:46:17.360 --> 00:46:19.490
So lots of changes here.

00:46:19.490 --> 00:46:24.210
But in the end, we're going
to get a finger search bound.

00:46:24.210 --> 00:46:25.970
Let's go over here.

00:46:45.805 --> 00:46:47.305
So here's a finger
search operation.

00:46:50.360 --> 00:46:53.580
First thing I want to
do is identify a node

00:46:53.580 --> 00:46:55.670
that I'm working with.

00:46:55.670 --> 00:46:57.690
I want to start from y's node.

00:46:57.690 --> 00:47:01.840
So we're supposing that we're
told the node, a leaf, that

00:47:01.840 --> 00:47:02.590
contains y.

00:47:02.590 --> 00:47:06.770
So I'm going to
let v be that leaf.

00:47:14.670 --> 00:47:15.170
OK.

00:47:15.170 --> 00:47:18.357
Because we're supposing
we've already found y,

00:47:18.357 --> 00:47:19.940
and now all the data
is in the leaves.

00:47:19.940 --> 00:47:22.934
So give me the leaf
that contains y.

00:47:22.934 --> 00:47:24.350
So that should
take constant time.

00:47:24.350 --> 00:47:26.790
That's just part of the input.

00:47:26.790 --> 00:47:29.550
Now I'm going to do a
combination of going up

00:47:29.550 --> 00:47:31.850
and horizontal.

00:47:31.850 --> 00:47:33.990
So starting at a leaf.

00:47:33.990 --> 00:47:37.430
And the first thing I'm
going to do is check,

00:47:37.430 --> 00:47:41.910
does this leaf
contain what I want?

00:47:41.910 --> 00:47:44.850
Does it contain the key I'm
searching for, which is x?

00:47:44.850 --> 00:47:46.595
So that's going to be the case.

00:47:46.595 --> 00:47:49.310
At every node I store
the min and the max.

00:47:49.310 --> 00:47:53.770
So if x happens to fall
between the min and the max,

00:47:53.770 --> 00:47:56.010
then I'm happy.

00:47:56.010 --> 00:48:06.320
Then I'm going to do a
regular search in v's subtree.

00:48:06.320 --> 00:48:08.007
This seems weird in
the case of a leaf.

00:48:08.007 --> 00:48:09.465
In the case of a
leaf, this is just

00:48:09.465 --> 00:48:11.580
to check the two
keys that are there.

00:48:11.580 --> 00:48:12.961
Which one is x.

00:48:12.961 --> 00:48:13.460
OK.

00:48:13.460 --> 00:48:16.460
But in general I gave
you this search algorithm

00:48:16.460 --> 00:48:20.830
which was, if I decide
which child to take,

00:48:20.830 --> 00:48:23.760
according to the ranges,
that's a downward search.

00:48:23.760 --> 00:48:26.204
So that's what I'm calling
regular search here.

00:48:26.204 --> 00:48:27.870
Maybe downward would
be a little better.

00:48:33.710 --> 00:48:37.650
This is the usual
log n time thing.

00:48:37.650 --> 00:48:40.980
But we're going to claim
a bound better than log n.

00:48:40.980 --> 00:48:43.150
If this is not the
case, then I know

00:48:43.150 --> 00:48:46.260
x either falls before
v.min or after v.max.

00:48:49.260 --> 00:48:58.100
So if x is less than v.min
then I'm going to go left.

00:48:58.100 --> 00:49:04.710
v equals v. I'll call it
level left to be clear.

00:49:04.710 --> 00:49:06.397
You might say left
is the left child.

00:49:06.397 --> 00:49:07.980
There's no left child
here, of course.

00:49:07.980 --> 00:49:09.490
But level left is clear.

00:49:09.490 --> 00:49:13.100
We take the horizontal
left pointer.

00:49:13.100 --> 00:49:18.000
And otherwise x is
greater than v.max.

00:49:18.000 --> 00:49:21.374
And in that case
I will go right.

00:49:21.374 --> 00:49:22.165
That seems logical.

00:49:25.580 --> 00:49:32.650
And in both cases
we're going to go up.

00:49:32.650 --> 00:49:36.250
x equals x.parent Whoops.

00:49:36.250 --> 00:49:39.050
v equals v.parent.

00:49:39.050 --> 00:49:40.060
X is not changing here.

00:49:40.060 --> 00:49:43.470
X is a key we're searching
for. v is the node.

00:49:43.470 --> 00:49:46.020
V for vertex.

00:49:46.020 --> 00:49:48.990
So we're always going
to go up, and then

00:49:48.990 --> 00:49:51.329
we're going to go
either left or right,

00:49:51.329 --> 00:49:53.120
and we're going to keep
doing that until we

00:49:53.120 --> 00:49:56.690
find a subtree that contains
x in terms of key range.

00:49:56.690 --> 00:49:58.630
Then we're going
to stop this part

00:49:58.630 --> 00:50:00.780
and we're just going
to do downward search.

00:50:00.780 --> 00:50:03.894
I should say return
here or something.

00:50:03.894 --> 00:50:05.560
I'm going to do a
downward search, which

00:50:05.560 --> 00:50:07.860
was this regular algorithm.

00:50:07.860 --> 00:50:12.140
And then whatever it finds,
that's what I return.

00:50:12.140 --> 00:50:14.712
I claim the algorithm
should be clear.

00:50:14.712 --> 00:50:16.670
What's less clear is that
it achieves the bound

00:50:16.670 --> 00:50:17.840
that we want.

00:50:17.840 --> 00:50:20.390
But I claim that this will
achieve the finger search

00:50:20.390 --> 00:50:23.140
property.

00:50:23.140 --> 00:50:25.410
Let me draw a picture
of what this thing looks

00:50:25.410 --> 00:50:33.175
like kind of generically.

00:50:33.175 --> 00:50:35.880
On small examples it's hard
to see what's going on.

00:50:35.880 --> 00:50:38.925
So I'm going to draw a
piece of a large example.

00:50:44.160 --> 00:50:47.190
Let's say we start here.

00:50:47.190 --> 00:50:49.395
This is where y was.

00:50:49.395 --> 00:50:50.570
I'm searching for x.

00:50:50.570 --> 00:50:52.910
Let's suppose x is to the right.

00:50:52.910 --> 00:50:55.600
'Cause otherwise I go
to the other board.

00:50:55.600 --> 00:50:57.290
So x is to the right.

00:50:57.290 --> 00:51:00.740
I'll discover that the range
with just this node, this node

00:51:00.740 --> 00:51:02.440
maybe contains one other key.

00:51:02.440 --> 00:51:05.410
I'll find that
range is too small.

00:51:05.410 --> 00:51:08.660
So I'm going to go follow
the level right pointer,

00:51:08.660 --> 00:51:10.930
and I get to some other node.

00:51:10.930 --> 00:51:12.600
Then I'm going to
go to the parent.

00:51:12.600 --> 00:51:15.910
Maybe the parent was the
parent of those two children

00:51:15.910 --> 00:51:17.930
so I'm going to
draw it like that.

00:51:17.930 --> 00:51:21.340
Maybe I find this
range is still too low.

00:51:21.340 --> 00:51:24.810
I need to go right to get to x,
so I'm going to follow a level

00:51:24.810 --> 00:51:26.630
pointer to the right.

00:51:26.630 --> 00:51:29.340
I find a new subtree.

00:51:29.340 --> 00:51:31.490
I'll go to its parent.

00:51:31.490 --> 00:51:35.160
Maybe I find that this subtree,
still the max is too small.

00:51:35.160 --> 00:51:37.360
So I have to go to
the right again.

00:51:37.360 --> 00:51:38.740
And then I take the parent.

00:51:38.740 --> 00:51:41.496
So this was an example
of a rightward parent.

00:51:41.496 --> 00:51:43.120
Here's an example of
a leftward parent.

00:51:43.120 --> 00:51:46.860
This is maybe the parent of
both of these two children.

00:51:46.860 --> 00:51:49.860
Then maybe this subtree
is still too small,

00:51:49.860 --> 00:51:52.300
the max is still smaller than x.

00:51:52.300 --> 00:51:56.170
So then I go right
one more time.

00:51:56.170 --> 00:51:57.470
Then I follow the parent.

00:51:57.470 --> 00:51:59.950
Always alternating
between right and parent

00:51:59.950 --> 00:52:06.170
until I find a node
whose subtree contains x.

00:52:06.170 --> 00:52:08.660
It might have actually, x
may be down here, because I

00:52:08.660 --> 00:52:11.000
immediately went to the
parent without checking

00:52:11.000 --> 00:52:13.370
whether I found where x is.

00:52:13.370 --> 00:52:15.650
But if I know that x is
somewhere in here then

00:52:15.650 --> 00:52:18.180
I will do a downward search.

00:52:18.180 --> 00:52:21.050
It might go left and then down
here, or it might go right,

00:52:21.050 --> 00:52:23.590
or there's actually
potentially three children.

00:52:23.590 --> 00:52:27.030
One of these searches
will find the key

00:52:27.030 --> 00:52:31.210
x that I'm looking
for because I'm

00:52:31.210 --> 00:52:34.580
in the case where x is
between v.min and v.max,

00:52:34.580 --> 00:52:37.090
so I know it's in
there, somewhere.

00:52:37.090 --> 00:52:40.670
It could be x doesn't exist, but
it's predecessor or successor

00:52:40.670 --> 00:52:42.710
is in there somewhere.

00:52:42.710 --> 00:52:45.320
And so one of these
three subtrees

00:52:45.320 --> 00:52:47.310
will contain the x range.

00:52:47.310 --> 00:52:49.720
And then I go follow that path.

00:52:49.720 --> 00:52:53.440
And keep going down until I
find x or it's predecessor

00:52:53.440 --> 00:52:54.200
or successor.

00:52:54.200 --> 00:52:57.440
Once I find it's predecessor I
can use a level right pointer

00:52:57.440 --> 00:53:01.270
to find its
successor, and so on.

00:53:01.270 --> 00:53:03.620
So that's kind of the general
picture what's going on.

00:53:03.620 --> 00:53:06.700
We keep going rightward
and we keep going up.

00:53:10.580 --> 00:53:15.780
Suppose we do k up steps.

00:53:15.780 --> 00:53:19.950
Let's look at this
last step here.

00:53:19.950 --> 00:53:20.490
Step k.

00:53:25.720 --> 00:53:27.080
How high am I in the tree?

00:53:27.080 --> 00:53:28.400
I started at the leaf level.

00:53:28.400 --> 00:53:31.720
Remember in a 2-3 tree all the
leaves have the same level.

00:53:31.720 --> 00:53:34.730
And I went up every step.

00:53:34.730 --> 00:53:36.270
Sorry.

00:53:36.270 --> 00:53:39.540
I don't know what this
is, like the 2-step dance

00:53:39.540 --> 00:53:44.140
where, let's say every
iteration of this loop I

00:53:44.140 --> 00:53:47.040
do one left or right step,
and then a parent step.

00:53:47.040 --> 00:53:52.270
So I should call
this iteration k.

00:53:52.270 --> 00:53:53.980
I guess there's
two k steps, then.

00:53:57.804 --> 00:53:59.850
Just to be clear.

00:53:59.850 --> 00:54:02.520
So in iteration k, that
means I've gone up k times

00:54:02.520 --> 00:54:05.100
and I've gone either
right or left k times.

00:54:05.100 --> 00:54:07.710
You can show if you start going
right you keep going right.

00:54:07.710 --> 00:54:10.750
If you initially go left
you'll keep going left.

00:54:10.750 --> 00:54:13.860
Doesn't matter too much.

00:54:13.860 --> 00:54:22.240
At iteration k I am at
height k, or k minus 1,

00:54:22.240 --> 00:54:24.000
or however you want to count.

00:54:24.000 --> 00:54:25.680
But let's call it k.

00:54:25.680 --> 00:54:31.250
So when I do this
right pointer here

00:54:31.250 --> 00:54:36.080
I know that, for
example, I am skipping

00:54:36.080 --> 00:54:42.660
over all of these keys.

00:54:42.660 --> 00:54:44.730
All the keys down-- the
keys are in the leaves,

00:54:44.730 --> 00:54:48.110
so all these things down
here, I'm jumping over them.

00:54:48.110 --> 00:54:51.130
How many keys are down there?

00:54:51.130 --> 00:54:53.900
Can you tell me,
roughly, how many keys

00:54:53.900 --> 00:54:56.860
I'm skipping over when I'm
moving right at height k?

00:54:59.970 --> 00:55:01.472
It's not a unique answer.

00:55:01.472 --> 00:55:02.805
But you can give me some bounds.

00:55:16.800 --> 00:55:18.940
Say again.

00:55:18.940 --> 00:55:20.840
Number of children
to the k power.

00:55:20.840 --> 00:55:22.010
Yeah.

00:55:22.010 --> 00:55:24.060
Except we don't know
the number of children.

00:55:24.060 --> 00:55:31.510
But it's between 2 and 3 Closer
one should be easy but I fail.

00:55:31.510 --> 00:55:33.350
So it's between two
and three children.

00:55:33.350 --> 00:55:41.140
So there's the number-- if
you look at a height k tree,

00:55:41.140 --> 00:55:42.880
how many leaves does it have?

00:55:42.880 --> 00:55:47.890
It's going to be between
2 to the k and 3 to the k.

00:55:47.890 --> 00:55:51.250
Because I have between 2 and
3 children at every node.

00:55:51.250 --> 00:55:53.175
And so it's exponential in k.

00:55:53.175 --> 00:55:54.050
That's all I'll need.

00:55:56.780 --> 00:55:57.280
OK.

00:55:57.280 --> 00:56:00.820
When I'm at height k here,
I'm skipping over a height

00:56:00.820 --> 00:56:03.440
k minus 1 tree or something.

00:56:03.440 --> 00:56:06.530
But it's going to be--

00:56:06.530 --> 00:56:13.165
So in iteration k I'm skipping,
at least, some constant times 2

00:56:13.165 --> 00:56:13.670
to the k.

00:56:13.670 --> 00:56:17.270
Maybe to the k minus
1, or to the k minus 2.

00:56:17.270 --> 00:56:18.300
I'm being very sloppy.

00:56:18.300 --> 00:56:19.130
Doesn't matter.

00:56:19.130 --> 00:56:21.870
As long as it's exponential
in k, I'm happy.

00:56:21.870 --> 00:56:27.190
Because I'm supposing that
x and y are somewhat close.

00:56:27.190 --> 00:56:30.510
Let's call this
rank difference d.

00:56:30.510 --> 00:56:33.290
Then I claim the
number of iterations

00:56:33.290 --> 00:56:37.340
I'll need to do in this loop
is, at most, order log d.

00:56:37.340 --> 00:56:41.100
Because if, when I get
to the k-th iteration,

00:56:41.100 --> 00:56:43.870
I'm jumping over 2
to the k elements.

00:56:43.870 --> 00:56:47.210
How large does k have
to be before 2 to the k

00:56:47.210 --> 00:56:50.230
is larger than d?

00:56:50.230 --> 00:56:52.960
Well, log d.

00:56:52.960 --> 00:57:09.120
Log base 2

00:57:09.120 --> 00:57:15.950
The number of
iterations is order

00:57:15.950 --> 00:57:19.850
log d, where d is
the rank difference.

00:57:19.850 --> 00:57:25.390
d is the absolute value between
rank of x and rank of y.

00:57:28.940 --> 00:57:31.940
And I'm being a
little sloppy here.

00:57:31.940 --> 00:57:33.840
You probably want
to use an induction.

00:57:33.840 --> 00:57:36.140
You need to show that
they're really, these items

00:57:36.140 --> 00:57:38.140
here that you're skipping
over that are strictly

00:57:38.140 --> 00:57:39.500
between x and y.

00:57:39.500 --> 00:57:41.970
But we know that there's
only d items between x or y.

00:57:41.970 --> 00:57:44.020
Actually d minus 1, I guess.

00:57:44.020 --> 00:57:49.360
So as soon as we've skipped over
all the items between x and y,

00:57:49.360 --> 00:57:52.652
then we'll find a
range that contains x,

00:57:52.652 --> 00:57:54.360
and then we'll go do
the downward search.

00:57:54.360 --> 00:57:56.740
Now how long does the
downward search cost?

00:57:56.740 --> 00:57:58.881
Whatever the height
of the tree is.

00:57:58.881 --> 00:58:00.130
What's the height of the tree?

00:58:00.130 --> 00:58:01.463
That's the number of iterations.

00:58:01.463 --> 00:58:03.230
So the total cost.

00:58:03.230 --> 00:58:05.110
The downward search
will cost the same

00:58:05.110 --> 00:58:07.480
as the rest of the search.

00:58:07.480 --> 00:58:12.302
And so the total cost is
going to be order log d.

00:58:12.302 --> 00:58:14.200
Clear?

00:58:14.200 --> 00:58:19.920
Any questions about finger
searching with level

00:58:19.920 --> 00:58:25.460
linked data at the
leaves, 2-3 trees?

00:58:25.460 --> 00:58:29.150
AUDIENCE: Sir, I'm not sure
why [INAUDIBLE] d, why is that?

00:58:29.150 --> 00:58:32.500
PROFESSOR: I'm defining d to be
the rank of x minus rank of y.

00:58:32.500 --> 00:58:34.570
My goal is to achieve
a log d bound.

00:58:34.570 --> 00:58:40.520
And I'm claiming that because
once I've skipped over d items,

00:58:40.520 --> 00:58:41.390
then I'm done.

00:58:41.390 --> 00:58:43.240
Then I've found x.

00:58:43.240 --> 00:58:48.250
And at step k I'm skipping
over 2 to the k items.

00:58:48.250 --> 00:58:50.010
So how big is k going to be?

00:58:50.010 --> 00:58:51.480
Log d.

00:58:51.480 --> 00:58:53.520
That's all.

00:58:53.520 --> 00:58:55.400
I used d for a notation here.

00:58:58.281 --> 00:58:58.780
Cool.

00:59:01.420 --> 00:59:02.600
Finger searching.

00:59:02.600 --> 00:59:03.100
It's nice.

00:59:03.100 --> 00:59:05.474
Especially if you're doing
many consecutive searches that

00:59:05.474 --> 00:59:09.110
are all relatively
close to each other.

00:59:09.110 --> 00:59:09.960
But that was easy.

00:59:09.960 --> 00:59:13.800
Let's do a more
difficult augmentation.

00:59:13.800 --> 00:59:18.690
So the last topic for
today is range trees.

00:59:18.690 --> 00:59:20.970
This is probably the coolest
example of augmentation,

00:59:20.970 --> 00:59:22.726
at least, that you'll
see in this class.

00:59:22.726 --> 00:59:24.350
If you want to see
more you should take

00:59:24.350 --> 00:59:32.570
advanced data structure 6851.

00:59:32.570 --> 00:59:34.970
And range trees solve
a problem called

00:59:34.970 --> 00:59:36.180
orthogonal range searching.

00:59:38.710 --> 00:59:41.910
Not orthogonal search ranging.

00:59:41.910 --> 00:59:46.130
Orthogonal range search.

00:59:51.840 --> 00:59:54.810
So what's the problem?

00:59:54.810 --> 00:59:57.810
I'm going to give you
a bunch of points.

00:59:57.810 --> 01:00:01.150
Draw them as fat dots so
you can actually see them.

01:00:01.150 --> 01:00:03.190
In some dimension.

01:00:03.190 --> 01:00:08.300
So this is, for
example, a 2D point set.

01:00:08.300 --> 01:00:08.800
OK.

01:00:08.800 --> 01:00:11.857
Over here I will
draw a 3D point set.

01:00:11.857 --> 01:00:13.440
You can tell the
difference, I'm sure.

01:00:18.860 --> 01:00:19.360
There.

01:00:19.360 --> 01:00:22.310
Now it's a 3D point set.

01:00:22.310 --> 01:00:25.221
And this is a static point set.

01:00:25.221 --> 01:00:26.970
You could make this
dynamic but let's just

01:00:26.970 --> 01:00:30.470
think about the static case.

01:00:30.470 --> 01:00:34.490
Don't want the 2D points
and the 3D points to mix.

01:00:34.490 --> 01:00:37.890
Now, you get to preprocess
this into a data structure.

01:00:37.890 --> 01:00:40.097
So this is a static
data structure problem.

01:00:40.097 --> 01:00:42.680
And now I'm going to come along
with a whole bunch of queries.

01:00:42.680 --> 01:00:45.770
A query will be a box.

01:00:45.770 --> 01:00:46.270
OK.

01:00:46.270 --> 01:00:48.445
In two dimensions, a
box is a rectangle.

01:00:51.370 --> 01:00:52.290
Something like this.

01:00:52.290 --> 01:00:53.580
Axis aligned.

01:00:53.580 --> 01:00:57.040
So I give you an x min, x
max, a y min, and a y max.

01:00:57.040 --> 01:00:59.490
I want to know what
are the points inside.

01:00:59.490 --> 01:01:00.920
Maybe I want you to list them.

01:01:00.920 --> 01:01:01.750
If there's a lot
of them it's going

01:01:01.750 --> 01:01:03.125
to take a long
time to list them.

01:01:03.125 --> 01:01:05.900
Maybe I just want to know
10 of them as examples.

01:01:05.900 --> 01:01:07.730
Maybe this is a Google
search or something.

01:01:07.730 --> 01:01:09.813
I just get the first 10
results in the first page,

01:01:09.813 --> 01:01:13.600
I hit next then want the
next 10, that kind of thing.

01:01:13.600 --> 01:01:16.730
Or maybe I want to know how
many search results there are.

01:01:16.730 --> 01:01:18.180
Number of points
in the rectangle.

01:01:18.180 --> 01:01:19.720
Bunch of different problems.

01:01:19.720 --> 01:01:23.370
In 3D, it's a 3D box.

01:01:23.370 --> 01:01:26.650
Which is a little
harder to draw.

01:01:26.650 --> 01:01:28.900
You can't really tell which
points are inside the box.

01:01:28.900 --> 01:01:30.900
Let's say these three points
are all inside the box.

01:01:30.900 --> 01:01:32.816
I give you an interval
in x, an interval in y,

01:01:32.816 --> 01:01:34.880
and an interval in
z, and I want to know

01:01:34.880 --> 01:01:36.060
what are the points inside.

01:01:36.060 --> 01:01:37.460
How many are there?

01:01:37.460 --> 01:01:38.220
List them all.

01:01:38.220 --> 01:01:40.941
List 10 of them, whatever.

01:01:40.941 --> 01:01:41.440
OK.

01:01:44.000 --> 01:01:47.050
I want to do this in
poly log time, let's say.

01:01:47.050 --> 01:01:50.830
I'm going to achieve today
log squared for the 2D problem

01:01:50.830 --> 01:01:52.920
and log cubed for
the 3D problem,

01:01:52.920 --> 01:01:54.830
plus whatever the
size output is.

01:02:01.630 --> 01:02:03.260
So let me just write that down.

01:02:14.970 --> 01:02:27.255
So the goal is to preprocess
n points in d dimensions.

01:02:30.400 --> 01:02:33.580
So you get to spend a
bunch of time preprocessing

01:02:33.580 --> 01:02:44.840
to support a query which is,
given a box, axis aligned box,

01:02:44.840 --> 01:02:52.020
find let's say the number
of points in the box.

01:02:56.310 --> 01:02:58.655
Find k points in the box.

01:03:03.927 --> 01:03:04.760
I think that's good.

01:03:04.760 --> 01:03:09.300
That includes a special case of
find all the points in the box.

01:03:09.300 --> 01:03:14.480
So this, of course, we have
to pay a penalty of order k

01:03:14.480 --> 01:03:15.230
for the output.

01:03:17.850 --> 01:03:20.470
No getting around that.

01:03:20.470 --> 01:03:24.894
But I want the rest of the
time to be log to the d.

01:03:28.282 --> 01:03:30.830
So we're going to
achieve log to the d n

01:03:30.830 --> 01:03:33.000
plus size of the output.

01:03:36.050 --> 01:03:38.550
And you get to control how
big you want the output to be.

01:03:38.550 --> 01:03:41.360
So it's a pretty
reasonable data structure.

01:03:41.360 --> 01:03:43.690
In a certain sense we will
understand what the output

01:03:43.690 --> 01:03:45.551
is in log to the d time.

01:03:45.551 --> 01:03:47.050
If you actually
want to list points,

01:03:47.050 --> 01:03:50.980
well, then you have to
spend the time to do it.

01:03:50.980 --> 01:03:51.480
All right.

01:03:51.480 --> 01:03:55.110
So 2D and 3D are great,
but let's start with 1D.

01:03:55.110 --> 01:03:57.510
First we should
understand 1D completely,

01:03:57.510 --> 01:03:58.710
then we can generalize.

01:04:06.590 --> 01:04:08.290
1D we already know how to do.

01:04:12.700 --> 01:04:15.430
1D I have a line.

01:04:15.430 --> 01:04:16.770
I have some points on the line.

01:04:22.370 --> 01:04:26.210
And I'm given, as a
query, some interval.

01:04:29.360 --> 01:04:32.220
And I want to know how many
points are in the interval,

01:04:32.220 --> 01:04:36.890
give me the points in
the interval, and so on.

01:04:36.890 --> 01:04:38.725
So how do I do this?

01:04:38.725 --> 01:04:39.225
Any ways?

01:04:48.600 --> 01:04:49.720
If d is 1.

01:04:49.720 --> 01:04:52.980
So I want to achieve
log d, sorry, log n,

01:04:52.980 --> 01:04:54.288
plus size of output.

01:04:58.294 --> 01:04:58.960
I hear whispers.

01:05:04.988 --> 01:05:05.876
Yeah?

01:05:05.876 --> 01:05:06.950
AUDIENCE: Segment trees?

01:05:06.950 --> 01:05:07.950
PROFESSOR: Segment tree?

01:05:07.950 --> 01:05:09.100
That's fancy.

01:05:09.100 --> 01:05:10.310
We won't cover segment trees.

01:05:10.310 --> 01:05:13.750
Probably segment trees do it.

01:05:13.750 --> 01:05:14.250
Yeah.

01:05:14.250 --> 01:05:17.546
We know lots of ways to do this.

01:05:17.546 --> 01:05:18.045
Yeah?

01:05:18.045 --> 01:05:18.960
AUDIENCE: Sorted array?

01:05:18.960 --> 01:05:21.020
PROFESSOR: Sorted array
is probably the simplest.

01:05:21.020 --> 01:05:24.380
If I store the items in a sorted
array and I have two values,

01:05:24.380 --> 01:05:28.040
I'll call them x1
and x2, because it's

01:05:28.040 --> 01:05:30.820
the x min and x max.

01:05:30.820 --> 01:05:32.500
Binary search for x1.

01:05:32.500 --> 01:05:34.110
Binary search for x2.

01:05:34.110 --> 01:05:36.710
Find the successor of x1
and the predecessor of x2.

01:05:36.710 --> 01:05:38.164
I'll find these two guys.

01:05:38.164 --> 01:05:39.830
And then I know all
the ones in between.

01:05:39.830 --> 01:05:41.520
That's the match.

01:05:41.520 --> 01:05:44.830
So that'll take log n
time to find those points

01:05:44.830 --> 01:05:47.720
and then we're good.

01:05:47.720 --> 01:05:50.680
So we could do a sorted array.

01:05:50.680 --> 01:05:53.950
Of course, sorted array is
a little hard to generalize.

01:05:53.950 --> 01:05:57.170
I don't want to do a 2D
array, that sounds bad.

01:05:57.170 --> 01:05:59.580
You could, of course,
do a binary search tree.

01:05:59.580 --> 01:06:01.580
Like an AVL tree.

01:06:01.580 --> 01:06:02.140
Same thing.

01:06:02.140 --> 01:06:04.102
Because we have log n
search, find successor,

01:06:04.102 --> 01:06:06.310
and predecessor, I guess
you could use Van Emde Boas,

01:06:06.310 --> 01:06:08.070
but that's hard to
generalize to 2D.

01:06:10.940 --> 01:06:13.640
You could use level links.

01:06:13.640 --> 01:06:14.940
Here's a fancy version.

01:06:14.940 --> 01:06:19.310
We could use level linked 2-3
trees with data in the leaves.

01:06:19.310 --> 01:06:24.150
Then once I find x
min, I find this point,

01:06:24.150 --> 01:06:26.580
I can go to the successor
in constant time

01:06:26.580 --> 01:06:29.680
because that's a finger search
with a rank difference of 1.

01:06:29.680 --> 01:06:32.180
And I could just keep
calling successor

01:06:32.180 --> 01:06:36.239
and in constant time per item
I will find the next item.

01:06:36.239 --> 01:06:38.280
So we could do that easily
with the sorted array.

01:06:38.280 --> 01:06:40.870
BST is not so great
because successor

01:06:40.870 --> 01:06:44.154
might cost log n each time.

01:06:44.154 --> 01:06:45.570
But if I have the
level links then

01:06:45.570 --> 01:06:47.444
basically I'm just
walking down the link list

01:06:47.444 --> 01:06:49.190
at the bottom of the tree.

01:06:49.190 --> 01:06:49.690
OK.

01:06:49.690 --> 01:06:52.920
So actually level
linked is even better.

01:06:55.660 --> 01:07:01.660
BST would achieve something
like log n plus k log n, where

01:07:01.660 --> 01:07:05.630
k is the size of the output.

01:07:05.630 --> 01:07:08.400
If I want k points in the
box I have to pay log n.

01:07:08.400 --> 01:07:14.130
For each level linked I'll
only pay log n plus k.

01:07:14.130 --> 01:07:17.310
Here I actually only need
the levels at the leaves.

01:07:17.310 --> 01:07:18.790
Level links.

01:07:18.790 --> 01:07:19.290
OK.

01:07:19.290 --> 01:07:20.352
All good.

01:07:20.352 --> 01:07:22.310
But I actually want to
tell you a different way

01:07:22.310 --> 01:07:24.560
to do it that will
generalize better.

01:07:30.552 --> 01:07:32.010
The pictures are
going to look just

01:07:32.010 --> 01:07:34.460
like the pictures
we've talked about.

01:07:55.040 --> 01:07:57.980
So these would actually
work dynamically.

01:07:57.980 --> 01:08:00.720
My goal here is just to achieve
a static data structure.

01:08:00.720 --> 01:08:05.320
I'm going to idealize this
solution a little bit.

01:08:05.320 --> 01:08:11.030
And just say, suppose I
have a perfectly balanced

01:08:11.030 --> 01:08:13.050
binary search tree.

01:08:13.050 --> 01:08:14.971
That's going to be
my data structure.

01:08:14.971 --> 01:08:15.470
OK.

01:08:15.470 --> 01:08:18.140
So the data structure is not
hard, but what's interesting

01:08:18.140 --> 01:08:21.859
is how I do a range search.

01:08:21.859 --> 01:08:32.569
So if I do range query of the
interval, I'll call it ab.

01:08:32.569 --> 01:08:37.380
Then what I'm going to do
is do a binary search for a,

01:08:37.380 --> 01:08:45.970
do a binary search for
b, trim the common prefix

01:08:45.970 --> 01:08:48.420
of those search paths.

01:08:48.420 --> 01:08:52.504
That's basically finding
the lowest common ancestor

01:08:52.504 --> 01:08:55.310
of a and b.

01:08:59.771 --> 01:09:01.270
And then I'm going
to do some stuff.

01:09:01.270 --> 01:09:02.670
Let me draw the picture.

01:09:02.670 --> 01:09:07.649
So here is, suppose here's
the node that contains a.

01:09:07.649 --> 01:09:09.380
Here's the node that contains b.

01:09:09.380 --> 01:09:12.670
They may not be at the
same depth, who knows.

01:09:12.670 --> 01:09:15.290
Then I'm going to look
at the parents of a.

01:09:15.290 --> 01:09:19.630
I just came down from some path
here, and some path down to b.

01:09:19.630 --> 01:09:21.439
I want to find this
branching point

01:09:21.439 --> 01:09:23.775
where the paths to a and
the paths to b diverge.

01:09:26.490 --> 01:09:28.500
So let's just look
at the parent of a.

01:09:28.500 --> 01:09:34.649
It could be a right
parent, in which case

01:09:34.649 --> 01:09:35.649
there's a subtree here.

01:09:35.649 --> 01:09:38.910
Could be a left parent in
which case, subtree here.

01:09:43.020 --> 01:09:46.050
I'm going to follow
my convention again.

01:09:46.050 --> 01:09:48.220
That x-coordinate
corresponds roughly to key.

01:09:53.206 --> 01:09:55.120
Left parent here.

01:09:55.120 --> 01:09:56.485
Maybe right parent here.

01:10:08.040 --> 01:10:09.090
Something like that.

01:10:23.220 --> 01:10:23.720
OK.

01:10:23.720 --> 01:10:25.060
Remember it's a perfect tree.

01:10:25.060 --> 01:10:30.005
So, actually, all the leaves
will be at the same level.

01:10:35.050 --> 01:10:39.500
And, roughly here, x-coordinate
corresponds to key.

01:10:39.500 --> 01:10:41.786
So here is a.

01:10:41.786 --> 01:10:44.910
And I want to return all the
keys that are between a and b.

01:10:44.910 --> 01:10:48.112
So that's everything
in this sweep line.

01:10:50.960 --> 01:10:53.880
The parents of the LCA don't
matter, because this parents

01:10:53.880 --> 01:10:56.191
either going to be way over
to the right or way over

01:10:56.191 --> 01:10:56.690
to the left.

01:10:56.690 --> 01:10:59.399
In both cases, it's outside
the interval a to b.

01:10:59.399 --> 01:11:00.940
So what I've tried
to highlight here,

01:11:00.940 --> 01:11:06.700
and I will color it in
blue, is the relevant nodes

01:11:06.700 --> 01:11:08.350
for the search between a and b.

01:11:08.350 --> 01:11:11.570
So a is between a and b.

01:11:11.570 --> 01:11:14.800
This subtree is greater
than a and less than b.

01:11:14.800 --> 01:11:17.360
This node, and these nodes.

01:11:17.360 --> 01:11:20.520
This node, and these nodes.

01:11:20.520 --> 01:11:23.580
This node and these nodes.

01:11:23.580 --> 01:11:25.490
The common ancestor.

01:11:25.490 --> 01:11:27.350
And then the corresponding
thing over here.

01:11:31.350 --> 01:11:33.450
All the nodes in all
these blue subtrees,

01:11:33.450 --> 01:11:36.630
plus these individual
nodes, fall in the interval

01:11:36.630 --> 01:11:40.440
between a and b, and that's it.

01:11:40.440 --> 01:11:40.940
OK.

01:11:40.940 --> 01:11:42.400
This should look super familiar.

01:11:42.400 --> 01:11:44.397
It's just like when
we're computing rank.

01:11:44.397 --> 01:11:46.730
We're trying to figure out
how many guys are to our left

01:11:46.730 --> 01:11:47.740
or to our right.

01:11:47.740 --> 01:11:49.960
We're basically doing
a rightward rank

01:11:49.960 --> 01:11:52.680
from a and a
leftward rank from b.

01:11:52.680 --> 01:11:54.160
And that finds all the nodes.

01:11:54.160 --> 01:11:57.670
And stopping when those
two searches converge.

01:11:57.670 --> 01:12:00.040
And then we're finding all
the nodes between a and b.

01:12:00.040 --> 01:12:01.165
I'm not going to write down
the pseudocode because it's

01:12:01.165 --> 01:12:02.123
the same kind of thing.

01:12:02.123 --> 01:12:05.050
You look at right
parents and left parents.

01:12:05.050 --> 01:12:06.880
You just walk up from a.

01:12:06.880 --> 01:12:09.550
Whenever you get a
right parent then

01:12:09.550 --> 01:12:12.830
you want that node, and
the subtree to its right.

01:12:12.830 --> 01:12:15.070
And so that will
highlight these nodes.

01:12:15.070 --> 01:12:17.920
Same thing for b, but
you look at left parents.

01:12:17.920 --> 01:12:20.224
And then you stop when
those two searches converge.

01:12:20.224 --> 01:12:21.890
So you're going to
do them in lock step.

01:12:21.890 --> 01:12:23.530
You do one step for a and b.

01:12:23.530 --> 01:12:24.480
One step for a and b.

01:12:24.480 --> 01:12:26.980
And when they happen to hit the
same node, then you're done.

01:12:26.980 --> 01:12:29.990
You add that node to your list.

01:12:29.990 --> 01:12:35.630
And what you end
up with is a bunch

01:12:35.630 --> 01:12:39.510
of nodes and rooted subtrees.

01:12:43.820 --> 01:12:48.500
The things I circled in blue
is going to be my return value.

01:12:48.500 --> 01:12:52.110
So I'm going to return all
of these nodes, explicitly.

01:12:52.110 --> 01:12:54.012
And I'm also going to
return these subtrees.

01:12:54.012 --> 01:12:55.720
I'm not going to have
to write them down.

01:12:55.720 --> 01:12:58.130
I'm just going to return
the root of the subtree,

01:12:58.130 --> 01:12:58.880
and say, hey look.

01:12:58.880 --> 01:13:01.930
Here's an entire
subtree that contains

01:13:01.930 --> 01:13:04.520
points that are in the answer.

01:13:04.520 --> 01:13:06.380
Don't have to list
them explicitly,

01:13:06.380 --> 01:13:08.560
I can just give you the tree.

01:13:08.560 --> 01:13:12.750
Then if I want to know how
many results are in the answer,

01:13:12.750 --> 01:13:16.730
well, just augment to store
subtree size at the beginning.

01:13:16.730 --> 01:13:18.200
And then I can
count how many nodes

01:13:18.200 --> 01:13:20.280
are down here, how many
nodes are down here,

01:13:20.280 --> 01:13:22.670
add that up for
all the triangles,

01:13:22.670 --> 01:13:25.700
and then also add one for
each of the blue nodes,

01:13:25.700 --> 01:13:30.500
and then I've counted the size
of the answer in how much time?

01:13:30.500 --> 01:13:35.010
How many subtrees and how many
nodes am I returning here?

01:13:35.010 --> 01:13:35.510
Log.

01:13:40.490 --> 01:13:44.380
Log n nodes and log n rooted
subtrees because at each step,

01:13:44.380 --> 01:13:46.990
I'm going up by one for
a, and up by one for b.

01:13:46.990 --> 01:13:48.260
So it's like 2 log n.

01:13:48.260 --> 01:13:48.860
Log n.

01:13:51.880 --> 01:13:55.780
So I would call this an implicit
representation of the answer.

01:13:55.780 --> 01:13:57.810
From that implicit
representation

01:13:57.810 --> 01:13:59.920
you can do subtree size.

01:13:59.920 --> 01:14:02.240
Augmentation to count
the size the answer.

01:14:02.240 --> 01:14:05.700
You can just start walking
through one by one, do an inter

01:14:05.700 --> 01:14:08.480
traversal of the trees, and
you'll get the first k points

01:14:08.480 --> 01:14:10.220
in the answer in order k time.

01:14:10.220 --> 01:14:11.273
Question?

01:14:11.273 --> 01:14:12.564
AUDIENCE: Just a clarification.

01:14:12.564 --> 01:14:13.556
You said when we
were walking up,

01:14:13.556 --> 01:14:15.044
you want to get
all the ancestors

01:14:15.044 --> 01:14:17.524
in their right subtrees.

01:14:17.524 --> 01:14:20.020
But you don't do that for
the left parent, right?

01:14:20.020 --> 01:14:21.020
PROFESSOR: That's right.

01:14:21.020 --> 01:14:23.219
As I'm walking up the tree,
if it's a right parent

01:14:23.219 --> 01:14:25.760
then I take the right subtree
and include that in the answer.

01:14:25.760 --> 01:14:29.210
If it's a left parent
just forget about it.

01:14:29.210 --> 01:14:30.230
Don't do anything.

01:14:30.230 --> 01:14:31.640
Just keep following parents.

01:14:31.640 --> 01:14:34.020
Whenever I do right
parent then I also

01:14:34.020 --> 01:14:35.522
add that node and
the right subtree.

01:14:35.522 --> 01:14:37.480
If it's a left parent I
don't include the node,

01:14:37.480 --> 01:14:39.110
I don't include
the left subtree.

01:14:39.110 --> 01:14:40.420
I also don't include
the right subtree.

01:14:40.420 --> 01:14:41.711
That would have too much stuff.

01:14:44.072 --> 01:14:45.530
It's easy when you
see the picture,

01:14:45.530 --> 01:14:46.950
you would write
down the algorithm.

01:14:46.950 --> 01:14:47.450
It's clear.

01:14:47.450 --> 01:14:50.550
It's left versus right parents.

01:14:50.550 --> 01:14:53.321
AUDIENCE: Would you include
the left subtree of b?

01:14:53.321 --> 01:14:54.820
PROFESSOR: I would
also-- thank you.

01:14:54.820 --> 01:14:58.630
I should color the
left subtree of b.

01:14:58.630 --> 01:15:00.480
I didn't apply
symmetry perfectly.

01:15:00.480 --> 01:15:03.250
So we have the right subtree
of a and the left subtree of b.

01:15:03.250 --> 01:15:04.950
Thanks.

01:15:04.950 --> 01:15:10.110
I would also include b if
it's a closed interval.

01:15:10.110 --> 01:15:11.130
Slightly more general.

01:15:11.130 --> 01:15:12.990
If a and b are not
in the tree then this

01:15:12.990 --> 01:15:17.110
is really the successor of a and
this is the predecessor of b.

01:15:17.110 --> 01:15:19.790
So then a and b don't
have to be in there.

01:15:19.790 --> 01:15:22.940
This is still a well
defined range search.

01:15:22.940 --> 01:15:23.440
OK.

01:15:23.440 --> 01:15:25.540
Now we really understand 1D.

01:15:25.540 --> 01:15:30.190
I claim we've almost
solved all dimensions.

01:15:30.190 --> 01:15:33.340
All we need is a little
bit of augmentation.

01:15:33.340 --> 01:15:34.230
So let's do it.

01:15:51.560 --> 01:15:53.220
Let's start with 2D.

01:15:53.220 --> 01:15:59.090
But then 3D, and 4D,
and so on will be easy.

01:15:59.090 --> 01:16:00.990
Why do I care about
4D range trees?

01:16:00.990 --> 01:16:03.110
Because maybe I have a database.

01:16:03.110 --> 01:16:05.720
Each of these points
is actually just a row

01:16:05.720 --> 01:16:09.880
in the database which has
four columns, four values.

01:16:09.880 --> 01:16:12.590
And what I'm trying to do
here is find all the people

01:16:12.590 --> 01:16:15.000
in my database that have a
salary between this and this,

01:16:15.000 --> 01:16:17.180
and have an age
between this and that,

01:16:17.180 --> 01:16:19.370
and have a profession
between this and this.

01:16:19.370 --> 01:16:22.130
I don't know what that means.

01:16:22.130 --> 01:16:24.600
Number of degrees between
this and this, whatever.

01:16:24.600 --> 01:16:28.190
You have some numerical data
representing a person or thing

01:16:28.190 --> 01:16:31.280
in your database, then this
is a typical kind of search

01:16:31.280 --> 01:16:32.620
you want to do.

01:16:32.620 --> 01:16:34.850
And you want to know how
many answers you've got

01:16:34.850 --> 01:16:37.290
and then list the first
hundreds of them, or whatever.

01:16:37.290 --> 01:16:40.150
So this is a practical
thing in databases.

01:16:40.150 --> 01:16:43.736
This is what you might call
an index in the database.

01:16:43.736 --> 01:16:44.360
So let's start.

01:16:44.360 --> 01:16:46.110
Suppose your data is
just two dimensional.

01:16:46.110 --> 01:16:48.910
You have two fields
for every item.

01:16:48.910 --> 01:17:02.800
What I'm going to do is store
a 1D range tree on all points

01:17:02.800 --> 01:17:05.430
by x.

01:17:05.430 --> 01:17:09.240
So this data structure makes
sense if you fix a dimension.

01:17:09.240 --> 01:17:10.990
Say x is all I care about.

01:17:10.990 --> 01:17:13.290
Forget about y.

01:17:13.290 --> 01:17:14.140
So my point set.

01:17:14.140 --> 01:17:15.020
Yeah.

01:17:15.020 --> 01:17:23.380
So what that corresponds to is
projecting each of these points

01:17:23.380 --> 01:17:24.475
onto the x-axis.

01:17:31.320 --> 01:17:33.220
And now also
projecting my query.

01:17:36.050 --> 01:17:38.750
So my new query is
from here to here in x.

01:17:41.420 --> 01:17:43.900
And so this data
structure will let

01:17:43.900 --> 01:17:46.370
me find all these
points that match in x.

01:17:46.370 --> 01:17:48.180
That's not good because
there's actually

01:17:48.180 --> 01:17:50.480
only two points that
I want, but I find

01:17:50.480 --> 01:17:53.210
four points in this picture.

01:17:53.210 --> 01:17:55.120
But it's half of the answer.

01:17:55.120 --> 01:17:57.520
It's all the x matches
forgetting about y.

01:18:00.540 --> 01:18:03.140
Now here's the fun part.

01:18:03.140 --> 01:18:08.280
So when I do a search
here I get log n nodes.

01:18:08.280 --> 01:18:11.210
Nodes are good because they
have a single key in them.

01:18:11.210 --> 01:18:13.980
So I'll just check for
each of those log n nodes.

01:18:13.980 --> 01:18:15.780
Do they also match in y?

01:18:15.780 --> 01:18:17.570
If they do, add
it to the answer.

01:18:17.570 --> 01:18:20.000
If they don't forget about it.

01:18:20.000 --> 01:18:21.610
OK.

01:18:21.610 --> 01:18:25.370
But the tricky part is I also
get log n subtrees representing

01:18:25.370 --> 01:18:26.760
parts of the answer.

01:18:26.760 --> 01:18:31.190
So potentially it could be that
your search, this rectangle,

01:18:31.190 --> 01:18:33.100
only has like five points.

01:18:33.100 --> 01:18:36.030
But if you look at this
whole vertical slab

01:18:36.030 --> 01:18:38.200
there's a billion points.

01:18:38.200 --> 01:18:39.740
Now, luckily, those
billion points

01:18:39.740 --> 01:18:41.050
are represented succinctly.

01:18:41.050 --> 01:18:42.706
There's just log
n subtrees saying,

01:18:42.706 --> 01:18:44.080
well there's half
a billion here,

01:18:44.080 --> 01:18:46.579
and a quarter billion here, and
an eighth of a billion here.

01:18:50.970 --> 01:18:56.100
Now for each of that
big chunk of output,

01:18:56.100 --> 01:18:59.050
I want to very quickly find
the ones that match in y.

01:18:59.050 --> 01:19:01.220
How would I find the
ones matching in y?

01:19:03.910 --> 01:19:05.342
A range tree.

01:19:05.342 --> 01:19:06.780
Yeah.

01:19:06.780 --> 01:19:07.280
OK.

01:19:07.280 --> 01:19:09.000
So here's what
we're going to do.

01:19:09.000 --> 01:19:14.690
For each node, call it x.

01:19:14.690 --> 01:19:16.240
x is overloaded.

01:19:16.240 --> 01:19:17.050
It's a coordinate.

01:19:17.050 --> 01:19:17.960
So many things.

01:19:17.960 --> 01:19:22.590
Let's call it v.
In the, this thing

01:19:22.590 --> 01:19:24.580
I'm going to call the x-tree.

01:19:24.580 --> 01:19:27.110
So for every node
in the x-tree I'm

01:19:27.110 --> 01:19:30.810
going to store
another 1D range tree.

01:19:30.810 --> 01:19:42.460
But this time using
the y-coordinate on all

01:19:42.460 --> 01:19:50.360
points in these rooted subtree.

01:19:52.899 --> 01:19:54.815
At this point I really
want to draw a diagram.

01:19:58.290 --> 01:20:00.580
So, rough picture.

01:20:13.740 --> 01:20:17.000
Forgive me for not
drawing this perfectly.

01:20:17.000 --> 01:20:18.970
This is roughly what
the answer looks

01:20:18.970 --> 01:20:22.430
like for the 1D range search.

01:20:22.430 --> 01:20:24.630
This is the x-tree.

01:20:24.630 --> 01:20:28.900
And here I've searched between
this value and this value

01:20:28.900 --> 01:20:29.950
in the x-coordinate.

01:20:29.950 --> 01:20:31.240
Basically I have log n nodes.

01:20:31.240 --> 01:20:33.020
I'm going to check
those separately.

01:20:33.020 --> 01:20:35.970
Then I also have
these log n subtrees.

01:20:35.970 --> 01:20:40.280
For each of those
log n sub trees

01:20:40.280 --> 01:20:42.510
I'm going to have
a pointer-- this

01:20:42.510 --> 01:20:49.080
is the augmentation-- to another
tree of exactly the same size.

01:20:49.080 --> 01:20:51.860
On exactly the same
data that's in here.

01:20:51.860 --> 01:20:53.410
It's also over here.

01:20:53.410 --> 01:20:55.900
But it's going to
be sorted by y.

01:20:55.900 --> 01:20:59.100
And it's a 1D range tree by y.

01:20:59.100 --> 01:21:00.670
Tons of data duplication here.

01:21:00.670 --> 01:21:03.740
I took all these points and I
copied them over here, but then

01:21:03.740 --> 01:21:05.420
built a 1D range tree in y.

01:21:05.420 --> 01:21:06.590
This is all preprocessing.

01:21:06.590 --> 01:21:08.595
So I don't have to pay for this.

01:21:08.595 --> 01:21:09.470
It's polynomial time.

01:21:09.470 --> 01:21:11.430
Don't worry too much.

01:21:11.430 --> 01:21:13.730
And then I'm going
to search in here.

01:21:13.730 --> 01:21:15.260
What does the search
in there look?

01:21:15.260 --> 01:21:18.820
I'm going to get, you know,
some more trees and a couple

01:21:18.820 --> 01:21:20.540
more nodes.

01:21:20.540 --> 01:21:21.040
OK.

01:21:21.040 --> 01:21:25.710
But now those items, those
points, match in x and y

01:21:25.710 --> 01:21:28.410
because this whole
subtree matched in x

01:21:28.410 --> 01:21:31.530
and I just did a y search, so I
found things that matched in y.

01:21:31.530 --> 01:21:34.940
So I get here
another log n trees

01:21:34.940 --> 01:21:37.120
that are actually in my answer.

01:21:37.120 --> 01:21:41.400
And for each of these nodes I
have a corresponding other data

01:21:41.400 --> 01:21:45.130
structure where I
do a little search

01:21:45.130 --> 01:21:46.400
and I get part of the answer.

01:21:51.720 --> 01:21:53.500
Every one.

01:21:53.500 --> 01:21:54.970
Sounds huge.

01:21:54.970 --> 01:21:58.260
This data structure sounds
huge, but it's actually small.

01:21:58.260 --> 01:22:02.560
But one thing that's clear is
it takes log squared n time,

01:22:02.560 --> 01:22:05.050
because I have log n
triangles over here.

01:22:05.050 --> 01:22:08.520
For each of them I spend log
n to find triangles over here.

01:22:08.520 --> 01:22:12.960
The total output is log squared
n nodes, for each of them

01:22:12.960 --> 01:22:14.590
I have to check manually.

01:22:14.590 --> 01:22:18.602
Plus, so over here,
there's log n,

01:22:18.602 --> 01:22:19.810
different searches I'm doing.

01:22:19.810 --> 01:22:21.160
Each one has size log n.

01:22:21.160 --> 01:22:23.870
So I get log squared
little triangles that

01:22:23.870 --> 01:22:27.230
contain the results
that match in x and y.

01:22:27.230 --> 01:22:30.639
How much space in
this data structure?

01:22:30.639 --> 01:22:31.930
That's the remaining challenge.

01:22:36.140 --> 01:22:45.600
Actually, it's not that hard,
because if you look at a key.

01:22:45.600 --> 01:22:48.642
So look at some
key in this x-tree.

01:22:48.642 --> 01:22:50.350
Let's look at a leaf
because that's maybe

01:22:50.350 --> 01:22:51.224
the most interesting.

01:22:57.440 --> 01:22:58.500
Here's the x-tree.

01:22:58.500 --> 01:22:59.810
x-tree has linear size.

01:22:59.810 --> 01:23:01.060
Just one tree.

01:23:01.060 --> 01:23:06.570
If I look at some key value,
well, it lives in this subtree.

01:23:06.570 --> 01:23:09.060
And so there's going to be a
corresponding blue structure

01:23:09.060 --> 01:23:11.150
of that size that
contains that key.

01:23:11.150 --> 01:23:12.620
And then there's the parent.

01:23:12.620 --> 01:23:14.340
So there's a structure here.

01:23:14.340 --> 01:23:16.790
That has a corresponding
blue triangle.

01:23:16.790 --> 01:23:20.050
And then its parent,
that's another triangle.

01:23:20.050 --> 01:23:24.990
That contains-- I'm
looking at a key k here.

01:23:24.990 --> 01:23:28.580
All of these triangles
contain the key k.

01:23:28.580 --> 01:23:32.800
And so key k will be
duplicated all this many times,

01:23:32.800 --> 01:23:37.260
but how many sub trees is k in?

01:23:37.260 --> 01:23:39.430
Log n.

01:23:39.430 --> 01:23:45.160
Each key, fundamental fact
about balanced binary search

01:23:45.160 --> 01:23:51.684
trees, each key lives
in log n subtrees.

01:23:51.684 --> 01:23:52.850
Namely all of its ancestors.

01:24:00.000 --> 01:24:01.180
Awesome.

01:24:01.180 --> 01:24:05.430
Because that means the
total space is n log n.

01:24:05.430 --> 01:24:06.860
There's n keys.

01:24:06.860 --> 01:24:09.145
Each of them is duplicated
at most log n times.

01:24:12.060 --> 01:24:15.610
In general, log
to the d minus 1.

01:24:15.610 --> 01:24:19.420
So If you do it in 3D,
each of the blue trees,

01:24:19.420 --> 01:24:21.430
every node in it has a
corresponding pointer

01:24:21.430 --> 01:24:25.990
to a red tree
that's sorted by z.

01:24:25.990 --> 01:24:29.050
And you just keep doing this,
sort of, nested searching,

01:24:29.050 --> 01:24:30.610
like super augmentation.

01:24:30.610 --> 01:24:34.520
But you're only losing a log
factor each dimension you add.