WEBVTT

00:00:00.090 --> 00:00:02.490
The following content is
provided under a Creative

00:00:02.490 --> 00:00:04.059
Commons license.

00:00:04.059 --> 00:00:06.330
Your support will help
MIT OpenCourseWare

00:00:06.330 --> 00:00:10.720
continue to offer high quality
educational resources for free.

00:00:10.720 --> 00:00:13.320
To make a donation or
view additional materials

00:00:13.320 --> 00:00:17.280
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.280 --> 00:00:19.790
at osw.mit.edu.

00:00:19.790 --> 00:00:20.790
ERIK DEMAINE: All right.

00:00:20.790 --> 00:00:24.270
Today's lecture's full of
tries and trays, and trees.

00:00:24.270 --> 00:00:25.170
Oh, my.

00:00:25.170 --> 00:00:30.150
Lots of different synonyms
all coming from trees.

00:00:30.150 --> 00:00:31.620
In particular,
we're going to cover

00:00:31.620 --> 00:00:34.320
suffix trees today and various
representations of them,

00:00:34.320 --> 00:00:36.387
and how to build
them in linear time.

00:00:36.387 --> 00:00:37.470
Now, they are good things.

00:00:37.470 --> 00:00:39.840
Some of you may have
seen suffix trees before,

00:00:39.840 --> 00:00:43.260
but hopefully, haven't actually
seen most of the things

00:00:43.260 --> 00:00:46.360
we're going to cover,
except for the very basics.

00:00:46.360 --> 00:00:50.580
So the general problem we're
interested in solving today

00:00:50.580 --> 00:00:51.480
is string matching.

00:00:57.660 --> 00:01:02.610
And in string matching
we are given two strings.

00:01:02.610 --> 00:01:07.290
One of them we call the
text T and the other one

00:01:07.290 --> 00:01:18.630
we call a pattern P.
These are both strings

00:01:18.630 --> 00:01:20.955
over some alphabet.

00:01:20.955 --> 00:01:25.950
And the alphabet we're going
to always call capital Sigma.

00:01:25.950 --> 00:01:26.947
Think of that.

00:01:26.947 --> 00:01:27.780
It could be binary--

00:01:27.780 --> 00:01:28.470
0 and 1.

00:01:28.470 --> 00:01:31.950
Could be ASCII, so there's
256 characters in there.

00:01:31.950 --> 00:01:35.280
Could be Unicode-- pick
your favorite alphabet.

00:01:35.280 --> 00:01:40.410
Then it could be ACGT for DNA.

00:01:40.410 --> 00:01:43.486
And their goal is to
find the occurrences

00:01:43.486 --> 00:01:44.610
of the pattern in the text.

00:01:52.960 --> 00:01:55.170
Could be we want to find
some of those occurrences

00:01:55.170 --> 00:01:56.860
or all of them, or count them.

00:02:09.870 --> 00:02:14.280
And in this lecture, we're
only interested in substring

00:02:14.280 --> 00:02:14.930
searches.

00:02:14.930 --> 00:02:17.580
So the pattern is just a string.

00:02:17.580 --> 00:02:25.000
You want to know all the
places where P occurs.

00:02:25.000 --> 00:02:29.910
P might appear multiple times,
even overlapping itself--

00:02:29.910 --> 00:02:31.740
in those two
positions, whatever.

00:02:31.740 --> 00:02:34.446
You want to find all the shifts
of P where it's identical to T.

00:02:34.446 --> 00:02:35.820
Now, there are
lots of variations

00:02:35.820 --> 00:02:37.278
on this problem
which we won't look

00:02:37.278 --> 00:02:41.700
at in this lecture, such as
when the pattern has wildcards

00:02:41.700 --> 00:02:44.152
in it, or you could imagine
it being a regular expression,

00:02:44.152 --> 00:02:45.610
or you don't want
to match exactly,

00:02:45.610 --> 00:02:48.930
you want to match approximately,
you could have some mismatches,

00:02:48.930 --> 00:02:52.170
or it could require
some edits to match

00:02:52.170 --> 00:02:54.045
T. We're not going to
look at those problems.

00:02:56.920 --> 00:02:59.460
This is both an algorithmic
problem and a data structures

00:02:59.460 --> 00:03:00.780
problem.

00:03:00.780 --> 00:03:02.730
If I give you this
text in the pattern,

00:03:02.730 --> 00:03:04.200
I just want to know the answer.

00:03:04.200 --> 00:03:05.880
You can do that in
linear time-- it's

00:03:05.880 --> 00:03:12.694
famous Knuth-Morris-Pratt, or
Boyer-Moore, or Rabin-Karp.

00:03:12.694 --> 00:03:14.610
Lots of linear time
algorithms for doing that.

00:03:14.610 --> 00:03:16.410
We're interested in
the data structure

00:03:16.410 --> 00:03:21.070
version of the problem,
static data structure.

00:03:21.070 --> 00:03:24.210
So we're given
the text up front,

00:03:24.210 --> 00:03:27.660
given T. We want
to preprocess T.

00:03:27.660 --> 00:03:34.110
And then the query
consists of the pattern.

00:03:34.110 --> 00:03:37.980
Imagine T being very
big, P being not so big.

00:03:37.980 --> 00:03:45.400
And we'd like to spend
something like order

00:03:45.400 --> 00:03:47.820
P time to do a query.

00:03:51.222 --> 00:03:53.430
That would be ideal because
you have to at least look

00:03:53.430 --> 00:03:55.170
at the query and
you don't really

00:03:55.170 --> 00:03:57.240
want to spend time
looking at the text.

00:03:57.240 --> 00:04:03.405
You'd also like something
like order T space.

00:04:03.405 --> 00:04:05.280
We don't want the space
of the data structure

00:04:05.280 --> 00:04:07.590
to be much bigger than
the original text.

00:04:07.590 --> 00:04:10.727
So these are goals which we will
more or less achieve, depending

00:04:10.727 --> 00:04:12.060
on exactly the problem you want.

00:04:12.060 --> 00:04:13.684
Sometimes we'll
achieve this, sometimes

00:04:13.684 --> 00:04:16.019
we'll achieve almost this.

00:04:16.019 --> 00:04:21.492
But these are really nice
running times and space.

00:04:21.492 --> 00:04:22.200
It's all optimal.

00:04:24.710 --> 00:04:26.660
Before we get to
that problem, I want

00:04:26.660 --> 00:04:32.400
to solve a simpler problem which
is necessary to solve this one.

00:04:32.400 --> 00:04:33.800
We'll call it a warm up.

00:04:36.460 --> 00:04:41.860
And that's a good friend--
the predecessor problem,

00:04:41.860 --> 00:04:43.220
but now among strings.

00:04:48.060 --> 00:04:50.970
Let's say we have k strings--

00:04:50.970 --> 00:04:54.410
k texts-- T1 to T k.

00:04:54.410 --> 00:04:57.200
And now the query is
you're given some pattern P

00:04:57.200 --> 00:04:59.090
and you want to
know where P fits

00:04:59.090 --> 00:05:02.070
among these strings
in lexical order.

00:05:02.070 --> 00:05:04.160
So a regular predecessor,
but now comparison

00:05:04.160 --> 00:05:06.080
is string comparison.

00:05:06.080 --> 00:05:08.964
Of course, you could try to
solve that using our existing

00:05:08.964 --> 00:05:10.130
predecessor data structures.

00:05:10.130 --> 00:05:11.457
But they won't do very well.

00:05:11.457 --> 00:05:13.040
Even a binary search
tree is not going

00:05:13.040 --> 00:05:15.350
to do well here because
comparing two strings

00:05:15.350 --> 00:05:18.350
could take a very long time
if those strings are long.

00:05:18.350 --> 00:05:19.980
So we don't want to do that.

00:05:19.980 --> 00:05:24.470
Instead, we're going
to build a trie.

00:05:24.470 --> 00:05:28.010
Now, tries we've
seen in fast sorting

00:05:28.010 --> 00:05:31.540
lecture, when w is at least
logs at two plus epsilon event.

00:05:34.110 --> 00:05:36.740
We used tries in a
particular setting there.

00:05:36.740 --> 00:05:40.130
We're going to use them in
their more native setting

00:05:40.130 --> 00:05:41.810
today a lot.

00:05:45.080 --> 00:05:51.710
In this setting-- again,
a trie is a rooted tree.

00:05:51.710 --> 00:05:53.495
The children
branches are labeled.

00:06:00.000 --> 00:06:01.710
And in this case,
they're labeled

00:06:01.710 --> 00:06:04.440
with letters in the alphabet--

00:06:04.440 --> 00:06:06.030
Sigma.

00:06:06.030 --> 00:06:07.830
So you have a node.

00:06:07.830 --> 00:06:13.020
And let's say, we have the
English alphabet-- a, b,

00:06:13.020 --> 00:06:14.586
up to z.

00:06:14.586 --> 00:06:16.560
Those are your 26
possible children.

00:06:16.560 --> 00:06:19.200
Some of them may not exist,
they are null pointers.

00:06:19.200 --> 00:06:23.340
Others may point
to actual nodes.

00:06:23.340 --> 00:06:25.560
That is a trie in
its native setting,

00:06:25.560 --> 00:06:28.750
which is when the alphabet
is something you care about.

00:06:28.750 --> 00:06:31.260
Now, when we used tries
before, our alphabet

00:06:31.260 --> 00:06:33.240
just represented some
digit in some kind

00:06:33.240 --> 00:06:35.900
of arbitrary representation.

00:06:35.900 --> 00:06:38.160
The digit was made up of
log to the epsilon bits.

00:06:38.160 --> 00:06:39.839
We were just using it as a tool.

00:06:39.839 --> 00:06:41.380
But this is where
tries actually come

00:06:41.380 --> 00:06:45.660
from-- they come from trying
to retrieve strings out

00:06:45.660 --> 00:06:48.840
of some database, in this case.

00:06:48.840 --> 00:06:52.320
We're doing predecessor--
this is a practical problem.

00:06:52.320 --> 00:06:54.850
Like a lot of library
search engines,

00:06:54.850 --> 00:06:56.850
you type in the beginning
of the title of a book

00:06:56.850 --> 00:07:01.080
and you want to know what is the
preceding and succeeding book

00:07:01.080 --> 00:07:03.690
title of what you query for.

00:07:03.690 --> 00:07:06.202
So this is something
people care about.

00:07:06.202 --> 00:07:07.910
Although really they
want us-- typically,

00:07:07.910 --> 00:07:10.440
we want to solve this
problem because it's harder.

00:07:13.020 --> 00:07:15.390
So that's a trie.

00:07:15.390 --> 00:07:23.070
Now, to make this actually
work, what we'd like to do

00:07:23.070 --> 00:07:25.410
is represent our strings.

00:07:25.410 --> 00:07:29.510
So how do we use this structure
to represent strings T1 to T k?

00:07:33.090 --> 00:07:35.960
We're going to
represent those strings

00:07:35.960 --> 00:07:38.670
in the obvious way, which
we've done many times

00:07:38.670 --> 00:07:41.970
in the past when we were doing
integer data structures--

00:07:41.970 --> 00:07:43.365
as root to leaf paths.

00:07:48.790 --> 00:07:51.370
Because any root to leaf path
is just a sequence of letters,

00:07:51.370 --> 00:07:52.203
and that's a string.

00:07:52.203 --> 00:07:54.520
So we just throw them in there.

00:07:54.520 --> 00:07:59.360
Now, to do that, we need to
change things a little bit.

00:07:59.360 --> 00:08:07.390
We're going to add a new letter,
which we usually present as $

00:08:07.390 --> 00:08:10.060
sign, to the end
of every string.

00:08:21.550 --> 00:08:24.390
I have an example.

00:08:24.390 --> 00:08:33.039
We're going to do four strings--

00:08:37.679 --> 00:08:41.590
various spellings
of Anna and Ann.

00:08:41.590 --> 00:08:47.170
And say, we'd like to
throw these into a trie.

00:08:47.170 --> 00:08:48.490
They all start with a.

00:08:48.490 --> 00:08:52.600
So at the root, there's going to
be four branches corresponding

00:08:52.600 --> 00:08:55.940
to $ sign, a, e, and n.

00:08:55.940 --> 00:08:57.850
I'm supposing my
alphabet is just a,

00:08:57.850 --> 00:09:01.150
e, n because that's
all that appears here.

00:09:01.150 --> 00:09:04.570
But everything will
be on the a branch.

00:09:04.570 --> 00:09:08.420
And then from there
we're going to have--

00:09:08.420 --> 00:09:12.710
let's see-- they
all go to n next.

00:09:12.710 --> 00:09:16.780
So they all follow this branch.

00:09:16.780 --> 00:09:20.950
Then one of them goes to a.

00:09:20.950 --> 00:09:23.440
These all go to n afterwards.

00:09:23.440 --> 00:09:31.240
So we've got $ sign,
a we use, e, n we use.

00:09:31.240 --> 00:09:33.880
And on the a
branch, we are done.

00:09:33.880 --> 00:09:38.680
This corresponds to and a, n, e.

00:09:38.680 --> 00:09:39.310
We're finished.

00:09:39.310 --> 00:09:42.670
And we imagine there being $
sign at the end of this string.

00:09:42.670 --> 00:09:48.940
So we follow the $ sign child.

00:09:48.940 --> 00:09:50.920
The others are blank.

00:09:50.920 --> 00:09:55.870
And this leaf here
corresponds to a, n, a.

00:09:55.870 --> 00:10:00.714
On the other hand, if
we could do a, n, n,

00:10:00.714 --> 00:10:01.630
there's three options.

00:10:01.630 --> 00:10:03.000
We could be done.

00:10:03.000 --> 00:10:05.750
Or there could be an
a or an e to follow.

00:10:05.750 --> 00:10:09.310
So if we're done, that would
correspond to the $ sign

00:10:09.310 --> 00:10:10.660
pointer.

00:10:10.660 --> 00:10:14.530
That's going to be a leaf
corresponding to this string

00:10:14.530 --> 00:10:16.840
here.

00:10:16.840 --> 00:10:22.465
Or it could be an a
and then we're done.

00:10:26.380 --> 00:10:28.920
And then we have a
leaf corresponding

00:10:28.920 --> 00:10:32.640
to Anna, a, n, n, a.

00:10:32.640 --> 00:10:39.230
Or could be we have an e
next and then we're done.

00:10:44.210 --> 00:10:44.970
OK.

00:10:44.970 --> 00:10:46.770
Not very exciting
but that is the tri

00:10:46.770 --> 00:10:52.260
representation of a, n, a; a,
n, n; a, n, n, a; a, n, n, e.

00:10:52.260 --> 00:10:57.930
And you can see there is exactly
one leaf per word down here.

00:10:57.930 --> 00:11:00.840
And furthermore, if you
take in order traversal

00:11:00.840 --> 00:11:03.934
of those leaves, you get
these strings in order.

00:11:03.934 --> 00:11:05.850
And typically, if you're
going to store a data

00:11:05.850 --> 00:11:08.379
structure like this, you would
store these actual pointers.

00:11:08.379 --> 00:11:10.670
So once you get to a leaf,
you know which word matched.

00:11:13.760 --> 00:11:15.600
So that's a trie.

00:11:15.600 --> 00:11:18.885
Seems pretty trivial.

00:11:18.885 --> 00:11:19.385
Trievial?

00:11:23.860 --> 00:11:26.050
But it turns out there's
something already

00:11:26.050 --> 00:11:30.580
pretty interesting about
this data structure.

00:11:30.580 --> 00:11:32.820
How do you do a
predecessor search?

00:11:32.820 --> 00:11:37.600
If I'm searching for something
like, I don't know, a, n, e--

00:11:37.600 --> 00:11:39.850
because I made a typo--

00:11:39.850 --> 00:11:44.410
then I follow a, n, and then
I follow this e branch here

00:11:44.410 --> 00:11:45.490
and discover-- whoops--

00:11:45.490 --> 00:11:46.960
there's nothing here.

00:11:46.960 --> 00:11:49.590
But right at that node I see,
OK, well, my predecessor is

00:11:49.590 --> 00:11:51.610
going to be the max
in this subtree, which

00:11:51.610 --> 00:11:53.500
happens to be a, n, a.

00:11:53.500 --> 00:11:56.230
My successor is going to be
the min in this subtree, which

00:11:56.230 --> 00:11:57.790
happens to be a, n, n.

00:11:57.790 --> 00:11:59.170
And so I find what I want.

00:11:59.170 --> 00:12:01.010
How long does it
take me to do that?

00:12:01.010 --> 00:12:03.970
Well, if I store
subtree mins and maxs,

00:12:03.970 --> 00:12:05.890
then I just have to
walk down the tree.

00:12:05.890 --> 00:12:09.034
That will take order
P time to walk down.

00:12:09.034 --> 00:12:10.450
And then, once I'm
at a node, I've

00:12:10.450 --> 00:12:14.740
got to do a predecessor
or successor in the node.

00:12:14.740 --> 00:12:15.740
So there are two issues.

00:12:15.740 --> 00:12:19.030
One is, given a node, how do
you know which way to walk down?

00:12:19.030 --> 00:12:21.130
And then, when you're
done, how do you

00:12:21.130 --> 00:12:22.270
do predecessor in a node?

00:12:22.270 --> 00:12:23.630
It's the fundamental question.

00:12:23.630 --> 00:12:25.505
Now, this is something
we spent a lot of time

00:12:25.505 --> 00:12:26.950
doing in, say, fusion trees.

00:12:26.950 --> 00:12:28.160
That was the big challenge.

00:12:28.160 --> 00:12:30.310
So this is not
really so trivial--

00:12:30.310 --> 00:12:32.560
how do I represent a node?

00:12:32.560 --> 00:12:35.470
One way to make it trivial is
to assume that the alphabet is

00:12:35.470 --> 00:12:37.145
constant size, like two.

00:12:37.145 --> 00:12:38.770
Then, of course,
there's nothing to do.

00:12:38.770 --> 00:12:40.370
It's a binary trie.

00:12:40.370 --> 00:12:41.890
You look at 0, you look at 1.

00:12:41.890 --> 00:12:44.560
You can figure out
anything you need to do

00:12:44.560 --> 00:12:45.780
if the alphabet is constant.

00:12:45.780 --> 00:12:47.530
But things get interesting
if you imagine,

00:12:47.530 --> 00:12:49.150
well, the alphabet
is some parameter,

00:12:49.150 --> 00:12:50.600
we don't know how big it is.

00:12:50.600 --> 00:12:51.950
It might be substantial.

00:12:51.950 --> 00:12:55.460
So let's think about
how you might represent

00:12:55.460 --> 00:12:58.390
a trie or the node of a trie.

00:13:03.790 --> 00:13:08.320
Let's call this trie
node representation.

00:13:11.620 --> 00:13:13.040
Any suggestions?

00:13:13.040 --> 00:13:15.550
What are the obvious ways to
represent the node of a trie?

00:13:15.550 --> 00:13:18.156
Nothing fancy.

00:13:18.156 --> 00:13:19.780
I have three obvious
answers, at least.

00:13:19.780 --> 00:13:20.547
AUDIENCE: Array.

00:13:20.547 --> 00:13:21.380
ERIK DEMAINE: Array.

00:13:21.380 --> 00:13:21.570
Good.

00:13:21.570 --> 00:13:22.570
That was my number one.

00:13:25.090 --> 00:13:26.740
Any more?

00:13:26.740 --> 00:13:28.365
That's I think the most obvious.

00:13:28.365 --> 00:13:28.990
AUDIENCE: Tree.

00:13:28.990 --> 00:13:29.915
ERIK DEMAINE: Tree.

00:13:29.915 --> 00:13:30.415
Good.

00:13:30.415 --> 00:13:32.350
Do a binary search tree.

00:13:32.350 --> 00:13:32.925
Or?

00:13:32.925 --> 00:13:33.800
AUDIENCE: Hash table.

00:13:33.800 --> 00:13:34.360
ERIK DEMAINE: Hash table.

00:13:34.360 --> 00:13:34.860
Good.

00:13:39.550 --> 00:13:44.950
So for each of them we
have query time and space.

00:13:48.590 --> 00:13:51.420
If I use an array,
meaning I have--

00:13:51.420 --> 00:13:53.320
let's say, for a through z--

00:13:53.320 --> 00:13:59.710
I have a pointer that either
is null or points to the child.

00:13:59.710 --> 00:14:02.170
This is going to be really
fast because they're at a node.

00:14:02.170 --> 00:14:05.350
If I want to know, I just
look at that i-th letter

00:14:05.350 --> 00:14:08.560
in my pattern P. I
say, oh, it's a j.

00:14:08.560 --> 00:14:11.577
So I look at the j
position and I follow it.

00:14:11.577 --> 00:14:13.910
You might wonder, how do I
do predecessor and successor?

00:14:13.910 --> 00:14:15.493
Well, this is a
static data structure.

00:14:15.493 --> 00:14:17.470
So for every cell,
if it's null, I

00:14:17.470 --> 00:14:21.460
can store the predecessor
and successor in the node.

00:14:21.460 --> 00:14:23.910
With no more space.

00:14:23.910 --> 00:14:28.270
This is Sigma space per node.

00:14:28.270 --> 00:14:34.570
So the amount of space is T
Sigma, which is not so great.

00:14:34.570 --> 00:14:37.388
But the query is fast,
query is order P time.

00:14:40.230 --> 00:14:40.850
BST.

00:14:40.850 --> 00:14:43.110
The idea of the BST
is instead of having

00:14:43.110 --> 00:14:47.270
a node that has some pointers,
some of which may be absent,

00:14:47.270 --> 00:14:54.050
let's expand it out into
something like this.

00:14:54.050 --> 00:14:55.610
Actually, I'll use colors.

00:14:55.610 --> 00:14:58.820
This will make life a little
bit cleaner in a moment.

00:14:58.820 --> 00:15:01.830
Because we are going to
modify this approach.

00:15:01.830 --> 00:15:06.470
So let's say that the pointers
you care about are red.

00:15:06.470 --> 00:15:08.670
Those are the actual letter
pointers you want to do.

00:15:08.670 --> 00:15:11.360
So the idea is to expand
out this high degree node

00:15:11.360 --> 00:15:12.950
into binary nodes.

00:15:12.950 --> 00:15:16.150
You put appropriate keys in here
so you can do a binary search.

00:15:16.150 --> 00:15:20.120
And then, eventually, you get
down to where you need to go.

00:15:20.120 --> 00:15:23.960
This structure has
high log Sigma.

00:15:23.960 --> 00:15:28.110
So the query time is
going to be P log Sigma.

00:15:28.110 --> 00:15:31.310
So that goes up a
little bit, not perfect.

00:15:31.310 --> 00:15:33.300
But the space now
becomes linear,

00:15:33.300 --> 00:15:35.210
so that's an improvement.

00:15:35.210 --> 00:15:38.630
Ideally, we'd like the
best of both of these--

00:15:38.630 --> 00:15:42.740
optimal query time, optimal
space, linear space.

00:15:42.740 --> 00:15:45.050
And hash tables achieve that.

00:15:45.050 --> 00:15:51.530
They give you order P
query and order T space.

00:15:51.530 --> 00:15:53.840
Again, the issue is some of
these cells are absent so

00:15:53.840 --> 00:15:54.740
don't use an array.

00:15:54.740 --> 00:15:56.840
That's like a direct
mapped hash table.

00:15:56.840 --> 00:15:58.700
Use a hash table, use hashing.

00:15:58.700 --> 00:16:01.880
That way you can use linear
space per node, however many

00:16:01.880 --> 00:16:04.380
occupied children there are.

00:16:04.380 --> 00:16:06.020
What is T here, by the way?

00:16:06.020 --> 00:16:11.120
T is the sum of the
lengths of the T i's--

00:16:11.120 --> 00:16:13.220
because here we're
storing multiple T i's.

00:16:13.220 --> 00:16:19.700
Or it's the number of
nodes in the tree, which,

00:16:19.700 --> 00:16:22.160
if your strings happen to
have a lot of common prefixes,

00:16:22.160 --> 00:16:24.159
the number of nodes in
the trie could be smaller

00:16:24.159 --> 00:16:28.980
than that, but not in general.

00:16:28.980 --> 00:16:30.841
What's the problem
with the hash table?

00:16:30.841 --> 00:16:31.340
Question.

00:16:31.340 --> 00:16:38.330
AUDIENCE: [INAUDIBLE]

00:16:38.330 --> 00:16:39.080
ERIK DEMAINE: Yes.

00:16:39.080 --> 00:16:41.570
For the BST, we need to
store some keys in this node.

00:16:41.570 --> 00:16:42.992
That lets you do
a binary search.

00:16:42.992 --> 00:16:44.450
For example, every
node could store

00:16:44.450 --> 00:16:46.220
the max in the left subtree--

00:16:46.220 --> 00:16:48.353
just within this
little tree, though.

00:16:48.353 --> 00:16:53.090
AUDIENCE: [INAUDIBLE]

00:16:53.090 --> 00:16:54.980
ERIK DEMAINE: It
is order T. Sorry,

00:16:54.980 --> 00:16:57.040
I see-- why is it
not O T Sigma space?

00:16:57.040 --> 00:16:58.580
You're only storing
one letter here,

00:16:58.580 --> 00:17:02.130
so that fits in a single
word, and two pointers.

00:17:02.130 --> 00:17:04.099
So every node only
takes constant space.

00:17:04.099 --> 00:17:08.730
It's only T space, not T Sigma.

00:17:08.730 --> 00:17:10.940
Other questions?

00:17:10.940 --> 00:17:11.450
Or answers?

00:17:11.450 --> 00:17:13.790
There's a problem with hashing--

00:17:13.790 --> 00:17:18.260
doesn't actually solve the
problem we want to solve.

00:17:18.260 --> 00:17:20.520
It doesn't solve predecessor.

00:17:20.520 --> 00:17:22.940
Because hashing mixes up
the order of the nodes.

00:17:22.940 --> 00:17:25.010
This is the problem
we had with--

00:17:25.010 --> 00:17:29.460
what's it called-- signature
sort, which hashed,

00:17:29.460 --> 00:17:32.790
it messed up, it permuted
all the things in the nodes

00:17:32.790 --> 00:17:34.950
and so you didn't know--

00:17:34.950 --> 00:17:37.410
I mean, in a hash table,
you can't solve predecessor.

00:17:37.410 --> 00:17:40.289
That's what the
predecessor problem is for.

00:17:40.289 --> 00:17:42.330
I guess you could try to
throw a predecessor data

00:17:42.330 --> 00:17:43.500
structure in here.

00:17:43.500 --> 00:17:46.960
Actually, I hadn't
thought of that before.

00:17:46.960 --> 00:17:49.680
So we could use y-fast
tries or something.

00:17:49.680 --> 00:17:55.650
And we would get
order T space and--

00:17:55.650 --> 00:17:58.510
I guess, with high
probability, this

00:17:58.510 --> 00:18:00.990
is also with high probability--

00:18:00.990 --> 00:18:06.180
we get order P log
log Sigma, I guess.

00:18:06.180 --> 00:18:09.270
Because I use Van Emde Boas.

00:18:09.270 --> 00:18:15.840
I'm going to have to call
it 3.5, Van Emde Boas.

00:18:15.840 --> 00:18:17.280
So that would be
another approach.

00:18:17.280 --> 00:18:22.409
So hashing will not
do a predecessor.

00:18:22.409 --> 00:18:24.950
We'll do exact search, which is
still an interesting problem.

00:18:24.950 --> 00:18:26.783
Might give you some
strings I want to know--

00:18:26.783 --> 00:18:29.640
is this string in your set?

00:18:29.640 --> 00:18:32.734
But it won't solve the
predecessor problem.

00:18:32.734 --> 00:18:34.650
So this is an interesting
solution-- hashing--

00:18:34.650 --> 00:18:36.070
but not quite what we want.

00:18:36.070 --> 00:18:38.361
And Van Emde Boas doesn't
quite do what we want either.

00:18:38.361 --> 00:18:40.620
It improves over
the BST approach

00:18:40.620 --> 00:18:42.900
but we get another log in there.

00:18:42.900 --> 00:18:46.080
But it's still not order
P. I kind of like order P.

00:18:46.080 --> 00:18:49.360
Or at least, instead of
order P times log log Sigma,

00:18:49.360 --> 00:18:54.030
I kind of like order
P plus log Sigma.

00:18:54.030 --> 00:18:56.470
And order P plus
log Sigma is known.

00:18:56.470 --> 00:19:01.010
So that's what I want
to tell you about.

00:19:01.010 --> 00:19:03.710
And this is normally
done with a structure

00:19:03.710 --> 00:19:09.740
called trays, which is
a portamento, I guess,

00:19:09.740 --> 00:19:12.410
of tree and array.

00:19:12.410 --> 00:19:15.140
Somewhere in there there's
a tree and an array,

00:19:15.140 --> 00:19:18.630
so it's a bit of
an awkward word.

00:19:18.630 --> 00:19:23.690
But Those are developed by
Koplowitz and Lewenstein,

00:19:23.690 --> 00:19:28.460
in 2006, a fairly
recent innovation.

00:19:28.460 --> 00:19:31.240
I'll have this number 6--

00:19:31.240 --> 00:19:41.900
trays, achieve order P plus
log Sigma and order T space.

00:19:41.900 --> 00:19:42.870
So this is pretty good.

00:19:42.870 --> 00:19:45.680
And they will do predecessor
and successor-- definitely

00:19:45.680 --> 00:19:47.037
an improvement over the BST.

00:19:50.690 --> 00:19:55.460
It's open whether you could
do order P plus log log Sigma.

00:19:55.460 --> 00:19:58.530
This is as far as I can tell,
no one has worked on this.

00:19:58.530 --> 00:20:03.920
Maybe we will work on it today.

00:20:03.920 --> 00:20:05.360
So something to think about--

00:20:05.360 --> 00:20:06.901
whether you could
get the best of all

00:20:06.901 --> 00:20:09.174
of these worlds for predecessor.

00:20:09.174 --> 00:20:11.590
There's a lower bound-- you
need to spend at least log log

00:20:11.590 --> 00:20:12.170
Sigma time.

00:20:12.170 --> 00:20:14.319
Because even if you
try as a single node,

00:20:14.319 --> 00:20:15.860
you have the
predecessor lower bound.

00:20:15.860 --> 00:20:18.470
And we know log log universe
size is the best you

00:20:18.470 --> 00:20:19.860
can do in this regime.

00:20:23.320 --> 00:20:26.560
So that's where we're going.

00:20:26.560 --> 00:20:28.270
Instead of describing
trays, though, I'm

00:20:28.270 --> 00:20:29.645
going to describe
a new way to do

00:20:29.645 --> 00:20:32.200
it, which has never
been seen before

00:20:32.200 --> 00:20:33.452
in any class or anywhere.

00:20:33.452 --> 00:20:34.410
Because it's brand new.

00:20:34.410 --> 00:20:37.720
It's developed by Martin
Farach-Colton, who

00:20:37.720 --> 00:20:41.290
did the LCA and the
level ancestor structures

00:20:41.290 --> 00:20:43.480
that we saw in last class.

00:20:43.480 --> 00:20:46.160
And he just told it to
me and it's really cool

00:20:46.160 --> 00:20:47.590
so we're going to cover it.

00:20:51.640 --> 00:20:54.755
A simpler way to get
this same bound of trays.

00:20:58.990 --> 00:21:00.670
And the first thing
we're going to do

00:21:00.670 --> 00:21:02.270
is use a weight balanced BST.

00:21:07.040 --> 00:21:16.670
This will achieve P plus log
k query and linear space.

00:21:20.650 --> 00:21:23.210
k, remember, is the
number of strings

00:21:23.210 --> 00:21:26.670
that we're storing, so it's the
number of leaves in the trie.

00:21:26.670 --> 00:21:28.789
So it's not quite as
good as P plus log Sigma

00:21:28.789 --> 00:21:30.080
but it's going to be a warm up.

00:21:30.080 --> 00:21:32.621
We're going to to do this and
then we're going to improve it.

00:21:34.550 --> 00:21:39.470
Remember weight balanced trees,
we talked about them way back

00:21:39.470 --> 00:21:41.550
in lecture 3, I believe.

00:21:41.550 --> 00:21:44.690
There is an issue of
what is the weight.

00:21:44.690 --> 00:21:46.850
And typically, you say,
the weight of a subtree

00:21:46.850 --> 00:21:48.670
is the number of
nodes in the subtree.

00:21:48.670 --> 00:21:50.420
I'm going to change
that slightly and say,

00:21:50.420 --> 00:21:53.360
the weight of a subtree is
the number of descendant

00:21:53.360 --> 00:21:59.300
leaves in the subtree,
not the number of nodes,

00:21:59.300 --> 00:22:01.890
because it's log k.

00:22:01.890 --> 00:22:04.460
We really care about the
number of leaves down there.

00:22:04.460 --> 00:22:05.960
There could be long
paths here which

00:22:05.960 --> 00:22:08.690
we are not so excited about.

00:22:08.690 --> 00:22:11.150
We really care about how
many leaves are down there.

00:22:11.150 --> 00:22:14.881
Like the weight of this
node here is three--

00:22:14.881 --> 00:22:16.130
there's three leaves below it.

00:22:21.410 --> 00:22:23.800
You may recall
weight balanced BSTs

00:22:23.800 --> 00:22:25.910
trying to make the weight
of the left subtree

00:22:25.910 --> 00:22:28.860
within a constant factor of the
weight of the right subtree.

00:22:28.860 --> 00:22:31.130
Because we're static,
we can be even simpler

00:22:31.130 --> 00:22:33.335
and say, find the
optimal partition.

00:22:37.190 --> 00:22:40.430
So we're thinking
about this approach--

00:22:40.430 --> 00:22:44.570
idea of expanding a large degree
node into some binary tree.

00:22:44.570 --> 00:22:46.610
We have a choice of
what binary tree to use.

00:22:46.610 --> 00:22:48.694
With three nodes it may
be not many choices-- that

00:22:48.694 --> 00:22:50.693
could be this or it could
be a straight this way

00:22:50.693 --> 00:22:51.957
or a straight line that way.

00:22:51.957 --> 00:22:52.790
Those are different.

00:22:52.790 --> 00:22:55.167
And if one of these
guys is really heavy,

00:22:55.167 --> 00:22:56.750
one of these children
is really heavy,

00:22:56.750 --> 00:22:58.675
you want to put it
closer to the root.

00:22:58.675 --> 00:23:00.050
So that's what
we're going to do.

00:23:04.060 --> 00:23:06.720
Let me draw it this way.

00:23:06.720 --> 00:23:09.270
That's kind of an array.

00:23:09.270 --> 00:23:13.490
But what this array
represents is for a node--

00:23:13.490 --> 00:23:16.040
so here's my node, it
has lots of children.

00:23:16.040 --> 00:23:18.800
Some of these are
heavy, some of them

00:23:18.800 --> 00:23:21.567
are light, lighter than others.

00:23:21.567 --> 00:23:23.150
We don't know how
they're distributed.

00:23:23.150 --> 00:23:27.374
But they're ordered, we
have to preserve the order.

00:23:27.374 --> 00:23:28.790
What this is
supposed to represent

00:23:28.790 --> 00:23:31.890
is the total number of
leaves in this subtree.

00:23:31.890 --> 00:23:36.050
So the total number
of leaves here.

00:23:36.050 --> 00:23:40.490
And then I'm going to partition
this rectangle into groups

00:23:40.490 --> 00:23:43.430
corresponding to these sizes.

00:23:43.430 --> 00:23:48.230
So these are small, medium,
small, little less than medium,

00:23:48.230 --> 00:23:51.290
big, and then small.

00:23:51.290 --> 00:23:52.550
Something like that.

00:23:52.550 --> 00:23:54.950
So these horizontal
lengths correspond

00:23:54.950 --> 00:23:57.260
to the number of
leaves in these things,

00:23:57.260 --> 00:24:00.224
correspond to the
weight of my children.

00:24:00.224 --> 00:24:01.640
So I look at that
and I say, well,

00:24:01.640 --> 00:24:04.160
what I'd really like
to do is split this

00:24:04.160 --> 00:24:06.690
in the middle, which
is, maybe, here.

00:24:06.690 --> 00:24:08.730
I say, OK, well,
then I'll split here.

00:24:08.730 --> 00:24:11.090
That's pretty close
to the middle.

00:24:11.090 --> 00:24:13.547
So my left subtree will
consist of these guys,

00:24:13.547 --> 00:24:15.380
my right subtree will
consist of these guys.

00:24:15.380 --> 00:24:16.760
And then I recurse--

00:24:16.760 --> 00:24:19.490
over here I've
split at the middle,

00:24:19.490 --> 00:24:21.440
I find the thing that's
closest to the middle.

00:24:21.440 --> 00:24:22.898
Over here I've
split at the middle,

00:24:22.898 --> 00:24:26.724
I find the thing that's
closest to the middle.

00:24:26.724 --> 00:24:27.890
It's pretty much determined.

00:24:27.890 --> 00:24:30.642
So my root node
corresponds to this one.

00:24:30.642 --> 00:24:31.850
It's going to partition here.

00:24:31.850 --> 00:24:33.935
So over on the right,
there's going to be--

00:24:37.250 --> 00:24:39.470
here's going to be the
big tree and then here

00:24:39.470 --> 00:24:40.379
is the small tree.

00:24:40.379 --> 00:24:42.170
So this small tree
corresponds to this one.

00:24:42.170 --> 00:24:44.270
This big tree corresponds
to this interval.

00:24:44.270 --> 00:24:47.390
Then on the left we've got
four things we need to store.

00:24:47.390 --> 00:24:51.970
So these are the red
pointers that we had before.

00:24:51.970 --> 00:24:54.960
Then over on the left, we're
going to have a partition.

00:24:54.960 --> 00:24:57.250
And then there's going
to be two guys here.

00:24:57.250 --> 00:24:59.810
It doesn't really matter
how we store them.

00:24:59.810 --> 00:25:03.050
It's something like this.

00:25:03.050 --> 00:25:10.214
There is medium and small.

00:25:10.214 --> 00:25:12.255
And then over on the left,
we also have two guys.

00:25:12.255 --> 00:25:15.740
So it's going to be,
again, something like this.

00:25:20.440 --> 00:25:23.370
You got medium and small.

00:25:23.370 --> 00:25:25.880
So you see how that worked.

00:25:25.880 --> 00:25:29.270
Our main goal was to make this
big guy as close to the root

00:25:29.270 --> 00:25:30.141
as possible.

00:25:30.141 --> 00:25:32.390
It was the biggest and that's
basically what happened.

00:25:32.390 --> 00:25:34.206
This one is really big.

00:25:34.206 --> 00:25:36.330
And we couldn't quite put
it as a child of the root

00:25:36.330 --> 00:25:37.790
because it appeared
in the middle,

00:25:37.790 --> 00:25:41.480
but we could put it as a
grandchild at the root.

00:25:41.480 --> 00:25:43.400
In general, if you have
a super heavy child,

00:25:43.400 --> 00:25:47.150
it will always become a child
or grandchild of the root.

00:25:47.150 --> 00:25:50.149
So in constant number of
traversals you'll get there.

00:25:50.149 --> 00:25:52.190
Now again, you fill in
these nodes with some keys

00:25:52.190 --> 00:25:53.780
so you can do a binary search.

00:25:53.780 --> 00:25:57.950
But now the binary
search might go faster

00:25:57.950 --> 00:26:01.490
than log Sigma, which
is what we had before.

00:26:01.490 --> 00:26:04.340
And indeed, you can prove
that this really works well.

00:26:12.410 --> 00:26:13.610
So what's the claim?

00:26:17.030 --> 00:26:32.950
Claim is every two
edges you follow either

00:26:32.950 --> 00:26:35.654
advance one letter in P--

00:26:38.440 --> 00:26:41.177
these are the red edges
that we want to follow.

00:26:41.177 --> 00:26:42.760
So if we follow a
red edge, then we've

00:26:42.760 --> 00:26:45.020
made progress to the next node.

00:26:45.020 --> 00:26:48.730
So this would be
following a red edge.

00:26:48.730 --> 00:27:04.270
Or we reduce the number of
candidate to T i's by 2/3

00:27:04.270 --> 00:27:08.110
or, I guess, to 2/3
of its original value.

00:27:08.110 --> 00:27:10.210
So we lose a third
of the strings.

00:27:10.210 --> 00:27:12.100
That's what I'd like to claim.

00:27:12.100 --> 00:27:14.450
And it's not too
hard to see this.

00:27:14.450 --> 00:27:17.685
You have to imagine all of
these possible partitions.

00:27:17.685 --> 00:27:19.720
It's a little bit awkward.

00:27:19.720 --> 00:27:20.890
The idea is the following.

00:27:20.890 --> 00:27:23.170
If you take one
of these arrays--

00:27:23.170 --> 00:27:26.630
this view of all the leaves
just laid out on the line--

00:27:26.630 --> 00:27:30.800
you say, well, I'd like
to split in half and half.

00:27:30.800 --> 00:27:33.490
But that will never happen
unless I'm really lucky.

00:27:33.490 --> 00:27:37.540
So let's think about
this one third splitting.

00:27:37.540 --> 00:27:41.830
If I were able to cut anywhere
in here, then in one step,

00:27:41.830 --> 00:27:45.430
actually, I would achieve
this 2/3 reduction.

00:27:45.430 --> 00:27:46.690
I'd lose a third of the nodes.

00:27:50.530 --> 00:27:56.170
If I end up cutting here,
for example, then either I

00:27:56.170 --> 00:27:58.420
go to the left and I lost
almost 2/3 of the nodes,

00:27:58.420 --> 00:27:59.878
or I go to the
right and I at least

00:27:59.878 --> 00:28:02.410
lost this one third of the notes
or one third of the leaves,

00:28:02.410 --> 00:28:04.300
I should say.

00:28:04.300 --> 00:28:06.250
So that would be
a good situation

00:28:06.250 --> 00:28:07.610
if I got some cut in here.

00:28:07.610 --> 00:28:10.570
But it might be there is no
possible cut I can make in here

00:28:10.570 --> 00:28:16.879
because there's a giant child
in here that has more than one

00:28:16.879 --> 00:28:17.670
third of the nodes.

00:28:17.670 --> 00:28:20.710
It would have to span
all the way across here.

00:28:20.710 --> 00:28:22.630
So I can't make any
cuts, I can only

00:28:22.630 --> 00:28:25.060
cut between child boundaries.

00:28:25.060 --> 00:28:29.880
In that situation,
you make this--

00:28:29.880 --> 00:28:34.180
well, this is when I need to
follow two edges, not one.

00:28:34.180 --> 00:28:36.550
When there's a super big
child like that, as we said,

00:28:36.550 --> 00:28:39.280
it will become a
grandchild of the root.

00:28:39.280 --> 00:28:40.720
So it will be--

00:28:40.720 --> 00:28:45.070
there's the root and then
here is the giant tree.

00:28:45.070 --> 00:28:50.270
And then there's going to be
the other stuff here and here.

00:28:50.270 --> 00:28:53.890
So after I go down to either
one step or two steps,

00:28:53.890 --> 00:28:57.390
I will either get here--

00:28:57.390 --> 00:29:03.020
sorry, more red chalk,
this was a red point.

00:29:03.020 --> 00:29:04.840
Now, this is going to a child.

00:29:04.840 --> 00:29:07.090
So if I went there,
I'm happy in two steps.

00:29:07.090 --> 00:29:12.260
I advance one letter in P.
Or in two steps, I went here

00:29:12.260 --> 00:29:13.060
or I went here.

00:29:13.060 --> 00:29:14.780
And this was a huge
amount of the nodes,

00:29:14.780 --> 00:29:16.363
this is at least a
third of the nodes.

00:29:16.363 --> 00:29:18.220
Again, if I end up
here or end up here,

00:29:18.220 --> 00:29:20.990
I lost 2/3 of the
candidate leaves.

00:29:20.990 --> 00:29:23.740
I mean, I lost one third
of the candidate leaves,

00:29:23.740 --> 00:29:25.570
leaving 2/3 of them.

00:29:28.840 --> 00:29:33.340
If this happens, I charged
to this order P term.

00:29:33.340 --> 00:29:34.810
And if the other
situation happens,

00:29:34.810 --> 00:29:37.360
I charge the log k term--
because I can only reduce k

00:29:37.360 --> 00:29:41.260
by a factor of 2/3--

00:29:41.260 --> 00:29:44.110
order log k times.

00:29:44.110 --> 00:29:50.720
This implies order
p plus log k search.

00:29:50.720 --> 00:29:52.440
So a very simple idea.

00:29:52.440 --> 00:29:54.900
Just change the way we do BSTs.

00:29:54.900 --> 00:29:57.210
And we get, in some
cases, a better bound.

00:29:57.210 --> 00:29:59.940
But not in all
cases because maybe

00:29:59.940 --> 00:30:03.240
P plus log k might be bigger
than P times log Sigma.

00:30:03.240 --> 00:30:06.620
And k and Sigma are kind of
incomparable, so we don't know.

00:30:06.620 --> 00:30:13.350
That's where method
5 comes in, which

00:30:13.350 --> 00:30:15.620
is our good friend
from last class--

00:30:15.620 --> 00:30:18.752
leaf trimming and indirection.

00:30:22.200 --> 00:30:26.640
So we're going to use
this idea of finding--

00:30:26.640 --> 00:30:33.750
we're going to cut below
maximally deep nodes

00:30:33.750 --> 00:30:36.200
with the right number
of descendants in them.

00:30:43.820 --> 00:30:48.910
So we need at least
Sigma descendants.

00:30:53.820 --> 00:30:56.090
It could just be descendants
or descendant leaves,

00:30:56.090 --> 00:30:57.090
doesn't actually matter.

00:31:02.890 --> 00:31:06.416
Let me draw a picture, maybe.

00:31:06.416 --> 00:31:08.040
This is pretty much
what we did before,

00:31:08.040 --> 00:31:12.500
except before this magic number
was log n that we needed or 1/2

00:31:12.500 --> 00:31:13.820
log n or something.

00:31:13.820 --> 00:31:16.260
Now it's going to be
Sigma that we need.

00:31:16.260 --> 00:31:18.815
So it is we find these
maximally deep nodes--

00:31:18.815 --> 00:31:22.880
these dots-- that
have at least--

00:31:22.880 --> 00:31:25.310
I guess, there is really
multiple things hanging off

00:31:25.310 --> 00:31:25.970
here.

00:31:25.970 --> 00:31:29.990
In general, it could be
several things hanging off.

00:31:29.990 --> 00:31:31.490
But the total number
of descendants

00:31:31.490 --> 00:31:35.970
of each of these nodes
is at least Sigma.

00:31:35.970 --> 00:31:37.880
So what that implies
is that the number

00:31:37.880 --> 00:31:43.420
of these dots, the number of
the leaves in the top tree--

00:31:43.420 --> 00:31:51.890
so up here-- number of leaves
is at most T over Sigma.

00:31:51.890 --> 00:31:54.890
Because we can charge each
of these nodes to Sigma

00:31:54.890 --> 00:31:58.490
descendants in each of them.

00:31:58.490 --> 00:32:03.160
So that's good because it
says we can use method 1--

00:32:03.160 --> 00:32:06.830
the simple array method--
which is fast but spacious.

00:32:06.830 --> 00:32:11.870
But if our new size of the trie
gets divided by a Sigma factor,

00:32:11.870 --> 00:32:14.100
then this turns
out to be linear.

00:32:14.100 --> 00:32:15.860
So up here we use method 1.

00:32:18.980 --> 00:32:21.170
Now, you got to be a little
careful because we can't

00:32:21.170 --> 00:32:23.127
use method 1 on all the nodes.

00:32:23.127 --> 00:32:24.710
We can definitely
use it on the leaves

00:32:24.710 --> 00:32:26.600
because there aren't
too many leaves.

00:32:26.600 --> 00:32:32.910
That means we can also use it on
the number of branching nodes.

00:32:32.910 --> 00:32:35.000
Number of branching
nodes is also

00:32:35.000 --> 00:32:37.940
going to be, at
most, T over Sigma

00:32:37.940 --> 00:32:40.550
because it's actually
one fewer branching node

00:32:40.550 --> 00:32:43.310
than there are leaves.

00:32:43.310 --> 00:32:49.340
So great, I can use
arrays on the leaves,

00:32:49.340 --> 00:32:52.340
I can use arrays on
the branching nodes.

00:32:52.340 --> 00:32:54.950
I can't use it on the
non-branching nodes.

00:32:54.950 --> 00:32:58.220
Non-branching nodes are nodes
with a single descendant

00:32:58.220 --> 00:33:00.650
and everything else is null.

00:33:00.650 --> 00:33:03.490
What do I do for those nodes?

00:33:03.490 --> 00:33:06.470
Very difficult. I just store
that one pointer in a storage

00:33:06.470 --> 00:33:07.340
label.

00:33:07.340 --> 00:33:09.670
I guess you could think
of that as method 2

00:33:09.670 --> 00:33:11.360
in a very trivial case.

00:33:11.360 --> 00:33:13.670
You see-- is this
the right label?

00:33:13.670 --> 00:33:15.310
Yes or no.

00:33:15.310 --> 00:33:17.975
So this is the
non-branching nodes.

00:33:22.730 --> 00:33:25.910
Non-branching top nodes--

00:33:25.910 --> 00:33:28.430
I will use method 2.

00:33:28.430 --> 00:33:30.260
So I guess this is really--

00:33:30.260 --> 00:33:32.510
well, for these
guys I use method 1,

00:33:32.510 --> 00:33:35.930
for these guys I use method 1.

00:33:35.930 --> 00:33:37.160
So I can afford all this.

00:33:37.160 --> 00:33:38.555
This will take order T space.

00:33:43.070 --> 00:33:46.280
And it will be fast because
either I'm using arrays

00:33:46.280 --> 00:33:48.230
or I really don't
have any work to do,

00:33:48.230 --> 00:33:50.440
and so it doesn't
really matter what I do.

00:33:50.440 --> 00:33:52.190
But except I can't use
arrays because they

00:33:52.190 --> 00:33:53.990
would be too spacious.

00:33:53.990 --> 00:33:55.170
So that handles the top.

00:33:55.170 --> 00:33:57.530
Now, the issue is, what about
these bottom structures?

00:33:57.530 --> 00:33:59.540
The bottom structures--
what do we know?

00:33:59.540 --> 00:34:03.450
They have to have
less than Sigma nodes,

00:34:03.450 --> 00:34:05.660
less than Sigma descendants.

00:34:05.660 --> 00:34:09.260
Also less than Sigma leaves.

00:34:09.260 --> 00:34:15.889
So in other words,
in these trees

00:34:15.889 --> 00:34:19.100
we have k less than Sigma.

00:34:19.100 --> 00:34:21.889
Well, then we can
afford to use method 4.

00:34:21.889 --> 00:34:25.530
Because our whole goal is to get
k down to Sigma in this bound.

00:34:25.530 --> 00:34:29.105
So in the bottom
trees, we use method 4.

00:34:31.730 --> 00:34:33.500
Method 4 was always
linear space.

00:34:33.500 --> 00:34:36.260
And the issue was we
paid P plus log k.

00:34:36.260 --> 00:34:44.239
But now in here, k is less
than Sigma in these trees.

00:34:44.239 --> 00:34:50.690
So that means we get order
P plus log Sigma query time.

00:34:53.659 --> 00:34:56.550
And that's the best we know how
to do if you want predecessor

00:34:56.550 --> 00:34:57.780
at the nodes.

00:34:57.780 --> 00:35:01.830
So it matches this tray
bound in pretty easy way.

00:35:01.830 --> 00:35:04.790
Just to apply weight balanced,
clean things up a little bit.

00:35:04.790 --> 00:35:07.800
But only do that at the
leaves and everywhere up

00:35:07.800 --> 00:35:09.120
here, basically.

00:35:09.120 --> 00:35:11.114
Except the non-branching
nodes use arrays.

00:35:11.114 --> 00:35:13.280
So for the most part arrays
and then, at the bottom,

00:35:13.280 --> 00:35:16.400
you use weight balance.

00:35:16.400 --> 00:35:19.340
This is how you ought
to represent a trie.

00:35:19.340 --> 00:35:22.172
If you want to preserve
the order of the children,

00:35:22.172 --> 00:35:23.630
this is the best
we know how to do.

00:35:23.630 --> 00:35:26.330
If you don't want to preserve
order, just use a hash table.

00:35:26.330 --> 00:35:28.145
So it depends on
the application.

00:35:32.110 --> 00:35:36.370
One fun application of
this is string sorting.

00:35:39.370 --> 00:35:40.930
It's not a data
structures problem

00:35:40.930 --> 00:35:42.804
so I don't want to spend
too much time on it.

00:35:42.804 --> 00:35:45.340
But you use this trie data
structure to sort strings.

00:35:45.340 --> 00:35:47.590
You just throw in a string
and then throw in a string.

00:35:47.590 --> 00:35:52.670
We didn't talk about dynamic
tries but it can be done.

00:35:52.670 --> 00:35:54.670
And if you throw it,
you just sort of find

00:35:54.670 --> 00:35:57.220
where you fall off and
then add the thing.

00:35:57.220 --> 00:35:59.770
Now, you have to maintain
all this funky stuff

00:35:59.770 --> 00:36:03.370
but weight balanced trees can
be made dynamic and indirection

00:36:03.370 --> 00:36:05.030
can be made dynamic.

00:36:05.030 --> 00:36:09.880
So you end up with this sort
of simple incremental scheme.

00:36:09.880 --> 00:36:14.410
You end up with T
plus k log Sigma

00:36:14.410 --> 00:36:21.790
to sort k strings of total size
T with alphabet size Sigma.

00:36:21.790 --> 00:36:22.720
This is good.

00:36:22.720 --> 00:36:26.077
If I used, for example,
merge sort to sort strings,

00:36:26.077 --> 00:36:27.160
it's going to be very bad.

00:36:27.160 --> 00:36:32.650
It's going to be something like
T times k times log something.

00:36:32.650 --> 00:36:34.150
We didn't really
care about the log.

00:36:34.150 --> 00:36:35.420
T times k is bad.

00:36:35.420 --> 00:36:39.400
That's because comparing strings
could potentially take T time.

00:36:39.400 --> 00:36:40.915
And then there's k of them.

00:36:40.915 --> 00:36:42.290
But this is linear.

00:36:42.290 --> 00:36:44.870
This is the sum of the
lengths of the strings.

00:36:44.870 --> 00:36:46.510
There's this extra little term.

00:36:46.510 --> 00:36:48.670
But most of the time that's
going to be dominated

00:36:48.670 --> 00:36:51.600
by the length of the strings.

00:36:51.600 --> 00:36:55.217
So that's a good way to
sort strings using tries.

00:36:55.217 --> 00:36:57.800
Tries by themselves, I mean this
is about all there is to say.

00:36:57.800 --> 00:37:02.870
So let's move on to suffix
trees and compressed tries.

00:37:02.870 --> 00:37:06.721
Now, we actually did compressed
tries in the signature sort

00:37:06.721 --> 00:37:07.220
lecture.

00:37:14.492 --> 00:37:15.950
Actually, why don't
I go over here?

00:37:25.210 --> 00:37:28.230
So tries-- branches were
labeled with letters.

00:37:28.230 --> 00:37:32.160
That's still going to be
true for a compressed trie.

00:37:32.160 --> 00:37:35.190
But as we saw in that
lecture, in compressed trie

00:37:35.190 --> 00:37:37.820
we're going to get rid of
the non-branching nodes.

00:37:41.650 --> 00:37:44.010
So idea with the compressed
trie is very simple--

00:37:44.010 --> 00:37:49.500
just contract non-branching
paths into a single edge.

00:38:03.580 --> 00:38:05.800
This is our example of a trie.

00:38:05.800 --> 00:38:08.440
We're just going to modify
it to make a compressed trie.

00:38:14.890 --> 00:38:17.920
Here we have a
non-branching path.

00:38:17.920 --> 00:38:20.770
We have to follow an a, and
then we have to follow an n.

00:38:20.770 --> 00:38:22.330
There's no point in
having this node.

00:38:22.330 --> 00:38:24.038
You might as well just
have a single edge

00:38:24.038 --> 00:38:26.560
that says a-n on it.

00:38:26.560 --> 00:38:29.470
So we go from here,
from the root.

00:38:29.470 --> 00:38:33.370
We're going to have
an edge that says a-n.

00:38:37.230 --> 00:38:40.560
And in some sense, the
key of this child is a.

00:38:40.560 --> 00:38:42.540
If you're starting up
here and you want to know

00:38:42.540 --> 00:38:45.820
which way should I go, you
should only go this way

00:38:45.820 --> 00:38:47.700
if your first letter is a.

00:38:47.700 --> 00:38:49.410
After that, your next
letter better be n,

00:38:49.410 --> 00:38:51.420
otherwise you fell off the tree.

00:38:51.420 --> 00:38:53.280
So that's the
compression we're doing.

00:38:53.280 --> 00:38:55.310
Now, here we have-- this
is a branching node,

00:38:55.310 --> 00:38:56.700
so that node we keep intact.

00:39:00.490 --> 00:39:03.840
This is an n, this is an a here.

00:39:03.840 --> 00:39:06.370
But here it's non-branching.

00:39:06.370 --> 00:39:08.320
Let me draw this a
little bit longer.

00:39:08.320 --> 00:39:10.350
In reality, it's
just a single edge.

00:39:10.350 --> 00:39:13.000
And again, the key is a, and
then you must have a $ sign

00:39:13.000 --> 00:39:14.070
on afterwards.

00:39:14.070 --> 00:39:16.730
Then you reach a
leaf, the first leaf.

00:39:16.730 --> 00:39:18.420
If we follow the n branch--

00:39:18.420 --> 00:39:22.278
this is branching, so
that node is preserved.

00:39:25.560 --> 00:39:28.730
If I go this way, it's a
$ sign and I reach a leaf.

00:39:28.730 --> 00:39:33.030
If I go this way it's an a that
must be followed by a $ sign,

00:39:33.030 --> 00:39:34.020
so that's a leaf.

00:39:34.020 --> 00:39:37.635
And if I go this way, it must
be an e, followed by a $ sign,

00:39:37.635 --> 00:39:39.780
which is a leaf.

00:39:39.780 --> 00:39:43.080
Again, these four leaves
can point to these places.

00:39:43.080 --> 00:39:44.640
That's a compressed trie.

00:39:44.640 --> 00:39:45.967
Pretty obvious.

00:39:45.967 --> 00:39:48.300
The nice thing about the
compressed trie is the number--

00:39:48.300 --> 00:39:50.258
here we knew the number
of non-branching nodes,

00:39:50.258 --> 00:39:51.780
it was at most the
number of leaves.

00:39:51.780 --> 00:39:53.510
Over here, the number
of internal nodes

00:39:53.510 --> 00:39:54.843
is at most the number of leaves.

00:39:54.843 --> 00:39:59.540
So this structure has
order k nodes in total

00:39:59.540 --> 00:40:02.160
because we got rid of all
the non-branching nodes.

00:40:02.160 --> 00:40:04.852
I guess except the root, the
root might not be branching.

00:40:07.500 --> 00:40:09.330
We've got a big O
there to cover us.

00:40:12.000 --> 00:40:14.790
And all the things we said
about representing tries here,

00:40:14.790 --> 00:40:18.300
you can do the same thing
with a compressed trie.

00:40:18.300 --> 00:40:22.990
I need to write
down that 3.5 here.

00:40:33.980 --> 00:40:35.990
And in fact, these
results get better because

00:40:35.990 --> 00:40:40.880
before order T meant the
number of nodes in the trie.

00:40:40.880 --> 00:40:42.730
Now order T will be
the number of nodes

00:40:42.730 --> 00:40:45.770
in the compressed trie,
which is actually order k.

00:40:45.770 --> 00:40:50.902
So life gets really
good in this world.

00:40:50.902 --> 00:40:52.610
I did it in the trie
setting because it's

00:40:52.610 --> 00:40:53.760
just simpler to think about.

00:40:53.760 --> 00:40:55.968
But really, you would always
store a compressed trie.

00:40:55.968 --> 00:40:57.942
There's no point
in storing a trie.

00:40:57.942 --> 00:41:00.080
You can still do the
same kinds of searches.

00:41:04.010 --> 00:41:09.150
But really, compressed tries
are warm up for suffix trees.

00:41:09.150 --> 00:41:10.820
So let's talk
about suffix trees.

00:41:14.720 --> 00:41:18.910
Suffix trees are
a compressed trie.

00:41:18.910 --> 00:41:22.790
So really they should
be called suffix tries.

00:41:22.790 --> 00:41:27.050
And occasionally, people
will call them suffix tries.

00:41:27.050 --> 00:41:28.745
But most people call
them suffix trees,

00:41:28.745 --> 00:41:31.550
so for consistency I'll
call them trees as well.

00:41:31.550 --> 00:41:32.423
But they are tries.

00:41:42.542 --> 00:41:45.110
I'm going to introduce
some notation here.

00:41:53.457 --> 00:41:55.040
With tries, we are
thinking about lots

00:41:55.040 --> 00:41:56.240
of different strings.

00:41:56.240 --> 00:41:59.590
In this case, we're going back
to our string matching problem.

00:41:59.590 --> 00:42:02.940
We have a single text and we
want to preprocess that text.

00:42:02.940 --> 00:42:04.940
But we're going to turn
it into multiple strings

00:42:04.940 --> 00:42:07.970
by looking at all
suffixes of the string.

00:42:07.970 --> 00:42:09.860
This is Python
notation for everything

00:42:09.860 --> 00:42:12.590
from letter i onwards.

00:42:12.590 --> 00:42:15.440
And we do that for all i,
so that's a lot of strings.

00:42:15.440 --> 00:42:18.824
And we build the
compressed trie over them.

00:42:18.824 --> 00:42:19.490
That's the idea.

00:42:19.490 --> 00:42:22.340
And to make it work out--
because you remember,

00:42:22.340 --> 00:42:25.790
with tries we had to append
$ sign to every string.

00:42:25.790 --> 00:42:28.700
In this case, we'd just
have to append $ sign to T,

00:42:28.700 --> 00:42:31.220
and then all suffixes
will end with a $ sign.

00:42:31.220 --> 00:42:33.590
So that covers
us. $ sign, again,

00:42:33.590 --> 00:42:36.510
is a character not
appearing in the alphabet.

00:42:36.510 --> 00:42:37.372
And that's it.

00:42:37.372 --> 00:42:38.330
So that's a definition.

00:42:38.330 --> 00:42:39.163
Let's do an example.

00:42:49.240 --> 00:42:51.880
At this point, we going for
this goal of order P query,

00:42:51.880 --> 00:42:54.010
order T space.

00:42:54.010 --> 00:42:57.130
Suffix trees will be a
way to achieve that goal.

00:43:03.820 --> 00:43:10.375
Let's do my favorite
example which is banana.

00:43:13.540 --> 00:43:17.650
I had a friend who said, I
know how to spell banana,

00:43:17.650 --> 00:43:19.900
I just don't know when to stop.

00:43:19.900 --> 00:43:22.990
There's nice pattern to it
and a lot of repeated letters

00:43:22.990 --> 00:43:24.700
and so on.

00:43:24.700 --> 00:43:26.230
I've got to number
the characters.

00:43:26.230 --> 00:43:28.880
He said that when he was like
six, not when he was older.

00:43:31.277 --> 00:43:33.610
It's a little harder when
you're writing it on the board

00:43:33.610 --> 00:43:36.580
but we all know how to
spell banana, I hope.

00:43:36.580 --> 00:43:37.790
I'd got it right, right?

00:43:37.790 --> 00:43:40.600
It should be 7 letters,
including the $ sign.

00:43:43.395 --> 00:43:44.020
There they are.

00:43:44.020 --> 00:43:46.190
So there's a suffix which
is the whole string.

00:43:46.190 --> 00:43:48.670
There's a suffix which
is a, n, a, n, a, $ sign.

00:43:48.670 --> 00:43:50.710
There is a suffix which
is n, a, n, a, $ sign.

00:43:50.710 --> 00:43:52.459
There's a suffix which
is a, n, a, $ sign.

00:43:52.459 --> 00:43:53.890
Suffix n, a, $ sign. a, $ sign.

00:43:53.890 --> 00:43:55.140
And $ sign.

00:43:55.140 --> 00:43:58.340
And empty, I suppose, but we're
not going to store that one.

00:43:58.340 --> 00:44:01.210
You don't need to.

00:44:01.210 --> 00:44:02.422
Cool.

00:44:02.422 --> 00:44:04.630
I'm going to cheat a little
bit and look at my figure

00:44:04.630 --> 00:44:07.350
because it is a little
bit of thinking.

00:44:07.350 --> 00:44:09.400
One The final challenge
of this lecture

00:44:09.400 --> 00:44:14.560
will be construct this
diagram in linear time.

00:44:14.560 --> 00:44:38.115
But I'm, just for
now, going to cheat

00:44:38.115 --> 00:44:39.990
because it's a little
tricky to do it and get

00:44:39.990 --> 00:44:41.340
all the nodes in sorted order.

00:44:57.980 --> 00:44:59.320
So that should give it to us.

00:44:59.320 --> 00:45:02.155
And then the suffixes.

00:45:02.155 --> 00:45:04.630
Here is another color.

00:45:04.630 --> 00:45:14.810
6, 5, 3, 1, 0, 4, 2.

00:45:14.810 --> 00:45:16.420
Cool.

00:45:16.420 --> 00:45:18.580
This I claim is a
suffix tree of banana.

00:45:18.580 --> 00:45:20.530
You see the banana substring.

00:45:20.530 --> 00:45:24.670
Than the next one is
a, n, a, n, a, $ sign.

00:45:24.670 --> 00:45:27.700
Then the next one is
n, a, n, a, $ sign.

00:45:27.700 --> 00:45:31.570
Then the next one
is a, n, a, $ sign.

00:45:31.570 --> 00:45:34.380
Next one is n, a, $ sign.

00:45:34.380 --> 00:45:35.840
Next one is a, $ sign.

00:45:35.840 --> 00:45:37.710
And then $ sign.

00:45:37.710 --> 00:45:39.917
So that's a nice,
clean representation

00:45:39.917 --> 00:45:40.750
of all the suffixes.

00:45:40.750 --> 00:45:43.000
And you can see that if
you wanted to search from

00:45:43.000 --> 00:45:45.250
the middle of this string--
suppose I want to search

00:45:45.250 --> 00:45:46.510
for a nan--

00:45:46.510 --> 00:45:47.490
then it's right there.

00:45:47.490 --> 00:45:51.700
Just do n, a, n, then I'm done.

00:45:51.700 --> 00:45:54.130
This virtual node
in the middle here

00:45:54.130 --> 00:45:56.980
along the one third of
the way down the edge,

00:45:56.980 --> 00:46:00.100
that represents n-a-n.

00:46:00.100 --> 00:46:02.170
And indeed, if you look
at the descendant leaf,

00:46:02.170 --> 00:46:05.470
that corresponds to an
occurrence of n-a-n.

00:46:05.470 --> 00:46:08.830
If I was going to
look for a-n, I

00:46:08.830 --> 00:46:12.880
would do a, n, so
halfway down this edge.

00:46:12.880 --> 00:46:17.920
And then this subtree represents
all the occurrences of a-n.

00:46:17.920 --> 00:46:19.210
Think about it.

00:46:19.210 --> 00:46:21.220
There's two of them--

00:46:21.220 --> 00:46:25.450
One that starts at position 3,
one that starts at position 1.

00:46:25.450 --> 00:46:27.067
Here's one occurrence
of a-n, here's

00:46:27.067 --> 00:46:28.150
another occurrence of a-n.

00:46:28.150 --> 00:46:29.858
This works even when
they're overlapping.

00:46:29.858 --> 00:46:32.717
If I search for a-n-a,
I would get here.

00:46:32.717 --> 00:46:35.050
And then these are the two
occurrences of a-n-a and they

00:46:35.050 --> 00:46:36.400
actually overlap each other--

00:46:36.400 --> 00:46:38.764
this one and this one.

00:46:38.764 --> 00:46:40.180
So this is a great
data structure,

00:46:40.180 --> 00:46:43.940
it solves what we need.

00:46:43.940 --> 00:46:46.512
It's all substrings searching.

00:47:01.460 --> 00:47:03.350
Applications of suffix trees.

00:47:18.570 --> 00:47:21.860
Just do a search in the trie
for a particular pattern.

00:47:21.860 --> 00:47:42.800
We get subtree representing all
of the occurrences of P and T.

00:47:42.800 --> 00:47:44.150
So this is great.

00:47:44.150 --> 00:47:47.690
In order P time, walking
down this structure,

00:47:47.690 --> 00:47:49.820
I can figure out
all the occurrences.

00:47:49.820 --> 00:47:52.190
And then, if I want to
know how many there were,

00:47:52.190 --> 00:47:54.110
I could just store
subtree sizes--

00:47:54.110 --> 00:47:55.940
number of leaves
below every node.

00:47:55.940 --> 00:47:59.270
If I wanted to list
them, I could just

00:47:59.270 --> 00:48:00.890
do an in-order traversal.

00:48:00.890 --> 00:48:03.230
And I'll even get them in order.

00:48:03.230 --> 00:48:08.900
So in particular, if I wanted to
list the first 10 occurrences,

00:48:08.900 --> 00:48:12.800
I could store the left-most leaf
from every node, teleport down

00:48:12.800 --> 00:48:14.870
to the first occurrence
in constant time.

00:48:14.870 --> 00:48:17.600
And then I could just have a
linked list of all the leaves.

00:48:17.600 --> 00:48:19.760
So once I find the
first one, I can just

00:48:19.760 --> 00:48:22.880
follow until I find, oh,
that's not an occurrence of P.

00:48:22.880 --> 00:48:25.520
So I can list the first
k of them in order k time

00:48:25.520 --> 00:48:28.160
once I've done the
search of order P time.

00:48:28.160 --> 00:48:30.110
So this is really
good searching.

00:48:30.110 --> 00:48:32.360
And It's the ideal situation.

00:48:32.360 --> 00:48:34.520
You can list any information
you want about all

00:48:34.520 --> 00:48:38.150
of the answers in the optimal
time and size of the output.

00:48:40.670 --> 00:48:43.640
How big is this data structure?

00:48:43.640 --> 00:48:51.008
Well, there are T suffixes,
so k is the size of T.

00:48:51.008 --> 00:48:53.630
And when we look at our
trie representations,

00:48:53.630 --> 00:48:55.730
our general goal was to get--

00:48:55.730 --> 00:48:59.817
here, capital T was
the sum of the lengths.

00:48:59.817 --> 00:49:01.400
Well, sum of the
lengths is not good--

00:49:01.400 --> 00:49:02.702
that would be quadratic--

00:49:02.702 --> 00:49:04.160
sum of the lengths
of the suffixes.

00:49:04.160 --> 00:49:08.420
But we also said, or the
number of nodes in the trie.

00:49:08.420 --> 00:49:10.745
And we know the number
of leaves in this trie

00:49:10.745 --> 00:49:15.054
is exactly the size of T. And so
because it's a compressed trie,

00:49:15.054 --> 00:49:16.470
the number of
internal [INAUDIBLE]

00:49:16.470 --> 00:49:19.640
is also less than the size of
T. So the total number of nodes

00:49:19.640 --> 00:49:24.890
here is order T And
so if we use any

00:49:24.890 --> 00:49:26.750
of the reasonable
representations,

00:49:26.750 --> 00:49:27.900
we get order T space.

00:49:33.020 --> 00:49:36.020
Now, there's one issue which
is, how long does a search for P

00:49:36.020 --> 00:49:36.950
cost?

00:49:36.950 --> 00:49:38.630
And it depends on
our representation,

00:49:38.630 --> 00:49:41.180
it depends how quickly
we can traverse a node.

00:49:41.180 --> 00:49:42.860
If we use hashing--

00:49:42.860 --> 00:49:51.740
method 3-- use hashing,
then we get order P time.

00:49:55.310 --> 00:49:58.100
But the trouble with
hashing is it permutes

00:49:58.100 --> 00:50:00.650
the children of every node.

00:50:00.650 --> 00:50:02.360
So in that situation,
the leaves will not

00:50:02.360 --> 00:50:05.799
be ordered in the same way that
they're ordered in the string.

00:50:05.799 --> 00:50:08.090
So if you really want to be
able to find the first five

00:50:08.090 --> 00:50:11.060
occurrences of the pattern
P, you can't use hashing.

00:50:11.060 --> 00:50:12.680
You can find some
five occurrences

00:50:12.680 --> 00:50:15.200
but you will find the
first in the usual ordering

00:50:15.200 --> 00:50:16.770
of the string.

00:50:16.770 --> 00:50:19.280
So if you really
want the first five

00:50:19.280 --> 00:50:23.750
and you want them in order,
then you should use trays--

00:50:23.750 --> 00:50:26.100
this method 6 that we used.

00:50:26.100 --> 00:50:26.780
6?

00:50:26.780 --> 00:50:28.220
5.

00:50:28.220 --> 00:50:35.230
If we use trays, then it will
be order P times log Sigma--

00:50:38.050 --> 00:50:40.640
sorry, order P plus log Sigma.

00:50:40.640 --> 00:50:43.720
That was our query time.

00:50:43.720 --> 00:50:47.030
Here, P plus log Sigma.

00:50:47.030 --> 00:50:50.240
Small penalty to pay but the
nice thing is then your answers

00:50:50.240 --> 00:50:52.310
are represented in order.

00:50:52.310 --> 00:50:56.840
No permutation, no
hashing, no randomization.

00:50:56.840 --> 00:50:58.940
This is the reason suffix
trees were invented--

00:50:58.940 --> 00:51:00.680
they let you do searches fast.

00:51:00.680 --> 00:51:03.500
But actually, they let you
do a ton of things fast.

00:51:03.500 --> 00:51:05.930
And I want to quickly
give you an overview

00:51:05.930 --> 00:51:08.999
of the zillions of things you
can do with the suffix tree.

00:51:08.999 --> 00:51:10.790
And then I want to get
to how to build them

00:51:10.790 --> 00:51:16.205
in linear time, which has some
interesting algorithms/data

00:51:16.205 --> 00:51:19.184
structures.

00:51:19.184 --> 00:51:20.600
I already talked
about if you want

00:51:20.600 --> 00:51:21.980
to find the first
k occurrences, you

00:51:21.980 --> 00:51:23.150
can do that in order k time.

00:51:23.150 --> 00:51:25.280
If you want to find the
number of occurrences,

00:51:25.280 --> 00:51:26.654
you can do that
in constant time,

00:51:26.654 --> 00:51:29.174
just by augmenting
the subtree sizes.

00:51:29.174 --> 00:51:30.590
Here's another
thing you could do.

00:51:30.590 --> 00:51:32.990
Suppose you have a
very long string.

00:51:32.990 --> 00:51:35.160
I mean think of T as
an entire document.

00:51:35.160 --> 00:51:38.360
You know, it could be the
Merriam-Webster dictionary

00:51:38.360 --> 00:51:41.320
or it could be the web.

00:51:41.320 --> 00:51:44.069
We're imagining T to be
the huge data structure.

00:51:44.069 --> 00:51:46.610
And then we're able to search
for substrings within that data

00:51:46.610 --> 00:51:50.130
structure very fast.

00:51:50.130 --> 00:51:52.430
So that's cool.

00:51:52.430 --> 00:51:53.680
Here's an interesting puzzle.

00:51:53.680 --> 00:51:57.790
What is the longest substring--
what is the longest string that

00:51:57.790 --> 00:52:00.280
appears twice on the web?

00:52:00.280 --> 00:52:02.260
This is called the longest
repeated substring.

00:52:02.260 --> 00:52:04.610
Could be overlapping, maybe not.

00:52:04.610 --> 00:52:07.690
Well, you take the web, you
throw it in the suffix tree--

00:52:07.690 --> 00:52:09.500
not sure anyone could
actually do that--

00:52:09.500 --> 00:52:11.762
but small part of the web.

00:52:11.762 --> 00:52:13.345
Dictionary-- this
would be no problem.

00:52:17.260 --> 00:52:18.560
Wikipedia would be feasible.

00:52:18.560 --> 00:52:21.280
You take Wikipedia, you
throw it in the suffix tree.

00:52:21.280 --> 00:52:24.820
And what I'm interested
in is, basically,

00:52:24.820 --> 00:52:29.230
a node that has two, at
least two descendant leaves.

00:52:29.230 --> 00:52:31.749
And if I'm counting the number
of leaves at every node,

00:52:31.749 --> 00:52:33.790
I could just do one pass
over this data structure

00:52:33.790 --> 00:52:35.529
and find what are all
the nodes that have

00:52:35.529 --> 00:52:36.820
at least two descendant leaves.

00:52:36.820 --> 00:52:39.520
That's all the internal nodes.

00:52:39.520 --> 00:52:42.280
And then among them I'd also
like to know how deep is it.

00:52:42.280 --> 00:52:46.330
Because the depth corresponds
to how long the string is.

00:52:46.330 --> 00:52:48.280
This one is a-n-a
so this one has,

00:52:48.280 --> 00:52:51.036
I call it, a letter depth of 3.

00:52:51.036 --> 00:52:52.410
This one has a
letter depth of 1.

00:52:52.410 --> 00:52:53.785
This one has a
letter depth of 2.

00:52:53.785 --> 00:52:55.784
So I just want to find
the deepest node that has

00:52:55.784 --> 00:52:57.130
at least two descendant leaves.

00:52:57.130 --> 00:53:00.151
In linear time, I could find
the longest repeated substring.

00:53:00.151 --> 00:53:01.900
Or I could find the
longest substring that

00:53:01.900 --> 00:53:03.520
appears five times or whatever.

00:53:03.520 --> 00:53:05.530
I just do one pass
over this thing,

00:53:05.530 --> 00:53:08.087
find the deepest node that
has my threshold of leaves.

00:53:08.087 --> 00:53:09.670
So that's kind of a
neat thing you can

00:53:09.670 --> 00:53:11.440
do in linear time on a string.

00:53:14.780 --> 00:53:16.580
Here's another fun one.

00:53:16.580 --> 00:53:18.920
Suppose I have
this giant string.

00:53:18.920 --> 00:53:21.930
And I just want to compare
two substrings in it.

00:53:21.930 --> 00:53:25.730
So here's my giant string.

00:53:25.730 --> 00:53:29.360
And suppose I want to measure
how long is the repeated

00:53:29.360 --> 00:53:30.290
substring.

00:53:30.290 --> 00:53:31.940
So I say, well,
I've got position i,

00:53:31.940 --> 00:53:32.944
I've got position j.

00:53:32.944 --> 00:53:35.360
Let's say I already know that
they match for a little bit.

00:53:35.360 --> 00:53:37.220
I want to know, how
long do they match?

00:53:37.220 --> 00:53:40.790
How far can I go to the right
and have them still match?

00:53:43.217 --> 00:53:44.050
How could I do that?

00:53:44.050 --> 00:53:46.580
Well, I could look at
the suffix starting at i.

00:53:46.580 --> 00:53:48.510
That corresponds to
a leaf over here.

00:53:48.510 --> 00:53:51.380
And I could look at the
suffix starting at j.

00:53:51.380 --> 00:53:55.100
That corresponds
to some other leaf.

00:53:55.100 --> 00:53:59.000
And what is the length of
the longest common prefix

00:53:59.000 --> 00:54:02.560
of those two suffixes
in the suffix tree?

00:54:07.040 --> 00:54:12.150
Three letters-- LCA.

00:54:12.150 --> 00:54:16.110
If I take the LCA of those
two leaves-- for example,

00:54:16.110 --> 00:54:19.270
I take these two leaves--

00:54:19.270 --> 00:54:21.970
the LCA gives me the
longest common prefix.

00:54:21.970 --> 00:54:23.500
Then they branch.

00:54:23.500 --> 00:54:25.780
So longest common prefix
of these two suffixes

00:54:25.780 --> 00:54:28.360
is the letter a, so
it's just length 1.

00:54:28.360 --> 00:54:31.030
And again, if I label every
node with the letter depth,

00:54:31.030 --> 00:54:33.340
I can figure out exactly
how long these guys match,

00:54:33.340 --> 00:54:35.450
even if they overlap.

00:54:35.450 --> 00:54:37.150
So in constant time--
because we already

00:54:37.150 --> 00:54:39.670
have a constant time LCA query.

00:54:39.670 --> 00:54:41.590
Linear space,
constant time query.

00:54:41.590 --> 00:54:43.031
Given any two
positions i and j, I

00:54:43.031 --> 00:54:45.280
can tell you how long they
match for in constant time.

00:54:45.280 --> 00:54:47.549
Boom-- instantaneously.

00:54:47.549 --> 00:54:48.340
It's kind of crazy.

00:54:48.340 --> 00:54:51.310
So you can do tons of
these queries instantly.

00:54:51.310 --> 00:54:53.770
That's one reason why
people care about LCAs,

00:54:53.770 --> 00:54:54.770
there are other reasons.

00:54:54.770 --> 00:54:58.630
But mostly LCAs were
developed for suffix trees

00:54:58.630 --> 00:54:59.800
to answer queries like that.

00:55:02.650 --> 00:55:03.310
Got some more.

00:55:08.940 --> 00:55:11.010
Why don't I just write--

00:55:11.010 --> 00:55:19.620
LCP of one suffix
and another suffix

00:55:19.620 --> 00:55:22.250
is equivalent to an LCA query.

00:55:22.250 --> 00:55:25.050
And so we can do
that in constant time

00:55:25.050 --> 00:55:26.462
after pre-processing.

00:55:38.600 --> 00:55:39.920
Here's another one.

00:55:39.920 --> 00:55:52.180
Suppose I want to find all
occurrences of T i to j.

00:55:55.810 --> 00:55:57.670
So I give you a
substring and I want

00:55:57.670 --> 00:56:00.070
to know where does that occur.

00:56:00.070 --> 00:56:03.800
The substring is restricted
to come from the text.

00:56:03.800 --> 00:56:05.080
Now, this is a little subtle.

00:56:05.080 --> 00:56:08.860
Of course, I could solve it
in j minus i plus 1 time.

00:56:08.860 --> 00:56:10.660
I just do the search.

00:56:10.660 --> 00:56:14.470
But what if I want to
do it in constant time?

00:56:14.470 --> 00:56:16.435
Maybe this is a
really big substring.

00:56:16.435 --> 00:56:18.740
But I still know it
appears multiple times.

00:56:18.740 --> 00:56:21.120
I want to know how many
times does it appear.

00:56:21.120 --> 00:56:24.100
I claim I can do this
in constant time.

00:56:24.100 --> 00:56:26.320
How?

00:56:26.320 --> 00:56:30.040
This is a level ancestor query.

00:56:30.040 --> 00:56:32.050
Why is it a level
ancestor query?

00:56:32.050 --> 00:56:35.380
If I look at the
suffix starting at i,

00:56:35.380 --> 00:56:38.470
and then I just want to
trim off, I want to stop.

00:56:38.470 --> 00:56:40.510
Or I don't care about
the entire suffix,

00:56:40.510 --> 00:56:43.330
I just want to do that j.

00:56:43.330 --> 00:56:45.580
It's like saying, well,
suppose I'm looking

00:56:45.580 --> 00:56:47.590
for occurrences of a-n-a.

00:56:47.590 --> 00:56:51.520
So I go and I start at the
first occurrence of a-n-a,

00:56:51.520 --> 00:56:54.930
which is a-n-a-n-a-$, so this
is the leaf corresponding

00:56:54.930 --> 00:56:55.974
to a-n-a.

00:56:55.974 --> 00:56:58.140
And then if I want to find
all occurrences of a-n-a,

00:56:58.140 --> 00:57:03.910
I just need to go up to the
ancestor that represents a-n-a.

00:57:03.910 --> 00:57:09.787
This is what I call a
weighted level ancestor.

00:57:09.787 --> 00:57:11.370
That's not quite the
problem we solved

00:57:11.370 --> 00:57:18.490
in the last lecture, lecture
15, because now it's weighted.

00:57:18.490 --> 00:57:28.577
So it's level ancestor j minus
i of the T i suffix leaf.

00:57:28.577 --> 00:57:30.160
So I find this leaf,
which I just have

00:57:30.160 --> 00:57:31.450
an array of all the leaves.

00:57:31.450 --> 00:57:34.737
Given a suffix, tell me what
leaf it is in the suffix tree.

00:57:34.737 --> 00:57:36.820
And then I want to find
the j minus i-th ancestor,

00:57:36.820 --> 00:57:39.820
except the edges don't
just have unit length.

00:57:39.820 --> 00:57:42.160
So here I want to find
the third ancestor,

00:57:42.160 --> 00:57:45.190
except it's really the ancestor
in the compressed trie.

00:57:45.190 --> 00:57:47.240
I want to do the j
minus i-th ancestor

00:57:47.240 --> 00:57:49.900
in the suffix in
the trie, but what

00:57:49.900 --> 00:57:51.850
I have is a compressed tree.

00:57:51.850 --> 00:57:54.880
And so these edges are labeled
with how many characters

00:57:54.880 --> 00:57:58.000
are on them and I got
to deal with that.

00:57:58.000 --> 00:58:00.980
Fortunately, the data structure
we gave for a level ancestor--

00:58:00.980 --> 00:58:03.040
which was constant time
query, linear space--

00:58:03.040 --> 00:58:05.140
can be fairly easily
adapted to weights.

00:58:08.170 --> 00:58:10.710
Not quite in
constant time though.

00:58:10.710 --> 00:58:14.860
It can be solved
in log log n time.

00:58:14.860 --> 00:58:17.440
And I think that's optimal.

00:58:17.440 --> 00:58:23.710
Because if your thing is
a single path with maybe

00:58:23.710 --> 00:58:28.720
the occasional branch, then
finding your i-th ancestor here

00:58:28.720 --> 00:58:31.430
is like solving a
predecessor problem.

00:58:31.430 --> 00:58:36.190
Because you say, well,
from the i-th position up,

00:58:36.190 --> 00:58:40.150
I want to know what
is the previous--

00:58:40.150 --> 00:58:41.887
I want to round
up or round down.

00:58:41.887 --> 00:58:43.720
So I want to do a
predecessor or a successor

00:58:43.720 --> 00:58:45.600
on this straight line.

00:58:45.600 --> 00:58:47.200
And so for a
predecessor you need

00:58:47.200 --> 00:58:51.610
log log time for
the right parameters

00:58:51.610 --> 00:58:53.206
and this can be achieved.

00:58:53.206 --> 00:58:55.330
And the basic idea is you
use ladder decomposition,

00:58:55.330 --> 00:58:56.440
just like before.

00:58:56.440 --> 00:58:58.840
But now a ladder can't be
represented by an array

00:58:58.840 --> 00:59:01.760
because there are lots of
absent places in the array.

00:59:01.760 --> 00:59:04.540
Now instead, use a predecessor,
use a Van Emde Boas

00:59:04.540 --> 00:59:06.260
to represent a ladder.

00:59:06.260 --> 00:59:07.870
So that's basically all you do.

00:59:07.870 --> 00:59:13.630
Van Emde Boas
represents a ladder.

00:59:13.630 --> 00:59:15.517
That's what you do
in the top structure.

00:59:15.517 --> 00:59:17.350
Remember, we had
indirection, leaf trimming,

00:59:17.350 --> 00:59:19.058
top was this thing,
ladder decomposition.

00:59:19.058 --> 00:59:21.317
You Bottom was look up tables.

00:59:21.317 --> 00:59:23.650
The other problem is you can't
use lookup tables anymore

00:59:23.650 --> 00:59:26.530
because in one of
these tiny trees

00:59:26.530 --> 00:59:28.030
you could have a
super long path.

00:59:28.030 --> 00:59:30.040
It's non-branching,
they got compressed.

00:59:30.040 --> 00:59:31.420
And you can't
afford to enumerate

00:59:31.420 --> 00:59:33.410
all possible situations.

00:59:33.410 --> 00:59:35.092
It's kind of annoying.

00:59:35.092 --> 00:59:37.300
So instead of using lookup
tables-- this was actually

00:59:37.300 --> 00:59:40.960
an idea from some students
in this class last time

00:59:40.960 --> 00:59:43.750
I taught this material--
they said, oh, well, instead

00:59:43.750 --> 00:59:47.180
of using a lookup table, you
can use ladder decomposition.

00:59:47.180 --> 00:59:49.960
So down here, in
the compressed tree,

00:59:49.960 --> 00:59:52.240
we have log n different nodes.

00:59:52.240 --> 00:59:55.090
If you use ladder decomposition
on that thing-- but not

00:59:55.090 --> 00:59:56.030
the hybrid structure.

00:59:56.030 --> 00:59:58.360
Remember, we used jump
pointers plus ladders.

00:59:58.360 --> 00:59:59.890
Jump pointers still
work here, just

00:59:59.890 --> 01:00:03.160
you have to round them
to a different place.

01:00:03.160 --> 01:00:04.750
Down here, I'm not
going to try to do

01:00:04.750 --> 01:00:06.010
jump pointers plus ladders.

01:00:06.010 --> 01:00:07.210
I'll just do ladders.

01:00:07.210 --> 01:00:10.120
And remember, just ladders
gave us a log n query time.

01:00:10.120 --> 01:00:18.300
But now n is log T. And so
we get log log T query time.

01:00:18.300 --> 01:00:20.050
And that's, basically,
all you have to do.

01:00:22.960 --> 01:00:24.960
So you're always jumping
to the top of a ladder.

01:00:24.960 --> 01:00:27.262
You'll only have to
traverse log log T ladders.

01:00:27.262 --> 01:00:28.720
The very last ladder
you might have

01:00:28.720 --> 01:00:32.500
to do a predecessor query that
will cost you log log log T.

01:00:32.500 --> 01:00:35.050
But overall, it will
be log log T time just

01:00:35.050 --> 01:00:39.730
by this kind of tweak to our
level ancestor data structure.

01:00:39.730 --> 01:00:43.120
So I thought that was
kind of a fun connection.

01:00:43.120 --> 01:00:46.450
This is the reason, essentially,
level ancestors were developed.

01:00:46.450 --> 01:00:48.460
And people use them
because you can

01:00:48.460 --> 01:00:51.800
do these kinds of things
in nearly constant time,

01:00:51.800 --> 01:00:54.530
even if the substring is huge.

01:00:54.530 --> 01:00:57.760
So maybe I know ahead
of time all the queries

01:00:57.760 --> 01:00:59.860
I might want to do.

01:00:59.860 --> 01:01:03.310
I just throw them into the
text, just add them in there.

01:01:03.310 --> 01:01:05.770
Then I've cut these
substrings, they're now

01:01:05.770 --> 01:01:07.200
represented in the suffix tree.

01:01:07.200 --> 01:01:10.480
Now any substring I want
to query in log log n time,

01:01:10.480 --> 01:01:13.960
I can find all the
occurrences of that string,

01:01:13.960 --> 01:01:16.670
even if the substring is huge.

01:01:16.670 --> 01:01:19.060
So if you know what
queries you want,

01:01:19.060 --> 01:01:20.980
you can preprocess
them and solve them

01:01:20.980 --> 01:01:24.430
even faster than order P time.

01:01:24.430 --> 01:01:25.900
Cool.

01:01:25.900 --> 01:01:32.480
Another thing you can do is
represent multiple documents.

01:01:32.480 --> 01:01:35.580
And that's what I was
sort of getting at there.

01:01:35.580 --> 01:01:37.270
If you have multiple
documents-- say,

01:01:37.270 --> 01:01:39.670
you're storing the
entire web or Wikipedia.

01:01:39.670 --> 01:01:41.560
Like there's multiple pages.

01:01:41.560 --> 01:01:43.480
You want to separate them.

01:01:43.480 --> 01:01:47.860
All you need to do is say,
OK, I'll take my first string

01:01:47.860 --> 01:01:49.990
and then put a special
$ sign after it.

01:01:49.990 --> 01:01:52.980
Then take my second string,
put a special $ sign after it.

01:01:52.980 --> 01:01:56.860
And take my k-th string and
put a special $ sign after it.

01:01:56.860 --> 01:01:59.710
Just concatenate them with
different $ signs in between

01:01:59.710 --> 01:02:00.460
them.

01:02:00.460 --> 01:02:03.885
Then build the suffix tree on
this thing which I'll call T

01:02:03.885 --> 01:02:06.010
So you can use the same
suffix tree data structure,

01:02:06.010 --> 01:02:08.140
but now, in some sense,
you're representing

01:02:08.140 --> 01:02:11.964
all of these documents and
all the ways they interweave.

01:02:11.964 --> 01:02:13.630
Because there are
some shared substrings

01:02:13.630 --> 01:02:15.838
here that are shared by
this, and this, and whatever.

01:02:15.838 --> 01:02:18.070
And those will be represented
in the same structure.

01:02:18.070 --> 01:02:20.050
Or I can do a search and
then I've effectively

01:02:20.050 --> 01:02:23.500
found all the documents
that contain it.

01:02:23.500 --> 01:02:25.570
One issue, though.

01:02:25.570 --> 01:02:27.820
Suppose, I want to find all
the documents containing

01:02:27.820 --> 01:02:31.390
the word MIT or something.

01:02:31.390 --> 01:02:34.927
Maybe all k of them match,
maybe one document matches,

01:02:34.927 --> 01:02:36.010
maybe two documents match.

01:02:36.010 --> 01:02:37.930
Suppose, two documents match.

01:02:37.930 --> 01:02:40.990
The first document mentions
MIT a billion times.

01:02:40.990 --> 01:02:46.330
The second document
has MIT in it once.

01:02:46.330 --> 01:02:47.980
Then suffix trees
are kind of annoying

01:02:47.980 --> 01:02:50.980
because they will find that
billion and one matches

01:02:50.980 --> 01:02:51.907
as a subtree.

01:02:51.907 --> 01:02:54.490
But if I just want to know the
answer, oh, these two documents

01:02:54.490 --> 01:02:57.070
match, I'd like to do
that in order 2 time,

01:02:57.070 --> 01:03:02.230
not order billion time,
to use technical terms.

01:03:02.230 --> 01:03:08.080
And that is called the document
retrieval problem or a document

01:03:08.080 --> 01:03:09.830
retrieval data structure.

01:03:09.830 --> 01:03:14.320
This is a problem considered
by M. Krishnan in 2002.

01:03:14.320 --> 01:03:22.510
Document retrieval you can
do an order P plus number

01:03:22.510 --> 01:03:26.150
of documents matching.

01:03:26.150 --> 01:03:30.449
So if I want to list all
the documents that match,

01:03:30.449 --> 01:03:32.740
I could do an order the number
of documents that match,

01:03:32.740 --> 01:03:37.270
not the order of a number of
occurrences of the string.

01:03:37.270 --> 01:03:39.760
So I still got to do the
P search in the beginning,

01:03:39.760 --> 01:03:41.920
and then this is better.

01:03:41.920 --> 01:03:45.340
And the funny thing is the
solution to this data structure

01:03:45.340 --> 01:03:49.717
uses RMQ, range minimum
queries, from last lecture.

01:03:49.717 --> 01:03:51.050
So let me tell you how it works.

01:03:51.050 --> 01:03:52.133
It's actually very simple.

01:03:56.730 --> 01:04:01.460
And then I think we'll move on
to how to build a suffix tree.

01:04:01.460 --> 01:04:03.500
So document retrieval.

01:04:08.220 --> 01:04:09.470
Here's what we're going to do.

01:04:25.040 --> 01:04:27.530
Remember, these different $
signs i represent different

01:04:27.530 --> 01:04:30.230
documents.

01:04:30.230 --> 01:04:32.450
I want to remember
which suffixes

01:04:32.450 --> 01:04:35.060
came from the same document.

01:04:35.060 --> 01:04:40.790
So at every $ sign i, I
want to store the number

01:04:40.790 --> 01:04:44.990
of the previous $ sign i.

01:04:44.990 --> 01:04:48.260
Let's suppose, the suffixes,
when they get to one of the $

01:04:48.260 --> 01:04:51.490
signs, I can just stop, I
don't have to store the rest,

01:04:51.490 --> 01:04:52.490
I'm going to throw away.

01:04:52.490 --> 01:04:55.490
Whenever I hit a $ sign, I
will stop the suffix tree.

01:04:55.490 --> 01:04:57.410
That way, the $ signs
really are leaves,

01:04:57.410 --> 01:04:59.741
all of them now become leaves.

01:04:59.741 --> 01:05:01.490
So I don't really care
about a suffix that

01:05:01.490 --> 01:05:02.480
goes all the way through here.

01:05:02.480 --> 01:05:04.040
I just want the
suffix to the $ sign,

01:05:04.040 --> 01:05:06.960
as it represents the
individual documents.

01:05:06.960 --> 01:05:08.810
So $ sign i's are leaves.

01:05:08.810 --> 01:05:11.600
And I want each of them just
to store a pointer, basically,

01:05:11.600 --> 01:05:14.930
to the previous one of the
same type, the same $ sign i.

01:05:14.930 --> 01:05:16.370
It came from the same document.

01:05:22.860 --> 01:05:26.990
Now, here's the idea.

01:05:26.990 --> 01:05:30.470
I did a search, I
got down to a node,

01:05:30.470 --> 01:05:32.570
and now there's this
big subtree here.

01:05:32.570 --> 01:05:36.400
And this subtree has a
bunch of leaves in it,

01:05:36.400 --> 01:05:40.460
those represent all the
occurrences of the pattern P.

01:05:40.460 --> 01:05:43.120
And let's suppose that
those leaves are numbered.

01:05:43.120 --> 01:05:48.620
I'm numbering the leaves
from 1 to n, I guess.

01:05:48.620 --> 01:05:51.440
Then in here, the leaves
will be an interval--

01:05:51.440 --> 01:05:54.710
interval l, comma, n.

01:05:54.710 --> 01:05:57.560
And the trouble is a lot of
these have the same label $

01:05:57.560 --> 01:05:58.640
sign i.

01:05:58.640 --> 01:06:01.370
And I just want to
find the unique ones.

01:06:01.370 --> 01:06:02.120
How do I do that?

01:06:07.760 --> 01:06:15.560
What we do is find the first
occurrence of $ sign i for each

01:06:15.560 --> 01:06:17.090
i.

01:06:17.090 --> 01:06:19.560
I could just find the first
occurrence of $ sign i for each

01:06:19.560 --> 01:06:20.060
i.

01:06:20.060 --> 01:06:24.370
I'd then only have to pay order
number of distinct documents,

01:06:24.370 --> 01:06:27.170
then we'll have to pay for
every match within the document.

01:06:27.170 --> 01:06:30.860
Now, one way to define
the first $ sign i is--

01:06:30.860 --> 01:06:35.690
that's a $ sign i
whose stored value--

01:06:35.690 --> 01:06:38.620
we said we store the leaf number
of the previous $ sign i--

01:06:38.620 --> 01:06:45.301
whose stored value
is less than l.

01:06:45.301 --> 01:06:46.550
So we find some position here.

01:06:46.550 --> 01:06:48.300
If the previous
guy is less than l,

01:06:48.300 --> 01:06:51.950
that means it was the
first of that type.

01:06:51.950 --> 01:06:56.330
If we store this, that's
definition of being first.

01:06:56.330 --> 01:07:01.610
So in this interval, I want to
find $ sign i's that have very

01:07:01.610 --> 01:07:04.070
small stored values.

01:07:04.070 --> 01:07:06.050
How would I find
the very best one?

01:07:06.050 --> 01:07:07.430
Range minimum query.

01:07:07.430 --> 01:07:12.560
So we do a range minimum
query on l, comma, n.

01:07:12.560 --> 01:07:15.200
If there's any firsts in
there, this will find it.

01:07:18.580 --> 01:07:24.470
Find, let's say, a position
m with the smallest

01:07:24.470 --> 01:07:25.790
possible stored value.

01:07:37.430 --> 01:07:43.080
If the stored number
is less than l,

01:07:43.080 --> 01:07:44.570
then output that answer.

01:07:48.480 --> 01:07:53.890
And then recurse on the
remaining intervals.

01:07:53.890 --> 01:08:01.860
So there's going to be from l
to m minus 1 and m plus 1 to n.

01:08:01.860 --> 01:08:05.284
So we find the best
candidate, the minimum.

01:08:05.284 --> 01:08:06.450
That's minimum sorted value.

01:08:06.450 --> 01:08:09.210
If anything is going to be
less than l, that would be it.

01:08:09.210 --> 01:08:12.459
If it is less than l, we output
it, then we recurse over here

01:08:12.459 --> 01:08:13.500
and we recurse over here.

01:08:13.500 --> 01:08:15.750
At some point this will
stop finding things.

01:08:15.750 --> 01:08:17.910
We're going to do
another RMQ over here.

01:08:17.910 --> 01:08:21.189
Might not find anything, then
we just stop that recursion.

01:08:21.189 --> 01:08:23.151
But the number of
recursions we have to do

01:08:23.151 --> 01:08:25.109
is going to be equal to
the number of documents

01:08:25.109 --> 01:08:27.810
that match, maybe plus 1.

01:08:27.810 --> 01:08:30.660
So we achieved this bound
using RMQ because RMQ

01:08:30.660 --> 01:08:33.689
we can do in constant time with
appropriate pre-processing.

01:08:33.689 --> 01:08:35.770
Now, the RMQ is over an array.

01:08:35.770 --> 01:08:40.590
It's over this array of stored
values indexed by leaves.

01:08:40.590 --> 01:08:42.330
And this idea of
taking the leaves

01:08:42.330 --> 01:08:46.290
and writing them down in order
is actually something we need.

01:08:46.290 --> 01:08:48.180
It's called a suffix array.

01:08:56.970 --> 01:08:59.340
We're going to use this
alternate representation

01:08:59.340 --> 01:09:02.640
of suffix trees in
order to compute them.

01:09:02.640 --> 01:09:04.899
Suffix arrays in some sense
are easier to think about.

01:09:16.410 --> 01:09:18.750
The idea with the
suffix array is

01:09:18.750 --> 01:09:21.540
to write down all the
suffixes, sort them.

01:09:25.979 --> 01:09:27.090
This is conceptual.

01:09:27.090 --> 01:09:28.710
Imagine you take
all these suffixes.

01:09:28.710 --> 01:09:30.370
Their total size
is quadratic in T

01:09:30.370 --> 01:09:32.142
so you'd never actually
want to do this.

01:09:32.142 --> 01:09:33.600
But just imagine
writing them down,

01:09:33.600 --> 01:09:37.560
sorting them lexically using
our string sorting algorithms.

01:09:37.560 --> 01:09:40.560
And then we can't
represent them explicitly

01:09:40.560 --> 01:09:41.850
because it would be too big.

01:09:41.850 --> 01:09:48.254
Just write down their index,
just store the indices.

01:09:51.729 --> 01:09:52.770
Let's do this for banana.

01:09:55.350 --> 01:09:58.545
Banana's over here.

01:09:58.545 --> 01:10:00.060
It'll make my life
a little harder.

01:10:17.090 --> 01:10:19.640
Actually, they're already
here in sorted order.

01:10:19.640 --> 01:10:23.720
If dollar sign, I'm supposing,
is first, first suffix is $,

01:10:23.720 --> 01:10:28.460
then a-$, then a-n-a-$, then
a-n-a-n-a-$, then banana,

01:10:28.460 --> 01:10:31.880
then n-a-$, then n-a-n-a-$.

01:10:31.880 --> 01:10:33.980
I'll just write
that down over here.

01:10:33.980 --> 01:10:56.580
$, a-$, a-n-a-$, a-n-a-n-a-$,
then banana, then n-a-$,

01:10:56.580 --> 01:10:59.420
then n-a-n-a-$.

01:10:59.420 --> 01:11:02.220
If you look at these, they're
indeed in sorted order-- $,

01:11:02.220 --> 01:11:04.330
a's, b's, n's.

01:11:04.330 --> 01:11:06.740
Everything is sorted
here lexically.

01:11:06.740 --> 01:11:09.160
Now, I can't store this
because it's quadratic size.

01:11:09.160 --> 01:11:11.740
Instead, I just write down the
numbers that are down there.

01:11:11.740 --> 01:11:14.620
This was the sixth suffix, it
was starting at position 6.

01:11:14.620 --> 01:11:21.370
Then 5, then 3, then 1, then 0--

01:11:21.370 --> 01:11:27.010
that's everything--
then 4, then 2.

01:11:27.010 --> 01:11:31.810
This thing is the suffix array.

01:11:31.810 --> 01:11:33.550
It also has linear size.

01:11:33.550 --> 01:11:37.570
It's just a permutation on the
suffix labels, suffix indices.

01:11:46.115 --> 01:11:48.699
I still want to
tell you about it.

01:11:48.699 --> 01:11:50.240
There's some other
information that's

01:11:50.240 --> 01:11:55.630
helpful to write down
about the suffix array.

01:11:55.630 --> 01:11:59.600
It's called longest common
prefix information, LCP.

01:11:59.600 --> 01:12:03.580
The idea is to look at adjacent
elements in the suffix array.

01:12:03.580 --> 01:12:06.080
In some sense, this represents
the same information, right?

01:12:06.080 --> 01:12:08.360
Our whole goal is to
sort the suffixes.

01:12:08.360 --> 01:12:12.200
If we could do this,
then, as we'll see,

01:12:12.200 --> 01:12:13.430
we can also build this.

01:12:13.430 --> 01:12:15.472
And this is sort of
what we really want.

01:12:15.472 --> 01:12:17.180
The suffix array by
itself is pretty good

01:12:17.180 --> 01:12:19.130
if you add in LCP information.

01:12:19.130 --> 01:12:21.500
LCP is-- what is the longest
common prefix of these two

01:12:21.500 --> 01:12:22.000
suffixes?

01:12:22.000 --> 01:12:23.960
In this case, 0.

01:12:23.960 --> 01:12:26.090
In this case, one letter.

01:12:26.090 --> 01:12:30.770
In this case, three
letters match.

01:12:30.770 --> 01:12:33.440
So here the value is 3.

01:12:33.440 --> 01:12:36.320
And the next one,
zero letters match.

01:12:36.320 --> 01:12:39.660
Next one, zero letters match.

01:12:39.660 --> 01:12:45.380
Next one, two letters match.

01:12:45.380 --> 01:12:47.720
So this is another array
you could store here--

01:12:47.720 --> 01:12:49.805
0, 1, 3, 0, 0, 2.

01:12:49.805 --> 01:12:51.980
AUDIENCE: Longest common prefix?

01:12:51.980 --> 01:12:56.270
ERIK DEMAINE: Longest common
prefix of the suffixes.

01:12:56.270 --> 01:12:58.200
Because each of these
is a suffix but here

01:12:58.200 --> 01:13:01.400
we're interested in how
long they match for.

01:13:01.400 --> 01:13:05.360
I claim if you have this suffix
array and this LCP information,

01:13:05.360 --> 01:13:08.570
you can build this structure.

01:13:08.570 --> 01:13:12.680
Anyone wants to tell me how
to build this using this?

01:13:12.680 --> 01:13:17.930
It's a one word or two word
answer that we saw, I think,

01:13:17.930 --> 01:13:19.480
last class.

01:13:19.480 --> 01:13:21.860
But we saw a lot of things
last class, so it's maybe not

01:13:21.860 --> 01:13:22.360
obvious.

01:13:30.530 --> 01:13:32.730
Magic words are Cartesian tree.

01:13:35.570 --> 01:13:41.970
Cartesian tree was how we
converted RMQ into LCA,

01:13:41.970 --> 01:13:42.630
I think.

01:13:42.630 --> 01:13:43.250
Yeah?

01:13:43.250 --> 01:13:45.890
Which was you take the
minimum value in the array,

01:13:45.890 --> 01:13:50.480
make that the root, and then
recurse on the two sides.

01:13:50.480 --> 01:13:54.110
So a Cartesian tree
of the LCP array,

01:13:54.110 --> 01:13:57.800
basically, gives you
this transformation.

01:13:57.800 --> 01:13:59.930
The minimum values
here are the 0's.

01:13:59.930 --> 01:14:02.990
Now, before we just broke
ties, we picked an arbitrary 0,

01:14:02.990 --> 01:14:04.050
put it at the root.

01:14:04.050 --> 01:14:07.500
Now I want to take all the
0's, put them at the root.

01:14:07.500 --> 01:14:13.130
If I do that, I get
three 0's at the root

01:14:13.130 --> 01:14:16.170
and then I have
everything in between.

01:14:16.170 --> 01:14:19.430
So there's nothing
left of the first 0.

01:14:19.430 --> 01:14:21.740
Then next one,
there's these guys

01:14:21.740 --> 01:14:24.320
and the mins are going
to be 1 and then 3.

01:14:24.320 --> 01:14:27.800
So here I'm going to get a
1 when I recurse and then 3.

01:14:34.910 --> 01:14:36.740
There's nothing in
between these 0's.

01:14:36.740 --> 01:14:39.380
And after the last
0, there's a 2.

01:14:39.380 --> 01:14:41.840
So this would be the Cartesian
tree, a slightly different

01:14:41.840 --> 01:14:43.760
version where we
don't break ties,

01:14:43.760 --> 01:14:45.800
we take all the
mins simultaneously,

01:14:45.800 --> 01:14:47.670
put them at the root.

01:14:47.670 --> 01:14:50.300
Now, does that look
like this thing?

01:14:50.300 --> 01:14:50.877
Yeah.

01:14:50.877 --> 01:14:52.085
Everything except the leaves.

01:14:52.085 --> 01:14:53.960
[INAUDIBLE] are
missing at the leaves.

01:14:53.960 --> 01:14:57.170
The leaves are represented
by these values.

01:14:57.170 --> 01:15:00.410
Just visit them in order
here, do an inner traversal

01:15:00.410 --> 01:15:02.270
of the missing pointers here.

01:15:02.270 --> 01:15:12.260
We're going to get 6, and
then 5, and then 3 and then 1,

01:15:12.260 --> 01:15:19.609
and then 0, and
then 4, and then 2.

01:15:19.609 --> 01:15:21.900
Now, the meaning of these
values is slightly different.

01:15:21.900 --> 01:15:24.740
Maybe I should
circle them in red.

01:15:24.740 --> 01:15:27.570
These leaves are just
like these leaves.

01:15:27.570 --> 01:15:30.389
They're exactly the labels we
wrote down in the same order.

01:15:30.389 --> 01:15:31.930
These numbers are
slightly different.

01:15:31.930 --> 01:15:33.569
What they represent
are letter depths.

01:15:33.569 --> 01:15:36.110
The letter depth of this node
is 0, letter depth of this node

01:15:36.110 --> 01:15:37.670
is 1, letter depth
of this node is 3.

01:15:37.670 --> 01:15:38.753
That's what I wrote here--

01:15:38.753 --> 01:15:40.804
1, 3, 2.

01:15:40.804 --> 01:15:42.240
This one says, 2.

01:15:42.240 --> 01:15:44.000
These LCPs are exactly
the letter depth.

01:15:44.000 --> 01:15:45.967
That's how far down
the tree you are.

01:15:45.967 --> 01:15:48.050
Once you have this structure
and the letter depth,

01:15:48.050 --> 01:15:50.265
you can very easily
put in these labels.

01:15:50.265 --> 01:15:51.410
I won't say how to do that.

01:15:51.410 --> 01:15:53.210
But in linear time,
if I could build

01:15:53.210 --> 01:15:59.030
the suffix array plus the LCPs,
I could build suffix tree.

01:15:59.030 --> 01:16:02.081
So our real goal is to build
this information, these two

01:16:02.081 --> 01:16:02.580
arrays.

01:16:02.580 --> 01:16:04.300
If we could do it
in linear time,

01:16:04.300 --> 01:16:07.470
we'd get a suffix
tree in linear time.

01:16:07.470 --> 01:16:10.100
So that is what
remains to be done.

01:16:17.379 --> 01:16:18.170
We're going to do--

01:16:24.565 --> 01:16:27.200
not quite linear time.

01:16:27.200 --> 01:16:30.620
If you want a
nicely sorted suffix

01:16:30.620 --> 01:16:33.830
tree where all the
children are labeled here--

01:16:33.830 --> 01:16:36.020
so in particular, if I
just had a single node,

01:16:36.020 --> 01:16:39.650
I have to be able to sort
the letters in the alphabet.

01:16:39.650 --> 01:16:41.090
However long that takes.

01:16:41.090 --> 01:16:42.560
Maybe it's a small
alphabet and you

01:16:42.560 --> 01:16:45.500
can do linear time sorting
by radix sort or whatever.

01:16:45.500 --> 01:16:47.260
However long that
takes, we do it once.

01:16:47.260 --> 01:16:50.377
Then the rest will
be order T time.

01:16:50.377 --> 01:16:51.210
Here's how we do it.

01:16:51.210 --> 01:16:54.217
First step-- sort the alphabet.

01:16:54.217 --> 01:16:56.300
This will turn out to be
more interesting than you

01:16:56.300 --> 01:16:57.170
might think.

01:16:57.170 --> 01:16:58.430
I'll come back to it.

01:16:58.430 --> 01:17:01.250
Second step--
replace each letter

01:17:01.250 --> 01:17:03.920
by its index in
the sorted order.

01:17:03.920 --> 01:17:07.130
This sounds boring but it
will be useful for later.

01:17:15.710 --> 01:17:19.670
Third step-- the big idea.

01:17:19.670 --> 01:17:23.180
This is an algorithm by
Karkkainen and Sanders,

01:17:23.180 --> 01:17:25.750
from 2003.

01:17:25.750 --> 01:17:27.950
The problem was first
solved in this running time

01:17:27.950 --> 01:17:31.640
by Martin Farach-Colton,
our good friend.

01:17:31.640 --> 01:17:33.260
But then it got simplified.

01:17:33.260 --> 01:17:36.170
So I'll tell you a little
bit about that in a moment.

01:17:38.750 --> 01:17:41.270
And there going to be
a lot of writing here.

01:17:52.880 --> 01:17:55.410
The idea here is we're going
to take the 3i-th letter,

01:17:55.410 --> 01:17:57.440
3i plus first, 3i
plus second letter,

01:17:57.440 --> 01:18:00.020
concatenate them into a
single triple letter--

01:18:00.020 --> 01:18:01.800
think of it as a single letter.

01:18:01.800 --> 01:18:03.402
And then just do that for all i.

01:18:03.402 --> 01:18:05.110
So it's like I take
these guys, make them

01:18:05.110 --> 01:18:07.070
one letter, these guys,
make them one letter.

01:18:07.070 --> 01:18:10.100
Now, I could start at 0,
or I could start at 1,

01:18:10.100 --> 01:18:12.230
or I could start at 2.

01:18:12.230 --> 01:18:14.260
Do them all.

01:18:14.260 --> 01:18:22.910
So this is going to be 3i
plus 1, 3i plus 2, 3i plus 3.

01:18:22.910 --> 01:18:32.390
And this one is going to be 3i
plus 2, 3i plus 3, 3i plus 4.

01:18:32.390 --> 01:18:33.860
We're going to do
this to recurse.

01:18:33.860 --> 01:18:36.320
But the point is, if
I want to represent

01:18:36.320 --> 01:18:38.750
all the suffixes
of T, suffix could

01:18:38.750 --> 01:18:41.600
start at a position 0 mod
3, or position 1 mod 3,

01:18:41.600 --> 01:18:43.800
or position 2 mod 3.

01:18:43.800 --> 01:18:46.250
So if I could sort all the
suffixes of these guys,

01:18:46.250 --> 01:18:49.120
I would effectively sort all
the suffixes of the original T.

01:18:49.120 --> 01:18:51.800
This tripling up doesn't
really change things,

01:18:51.800 --> 01:18:53.880
up to like plus 1 or 2.

01:19:03.500 --> 01:19:05.180
Next, I believe, is recursion.

01:19:13.726 --> 01:19:18.320
I'm going to take T0 and
T1 and concatenate them.

01:19:18.320 --> 01:19:21.320
This thing has size 2/3 n.

01:19:21.320 --> 01:19:24.500
It has number of characters
2/3 n because each of them

01:19:24.500 --> 01:19:26.420
has a third of the
number of characters.

01:19:26.420 --> 01:19:28.020
Of course, all the
information is still there,

01:19:28.020 --> 01:19:28.978
which is kind of weird.

01:19:28.978 --> 01:19:31.680
But if we treat this as
a single character, which

01:19:31.680 --> 01:19:35.435
then has a 1/3 n, we can't
afford to recurse on all three.

01:19:35.435 --> 01:19:37.850
We can only afford to recurse
on two out of the three

01:19:37.850 --> 01:19:40.850
because then we're going to get
a recurrence of the form T of n

01:19:40.850 --> 01:19:46.370
is T of 2/3 n plus order n.

01:19:46.370 --> 01:19:48.682
And this is geometric,
so it's order n.

01:19:48.682 --> 01:19:50.390
That's how we're going
to get linear time

01:19:50.390 --> 01:19:53.630
after the first sort.

01:19:53.630 --> 01:19:56.000
If this was 3/3 n, then
this would be n log n.

01:19:56.000 --> 01:19:58.310
We don't want to do that.

01:19:58.310 --> 01:19:59.832
So that's what I can afford.

01:19:59.832 --> 01:20:01.040
Now I've got to deal with it.

01:20:01.040 --> 01:20:03.020
What this tells me is,
the sorted order of all

01:20:03.020 --> 01:20:05.420
the suffixes of T0 and
T1, all the suffixes

01:20:05.420 --> 01:20:08.630
starting at positions
that are 0 or 1 mod 3.

01:20:12.140 --> 01:20:17.840
Next thing we'd like to do
is sort the suffixes of T2.

01:20:17.840 --> 01:20:20.150
We can do that, I
claim, by radix sort.

01:20:26.160 --> 01:20:27.190
How do we do that?

01:20:27.190 --> 01:20:30.240
Well, if you look
at a suffix T 2i,

01:20:30.240 --> 01:20:37.320
this is the same thing as
T from 3i plus 2 onwards.

01:20:37.320 --> 01:20:43.140
Which we can think of as
that first character, comma,

01:20:43.140 --> 01:20:44.265
the next character onwards.

01:20:48.840 --> 01:20:51.150
Sorry, that's the angle bracket.

01:20:51.150 --> 01:20:56.680
And this thing is, basically,
T0 of i plus 1 onwards.

01:20:56.680 --> 01:20:58.260
So if I strip off
the first letter,

01:20:58.260 --> 01:21:00.000
then I get a suffix
that I know about.

01:21:00.000 --> 01:21:02.830
I know the sorted order
of all the T0 suffixes.

01:21:02.830 --> 01:21:04.830
So this is really just
a-- you can think of this

01:21:04.830 --> 01:21:06.679
as a two character value.

01:21:06.679 --> 01:21:08.220
There's a single
character from Sigma

01:21:08.220 --> 01:21:12.660
here, which we've
already reduced down to--

01:21:12.660 --> 01:21:18.210
this is an integer between
0 and Sigma minus 1.

01:21:18.210 --> 01:21:21.150
This thing you can do the same
thing with these recursive

01:21:21.150 --> 01:21:22.110
values.

01:21:22.110 --> 01:21:24.210
So you've just got two values.

01:21:24.210 --> 01:21:24.810
Small.

01:21:24.810 --> 01:21:27.540
You can radix sort
them in linear time.

01:21:27.540 --> 01:21:30.990
And then we will have sorted
T2 suffixes because we already

01:21:30.990 --> 01:21:32.220
knew the order of these guys.

01:21:34.800 --> 01:21:45.780
One more thing, which we have
to merge suffixes of T0 and T1

01:21:45.780 --> 01:21:52.470
with suffixes of T2.

01:21:52.470 --> 01:21:55.755
And this is where
we use the fact

01:21:55.755 --> 01:21:58.130
that there are three of these
things and not two of them.

01:21:58.130 --> 01:22:01.040
This is a weird case where three
way divide and conquer works.

01:22:01.040 --> 01:22:02.952
Two way divide and
conquer is what

01:22:02.952 --> 01:22:04.160
Farach-Colton did originally.

01:22:04.160 --> 01:22:07.070
It's much more complicated
because of this merge step.

01:22:07.070 --> 01:22:09.050
Merge gets painful.

01:22:09.050 --> 01:22:13.490
I claim this merging
is easy because merging

01:22:13.490 --> 01:22:16.500
is linear time, provided your
comparison is constant time.

01:22:16.500 --> 01:22:21.610
So if I need to compare a
T0 suffix with a T2 suffix,

01:22:21.610 --> 01:22:23.840
if I want to do
that comparison, I

01:22:23.840 --> 01:22:26.030
strip off the first
letter from this one.

01:22:26.030 --> 01:22:29.330
It turns into a T1 suffix,
the first character

01:22:29.330 --> 01:22:30.155
plus a T1 suffix.

01:22:30.155 --> 01:22:32.113
If I strip out the first
character of this one,

01:22:32.113 --> 01:22:35.810
it turns into the first
character and then a T0 suffix.

01:22:35.810 --> 01:22:38.150
And these things I know how
to compare because I already

01:22:38.150 --> 01:22:40.970
sorted T0, comma, T1.

01:22:40.970 --> 01:22:47.900
If I need to compare T1
suffix with the T2 suffix,

01:22:47.900 --> 01:22:48.830
how do I do it?

01:22:48.830 --> 01:22:51.440
I strip off the first
two letters of this one,

01:22:51.440 --> 01:22:52.850
I get a T0 suffix.

01:22:52.850 --> 01:22:55.340
I strip off the first
two letters of this one,

01:22:55.340 --> 01:22:56.829
I get a T1 suffix.

01:22:56.829 --> 01:22:59.120
I can't strip off one letter
because this would turn it

01:22:59.120 --> 01:23:00.953
into a T2 and I don't
know how to compare T2

01:23:00.953 --> 01:23:02.960
to other things,
that's the whole point.

01:23:02.960 --> 01:23:05.420
I guess, it's a T2
versus a T0, if I

01:23:05.420 --> 01:23:07.330
did that, which is this case.

01:23:07.330 --> 01:23:09.200
But here, I strip
off two letters,

01:23:09.200 --> 01:23:10.850
I get something I
know how to compare.

01:23:10.850 --> 01:23:13.290
This technique does not work
if you only have two things.

01:23:13.290 --> 01:23:15.540
It only works if you have
three things because they're

01:23:15.540 --> 01:23:17.700
sort of these situations.

01:23:17.700 --> 01:23:18.950
So constant time.

01:23:18.950 --> 01:23:21.620
By comparing these little
tuples, the first character

01:23:21.620 --> 01:23:26.000
or two plus the remaining
suffix, I can do the comparator

01:23:26.000 --> 01:23:27.980
and merge.

01:23:27.980 --> 01:23:31.010
And then if I can do that,
everything is linear time.

01:23:31.010 --> 01:23:33.710
The only interesting thing
is how do I sort the alphabet

01:23:33.710 --> 01:23:34.970
when I recurse?

01:23:34.970 --> 01:23:38.690
And for that, you
use radix sort.

01:23:38.690 --> 01:23:44.500
So the first time,
you pay sort of Sigma.

01:23:44.500 --> 01:23:46.250
We don't know how long
that takes, depends

01:23:46.250 --> 01:23:47.330
on your alphabet.

01:23:47.330 --> 01:23:49.359
But every following
recursion it's a radix

01:23:49.359 --> 01:23:51.650
sort because you have a triple
of values, each of which

01:23:51.650 --> 01:23:52.990
is small.

01:23:52.990 --> 01:23:54.650
And so you can do
it in linear time.

01:23:54.650 --> 01:23:57.860
Because there's only
three digits to the thing

01:23:57.860 --> 01:23:59.000
you're sorting.

01:23:59.000 --> 01:24:01.880
So overall, this is a
recursive algorithm.

01:24:01.880 --> 01:24:05.000
It gives you linear
time because you're

01:24:05.000 --> 01:24:07.910
making one recursive
call of 2/3 the size.

01:24:07.910 --> 01:24:11.840
Pretty clever and simple.

01:24:11.840 --> 01:24:14.330
And that's suffix trees
and how you build them.

01:24:14.330 --> 01:24:17.270
Versus you get suffix arrays,
you can do the same thing

01:24:17.270 --> 01:24:20.330
and get LCP information
at the same time,

01:24:20.330 --> 01:24:22.220
it's written in the nodes.

01:24:22.220 --> 01:24:23.330
Then you get suffix trees.

01:24:23.330 --> 01:24:25.480
And then you're done.