WEBVTT

00:00:00.090 --> 00:00:02.490
The following content is
provided under a Creative

00:00:02.490 --> 00:00:04.030
Commons license.

00:00:04.030 --> 00:00:06.360
Your support will help
MIT OpenCourseWare

00:00:06.360 --> 00:00:10.720
continue to offer high quality
educational resources for free.

00:00:10.720 --> 00:00:13.320
To make a donation, or
view additional materials

00:00:13.320 --> 00:00:17.280
from hundreds of MIT courses,
visit MIT OpenCourseWare

00:00:17.280 --> 00:00:18.450
at ocw.mit.edu.

00:00:21.139 --> 00:00:22.680
ERIK DEMAINE: All
right, welcome back

00:00:22.680 --> 00:00:26.580
to Succinct Data
Structures, part two of two.

00:00:26.580 --> 00:00:29.610
Today we're going to take all
the stuff we know about tries

00:00:29.610 --> 00:00:32.910
and apply them to the main
motivating application, which

00:00:32.910 --> 00:00:35.190
is suffix trees.

00:00:35.190 --> 00:00:37.320
And as we know, suffix
trees and suffix arrays

00:00:37.320 --> 00:00:39.510
are more or less equivalent.

00:00:39.510 --> 00:00:43.524
But if you build one,
you can build the other.

00:00:43.524 --> 00:00:44.940
But what we're
going to show today

00:00:44.940 --> 00:00:47.500
is they're equivalent also
from a space perspective.

00:00:47.500 --> 00:00:49.130
That will be the last topic.

00:00:49.130 --> 00:00:52.980
If you can succinctly
represent a suffix array,

00:00:52.980 --> 00:00:56.700
then you can transform-- with
a little o of n extra space,

00:00:56.700 --> 00:00:59.430
you can make a
suffix tree as well

00:00:59.430 --> 00:01:02.920
and do searches in roughly
the time we're used to,

00:01:02.920 --> 00:01:05.844
which is p plus size of output.

00:01:05.844 --> 00:01:07.260
It's not going to
be exactly that.

00:01:07.260 --> 00:01:09.660
We're going to lose like log
to the epsilons and such.

00:01:09.660 --> 00:01:12.630
But that's mostly caused--
this transformation only

00:01:12.630 --> 00:01:17.240
occurs like an additive log,
log, log, log, log, log n.

00:01:17.240 --> 00:01:19.800
You could have as
many logs as you want.

00:01:19.800 --> 00:01:23.700
Take any arbitrarily,
slowly growing function,

00:01:23.700 --> 00:01:24.940
that it will--

00:01:24.940 --> 00:01:27.610
your space bound gets
closer and closer to linear.

00:01:27.610 --> 00:01:29.580
Anyway, that's what
we'll get to at the end.

00:01:29.580 --> 00:01:32.360
The bulk of the lecture will
be on building suffix arrays.

00:01:32.360 --> 00:01:34.290
Here we're going to lose
a log to the epsilon

00:01:34.290 --> 00:01:36.810
time in the query.

00:01:36.810 --> 00:01:40.901
And we're going to start out
improving down to T log log T

00:01:40.901 --> 00:01:41.400
space.

00:01:41.400 --> 00:01:43.500
Our bottom line
is T log T space.

00:01:43.500 --> 00:01:46.020
That's a normal-- if you
just stored a suffix array

00:01:46.020 --> 00:01:47.850
as a bunch of numbers.

00:01:47.850 --> 00:01:51.600
First we'll add another log,
then we'll get down to linear.

00:01:51.600 --> 00:01:56.230
That gives us a compact suffix
tree, or sorry, suffix array.

00:01:56.230 --> 00:01:59.490
That's also knowing how to
do succinct suffix arrays.

00:01:59.490 --> 00:02:03.270
But there are dozens of papers
on this topic, it's kind

00:02:03.270 --> 00:02:05.970
of a big field all to itself.

00:02:05.970 --> 00:02:09.210
And a lot of the techniques
are pretty complicated.

00:02:09.210 --> 00:02:13.140
So I'm going try to keep it
to the bare minimum we can do,

00:02:13.140 --> 00:02:15.222
that will give us linear--

00:02:15.222 --> 00:02:17.670
linear number, bits of
space give us a compact data

00:02:17.670 --> 00:02:18.720
structure.

00:02:18.720 --> 00:02:21.390
But before we go to
those data structures,

00:02:21.390 --> 00:02:23.700
I want to give you a little
survey of what's known.

00:02:26.680 --> 00:02:34.589
So compact suffix arrays,
and trees, start out with--

00:02:34.589 --> 00:02:36.630
I'm going to start out
with the original results.

00:02:36.630 --> 00:02:39.088
And then I'll jump to sort of
the latest results, which are

00:02:39.088 --> 00:02:40.515
getting things to be succinct.

00:02:43.810 --> 00:02:48.660
So the first result on this
topic that got a compact suffix

00:02:48.660 --> 00:02:52.620
array, is by Grossi and Vitter.

00:02:52.620 --> 00:02:55.710
This was in 2000,
spring of 2000.

00:02:55.710 --> 00:02:58.530
And let me tell you the
bounds that they achieve.

00:02:58.530 --> 00:03:01.560
This is actually the solution
that we're going to look at.

00:03:26.340 --> 00:03:28.830
So this is the
first space bound.

00:03:31.670 --> 00:03:35.000
I guess the big term
here is T log sigma.

00:03:35.000 --> 00:03:38.030
That's how many bits it takes
just to write down the text.

00:03:38.030 --> 00:03:41.480
So this is what you might
call optimal, in this world.

00:03:41.480 --> 00:03:42.980
I mean, if you have
random text, you

00:03:42.980 --> 00:03:44.720
need that many bits
to write it down.

00:03:44.720 --> 00:03:46.610
So there's 1 times that.

00:03:46.610 --> 00:03:49.040
We're also going to have this
1 over epsilon times that.

00:03:49.040 --> 00:03:50.750
And this is actually
the data structure.

00:03:50.750 --> 00:03:52.541
It's going to store
the text, and then it's

00:03:52.541 --> 00:03:55.730
going to add on a data structure
of 1 over epsilon times that.

00:03:55.730 --> 00:04:01.326
So it's order-- order [? ops. ?]
There's some lower-order terms.

00:04:01.326 --> 00:04:03.200
We won't actually have
this lower-order term,

00:04:03.200 --> 00:04:06.860
because I'm going to focus
on binary alphabet here.

00:04:06.860 --> 00:04:07.887
Keep it simple.

00:04:07.887 --> 00:04:09.470
But if you have a
non-binary alphabet,

00:04:09.470 --> 00:04:12.410
they have another order
T bits, and so on.

00:04:12.410 --> 00:04:14.210
But you get to
control this constant.

00:04:14.210 --> 00:04:17.060
This will work for any
epsilon between 0 and 1.

00:04:19.850 --> 00:04:22.910
And why are you interested
in a small epsilon?

00:04:22.910 --> 00:04:27.060
Because if epsilon is small,
this space bound goes up.

00:04:27.060 --> 00:04:29.295
Well, that happens
in the query bound.

00:04:48.380 --> 00:04:51.590
So in the query bound, there's
this multiplicative log

00:04:51.590 --> 00:04:55.130
to the epsilon of T. So if you
really want queries to go fast,

00:04:55.130 --> 00:04:57.290
you don't want to pay
a big polylog here,

00:04:57.290 --> 00:04:59.490
then you're going to have
to pay for it in space.

00:04:59.490 --> 00:05:00.781
So those are the same epsilons.

00:05:05.150 --> 00:05:08.690
In the Grossi-Vitter paper,
they only multiply this

00:05:08.690 --> 00:05:09.920
by the size of the output.

00:05:09.920 --> 00:05:11.597
So if you want to
just output one guy,

00:05:11.597 --> 00:05:13.430
you only pay an additive
log to the epsilon.

00:05:13.430 --> 00:05:14.900
If you want to output
all the matches

00:05:14.900 --> 00:05:16.775
you have to pay a number
of matches times log

00:05:16.775 --> 00:05:17.690
to the epsilon.

00:05:17.690 --> 00:05:20.330
They achieve the P bound.

00:05:20.330 --> 00:05:23.050
In fact, they do a little bit
better than order P query.

00:05:23.050 --> 00:05:26.150
On a RAM, you can hope to do--

00:05:26.150 --> 00:05:33.170
save a log factor by reading log
base sigma of T, of the letters

00:05:33.170 --> 00:05:34.950
in one word operation.

00:05:34.950 --> 00:05:36.900
So I'm not going to go
into how to do this--

00:05:36.900 --> 00:05:41.257
I'm going to cover this paper
today, or a simplification

00:05:41.257 --> 00:05:41.840
of this paper.

00:05:41.840 --> 00:05:43.560
You might say, throw away.

00:05:43.560 --> 00:05:46.250
I'm going to get a slightly
worse bounds than this.

00:05:46.250 --> 00:05:51.170
Space bound will be the
same, but I'm not going to--

00:05:51.170 --> 00:05:53.300
I'm not going to worry
about this log factor.

00:05:53.300 --> 00:05:55.490
And in fact, both
P and output are

00:05:55.490 --> 00:05:59.540
going to be multiplied
by log to the epsilon.

00:05:59.540 --> 00:06:01.700
So I won't achieve quite
the best query bound,

00:06:01.700 --> 00:06:03.590
but same space bound,
just to give you

00:06:03.590 --> 00:06:06.800
an idea of how it works.

00:06:06.800 --> 00:06:12.170
The next result-- yeah,
I'll go to another board.

00:06:12.170 --> 00:06:14.480
These bounds are a
bit big, as you see.

00:06:17.090 --> 00:06:19.860
The next result, which was
done later in the same year.

00:06:19.860 --> 00:06:21.830
So these are
probably discovered,

00:06:21.830 --> 00:06:25.690
basically at the same time.

00:06:25.690 --> 00:06:29.840
Because writing a paper
takes probably a year or so.

00:06:29.840 --> 00:06:31.550
So they were being
done in parallel,

00:06:31.550 --> 00:06:34.260
and then this was published
in the spring of 2000.

00:06:34.260 --> 00:06:36.899
This was published
in the fall of 2000.

00:06:36.899 --> 00:06:37.940
It's called the FM-index.

00:06:40.640 --> 00:06:43.754
And it achieves
this bound, which

00:06:43.754 --> 00:06:45.545
is going to take a
little while to explain.

00:07:09.270 --> 00:07:11.450
OK.

00:07:11.450 --> 00:07:16.160
Think of this right now,
as this is T log sigma.

00:07:16.160 --> 00:07:20.254
Ignore this H. This
is entropy stuff.

00:07:20.254 --> 00:07:21.920
But if you think of
this as T log sigma,

00:07:21.920 --> 00:07:24.470
we're getting 5
times T log sigma,

00:07:24.470 --> 00:07:27.800
plus some lower-order term.

00:07:27.800 --> 00:07:30.046
So it's a little less
flexible over here.

00:07:30.046 --> 00:07:31.670
We kind of got to
control the constant.

00:07:31.670 --> 00:07:35.600
Anything greater or equal
to 2 would be all right.

00:07:35.600 --> 00:07:37.580
Over here, it's
always at least 5.

00:07:37.580 --> 00:07:39.170
This has since been improved.

00:07:39.170 --> 00:07:41.750
I'm just telling
you the historical--

00:07:41.750 --> 00:07:45.410
these days people can get
down to at least 4 or so.

00:07:45.410 --> 00:07:46.470
Actually, get down to 1.

00:07:46.470 --> 00:07:49.730
We'll talk about it in a moment.

00:07:49.730 --> 00:07:51.380
Before I get to
the Hk part, I want

00:07:51.380 --> 00:07:53.180
to talk about the
lower-order term.

00:07:53.180 --> 00:07:55.340
There's some scary
parts like this.

00:07:55.340 --> 00:07:58.430
If sigma is that at all
large, this is big trouble.

00:07:58.430 --> 00:08:00.320
Or even sigma log n--

00:08:00.320 --> 00:08:03.110
this is a super polynomial.

00:08:03.110 --> 00:08:05.960
So this cannot handle
very large sigma,

00:08:05.960 --> 00:08:07.580
whereas this solution can.

00:08:07.580 --> 00:08:11.930
And other structures can,
but this is an early result.

00:08:11.930 --> 00:08:14.990
This also gets bad when
sigma's very large.

00:08:14.990 --> 00:08:17.540
Even bigger-- when sigma's
bigger than log log T,

00:08:17.540 --> 00:08:20.660
then this starts to dominate.

00:08:20.660 --> 00:08:23.779
OK, but for sigma small, think
binary alphabets, whatever.

00:08:23.779 --> 00:08:25.820
This is good, and in many
ways is actually better

00:08:25.820 --> 00:08:26.750
than T log sigma.

00:08:26.750 --> 00:08:31.750
So let me tell you about
this Hk of T thing.

00:08:31.750 --> 00:08:34.985
This is what's called k-th
order empirical entropy.

00:08:45.770 --> 00:08:49.645
Maybe I should start with an
aside of 0-th order entropy,

00:08:49.645 --> 00:08:51.020
because we haven't
talked about--

00:08:51.020 --> 00:08:53.450
I guess we talked about entropy
in the context of binary search

00:08:53.450 --> 00:08:53.949
trees.

00:08:53.949 --> 00:08:55.820
We said, oh, if you've got--

00:08:55.820 --> 00:08:59.300
if you access item i
with probability P i,

00:08:59.300 --> 00:09:07.450
then there's this entropy bound,
which is sum of P log 1/P.

00:09:07.450 --> 00:09:11.630
So I don't know, let's
call this character x.

00:09:11.630 --> 00:09:23.420
So if you-- let's see,
you have H0 substring s.

00:09:23.420 --> 00:09:30.030
You sum over all
characters in the alphabet,

00:09:30.030 --> 00:09:33.380
of the probability-- this
is not really a probability.

00:09:33.380 --> 00:09:40.490
This is going to be the
number of x's in s, divided

00:09:40.490 --> 00:09:41.900
by the length of s.

00:09:41.900 --> 00:09:43.810
This is what's called
empirical probability.

00:09:43.810 --> 00:09:46.167
It's what you observe
from this string.

00:09:46.167 --> 00:09:48.000
There's this many
occurrences in the string.

00:09:48.000 --> 00:09:49.220
You divide by the
length of the string.

00:09:49.220 --> 00:09:50.660
That's kind of
like a probability.

00:09:50.660 --> 00:09:52.390
It's scaled to be
like a probability.

00:09:52.390 --> 00:09:53.760
It's between 0 and 1.

00:09:53.760 --> 00:09:57.410
And if you take sum of P log
1/P, that gives you a bound.

00:09:57.410 --> 00:10:01.950
And this is the bound achieved
by say, Huffman coding,

00:10:01.950 --> 00:10:04.542
or the optimal code.

00:10:04.542 --> 00:10:06.500
If all you're allowed to
do is give a code word

00:10:06.500 --> 00:10:08.660
for each letter of the
alphabet, and then you

00:10:08.660 --> 00:10:10.820
write down a binary code
word for each letter

00:10:10.820 --> 00:10:11.750
of the alphabet.

00:10:11.750 --> 00:10:14.870
And you write that down for each
letter in s, then you achieve--

00:10:14.870 --> 00:10:17.480
I guess Huffman codes
achieve ceiling of this.

00:10:17.480 --> 00:10:19.610
If you want to achieve
exactly that bound,

00:10:19.610 --> 00:10:23.660
you can use arithmetic
coding, but we're not

00:10:23.660 --> 00:10:25.850
going to get into
those kinds of details.

00:10:25.850 --> 00:10:30.740
So if you used what's called a
0-th order code, where you just

00:10:30.740 --> 00:10:34.220
have a code for each
character of the alphabet,

00:10:34.220 --> 00:10:37.400
then the space bound you
would achieve is H0 of s,

00:10:37.400 --> 00:10:41.460
times the number
of characters in s.

00:10:41.460 --> 00:10:45.800
So that would be if you
substituted k equals 0 here.

00:10:45.800 --> 00:10:46.800
So that's kind of neat.

00:10:46.800 --> 00:10:50.030
This is a compressed
representation of the string.

00:10:50.030 --> 00:10:53.150
Over here, we just
wrote down the string.

00:10:53.150 --> 00:10:54.950
And if the string
is incompressible,

00:10:54.950 --> 00:10:56.954
yeah, T log sigma is optimal.

00:10:56.954 --> 00:10:59.120
But if the string is
compressible, like many strings

00:10:59.120 --> 00:11:01.328
we want to store-- you're
storing English, whatever--

00:11:01.328 --> 00:11:04.610
you should save somewhere
between a factor of 2 and 10.

00:11:04.610 --> 00:11:06.110
This will try to save it.

00:11:06.110 --> 00:11:09.530
Of course, factor
between 2 and 10 is not--

00:11:09.530 --> 00:11:12.020
is a little scary, when
there's this factor 5 out here.

00:11:12.020 --> 00:11:14.420
That might dominate
whatever savings you get.

00:11:14.420 --> 00:11:17.420
But in theory, this
could be a lot better.

00:11:17.420 --> 00:11:20.090
And this is just the first
result in this series.

00:11:20.090 --> 00:11:23.240
Now we can get 1 times
Hk of T, and then it's

00:11:23.240 --> 00:11:27.960
a lot more interesting.

00:11:27.960 --> 00:11:30.020
OK, so that was
0-th order entropy.

00:11:30.020 --> 00:11:32.210
What's this k-th order
entropy business?

00:11:32.210 --> 00:11:34.790
Essentially, it's about taking--
instead of writing a code

00:11:34.790 --> 00:11:37.520
word for a single letter,
you can write a code word

00:11:37.520 --> 00:11:43.670
for a letter that depends on
the previous k characters.

00:11:43.670 --> 00:11:47.170
So I'm going to write
down a definition.

00:11:47.170 --> 00:11:54.970
Hk of T is going to be the sum
over all words of length k.

00:11:54.970 --> 00:11:58.970
This is going to be our
context of the probability,

00:11:58.970 --> 00:12:09.740
or empirical probability of w
occurring times the 0-th order

00:12:09.740 --> 00:12:24.005
entropy of the string of
successor characters of w.

00:12:26.510 --> 00:12:29.660
So again, the empirical
probability of w occurring

00:12:29.660 --> 00:12:37.730
is the number of occurrences
of w, divided by T, basically.

00:12:37.730 --> 00:12:44.570
So the idea is, now you get to
encode a character depending

00:12:44.570 --> 00:12:47.180
on the context of the
last k characters.

00:12:47.180 --> 00:12:50.150
So we're summing over
all possible contexts

00:12:50.150 --> 00:12:54.260
of k characters, and we're
taking the expectation

00:12:54.260 --> 00:12:58.040
over all possible context w.

00:12:58.040 --> 00:13:01.070
That's the sum of the
probabilities times something.

00:13:01.070 --> 00:13:05.540
And then condition on w
being the context, the last k

00:13:05.540 --> 00:13:06.440
characters.

00:13:06.440 --> 00:13:10.790
We want to measure what
characters follow that.

00:13:10.790 --> 00:13:13.130
And there, we can use
a 0-th order encoding.

00:13:13.130 --> 00:13:15.140
I mean, we've
already conditioned

00:13:15.140 --> 00:13:17.940
on w being right there.

00:13:17.940 --> 00:13:21.170
So for all occurrences of w,
you look at the next character

00:13:21.170 --> 00:13:23.660
right after it, and you
take 0-th order entropy

00:13:23.660 --> 00:13:26.600
of that, that's called
k-th order entropy.

00:13:26.600 --> 00:13:31.230
OK, you have to think
about it for a while, too.

00:13:31.230 --> 00:13:34.520
But this essentially
means the best,

00:13:34.520 --> 00:13:37.400
you can prove this is the
best encoding you can do,

00:13:37.400 --> 00:13:40.520
if the codeword of a letter
can depend on the previous k

00:13:40.520 --> 00:13:41.540
characters.

00:13:41.540 --> 00:13:44.000
Of course, if you have such a
code it's easy to decompress,

00:13:44.000 --> 00:13:46.620
because as you're decompressing,
you know what the previous k

00:13:46.620 --> 00:13:48.780
characters were.

00:13:48.780 --> 00:13:51.980
OK, interesting thing about this
index or this data structure,

00:13:51.980 --> 00:13:54.620
is it's independent of k.

00:13:54.620 --> 00:13:57.080
The data structure
doesn't know what k is.

00:13:57.080 --> 00:14:00.200
This works for all k.

00:14:00.200 --> 00:14:05.032
For any fixed k-- k has
to be constant here.

00:14:05.032 --> 00:14:07.490
There are other data structures
like [? KB, ?] logarithmic,

00:14:07.490 --> 00:14:08.870
or so.

00:14:08.870 --> 00:14:11.020
But here, we'll think
of k as a constant.

00:14:11.020 --> 00:14:14.150
And so this is really a neat
thing about compression.

00:14:14.150 --> 00:14:17.750
There's a technique called
the Burrows-Wheeler transform.

00:14:17.750 --> 00:14:20.431
And Lempel-Ziv does
similar things.

00:14:20.431 --> 00:14:22.430
You may have heard of
those compression schemes.

00:14:22.430 --> 00:14:24.430
They're used in bzip,
and things like-- bzip

00:14:24.430 --> 00:14:28.190
is named after
Burrows-Wheeler, I believe.

00:14:28.190 --> 00:14:32.150
And those compression
schemes achieve

00:14:32.150 --> 00:14:35.330
Hk of T bits per character--

00:14:35.330 --> 00:14:37.220
so Hk of T times T--

00:14:37.220 --> 00:14:39.720
for all k.

00:14:39.720 --> 00:14:42.410
So if your text is
really good, given

00:14:42.410 --> 00:14:45.320
the context of the last five
letters, or three letters.

00:14:45.320 --> 00:14:49.110
In some sense, the compression
scheme adapts to that.

00:14:49.110 --> 00:14:53.360
So this is what we call a
self index, in that this also

00:14:53.360 --> 00:14:54.260
stores the string.

00:14:54.260 --> 00:14:56.690
You can read the
data of the string.

00:14:56.690 --> 00:14:59.360
And so whereas
over here, we just

00:14:59.360 --> 00:15:00.920
stored the string uncompressed.

00:15:00.920 --> 00:15:02.930
Here we're effectively
storing the string

00:15:02.930 --> 00:15:05.450
in a compressed form,
and the data structure

00:15:05.450 --> 00:15:06.920
is similarly compressed.

00:15:06.920 --> 00:15:08.990
So if your string is
compressible by more

00:15:08.990 --> 00:15:13.600
than a factor of 5, this
will be really good.

00:15:13.600 --> 00:15:17.390
And that's the FM-index bound.

00:15:17.390 --> 00:15:21.530
Now that you have that Hk
stuff, it's a lot easier

00:15:21.530 --> 00:15:25.010
to state all other results.

00:15:25.010 --> 00:15:28.370
So we have-- oh, I didn't
give a query bound.

00:15:28.370 --> 00:15:30.410
That was the space.

00:15:30.410 --> 00:15:46.070
Query is P plus size of output
times log to the epsilon T.

00:15:46.070 --> 00:15:49.520
So, similar to this
one, but we don't

00:15:49.520 --> 00:15:54.470
have this trick over here.

00:15:54.470 --> 00:15:56.880
Another early result
is by Sadakane.

00:16:00.205 --> 00:16:10.510
I think also, maybe 2001, I have
the journal referenced as 2003.

00:16:10.510 --> 00:16:13.737
This is in some ways
better, some ways worse,

00:16:13.737 --> 00:16:15.695
it's kind of incomparable
to the other results.

00:16:32.380 --> 00:16:40.435
This is bits, and then the
query has an extra large factor.

00:16:48.840 --> 00:16:50.439
This is again,
another early result

00:16:50.439 --> 00:16:51.480
that I want to highlight.

00:16:51.480 --> 00:16:53.520
Now I'm going to start
skipping results.

00:16:53.520 --> 00:16:55.320
The main innovation
here, is that it

00:16:55.320 --> 00:16:57.690
works good for large alphabets.

00:16:57.690 --> 00:17:00.030
This is a very small
dependence on sigma,

00:17:00.030 --> 00:17:02.841
whereas-- as I mentioned,
this structure really

00:17:02.841 --> 00:17:04.424
doesn't work well
for large alphabets.

00:17:04.424 --> 00:17:06.636
Here we're getting-- not
getting k-th order entropy,

00:17:06.636 --> 00:17:08.010
we're getting 0-th
order entropy.

00:17:08.010 --> 00:17:12.270
It's a somewhat weaker result.
The dependence on epsilon

00:17:12.270 --> 00:17:13.349
is more like this one.

00:17:16.680 --> 00:17:19.079
But if you just want
a log factor here,

00:17:19.079 --> 00:17:21.690
then this is a 1 plus
epsilon times H0.

00:17:21.690 --> 00:17:23.490
So in that sense,
we're doing better--

00:17:23.490 --> 00:17:26.880
only a 1 plus epsilon,
which is better.

00:17:26.880 --> 00:17:30.300
This thing was
always at least 2.

00:17:30.300 --> 00:17:33.000
This thing was always at least
5, the complete constant.

00:17:33.000 --> 00:17:34.980
Here the lead constant
can be 1 plus epsilon.

00:17:34.980 --> 00:17:38.010
This is almost
succinct, but not quite.

00:17:38.010 --> 00:17:40.050
It doesn't quite
compress as well--

00:17:40.050 --> 00:17:41.520
it only uses 0-th
order entropy--

00:17:41.520 --> 00:17:43.140
but that's still not bad.

00:17:43.140 --> 00:17:44.610
And then the other
big innovation

00:17:44.610 --> 00:17:47.670
is the dependence
on sigma small.

00:17:47.670 --> 00:17:49.800
The query is a little bit worse.

00:17:53.050 --> 00:17:55.710
OK, now fast forward
a little bit.

00:17:58.620 --> 00:18:00.750
I want to talk about
succinct data structures

00:18:00.750 --> 00:18:04.785
for suffix-tree-like queries.

00:18:07.350 --> 00:18:10.050
So there's two succinct
data structures out there,

00:18:10.050 --> 00:18:13.620
with more or less the same
authors as the first two

00:18:13.620 --> 00:18:15.480
results I talked about.

00:18:15.480 --> 00:18:19.050
So Grossi and Vitter,
together with Gupta,

00:18:19.050 --> 00:18:27.510
can get Hk of T times
T, which is optimal even

00:18:27.510 --> 00:18:33.390
with compression, with
k-th order compression.

00:18:33.390 --> 00:18:34.830
And a good dependence on sigma.

00:18:38.440 --> 00:18:39.170
Yeah, I guess--

00:18:39.170 --> 00:18:42.330
T log sigma is the
uncompressed bound.

00:18:42.330 --> 00:18:43.607
So you have to worry about--

00:18:43.607 --> 00:18:45.190
when you're talking
about compression,

00:18:45.190 --> 00:18:47.070
so here we have
the optimal bound

00:18:47.070 --> 00:18:50.670
using k-th order entropy
with a lead constant of 1,

00:18:50.670 --> 00:18:51.690
so that's great.

00:18:51.690 --> 00:18:53.220
That's what makes it succinct.

00:18:53.220 --> 00:18:54.840
As long as this is
little o of that.

00:18:54.840 --> 00:18:58.170
This is going to be a little
o of that, as long as Hk of T

00:18:58.170 --> 00:19:00.940
is not too small.

00:19:00.940 --> 00:19:06.420
If it's like 1 over log T, then
actually this term dominates.

00:19:06.420 --> 00:19:11.080
But as long as it's bigger
than log log T over log T,

00:19:11.080 --> 00:19:13.810
this thing, then you're fine.

00:19:13.810 --> 00:19:16.420
Just as long as you're not
compressing a huge amount,

00:19:16.420 --> 00:19:18.541
then this will be lower-order.

00:19:21.940 --> 00:19:23.010
Sorry, query time.

00:19:23.010 --> 00:19:25.860
Query's a little
bit worse, though.

00:19:25.860 --> 00:19:30.240
We have a log term with
a P, only a log sigma,

00:19:30.240 --> 00:19:35.050
but then we also have this
log squared over log log.

00:19:35.050 --> 00:19:41.280
Times log sigma,
and here I haven't--

00:19:41.280 --> 00:19:44.180
there isn't a clear dependence
on the size of the output.

00:19:44.180 --> 00:19:46.140
So this is-- let's say
size of output is 1.

00:19:46.140 --> 00:19:49.265
You just want to find one match.

00:19:49.265 --> 00:19:51.690
I won't write this dependence
on the size of the output.

00:19:51.690 --> 00:19:54.065
My guess is this is multiplied
by the size of the output,

00:19:54.065 --> 00:19:56.440
but it's not stated
explicitly in the paper,

00:19:56.440 --> 00:19:58.180
so I want to be careful.

00:19:58.180 --> 00:20:02.460
So we have a polylog
additive slowdown here.

00:20:02.460 --> 00:20:04.670
So it's a little
bit worse in time,

00:20:04.670 --> 00:20:06.420
but this space is
obviously, a lot better.

00:20:06.420 --> 00:20:12.840
We've improved our constant
factor from 5, over here, to 1.

00:20:12.840 --> 00:20:16.890
OK, and then there's
one more paper

00:20:16.890 --> 00:20:29.075
I want to mention, by Ferragina,
Manzini, Makinen, and Navarro.

00:20:32.160 --> 00:20:38.880
This is from just five
years ago now, 2007.

00:20:38.880 --> 00:20:47.890
They also achieved 1 times Hk
of T times T as the lead term.

00:20:47.890 --> 00:20:53.160
And they get T divided by log
to the epsilon n, so this is--

00:20:56.040 --> 00:20:59.370
yes, it's slight, there's
probably a log sigma here, too.

00:20:59.370 --> 00:21:03.190
I'm not sure, it might just be
T. Probably just T, actually.

00:21:03.190 --> 00:21:07.210
So we get rid of the log sigma,
but this log log over log

00:21:07.210 --> 00:21:08.140
gets slightly smaller.

00:21:08.140 --> 00:21:12.040
It's only a log to
the epsilon now.

00:21:12.040 --> 00:21:15.550
But the query bound is
a little bit better.

00:21:15.550 --> 00:21:18.120
So the P plus--

00:21:18.120 --> 00:21:27.380
as the output times log to
the 1 plus epsilon T query.

00:21:30.490 --> 00:21:33.010
So instead of basically log
squared, we have log to 1

00:21:33.010 --> 00:21:35.290
plus epsilon, slightly better.

00:21:35.290 --> 00:21:39.350
They also have an
order P counting query.

00:21:39.350 --> 00:21:41.710
So if you just want to know
how many matches are there,

00:21:41.710 --> 00:21:47.530
they can do that really fast in
kind of regular time order P.

00:21:47.530 --> 00:21:49.039
And this is
obviously very small.

00:21:49.039 --> 00:21:50.830
So this is probably
the best result so far,

00:21:50.830 --> 00:21:57.040
still obviously, lots of
open problems in this world.

00:21:57.040 --> 00:21:58.990
Still an active
area of research.

00:21:58.990 --> 00:22:01.769
There are papers
since these, but they

00:22:01.769 --> 00:22:03.310
don't achieve-- the
space bounds they

00:22:03.310 --> 00:22:04.390
achieve are not quite as good.

00:22:04.390 --> 00:22:06.250
There may be like 2
times Hk, and then they

00:22:06.250 --> 00:22:07.770
can get better query bounds.

00:22:07.770 --> 00:22:09.620
A lot of papers that
I'm not talking about,

00:22:09.620 --> 00:22:12.190
there's just a few too many.

00:22:12.190 --> 00:22:15.550
But if you just care about
space, this is the best so far.

00:22:15.550 --> 00:22:21.612
Or I use these two, depending
on exactly how big sigma is.

00:22:21.612 --> 00:22:24.070
Just to mention, there's some
other cool things you can do.

00:22:24.070 --> 00:22:28.240
So these are small space
static data structures.

00:22:28.240 --> 00:22:30.250
Some of them can
be made dynamic.

00:22:30.250 --> 00:22:32.440
But in particular,
there's work on,

00:22:32.440 --> 00:22:36.100
how do you actually build these
data structures with low space?

00:22:36.100 --> 00:22:39.070
Because you don't really want
to build a huge suffix tree

00:22:39.070 --> 00:22:40.180
and then compress it.

00:22:40.180 --> 00:22:42.138
Because the whole point
is you have a hard time

00:22:42.138 --> 00:22:43.550
storing this data structure.

00:22:43.550 --> 00:22:47.080
So in fact, there's
some papers--

00:22:47.080 --> 00:22:49.480
I think more along the lines
of these original results--

00:22:49.480 --> 00:22:52.630
the Grossi-Vitter, Ferragina,
Manzini, and Sadakane--

00:22:52.630 --> 00:22:54.820
building those data structures.

00:22:54.820 --> 00:22:57.250
And while you're building
the amount of working space

00:22:57.250 --> 00:23:01.540
is at least proportional to
the size of the final data

00:23:01.540 --> 00:23:02.170
structure.

00:23:02.170 --> 00:23:04.510
So that can be done.

00:23:04.510 --> 00:23:06.380
We're not going to
go into it here.

00:23:06.380 --> 00:23:08.537
There are other papers about--

00:23:08.537 --> 00:23:10.120
all of these papers
are focused on how

00:23:10.120 --> 00:23:12.130
do I do a search, how do
I search for a pattern,

00:23:12.130 --> 00:23:13.449
find all the matches.

00:23:13.449 --> 00:23:15.490
There's other things you
can do with suffix trees

00:23:15.490 --> 00:23:19.180
like, given two suffixes,
you can find the longest

00:23:19.180 --> 00:23:21.100
common prefix of them.

00:23:21.100 --> 00:23:23.200
So there's papers on how
to do that kind of stuff

00:23:23.200 --> 00:23:26.200
in the compressed regime.

00:23:26.200 --> 00:23:28.990
There's papers on-- or there is
a paper on how to do document

00:23:28.990 --> 00:23:32.290
retrieval, which is a problem
we looked at two lectures ago,

00:23:32.290 --> 00:23:33.670
in the string lecture.

00:23:33.670 --> 00:23:35.350
You want to find--
not all the matches,

00:23:35.350 --> 00:23:36.974
you want to find all
the documents that

00:23:36.974 --> 00:23:40.360
have this substring in them.

00:23:40.360 --> 00:23:42.400
So that can be--
that reduces the size

00:23:42.400 --> 00:23:46.150
of the output in these bounds.

00:23:46.150 --> 00:23:49.240
That can also be done, Sadakane
wrote a paper about that.

00:23:49.240 --> 00:23:50.316
Some work on dynamic--

00:23:50.316 --> 00:23:52.690
there's actually a lot of work
in implementing these data

00:23:52.690 --> 00:23:57.360
structures, definitely
FM-index, and I believe,

00:23:57.360 --> 00:23:58.850
maybe the Sadakane one.

00:23:58.850 --> 00:24:00.910
And maybe this--
versions of this one.

00:24:00.910 --> 00:24:03.160
I don't think the succinct
ones have been implemented,

00:24:03.160 --> 00:24:04.760
although I don't know for sure.

00:24:04.760 --> 00:24:06.718
But there's a lot of work
in implementing this,

00:24:06.718 --> 00:24:08.710
because people care,
and indeed they're

00:24:08.710 --> 00:24:11.760
small and reasonably fast.

00:24:11.760 --> 00:24:14.170
So if you need a
text index, there's

00:24:14.170 --> 00:24:18.490
freely available implementations
of at least some of these.

00:24:18.490 --> 00:24:21.670
So this is one of--

00:24:21.670 --> 00:24:25.350
I mean this is
practical stuff, too.

00:24:25.350 --> 00:24:26.020
Cool.

00:24:26.020 --> 00:24:29.590
But as I said, I'm going to
focus on the simplest I know,

00:24:29.590 --> 00:24:32.934
which is Grossi and Vitter.

00:24:32.934 --> 00:24:34.600
If you look at the
paper, there are sort

00:24:34.600 --> 00:24:36.340
of successive improvements.

00:24:36.340 --> 00:24:39.280
And we're going to
cover up to the point

00:24:39.280 --> 00:24:41.290
where we get a good space
bound, and the query

00:24:41.290 --> 00:24:44.310
won't be quite as good.

00:24:44.310 --> 00:24:47.710
So that's going to be
the bulk of the lecture.

00:24:47.710 --> 00:24:49.200
It's how to get
that space bound.

00:24:52.230 --> 00:24:54.120
And as I mentioned,
we're going to start out

00:24:54.120 --> 00:24:56.910
with a weaker bound, which
is getting T log log T bits,

00:24:56.910 --> 00:25:01.050
and then we'll see how
to improve that to T.

00:25:01.050 --> 00:25:04.440
And then we'll see how to
improve it to 1 over epsilon

00:25:04.440 --> 00:25:05.580
times T.

00:25:05.580 --> 00:25:07.759
So it will be a series
of improvements.

00:25:11.799 --> 00:25:13.590
And we're going to
start just with thinking

00:25:13.590 --> 00:25:16.390
about suffix arrays.

00:25:16.390 --> 00:25:19.110
So what is the compressed
suffix array problem?

00:25:19.110 --> 00:25:22.500
Well, it's just that I have--

00:25:22.500 --> 00:25:27.830
I want to be able to do
queries of the form SA of k.

00:25:27.830 --> 00:25:29.580
If I imagine the
suffixes in sorted order,

00:25:29.580 --> 00:25:30.930
what is the k-th suffix?

00:25:30.930 --> 00:25:32.520
Where does it begin?

00:25:32.520 --> 00:25:34.940
So I want to be able to
represent that array.

00:25:34.940 --> 00:25:36.750
And using that, you
could do searches,

00:25:36.750 --> 00:25:40.590
and later we'll see how to use
that to make a suffix tree.

00:25:40.590 --> 00:25:44.730
But for now, that's just our
goal, is to compute SA of k.

00:25:44.730 --> 00:25:47.400
OK, well, the idea is actually
going to be very familiar.

00:25:47.400 --> 00:25:49.920
We saw it two lectures ago,
when we did this divide

00:25:49.920 --> 00:25:52.200
and conquer for
building a suffix array.

00:25:52.200 --> 00:25:55.770
We did this-- we divided
the letters in our string

00:25:55.770 --> 00:25:58.170
by 0, 1, and 2, mod 3.

00:25:58.170 --> 00:26:00.050
We won't need mod 3.

00:26:00.050 --> 00:26:01.555
We'll just do mod 2 here.

00:26:01.555 --> 00:26:05.760
It won't actually matter
what constant we use.

00:26:05.760 --> 00:26:07.830
But we're going to
follow that recursion

00:26:07.830 --> 00:26:09.930
and use it to represent
the suffix array,

00:26:09.930 --> 00:26:12.720
instead of using it to build it.

00:26:12.720 --> 00:26:16.020
So the base case, and
set up some notation.

00:26:16.020 --> 00:26:20.730
T0 is going to represent T.
The length of that string I'm

00:26:20.730 --> 00:26:22.290
going to call n0 or n.

00:26:27.090 --> 00:26:31.590
And we have a
suffix array, which

00:26:31.590 --> 00:26:37.530
I'm going to call SA 0, which is
the suffix array of that text.

00:26:37.530 --> 00:26:38.820
So that's just notation.

00:26:38.820 --> 00:26:40.778
We're not actually storing
all of those things.

00:26:43.230 --> 00:26:49.410
Now, the recursion
is T k plus 1.

00:26:49.410 --> 00:26:52.770
That's going to be the next
level, which is, we write--

00:26:52.770 --> 00:26:55.880
we combine two letters, Tk--

00:26:55.880 --> 00:26:57.580
sorry, square bracket--

00:26:57.580 --> 00:27:02.970
2i comma Tk square
bracket 2i plus 1.

00:27:02.970 --> 00:27:06.480
Combine two adjacent
letters into one letter,

00:27:06.480 --> 00:27:12.480
and we do that for i
equals 0, 1, up to n/2.

00:27:15.510 --> 00:27:17.820
That's our new string.

00:27:17.820 --> 00:27:19.440
I'm not going to
sort these letters

00:27:19.440 --> 00:27:21.537
and remap the letters to
compress the alphabet.

00:27:21.537 --> 00:27:23.370
I'm just going to leave
those letters alone,

00:27:23.370 --> 00:27:24.750
as an ordered pair.

00:27:24.750 --> 00:27:29.370
In general, at level
Tk, a single letter

00:27:29.370 --> 00:27:32.027
is actually 2 to the k letters.

00:27:32.027 --> 00:27:34.110
But still, this is a useful
way to think about it,

00:27:34.110 --> 00:27:36.190
because it lets me think
about fewer suffixes.

00:27:36.190 --> 00:27:38.880
Here, we only have
the even suffixes,

00:27:38.880 --> 00:27:46.650
suffixes that begin at even
positions relative to Tk.

00:27:46.650 --> 00:27:49.500
The size of this string, in
terms of number of letters,

00:27:49.500 --> 00:27:51.360
is 1/2 of the original.

00:27:51.360 --> 00:27:56.430
So in general, this is going
to be n over 2 to the k.

00:27:56.430 --> 00:28:01.680
And then we're interested in
the suffix array SA k plus 1.

00:28:01.680 --> 00:28:07.630
This is going to be just
looking at the even values.

00:28:07.630 --> 00:28:17.340
So if we extract even
entries from sorry, SA k.

00:28:17.340 --> 00:28:19.770
So if we already
have SA k, we just

00:28:19.770 --> 00:28:22.380
take the even values
that are in there.

00:28:22.380 --> 00:28:25.620
Those are the ones that
are existing suffixes.

00:28:25.620 --> 00:28:28.050
Extract those, divide by 2.

00:28:28.050 --> 00:28:32.220
That will be the suffix
array of this text.

00:28:32.220 --> 00:28:33.810
This is kind of
backwards from how

00:28:33.810 --> 00:28:35.820
you would construct the thing.

00:28:35.820 --> 00:28:37.290
You would construct
it bottom up.

00:28:37.290 --> 00:28:38.370
Here, we're
imagining-- we already

00:28:38.370 --> 00:28:40.590
know the suffix arrays are
just about representation.

00:28:40.590 --> 00:28:43.440
So this is a top-down
kind of definition

00:28:43.440 --> 00:28:45.650
of what we're trying to store.

00:28:45.650 --> 00:28:48.399
OK, so this is
what we want to do.

00:28:48.399 --> 00:28:50.190
Now we are going to
build things bottom up.

00:28:50.190 --> 00:28:51.689
We're going to
imagine we've already

00:28:51.689 --> 00:28:54.210
represented SA k plus 1.

00:28:54.210 --> 00:28:57.060
And now we need
to represent SA k.

00:28:57.060 --> 00:29:04.050
If we can represent SA k
in terms of SA k plus 1

00:29:04.050 --> 00:29:06.420
with not too many
bits, then you add up

00:29:06.420 --> 00:29:08.114
all of the levels of recursion.

00:29:08.114 --> 00:29:10.530
We'll have to talk about how
many levels of this recursion

00:29:10.530 --> 00:29:11.113
we need to do.

00:29:11.113 --> 00:29:13.470
We're not going to go
down to constant size.

00:29:13.470 --> 00:29:17.310
We'll just go log log n levels.

00:29:17.310 --> 00:29:20.940
But we just add up
all those costs,

00:29:20.940 --> 00:29:23.826
and we'll get the overall
size of our data structure.

00:29:27.560 --> 00:29:30.080
So how do we do
this representation?

00:29:30.080 --> 00:29:34.190
I need to define two
kind of weird things,

00:29:34.190 --> 00:29:37.280
and then we'll see why
they're interesting.

00:29:37.280 --> 00:29:44.920
OK, the first thing is called
even successor sub k of i.

00:29:44.920 --> 00:29:47.330
So let me define it.

00:29:47.330 --> 00:29:55.700
It's going to be i if
the i-th suffix starts

00:29:55.700 --> 00:29:58.010
in an even position.

00:29:58.010 --> 00:30:00.754
So it doesn't do anything
for the even guys.

00:30:00.754 --> 00:30:02.420
The interesting thing
is when the suffix

00:30:02.420 --> 00:30:04.370
starts in an odd position.

00:30:04.370 --> 00:30:06.730
Then we're going to write
down a different number j.

00:30:09.860 --> 00:30:14.120
This is going to look kind
of weird, but it's actually--

00:30:14.120 --> 00:30:19.330
it's simple after you think
about it for 10 minutes.

00:30:19.330 --> 00:30:23.880
This one is odd.

00:30:23.880 --> 00:30:26.330
OK, so the other
situation is that SA k--

00:30:26.330 --> 00:30:29.550
the i-th suffix starts
at an even position.

00:30:29.550 --> 00:30:31.910
So let me draw a little picture.

00:30:31.910 --> 00:30:36.814
So here is SA of i.

00:30:39.600 --> 00:30:42.920
OK, if this happens to
be odd, this position

00:30:42.920 --> 00:30:44.645
in the text-- this is Tk.

00:30:47.300 --> 00:30:50.540
Then I want to go here.

00:30:50.540 --> 00:30:51.050
OK?

00:30:51.050 --> 00:30:53.008
Because that's an even
position, it's a suffix,

00:30:53.008 --> 00:30:55.080
it's right next to the
suffix I care about.

00:30:55.080 --> 00:30:57.710
It is what we call the
even successor suffix.

00:30:57.710 --> 00:30:59.810
But I don't want to
know the index of that.

00:30:59.810 --> 00:31:03.590
The index of that would
just be SA k of i plus 1.

00:31:03.590 --> 00:31:07.580
I want to map backwards
through SA inverse.

00:31:07.580 --> 00:31:12.320
I want to know, what is
the rank of that suffix?

00:31:12.320 --> 00:31:17.120
Which suffix j
starts right there?

00:31:17.120 --> 00:31:19.790
I want to know that the
j-th suffix starts right

00:31:19.790 --> 00:31:23.480
after the i-th suffix, and
I want to write down j.

00:31:23.480 --> 00:31:26.590
We'll see why this is the
right thing in a moment.

00:31:26.590 --> 00:31:29.270
We're just mapping
through SA, adding 1,

00:31:29.270 --> 00:31:32.350
and then mapping
backwards through SA.

00:31:32.350 --> 00:31:33.584
So that's a function.

00:31:33.584 --> 00:31:35.750
We're going to store that
function in a particular--

00:31:35.750 --> 00:31:40.670
in a very weird way, which
we'll get to in a moment.

00:31:40.670 --> 00:31:45.420
OK, next thing we need
is called even rank.

00:31:45.420 --> 00:31:47.910
This is going to be
like our rank function.

00:31:47.910 --> 00:31:49.850
We've had it before.

00:31:49.850 --> 00:31:55.220
This is going to be the
number of even suffixes--

00:31:55.220 --> 00:31:59.090
even suffixes are suffixes
starting at even positions--

00:31:59.090 --> 00:32:05.600
preceding the i-th suffix.

00:32:05.600 --> 00:32:09.570
i-th suffix meaning the
i-th one in sorted order.

00:32:09.570 --> 00:32:11.180
So the suffix SA of i.

00:32:14.600 --> 00:32:15.980
Yes, so this is--

00:32:15.980 --> 00:32:18.170
let me be more precise.

00:32:18.170 --> 00:32:28.460
This is the number of even
values in SA k up to i.

00:32:28.460 --> 00:32:30.890
So we're looking--
so this was the text.

00:32:30.890 --> 00:32:33.680
Now we're looking at the
suffix array, which has

00:32:33.680 --> 00:32:35.660
the suffixes in sorted order.

00:32:35.660 --> 00:32:38.440
We're looking at position i
here, and we want to know,

00:32:38.440 --> 00:32:41.270
of all of these values,
which ones are even?

00:32:41.270 --> 00:32:42.540
Or how many are even--

00:32:42.540 --> 00:32:44.390
that's the even rank.

00:32:44.390 --> 00:32:46.250
Again, a weird thing,
we'll see why it's

00:32:46.250 --> 00:32:47.530
the right thing in a moment.

00:32:55.760 --> 00:32:56.690
Right now, in fact.

00:33:02.440 --> 00:33:07.900
So here is observation 3,
putting these together.

00:33:07.900 --> 00:33:09.295
This is a rather long equation.

00:33:12.040 --> 00:33:13.510
Ultimately, I want to know--

00:33:13.510 --> 00:33:16.260
I want to represent Sk of i.

00:33:16.260 --> 00:33:17.830
I'm trying to represent that.

00:33:17.830 --> 00:33:22.090
And I want the right-hand side
to only refer to SA k plus 1.

00:33:22.090 --> 00:33:24.280
So here's the claim.

00:33:24.280 --> 00:33:28.540
Take 2 times SA k plus 1 of--

00:33:35.187 --> 00:33:36.520
I'm going to need another board.

00:33:50.240 --> 00:33:51.830
Not of i.

00:33:51.830 --> 00:34:03.110
Even rank of even successor
of i, minus 1 minus

00:34:03.110 --> 00:34:14.060
is even suffix of i.

00:34:14.060 --> 00:34:16.320
OK, so that's the equation.

00:34:16.320 --> 00:34:19.580
Let me unpack this a little bit.

00:34:19.580 --> 00:34:22.240
The idea is, we want to
know about a suffix i.

00:34:22.240 --> 00:34:24.080
If i happens to be even--

00:34:24.080 --> 00:34:26.929
sorry, not if i happens
to be even-- if SA of i

00:34:26.929 --> 00:34:29.510
happens to be
even, we're golden.

00:34:29.510 --> 00:34:33.260
Because that suffix is
represented by SA k plus 1,

00:34:33.260 --> 00:34:34.429
but it might not be even.

00:34:34.429 --> 00:34:38.120
So we want to round
it to an even suffix.

00:34:38.120 --> 00:34:41.150
Knowing about this odd
suffix is just about as good

00:34:41.150 --> 00:34:44.340
as knowing about the suffix
that starts right after it.

00:34:44.340 --> 00:34:46.310
So that's what even
successor does.

00:34:46.310 --> 00:34:51.620
This is rounding
to an even suffix,

00:34:51.620 --> 00:34:54.949
meaning a suffix starting
at an even position.

00:34:59.870 --> 00:35:04.820
Now there's this issue
that over here, we

00:35:04.820 --> 00:35:08.630
have this relation between
SA k and SA k plus 1,

00:35:08.630 --> 00:35:11.970
but it extracts
the even entries.

00:35:11.970 --> 00:35:14.300
So if you think about the
suffix array, which now I'm

00:35:14.300 --> 00:35:16.508
going to draw a vertical,
because that's more normal.

00:35:19.580 --> 00:35:21.964
Some of these values
are going to be even,

00:35:21.964 --> 00:35:24.380
but you don't really know which
ones are going to be even.

00:35:24.380 --> 00:35:26.530
It's arbitrary subset of--

00:35:26.530 --> 00:35:29.630
in SA k, our even values.

00:35:29.630 --> 00:35:35.950
And those are the ones that you
extract and form SA k plus 1.

00:35:35.950 --> 00:35:38.600
But it's an arbitrary
subset, that's kind of a--

00:35:38.600 --> 00:35:40.550
you can't just divide
by 2 or something.

00:35:40.550 --> 00:35:42.140
It's not the right thing.

00:35:42.140 --> 00:35:45.680
If I'm given an index into
here, even if it's an even one,

00:35:45.680 --> 00:35:48.470
I need to know what the
corresponding index is

00:35:48.470 --> 00:35:50.270
over here.

00:35:50.270 --> 00:35:53.390
And that, I claim,
is exactly even rank.

00:35:53.390 --> 00:35:59.090
Because what position does
this cell become over here?

00:35:59.090 --> 00:36:03.140
Well, however many even
numbers there are above it.

00:36:03.140 --> 00:36:05.840
So you take-- that's what
this definition was, a number

00:36:05.840 --> 00:36:08.060
of even values in that prefix.

00:36:08.060 --> 00:36:12.360
That is the position you
will be in, in SA k plus 1.

00:36:12.360 --> 00:36:15.510
So this is what I
would call the name--

00:36:15.510 --> 00:36:17.630
we've now rounded to
an even suffix but now

00:36:17.630 --> 00:36:21.530
we need to find the name
of that even suffix--

00:36:21.530 --> 00:36:25.070
in SA k plus 1.

00:36:25.070 --> 00:36:29.120
So that's exactly
what even rank does.

00:36:29.120 --> 00:36:33.050
So now we can dereference
SA k plus 1 of that thing.

00:36:33.050 --> 00:36:38.120
That will give us a
position into the text T k

00:36:38.120 --> 00:36:42.830
plus 1, where that
suffix begins.

00:36:42.830 --> 00:36:47.960
Now that's an index into
this divided by 2 string,

00:36:47.960 --> 00:36:51.560
we need to uncompress that to
an index into the actual string.

00:36:51.560 --> 00:36:52.560
And there are two parts.

00:36:52.560 --> 00:36:55.220
One is we need to multiply by
2, because every letter in T

00:36:55.220 --> 00:36:57.770
k plus 1 is two letters in Tk.

00:36:57.770 --> 00:36:58.670
So multiply by 2.

00:36:58.670 --> 00:37:01.070
And sometimes we
need to subtract 1.

00:37:01.070 --> 00:37:04.670
We basically need to subtract
1 if if even successor did

00:37:04.670 --> 00:37:05.505
anything.

00:37:05.505 --> 00:37:08.540
If even successor essentially
moved us to the right by 1,

00:37:08.540 --> 00:37:10.040
now we need to move
back to the left

00:37:10.040 --> 00:37:13.080
by 1, if this moved us at all.

00:37:13.080 --> 00:37:15.620
So I have one more function
here, which is is even suffix.

00:37:15.620 --> 00:37:19.670
Was SA of i an even--

00:37:19.670 --> 00:37:22.490
SA sub k of i, an
even number already.

00:37:22.490 --> 00:37:25.580
Which means that even
successor did nothing.

00:37:25.580 --> 00:37:30.180
If it did nothing,
then 1 minus 1 is 0.,

00:37:30.180 --> 00:37:31.550
and so nothing happens.

00:37:31.550 --> 00:37:33.800
If it did something,
then its even suffix

00:37:33.800 --> 00:37:35.780
will be 0, because it was odd.

00:37:35.780 --> 00:37:37.250
And then we're subtracting 1.

00:37:37.250 --> 00:37:40.550
So this just means
subtract 1, if it was odd.

00:37:40.550 --> 00:37:43.130
You might say minus is
odd suffix, instead of

00:37:43.130 --> 00:37:44.510
1 minus is even suffix.

00:37:44.510 --> 00:37:47.330
But it turns out, this is
the thing I want to store,

00:37:47.330 --> 00:37:50.424
so I wrote it in a weird way.

00:37:50.424 --> 00:37:51.590
Why did I write it that way?

00:37:51.590 --> 00:37:57.380
Because is even suffix
is related to even rank.

00:37:57.380 --> 00:38:04.040
Even rank is just rank
sub 1 of is even suffix.

00:38:04.040 --> 00:38:05.900
And we already saw
how to do rank sub 1,

00:38:05.900 --> 00:38:09.650
and so that's why I
wanted to reuse it.

00:38:09.650 --> 00:38:12.800
I think you see now why
this equation holds.

00:38:12.800 --> 00:38:16.820
What remains is how to
store is even suffix,

00:38:16.820 --> 00:38:18.957
even rank, even successor.

00:38:23.340 --> 00:38:26.590
One other thing that
remains, is to say

00:38:26.590 --> 00:38:29.240
when to stop this recursion.

00:38:29.240 --> 00:38:32.885
So I claim it's enough to just
do this recursion for log log n

00:38:32.885 --> 00:38:33.385
levels.

00:38:43.900 --> 00:38:46.900
And then I'll call log
log n l, the number

00:38:46.900 --> 00:38:48.620
of levels in this recursion.

00:38:48.620 --> 00:38:53.440
Because at that point,
n sub l equals n over--

00:38:53.440 --> 00:38:55.210
it's n over 2 to
the l, so that's

00:38:55.210 --> 00:38:57.760
going to be n over log n.

00:38:57.760 --> 00:39:00.880
Once I have a string
of length n over log n,

00:39:00.880 --> 00:39:04.600
I can afford the regular
boring representation

00:39:04.600 --> 00:39:11.310
of a suffix tree.

00:39:11.310 --> 00:39:16.560
I can afford T log T bits,
when T is only n over log n.

00:39:16.560 --> 00:39:18.790
If you want to be a
little extra clever,

00:39:18.790 --> 00:39:22.500
you can put a factor 2 here,
and then there's a square here.

00:39:22.500 --> 00:39:25.140
And so then you're really
paying little o of T

00:39:25.140 --> 00:39:27.485
in order to store that thing.

00:39:27.485 --> 00:39:28.860
So once you get
down to here, you

00:39:28.860 --> 00:39:31.510
can afford a simple
representation.

00:39:31.510 --> 00:39:36.210
Now let's think about
how to compute SA,

00:39:36.210 --> 00:39:40.360
like the original SA,
sub 0, of an index.

00:39:40.360 --> 00:39:46.740
Well I apply this
formula at all times,

00:39:46.740 --> 00:39:49.690
I do all these computations.

00:39:49.690 --> 00:39:52.806
And now I've reduced
the problem to SA 1,

00:39:52.806 --> 00:39:54.180
and then I do
these computations.

00:39:54.180 --> 00:39:56.170
I reduce it to SA 2, and so on.

00:39:56.170 --> 00:40:00.350
After l steps, I'll have
reduced it to an SA query

00:40:00.350 --> 00:40:02.070
in a boring old
suffix array, which

00:40:02.070 --> 00:40:04.090
I've just stored as an array.

00:40:04.090 --> 00:40:07.160
So then I can answer it, and
then I pop up the recursion,

00:40:07.160 --> 00:40:11.460
log log n times, doing these
adjustments as appropriate.

00:40:11.460 --> 00:40:15.780
In the end, I get the correct
index into the original text T.

00:40:15.780 --> 00:40:17.850
How much time did it take?

00:40:17.850 --> 00:40:19.110
Order log log n time.

00:40:23.640 --> 00:40:30.050
So I can do a log log
n time query to SA.

00:40:34.740 --> 00:40:37.940
This is, of course, assuming
that even rank, even successor,

00:40:37.940 --> 00:40:42.030
and is even suffix are all
constant time operations.

00:40:42.030 --> 00:40:44.340
So what remains is
to do each of these

00:40:44.340 --> 00:40:46.740
in small space
and constant time.

00:40:46.740 --> 00:40:50.790
Then my overall query time will
only go up by log log factor.

00:40:50.790 --> 00:40:52.830
This is actually going
to be pretty good,

00:40:52.830 --> 00:40:54.470
we're not going to--

00:40:54.470 --> 00:40:56.550
we're going to
achieve log log query

00:40:56.550 --> 00:40:59.920
when we have T log log T bits.

00:40:59.920 --> 00:41:02.150
That'll be our first
encoding of these things.

00:41:02.150 --> 00:41:03.608
Later on, we're
going have to go up

00:41:03.608 --> 00:41:05.970
to log to the epsilon, which
is worse than log log n.

00:41:08.620 --> 00:41:09.600
Clear, so far?

00:41:09.600 --> 00:41:12.447
Everything is pretty
easy at this point now.

00:41:12.447 --> 00:41:14.280
It's going to remain
easy, it's just there's

00:41:14.280 --> 00:41:15.780
a lot of pieces to the puzzle.

00:41:15.780 --> 00:41:18.300
This is the first--
this is the big idea.

00:41:18.300 --> 00:41:21.240
Next thing is some
fancy encoding schemes

00:41:21.240 --> 00:41:22.615
to make these
things quite small.

00:41:22.615 --> 00:41:23.114
Question?

00:41:23.114 --> 00:41:25.690
AUDIENCE: [INAUDIBLE] Did you
say what the space [INAUDIBLE]

00:41:25.690 --> 00:41:26.640
was?

00:41:26.640 --> 00:41:27.510
ERIK DEMAINE: We haven't
analyzed space yet,

00:41:27.510 --> 00:41:28.710
because I haven't said
how we're actually

00:41:28.710 --> 00:41:29.812
storing these functions.

00:41:29.812 --> 00:41:31.770
If you stored these
functions explicitly, you'd

00:41:31.770 --> 00:41:34.590
have bad space, probably still
T log T. But it turns out,

00:41:34.590 --> 00:41:38.400
these functions can be encoded
in a clever way, that small--

00:41:38.400 --> 00:41:41.610
smaller, it's going
to be T log log T.

00:41:41.610 --> 00:41:44.710
And still has
constant time query.

00:41:44.710 --> 00:41:47.266
AUDIENCE: Without the functions,
how much space are we using?

00:41:47.266 --> 00:41:48.765
ERIK DEMAINE: Without
the functions,

00:41:48.765 --> 00:41:51.090
we're using,
essentially, no space.

00:41:51.090 --> 00:41:54.096
I guess, at the end
where we're using--

00:41:54.096 --> 00:41:55.470
the only thing
we've said so far,

00:41:55.470 --> 00:41:58.120
is at the end we use an
explicit suffix array.

00:41:58.120 --> 00:42:00.480
And if you set this
to 2 log log T,

00:42:00.480 --> 00:42:04.010
then this would be like n
over log n bits of space.

00:42:04.010 --> 00:42:07.860
Because it's going
to be this times--

00:42:07.860 --> 00:42:12.980
I mean, the space at the bottom
is going to be nl log nl.

00:42:12.980 --> 00:42:15.150
That's to store an
explicit suffix array,

00:42:15.150 --> 00:42:17.400
so it's going to be
this times log of this,

00:42:17.400 --> 00:42:23.190
which is going to be n over
log n, if we put the 2 in.

00:42:23.190 --> 00:42:26.280
So that part's really cheap,
and that's little o of n.

00:42:26.280 --> 00:42:28.670
Of course, we probably also
have to store the text.

00:42:28.670 --> 00:42:30.720
So that's n bits.

00:42:30.720 --> 00:42:32.572
I didn't mention--
I'm going to assume,

00:42:32.572 --> 00:42:33.780
I don't think we need it yet.

00:42:33.780 --> 00:42:36.600
At some point I will assume
that the alphabets binary.

00:42:36.600 --> 00:42:39.700
So I'm going to leave off--
when I say n bits, really it's

00:42:39.700 --> 00:42:42.270
n log sigma bits, or n
characters, or whatever.

00:42:42.270 --> 00:42:45.850
But I'm not going to
worry about that here.

00:42:45.850 --> 00:42:48.810
Are there questions?

00:42:48.810 --> 00:42:51.064
So now, it's an
encoding problem.

00:42:51.064 --> 00:42:52.230
How do we encode these guys?

00:42:57.120 --> 00:43:00.419
Actually, even successor is the
only thing that's non-trivial.

00:43:00.419 --> 00:43:02.460
We're going to do the
obvious thing for the rest.

00:43:05.110 --> 00:43:07.910
So let me tell you about
the obvious ones, easy ones.

00:43:16.527 --> 00:43:18.110
At least, the first
revision we're not

00:43:18.110 --> 00:43:20.540
going to do anything fancy
with them, later on we will.

00:43:26.170 --> 00:43:27.865
Sorry, is even suffix.

00:43:37.010 --> 00:43:39.080
We're just going to store
this as a bit vector.

00:43:39.080 --> 00:43:47.990
This is 1 if SA k is
even, 0 if it's odd.

00:43:47.990 --> 00:43:49.840
So if we just store
that is a bit vector,

00:43:49.840 --> 00:43:55.730
this is n sub k bits
that we can afford.

00:43:55.730 --> 00:43:57.439
Because this is a
geometric series,

00:43:57.439 --> 00:43:58.480
it's going to be order n.

00:44:02.030 --> 00:44:03.290
Next is even rank.

00:44:06.830 --> 00:44:10.550
This is just the
rank one structure

00:44:10.550 --> 00:44:14.900
that we covered last
class, on this thing.

00:44:14.900 --> 00:44:19.420
So this is going to be nk--

00:44:19.420 --> 00:44:25.610
I think we did log
log nk over log nk.

00:44:25.610 --> 00:44:28.670
And this can be improved
to nk over log to the k--

00:44:28.670 --> 00:44:31.760
or log to the something of nk.

00:44:31.760 --> 00:44:33.830
But that's an OK bound.

00:44:33.830 --> 00:44:36.740
It's little o of N.
Again, this is geometric,

00:44:36.740 --> 00:44:39.890
so this overall will
be little o of n.

00:44:39.890 --> 00:44:48.032
So those are easy, the remaining
part is doing even successor.

00:45:00.120 --> 00:45:03.370
A little optimization.

00:45:03.370 --> 00:45:09.640
If the i's where
Sk of i is even,

00:45:09.640 --> 00:45:11.320
we don't really need
to store anything.

00:45:11.320 --> 00:45:14.870
Because then, even successor
is the identity function.

00:45:14.870 --> 00:45:16.810
So let's forget
about those guys.

00:45:16.810 --> 00:45:21.640
I'll say, it's trivial
for even successors--

00:45:21.640 --> 00:45:24.490
for even suffixes.

00:45:29.350 --> 00:45:33.130
So what I'd like to do, is store
the answers for odd suffixes.

00:45:33.130 --> 00:45:36.550
That's what we're going to do.

00:45:36.550 --> 00:45:39.880
We're going to store them in
a weird way, as we will see.

00:45:50.388 --> 00:45:52.100
So that's the odd suffixes.

00:45:52.100 --> 00:45:59.054
There are nk over 2 evens,
and there are nk over 2 odds.

00:45:59.054 --> 00:46:00.470
So we've just saved
a factor of 2.

00:46:00.470 --> 00:46:02.350
This wasn't a very
deep observation.

00:46:02.350 --> 00:46:06.320
But it turns out, if you
focus in on the odd ones,

00:46:06.320 --> 00:46:08.500
has a nice little
structure to them.

00:46:12.330 --> 00:46:14.770
This step isn't
really necessary,

00:46:14.770 --> 00:46:16.060
but it saves a factor of 2.

00:46:24.910 --> 00:46:29.560
Now the kind of
interesting observation.

00:46:29.560 --> 00:46:34.269
What I'd like to do is store
these answers in order by i.

00:46:34.269 --> 00:46:35.560
That's the obvious thing to do.

00:46:35.560 --> 00:46:37.060
I want to store
basically an array.

00:46:40.780 --> 00:46:43.600
Just store it in
order by i, so I'm

00:46:43.600 --> 00:46:46.360
skipping the even suffixes,
just storing the answers

00:46:46.360 --> 00:46:49.240
for the odd suffixes.

00:46:49.240 --> 00:46:52.750
So if I was given a number
i, how would I look it up?

00:46:52.750 --> 00:46:59.180
Well, given an index i
into the suffix array,

00:46:59.180 --> 00:47:00.925
what I need to know is--

00:47:00.925 --> 00:47:05.290
this is basically the inverse
of what we did with SA k plus 1.

00:47:05.290 --> 00:47:07.560
SA k plus 1 is extracting
the even entries,

00:47:07.560 --> 00:47:09.640
here we're extracting
the odd entries.

00:47:09.640 --> 00:47:13.660
So all I need to know
is the odd rank of i,

00:47:13.660 --> 00:47:15.940
and then I look
up into this array

00:47:15.940 --> 00:47:18.110
at position odd rank of i.

00:47:18.110 --> 00:47:20.260
That will give me
the answer I want.

00:47:20.260 --> 00:47:23.350
Well, first I check is
is it an even suffix,

00:47:23.350 --> 00:47:25.000
which I have stored
as a bit vector.

00:47:25.000 --> 00:47:28.990
If it's an even suffix, I
do nothing, I just return i.

00:47:28.990 --> 00:47:32.710
But if it's an odd suffix,
then I compute the odd rank.

00:47:32.710 --> 00:47:34.290
How do I compute the odd rank?

00:47:34.290 --> 00:47:37.720
I take the even rank
and take i minus that.

00:47:37.720 --> 00:47:42.362
Odd rank, we don't need
to store anything for it.

00:47:42.362 --> 00:47:43.820
I mean, you could
if you wanted to,

00:47:43.820 --> 00:47:47.530
but odd rank is just
i minus even rank.

00:47:50.050 --> 00:47:55.590
Because every index
is either odd or even.

00:47:55.590 --> 00:47:57.040
OK, great.

00:47:57.040 --> 00:48:00.900
So I can look up odd rank
and then look at this array.

00:48:00.900 --> 00:48:02.580
That'll give me
the answer I need.

00:48:02.580 --> 00:48:04.788
But I'm not going to actually
store this as an array.

00:48:04.788 --> 00:48:07.350
I lied.

00:48:07.350 --> 00:48:09.370
But in any case, let's
worry about how I'm

00:48:09.370 --> 00:48:10.880
going to store it in a moment.

00:48:10.880 --> 00:48:15.670
Let's think about i-- if I
I'm storing these answers--

00:48:15.670 --> 00:48:21.091
the even successor answers,
these j values, in order by i.

00:48:21.091 --> 00:48:25.210
I claim that order is a very
special order, because what

00:48:25.210 --> 00:48:27.100
does it mean to order by i?

00:48:27.100 --> 00:48:31.240
Ordering by i, that means the
suffixes are sorted, right?

00:48:31.240 --> 00:48:43.240
So this is the same thing as
ordering by an odd suffix in Tk

00:48:43.240 --> 00:48:47.590
from SA of i onwards.

00:48:47.590 --> 00:48:49.360
That's the suffix that we're--

00:48:49.360 --> 00:48:52.462
sorting by that suffix,
is sorting by i.

00:48:55.060 --> 00:48:56.980
Now we can unpack
an odd suffix--

00:48:56.980 --> 00:49:00.190
it has the first character--
and then an even suffix.

00:49:00.190 --> 00:49:03.850
So this is the same
thing as ordering by--

00:49:03.850 --> 00:49:05.350
this should look
familiar because we

00:49:05.350 --> 00:49:06.460
did the same kinds
of tricks when

00:49:06.460 --> 00:49:07.710
we were building suffix trees.

00:49:21.010 --> 00:49:23.110
This is even.

00:49:26.390 --> 00:49:28.328
In fact, it's the
even successor.

00:49:31.674 --> 00:49:41.560
There's a typo here,
[? see ?] If we follow SA k,

00:49:41.560 --> 00:49:42.700
and then we add 1.

00:49:42.700 --> 00:49:45.820
If we follow SA
k backwards, that

00:49:45.820 --> 00:49:48.430
was the definition
of even successor.

00:49:48.430 --> 00:49:51.200
So I can rewrite this thing.

00:49:51.200 --> 00:50:06.500
This part is the same thing as
Tk SA k even successor k of i,

00:50:06.500 --> 00:50:09.706
closed bracket,
colon, closed bracket.

00:50:09.706 --> 00:50:11.980
Get that right?

00:50:11.980 --> 00:50:14.650
Yes.

00:50:14.650 --> 00:50:16.630
That was the definition
of even successors.

00:50:16.630 --> 00:50:20.950
Even successor is the value j,
for which if I do SA k of j,

00:50:20.950 --> 00:50:22.896
I get SA k of i plus 1.

00:50:22.896 --> 00:50:25.600
That's the definition.

00:50:25.600 --> 00:50:30.185
OK, now Tk of SA of k.

00:50:32.980 --> 00:50:35.400
Sorry, the suffix--
that's not Tk of.

00:50:35.400 --> 00:50:36.580
There's a colon here.

00:50:36.580 --> 00:50:40.690
The suffix of Tk
starting at SA k.

00:50:40.690 --> 00:50:42.903
If I sort by those suffixes--

00:50:46.770 --> 00:50:47.780
they're sorted, right?

00:50:47.780 --> 00:50:50.330
I mean, that was the
point of the suffix array,

00:50:50.330 --> 00:50:51.860
is to sort the suffixes.

00:50:51.860 --> 00:50:57.350
So if I say I'm ordering by the
suffixes given in order by SA

00:50:57.350 --> 00:50:59.660
k, they're already sorted.

00:50:59.660 --> 00:51:02.720
There's no reason to do
this Tk of SA k part.

00:51:02.720 --> 00:51:07.250
This is going to be the
same thing as the order

00:51:07.250 --> 00:51:16.513
by this first letter, Tk SA
k of i comma, even successor.

00:51:20.257 --> 00:51:22.340
The suffix array is defined
to have this property,

00:51:22.340 --> 00:51:23.881
that these orders
are the same thing.

00:51:26.470 --> 00:51:28.510
And sorting by the
suffixes is the same thing

00:51:28.510 --> 00:51:33.590
as sorting by the indices
into the suffix array.

00:51:33.590 --> 00:51:36.702
Interesting, because this is
what I want to store, right?

00:51:36.702 --> 00:51:38.660
Those are the answers
that I'm trying to store.

00:51:38.660 --> 00:51:41.990
I'm trying to store even
successor for every i

00:51:41.990 --> 00:51:43.910
that has an odd--

00:51:43.910 --> 00:51:45.710
that starts in an odd suffix.

00:51:48.500 --> 00:51:51.740
So really, all I need to
do is order by this thing.

00:51:51.740 --> 00:51:54.560
And then once I've
ordered by this thing,

00:51:54.560 --> 00:52:00.450
I'll store these guys
in order by their value.

00:52:00.450 --> 00:52:00.980
Cool.

00:52:00.980 --> 00:52:04.160
So these are the pairs
I'm going to store.

00:52:04.160 --> 00:52:05.780
I'm not going to--

00:52:05.780 --> 00:52:11.110
I'm going to store this comma
this, for all i, in order

00:52:11.110 --> 00:52:12.242
by this value.

00:52:12.242 --> 00:52:13.430
That is my goal.

00:52:13.430 --> 00:52:16.010
If I can store these
in order by this value,

00:52:16.010 --> 00:52:18.560
then by computing
odd rank, I know

00:52:18.560 --> 00:52:20.762
where in this list
of pairs to go.

00:52:20.762 --> 00:52:22.220
And I just look at
the second value

00:52:22.220 --> 00:52:26.720
of the pair, that is my answer.

00:52:26.720 --> 00:52:28.290
Why am I storing this?

00:52:28.290 --> 00:52:28.790
We'll see.

00:52:31.500 --> 00:52:33.500
I don't know if you really
need to, but you can.

00:52:36.080 --> 00:52:38.810
OK.

00:52:38.810 --> 00:52:41.276
So what we're going to--

00:52:41.276 --> 00:52:43.040
I feel like it's cheating.

00:52:43.040 --> 00:52:44.780
I say, actually
store these pairs.

00:52:44.780 --> 00:52:46.700
We're not really going
to actually store them.

00:52:46.700 --> 00:52:49.070
We still have another
trick up our sleeve.

00:52:49.070 --> 00:52:51.920
But more or less, we're
going to store these pairs--

00:52:51.920 --> 00:52:54.860
I'll cross out, actually.

00:52:54.860 --> 00:53:02.906
Store these pairs
in order by value.

00:53:02.906 --> 00:53:04.280
Storing them in
order by value is

00:53:04.280 --> 00:53:06.820
the same thing as order by i.

00:53:06.820 --> 00:53:09.150
That's what we just proved.

00:53:09.150 --> 00:53:10.850
And at this point,
is when I'm going

00:53:10.850 --> 00:53:12.350
to assume a binary alphabet.

00:53:16.971 --> 00:53:17.470
OK.

00:53:22.980 --> 00:53:28.330
Maybe, I'll go through here.

00:53:31.000 --> 00:53:31.960
Need lots of stuff.

00:53:35.180 --> 00:53:37.440
Think we don't need this
giant recursion up here.

00:53:41.494 --> 00:53:42.910
Just remember,
it's enough to know

00:53:42.910 --> 00:53:45.550
how to compute even
successor, the rest is easy.

00:54:16.620 --> 00:54:17.300
So here we go.

00:54:19.434 --> 00:54:20.850
We're trying to
store these pairs,

00:54:20.850 --> 00:54:30.740
so we're trying to store
a sorted array of nk

00:54:30.740 --> 00:54:31.480
over 2 values.

00:54:34.550 --> 00:54:37.520
That's how many odd
suffixes there are.

00:54:37.520 --> 00:54:46.460
And they're each 2 to the k
plus log nk bits, I claim.

00:54:46.460 --> 00:54:47.450
Why?

00:54:47.450 --> 00:54:51.510
Because this was a
single character in Tk.

00:54:51.510 --> 00:54:54.329
But a single character in Tk
was actually 2 to the k bits,

00:54:54.329 --> 00:54:56.120
in the original string
for binary alphabet,

00:54:56.120 --> 00:54:59.120
and general sigma to the k.

00:54:59.120 --> 00:55:01.470
So that's that part of
this 2 to the k bits.

00:55:01.470 --> 00:55:03.710
The even successor, well,
that's just an index

00:55:03.710 --> 00:55:05.570
into something of size nk.

00:55:05.570 --> 00:55:08.060
So it's log nk bits.

00:55:08.060 --> 00:55:08.880
OK, fine.

00:55:08.880 --> 00:55:11.390
If I store that explicitly,
I would be in trouble,

00:55:11.390 --> 00:55:16.470
because 2 to the
k times nk is n.

00:55:16.470 --> 00:55:19.460
And so I would be storing
n bits at every level--

00:55:19.460 --> 00:55:23.300
well, so I guess they
get n log log n space.

00:55:23.300 --> 00:55:25.110
That part's actually OK.

00:55:25.110 --> 00:55:27.200
I can afford that
much if I'm just

00:55:27.200 --> 00:55:29.900
going for an n log log n bound.

00:55:29.900 --> 00:55:32.840
This part, not so much.

00:55:32.840 --> 00:55:35.270
Because in particular,
when k equals 0,

00:55:35.270 --> 00:55:38.030
that's going to
be n times log n.

00:55:38.030 --> 00:55:40.431
I don't want to
spend n log n space.

00:55:40.431 --> 00:55:41.930
And the whole point,
is we're trying

00:55:41.930 --> 00:55:43.346
to avoid storing
these explicitly.

00:55:43.346 --> 00:55:45.920
Because if I did, I'd
get n log n space.

00:55:45.920 --> 00:55:48.010
So we're not going to
store them explicitly.

00:55:52.010 --> 00:56:01.397
As follows, we are
going to store so there

00:56:01.397 --> 00:56:02.480
are these big bit vectors.

00:56:02.480 --> 00:56:07.550
We're going to look at
the leading log nk bits.

00:56:07.550 --> 00:56:10.580
This is kind of weird,
because the log nk bits

00:56:10.580 --> 00:56:12.050
we care about are at the end.

00:56:12.050 --> 00:56:14.150
But we're going to look
at the leading log nk bits

00:56:14.150 --> 00:56:20.190
especially, because this is
a sorted list of bit vectors.

00:56:20.190 --> 00:56:23.452
So if you look at the leading
bits, most of the time,

00:56:23.452 --> 00:56:24.660
they're going to be the same.

00:56:24.660 --> 00:56:26.300
They don't change very much.

00:56:26.300 --> 00:56:28.850
Leading bits are going to
be all 0's for a while,

00:56:28.850 --> 00:56:30.530
and then occasionally
they'll increment.

00:56:30.530 --> 00:56:31.904
How many times
will it increment?

00:56:31.904 --> 00:56:36.440
nk times, at most, if we look
at the leading log nk bits.

00:56:48.274 --> 00:56:49.690
Here's the crazy
idea, we're going

00:56:49.690 --> 00:56:53.080
to use unary encoding,
unary differential encoding.

00:56:59.440 --> 00:57:00.970
Differential encoding
means, instead

00:57:00.970 --> 00:57:03.910
of storing a list of values,
you store the first value.

00:57:03.910 --> 00:57:07.540
Then the next value,
minus the first value,

00:57:07.540 --> 00:57:10.284
and then the next value
minus that value, and so on.

00:57:10.284 --> 00:57:11.950
And unary means we're
going to represent

00:57:11.950 --> 00:57:14.650
those differences in unary.

00:57:14.650 --> 00:57:18.370
Seems like a bad idea, but it
turns out it's a good idea.

00:57:18.370 --> 00:57:20.230
So here's what it looks
like, you look at--

00:57:20.230 --> 00:57:22.020
I'm going to write down 0.

00:57:22.020 --> 00:57:27.270
I'm going to write down a bunch
of 0's, however big v1 is.

00:57:27.270 --> 00:57:28.630
Then I'm going to write a 1.

00:57:28.630 --> 00:57:30.670
Then I'm going to write
a bunch of 0's, however

00:57:30.670 --> 00:57:35.510
big v2 minus v1 is.

00:57:35.510 --> 00:57:39.130
Then I'll write a 1, and so on.

00:57:39.130 --> 00:57:43.200
0 to the lead, the
leading bits of v--

00:57:43.200 --> 00:57:44.340
sorry.

00:57:44.340 --> 00:57:49.190
It's the leading bits of v2
minus the leading bits of v1.

00:57:49.190 --> 00:57:51.960
That's what I meant.

00:57:51.960 --> 00:57:56.240
And then leading bits of v3
minus the leading bits of v2.

00:57:56.240 --> 00:57:59.860
And then 1, and so on.

00:57:59.860 --> 00:58:02.290
OK, that is unary
differential encoding.

00:58:02.290 --> 00:58:05.380
I claim this is small,
looks kind of crazy.

00:58:05.380 --> 00:58:09.210
But it's small, because how
many 0's are there total?

00:58:09.210 --> 00:58:11.050
Well, at most, nk 0's.

00:58:11.050 --> 00:58:15.340
Because I start at the value 0.

00:58:15.340 --> 00:58:18.760
With log nk bits, at most
I get up to n k minus 1.

00:58:18.760 --> 00:58:22.505
So the number of times I
increment is, at most, nk.

00:58:25.660 --> 00:58:26.770
How many 1's are there?

00:58:30.040 --> 00:58:34.030
Well, there's one 1, per value.

00:58:34.030 --> 00:58:35.620
So there's nk over 2 1's.

00:58:40.840 --> 00:58:46.030
So total size of this
bit factor is 3/2 nk.

00:58:48.580 --> 00:58:54.630
So storing those leading bits
in this weird way is cheap.

00:58:54.630 --> 00:58:57.470
Linear-- again, this
geometric series

00:58:57.470 --> 00:58:59.240
is going to add up to 3/2.

00:58:59.240 --> 00:59:04.640
All right, it's going
to add up to 3 times n.

00:59:04.640 --> 00:59:06.924
Cool.

00:59:06.924 --> 00:59:08.340
But that's just
the leading bits--

00:59:08.340 --> 00:59:09.646
I need to store this thing.

00:59:09.646 --> 00:59:11.020
I need to store
the leading bits,

00:59:11.020 --> 00:59:12.644
and I need to store
the remaining bits.

00:59:12.644 --> 00:59:15.660
Now the remaining bits, there's
only 2 to the k remaining bits.

00:59:15.660 --> 00:59:17.040
We switched the order.

00:59:17.040 --> 00:59:18.624
We looked at the
high log nk bits,

00:59:18.624 --> 00:59:20.040
but then the low
end bits, there's

00:59:20.040 --> 00:59:21.390
going to be 2 to the k of them.

00:59:21.390 --> 00:59:23.700
That I already said was OK.

00:59:23.700 --> 00:59:25.890
We could afford that--

00:59:25.890 --> 00:59:30.020
kind of, we'd lose
a log log factor.

00:59:30.020 --> 00:59:34.430
So we store the trailing
2 of the k bits.

00:59:34.430 --> 00:59:36.940
This we actually
store explicitly.

00:59:41.280 --> 00:59:44.350
So this is going to
be 2 to the k times

00:59:44.350 --> 00:59:50.520
nk over 2, which is n/2 bits.

00:59:50.520 --> 00:59:52.840
nk is n over 2 to the k.

00:59:52.840 --> 00:59:56.650
Cancel, n over 2.

00:59:56.650 --> 01:00:00.880
OK, so total number of
bits-- we add these up--

01:00:00.880 --> 01:00:11.130
is going to be 1/2
n plus 3/2 nk plus--

01:00:11.130 --> 01:00:13.410
we'll get to this later.

01:00:13.410 --> 01:00:19.710
And then the total, this we
have to do for log log n levels.

01:00:19.710 --> 01:00:25.170
We're summing k
equals 0 to log log n.

01:00:25.170 --> 01:00:26.940
This thing.

01:00:26.940 --> 01:00:33.604
And this comes out
to 1/2 n log log n.

01:00:33.604 --> 01:00:35.270
This is bad, we want
to get rid of that.

01:00:35.270 --> 01:00:42.660
But that was our first
aim, then we have 5n--

01:00:42.660 --> 01:00:43.800
did I miss a term?

01:00:47.580 --> 01:00:48.080
OK.

01:00:53.555 --> 01:00:57.060
Where did I miss the nk?

01:00:57.060 --> 01:01:01.450
This was the cost
for even successor.

01:01:01.450 --> 01:01:05.010
OK, but there was also, is
even suffix, which was nk bits,

01:01:05.010 --> 01:01:08.010
and there was even rank,
which was little o of that.

01:01:08.010 --> 01:01:13.035
So there's an extra nk
here for is even suffix.

01:01:17.760 --> 01:01:20.490
OK, so we have nk plus 3/2 nk.

01:01:20.490 --> 01:01:22.080
That's 5/2 nk.

01:01:22.080 --> 01:01:23.850
And then the 1/2
disappears because it's

01:01:23.850 --> 01:01:24.840
a geometric series.

01:01:24.840 --> 01:01:28.710
So we end up with 5n,
for what it's worth.

01:01:28.710 --> 01:01:30.770
Plus big O of something.

01:01:30.770 --> 01:01:32.730
OK, I left out something,
because there's

01:01:32.730 --> 01:01:35.640
one data structure we
haven't yet described.

01:01:35.640 --> 01:01:37.290
There's one more thing we need.

01:01:37.290 --> 01:01:40.230
And that comes up if you want
to do a query in the structure.

01:01:40.230 --> 01:01:41.280
How do I do a query?

01:01:44.490 --> 01:01:46.620
I already did odd
rank, so I'm just

01:01:46.620 --> 01:01:50.430
trying to look up into the
sorted array, at a given index.

01:01:50.430 --> 01:01:55.740
Well, first thing is to
compute the leading bits.

01:01:55.740 --> 01:01:58.650
Actually, computing
leading bits is really easy

01:01:58.650 --> 01:02:00.720
if I have rank and select.

01:02:00.720 --> 01:02:05.590
What I want, if I'm trying
to index into index i,

01:02:05.590 --> 01:02:07.860
I want the i-th one bit.

01:02:07.860 --> 01:02:09.390
To look at the
i-th one bit, which

01:02:09.390 --> 01:02:18.690
is select sub 1 of i, which
we already know how to do,

01:02:18.690 --> 01:02:24.040
then that corresponds
to the i-th value.

01:02:24.040 --> 01:02:25.980
And in particular,
if I look at how many

01:02:25.980 --> 01:02:30.570
0's are there up
to that point, it's

01:02:30.570 --> 01:02:31.950
going to be the sum of this.

01:02:31.950 --> 01:02:35.220
Plus this, plus this,
it's a telescoping sum.

01:02:35.220 --> 01:02:39.380
It's just going to give
me the leading bits.

01:02:39.380 --> 01:02:42.350
Because this plus this
is just lead of v2.

01:02:42.350 --> 01:02:44.930
This plus that is lead of v3.

01:02:44.930 --> 01:02:45.944
So they all cancel.

01:02:45.944 --> 01:02:47.360
I just count the
number of 0 bits.

01:02:47.360 --> 01:02:50.540
That's exactly the
value I want to know.

01:02:50.540 --> 01:02:56.960
So I want to do rank
sub 0 of that position.

01:02:56.960 --> 01:03:00.780
That will tell me
the leading bits.

01:03:00.780 --> 01:03:08.730
In a query, it's not
really lead of i, I guess.

01:03:08.730 --> 01:03:13.040
Lead of vi is what
we're trying to compute.

01:03:13.040 --> 01:03:14.540
Now, we also need
the trailing bits.

01:03:14.540 --> 01:03:16.090
The trailing bits,
they're just in an array,

01:03:16.090 --> 01:03:17.220
so you just look that up.

01:03:17.220 --> 01:03:18.303
You get the trailing bits.

01:03:18.303 --> 01:03:20.180
You concatenate those
two words, the leading

01:03:20.180 --> 01:03:22.730
bits of the trailing bits--
boom, you have your answer.

01:03:22.730 --> 01:03:25.820
That gives you the
even successor.

01:03:25.820 --> 01:03:28.130
So the only thing
is we need to store

01:03:28.130 --> 01:03:30.020
rank and a select structure.

01:03:30.020 --> 01:03:36.320
And for rank, we used nk
over log log nk space.

01:03:36.320 --> 01:03:39.130
Again, that can be improved
to nk over polylog nk.

01:03:39.130 --> 01:03:40.880
But let's not worry about that.

01:03:52.290 --> 01:03:54.700
Item 1 completes.

01:03:54.700 --> 01:03:59.410
We now have a T log
log T bit suffix array.

01:03:59.410 --> 01:04:01.950
Next, we need to
make it order T,

01:04:01.950 --> 01:04:05.261
then we need to make
it into suffix tree.

01:04:05.261 --> 01:04:06.760
We're going to move
a little faster.

01:04:11.570 --> 01:04:12.465
Where to go now?

01:04:22.080 --> 01:04:24.270
Now I want a compact
suffix array.

01:04:31.670 --> 01:04:33.770
I'm going to use
the same definition.

01:04:33.770 --> 01:04:36.849
Everything's going to be
more or less the same.

01:04:36.849 --> 01:04:38.765
I just can't afford to
store all these levels.

01:04:44.034 --> 01:04:45.200
There were log log n levels.

01:04:45.200 --> 01:04:47.100
Log log n levels
is too expensive.

01:04:47.100 --> 01:04:49.240
Each one costs linear space.

01:04:49.240 --> 01:04:52.096
So I'm only going to store
a constant number of levels.

01:04:54.850 --> 01:05:00.350
Only store 1 over
epsilon plus 1 levels.

01:05:03.350 --> 01:05:06.410
And not just any levels,
but the first level,

01:05:06.410 --> 01:05:10.390
the epsilon l-th level,
the 2 epsilon l-th level,

01:05:10.390 --> 01:05:11.660
up to the l-th level.

01:05:11.660 --> 01:05:14.486
So it's still log log n levels.

01:05:14.486 --> 01:05:17.120
I'm just going to
skip a lot of them.

01:05:17.120 --> 01:05:19.130
Now, it's going to be different.

01:05:19.130 --> 01:05:21.570
I can't use even
successor anymore.

01:05:21.570 --> 01:05:27.080
Instead, even is
going to be replaced

01:05:27.080 --> 01:05:32.740
with the notion of divisible
by 2 to the epsilon l,

01:05:32.740 --> 01:05:33.950
instead of divisible by 2.

01:05:37.460 --> 01:05:40.310
So I do all this, but
replace the notion of even

01:05:40.310 --> 01:05:44.990
with divisible by epsilon l.

01:05:44.990 --> 01:05:57.860
Because this is when you are
in SA sub k plus 1 epsilon l.

01:05:57.860 --> 01:06:00.390
The whole name of
the game is, you're

01:06:00.390 --> 01:06:03.620
trying to do a query
in SA k epsilon l,

01:06:03.620 --> 01:06:07.250
and now you want to reduce
it to SA k plus 1 epsilon l.

01:06:07.250 --> 01:06:10.790
And these are the suffixes that
are explicitly represented.

01:06:10.790 --> 01:06:13.610
Everything else needs to be
rounded to that value, then

01:06:13.610 --> 01:06:17.215
rounded back, like we had
with our giant formula before.

01:06:17.215 --> 01:06:19.340
It's not so easy to write
a single formula anymore,

01:06:19.340 --> 01:06:21.156
it's now really an algorithm.

01:06:24.440 --> 01:06:30.010
So to compute SA
k epsilon l of i,

01:06:30.010 --> 01:06:34.510
what you do is follow
a new thing, which

01:06:34.510 --> 01:06:38.600
I'm going to call
just successor of i,

01:06:38.600 --> 01:06:45.050
repeatedly to get a new index j.

01:06:47.620 --> 01:06:51.810
Or I guess call it i prime,
make it a little clearer--

01:06:51.810 --> 01:06:52.940
until it's even.

01:07:00.080 --> 01:07:02.090
So before, we just
had to make one step,

01:07:02.090 --> 01:07:03.110
and then we were even.

01:07:03.110 --> 01:07:06.680
Now, we're going to have to make
potentially epsilon l steps.

01:07:06.680 --> 01:07:08.620
So this could cost log log n.

01:07:08.620 --> 01:07:10.860
Log log n, that's not much.

01:07:10.860 --> 01:07:14.700
Actually-- sorry, not log log n.

01:07:14.700 --> 01:07:16.850
This is going to
cost 2 to the epsilon

01:07:16.850 --> 01:07:20.020
l, because it's divisible
by 2 to the epsilon l.

01:07:20.020 --> 01:07:23.660
2 to the epsilon l is
log to the epsilon.

01:07:23.660 --> 01:07:33.650
So this now may take log
to the epsilon T steps.

01:07:33.650 --> 01:07:37.460
This is where we're going to get
the log to the epsilon penalty,

01:07:37.460 --> 01:07:38.800
in time.

01:07:38.800 --> 01:07:42.410
OK, but it's simple linear
search, nothing clever here.

01:07:42.410 --> 01:07:44.040
Now, what is successor?

01:07:44.040 --> 01:07:45.950
Well, successor is
just the same thing.

01:07:48.710 --> 01:07:51.740
If you're even in this strong
sense, then nothing happens.

01:07:51.740 --> 01:07:53.640
Otherwise, you just--
same definition.

01:07:53.640 --> 01:07:56.360
This part is exactly the same.

01:07:56.360 --> 01:07:59.180
Just go to the next
position, the next suffix.

01:07:59.180 --> 01:08:01.340
But now we have to
follow it several times,

01:08:01.340 --> 01:08:04.220
until we get to an even one.

01:08:04.220 --> 01:08:05.630
OK.

01:08:05.630 --> 01:08:13.220
Then we recurse, just
like before on SA k plus

01:08:13.220 --> 01:08:15.260
1, epsilon l.

01:08:15.260 --> 01:08:20.456
The next level down of the--

01:08:20.456 --> 01:08:22.430
I think we can still
call it even rank.

01:08:35.520 --> 01:08:46.370
And then we multiply
by 2 to the epsilon l.

01:08:49.319 --> 01:08:57.020
And then subtract the number
of steps we did, in 1.

01:09:03.319 --> 01:09:05.160
We made several
steps here, we need

01:09:05.160 --> 01:09:07.180
to undo those steps at the end.

01:09:07.180 --> 01:09:07.680
That's it.

01:09:07.680 --> 01:09:09.846
So it's just the same as
before, except before there

01:09:09.846 --> 01:09:12.010
was one step here, and
at most, one step here.

01:09:12.010 --> 01:09:14.439
Now you just count them,
subtract at the end.

01:09:14.439 --> 01:09:16.620
So exactly the
same template, just

01:09:16.620 --> 01:09:18.689
skipping a lot of the levels.

01:09:18.689 --> 01:09:25.529
And now the space is going to be
1 over epsilon, plus 1 times n.

01:09:25.529 --> 01:09:27.880
That's it.

01:09:27.880 --> 01:09:29.560
OK, so let me
analyze a little bit.

01:09:35.340 --> 01:09:37.920
So you have to check
that all of this works.

01:09:37.920 --> 01:09:39.689
Is is even suffix, that's easy.

01:09:39.689 --> 01:09:40.680
It's still nk bits.

01:09:40.680 --> 01:09:42.479
Even rank, still nk bits.

01:09:42.479 --> 01:09:45.240
Even successor, we did
all this fancy encoding.

01:09:45.240 --> 01:09:47.472
The one thing you
can't do, is this part.

01:09:47.472 --> 01:09:49.680
I mean, there aren't very
many even suffixes anymore.

01:09:49.680 --> 01:09:54.970
So it really doesn't help you,
it buys you a very tiny factor.

01:09:54.970 --> 01:10:00.840
But 1 over 2 to the epsilon
l are going to be even.

01:10:00.840 --> 01:10:01.891
So that's very few.

01:10:01.891 --> 01:10:04.140
So you still have to store
all the answers, basically.

01:10:04.140 --> 01:10:06.730
But you can do all this
ordering trick, it still works.

01:10:06.730 --> 01:10:10.650
We weren't really exploiting
the fact that it was odd.

01:10:10.650 --> 01:10:13.290
And now you have to-- this
is not a single character,

01:10:13.290 --> 01:10:16.800
it's a bunch of characters.

01:10:16.800 --> 01:10:19.860
But still-- and so now
instead of 2 to the k,

01:10:19.860 --> 01:10:24.480
it's probably 2 to
the k epsilon l.

01:10:24.480 --> 01:10:25.950
But it all works out.

01:10:25.950 --> 01:10:29.160
It's just a renaming
of everything.

01:10:29.160 --> 01:10:32.665
It's still going to be linear
number of bits, I claim.

01:10:32.665 --> 01:10:34.790
I don't want to go through
a formal proof for that,

01:10:34.790 --> 01:10:35.581
we don't have time.

01:10:38.290 --> 01:10:39.550
But all the same tricks work.

01:10:45.730 --> 01:10:53.500
So the claim is
space going to be sum

01:10:53.500 --> 01:10:58.850
k equals 0 to 1 over epsilon.

01:10:58.850 --> 01:11:05.480
nk epsilon l, plus n,
plus 2 nk epsilon l,

01:11:05.480 --> 01:11:12.702
plus the select bound,
n over log log n.

01:11:17.270 --> 01:11:17.900
Why?

01:11:17.900 --> 01:11:20.930
Because this is storing
the is even structure.

01:11:20.930 --> 01:11:23.180
That was just nk bits.

01:11:23.180 --> 01:11:27.784
And then, this is the successor.

01:11:27.784 --> 01:11:29.049
This is, is even.

01:11:33.270 --> 01:11:36.080
Same as we had over here,
except there's no 1/2 anymore.

01:11:36.080 --> 01:11:38.840
It's just n plus--

01:11:38.840 --> 01:11:43.430
claim is 2 nk epsilon l.

01:11:43.430 --> 01:11:45.590
That's the right answer.

01:11:45.590 --> 01:11:47.694
Yeah, that 3 was because
of this, plus this.

01:11:47.694 --> 01:11:50.110
So we still have the 3, just
don't divide it by 2 anymore.

01:11:55.950 --> 01:12:04.060
So this equals some
constant times n, 6n

01:12:04.060 --> 01:12:05.840
plus 1 over epsilon n.

01:12:09.410 --> 01:12:14.416
Plus order n over
log log n bits.

01:12:18.520 --> 01:12:19.840
OK, not bad.

01:12:19.840 --> 01:12:22.400
Not quite as good as this
bound for binary alphabet,

01:12:22.400 --> 01:12:25.210
so ignore the log sigma.

01:12:25.210 --> 01:12:26.980
Before we had 1 plus
1 over epsilon, now

01:12:26.980 --> 01:12:28.540
we have 6 plus 1 over epsilon.

01:12:32.494 --> 01:12:33.660
Kind of running out of time.

01:12:33.660 --> 01:12:40.260
I'll just tell you, you can
tune this to 1 over epsilon n,

01:12:40.260 --> 01:12:44.050
plus the little o, with
two very simple tricks.

01:12:44.050 --> 01:12:45.310
Two simple observations.

01:12:45.310 --> 01:12:51.540
The first one is, the
successor structure.

01:12:51.540 --> 01:12:55.760
At level 0, there's
nothing to do.

01:12:55.760 --> 01:12:56.260
Why?

01:12:56.260 --> 01:13:02.710
Because level 0--
a single step just

01:13:02.710 --> 01:13:04.900
corresponds to
walking in the string.

01:13:04.900 --> 01:13:08.437
I've got to think about
this a little bit.

01:13:08.437 --> 01:13:16.420
Successor-- Actually not quite
clear to me why that's true,

01:13:16.420 --> 01:13:17.820
but it turns out to be true.

01:13:17.820 --> 01:13:20.660
It's an exercise, I guess.

01:13:20.660 --> 01:13:23.580
At level 0, you don't need to
[? the ?] successor structure.

01:13:23.580 --> 01:13:27.210
So that actually saves you
a big factor, because if you

01:13:27.210 --> 01:13:28.680
can skip the very--

01:13:28.680 --> 01:13:32.340
k equals 0, then you get to
skip-- you get to divide by 2

01:13:32.340 --> 01:13:33.870
to the epsilon l, the space.

01:13:33.870 --> 01:13:38.850
So that gets rid of this term.

01:13:38.850 --> 01:13:43.630
Then there's this other
term, which you can skip,

01:13:43.630 --> 01:13:45.660
or you can store is
even more efficiently.

01:13:45.660 --> 01:13:48.219
So before is even,
should be a big factor.

01:13:48.219 --> 01:13:50.010
Because half of them
are even, half of them

01:13:50.010 --> 01:13:52.200
are odd, that's the
optimal thing to do.

01:13:52.200 --> 01:13:55.920
But in this structure,
most of them are not even.

01:13:55.920 --> 01:14:00.240
So you can save a little bit
using succinct dictionaries.

01:14:00.240 --> 01:14:01.800
Because there are
very few ones--

01:14:01.800 --> 01:14:05.160
you can achieve log, the
total number of things,

01:14:05.160 --> 01:14:08.240
choose the number of ones.

01:14:08.240 --> 01:14:10.980
[? Bog ?] of that binomial
coefficient is the number

01:14:10.980 --> 01:14:12.710
of 0's plus 1's.

01:14:12.710 --> 01:14:15.170
Not going to work it out,
it's worked out in the notes.

01:14:15.170 --> 01:14:17.550
But if you store that more
efficient dictionary, which

01:14:17.550 --> 01:14:20.010
we claimed could
be done last time,

01:14:20.010 --> 01:14:23.760
then this turns out to get a
nice sort of cascading thing.

01:14:23.760 --> 01:14:27.210
And it's little of
of n, in the end.

01:14:27.210 --> 01:14:28.920
So that gets rid of this term.

01:14:28.920 --> 01:14:32.580
And so you're left with
just n times 1 over epsilon.

01:14:32.580 --> 01:14:34.680
Plus 1, because you have
to store the text also.

01:14:37.200 --> 01:14:43.410
Or maybe because of
this plus 1, anyway.

01:14:43.410 --> 01:14:45.960
Boom.

01:14:45.960 --> 01:14:48.720
That's all I want to say
about this structure.

01:14:48.720 --> 01:14:51.310
So I wanted to focus on
the ideas, which got us

01:14:51.310 --> 01:14:54.940
the T log log T. Just
apply the same ideas,

01:14:54.940 --> 01:14:56.119
but much more sparsely.

01:14:56.119 --> 01:14:57.910
You lose in running
time, instead of paying

01:14:57.910 --> 01:15:00.240
log log T. Now we pay--

01:15:00.240 --> 01:15:02.520
we pay log to the
epsilon times log log T,

01:15:02.520 --> 01:15:04.350
but that's just log
to some other epsilon.

01:15:06.870 --> 01:15:10.320
So that gives us better space.

01:15:10.320 --> 01:15:13.700
Now linear space, instead
of n log log space.

01:15:13.700 --> 01:15:16.290
Any questions about that?

01:15:16.290 --> 01:15:16.790
All right.

01:15:19.310 --> 01:15:24.740
Now, I get to hurry through
transforming suffix arrays,

01:15:24.740 --> 01:15:25.730
into suffix trees.

01:15:35.611 --> 01:15:37.110
This is actually a
much older paper.

01:15:37.110 --> 01:15:45.710
It's by [? Monroe, ?]
[? Roman, ?] and [? Row. ?]

01:15:45.710 --> 01:15:49.370
There's two versions of
it in the same paper.

01:15:49.370 --> 01:15:51.680
First version is going to
be compact, second version

01:15:51.680 --> 01:15:52.450
is succinct.

01:15:52.450 --> 01:15:54.950
Probably won't have much time
to cover the succinct version,

01:15:54.950 --> 01:15:57.920
but here's what we do.

01:15:57.920 --> 01:16:00.950
Start with compact.

01:16:00.950 --> 01:16:04.460
Store compressed--
we're going to assume

01:16:04.460 --> 01:16:12.230
binary alphabet again, as
this paper does, I believe.

01:16:12.230 --> 01:16:17.090
Store the suffix tree, but
only store the trie part of it.

01:16:17.090 --> 01:16:19.510
Suffix tree really
consists of trie--

01:16:19.510 --> 01:16:22.260
binary trie, if it's
a binary alphabet.

01:16:22.260 --> 01:16:25.220
Plus, lengths on the edges.

01:16:25.220 --> 01:16:26.960
Don't store the links.

01:16:26.960 --> 01:16:30.980
Or, as Ian likes to
call it, skip the skips.

01:16:30.980 --> 01:16:33.050
The lengths of an
edge is how many bits

01:16:33.050 --> 01:16:36.530
you're supposed to
skip, so skip those.

01:16:36.530 --> 01:16:39.270
Just store the trie structure.

01:16:39.270 --> 01:16:43.250
So the trie structure
is on 2n plus 1 nodes,

01:16:43.250 --> 01:16:45.705
because there is n
leaves, and minus 1.

01:16:45.705 --> 01:16:47.991
Telling me it's plus
1, I don't know.

01:16:47.991 --> 01:16:50.820
2n plus a constant nodes.

01:16:50.820 --> 01:16:55.160
So this is 4n bits.

01:16:55.160 --> 01:16:57.440
We know how to do
binary tries, finally

01:16:57.440 --> 01:16:59.090
we're using last lecture.

01:16:59.090 --> 01:17:01.100
We use rank and select
a lot, but now are

01:17:01.100 --> 01:17:02.360
using the binary trie.

01:17:02.360 --> 01:17:05.670
We're going to store this using
the balanced paren structure.

01:17:05.670 --> 01:17:08.500
OK, so you have to double that--
this linear number of bits,

01:17:08.500 --> 01:17:11.540
so if we're just looking
for compact, that's fine.

01:17:11.540 --> 01:17:13.630
Now the hard part
is in a search,

01:17:13.630 --> 01:17:18.630
where we go from one
node, to the next node.

01:17:18.630 --> 01:17:20.445
We need to know the
length of this edge,

01:17:20.445 --> 01:17:23.142
we've got to figure that out.

01:17:23.142 --> 01:17:25.100
We need to know whether
the pattern jumped off,

01:17:25.100 --> 01:17:26.690
or something.

01:17:26.690 --> 01:17:31.190
We need to know at position
y, which letter of the pattern

01:17:31.190 --> 01:17:33.620
should we branch on.

01:17:33.620 --> 01:17:36.320
So we need to
measure this length.

01:17:36.320 --> 01:17:37.760
Not too hard.

01:17:37.760 --> 01:17:40.280
What you do, you
look at this subtree.

01:17:40.280 --> 01:17:44.330
You look at the leftmost
leaf and the rightmost leaf.

01:17:44.330 --> 01:17:46.190
You look at their
longest common prefix,

01:17:46.190 --> 01:17:48.304
starting from the
character you care about.

01:17:48.304 --> 01:17:50.720
And you look at the longest
common prefix with the pattern

01:17:50.720 --> 01:17:53.120
P. All sounds easy--

01:17:53.120 --> 01:17:55.430
how do you actually do it?

01:17:55.430 --> 01:17:58.700
So you need to be able to find
the leftmost leaf in a subtree.

01:17:58.700 --> 01:18:02.420
Leaves in the balanced
paren expression--

01:18:02.420 --> 01:18:05.270
I think last class, I mistakenly
thought they were that.

01:18:05.270 --> 01:18:07.190
In fact, they are this.

01:18:07.190 --> 01:18:09.080
Think about it long enough.

01:18:09.080 --> 01:18:11.832
This was leaves in
the rooted order tree,

01:18:11.832 --> 01:18:14.040
but what we care about are
leaves in the binary tree.

01:18:14.040 --> 01:18:15.353
And they always look
like open paren,

01:18:15.353 --> 01:18:16.730
closed paren, and closed paren.

01:18:16.730 --> 01:18:19.820
So this is a leaf,
and so what we're

01:18:19.820 --> 01:18:22.250
asking for is in a subtree,
we'll find the first leaf.

01:18:22.250 --> 01:18:26.120
That's actually just going to
be right after this open paren.

01:18:26.120 --> 01:18:32.870
Or, I guess, you do a
select, select sub this,

01:18:32.870 --> 01:18:34.619
to jump to the next leaf.

01:18:34.619 --> 01:18:36.660
Then also, you can jump
to the end of the subtree

01:18:36.660 --> 01:18:39.890
and then go back to the previous
leaf, using rank and select.

01:18:39.890 --> 01:18:42.050
So I won't go into details,
but that's easy to do.

01:18:42.050 --> 01:18:44.000
So you can identify
the two leaves

01:18:44.000 --> 01:18:47.720
using rank sub, this thing.

01:18:47.720 --> 01:18:51.230
I can identify the leaf
number, so I can identify

01:18:51.230 --> 01:18:53.090
where these leaves are.

01:18:53.090 --> 01:18:54.950
Now, I have a suffix array.

01:18:54.950 --> 01:18:58.520
If I look up the suffix array
of these two leaf numbers--

01:18:58.520 --> 01:19:01.490
remember leaves are ordered
by suffix in sorted order

01:19:01.490 --> 01:19:03.222
by suffix array.

01:19:03.222 --> 01:19:05.180
These are really indices
into the suffix array.

01:19:05.180 --> 01:19:07.580
They're giving me-- oh,
this is the i-th suffix,

01:19:07.580 --> 01:19:08.870
this is the j-th suffix.

01:19:08.870 --> 01:19:11.078
So I look at those two
positions of the suffix array,

01:19:11.078 --> 01:19:15.560
I teleport over to the string T.
Now I have the actual suffixes

01:19:15.560 --> 01:19:17.120
corresponding to this and this.

01:19:17.120 --> 01:19:19.160
And I just look at
where they match.

01:19:19.160 --> 01:19:22.940
I know that if I've already
gone down to depth d,

01:19:22.940 --> 01:19:23.870
letter depth d.

01:19:23.870 --> 01:19:26.120
I already know that they
match the first d characters.

01:19:26.120 --> 01:19:27.110
I don't compare those.

01:19:27.110 --> 01:19:28.460
They're guaranteed to match.

01:19:28.460 --> 01:19:30.800
So I start at position d plus 1.

01:19:30.800 --> 01:19:32.930
I know they should match,
but one more letter.

01:19:32.930 --> 01:19:34.430
How many more letters
do they match?

01:19:34.430 --> 01:19:36.900
That is the length
of this thing.

01:19:36.900 --> 01:19:37.415
OK.

01:19:37.415 --> 01:19:38.790
How can I afford
to pay for that?

01:19:38.790 --> 01:19:41.030
I'm just going to pen linear
cost, the total number

01:19:41.030 --> 01:19:42.405
of characters I
compare, is going

01:19:42.405 --> 01:19:44.820
to be equal to the
length of the pattern.

01:19:44.820 --> 01:19:47.750
So we're going to end up
getting length of the pattern,

01:19:47.750 --> 01:19:51.762
times the cost to do
a suffix array access.

01:19:51.762 --> 01:19:53.720
Because I have to do this
at every single step,

01:19:53.720 --> 01:19:55.460
in the worst case.

01:19:55.460 --> 01:19:58.010
So not perfect, but pretty good.

01:19:58.010 --> 01:20:01.864
Roughly P, suffix array access
is like log to the epsilon.

01:20:01.864 --> 01:20:03.530
So we're getting a P
log to the epsilon.

01:20:03.530 --> 01:20:08.284
Not quite as good as this
bound, but because here the P

01:20:08.284 --> 01:20:09.950
is not multiplied by
log to the epsilon.

01:20:09.950 --> 01:20:12.000
But, it's just log
to the epsilon.

01:20:12.000 --> 01:20:13.760
If you want to see a
better way to do it,

01:20:13.760 --> 01:20:15.434
you can read the
Grossi-Vitter paper.

01:20:15.434 --> 01:20:16.850
But this is a
decent way to do it.

01:20:20.900 --> 01:20:25.997
Now briefly, this is
the compact version,

01:20:25.997 --> 01:20:27.830
and let me tell you how
to make it succinct.

01:20:33.504 --> 01:20:35.170
I'm not going to touch
the suffix array.

01:20:35.170 --> 01:20:37.880
Suffix array, to make
that succinct is harder.

01:20:37.880 --> 01:20:41.200
But if I just want to make the
suffix tree parts succinct,

01:20:41.200 --> 01:20:43.870
I can use this same
idea, but I can't

01:20:43.870 --> 01:20:46.300
afford to store the whole trie.

01:20:46.300 --> 01:20:48.820
So just going to use a
little bit of indirection.

01:20:48.820 --> 01:20:50.320
You can use as
little as you want,

01:20:50.320 --> 01:20:54.650
this is the log log log
log log log log n factor.

01:21:00.150 --> 01:21:15.250
Use the suffix tree
above every b-th suffix.

01:21:15.250 --> 01:21:19.960
So throw away all but a
1/b fraction of the leaves.

01:21:19.960 --> 01:21:22.600
And then, take the
tree that remains.

01:21:22.600 --> 01:21:25.210
So once you do a search, you
won't find exactly the leaf

01:21:25.210 --> 01:21:27.580
you want, but you'll
be within an additive b

01:21:27.580 --> 01:21:29.860
of the leaf you want. b here
can be arbitrarily small.

01:21:29.860 --> 01:21:32.530
This can be log
log log log log n.

01:21:32.530 --> 01:21:34.450
But something super constant.

01:21:34.450 --> 01:21:37.180
Then if I use this structure,
instead of being n--

01:21:37.180 --> 01:21:38.580
order n over b space--

01:21:38.580 --> 01:21:40.090
instead of being
order n space, it's

01:21:40.090 --> 01:21:43.000
going to be order n over b bits.

01:21:43.000 --> 01:21:44.026
So, we win.

01:21:44.026 --> 01:21:45.400
The only issue is
now, how do you

01:21:45.400 --> 01:21:49.890
find the correct leaf, as
opposed to the incorrect leaf?

01:21:53.161 --> 01:21:54.910
I don't really have
time to talk about it.

01:21:54.910 --> 01:21:57.100
You can look at the notes.

01:21:57.100 --> 01:21:58.600
Rough idea is,
well, you can have

01:21:58.600 --> 01:22:01.810
a look-up table that lets you
do whatever you want on b bits.

01:22:01.810 --> 01:22:05.140
As long as b is less
than, like, 1/2 log n.

01:22:05.140 --> 01:22:09.730
Then you can encompass the
whole trie, more or less.

01:22:09.730 --> 01:22:11.620
And just hit it with
a big look-up table

01:22:11.620 --> 01:22:13.600
and do everything
in constant time.

01:22:13.600 --> 01:22:20.320
It's not quite so
simple, because--

01:22:20.320 --> 01:22:21.610
easy summary, here.

01:22:32.900 --> 01:22:38.840
Essentially, what
you're doing is--

01:22:38.840 --> 01:22:40.175
these are the blocks.

01:22:40.175 --> 01:22:42.410
So this is length b.

01:22:42.410 --> 01:22:45.560
You're finding this suffix,
and you want to know,

01:22:45.560 --> 01:22:47.439
which of these is
the correct one.

01:22:47.439 --> 01:22:49.730
In some sense, you have to
do the search simultaneously

01:22:49.730 --> 01:22:51.620
for all b of these guys.

01:22:51.620 --> 01:22:53.870
And so you run down
the search again,

01:22:53.870 --> 01:22:55.647
but instead of searching
for one pattern,

01:22:55.647 --> 01:22:57.980
you search for all b of these
patterns at the same time.

01:22:57.980 --> 01:23:00.830
Now they're mostly the
same, and so you can

01:23:00.830 --> 01:23:02.280
prove it doesn't hurt you much.

01:23:02.280 --> 01:23:04.390
Maybe it hurts
you an additive b.

01:23:04.390 --> 01:23:07.730
I believe the correct answer
is, in time, you end up

01:23:07.730 --> 01:23:13.430
paying quarter p plus b time.

01:23:13.430 --> 01:23:17.449
Sorry, times the cost of
a suffix array access.

01:23:17.449 --> 01:23:19.490
OK, so we're still paying
the log to the epsilon,

01:23:19.490 --> 01:23:21.120
because of the suffix array.

01:23:21.120 --> 01:23:24.310
If that was constant,
it would be free.

01:23:24.310 --> 01:23:29.082
P plus b time is fine, if
b is log log log log n.

01:23:29.082 --> 01:23:30.290
Or you can make it log log n.

01:23:30.290 --> 01:23:33.230
Then you save a log log
n factor in the bits.

01:23:33.230 --> 01:23:34.760
You pay an additive log log n.

01:23:34.760 --> 01:23:37.301
That's going to be absorbed by
the log to the epsilon anyway.

01:23:37.301 --> 01:23:38.750
So it's pretty efficient.

01:23:38.750 --> 01:23:40.625
I guess you can make
this log to the epsilon,

01:23:40.625 --> 01:23:43.280
if you felt like it,
to balance out here.

01:23:43.280 --> 01:23:45.740
Still would be P times
log to the epsilon.

01:23:45.740 --> 01:23:48.080
And so this stuff is
really quite cheap,

01:23:48.080 --> 01:23:50.360
see the notes for details.

01:23:50.360 --> 01:23:54.150
That ends our succinct coverage.

01:23:54.150 --> 01:23:57.700
Sorry, it was a little more
succinct than intended.

01:23:57.700 --> 01:23:59.320
Get the idea.